Show / Hide Table of Contents

    Class BeiderMorseEncoder

    Encodes strings into their Beider-Morse phonetic encoding.

    Inheritance
    System.Object
    BeiderMorseEncoder
    Implements
    IStringEncoder
    Namespace: Lucene.Net.Analysis.Phonetic.Language.Bm
    Assembly: Lucene.Net.Analysis.Phonetic.dll
    Syntax
    public class BeiderMorseEncoder : object, IStringEncoder
    Remarks

    Beider-Morse phonetic encodings are optimised for family names. However, they may be useful for a wide range of words.

    This encoder is intentionally mutable to allow dynamic configuration through bean properties. As such, it is mutable, and may not be thread-safe. If you require a guaranteed thread-safe encoding then use PhoneticEngine directly.

    Encoding overview

    Beider-Morse phonetic encodings is a multi-step process. Firstly, a table of rules is consulted to guess what language the word comes from. For example, if it ends in "ault" then it infers that the word is French. Next, the word is translated into a phonetic representation using a language-specific phonetics table. Some runs of letters can be pronounced in multiple ways, and a single run of letters may be potentially broken up into phonemes at different places, so this stage results in a set of possible language-specific phonetic representations. Lastly, this language-specific phonetic representation is processed by a table of rules that re-writes it phonetically taking into account systematic pronunciation differences between languages, to move it towards a pan-indo-european phonetic representation. Again, sometimes there are multiple ways this could be done and sometimes things that can be pronounced in several ways in the source language have only one way to represent them in this average phonetic language, so the result is again a set of phonetic spellings.

    Some names are treated as having multiple parts. This can be due to two things. Firstly, they may be hyphenated. In this case, each individual hyphenated word is encoded, and then these are combined end-to-end for the final encoding. Secondly, some names have standard prefixes, for example, "Mac/Mc" in Scottish (English) names. As sometimes it is ambiguous whether the prefix is intended or is an accident of the spelling, the word is encoded once with the prefix and once without it. The resulting encoding contains one and then the other result.

    Encoding format

    Individual phonetic spellings of an input word are represented in upper- and lower-case roman characters. Where there are multiple possible phonetic representations, these are joined with a pipe (|) character. If multiple hyphenated words where found, or if the word may contain a name prefix, each encoded word is placed in elipses and these blocks are then joined with hyphens. For example, "d'ortley" has a possible prefix. The form without prefix encodes to ortlaj|ortlej, while the form with prefix encodes to dortlaj|dortlej. Thus, the full, combined encoding is (ortlaj|ortlej)-(dortlaj|dortlej).

    The encoded forms are often quite a bit longer than the input strings. This is because a single input may have many potential phonetic interpretations. For example, Renault encodes to rYnDlt|rYnalt|rYnult|rinDlt|rinalt|rinult. The APPROX rules will tend to produce larger encodings as they consider a wider range of possible, approximate phonetic interpretations of the original word. Down-stream applications may wish to further process the encoding for indexing or lookup purposes, for example, by splitting on pipe (|) and indexing under each of these alternatives.

    since 1.6

    Properties

    | Improve this Doc View Source

    IsConcat

    Gets or Sets how multiple possible phonetic encodings are combined. true if multiple encodings are to be combined with a '|', false if just the first one is to be considered.

    Declaration
    public virtual bool IsConcat { get; set; }
    Property Value
    Type Description
    System.Boolean
    | Improve this Doc View Source

    NameType

    Gets or Sets the name type currently in operation. Use GENERIC unless you specifically want phonetic encodings optimized for Ashkenazi or Sephardic Jewish family names.

    Declaration
    public virtual NameType NameType { get; set; }
    Property Value
    Type Description
    NameType
    | Improve this Doc View Source

    RuleType

    Gets or Sets the rule type to apply. This will widen or narrow the range of phonetic encodings considered. APPROX or EXACT for approximate or exact phonetic matches.

    Declaration
    public virtual RuleType RuleType { get; set; }
    Property Value
    Type Description
    RuleType

    Methods

    | Improve this Doc View Source

    Encode(String)

    Declaration
    public virtual string Encode(string source)
    Parameters
    Type Name Description
    System.String source
    Returns
    Type Description
    System.String
    | Improve this Doc View Source

    SetMaxPhonemes(Int32)

    Sets the number of maximum of phonemes that shall be considered by the engine.

    since 1.7

    Declaration
    public virtual void SetMaxPhonemes(int maxPhonemes)
    Parameters
    Type Name Description
    System.Int32 maxPhonemes

    the maximum number of phonemes returned by the engine

    Implements

    IStringEncoder
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)