Fork me on GitHub
  • API

    Show / Hide Table of Contents

    Class HyphenationTree

    This tree structure stores the hyphenation patterns in an efficient way for fast lookup. It provides the provides the method to hyphenate a word.

    This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified.

    Inheritance
    System.Object
    TernaryTree
    HyphenationTree
    Implements
    IPatternConsumer
    Inherited Members
    TernaryTree.m_lo
    TernaryTree.m_hi
    TernaryTree.m_eq
    TernaryTree.m_sc
    TernaryTree.m_kv
    TernaryTree.m_root
    TernaryTree.m_freenode
    TernaryTree.m_length
    TernaryTree.BLOCK_SIZE
    TernaryTree.Init()
    TernaryTree.Insert(String, Char)
    TernaryTree.Insert(Char[], Int32, Char)
    TernaryTree.StrCmp(Char[], Int32, Char[], Int32)
    TernaryTree.StrCmp(String, Char[], Int32)
    TernaryTree.StrCpy(Char[], Int32, Char[], Int32)
    TernaryTree.StrLen(Char[], Int32)
    TernaryTree.StrLen(Char[])
    TernaryTree.Find(String)
    TernaryTree.Find(Char[], Int32)
    TernaryTree.Knows(String)
    TernaryTree.Length
    TernaryTree.Clone()
    TernaryTree.InsertBalanced(String[], Char[], Int32, Int32)
    TernaryTree.Balance()
    TernaryTree.TrimToSize()
    TernaryTree.GetEnumerator()
    TernaryTree.PrintStats(TextWriter)
    System.Object.Equals(System.Object)
    System.Object.Equals(System.Object, System.Object)
    System.Object.GetHashCode()
    System.Object.GetType()
    System.Object.MemberwiseClone()
    System.Object.ReferenceEquals(System.Object, System.Object)
    System.Object.ToString()
    Namespace: Lucene.Net.Analysis.Compound.Hyphenation
    Assembly: Lucene.Net.Analysis.Common.dll
    Syntax
    public class HyphenationTree : TernaryTree, IPatternConsumer

    Constructors

    | Improve this Doc View Source

    HyphenationTree()

    Declaration
    public HyphenationTree()

    Fields

    | Improve this Doc View Source

    m_classmap

    This map stores the character classes

    Declaration
    protected TernaryTree m_classmap
    Field Value
    Type Description
    TernaryTree
    | Improve this Doc View Source

    m_stoplist

    This map stores hyphenation exceptions

    Declaration
    protected IDictionary<string, IList<object>> m_stoplist
    Field Value
    Type Description
    System.Collections.Generic.IDictionary<System.String, System.Collections.Generic.IList<System.Object>>
    | Improve this Doc View Source

    m_vspace

    value space: stores the interletter values

    Declaration
    protected ByteVector m_vspace
    Field Value
    Type Description
    ByteVector

    Methods

    | Improve this Doc View Source

    AddClass(String)

    Add a character class to the tree. It is used by PatternParser as callback to add character classes. Character classes define the valid word characters for hyphenation. If a word contains a character not defined in any of the classes, it is not hyphenated. It also defines a way to normalize the characters in order to compare them with the stored patterns. Usually pattern files use only lower case characters, in this case a class for letter 'a', for example, should be defined as "aA", the first character being the normalization char.

    Declaration
    public virtual void AddClass(string chargroup)
    Parameters
    Type Name Description
    System.String chargroup
    | Improve this Doc View Source

    AddException(String, IList<Object>)

    Add an exception to the tree. It is used by PatternParser class as callback to store the hyphenation exceptions.

    Declaration
    public virtual void AddException(string word, IList<object> hyphenatedword)
    Parameters
    Type Name Description
    System.String word

    normalized word

    System.Collections.Generic.IList<System.Object> hyphenatedword

    a vector of alternating strings and Hyphen objects.

    | Improve this Doc View Source

    AddPattern(String, String)

    Add a pattern to the tree. Mainly, to be used by PatternParser class as callback to add a pattern to the tree.

    Declaration
    public virtual void AddPattern(string pattern, string ivalue)
    Parameters
    Type Name Description
    System.String pattern

    the hyphenation pattern

    System.String ivalue

    interletter weight values indicating the desirability and priority of hyphenating at a given point within the pattern. It should contain only digit characters. (i.e. '0' to '9').

    | Improve this Doc View Source

    FindPattern(String)

    Declaration
    public virtual string FindPattern(string pat)
    Parameters
    Type Name Description
    System.String pat
    Returns
    Type Description
    System.String
    | Improve this Doc View Source

    GetValues(Int32)

    Declaration
    protected virtual byte[] GetValues(int k)
    Parameters
    Type Name Description
    System.Int32 k
    Returns
    Type Description
    System.Byte[]
    | Improve this Doc View Source

    HStrCmp(Char[], Int32, Char[], Int32)

    String compare, returns 0 if equal or t is a substring of s

    Declaration
    protected virtual int HStrCmp(char[] s, int si, char[] t, int ti)
    Parameters
    Type Name Description
    System.Char[] s
    System.Int32 si
    System.Char[] t
    System.Int32 ti
    Returns
    Type Description
    System.Int32
    | Improve this Doc View Source

    Hyphenate(Char[], Int32, Int32, Int32, Int32)

    Hyphenate word and return an array of hyphenation points.

    Declaration
    public virtual Hyphenation Hyphenate(char[] w, int offset, int len, int remainCharCount, int pushCharCount)
    Parameters
    Type Name Description
    System.Char[] w

    char array that contains the word

    System.Int32 offset

    Offset to first character in word

    System.Int32 len

    Length of word

    System.Int32 remainCharCount

    Minimum number of characters allowed before the hyphenation point.

    System.Int32 pushCharCount

    Minimum number of characters allowed after the hyphenation point.

    Returns
    Type Description
    Hyphenation

    a Hyphenation object representing the hyphenated word or null if word is not hyphenated.

    Remarks

    w = "*nnllllllnnn", where n is a non-letter, l is a letter, all n may be absent, the first n is at offset, the first l is at offset + iIgnoreAtBeginning; word = ".llllll.'\0'*", where all l in w are copied into word. In the first part of the routine len = w.length, in the second part of the routine len = word.length. Three indices are used: index(w), the index in w, index(word), the index in word, letterindex(word), the index in the letter part of word. The following relations exist: index(w) = offset + i - 1 index(word) = i - iIgnoreAtBeginning letterindex(word) = index(word) - 1 (see first loop). It follows that: index(w) - index(word) = offset - 1 + iIgnoreAtBeginning index(w) = letterindex(word) + offset + iIgnoreAtBeginning

    | Improve this Doc View Source

    Hyphenate(String, Int32, Int32)

    Hyphenate word and return a Hyphenation object.

    Declaration
    public virtual Hyphenation Hyphenate(string word, int remainCharCount, int pushCharCount)
    Parameters
    Type Name Description
    System.String word

    the word to be hyphenated

    System.Int32 remainCharCount

    Minimum number of characters allowed before the hyphenation point.

    System.Int32 pushCharCount

    Minimum number of characters allowed after the hyphenation point.

    Returns
    Type Description
    Hyphenation

    a Hyphenation object representing the hyphenated word or null if word is not hyphenated.

    | Improve this Doc View Source

    LoadPatterns(FileInfo)

    Read hyphenation patterns from an XML file.

    Declaration
    public virtual void LoadPatterns(FileInfo f)
    Parameters
    Type Name Description
    System.IO.FileInfo f

    a System.IO.FileInfo object representing the file

    Exceptions
    Type Condition
    System.IO.IOException

    In case the parsing fails

    | Improve this Doc View Source

    LoadPatterns(FileInfo, Encoding)

    Read hyphenation patterns from an XML file.

    Declaration
    public virtual void LoadPatterns(FileInfo f, Encoding encoding)
    Parameters
    Type Name Description
    System.IO.FileInfo f

    a System.IO.FileInfo object representing the file

    System.Text.Encoding encoding

    The character encoding to use

    Exceptions
    Type Condition
    System.IO.IOException

    In case the parsing fails

    | Improve this Doc View Source

    LoadPatterns(Stream)

    Read hyphenation patterns from an XML file.

    Declaration
    public virtual void LoadPatterns(Stream source)
    Parameters
    Type Name Description
    System.IO.Stream source

    System.IO.Stream input source for the file

    Exceptions
    Type Condition
    System.IO.IOException

    In case the parsing fails

    | Improve this Doc View Source

    LoadPatterns(Stream, Encoding)

    Read hyphenation patterns from an XML file.

    Declaration
    public virtual void LoadPatterns(Stream source, Encoding encoding)
    Parameters
    Type Name Description
    System.IO.Stream source

    System.IO.Stream input source for the file

    System.Text.Encoding encoding

    The character encoding to use

    Exceptions
    Type Condition
    System.IO.IOException

    In case the parsing fails

    | Improve this Doc View Source

    LoadPatterns(String)

    Read hyphenation patterns from an XML file.

    Declaration
    public virtual void LoadPatterns(string filename)
    Parameters
    Type Name Description
    System.String filename

    the filename

    Exceptions
    Type Condition
    System.IO.IOException

    In case the parsing fails

    | Improve this Doc View Source

    LoadPatterns(String, Encoding)

    Read hyphenation patterns from an XML file.

    Declaration
    public virtual void LoadPatterns(string filename, Encoding encoding)
    Parameters
    Type Name Description
    System.String filename

    the filename

    System.Text.Encoding encoding

    The character encoding to use

    Exceptions
    Type Condition
    System.IO.IOException

    In case the parsing fails

    | Improve this Doc View Source

    LoadPatterns(XmlReader)

    Read hyphenation patterns from an System.Xml.XmlReader.

    Declaration
    public virtual void LoadPatterns(XmlReader source)
    Parameters
    Type Name Description
    System.Xml.XmlReader source

    System.Xml.XmlReader input source for the file

    Exceptions
    Type Condition
    System.IO.IOException

    In case the parsing fails

    | Improve this Doc View Source

    PackValues(String)

    Packs the values by storing them in 4 bits, two values into a byte Values range is from 0 to 9. We use zero as terminator, so we'll add 1 to the value.

    Declaration
    protected virtual int PackValues(string values)
    Parameters
    Type Name Description
    System.String values

    a string of digits from '0' to '9' representing the interletter values.

    Returns
    Type Description
    System.Int32

    the index into the vspace array where the packed values are stored.

    | Improve this Doc View Source

    SearchPatterns(Char[], Int32, Byte[])

    Search for all possible partial matches of word starting at index an update interletter values. In other words, it does something like:

    for (i=0; i<patterns.Length; i++) 
    {
        if (word.Substring(index).StartsWith(patterns[i], StringComparison.Ordinal))
            update_interletter_values(patterns[i]);
    }

    But it is done in an efficient way since the patterns are stored in a ternary tree. In fact, this is the whole purpose of having the tree: doing this search without having to test every single pattern. The number of patterns for languages such as English range from 4000 to 10000. Thus, doing thousands of string comparisons for each word to hyphenate would be really slow without the tree. The tradeoff is memory, but using a ternary tree instead of a trie, almost halves the the memory used by Lout or TeX. It's also faster than using a hash table

    Declaration
    protected virtual void SearchPatterns(char[] word, int index, byte[] il)
    Parameters
    Type Name Description
    System.Char[] word

    null terminated word to match

    System.Int32 index

    start index from word

    System.Byte[] il

    interletter values array to update

    | Improve this Doc View Source

    UnpackValues(Int32)

    Declaration
    protected virtual string UnpackValues(int k)
    Parameters
    Type Name Description
    System.Int32 k
    Returns
    Type Description
    System.String

    Implements

    IPatternConsumer
    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 The Apache Software Foundation, Licensed under the Apache License, Version 2.0
    Apache Lucene.Net, Lucene.Net, Apache, the Apache feather logo, and the Apache Lucene.Net project logo are trademarks of The Apache Software Foundation.
    All other marks mentioned may be trademarks or registered trademarks of their respective owners.