Class HyphenationTree
This tree structure stores the hyphenation patterns in an efficient way for fast lookup. It provides the provides the method to hyphenate a word.
This class has been taken from the Apache FOP project (http://xmlgraphics.apache.org/fop/). They have been slightly modified.
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Compound.Hyphenation
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
public class HyphenationTree : TernaryTree, IPatternConsumer
Constructors
| Improve this Doc View SourceHyphenationTree()
Declaration
public HyphenationTree()
Fields
| Improve this Doc View Sourcem_classmap
This map stores the character classes
Declaration
protected TernaryTree m_classmap
Field Value
| Type | Description |
|---|---|
| TernaryTree |
m_stoplist
This map stores hyphenation exceptions
Declaration
protected IDictionary<string, IList<object>> m_stoplist
Field Value
| Type | Description |
|---|---|
| System.Collections.Generic.IDictionary<System.String, System.Collections.Generic.IList<System.Object>> |
m_vspace
value space: stores the interletter values
Declaration
protected ByteVector m_vspace
Field Value
| Type | Description |
|---|---|
| ByteVector |
Methods
| Improve this Doc View SourceAddClass(String)
Add a character class to the tree. It is used by PatternParser as callback to add character classes. Character classes define the valid word characters for hyphenation. If a word contains a character not defined in any of the classes, it is not hyphenated. It also defines a way to normalize the characters in order to compare them with the stored patterns. Usually pattern files use only lower case characters, in this case a class for letter 'a', for example, should be defined as "aA", the first character being the normalization char.
Declaration
public virtual void AddClass(string chargroup)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | chargroup |
AddException(String, IList<Object>)
Add an exception to the tree. It is used by PatternParser class as callback to store the hyphenation exceptions.
Declaration
public virtual void AddException(string word, IList<object> hyphenatedword)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | word | normalized word |
| System.Collections.Generic.IList<System.Object> | hyphenatedword | a vector of alternating strings and Hyphen objects. |
AddPattern(String, String)
Add a pattern to the tree. Mainly, to be used by PatternParser class as callback to add a pattern to the tree.
Declaration
public virtual void AddPattern(string pattern, string ivalue)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | pattern | the hyphenation pattern |
| System.String | ivalue | interletter weight values indicating the desirability and priority of hyphenating at a given point within the pattern. It should contain only digit characters. (i.e. '0' to '9'). |
FindPattern(String)
Declaration
public virtual string FindPattern(string pat)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | pat |
Returns
| Type | Description |
|---|---|
| System.String |
GetValues(Int32)
Declaration
protected virtual byte[] GetValues(int k)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Int32 | k |
Returns
| Type | Description |
|---|---|
| System.Byte[] |
HStrCmp(Char[], Int32, Char[], Int32)
String compare, returns 0 if equal or t is a substring of s
Declaration
protected virtual int HStrCmp(char[] s, int si, char[] t, int ti)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Char[] | s | |
| System.Int32 | si | |
| System.Char[] | t | |
| System.Int32 | ti |
Returns
| Type | Description |
|---|---|
| System.Int32 |
Hyphenate(Char[], Int32, Int32, Int32, Int32)
Hyphenate word and return an array of hyphenation points.
Declaration
public virtual Hyphenation Hyphenate(char[] w, int offset, int len, int remainCharCount, int pushCharCount)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Char[] | w | char array that contains the word |
| System.Int32 | offset | Offset to first character in word |
| System.Int32 | len | Length of word |
| System.Int32 | remainCharCount | Minimum number of characters allowed before the hyphenation point. |
| System.Int32 | pushCharCount | Minimum number of characters allowed after the hyphenation point. |
Returns
| Type | Description |
|---|---|
| Hyphenation | a Hyphenation object representing the hyphenated word or null if word is not hyphenated. |
Remarks
w = "*nnllllllnnn", where n is a non-letter, l is a letter, all n may be absent, the first n is at offset, the first l is at offset + iIgnoreAtBeginning; word = ".llllll.'\0'*", where all l in w are copied into word. In the first part of the routine len = w.length, in the second part of the routine len = word.length. Three indices are used: index(w), the index in w, index(word), the index in word, letterindex(word), the index in the letter part of word. The following relations exist: index(w) = offset + i - 1 index(word) = i - iIgnoreAtBeginning letterindex(word) = index(word) - 1 (see first loop). It follows that: index(w) - index(word) = offset - 1 + iIgnoreAtBeginning index(w) = letterindex(word) + offset + iIgnoreAtBeginning
Hyphenate(String, Int32, Int32)
Hyphenate word and return a Hyphenation object.
Declaration
public virtual Hyphenation Hyphenate(string word, int remainCharCount, int pushCharCount)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | word | the word to be hyphenated |
| System.Int32 | remainCharCount | Minimum number of characters allowed before the hyphenation point. |
| System.Int32 | pushCharCount | Minimum number of characters allowed after the hyphenation point. |
Returns
| Type | Description |
|---|---|
| Hyphenation | a Hyphenation object representing the hyphenated word or null if word is not hyphenated. |
LoadPatterns(FileInfo)
Read hyphenation patterns from an XML file.
Declaration
public virtual void LoadPatterns(FileInfo f)
Parameters
| Type | Name | Description |
|---|---|---|
| System.IO.FileInfo | f | a System.IO.FileInfo object representing the file |
Exceptions
| Type | Condition |
|---|---|
| System.IO.IOException | In case the parsing fails |
LoadPatterns(FileInfo, Encoding)
Read hyphenation patterns from an XML file.
Declaration
public virtual void LoadPatterns(FileInfo f, Encoding encoding)
Parameters
| Type | Name | Description |
|---|---|---|
| System.IO.FileInfo | f | a System.IO.FileInfo object representing the file |
| System.Text.Encoding | encoding | The character encoding to use |
Exceptions
| Type | Condition |
|---|---|
| System.IO.IOException | In case the parsing fails |
LoadPatterns(Stream)
Read hyphenation patterns from an XML file.
Declaration
public virtual void LoadPatterns(Stream source)
Parameters
| Type | Name | Description |
|---|---|---|
| System.IO.Stream | source | System.IO.Stream input source for the file |
Exceptions
| Type | Condition |
|---|---|
| System.IO.IOException | In case the parsing fails |
LoadPatterns(Stream, Encoding)
Read hyphenation patterns from an XML file.
Declaration
public virtual void LoadPatterns(Stream source, Encoding encoding)
Parameters
| Type | Name | Description |
|---|---|---|
| System.IO.Stream | source | System.IO.Stream input source for the file |
| System.Text.Encoding | encoding | The character encoding to use |
Exceptions
| Type | Condition |
|---|---|
| System.IO.IOException | In case the parsing fails |
LoadPatterns(String)
Read hyphenation patterns from an XML file.
Declaration
public virtual void LoadPatterns(string filename)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | filename | the filename |
Exceptions
| Type | Condition |
|---|---|
| System.IO.IOException | In case the parsing fails |
LoadPatterns(String, Encoding)
Read hyphenation patterns from an XML file.
Declaration
public virtual void LoadPatterns(string filename, Encoding encoding)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | filename | the filename |
| System.Text.Encoding | encoding | The character encoding to use |
Exceptions
| Type | Condition |
|---|---|
| System.IO.IOException | In case the parsing fails |
LoadPatterns(XmlReader)
Read hyphenation patterns from an System.Xml.XmlReader.
Declaration
public virtual void LoadPatterns(XmlReader source)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Xml.XmlReader | source | System.Xml.XmlReader input source for the file |
Exceptions
| Type | Condition |
|---|---|
| System.IO.IOException | In case the parsing fails |
PackValues(String)
Packs the values by storing them in 4 bits, two values into a byte Values range is from 0 to 9. We use zero as terminator, so we'll add 1 to the value.
Declaration
protected virtual int PackValues(string values)
Parameters
| Type | Name | Description |
|---|---|---|
| System.String | values | a string of digits from '0' to '9' representing the interletter values. |
Returns
| Type | Description |
|---|---|
| System.Int32 | the index into the vspace array where the packed values are stored. |
SearchPatterns(Char[], Int32, Byte[])
Search for all possible partial matches of word starting at index an update interletter values. In other words, it does something like:
for (i=0; i<patterns.Length; i++)
{
if (word.Substring(index).StartsWith(patterns[i], StringComparison.Ordinal))
update_interletter_values(patterns[i]);
}
But it is done in an efficient way since the patterns are stored in a ternary tree. In fact, this is the whole purpose of having the tree: doing this search without having to test every single pattern. The number of patterns for languages such as English range from 4000 to 10000. Thus, doing thousands of string comparisons for each word to hyphenate would be really slow without the tree. The tradeoff is memory, but using a ternary tree instead of a trie, almost halves the the memory used by Lout or TeX. It's also faster than using a hash table
Declaration
protected virtual void SearchPatterns(char[] word, int index, byte[] il)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Char[] | word | null terminated word to match |
| System.Int32 | index | start index from word |
| System.Byte[] | il | interletter values array to update |
UnpackValues(Int32)
Declaration
protected virtual string UnpackValues(int k)
Parameters
| Type | Name | Description |
|---|---|---|
| System.Int32 | k |
Returns
| Type | Description |
|---|---|
| System.String |