Class CJKTokenizer
CJKTokenizer is designed for Chinese, Japanese, and Korean languages.
The tokens returned are every two adjacent characters with overlap match.
Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".
Additionally, the following is applied to Latin text (such as English):- Text is converted to lowercase.
- Numeric digits, '+', '#', and '_' are tokenized as letters.
- Full-width forms are converted to half-width forms.
Implements
Inherited Members
Namespace: Lucene.Net.Analysis.Cjk
Assembly: Lucene.Net.Analysis.Common.dll
Syntax
[Obsolete("Use StandardTokenizer, CJKWidthFilter, CJKBigramFilter, and LowerCaseFilter instead.")]
public sealed class CJKTokenizer : Tokenizer, IDisposable
Constructors
CJKTokenizer(AttributeFactory, TextReader)
CJKTokenizer is designed for Chinese, Japanese, and Korean languages.
The tokens returned are every two adjacent characters with overlap match.
Example: "java C1C2C3C4" will be segmented to: "java" "C1C2" "C2C3" "C3C4".
Additionally, the following is applied to Latin text (such as English):- Text is converted to lowercase.
- Numeric digits, '+', '#', and '_' are tokenized as letters.
- Full-width forms are converted to half-width forms.
Declaration
public CJKTokenizer(AttributeSource.AttributeFactory factory, TextReader @in)
Parameters
Type | Name | Description |
---|---|---|
AttributeSource.AttributeFactory | factory | |
TextReader | in |
CJKTokenizer(TextReader)
Construct a token stream processing the given input.
Declaration
public CJKTokenizer(TextReader @in)
Parameters
Type | Name | Description |
---|---|---|
TextReader | in | I/O reader |
Methods
End()
This method is called by the consumer after the last token has been
consumed, after Lucene.Net.Analysis.TokenStream.IncrementToken() returned false
(using the new Lucene.Net.Analysis.TokenStream API). Streams implementing the old API
should upgrade to use this feature.
base.End();
.
Declaration
public override sealed void End()
Overrides
Exceptions
Type | Condition |
---|---|
IOException | If an I/O error occurs |
IncrementToken()
Returns true for the next token in the stream, or false at EOS. See http://java.sun.com/j2se/1.3/docs/api/java/lang/Character.UnicodeBlock.html for detail.
Declaration
public override bool IncrementToken()
Returns
Type | Description |
---|---|
bool | false for end of stream, true otherwise |
Overrides
Exceptions
Type | Condition |
---|---|
IOException | when read error happened in the InputStream |
Reset()
This method is called by a consumer before it begins consumption using Lucene.Net.Analysis.TokenStream.IncrementToken().
Resets this stream to a clean state. Stateful implementations must implement this method so that they can be reused, just as if they had been created fresh. If you override this method, always callbase.Reset()
, otherwise
some internal state will not be correctly reset (e.g., Lucene.Net.Analysis.Tokenizer will
throw InvalidOperationException on further usage).
Declaration
public override void Reset()