Show / Hide Table of Contents

    Class SoraniNormalizer

    Normalizes the Unicode representation of Sorani text.

    Normalization consists of:

    • Alternate forms of 'y' (0064, 0649) are converted to 06CC (FARSI YEH)
    • Alternate form of 'k' (0643) is converted to 06A9 (KEHEH)
    • Alternate forms of vowel 'e' (0647+200C, word-final 0647, 0629) are converted to 06D5 (AE)
    • Alternate (joining) form of 'h' (06BE) is converted to 0647
    • Alternate forms of 'rr' (0692, word-initial 0631) are converted to 0695 (REH WITH SMALL V BELOW)
    • Harakat, tatweel, and formatting characters such as directional controls are removed.

    Inheritance
    System.Object
    SoraniNormalizer
    Namespace: Lucene.Net.Analysis.Ckb
    Assembly: Lucene.Net.Analysis.Common.dll
    Syntax
    public class SoraniNormalizer : object

    Methods

    | Improve this Doc View Source

    Normalize(Char[], Int32)

    Normalize an input buffer of Sorani text

    Declaration
    public virtual int Normalize(char[] s, int len)
    Parameters
    Type Name Description
    System.Char[] s

    input buffer

    System.Int32 len

    length of input buffer

    Returns
    Type Description
    System.Int32

    length of input buffer after normalization

    • Improve this Doc
    • View Source
    Back to top Copyright © 2020 Licensed to the Apache Software Foundation (ASF)