Page 341 - AI Computer 10
P. 341

Text Normalisation

        Text Normalisation is a technical term used for transforming text into a standard form. It reduces the complexity
        of textual data to improve the accuracy and efficiency of later stages, such as sentiment analysis. Normalisation
        works on a group of documents, called Corpus.
        The various steps used to normalise textual data are:
        Sentence Segmentation/ Sentence Tokenisation

        Sentence Segmentation refers to the process of splitting the entire corpus into sentences. Each sentence is taken
        as a different data element.

        Sentence Segmentation is followed by Tokenisation. It is the process of dividing each sentence further into
        smaller units called Tokens. A Token is a term used for any word, number, or special character occurring in a
        sentence. After Tokenisation, every word, number and special character is considered separately.


































        Elimination of Stopwords, Special Characters, and Numbers

        The whole corpus is segmented into sentences and then tokens among which few tokens are necessary. In this
        step, unnecessary tokens are removed from the token list. These unnecessary tokens are stop words, special
        characters, and symbols.
        Stopwords or conventional symbols occur very frequently in the corpus but do not add any value to it. Humans use
        grammar to make their sentences meaningful but grammatical words such as articles, prepositions, connectors,
        etc. do not add meaning to the text which is to be transmitted through the statement, hence they come under
        the category of stopwords. Some examples of stopwords are:















                                                                                                             207
                                                                                                             207
   336   337   338   339   340   341   342   343   344   345   346