Page 342 - AI Computer 10
P. 342

Along with the stopwords, a corpus might contain special characters and/or numbers. The decision to keep or
            discard these special symbols and numbers depends on the type of document being worked on. For example, if a
            document contains email IDs, then removing the special character “@” is not beneficial. Similarly, in a document
            containing invoice numbers or bill amounts, the digits and numbers are relevant. Thus, care should be taken
            during elimination of such tokens.
            Changing Letter Case

            After eliminating stop words from tokens, we change the letters of the whole text into lowercase to eliminate
            the issue of case-sensitivity, because a machine does not recognise same words as different just because of
            different cases. Thus, the words “TIGER”, “Tiger”, “tiger”, “TiGeR”, or “TigeR” will all be converted to “tiger”.
            Stemming and Lemmatisation

            Stemming is an elementary rule based process to remove the affixes of words and reduce them to their base
            from, called stem. For example, laughing, laughed, laughs, and laugh will become laugh after the stemming
            process.

                                               Word           Affixes          Stem
                                              Laughs             -s            Laugh
                                              Laughed           -ed            Laugh

                                             Laughing           -ing           Laugh
                                               Caring           -ing            Car
                                                Tries           -es              Tri

            You should remember that stemming is not a good approach for normalisation as some words after the stemming
            phase will not be meaningful. For example, ‘tak’, a stemmed word of “taking” is not meaningful.
            Lemmatisation is a systematic process of removing affixes of a word and transforming it into its root from called
            lemma. It ensures that lemma is a word with meaning and hence it takes a longer time to execute as compared
            to stemming. Form example, ‘take’ is a lemmatisation of ‘taking’.

                                               Word           Affixes         Lemma
                                              Laughs             -s            Laugh
                                              Laughed           -ed            Laugh
                                             Laughing           -ing           Laugh
                                               Caring           -ing            Care
                                                Tries           -es             Try

            BAG OF WORDS

            Bag of Words (BoW) is a simple  and  popular  method  to
            extract features  from text documents. These features
            can be used for training  machine learning algorithms. It
            is  a method  of  feature extraction  from text data. In this
            approach, we use the tokenised words for each observation
            and determine how many times it is used  in the corpus. The
            Bag of Words algorithm returns:
             u a vocabulary of words for the corpus.
             u the frequency of these words.


                208
                208
   337   338   339   340   341   342   343   344   345   346   347