Page 343 - AI Computer 10
P. 343

For example, suppose we have two sentences in the text document.

         u “binary code is an efficient language.”
         u “binary code is a code used in digital computers”.
        Now, the whole sentence is segmented into tokens excluding the punctuation marks. Thus, we make a list of all
        the tokens as:

        ‘Binary’, ‘code’, ‘is’, ‘an’, ‘efficient’, ‘language’, ‘a’, ‘used’, ‘in’, ‘digital’, ‘computers’
        Now, Bag of Words algorithm creates a vector by determining the frequency of words or tokens in the whole
        corpus like:

        Thus, we can say that ‘BoW’ algorithm creates a vocabulary of all the unique words occurring in all the documents
        in the training set.

              Knowledge Botwledge Bot
              Kno
          Bag of Words (Bow) is named as it is analogous to a bag containing all the words in a text.


        Bag of Words Algorithm

        To implement the Bag of Words algorithm, follow the given steps:
          Step I:  Text Normalisation

                    The whole corpus is segmented into tokens and removal of stopwords and others symbols is carried
                    out will take place.
          Step II:  Create Dictionary

                     Make a list of all the unique words occurring in the corpus.
          Step III:  Create Vectors for each Document

                     Generate vectors for each document in the corpus by determining the frequency of words in the
                    document.
           Step IV:  Create Document Vectors for all the Documents
                     The last step is to generate vectors for all the documents that exist in the corpus.
        Example:

        Let us understand all the steps of BoW algorithm with the help of an example:
        Suppose a corpus contains three document such as:

              Document 1: Subin and Sohan are friends.
              Document 2: Subin went to school.
              Document 3: Sohan went to park.
        Step I: Text Normalisation

        After text normalisation, the text becomes:
              Document 1: [‘Subin’, ‘and’, ‘Sohan’, ‘are’, ‘friends’]
              Document 2: [‘Subin’, ‘went’, ‘to’, ‘school’]
              Document 3: [‘Sohan’, ‘went’, ‘to’ ,‘park’]
        Notice that no tokens have been removed in the stopwords removal step because we have very little data and

        frequency of words is almost the same.

                                                                                                             209
                                                                                                             209
   338   339   340   341   342   343   344   345   346   347   348