Page 343 - AI Computer 10

P. 343

For example, suppose we have two sentences in the text document.

u “binary code is an efficient language.”
u “binary code is a code used in digital computers”.
Now, the whole sentence is segmented into tokens excluding the punctuation marks. Thus, we make a list of all
the tokens as:

‘Binary’, ‘code’, ‘is’, ‘an’, ‘efficient’, ‘language’, ‘a’, ‘used’, ‘in’, ‘digital’, ‘computers’
Now, Bag of Words algorithm creates a vector by determining the frequency of words or tokens in the whole
corpus like:

Thus, we can say that ‘BoW’ algorithm creates a vocabulary of all the unique words occurring in all the documents
in the training set.

Knowledge Botwledge Bot
Kno
Bag of Words (Bow) is named as it is analogous to a bag containing all the words in a text.

Bag of Words Algorithm

To implement the Bag of Words algorithm, follow the given steps:
Step I: Text Normalisation

The whole corpus is segmented into tokens and removal of stopwords and others symbols is carried
out will take place.
Step II: Create Dictionary

Make a list of all the unique words occurring in the corpus.
Step III: Create Vectors for each Document

Generate vectors for each document in the corpus by determining the frequency of words in the
document.
Step IV: Create Document Vectors for all the Documents
The last step is to generate vectors for all the documents that exist in the corpus.
Example:

Let us understand all the steps of BoW algorithm with the help of an example:
Suppose a corpus contains three document such as:

Document 1: Subin and Sohan are friends.
Document 2: Subin went to school.
Document 3: Sohan went to park.
Step I: Text Normalisation

After text normalisation, the text becomes:
Document 1: [‘Subin’, ‘and’, ‘Sohan’, ‘are’, ‘friends’]
Document 2: [‘Subin’, ‘went’, ‘to’, ‘school’]
Document 3: [‘Sohan’, ‘went’, ‘to’ ,‘park’]
Notice that no tokens have been removed in the stopwords removal step because we have very little data and

frequency of words is almost the same.

209
209

338 339 340 341 342 343 344 345 346 347 348