Page 343 - AI Computer 10
P. 343
For example, suppose we have two sentences in the text document.
u “binary code is an efficient language.”
u “binary code is a code used in digital computers”.
Now, the whole sentence is segmented into tokens excluding the punctuation marks. Thus, we make a list of all
the tokens as:
‘Binary’, ‘code’, ‘is’, ‘an’, ‘efficient’, ‘language’, ‘a’, ‘used’, ‘in’, ‘digital’, ‘computers’
Now, Bag of Words algorithm creates a vector by determining the frequency of words or tokens in the whole
corpus like:
Thus, we can say that ‘BoW’ algorithm creates a vocabulary of all the unique words occurring in all the documents
in the training set.
Knowledge Botwledge Bot
Kno
Bag of Words (Bow) is named as it is analogous to a bag containing all the words in a text.
Bag of Words Algorithm
To implement the Bag of Words algorithm, follow the given steps:
Step I: Text Normalisation
The whole corpus is segmented into tokens and removal of stopwords and others symbols is carried
out will take place.
Step II: Create Dictionary
Make a list of all the unique words occurring in the corpus.
Step III: Create Vectors for each Document
Generate vectors for each document in the corpus by determining the frequency of words in the
document.
Step IV: Create Document Vectors for all the Documents
The last step is to generate vectors for all the documents that exist in the corpus.
Example:
Let us understand all the steps of BoW algorithm with the help of an example:
Suppose a corpus contains three document such as:
Document 1: Subin and Sohan are friends.
Document 2: Subin went to school.
Document 3: Sohan went to park.
Step I: Text Normalisation
After text normalisation, the text becomes:
Document 1: [‘Subin’, ‘and’, ‘Sohan’, ‘are’, ‘friends’]
Document 2: [‘Subin’, ‘went’, ‘to’, ‘school’]
Document 3: [‘Sohan’, ‘went’, ‘to’ ,‘park’]
Notice that no tokens have been removed in the stopwords removal step because we have very little data and
frequency of words is almost the same.
209
209