Page 342 - AI Computer 10

P. 342

Along with the stopwords, a corpus might contain special characters and/or numbers. The decision to keep or
discard these special symbols and numbers depends on the type of document being worked on. For example, if a
document contains email IDs, then removing the special character “@” is not beneficial. Similarly, in a document
containing invoice numbers or bill amounts, the digits and numbers are relevant. Thus, care should be taken
during elimination of such tokens.
Changing Letter Case

After eliminating stop words from tokens, we change the letters of the whole text into lowercase to eliminate
the issue of case-sensitivity, because a machine does not recognise same words as different just because of
different cases. Thus, the words “TIGER”, “Tiger”, “tiger”, “TiGeR”, or “TigeR” will all be converted to “tiger”.
Stemming and Lemmatisation

Stemming is an elementary rule based process to remove the affixes of words and reduce them to their base
from, called stem. For example, laughing, laughed, laughs, and laugh will become laugh after the stemming
process.

Word Affixes Stem
Laughs -s Laugh
Laughed -ed Laugh

Laughing -ing Laugh
Caring -ing Car
Tries -es Tri

You should remember that stemming is not a good approach for normalisation as some words after the stemming
phase will not be meaningful. For example, ‘tak’, a stemmed word of “taking” is not meaningful.
Lemmatisation is a systematic process of removing affixes of a word and transforming it into its root from called
lemma. It ensures that lemma is a word with meaning and hence it takes a longer time to execute as compared
to stemming. Form example, ‘take’ is a lemmatisation of ‘taking’.

Word Affixes Lemma
Laughs -s Laugh
Laughed -ed Laugh
Laughing -ing Laugh
Caring -ing Care
Tries -es Try

BAG OF WORDS

Bag of Words (BoW) is a simple and popular method to
extract features from text documents. These features
can be used for training machine learning algorithms. It
is a method of feature extraction from text data. In this
approach, we use the tokenised words for each observation
and determine how many times it is used in the corpus. The
Bag of Words algorithm returns:
u a vocabulary of words for the corpus.
u the frequency of these words.

208
208

337 338 339 340 341 342 343 344 345 346 347