Page 342 - AI Computer 10
P. 342
Along with the stopwords, a corpus might contain special characters and/or numbers. The decision to keep or
discard these special symbols and numbers depends on the type of document being worked on. For example, if a
document contains email IDs, then removing the special character “@” is not beneficial. Similarly, in a document
containing invoice numbers or bill amounts, the digits and numbers are relevant. Thus, care should be taken
during elimination of such tokens.
Changing Letter Case
After eliminating stop words from tokens, we change the letters of the whole text into lowercase to eliminate
the issue of case-sensitivity, because a machine does not recognise same words as different just because of
different cases. Thus, the words “TIGER”, “Tiger”, “tiger”, “TiGeR”, or “TigeR” will all be converted to “tiger”.
Stemming and Lemmatisation
Stemming is an elementary rule based process to remove the affixes of words and reduce them to their base
from, called stem. For example, laughing, laughed, laughs, and laugh will become laugh after the stemming
process.
Word Affixes Stem
Laughs -s Laugh
Laughed -ed Laugh
Laughing -ing Laugh
Caring -ing Car
Tries -es Tri
You should remember that stemming is not a good approach for normalisation as some words after the stemming
phase will not be meaningful. For example, ‘tak’, a stemmed word of “taking” is not meaningful.
Lemmatisation is a systematic process of removing affixes of a word and transforming it into its root from called
lemma. It ensures that lemma is a word with meaning and hence it takes a longer time to execute as compared
to stemming. Form example, ‘take’ is a lemmatisation of ‘taking’.
Word Affixes Lemma
Laughs -s Laugh
Laughed -ed Laugh
Laughing -ing Laugh
Caring -ing Care
Tries -es Try
BAG OF WORDS
Bag of Words (BoW) is a simple and popular method to
extract features from text documents. These features
can be used for training machine learning algorithms. It
is a method of feature extraction from text data. In this
approach, we use the tokenised words for each observation
and determine how many times it is used in the corpus. The
Bag of Words algorithm returns:
u a vocabulary of words for the corpus.
u the frequency of these words.
208
208