Page 346 - AI Computer 10
P. 346

Let us understand each term one by one in detail.

            Term Frequency

            Term frequency is a technical term used for the frequency of a word in one document. It can be easily found
            from the document vector table.

               Subin        and        Sohan        are       friends      went          to        school       park
                 1           1           1           1           1           0           0           0           0
                 1           0           0           0           0           1           1           1           0

                 0           0           1           0           0           1           1           0           1
            In this table, we have mentioned the number ‘1’ for the words that exist in the document and the number ‘0’ for
            unmatched words. These numbers are nothing but the term frequencies.
            Inverse Document Frequency

            Inverse  Document Frequency  is a measure  of the  rareness  of a term.  Conceptually,  we  start  by  measuring
            document frequency.

            Document Frequency is the number of documents in which the word occurs irrespective of how many times it
            has occurred in those documents. The document frequency for the above table would be:

            Document Frequency:
               Subin        and        Sohan        are        friends     went          to        school       park
                 2           1           2           1           1           2           2           1           1

            To compute inverse document frequency, we divide the total number of documents by the document frequency
            as shown below:
            Inverse document frequency:

               Subin        and        Sohan        are        friends     went          to        school       park
                3/2         3/1         3/2         3/1         3/1         3/2         3/2         3/1         3/1

            Finally, the formula of TFIDF for any word W becomes: TFIDF(W) = TF(W) * log( IDF(W) )
            Where, W=words    TF=Term frequency     IDF=Inverse document frequency

            Here, log is to the base of 10. You can use a calculator to calculate the log values.
            Now, let’s multiply the IDF values to the TF values. Note that the TF values are for each document while the IDF
            values are for the whole corpus. Hence, we need to multiply the IDF values to each row of the document vector
            table.
               Subin        and        Sohan        are        friends     went          to        school       park
             1*log(3/2) 1*log(3/1) 1*log(3/2) 1*log(3/1) 1*log(3/1) 0*log(3/2) 0*log(3/2) 0*log(3/1) 0*log(3/1)
             1*log(3/2) 0*log(3/1) 0*log(3/2) 0*log(3/1) 0*log(3/1) 1*log(3/2) 1*log(3/2) 1*log(3/1) 0*log(3/1)

             0*log(3/2) 0*log(3/1) 1*log(3/2) 0*log(3/1) 0*log(3/1) 1*log(3/2) 1*log(3/2) 0*log(3/1) 1*log(3/1)
            After calculating values, the table look like as follows:

             Subin       and         Sohan       are         friends     went        To          school      park
             0.176       0.477       0.176       0.477       0.477       0           0           0           0
             0.176       0           0           0           0           0.176       0.176       0.477       0
             0           0           0.176       0           0           0.176       0.176       0           0.477


                212
                212
   341   342   343   344   345   346   347   348   349   350   351