Term Frequency-Inverse Document Frequency (TF-IDF)
- Integrated Circuits -
- An Online Book -
Integrated Circuits                                                                                   http://www.globalsino.com/ICs/

Chapter/Index: Introduction | A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | Appendix

=================================================================================

Term Frequency-Inverse Document Frequency (TF-IDF) is a numerical representation used in natural language processing and information retrieval to measure the importance of a term within a collection of documents. TF-IDF is commonly employed for tasks like document ranking, text classification, and information retrieval.

The TF-IDF representation of a term in a document is calculated by combining two components:

1. Term Frequency (TF): Term Frequency refers to the number of times a specific term appears in a document. It represents the relative importance of a term within that document. A higher TF value indicates that the term occurs frequently in the document and is likely to be significant in conveying the document's meaning.

TF = (Number of occurrences of the term in the document) / (Total number of terms in the document)

2. Inverse Document Frequency (IDF): Inverse Document Frequency measures the significance of a term across the entire collection of documents. It penalizes terms that occur frequently across many documents, as they are likely to be common and less informative. On the other hand, terms that occur in only a few documents are given higher IDF scores and are considered more discriminative.

IDF = log((Total number of documents) / (Number of documents containing the term))

The TF-IDF score for a term in a document is obtained by multiplying its TF value with its IDF value. The resulting score quantifies how important the term is to the specific document while considering its significance in the entire collection.

TF-IDF is a useful technique for representing documents in a vector space, where each dimension corresponds to a unique term, and the value in each dimension is the TF-IDF score of the corresponding term in the document. The TF-IDF vector representation allows documents to be compared and ranked based on their content, making it valuable for various text-based applications, including search engines, document clustering, and text classification.

Term Frequency Inverse Document Frequency of records can be defined as the calculation of how relevant a word in a series or corpus is to a text. The meaning increases proportionally to the number of times in the text a word appears but is compensated by the word frequency in the corpus (data-set).

A very popular representation for text is the product of TF and IDF:
TFIDF(t,d) = TF(t,d) x IDF(t)
where,
TF -- The frequency of any "term" in a given "document". It is the resulting value of the number of representations of words in the sentence divided by the number of the words in the sentence.
IDF -- Constant per corpus, and accounts for the ratio of documents that include that specific "term". It is the resulting value of log of the number of sentences divided by the number of the sentences containing the words.

TFIDF value is specific to a single document d, while IDF depends on entire corpus.

Bag of Words representation:
i) Stemming
ii) Stopwords elimination

===========================================

TF-IDF: code:

Output:

=================================================================================