Tf–idf

From testwiki
Revision as of 07:36, 10 January 2025 by imported>ThoughtWarden (Open access status updates in citations with OAbot #oabot)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

Template:Short description Template:Lowercase In information retrieval, tf–idf (also TF*IDF, TFIDF, TF–IDF, or Tf–idf), short for term frequency–inverse document frequency, is a measure of importance of a word to a document in a collection or corpus, adjusted for the fact that some words appear more frequently in general.[1] Like the bag-of-words model, it models a document as a multiset of words, without word order. It is a refinement over the simple bag-of-words model, by allowing the weight of words to depend on the rest of the corpus.

It was often used as a weighting factor in searches of information retrieval, text mining, and user modeling. A survey conducted in 2015 showed that 83% of text-based recommender systems in digital libraries used tf–idf.[2] Variations of the tf–idf weighting scheme were often used by search engines as a central tool in scoring and ranking a document's relevance given a user query.

One of the simplest ranking functions is computed by summing the tf–idf for each query term; many more sophisticated ranking functions are variants of this simple model.

Motivations

Karen Spärck Jones (1972) conceived a statistical interpretation of term-specificity called Inverse Document Frequency (idf), which became a cornerstone of term weighting:[3] Template:QuoteFor example, the df (document frequency) and idf for some words in Shakespeare's 37 plays are as follows:[4]

Word df idf
Romeo 1 1.57
salad 2 1.27
Falstaff 4 0.967
forest 12 0.489
battle 21 0.246
wit 34 0.037
fool 36 0.012
good 37 0
sweet 37 0

We see that "Romeo", "Falstaff", and "salad" appears in very few plays, so seeing these words, one could get a good idea as to which play it might be. In contrast, "good" and "sweet" appears in every play and are completely uninformative as to which play it is.

Definition

  1. The tf–idf is the product of two statistics, term frequency and inverse document frequency. There are various ways for determining the exact values of both statistics.
  2. A formula that aims to define the importance of a keyword or phrase within a document or a web page.
Variants of term frequency (tf) weight
weighting scheme tf weight
binary 0,1
raw count ft,d
term frequency ft,d/tdft,d
log normalization log(1+ft,d)
double normalization 0.5 0.5+0.5ft,dmax{td}ft,d
double normalization K K+(1K)ft,dmax{td}ft,d

Term frequency

Term frequency, Template:Math, is the relative frequency of term Template:Math within document Template:Math,

tf(t,d)=ft,dtdft,d,

where Template:Math is the raw count of a term in a document, i.e., the number of times that term Template:Mvar occurs in document Template:Mvar. Note the denominator is simply the total number of terms in document Template:Math (counting each occurrence of the same term separately). There are various other ways to define term frequency:[5]Template:Rp

tf(t,d)=0.5+0.5ft,dmax{ft,d:td}

Template:Clear right

Inverse document frequency

Variants of inverse document frequency (idf) weight
weighting scheme idf weight (nt=|{dD:td}|)
unary 1
inverse document frequency logNnt=logntN
inverse document frequency smooth log(N1+nt)+1
inverse document frequency max log(max{td}nt1+nt)
probabilistic inverse document frequency logNntnt

The inverse document frequency is a measure of how much information the word provides, i.e., how common or rare it is across all documents. It is the logarithmically scaled inverse fraction of the documents that contain the word (obtained by dividing the total number of documents by the number of documents containing the term, and then taking the logarithm of that quotient):

idf(t,D)=logN|{d:dD and td}|

with

  • N: total number of documents in the corpus N=|D|
  • |{dD:td}| : number of documents where the term t appears (i.e., tf(t,d)0). If the term is not in the corpus, this will lead to a division-by-zero. It is therefore common to adjust the numerator 1+N and denominator to 1+|{dD:td}|.
Plot of different inverse document frequency functions: standard, smooth, probabilistic.

Template:Clear right

Term frequency–inverse document frequency

Variants of term frequency-inverse document frequency (tf–idf) weights
weighting scheme tf-idf
count-idf ft,dlogNnt
double normalization-idf (0.5+0.5ft,qmaxtft,q)logNnt
log normalization-idf (1+logft,d)logNnt

Then tf–idf is calculated as

tfidf(t,d,D)=tf(t,d)idf(t,D)

A high weight in tf–idf is reached by a high term frequency (in the given document) and a low document frequency of the term in the whole collection of documents; the weights hence tend to filter out common terms. Since the ratio inside the idf's log function is always greater than or equal to 1, the value of idf (and tf–idf) is greater than or equal to 0. As a term appears in more documents, the ratio inside the logarithm approaches 1, bringing the idf and tf–idf closer to 0.

Template:Clear right

Justification of idf

Idf was introduced as "term specificity" by Karen Spärck Jones in a 1972 paper. Although it has worked well as a heuristic, its theoretical foundations have been troublesome for at least three decades afterward, with many researchers trying to find information theoretic justifications for it.[7]

Spärck Jones's own explanation did not propose much theory, aside from a connection to Zipf's law.[7] Attempts have been made to put idf on a probabilistic footing,[8] by estimating the probability that a given document Template:Mvar contains a term Template:Mvar as the relative document frequency,

P(t|D)=|{dD:td}|N,

so that we can define idf as

idf=logP(t|D)=log1P(t|D)=logN|{dD:td}|

Namely, the inverse document frequency is the logarithm of "inverse" relative document frequency.

This probabilistic interpretation in turn takes the same form as that of self-information. However, applying such information-theoretic notions to problems in information retrieval leads to problems when trying to define the appropriate event spaces for the required probability distributions: not only documents need to be taken into account, but also queries and terms.[7]

Both term frequency and inverse document frequency can be formulated in terms of information theory; it helps to understand why their product has a meaning in terms of joint informational content of a document. A characteristic assumption about the distribution p(d,t) is that:

p(d|t)=1|{dD:td}|

This assumption and its implications, according to Aizawa: "represent the heuristic that tf–idf employs."[9]

The conditional entropy of a "randomly chosen" document in the corpus D, conditional to the fact it contains a specific term t (and assuming that all documents have equal probability to be chosen) is:

H(𝒟|𝒯=t)=dpd|tlogpd|t=log1|{dD:td}|=log|{dD:td}||D|+log|D|=idf(t)+log|D|

In terms of notation, 𝒟 and 𝒯 are "random variables" corresponding to respectively draw a document or a term. The mutual information can be expressed as

M(𝒯;𝒟)=H(𝒟)H(𝒟|𝒯)=tpt(H(𝒟)H(𝒟|W=t))=tptidf(t)

The last step is to expand pt, the unconditional probability to draw a term, with respect to the (random) choice of a document, to obtain:

M(𝒯;𝒟)=t,dpt|dpdidf(t)=t,dtf(t,d)1|D|idf(t)=1|D|t,dtf(t,d)idf(t).

This expression shows that summing the Tf–idf of all possible terms and documents recovers the mutual information between documents and term taking into account all the specificities of their joint distribution.[9] Each Tf–idf hence carries the "bit of information" attached to a term x document pair.

Example of tf–idf

Suppose that we have term count tables of a corpus consisting of only two documents, as listed on the right.

Document 2
Term Term Count
this 1
is 1
another 2
example 3
Document 1
Term Term Count
this 1
is 1
a 2
sample 1

The calculation of tf–idf for the term "this" is performed as follows:

In its raw frequency form, tf is just the frequency of the "this" for each document. In each document, the word "this" appears once; but as the document 2 has more words, its relative frequency is smaller.

tf('𝗍𝗁𝗂𝗌,d1)=15=0.2
tf('𝗍𝗁𝗂𝗌,d2)=170.14

An idf is constant per corpus, and accounts for the ratio of documents that include the word "this". In this case, we have a corpus of two documents and all of them include the word "this".

idf('𝗍𝗁𝗂𝗌,D)=log(22)=0

So tf–idf is zero for the word "this", which implies that the word is not very informative as it appears in all documents.

tfidf('𝗍𝗁𝗂𝗌,d1,D)=0.2×0=0
tfidf('𝗍𝗁𝗂𝗌,d2,D)=0.14×0=0

The word "example" is more interesting - it occurs three times, but only in the second document:

tf('𝖾𝗑𝖺𝗆𝗉𝗅𝖾,d1)=05=0
tf('𝖾𝗑𝖺𝗆𝗉𝗅𝖾,d2)=370.429
idf('𝖾𝗑𝖺𝗆𝗉𝗅𝖾,D)=log(21)=0.301

Finally,

tfidf('𝖾𝗑𝖺𝗆𝗉𝗅𝖾,d1,D)=tf('𝖾𝗑𝖺𝗆𝗉𝗅𝖾,d1)×idf('𝖾𝗑𝖺𝗆𝗉𝗅𝖾,D)=0×0.301=0
tfidf('𝖾𝗑𝖺𝗆𝗉𝗅𝖾,d2,D)=tf('𝖾𝗑𝖺𝗆𝗉𝗅𝖾,d2)×idf('𝖾𝗑𝖺𝗆𝗉𝗅𝖾,D)=0.429×0.3010.129

(using the base 10 logarithm).

Beyond terms

The idea behind tf–idf also applies to entities other than terms. In 1998, the concept of idf was applied to citations.[10] The authors argued that "if a very uncommon citation is shared by two documents, this should be weighted more highly than a citation made by a large number of documents". In addition, tf–idf was applied to "visual words" with the purpose of conducting object matching in videos,[11] and entire sentences.[12] However, the concept of tf–idf did not prove to be more effective in all cases than a plain tf scheme (without idf). When tf–idf was applied to citations, researchers could find no improvement over a simple citation-count weight that had no idf component.[13]

Derivatives

A number of term-weighting schemes have derived from tf–idf. One of them is TF–PDF (term frequency * proportional document frequency).[14] TF–PDF was introduced in 2001 in the context of identifying emerging topics in the media. The PDF component measures the difference of how often a term occurs in different domains. Another derivate is TF–IDuF. In TF–IDuF,[15] idf is not calculated based on the document corpus that is to be searched or recommended. Instead, idf is calculated on users' personal document collections. The authors report that TF–IDuF was equally effective as tf–idf but could also be applied in situations when, e.g., a user modeling system has no access to a global document corpus.

See also

Template:Div col

Template:Div col end

References

Template:Reflist