MEDLARS collection: publications from medical reviews
Time magazine collection: archives of the generalist review Time in 1963
To the legacy of the SMART system belongs the so-called SMART triple notation, a mnemonic scheme for denoting tf-idf weighting variants in the vector space model. The mnemonic for representing a combination of weights takes the form ddd.qqq, where the first three letters represents the term weighting of the collection document vector and the second three letters represents the term weighting for the query document vector. For example, ltc.lnn represents the ltc weighting applied to a collection document and the lnn weighting applied to a query document.
The following tables establish the SMART notation:[2]
Symbols and notation
represents a document vector, where is the weight of the term in and is the number of unique terms in . Positive features characterize terms that are present in a document, and the weight of zero is used for terms that are absent from a document.
Occurrence frequency of term in document
Number of unique terms in document
Number of collection documents
Average number of unique terms in a document
Number of documents with term present
Number of characters in document
Occurrence frequency of the most common term in document
Average number of characters in a document
Average occurrence frequency of a term in document
Global collection statistics
The slope in the context of pivoted document length normalization[3]
The gray letters in the first, fifth, and ninth columns are the scheme used by Salton and Buckley in their 1988 paper.[4] The bold letters in the second, sixth, and tenth columns are the scheme used in experiments reported thereafter.