Semantic compressionIn natural language processing, semantic compression is a process of compacting a lexicon used to build a textual document (or a set of documents) by reducing language heterogeneity, while maintaining text semantics. As a result, the same ideas can be represented using a smaller set of words. In most applications, semantic compression is a lossy compression. Increased prolixity does not compensate for the lexical compression and an original document cannot be reconstructed in a reverse process. By generalizationSemantic compression is basically achieved in two steps, using frequency dictionaries and semantic network:
Step 1 requires assembling word frequencies and information on semantic relationships, specifically hyponymy. Moving upwards in word hierarchy, a cumulative concept frequency is calculating by adding a sum of hyponyms' frequencies to frequency of their hypernym: where is a hypernym of . Then a desired number of words with top cumulated frequencies are chosen to build a targed lexicon. In the second step, compression mapping rules are defined for the remaining words in order to handle every occurrence of a less frequent hyponym as its hypernym in output text.
The below fragment of text has been processed by the semantic compression. Words in bold have been replaced by their hypernyms.
The procedure outputs the following text:
Implicit semantic compressionA natural tendency to keep natural language expressions concise can be perceived as a form of implicit semantic compression, by omitting unmeaningful words or redundant meaningful words (especially to avoid pleonasms).[2] Applications and advantagesIn the vector space model, compacting a lexicon leads to a reduction of dimensionality, which results in less computational complexity and a positive influence on efficiency. Semantic compression is advantageous in information retrieval tasks, improving their effectiveness (in terms of both precision and recall).[3] This is due to more precise descriptors (reduced effect of language diversity – limited language redundancy, a step towards a controlled dictionary). As in the example above, it is possible to display the output as natural text (re-applying inflexion, adding stop words). See also
References
External links |