Language model

Template:Short description Template:Use dmy dates

A language model is a model of natural language.^[1] Language models are useful for a variety of tasks, including speech recognition,^[2] machine translation,^[3] natural language generation (generating more human-like text), optical character recognition, route optimization,^[4] handwriting recognition,^[5] grammar induction,^[6] and information retrieval.^[7]^[8]

Large language models, currently their most advanced form, are a combination of larger datasets (frequently using words scraped from the public internet), feedforward neural networks, and transformers. They have superseded recurrent neural network-based models, which had previously superseded the pure statistical models, such as word n-gram language model.

History

Noam Chomsky did pioneering work on language models in the 1950s by developing a theory of formal grammars, which became fundamental to the field of programming languages.^[9]

In 1980, statistical approaches were explored and found to be more useful for many purposes than rule-based formal grammars. Discrete representations like word n-gram language models, with probabilities for discrete combinations of words, made significant advances.

In the 2000s, continuous representations for words, such as word embeddings, began to replace discrete representations.^[10] Typically, the representation is a real-valued vector that encodes the meaning of the word in such a way that the words that are closer in the vector space are expected to be similar in meaning, and common relationships between pairs of words like plurality or gender .

Pure statistical models

In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.^[11]

Models based on word n-grams

Template:Excerpt

Exponential

Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The equation is

$P (w_{m} ∣ w_{1}, \dots, w_{m - 1}) = \frac{1}{Z (w_{1}, \dots, w_{m - 1})} \exp (a^{T} f (w_{1}, \dots, w_{m}))$

where $Z (w_{1}, \dots, w_{m - 1})$ is the partition function, $a$ is the parameter vector, and $f (w_{1}, \dots, w_{m})$ is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain n-gram. It is helpful to use a prior on $a$ or some form of regularization.

The log-bilinear model is another example of an exponential language model.

Skip-gram model

Template:Excerpt

Neural models

Recurrent neural network

Continuous representations or embeddings of words are produced in recurrent neural network-based language models (known also as continuous space language models).^[12] Such continuous space embeddings help to alleviate the curse of dimensionality, which is the consequence of the number of possible sequences of words increasing exponentially with the size of the vocabulary, further causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.^[13]

Large language models

Template:Excerpt

Although sometimes matching human performance, it is not clear whether they are plausible cognitive models. At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do.^[14]

Evaluation and benchmarks

Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data they see, some proposed models investigate the rate of learning, e.g., through inspection of learning curves.^[15]

Various data sets have been developed for use in evaluating language processing systems.^[16] These include:

Massive Multitask Language Understanding (MMLU)^[17]
Corpus of Linguistic Acceptability^[18]
GLUE benchmark^[19]
Microsoft Research Paraphrase Corpus^[20]
Multi-Genre Natural Language Inference
Question Natural Language Inference
Quora Question Pairs^[21]
Recognizing Textual Entailment^[22]
Semantic Textual Similarity Benchmark
SQuAD question answering Test^[23]
Stanford Sentiment Treebank^[24]
Winograd NLI
BoolQ, PIQA, SIQA, HellaSwag, WinoGrande, ARC, OpenBookQA, NaturalQuestions, TriviaQA, RACE, BIG-bench hard, GSM8k, RealToxicityPrompts, WinoGender, CrowS-Pairs^[25]

References

Template:Reflist

↑ Template:Cite book
↑ Kuhn, Roland, and Renato De Mori (1990). "A cache-based natural language model for speech recognition". IEEE transactions on pattern analysis and machine intelligence 12.6: 570–583.
↑ Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). "Semantic parsing as machine translation" Template:Webarchive. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
↑ Template:Cite journal
↑ Pham, Vu, et al (2014). "Dropout improves recurrent neural networks for handwriting recognition" Template:Webarchive. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.
↑ Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). "Grammar induction with neural language models: An unusual replication" Template:Webarchive. Template:ArXiv.
↑ Template:Cite conference
↑ Template:Cite conference
↑ Template:Cite journal
↑ Template:Cite news
↑ Template:Cite journal
↑ Template:Cite web
↑ Template:Cite encyclopedia
↑ Template:Cite book
↑ Template:Citation
↑ Template:Cite arXiv
↑ Template:Citation
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Citation
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite web

[1] Template:Cite book

[2] Kuhn, Roland, and Renato De Mori (1990). "A cache-based natural language model for speech recognition". IEEE transactions on pattern analysis and machine intelligence 12.6: 570–583.

[Semantic_parsing_as_machine_translation-3] Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). "Semantic parsing as machine translation" Template:Webarchive. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).

[4] Template:Cite journal

[5] Pham, Vu, et al (2014). "Dropout improves recurrent neural networks for handwriting recognition" Template:Webarchive. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.

[6] Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). "Grammar induction with neural language models: An unusual replication" Template:Webarchive. Template:ArXiv.

[ponte1998-7] Template:Cite conference

[hiemstra1998-8] Template:Cite conference

[9] Template:Cite journal

[10] Template:Cite news

[11] Template:Cite journal

[12] Template:Cite web

[bengio-13] Template:Cite encyclopedia

[14] Template:Cite book

[15] Template:Citation

[:0-16] Template:Cite arXiv

[17] Template:Citation

[18] Template:Cite web

[19] Template:Cite web

[20] Template:Cite web

[21] Template:Citation

[22] Template:Cite web

[23] Template:Cite web

[24] Template:Cite web

[25] Template:Cite web

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

Language model

Contents

History

Pure statistical models

Models based on word n-grams

Exponential

Skip-gram model

Neural models

Recurrent neural network

Large language models

Evaluation and benchmarks

See also

References

Further reading

Navigation menu

Language model

History

Pure statistical models

Models based on word n-grams

Exponential

Skip-gram model

Neural models

Recurrent neural network

Large language models

Evaluation and benchmarks

See also

References

Further reading

Navigation menu

Search