Language Models: Application and Limitations

4 min readMay 23, 2021

What is Language Model?

The language model is a mathematical representation of a process. Almost always models are an approximation of the process. There are several reasons for this but the 2 most important are:
1. We usually only observe the process a limited amount of times
2. The model can be exceptionally complex so we simplify it.

A model is built by observing some samples generated by the phenomenon to be modelled. In the same way, a language model is built by observing some text. They provide the context to distinguish between words, phrases that sound similar. For example, in American English, the phrases “recognize speech” and “wreck a nice beach” sound similar, but mean different things.

Language modeling is central to many important natural language processing tasks.

Recently, neural-network-based language models have demonstrated better performance than classical methods both standalone and as part of more challenging natural language processing tasks.

Statistical Language Model -

A statistical language model is a probability distribution over sequences of words. Given such a sequence, say of length m, it assigns a probability P(w1,…,wn).

Statistical Language Modeling, or Language Modeling and LM for short, is the development of probabilistic models that are able to predict the next word in the sequence given the words that precede it.

Language Model Types -

Unigram and n-gram
Bidirectional
Exponential
Neural Networks

1. Unigram and n-gram Language Model

In natural language processing, an n-gram is a sequence of n words. For example, “Model” is a unigram (n = 1), “Language Model” is a bigram (n = 2), “Statistical Language Model” is a trigram (n = 3), and so on. For longer n-grams, people just use their lengths to identify them, such as 4-gram, 5-gram, and so on. In this part of the project, we will focus only on language models based on unigrams i.e. single words.

Training the model

A language model estimates the probability of a word in a sentence, typically based on the the words that have come before it. For example, for the sentence “I have a dream”, our goal is to estimate the probability of each word in the sentence based on the previous words in the same sentence:

The unigram language model makes the following assumptions:

The probability of each word is independent of any words before it.
Instead, it only depends on the fraction of time this word appears among all the words in the training text. In other words, training the model is nothing but calculating these fractions for all unigrams in the training text.

A language model estimates the probability of a word in a sentence, typically based on the words that have come before it. For example, for the sentence “I have a dream”, our goal is to estimate the probability of each word in the sentence based on the previous words in the same sentence

The Bigram Model

As the name suggests, the bigram model approximates the probability of a word given all the previous words by using only the conditional probability of one preceding word. In other words, you approximate it with the probability: P(the | that)

And so, when you use a bigram model to predict the conditional probability of the next word, you are thus making the following approximation:

This assumption that the probability of a word depends only on the previous word is also known as Markov assumption.
Markov models are the class of probabilisitic models that assume that we can predict the probability of some future unit without looking too far in the past.

You can further generalize the bigram model to the trigram model which looks two words into the past and can thus be further generalized to the N-gram model

2. Bidirectional Language Model

The usual approach in building a language model is to predict a word given the previous words. We can use either use an ngram language model or a variant of a recurrent neural network (RNN). An RNN (theoretically) gives us infinite left context (words to the left of the target word). But what we would really like is to use both left and right contexts see how well the word fits in the sentence. A bidirectional language model can enable this.

The problem statement: predict every word in the sentence given the rest of the words in the sentence.

Most deep learning frameworks will have support for bidirectional RNNs. They will usually return two sets of RNN hidden vectors where one is the output of the forward RNN and the other is the output of the backward RNN. These hidden vectors will be used to predict the next word in the sentence, where next word is the previous word for the backward RNN.