N-grams are contiguous sequences of n items (words, characters, or symbols) extracted from a text. In NLP, they are most commonly used to analyze or model text by breaking it into smaller chunks. For example, a unigram (n=1) is a single word like “cat,” a bigram (n=2) is a pair like “black cat,” and a trigram (n=3) might be “the black cat.” N-grams help capture local patterns in text, such as common phrases or contextual relationships between words. They are simple to compute and serve as foundational building blocks for many NLP tasks.
In practice, n-grams are used to build statistical language models that predict the likelihood of word sequences. For instance, a bigram model calculates the probability of a word given its immediate predecessor (e.g., “cat” after “black”). These models power applications like autocomplete, spelling correction, and text generation. N-grams also act as features in machine learning pipelines. When converting text into numerical data, bag-of-words models often include n-grams to retain some context. For example, in sentiment analysis, bigrams like “not good” or “very happy” can improve accuracy by encoding negation or intensity that single words might miss.
A concrete example is training a spam classifier. By extracting n-grams from emails (e.g., “free money” as a bigram), the model learns which phrases are more common in spam versus legitimate messages. Similarly, search engines use n-grams for query suggestions—typing “how to” might trigger trigrams like “how to cook” or “how to code.” However, larger n-grams (e.g., n=4 or higher) can lead to sparse data, as many combinations rarely occur. To mitigate this, techniques like smoothing or backoff (using smaller n when data is insufficient) are applied. While n-grams lack deep semantic understanding, their simplicity and effectiveness make them a staple in NLP workflows.
Zilliz Cloud is a managed vector database built on Milvus perfect for building GenAI applications.
Try FreeLike the article? Spread the word