How truncation, wildcards, stemming and lemmatization help your literature search

Any good search is built on the “right” search terms—terms that retrieve literature relevant to the question under investigation.

But the English language is tricky. Many variations of a word can capture a single concept. If a researcher writes one version in their article title or abstract and you search with another, you can miss relevant—and important—articles. On the other hand, in some databases you may type one word in the search box, but get results with different, albeit related, words.

What accounts for these results? What is the best way to navigate search interfaces to get the results you need?

Four processes—truncation, wildcards, stemming and lemmatization—can expand what you type to capture more versions of that term. Truncation and wildcards are simple modifications you incorporate into a term you type. Stemming and lemmatization are algorithmic adjustments built into a database platform. Knowing how they work, and how you work them, gives you an easy way improve your literature searches.

Ways you can make your search more comprehensive

The main way a researcher can optimize their search is with truncation. Wildcards are helpful, too. In each of these methods, you type some of your search term’s letters and combine those letters with symbols that stand in for the possibility of letters you are not actually typing.

What is truncation?

To truncate a search term, you type the starting letters, or stem, of a word followed by a designated symbol, such as *, $, or !.

When the truncation symbol is added to a stem, the database brings back any results that match the letters you typed plus any results that have more letters following on from what you typed. For example, results for toxin* could include toxin, toxins, toxinogenesis, toxinogenic, toxinotype, toxinotypes, toxinotyping, toxinfective, toxinaemia, and toxinometer.

Truncating a search term is a powerful way to expand a search.

How do you know which symbol to use for truncation?

The truncation symbol is often an asterisk (*). Some databases use a dollar sign ($) instead, and at least a couple of databases use an exclamation point (!). Instructions on how to truncate terms can be found in every database’s help section.

The truncation symbol will always follow the letters you have typed with no space between letters and symbol.

Which databases let you truncate search terms?

Truncation is a universal database search technique. If it doesn’t seem to be working for you, check that you are using the correct truncation symbol for that database.

How do you know where to truncate a term?

If you are not sure how many letters you should be typing to capture your term and its variations without getting too many irrelevant results, experiment with truncating your term at different points. It’s usually worth trying a shorter stem than you think you need in case you are surprised by the relevant results you see. If you are flooded by irrelevant results, make your stem a little longer.

Some databases limit how short a truncated stem can be. PubMed, for instance, requires four typed letters before you can truncate a term.

Can you limit how many letters can follow your truncated stem?

Sometimes. On the Ovid platform, typing adult$1 returns results with adult and adults, but not adultery or adulteration. Typing adult$3 returns results with adult, adults, and adultery, but still not adulteration. Most platforms, however, do not offer this option.

A database’s help pages detail its truncation symbols’ exact functionality.

Can you truncate the beginning of a word?

Some database platforms do let you truncate both ends of a word. In these databases, typing *toxin* would return toxin, toxins, aflatoxin, aflatoxins, ochratoxin, ochratoxins, and many more results containing the string toxin somewhere in them.

Wildcards

Wildcards are symbols inserted into the middle of words.[1] They allow you to span spelling variations. Some databases use different symbols for what they term mandatory and optional wildcards. Mandatory means a letter, any letter, must stand in for the symbol; optional means that any letter might stand in for the symbol, but no letter might, too. For instance on the Ovid platform, you capture both the British and American spellings of organisation/organization by typing organi#ation, but favourite/favorite are captured with favo?rite.

Ways a database platform might be making your search more comprehensive

Some database platforms aim to make our searching lives easier by building in some automated extensions of search terms. They may search for both US and UK spellings of words regardless of which you type. They may also employ algorithms for stemming and lemmatization to broaden your searches.

Stemming is an automatic process in which the database searches for the word you type, the stem of that word, plus that stem with other possible endings. So, if you search asked, stemming would be the reason why you see results not just with asked, but also with ask, asks, and asking in them.

Lemmatization cleverly identifies the lemma, or lexical root, of a typed word, and goes out to find results with the different word versions tied to that root. Imagine we type grew into our search field. Grew is the past tense of grow. Lemmatization of grew brings you both grew and grow, but also growing and grown. Similarly, a database that is lemmatizing terms would take a search for mouse and bring back some results with mouse and some with mice.

If your search returns results which are the product of stemming or lemmatization and it’s not helpful, you can generally override it by typing your search term inside quotation marks (eg. “mouse”) to stipulate that only results containing exactly what you typed are returned.

It’s important to note that although stemming and lemmatization somewhat extend your results net, they are not the same thing as truncation. Stemming and lemmatization both work within predefined boundaries. Truncation returns any word that fits the rules you set. Stemming toxin returns toxin and toxins. Truncating toxin* returns toxin and toxins, plus toxinaemia, toxinogenic, toxinogenics, toxinotyping, toxinotypes, toxin-3α-glucoside, toxinogenesis, etc.

It’s also important to know that while lemmatization overlaps slightly with the function of a database thesaurus, it is much more limited than a thesaurus.

Lemmatization is only based on linguistic connections between words, while the thesaurus pulls together terms based on scientific usage that often goes well beyond dictionary usage. Lemmatization, for instance, would not collate bovine, heifer, cows, oxen, steers, bulls, and calves with cattle, nor BCAA with branched-chain amino acids. The FSTA thesaurus would.

Not all database platforms incorporate automated extensions of your search terms. Some only incorporate it in certain search modes. A quick way to check if what you are typing is being changed or taken literally is to sort your results by date rather than relevance. To make the check even quicker, restrict your sample search to the title field so you do not need to open records to spot which words are being captured with what you typed.

In summary

Preset algorithms in a search interface can help you find the research information you need. If you grasp the limitations of these algorithms, however, and are able take matters into your own hands by truncating terms or using wildcards, you can significantly improve your searches.

The better you understand the tools you use to find research literature, the better a job you’ll do at building powerful, efficient, and effective searches.

--------------------------------
[1] Some database platforms call all symbols, wherever they are inserted into a word, wildcards. If you can’t find truncation in a database’s help section, look at what they say about wildcards.

Research Skills Blog