What are tokenization, stemming, and named-entity recognition, and why should language-learners care?
Natural Language Processing (NLP) refers to the automatic, computerized processing of texts to analyze the language in those texts. NLP software processes a text and returns data about language patterns in the text. Language app developers like WordBrewery can use NLP technologies to create efficient and innovative learning software.
WordBrewery uses many NLP technologies to help our learners learn languages, and we are constantly looking for new ways to use those technologies. Here are some examples of how we use NLP and what challenges that task involves.
Subscribe to this blog to receive the latest posts by email or RSS.
Sentence segmentation to divide news articles into sentences
WordBrewery’s sentence collector uses sentence segmentation to identify where sentences begin and end—a computing task that is more difficult and error-prone than one might think. Here are screenshots of (now-fixed) sentence segmentation bugs that have cluttered our database along the way:
Here, the sentence was split at a semicolon.
This Arabic sentence has a stray arrow bracket.
This German sentence has the opening quotation mark, but not the closing quotation mark.
This Italian sentence has the opposite problem: only a closing quotation mark.
Word tokenization to identify the boundaries of each word
Then, in order to show learners definitions and example sentences when they click on a word, WordBrewery uses word tokenization to separate a word from its surrounding text. This is particularly difficult for languages like Chinese and Japanese that do not use spaces between words. The challenges of processing these languages has led some of the largest language-learning apps to focus their efforts elsewhere (an approach we disagree with). In Chinese, for example, we currently have a high-priority bug in which the user clicks a character and only the character itself is selected rather than the full word. This is a great example of a tricky word tokenization problem:
WordBrewery just fixed this Chinese word tokenization bug.
Named-entity recognition to identify names, places, and other proper nouns
We also use named-entity recognition to identify proper nouns. All of our sentences include almost exclusively high-frequency words (sentences are limited to the top 500 words [lemmas, actually] at the beginner level, 3000 at the intermediate level, 10,000 at the advanced level, and 20,000 at the master level). But we don’t want to filter out sentences that include names, especially since those are often the most interesting and timely sentences: when I am studying Italian on WordBrewery and come across a sentence from an Italian newspaper that uses the words “Obama” or “Trump,” for example, I am immediately curious. But those are names, not words we want to teach or include in our frequency lists. So, for each of WordBrewery’s nineteen languages, we have to build in ways for the WordBrewery algorithm to identify whether a word is a name or not—in other words, we have to add support for named-entity recognition on a language-by-language basis. We tried to take a shortcut early on by instructing the algorithm to simply ignore words that it did not find on its frequency list; we had to scrap that approach, as this bug screenshot from March illustrates:
In this bug, an English sentence appeared among the French sentences.
Lemmatization and stemming to recognize different forms of the same base word
We use lemmatization and stemming to treat different forms of a word as a single word. We don’t want English users to have to learn “phone,” “phones,” and “phoned” as separate words, for example. Lemmatization is an imperfect science, especially with languages that have fewer publicly available NLP technologies. As with named-entity recognition, we have to add support for lemmatization on a language-by-language basis. Here is an example of a lemmatization bug that I found while studying Portuguese; the algorithm was confusing one example of a reflexive verb, trata-se (it is), with the pronoun se (it) that appears in all Portuguese reflexive verbs—including se dizem (they say), as in the screenshot:
Frequency distributions to identify high-frequency words
Finally, the NLP technology that is at the core of WordBrewery’s sentence finder and sentence scoring algorithm is used to identify word frequency. We are constantly looking for and refining frequency lists for all nineteen languages that WordBrewery teaches. To paraphrase Schopenhauer, “A prerequisite to learning high-frequency words is ignoring low-frequency words: for life is short.” Here is a snapshot of our Japanese word frequency list:
WordBrewery is constantly brainstorming new ways to help people learn to read their target language and master the high-frequency vocabulary they need for real fluency.
We would love to hear your thoughts about what features you would like to see WordBrewery add in the future. And if you haven’t done so yet, please take a moment to subscribe to this blog by RSS or email.