The questions

I am a longtime user and fan of Duolingo, a platform for learning new languages free of charge. And as a PhD student researching language learning and bilingualism, I have long awaited the opportunity to get my hands on any of the data they accumulate. Well I finally have. Burr Settles, of the Duolingo team, published a paper in the ACL proceedings discussing spaced repetition in Duolingo data. With this publication they shared their code and data on github. So I’m going to play with it a bit! Here’s what I want to know:

  1. Which languages are hardest to learn?
  1. Which kinds of words are hardest to learn?
  1. Can we build a model to test the differences we see in the data?

Spoil your appetite and skip to the answers.

Clean the data

So before doing anything else let’s load some relevant libraries and take a peek at the data.

library(data.table)
library(ggplot2)
library(Rmisc)
library(stringr)
library(stringdist)
library(lme4)
library(SnowballC)


#Data can be found here: https://github.com/duolingo/halflife-regression

#fread is a faster means of loading a big dataset, which we know this will be
data.raw = fread('bigset.csv')
## 
Read 0.0% of 12854226 rows
Read 7.8% of 12854226 rows
Read 16.6% of 12854226 rows
Read 25.1% of 12854226 rows
Read 32.0% of 12854226 rows
Read 38.7% of 12854226 rows
Read 44.7% of 12854226 rows
Read 51.3% of 12854226 rows
Read 58.9% of 12854226 rows
Read 69.1% of 12854226 rows
Read 79.7% of 12854226 rows
Read 88.9% of 12854226 rows
Read 98.3% of 12854226 rows
Read 12854226 rows and 12 (of 12) columns from 1.219 GB file in 00:00:17

And let’s take a brief look at the data…

head(data.raw)
##    p_recall  timestamp    delta user_id learning_language ui_language
## 1:      1.0 1362076081 27649635    u:FO                de          en
## 2:      0.5 1362076081 27649635    u:FO                de          en
## 3:      1.0 1362076081 27649635    u:FO                de          en
## 4:      0.5 1362076081 27649635    u:FO                de          en
## 5:      1.0 1362076081 27649635    u:FO                de          en
## 6:      1.0 1362076081 27649635    u:FO                de          en
##                           lexeme_id                    lexeme_string
## 1: 76390c1350a8dac31186187e2fe1e178 lernt/lernen<vblex><pri><p3><sg>
## 2: 7dfd7086f3671685e2cf1c1da72796d7    die/die<det><def><f><sg><nom>
## 3: 35a54c25a2cda8127343f6a82e6f6b7d         mann/mann<n><m><sg><nom>
## 4: 0cf63ffe3dda158bc3dbd55682b355ae         frau/frau<n><f><sg><nom>
## 5: 84920990d78044db53c1b012f5bf9ab5   das/das<det><def><nt><sg><nom>
## 6: 56429751fdaedb6e491f4795c770f5a4    der/der<det><def><m><sg><nom>
##    history_seen history_correct session_seen session_correct
## 1:            6               4            2               2
## 2:            4               4            2               1
## 3:            5               4            1               1
## 4:            6               5            2               1
## 5:            4               4            1               1
## 6:            4               3            1               1
str(data.raw)
## Classes 'data.table' and 'data.frame':   12854226 obs. of  12 variables:
##  $ p_recall         : num  1 0.5 1 0.5 1 1 1 1 1 0.75 ...
##  $ timestamp        : int  1362076081 1362076081 1362076081 1362076081 1362076081 1362076081 1362076081 1362082032 1362082044 1362082044 ...
##  $ delta            : int  27649635 27649635 27649635 27649635 27649635 27649635 27649635 444407 5963 5963 ...
##  $ user_id          : chr  "u:FO" "u:FO" "u:FO" "u:FO" ...
##  $ learning_language: chr  "de" "de" "de" "de" ...
##  $ ui_language      : chr  "en" "en" "en" "en" ...
##  $ lexeme_id        : chr  "76390c1350a8dac31186187e2fe1e178" "7dfd7086f3671685e2cf1c1da72796d7" "35a54c25a2cda8127343f6a82e6f6b7d" "0cf63ffe3dda158bc3dbd55682b355ae" ...
##  $ lexeme_string    : chr  "lernt/lernen<vblex><pri><p3><sg>" "die/die<det><def><f><sg><nom>" "mann/mann<n><m><sg><nom>" "frau/frau<n><f><sg><nom>" ...
##  $ history_seen     : int  6 4 5 6 4 4 4 3 8 6 ...
##  $ history_correct  : int  4 4 4 5 4 3 4 3 6 5 ...
##  $ session_seen     : int  2 2 1 2 1 1 1 1 6 4 ...
##  $ session_correct  : int  2 1 1 1 1 1 1 1 6 3 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Ok, wow. We have almost 13 million datapoints from over 115 thousand users learning 6 languages, as well as information about every word learned. Let’s look more at what’s in this dataset.

Each line of the dataset is a word for a given user, for a given session. So the first line is the word “lernt” (seen in the lexeme_string) for user u:f0 for some session of German. In this particular session they’ve seen the word twice (session_seen) and gotten it right twice (session_correct). Before this session they’ve seen the word 6 times (history_seen) and gotten it right 4 times (history_correct).

The lexeme_string variable has a lot of juicy information. First we see the surface form, which is the word in question, as it appears. After the /, we see the lemma, which is the base form of the word (unchanged to note person or tense or anything like that). Then in the first set of <>, we have the part of speech, and following that we have a lot of information about how the word is modified - so lernt is a lexical verb, in the present tense, third person, singular (in that order).

Because this data set is massive and R isn’t super for such datasets, we need to clean up the data and add some variables to make it as memory and time efficient as possible.

#Removing all lines with NA data and non-English learners for simplicity
data.raw = data.raw[complete.cases(data.raw),]
data.raw = data.raw[data.raw$ui_language == "en"]

#Getting rid of variables that mean nothing to us
data.raw$timestamp = NULL
data.raw$lexeme_id = NULL

So we’ll be focusing on English speaking users learning German (de), Spanish (es), French (fr), Italian (it) and Portuguese (pt).

First, I want to find the total number of times a person has seen a word or gotten a word correct, across session. To do this I need to find the highest value for history_seen, (for each word and for each person) and remove all rows that aren’t the max for each person and word. This will show us each person’s last session for a given word. We will then add that current session information to the history information to calculate a total for each word.

#Create a temporary factor that includes both a user_id and a lexeme_string
data.raw$temp = as.factor(paste(data.raw$user_id, data.raw$lexeme_string, sep = "_"))

#Remove all rows that aren't the max value
data.reduced = data.raw[data.raw[, .I[history_seen == max(history_seen)], by=temp]$V1]

#Create total_variables
data.reduced$total_seen = data.reduced$history_seen + data.reduced$session_seen
data.reduced$total_correct = data.reduced$history_correct + data.reduced$session_correct

#This one was especially fun to name
data.reduced$total_recall = data.reduced$total_correct/data.reduced$total_seen

Next we’ll aggregate over subjects. By averaging every subject’s response to a lexeme_string, we significantly reduce the size of the dataset, having only an average value for each lexeme string produced. We lose some variability due to averaging, but the dataset is so big that shouldn’t be a problem.

data.reduced = data.frame(aggregate(cbind(total_seen,
                                 total_correct,
                                 total_recall)~
                             lexeme_string+
                               learning_language, data = data.reduced,mean))

#make learning_language a factor
data.reduced$learning_language = as.factor(data.reduced$learning_language)

#Peek
hist(data.reduced$total_recall)

So this is definitely pretty skewed, but nothing strange considering Duolingo is designed to get people to high levels of recall over time.

Add a lemma variable

That aggregation makes our data MUCH more managable, reducing our dataset to about 1% the original data size. Now we’ll add a lemma column. The lexeme_string contains the information about the word that we want to extract. While the first part, the surface form, isn’t that important to us, the “lemma” is. It’s the base word that we’ll be working with. The lemma is the word built into the lexeme string after the / and before the part of speech noted with <. So we’ll simply tell R to check every lexeme string, and extract the characters after the / and before the first <.

#This removes all information before the lexeme
data.reduced$lexeme_string = gsub("^.*?/","/",data.reduced$lexeme_string)



data.reduced$lemma = substr(data.reduced$lexeme_string, 2, as.numeric(lapply(gregexpr('<',data.reduced$lexeme_string),head,1)) - 1)

Add item and cognate status variables

Now I want to add a column that contains all of the information in the same language. For example, it’s better for analysis if I can represent the word chien as a vector containing the semantic information and the language - something like - so that when I compare it to cane - noted as - I know they’re the same item. For this I extracted all of the unique lemmas and fed them through Google Translate. I reupload that as a .csv here. (Using Google’s API to directly get translation information isn’t a free service, but translating a few thousand words in browser is…so again it’s a slightly more time intensive solution than I’d prefer, but it gets the job done.)

#CSV containing all of our translations.
trans = read.csv('translations.csv', encoding = "UTF-8")


#Add a column combining learning_language and lemma, so that we can match the two documents together
data.reduced$ll_lemma = paste(data.reduced$learning_language, data.reduced$lemma, sep = "_")
trans$ll_lemma = paste(trans$learning_language, trans$lemma, sep = "_")

#We'll add the actual item column in a minute

Additionally, I think cognate status could impact learning. A cognate is a word that is the same, or very similar between two languages (like animal in Spanish and English). We know that cognate improves word learning, so I wanted to implement a simple measure of cognate status, using levenshtien distance.

#Cognate status
## stringsim calcuates the minimal number of deletions, insertions, and substitutions that can change one word into another.
trans$cognatestatus = stringsim(as.character(trans$item),as.character(trans$lemma))
data.reduced$cognatestatus = with(trans, cognatestatus[match(data.reduced$ll_lemma,ll_lemma)])

#Peek at cognatestatus
hist(trans$cognatestatus)

Cognatestatus looks pretty good. Many words have next to no letter overlap (low values), but others have a pretty even distribution.

#Here I trim the endings off the translations to make sure plural words are not considered different than singular words.
##This is the simplest way to do that without manually editing all translations
trans$item = wordStem(trans$item, language = "english")


#Add item category
data.reduced$item = with(trans, item[match(data.reduced$ll_lemma,ll_lemma)])
data.reduced$ll_lemma = NULL

Add a part of speech variable

Now let’s add a couple simple variables that might help us capture differences in data. I’d like to extract the part of speech of each word, located in the lexeme_string, found in the first set of <>.

data.reduced$pos = substr(data.reduced$lexeme_string,
                      as.numeric(lapply(gregexpr('<',data.reduced$lexeme_string),head,1)) + 1, as.numeric(lapply(gregexpr('>',data.reduced$lexeme_string),head,1)) - 1)

Now I’d like to simplify the parts of speech variable…It would be too overwhelming for me to compare every category to one another. Categories like nouns will have big Ns, but the verbs, adjectives and other category words are broken down into lots of subsections that I want to aggregate together. To do this I used the lexeme_reference.txt found on Duolingo’s github and edited it to make simpler categories. All of the categories were distilled into Nouns, Verbs, Function words (like the, on, with etc.) and describer, which is a category I just made up to cover things like adjectives and adverbs.

lexref = read.csv('lexeme_reference.csv')
#Add simplePos based on the POS from the lexeme reference guide
data.reduced$simplePos = with(lexref, Type[match(data.reduced$pos,pos)])


data.reduced = data.reduced[complete.cases(data.reduced),]

#Remove intricate part of speech variable
data.reduced$pos = NULL

#Remove "other"" items
data.reduced = data.reduced[data.reduced$simplePos != "other",]

Add Number of Modifiers variable

Next, I want to know how complicated a word is. Surely a word with more modifiers should be more difficult (e.g. a basic noun should be easier to get correct than a basic noun that has a bunch of endings denoting female,plural,accusative case etc. (Yeah, I’m looking at you German.)) To do that I’ll simply add a variable that counts the number of < characters, as each modifier adds a left bracket. While it’s true that all lexeme_string values have at least one modifier, it shouldn’t matter so long we consider this value only relatively.

#Add number of modifiers
data.reduced$NoMod = str_count(data.reduced$lexeme_string,pattern = "<")

#Peek again
hist(data.reduced$NoMod)

So the distribution looks pretty reasonable. There’s some outliers with ~14 modifiers, but there are really very few of them so we’ll let them be.

Look at the data

Ok whew. We’ve reduced, added and altered our data down to 16K observations of 13 variables. Now let’s have some fun!

So our first question was to see which languages are harder or easier. The simplest, easiest, grayest way to do that is like this:

#create summary statistics table
sumstat = summarySE(data.reduced, measurevar = "total_recall", groupvars = c("learning_language"))

#Generate simple bar graph
ggplot(sumstat, aes(x=learning_language, y=total_recall, fill = learning_language)) +
  geom_bar(stat="identity", position=position_dodge(), size=.3, aes(fill = learning_language))