Find a model to accurately predict a relevant XKCD given multiple words (a reddit comment). This essentially gave me a bunch of positively single-labeled data since each comment usually only had one “relevant XKCD” sub-comment attached.
Example: Searching “Python” would return XKCD comics that were related to python.
The reason I chose to use Python was due to scikit’s sklearn library and the depth of the reference material available online.
The tutorial on scikit (SKlearn’s website) was extremely helpful to get me started in writing my model. As I moved along I tried different algorthims and finally found SGDClassifer. You should probably read that tutorial, here are some comments that I had along the way.
Naive Bayes Classifier
At first when I was searching for an example application, I came upon the Naive Bayes Classifier. However, I soon realized that it’s feasibility was extremely limited due to the fact that it only could train whether a specific piece of text was “positive” or “negative.” Link I used HERE.
This trained really quickly, however it was obvious it wouldn’t fit what I had in mind due to the fact because of how the model would work. The training data that Reddit provides only gives me positively labeled data, and making the assumption that anything that wasn’t labeled positive for a particular comment was negative would’ve been really stupid.
So I settled on the
Multinomial Naive Bayes Classifier
This was my first working example. Common XKCD comics produced pretty accurate answers (“Bobby tables”, “python”), but less well known comics did really badly. I think taking the top 5 probabilities gave me ~20% accuracy with the test data.
While there did seem to be a lot of training data, I soon realized that I had a bunch of garbage in my training data. While some comments were related, some where just wildly off topic, and this messed up the naive bayes part of the algorithm. Also because of the naiveness of the classifier, the probability essentially returned the comic ids of pretty generic XKCD comics (#1053, 386, 37). These particular comic ids had the most training data as well as the longest (since they are so generic), which isn’t helpful at all.
A weird thing is that MultinomialNB on SKLearn improves its accuracy the more iterations of the SAME training data you pass through. That’s weird because the algorithm for Naive Bayes is supposed to be naive and multiple passes of the same data shouldn’t have affected anything. Also predict_proba output was confusing until I realized what they were outputting. Essentially you need to put your text classification labels (in my case comic ids) in alphabetical order so that you can relate the probability with label
#Order y_train results alphabetically
from sortedcontainers import SortedSet
y_map = SortedSet(y_train)
predicted = clf.predict_proba(X_new_tfidf)
#This returns an array since X_new_tfidf is an array
#predicted[index] is an array of probabilities for each label, sorted alphabetically
index = 0
for val in predicted:
print('%r => %s (%i%%)' % (X_new_tfidf, y_map[index],int(val*100)))
index += 1
So my accuracy was still in the 20% so I decided, why not try a neural network solution? Didn’t work. I think it’s because I didn’t have enough training data and epochs (repeats of the same training data) was too time consuming for my short attention span. Since I only had around 100,000 training examples to cover around 1,800 comics, it was only 55 per comic on average. My accuracy was <1%.
SGDClassifier (Stochastic Gradient Descent)
I was dumb not the read the entire tutorial page and didn’t realize that there was this classifier. This is way better for learning that MultinomialNB, but the consequence is the time required as well as the memory required to keep this model in RAM (and storage for that matter).
This classifier performed the best once I got this thing to work, and near the end I got over 50% (if you count the top 5 results) in classifying my testing data!
clf = SGDClassifier(loss='modified_huber', penalty='l2',
alpha=6e-4, n_iter=10,n_jobs=6, random_state=42).fit(X_train_tfidf, y_train)
Modified huber allowed me to use the predict_proba function. I didn’t have the patience to play around with alpha or n_inter simply because it took way too train a single model (event with n_jobs — processors — = 6, which is 75% of my CPU’s resources!)
So yay! It worked! Now I needed to test it with my own data — not just the testing data I compiled from Reddit :P.