Retrieving Data From Google BigQuery (Reddit Relevant XKCD)

So I embarked on a machine learning quest. Here are my notes on how I obtained the data.

 

My first attempt a few months ago was to download the entire Reddit Corpus (in CSV form) and manually parse the comments. However, it soon became clear the my computer took way to long to run these queries simply because of the size of the database.

Luckily Reddit comments have been imported to Kaggle and Google BigQuery. I used Google BigQuery and started the free year-long trial. If you don’t start the trial, the queries will be really slow. Here is the query I used to pull the data.

I used the  LIKE '%xkcd.com%' because of the inconsistency of  LIKE '%XKCD%' results.

The results were quite good. With 201501 and above I got 81,876 rows. I exported it into my own table, and downloaded that table as a CSV.  (That took some time)

Now the tables have a bunch of information that is unnecessary so I reparsed the CSV to only have 2 rows: Comment, and XKCD comic id.

I extracted the ID from the URL. Removed newline chars, “*”, and “~~” as those are specific to reddit formatting and is of no use to me. Also lowered cased (so “Hello” and “hello” are recognized as the same word) and removed all unicode encodings.

Later on I realized 2 huge problems:

  • Some comments are pure garbage. But there isn’t a way of knowing.
    • However, it seemed to me that extremely long comments were the culprit.
      Some comments were extremely long and others were really short. I attempted to remedy the problem by including a summarizer that maxed each comment at 2 sentences.
  • Some XKCD ids were over represented. Especially 1053, 37, and 386. These ids could be applied to a variety of situations, so they somehow always managed to come up on top of predictions. I just completely removed those rows since they were too general.

Now I have some pretty decent data to begin training.

 

Code will be on the main page

 все для рыбалки интернет магазинкакой компьютер лучше для игрДанильченко Юрий ХарьковОлександр Васильович Фільчаковхороший игровой ноутбуксправки в бассейн с доставкой

You may also like...

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.