How to Study for “Jeopardy!”: A Data-Driven Approach
Some kind internet-dweller has put together a dataset of all the Jeopardy! questions up through 2004, roughly 200,000 of them. This seems like it could be an interesting dataset. I wanted to ask and answer a specific question: where should a prospective Jeopardy! player concentrate their effort? The data comes in JSON format and has the following information:
'category' : the question category, e.g. "HISTORY"
'value' : $ value of the question as string, e.g. "$200"
- Note: This is "None" for Final Jeopardy! and Tiebreaker questions
'question' : text of question - Note: This sometimes contains hyperlinks and other things messy text such as when there's a picture or video question
'answer' : text of answer - 'round' : one of "Jeopardy!","Double Jeopardy!","Final Jeopardy!" or "Tiebreaker"
- Note: Tiebreaker questions do happen but they're very rare (like once every 20 years) - 'show_number' : string of show number, e.g '4680'
'air_date' : the show air date in format YYYY-MM-DD
So after fetching and transforming the data, an obvious first question is this: what are the top Jeopardy categories, and is it possible to game the system? To my surprise, the show does not have many ‘favorite’ categories: here are the top 5:
- Before & After: 587 questions
- Science: 519 questions
- Literature: 496 questions
- American History: 418 questions
- Potpourri: 401 questions
This seems like a lot of questions for a few categories, but in fact over this span, there are no fewer than 27,995 categories and 216,930 questions.The top 5 categories represent a whopping 1% of all questions asked, which is…not particularly helpful in targeting your studying. The top 100 categories contain only 11% of all questions. You might suspect that Jeopardy! has a “long tail” of question categories, and you’d be right. The distribution of question frequency is on the left, with a logged version on the right. The long flat line represents categories with only 5 questions; that is, categories that have only appeared once. There is a very long tail of Jeopardy! categories – mostly because, as long-time viewers will know, often categories have specific and descriptive titles even if their “real category” falls into science, literature, or history. The punny ones don’t make it any easier, either.
Incidentally, the data-generating process isn’t easy to model. The weird baseline at five combined with a few ones (i.e., Final Jeopardy! questions) resists simple quantification and most common probability distributions fail. I had the best fit by far with a log-normal distribution, but that fails to capture both the extremity of the left tail and the low level of the right tail. It would be neat if Jeopardy! question distribution mimicked naturally-occurring probability distributions, but sadly this does not seem to be the case.
This whole process has given me a lot of sympathy for IBM’s Jeopardy!-playing robot Watson, because this data is extremely messy. The long tail of the category labels mean the dataset is not easy to work with at all, and brute-force attacks on the dataset will yield almost nothing of value. The only way they could produce such impressive performance was with a very high degree of sophistication in natural language processing. And unfortunately, the data doesn’t suggest that studying Jeopardy! is particularly easy to hack. There are no real shortcuts in the question distribution or any categories that are that disproportionately helpful. A survey of the top categories suggests that you should know your history, but “know everything about history” isn’t exactly easy actionable advice. If you were hoping for a clear and easy answer, I have bad news for you – you can study your ass off and know everything and win the hard way, or build practical weak AI and win the much much harder way.