Question classification

Dormant threads from the high school sections are preserved here.
Locked
User avatar
Habitat_Against_Humanity
Rikku
Posts: 476
Joined: Sun Jan 21, 2007 8:51 pm
Location: Syracuse, NY

Question classification

Post by Habitat_Against_Humanity »

Hello all,

So I'm in the middle of taking a class on Text Mining and have to submit a proposal for a final project (or else do the boring standard project). I was wondering if any work has been done into finding ways of having a program work through question text and determine its category. I think developing a rudimentary system for doing so would be quite interesting and I'm curious if anyone has tried tackling this before. The details of how I would do it are still kind of coalescing in my head, so I was hoping for input. Specifically, I'm told I to explain how I'll be using Perl and Text Mining in the project. Any thoughts?
Rachel
UChicago 09
User avatar
grapesmoker
Sin
Posts: 6345
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Re: Question classification

Post by grapesmoker »

Habitat_Against_Humanity wrote:Hello all,

So I'm in the middle of taking a class on Text Mining and have to submit a proposal for a final project (or else do the boring standard project). I was wondering if any work has been done into finding ways of having a program work through question text and determine its category. I think developing a rudimentary system for doing so would be quite interesting and I'm curious if anyone has tried tackling this before. The details of how I would do it are still kind of coalescing in my head, so I was hoping for input. Specifically, I'm told I to explain how I'll be using Perl and Text Mining in the project. Any thoughts?
oh god no not perl anything but that

You might be interested in contacting Jordan Boyd-Graber, who's been doing lots of work on NLP in quizbowl.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
jfuchs
Lulu
Posts: 12
Joined: Thu Oct 17, 2013 12:54 am

Re: Question classification

Post by jfuchs »

The guy who made Protobowl wrote a bit about how he classified questions, but doesn't give much detail: https://github.com/neotenic/protobowl#b ... classifier.
Julian Fuchs
MIT '18
User avatar
Mike Bentley
Sin
Posts: 6461
Joined: Fri Mar 31, 2006 11:03 pm
Location: Bellevue, WA
Contact:

Re: Question classification

Post by Mike Bentley »

I've been wanting to do this for a while but haven't gotten around to it. QEMS1 seems like a pretty rich source of accurately classified questions to train on if you could get the raw data.
Mike Bentley
Treasurer, Partnership for Academic Competition Excellence
Adviser, Quizbowl Team at University of Washington
University of Maryland, Class of 2008
User avatar
Habitat_Against_Humanity
Rikku
Posts: 476
Joined: Sun Jan 21, 2007 8:51 pm
Location: Syracuse, NY

Re: Question classification

Post by Habitat_Against_Humanity »

So I spent a couple weeks on and off trying to make a quick and dirty question classifier using multinomial logistic regression. Using a fairly small dataset (3 HSAPQ tournaments in the training set, 1 HSAPQ tourney in the test set), I was able to get around 80% overall accuracy, which seems tolerable for a first start which ended up testing my ability to use regexes more than anything.

EDIT: Added an "up" that I guess could alter how one construes this post.
Last edited by Habitat_Against_Humanity on Wed May 06, 2015 10:37 pm, edited 1 time in total.
Rachel
UChicago 09
User avatar
UlyssesInvictus
Yuna
Posts: 845
Joined: Thu Feb 10, 2011 7:38 pm

Re: Question classification

Post by UlyssesInvictus »

Habitat_Against_Humanity wrote:So I spent a couple weeks on and off trying to make a quick and dirty question classifier using multinomial logistic regression. Using a fairly small dataset (3 HSAPQ tournaments in the training set, 1 HSAPQ tourney in the test set), I was able to get around 80% overall accuracy, which seems tolerable for a first start which ended testing my ability to use regexes more than anything.
Is this code publicly available? For multinomial regression, that's pretty good, and I'm interested to see how it would do on a more diverse training set (i.e. mix college and HS, housewrites and HSAPQ).
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
User avatar
Adventure Temple Trail
Auron
Posts: 2754
Joined: Tue Jul 15, 2008 9:52 pm

Re: Question classification

Post by Adventure Temple Trail »

UlyssesInvictus wrote:
Habitat_Against_Humanity wrote:So I spent a couple weeks on and off trying to make a quick and dirty question classifier using multinomial logistic regression. Using a fairly small dataset (3 HSAPQ tournaments in the training set, 1 HSAPQ tourney in the test set), I was able to get around 80% overall accuracy, which seems tolerable for a first start which ended testing my ability to use regexes more than anything.
Is this code publicly available? For multinomial regression, that's pretty good, and I'm interested to see how it would do on a more diverse training set (i.e. mix college and HS, housewrites and HSAPQ).
Some sort of automatic category-assigner might be a real godsend to Quinterest and/or QBDB 3.0, with the opportunity for users to report errors as needed. (I Am Not A Computer Scientist, but perhaps it may also be possible to automatically assign certain keywords to categories if it's unlikely that word could come up in any other category, e.g. assume that any question containing the word "Chatterley" will be British Literature unless specified otherwise? Or is that the ultimate end-state of what your program does?).
Matt Jackson
University of Chicago '24
Yale '14, Georgetown Day School '10
member emeritus, ACF
User avatar
Mike Bentley
Sin
Posts: 6461
Joined: Fri Mar 31, 2006 11:03 pm
Location: Bellevue, WA
Contact:

Re: Question classification

Post by Mike Bentley »

Matthew J wrote:
UlyssesInvictus wrote:
Habitat_Against_Humanity wrote:So I spent a couple weeks on and off trying to make a quick and dirty question classifier using multinomial logistic regression. Using a fairly small dataset (3 HSAPQ tournaments in the training set, 1 HSAPQ tourney in the test set), I was able to get around 80% overall accuracy, which seems tolerable for a first start which ended testing my ability to use regexes more than anything.
Is this code publicly available? For multinomial regression, that's pretty good, and I'm interested to see how it would do on a more diverse training set (i.e. mix college and HS, housewrites and HSAPQ).
Some sort of automatic category-assigner might be a real godsend to Quinterest and/or QBDB 3.0, with the opportunity for users to report errors as needed. (I Am Not A Computer Scientist, but perhaps it may also be possible to automatically assign certain keywords to categories if it's unlikely that word could come up in any other category, e.g. assume that any question containing the word "Chatterley" will be British Literature unless specified otherwise? Or is that the ultimate end-state of what your program does?).
This is essentially how a machine learning algorithm would work. You give it some training data where it's able to learn that words such as Chatterley generally correspond to literature, and when it then sees such a word in new data it's more likely to assign that new question to literature.
Mike Bentley
Treasurer, Partnership for Academic Competition Excellence
Adviser, Quizbowl Team at University of Washington
University of Maryland, Class of 2008
jekbradbury
Lulu
Posts: 23
Joined: Fri Jul 09, 2010 9:41 pm

Re: Question classification

Post by jekbradbury »

I found fairly good results last year for question classification using an oldish and not very popular machine learning technique called Latent Dirichlet Allocation (a type of generative topic model); that might be worth looking into. Feel free to email/DM me if you want more details, though classification wasn’t actually the goal of my project so I don’t have accuracy numbers.
James Bradbury, Stanford '16, TJHSST '12
User avatar
Habitat_Against_Humanity
Rikku
Posts: 476
Joined: Sun Jan 21, 2007 8:51 pm
Location: Syracuse, NY

Re: Question classification

Post by Habitat_Against_Humanity »

Before anyone really gets their hopes up, some caveats/disclaimers:

1. This was done in just a few weeks as a final project for my text mining class. Only tossup questions were used. More than anything, the instructor wanted to see how well I could use Perl and regular expressions. Currently it's not any sort of implementable state and would need a lot of work before being deployed in any sort of serious use.

2. My stats and comp sci background is almost entirely self-taught so the code is not great to look at. There are a lot of improvements to be made in the code.

3. For now, I'm going to wait to post the code. I'm toying with the idea of basing my thesis around this and don't want give the appearance of academic dishonesty about what exactly constitutes "my" work by allowing others to make big changes to it without keeping me in the loop. I don't like doing it this way, but them's the breaks. Perhaps I'll put it on GitHub later, but for now I'll just describe what I did.


The reason the data set is relatively small is that I had to manually read and categorize 60 packets worth of questions. I sorted questions into one of six categories: Lit, History/Geography/Current Events, Science, Fine Arts, RMP and Social Science, and Trash. You can quibble with these designations, but I thought these would work for a preliminary classification. I used the questions from Tournaments 1-4 from the 2008-2009 year found on hsapq's website. After converting the PDFs to txts, I did some text cleaning to get rid of page numbers, weird spacing and characters, etc. So I ended up with a txt file containing 60 packets worth of tossups. From there, I wrote a simple word frequency program to count the number of occurrences of each word and took commonly used words and assigned them to one of the categories. So Literature had a set of 10 words associated with it like forms of "character," "title," "novel," and the like. Science had words like "quantity," "element," "function," and so on. The main Perl program counted the number of occurrences of each word and put it in a csv file. In R, I used the total number of words in each category as a variable. Thus, if a question used the word "author" twice,"title" four times and no other literature words, its associated literature variable would be six. Similarly, I counted the number of History words, Science words, Art words, and RMPSS words (I kind of ignored Trash). I also threw in average word length as a variable under the logic that science questions might have longer words in general. From there, I just ran a multinomial logistic regression and looked at how it did. Overall, it predicted 78% of questions correctly on the test set and was close to 90% accurate on lit questions. Like I said, there are a huge number of improvements to be made. Unfortunately, I won't have much time to work on this for some time, but hopefully I'll get around to it sometime in the Fall.
Rachel
UChicago 09
User avatar
UlyssesInvictus
Yuna
Posts: 845
Joined: Thu Feb 10, 2011 7:38 pm

Re: Question classification

Post by UlyssesInvictus »

Habitat_Against_Humanity wrote: The reason the data set is relatively small is that I had to manually read and categorize 60 packets worth of questions.

In R, I used the total number of words in each category as a variable.

Unfortunately, I won't have much time to work on this for some time, but hopefully I'll get around to it sometime in the Fall.
1. Since this seems to be the bottleneck, I'd love to see how large a training set we could get using help from the quizbowl community and Jerry's parser.

2. Have you considered using presence rather than frequency? I made a Naive Bayesian classifier that used similar features for email classification, and it improved when I considered whether or not a word was there at all rather than how many times it was there. I also did some preprocessing to find the words with the highest relative difference between categories, that I was doing this with two categories--it might be harder to generalize this to multiple.

3. Okay, well since you said you'd like to keep this private for thesis work, I'll leave the code to you, but I'd love to talk about the approach you used later in the year, if that's still permissible.
Raynor Kuang
quizdb.org
Harvard 2017, TJHSST 2013
I wrote GRAPHIC and FILM
User avatar
Mike Bentley
Sin
Posts: 6461
Joined: Fri Mar 31, 2006 11:03 pm
Location: Bellevue, WA
Contact:

Re: Question classification

Post by Mike Bentley »

UlyssesInvictus wrote:
Habitat_Against_Humanity wrote: The reason the data set is relatively small is that I had to manually read and categorize 60 packets worth of questions.

In R, I used the total number of words in each category as a variable.

Unfortunately, I won't have much time to work on this for some time, but hopefully I'll get around to it sometime in the Fall.
1. Since this seems to be the bottleneck, I'd love to see how large a training set we could get using help from the quizbowl community and Jerry's parser.

2. Have you considered using presence rather than frequency? I made a Naive Bayesian classifier that used similar features for email classification, and it improved when I considered whether or not a word was there at all rather than how many times it was there. I also did some preprocessing to find the words with the highest relative difference between categories, that I was doing this with two categories--it might be harder to generalize this to multiple.

3. Okay, well since you said you'd like to keep this private for thesis work, I'll leave the code to you, but I'd love to talk about the approach you used later in the year, if that's still permissible.
As I think I mentioned earlier, I think the best semi-publicly available training set is from QEMS1, which HSAPQ and some other organizations used to produce questions. Perhaps someone from HSAPQ can work on pulling this data. I could probably look into it in June.

I can very easily get the data for public sets in QEMS2, but the number of these is relatively low.
Mike Bentley
Treasurer, Partnership for Academic Competition Excellence
Adviser, Quizbowl Team at University of Washington
University of Maryland, Class of 2008
User avatar
Alejandro
Wakka
Posts: 226
Joined: Mon Jul 10, 2006 8:39 pm
Location: Seattle, WA

Re: Question classification

Post by Alejandro »

Mike Bentley wrote:
UlyssesInvictus wrote:
Habitat_Against_Humanity wrote: The reason the data set is relatively small is that I had to manually read and categorize 60 packets worth of questions.

In R, I used the total number of words in each category as a variable.

Unfortunately, I won't have much time to work on this for some time, but hopefully I'll get around to it sometime in the Fall.
1. Since this seems to be the bottleneck, I'd love to see how large a training set we could get using help from the quizbowl community and Jerry's parser.

2. Have you considered using presence rather than frequency? I made a Naive Bayesian classifier that used similar features for email classification, and it improved when I considered whether or not a word was there at all rather than how many times it was there. I also did some preprocessing to find the words with the highest relative difference between categories, that I was doing this with two categories--it might be harder to generalize this to multiple.

3. Okay, well since you said you'd like to keep this private for thesis work, I'll leave the code to you, but I'd love to talk about the approach you used later in the year, if that's still permissible.
As I think I mentioned earlier, I think the best semi-publicly available training set is from QEMS1, which HSAPQ and some other organizations used to produce questions. Perhaps someone from HSAPQ can work on pulling this data. I could probably look into it in June.

I can very easily get the data for public sets in QEMS2, but the number of these is relatively low.
I have a bunch of categorized questions from TriviaBot. There should be several thousand questions; I can send you a zip file with them if you're interested.
Alejandro
Naperville Central '07
Harvey Mudd '11
University of Washington '17
Tejas
Rikku
Posts: 258
Joined: Sun May 29, 2011 9:51 pm
Location: Chicago

Re: Question classification

Post by Tejas »

I fixed up the TriviaBot files and extracted all of the questions into one table. I'll try to create a classifier for it, PM me if you want to get the data.
Tejas Raje
Cornell '14
User avatar
ezubaric
Rikku
Posts: 369
Joined: Mon Feb 09, 2004 8:02 pm
Location: College Park, MD
Contact:

Re: Question classification

Post by ezubaric »

Sorry to resurrect a dead thread, but we have some code to do question classification here:

https://github.com/miyyer/qb

you can download the code and then run the target

make data/classifier/category.pkl
Jordan Boyd-Graber
UMD (College Park, MD), Faculty Advisor 2018-present
UC Boulder, Founder / Faculty Advisor 2014-2017
UMD (College Park, MD), Faculty Advisor 2010-2014
Princeton, Player 2004-2009
Caltech (Pasadena, CA), Player / President 2000-2004
Ark Math & Science (Hot Springs, AR), Player 1998-2000
Monticello High School, Player 1997-1998

Human-Computer Question Answering:
http://qanta.org/
User avatar
Habitat_Against_Humanity
Rikku
Posts: 476
Joined: Sun Jan 21, 2007 8:51 pm
Location: Syracuse, NY

Re: Question classification

Post by Habitat_Against_Humanity »

So, I'm beginning to get more serious about fiddling around with this as a real project. A number of people in this thread have mentioned about having access to questions. Would anyone have access to (hopefully already categorized) plain-text ACF questions from say the last 10 years? Just roughly, I'm thinking such a set would have around 7,500 tossups give or take. I don't want to go too far back into the nebulous past and I think ACF questions are probably the most consistent over time in terms of structure and distribution. Any help is appreciated.
Rachel
UChicago 09
jonah
Auron
Posts: 2383
Joined: Thu Jul 20, 2006 5:51 pm
Location: Chicago

Re: Question classification

Post by jonah »

Habitat_Against_Humanity wrote:So, I'm beginning to get more serious about fiddling around with this as a real project. A number of people in this thread have mentioned about having access to questions. Would anyone have access to (hopefully already categorized) plain-text ACF questions from say the last 10 years? Just roughly, I'm thinking such a set would have around 7,500 tossups give or take. I don't want to go too far back into the nebulous past and I think ACF questions are probably the most consistent over time in terms of structure and distribution. Any help is appreciated.
You might get in touch with Carlo Angiuli about ACFDB's data, although it hasn't been updated in some time. Let me know if you want me to facilitate that communication.
Jonah Greenthal
National Academic Quiz Tournaments
Locked