Page 1 of 1

return (sorta) of the question database

Posted: Sun Jan 13, 2008 12:22 am
by grapesmoker
As some people know, I've been working on this on and off for the last year. In the past, I had a complicated system that was supposed to allow various editors to collaborate on editing packets. This was a complicated thing that I never got to work, and now that we have Google Docs, there isn't really a need for such a complex thing anyway.

However, I did think that a nice, searchable database that presented things in a useful format was still a good thing. So I've simplified the system, a lot, to make it much less interactive, but at the same time, I think more useful. Anyway, I've only just gotten the bones of the system up, but all the user-end functions work. There aren't many right now: you can basically view packets that are there and search them. The usefulness will come, I'm sure, when there are more packets available.

At the moment, the only person who is authorized to upload packets is myself. This is mostly due to the fact that my scripts for converting Word files to a parsable XML can often be confused by people's esoteric numberings or markup that appears in places where the script doesn't expect. So right now, I'm following up on the excellent work done preiously by Andrew Hart in bringing many packets in line with reasonable formatting standards. However, if people would like to help. here's how. One way is if you're programming-savvy and know some Perl, I can send you my script and you can help convert files to uploadable XML. The other way is if you have packets that you want to make available, please format them in Word appropriately, in the way described in the next post.

Meanwhile, here is the database. If you find any error messages anywhere, I'm still working on some things.

edit: There are only two tournaments up right now, the 2007 Penn Bowl and Terrapin, but more are coming.

Posted: Sun Jan 13, 2008 12:35 am
by grapesmoker
Here's how you can format your packets if you want to be helpful:

Filename: must be of the form <year> - <tournament name> - <author>.doc
for example: 2007 - Michigan MLK - Jerry Vinokurov.doc
or: 2005 - ACF Nationals - Editors 1.doc

Internal formatting: the words "tossups" and "bonuses" must precede, preferrably by themselves, the respective sections of tossups and bonuses.

Tossups: Tossups may be numbered or not, it doesn't matter as long as they are in whatever the right order is. The text of a tossup should not have any line breaks in it. Following the text of the tossup, there should be one or more line breaks, followed by the text "Answer:" which is followed, on the same line, by the text of the answer.

THIS IS IMPORTANT: The text of the answer should not contain any formatting whatsoever beyond the accepted bold+underline combination that designates the required part of the answer. In particular, please do not italicize any part of the answer; I just strip it out anyway, because it tends to confuse the script.

Bonuses: Bonuses may also be numbered or not. The leadin should be followed by a line break, after which the bonus parts begin. A bonus part is designated by its point value, e.g. [10] or [5].

ALSO IMPORTANT: Do not be inventive in your formatting. Don't write things like "A." or "1)" or "[5 each]." Just write the total bonus value in brackets; if you have a bonus part which admits a range of positive values (for example, the aforementioned "5 each") , then specify the awarding of the points in the text of the bonus part itself, e.g. "Name the guys who wrote Moby Dick and The Scarlet Letter, for five points each."

The text of the bonus part should be followed by a line break and then the word "Answer:" followed on the same line by the text of the answer. Answer text should be formatted as above.

That is all! For an example of a properly formatted tossup or bonus, please refer to the ACF official formatting guide. For an example of a properly formatted packet, please look here.

Posted: Sun Jan 13, 2008 12:49 am
by grapesmoker
One thing I forgot to mention:

When writing packets, for the love of all that is holy, avoid special characters. Some, like the fancy dash or the "smart" (read: stupid) quotes are predictable, but wacky Unicode is not. The time it takes you to find that "o" with the ummlaut over it is not worth it; no one will see it and it won't help anyone. XML, however, hates Unicode, so when they come up, I have to track them down and kill them because I can never predict which ones are coming.

Posted: Sun Jan 13, 2008 9:03 am
by pray for elves
How hard would it be to handle quotes to ensure you get your whole answer? While searching for "Philip Roth" I got a bunch of history questions on various Kings Philip.

Posted: Sun Jan 13, 2008 10:18 am
by grapesmoker
DeisEvan wrote:How hard would it be to handle quotes to ensure you get your whole answer? While searching for "Philip Roth" I got a bunch of history questions on various Kings Philip.
Good point. That's something I overlooked but I'm sure it's not that hard to implement.

edit: it looks like an issue relating to how MySQL does fulltext searches vs. matching with the LIKE clause, if you're interested.

By the way, please, if you have a suggestion for a useful function, let me know and I'll do my best to implement it.

Posted: Sun Jan 13, 2008 4:53 pm
by Schweizerkas
If I choose the options "tight search", "both", and "both", and I do a search for Rauschenberg, the search results show the same bonus repeated 4 times (once for each of the 4 tournaments in the database).

As for additional features you could add, it would be nice if one could restrict the search to a certain span of years, like, "I only want to search through tournaments written after 2002. " This is a common annoyance with searching through the Stanford archive, that 75% of the search results are from terrible tournaments written ten years ago.

Posted: Sun Jan 13, 2008 5:31 pm
by pray for elves
I'm getting the same repeating bonus issue. Even on just a tight search for my earlier example of Philip Roth, I get two bonuses mentioning him and one tossup. The tossup correctly appears only once, under "Penn Bowl 2007 tossups", but the two bonuses are repeated under the heading of every tournament.

EDIT: After a test, the same does not occur when performing a loose search. The items are properly sorted and appear only once. Also, the repeating issue happens even when just "Bonuses" is selected. Checking with something other than Philip Roth, "Both" for the "Question text/Answer text" selection must be chosen for the repeats to occur.

Posted: Sun Jan 13, 2008 6:40 pm
by grapesmoker
The magic of misplaced parentheses was causing that error. It's fixed now. I will also add an option to choose what years you would like to search.

Posted: Sun Jan 13, 2008 11:15 pm
by KGeee
Nice.

Posted: Mon Jan 14, 2008 3:09 am
by theMoMA
Jerry, could we get an option that lumps together "quality" tournaments of a certain difficulty?

So all well-written regular difficulty questions would be an option, all well-written novice-level, and all well-written CO/Nats level questions. Also, this could be broken down by year, so if you wanted all hard stuff from 2001-2004, Manu/Mill/Auspicious Incident would come up, but if you wanted more recent stuff, it wouldn't. I think this would make the archive a resource for question-writing, because you could say 'hey, I wonder what good tossups on Tamarlane at the novice level look like?' and find out rather quickly.

Posted: Mon Jan 14, 2008 3:35 am
by grapesmoker
theMoMA wrote:Jerry, could we get an option that lumps together "quality" tournaments of a certain difficulty?
I suppose I could set some sort of "quality" flag to mark certain tournaments, but in reality, I'm not sure how useful this would be. While I think having people know which tournaments are supposed to be "quality" is useful, I think it's common knowledge; maybe it would be easier to just put a recommendation on the main page suggesting that those who are interested look at particular tournaments.

At some point, once more sets have been added, I'm going to cross reference the wiki to the database; that is perhaps a better place for people to look if they are interested in knowing what tournaments are good, since that's supposedly a repository of collective circuit wisdom, whereas the database is designed to automate many things and make searching easy.
Also, this could be broken down by year
I'm working on adding searchability by year. It's not evident now because I've neglected to add anything from previous years to 2007, but I will add a couple sets tomorrow and then you'll be able to search by year as well.

edit: I also plan to add some frequency analysis, which will make it possible to see how many times something has come up and perhaps even break that down by date/tournament.

Posted: Mon Jan 14, 2008 11:40 am
by Jeremy Gibbs Sampling
Ö is part of eight-bit ASCII, and so are a bunch of common accented characters. If you write Händel and I search for Handel, though, do I hit your question? I'd suspect not. That'd be another reason for not using non-keyboard characters.

Posted: Mon Jan 14, 2008 12:12 pm
by grapesmoker
deep_friar wrote:Ö is part of eight-bit ASCII, and so are a bunch of common accented characters. If you write Händel and I search for Handel, though, do I hit your question? I'd suspect not. That'd be another reason for not using non-keyboard characters.
You're right, the search wouldn't find it.

Posted: Mon Jan 14, 2008 4:29 pm
by theMoMA
grapesmoker wrote:
theMoMA wrote:Jerry, could we get an option that lumps together "quality" tournaments of a certain difficulty?
I suppose I could set some sort of "quality" flag to mark certain tournaments, but in reality, I'm not sure how useful this would be. While I think having people know which tournaments are supposed to be "quality" is useful, I think it's common knowledge; maybe it would be easier to just put a recommendation on the main page suggesting that those who are interested look at particular tournaments.
I just think it would be useful to writers to be able to see only well-written questions at a certain difficulty so they use the archive as a resource for writing better questions. It seems to me that this would be harder to do if you had to search tournament-by-tournament.

Posted: Mon Jan 14, 2008 8:02 pm
by Matt Weiner
grapesmoker wrote:You're right, the search wouldn't find it.
Irrespective of what people do in the future, this seems like a potential searching problem for the many hundreds of past tournaments that used such characters. Could you maybe come up with a search-and-replace script that changes the most commonly used characters to their nearest visual equivalent (a with an umlaut or an accent to plain a, r with a line through it to r, whatever else is thought of or noticed)? I'm sure the collective knowledge of the board could make a list of the 50 or so most common special characters fairly quickly.

Posted: Mon Jan 14, 2008 8:31 pm
by grapesmoker
Matt Weiner wrote:
grapesmoker wrote:You're right, the search wouldn't find it.
Irrespective of what people do in the future, this seems like a potential searching problem for the many hundreds of past tournaments that used such characters. Could you maybe come up with a search-and-replace script that changes the most commonly used characters to their nearest visual equivalent (a with an umlaut or an accent to plain a, r with a line through it to r, whatever else is thought of or noticed)? I'm sure the collective knowledge of the board could make a list of the 50 or so most common special characters fairly quickly.
Yeah, I'm going to incorporate something like this into the import routine.

Posted: Tue Jan 15, 2008 5:14 am
by grapesmoker
Schweizerkas wrote:As for additional features you could add, it would be nice if one could restrict the search to a certain span of years,
Searchability by years is now enabled. I've added ACF Fall 2006 so people could see what that looks like.

Posted: Tue Jan 15, 2008 3:47 pm
by qroper224
I searched for "Sterne" and I got a 2007 Penn Bowl tossup (which does mention him) but then I also get a 2007 ACF Nats bonus on Lang Xang...

I have "both" on for both tossup-bonus and answer text/question text. I'm searching in 2007.

Posted: Tue Jan 15, 2008 4:27 pm
by grapesmoker
qroper224 wrote:I searched for "Sterne" and I got a 2007 Penn Bowl tossup (which does mention him) but then I also get a 2007 ACF Nats bonus on Lang Xang...

I have "both" on for both tossup-bonus and answer text/question text. I'm searching in 2007.
This issue is due to my ineptitude in writing proper code for searches. If you actually search for "sterne" in the text of the bonus, you'll see that the search matched "westerners." Right now, what I'm doing is brute-force matching text that matches the search string, so what I need to figure out is how to exclude the false positives.

For those who know about searching MySQL databases, something I'm learning on the fly, the distinction is between searching using the "LIKE" clause (what I call "tight"), which allows you to use pattern matches, and using the "MATCH AGAINST" clause ("loose") which uses a fulltext search. "MATCH AGAINST" is smart enough to know that you're looking for distinct words when you search for, say, "Sterne," but not smart enough to figure out that when you search for "Chester Arthur" what you don't want is anything that mentions either a "chester" or an "arthur." Eventually, what I think will happen is that I will figure out the pattern matching syntax and adivse people to use the tight search with pattern matching to find more complex strings. Right now, I know for a fact that the "%" is the wildcard symbol, so if you entered "chester%arthur" you would only find those texts which have "chester" followed by any number of characters followed by "arthur." For many searches this is good, you can see how searching for two common terms this way will return many false positives.

For now, my advice is to keep your searches to the minimum you think you need to match what you're looking for. Play around with the options and keep me posted on what seems to work best for people. As more tournaments are added to the database, search issues will become increasingly important, so I want to be sure to get that right.

edit: I am retarded. I just realized that MySQL can do regular expressions, so hopefully I'll have a more functional search up in the next couple days that will allow you much more control over how you want to search.

Posted: Tue Jan 15, 2008 4:45 pm
by Red-necked Phalarope
There appear to be some formatting issues with a few of the packets, notably some of the bonuses in the Casey Retterer packet and pretty much all of the tossups in the Finals packet for "Maryland Terrapins". This looks fantastic conceptually, though, and I did get a kick out of reading about that famous John Cougar Mellencamp song "Captain John Yossarian"...

Posted: Tue Jan 15, 2008 4:59 pm
by grapesmoker
Casanova Frankenstein wrote:There appear to be some formatting issues with a few of the packets, notably some of the bonuses in the Casey Retterer packet and pretty much all of the tossups in the Finals packet for "Maryland Terrapins". This looks fantastic conceptually, though, and I did get a kick out of reading about that famous John Cougar Mellencamp song "Captain John Yossarian"...
Thanks for bringing that to my attention. The import routine has been changing with every set that I upload to catch problems I didn't foresee, and the Terrapin packets were imported with an older version. I'll fix this at some point today.

Posted: Wed Jan 16, 2008 2:46 am
by Skepticism and Animal Feed
Do you mind if we email you packets that we've formatted to your specifications?

Posted: Wed Jan 16, 2008 9:45 am
by grapesmoker
Bruce wrote:Do you mind if we email you packets that we've formatted to your specifications?
Not at all! This would be a most welcome thing.

Please send packets to [email protected]. My Brown account likes to filter out compressed archives for some reason.

Posted: Thu Jan 17, 2008 12:56 pm
by grapesmoker
I've rewritten the search routine to be more user-friendly and more accurate Please try it out. The "tight/loose" options have been eliminated and a simple commonsense syntax (consisting of AND, OR, and * operators) has been substituted. It works exactly like you think it might, but if you're confused, click on the "click here for explanation" link in the search menu.

Posted: Wed Feb 20, 2008 3:27 pm
by vandyhawk
While we're piling on Jerry's coding work, I thought I'd mention that the queries no longer show which tournament a particular question came from. I'm assuming that's not by design.

Posted: Wed Feb 20, 2008 11:09 pm
by grapesmoker
vandyhawk wrote:While we're piling on Jerry's coding work, I thought I'd mention that the queries no longer show which tournament a particular question came from. I'm assuming that's not by design.
Yeah, that shouldn't be happening. I'll look into it.

Re: return (sorta) of the question database

Posted: Sat Apr 19, 2008 7:50 pm
by Mechanical Beasts
Not to resurrect an old thread or anything, but I was wondering--any reformatting work you'd like me / anyone to do? I'm willing to help out, and not just by contrast with how unwilling I am to start writing final papers.

Re: return (sorta) of the question database

Posted: Sun Apr 20, 2008 12:31 pm
by grapesmoker
everyday847 wrote:Not to resurrect an old thread or anything, but I was wondering--any reformatting work you'd like me / anyone to do? I'm willing to help out, and not just by contrast with how unwilling I am to start writing final papers.
Andy, I'm always glad to have people help out. I'll probably start putting many more things up over the summer. If you are interested in helping, email me and I will explain the requirements. The short of it is that any packet formatted to the official ACF guidelines should be good.

Re: return (sorta) of the question database

Posted: Sun Apr 20, 2008 2:26 pm
by walter12
Quick question:

How difficult would it be to add a feature that allows users to search for questions of a specific topic? I'm talking about the most general level of subject tags, such as science, literature, philosophy, etc.

If the code were to be rewritten to incorporate subject tags, I would be happy to go through the question database and tag all questions sometime early this summer.

Re: return (sorta) of the question database

Posted: Sun Apr 20, 2008 2:41 pm
by ezubaric
I have code that will automatically assign top-level tags to questions (and finer tags to science questions). I sent the code to Jerry some time ago (but Java doesn't play well with a purely php environment, and I don't know php, so it might not make its way into the program), but I'd also be willing to assign topics automatically if I got a database dump. It's not perfect, but it has over 90% accuracy.

Re: return (sorta) of the question database

Posted: Sun Apr 20, 2008 8:47 pm
by grapesmoker
I do have Jordan's code and I plan to incorporate it into the database at some point. I would say that while categorization is a neat feature, I think it's of secondary importance to the overall functionality of the database. I'm going to add a category field to the database and maybe let some people categorize things by hand; the first goal is to get as many questions as possible, after which I'll work to translate the categorizer to PHP and implement it.

Re: return (sorta) of the question database

Posted: Mon Apr 21, 2008 7:57 pm
by ezubaric
grapesmoker wrote:the first goal is to get as many questions as possible, after which I'll work to translate the categorizer to PHP and implement it.
Another possibility, rather than incorporating it into php, is just to have a Python or Java daemon that runs nightly and classifies all questions that currently don't have categories. I'd also be willing to write such a thing given a database dump or a database schema (it would be really simple for me to tweak my code to do that).

Re: return (sorta) of the question database

Posted: Mon Apr 21, 2008 11:05 pm
by fleurdelivre
what's the database structure on this? can there be a table of categories that users can then help link to questions? it seems like the basic principle should be easy enough considering the limited number of tags we would need...

Re: return (sorta) of the question database

Posted: Mon Apr 21, 2008 11:19 pm
by grapesmoker
fleurdelivre wrote:what's the database structure on this? can there be a table of categories that users can then help link to questions? it seems like the basic principle should be easy enough considering the limited number of tags we would need...
It's just a MySQL database. Adding a category field is easy but I don't want to give everyone and their dog the option of changing category classifications. I may create database-level accounts for a couple of people to help out or I might do a dump and pass it off to Jordan and see if he can't do something with it. It won't be for at least another month just because I have a bunch of other things I need to do.