S-value revision

Old college threads.
User avatar
Mechanical Beasts
Banned Cheater
Posts: 5673
Joined: Thu Jun 08, 2006 10:50 pm

Re: S-value revision

Post by Mechanical Beasts »

Competitors in this model may not experience changes in ability relative to each other as question difficulty varies.
In general, I find that this isn't usually true; that said, it's an enormously helpful simplifying assumption, it's not hugely far from the truth, and I guess that since SCT questions are going to be of mostly similar difficulty, this won't matter so much. Just thought it was worth pointing out in case down the road someone decides to create a latent ability rating for a team intended to be more broadly applicable (to predict finishes at a variety of tournaments, for example).

I do want to say that this looks awesome, though. Also:
it requires no additional data collection, as all the inputs of the model are tabulated on the official scoresheet
Does this mean that you're retroactively determining difficulty ratings based on conversion statistics? This seems to be a fine way to do it*; I just don't know if you ever explicitly say that this is how it's gonna be.

* but won't DII/DI questions have depressed conversion by the DII population playing them? Or will we construct difficulty ratings by considering conversion stats only by DI teams? I'm sorry if I'm missing something, or if this is determined after page 17.
Andrew Watkins
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

I would like to point out a couple of things:

--D1 teams playing on D2 questions will artificially deflate D2 difficulty ratings; D2 teams playing on D1 questions will artificially inflate D1 difficulty ratings. It sounds like you're barking up the same tree as the "use common questions to estimate ratings" tree that I was suggesting in private correspondence (not on the board), but I'm not sure where or how you treat this adjustment, and whether extrapolation is the best or most practical method of doing this.

--As you've noted, a neg artificially inflates the probability of a tossup being answered by the opposing team (and thus IIA goes out the window). Therefore, a common opponent would be more likely to answer more questions against neg-happy Maryland (55 negs in 270 TUH) than neg-averse Carleton (15 negs in 299 TUH), so we would expect teams that neg more would have lower r-values than teams that neg less. I would also highly suspect that tossups that go dead against either of those teams are significantly more difficult than tossups that go dead due to one of those teams negging, thus a question's difficulty is potentially related to the number of teams that neg the tossup (it's possible that this averages out, but I'm not convinced one way or the other). Furthermore, I'm not sure how you can claim that negs are too rare to make a significant impact when a few teams neg on >20% of the questions and (just an eyeball estimate) the average appears to be around 10%.

--It is demonstrated that team ability can be estimated from both tossup and bonus statistics, but nowhere is it claimed which one is used or how these estimates are combined.

--I would be in favor of using this system, provided that the model can improve on my R^2 values. I was able to pull off all the data for 2009, compute each OAOAP by reading the result of each game and plugging the relevant subtractions/divisions into a calculator, and use Excel to find my standard deviations; in approximately the time it took for the computer program to run, possibly less. I do not think it wise to require additional computational power for equivalent predictive power.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
The Friar
Wakka
Posts: 159
Joined: Fri Jul 10, 2009 2:39 pm

Re: S-value revision

Post by The Friar »

Crazy Andy Watkins wrote:
Competitors in this model may not experience changes in ability relative to each other as question difficulty varies.
In general, I find that this isn't usually true; that said, it's an enormously helpful simplifying assumption, it's not hugely far from the truth, and I guess that since SCT questions are going to be of mostly similar difficulty, this won't matter so much. Just thought it was worth pointing out in case down the road someone decides to create a latent ability rating for a team intended to be more broadly applicable (to predict finishes at a variety of tournaments, for example).
Yes. In case I wasn't sufficiently explicit about that, let me second Andy: I don't believe in independence of irrelevant alternatives in this context and neither should you, but I think suspending disbelief and assuming it will not divorce the model from reality so far that it'll produce unhelpful ratings, so I leave it in for simplicity.
Gordon Arsenoff
Rochester '06
WUStL '14 (really)

Developer of WUStL Updates Statistics Live!
User avatar
The Friar
Wakka
Posts: 159
Joined: Fri Jul 10, 2009 2:39 pm

Re: S-value revision

Post by The Friar »

Dwight, thanks very much for the comments! If I may be so callous, I'll take them out of order.
cvdwightw wrote:It is demonstrated that team ability can be estimated from both tossup and bonus statistics, but nowhere is it claimed which one is used or how these estimates are combined.
The r-values in the probability equations for tossup outcomes and for bonus outcomes are the same value for a given team; the estimation routine searches the (P+T+3B)-dimensional space of possible parameter values for the unique vector of r-values (and betas and thetas) at which all the question results simultaneously would have been most likely to have been observed. (The maximum likelihood estimates of linear regression coefficients, for instance, are asymptotically equal to the least-squares estimates.) For some more on maximum likelihood estimation, see http://en.wikipedia.org/wiki/Maximum_likelihood.

Caveat: the estimation actually performed is Bayesian rather than classical, meaning a prior belief about the parameter values (in this case, an extremely vague and uninformative one, included as a formality) was incorporated; as the precision of the prior approaches zero, the Bayesian estimate of a parameter approaches the maximum likelihood estimate, and the estimation procedures come to look identical.
cvdwightw wrote:D1 teams playing on D2 questions will artificially deflate D2 difficulty ratings; D2 teams playing on D1 questions will artificially inflate D1 difficulty ratings. It sounds like you're barking up the same tree as the "use common questions to estimate ratings" tree that I was suggesting in private correspondence (not on the board), but I'm not sure where or how you treat this adjustment, and whether extrapolation is the best or most practical method of doing this.
Good hearing, Dwight! That's just the tree I'm barking up. If I'd had access to the knowledge that you were proposing this, I'd have given you some credit, but I'm glad to hear I'm not the only one thinking this.

Anyhow, it is once again due to the simultaneous estimation of the difficulty and ability parameters that we get this automatic bridging. Here's a toy example. Suppose Alice took a test consisting of math questions from the SAT I and the GRE General, and Brenda took one consisting of the same problems from the GRE General plus some from the Math GRE. If Alice did as well on the SAT questions as Brenda did on the Math GRE questions, but Alice did worse on the crossover items than the ones she had to herself while Brenda did better, we'd know the following things:

* Brenda is probably better at math than Alice, because she did better on the questions they had in common;
* SAT I math questions are probably easier than GRE General math questions, because Alice did better on the former than on the latter;
* Math GRE questions are probably harder than GRE General math questions, because Brenda did better on the latter than on the former.

Our estimates of how much harder each set of questions was than the last would improve the more Alices and Brendas (and, of course, the more total questions) we had. FRIAR doesn't have access to the divisional identity of either any question or any player, but, really, neither do we in the toy example above. The Bayesian or maximum likelihood estimation procedure employed will pick up on these things by considering all questions and players simultaneously, rather than by perturbation or extrapolation starting with only the common-questions part of the data.

Andrew, I hope the above gets to your question about relative scaling of Division I and Division II teams as well.
cvdwightw wrote:As you've noted, a neg artificially inflates the probability of a tossup being answered by the opposing team (and thus IIA goes out the window). Therefore, a common opponent would be more likely to answer more questions against neg-happy Maryland (55 negs in 270 TUH) than neg-averse Carleton (15 negs in 299 TUH), so we would expect teams that neg more would have lower r-values than teams that neg less. I would also highly suspect that tossups that go dead against either of those teams are significantly more difficult than tossups that go dead due to one of those teams negging, thus a question's difficulty is potentially related to the number of teams that neg the tossup (it's possible that this averages out, but I'm not convinced one way or the other). Furthermore, I'm not sure how you can claim that negs are too rare to make a significant impact when a few teams neg on >20% of the questions and (just an eyeball estimate) the average appears to be around 10%.
Yep. Negs are hairy. You're absolutely right that negs aren't exactly so rare that we wouldn't observe some difference in ratings if we could model them. My position on them is weaker, however: they are rare enough that, at only -5 points a pop, their contribution to game scores is too small to be worth the extreme effort needed to model them accurately. Constructing a theory of negs that could be used in FRIAR would probably take at least as long as the rest of the work done on it to this point, if I or any helpful fellow contributor could do it at all. Modeling them wrong, on the other hand, might be even worse than not modeling them at all, depending on where in parameter space the real world happens to lie.

The lack of a theory of negs and what they mean about ability (and difficulty and question quality) is a misspecification of the data-generating process, though, and model fit could only be improved by including a correct treatment of them. The distortion caused by the way negs break IIA will be somewhat reduced, however, by incorporating powers into the model, as will be done in the end. Opponents of teams that neg a lot will indeed get more tossups, inflating their r-values; however, they'll also power fewer of them, which will press the r-values downward. The effects may not be of the same size, but it is at least encouraging that they point in opposite directions.
cvdwightw wrote:I would be in favor of using this system, provided that the model can improve on my R^2 values. I was able to pull off all the data for 2009, compute each OAOAP by reading the result of each game and plugging the relevant subtractions/divisions into a calculator, and use Excel to find my standard deviations; in approximately the time it took for the computer program to run, possibly less. I do not think it wise to require additional computational power for equivalent predictive power.
Ooh, a model-off! Seriously, I'm with you 100%; the main criterion for use should be that the model performs best. All I would suggest is:

* Only a fully-developed version of FRIAR, with powers and nonstandard bonus formats included, optimally tuned with tossup and power specificity parameters, should be compared. This is only because, well, that's what we would want to deploy anyway.
* I assume you're talking about the Spearman (rank order) rather than Pearson correlation when you mention R^2; if not, you should be. It is inappropriate to examine the correlation between cardinal values on one scale and ranks on another, since the ranks might have come from cardinal (known or latent) parameters that naturally live on any range, not necessarily the same one. Rather, to use a coefficient of correlation as a metric, what should be compared are the finish order at ICT and the ordinal ranks of teams on these scales. Again, ultimately, ordinal rank is what we would deploy in order to issue invitations anyhow. (I suspect, for instance, that the correlation of teams' ranks on your metric with their finishes at ICT would be higher than the R^2s you reported if the reported ones are based on the cardinal value on the rating scale.)
Last edited by The Friar on Mon Aug 17, 2009 6:30 pm, edited 1 time in total.
Gordon Arsenoff
Rochester '06
WUStL '14 (really)

Developer of WUStL Updates Statistics Live!
User avatar
The Friar
Wakka
Posts: 159
Joined: Fri Jul 10, 2009 2:39 pm

Re: S-value revision

Post by The Friar »

A caveat regarding FRIAR that I don't think I brought out in the paper and a note on its performance, both of which I forgot in the last post.

First, although, as I note, no ranking system can get rid of all opportunities for gaming it, and it will be necessary for NAQT to reserve the right to review and adjust scores when fraud is suspected no matter what is adopted, Dwight's measure is more[\i] resistant to gaming than mine is; that is, I expect that right to review will need to be exercised at least somewhat more often under FRIAR. I don't think the difference will be of enormous size, and I do think FRIAR's other advantages outweigh this relative shortcoming, but it should not be overlooked that the Wynne measure has done a very good job of minimizing opportunities for manipulation.

Second, in practice, the model's speed will be increased by a factor of a little more than 2 by running only one Markov chain (dual chains were run for diagnostic purposes, but are not necessary, and perhaps are not even desirable, for inference) and by running WinBUGS on an actual Windows machine rather than using Wine in Linux, which is what I did. The model's speed could be increased by a factor of around 6.1 kajillion by estimating it under maximum likelihood, because MLE algorithms are much faster than Markov chain Monte Carlo techniques for Bayesian estimation (there's no numerical integration in the former) and because then the model could be run in a fast language like C rather than a slow, slow one like WinBUGS. I have so far not taken the trouble to derive the likelihood function of the model, because that would require effort on my part rather than the computer's and because I'm bad at algebra, but it could surely be done.
Gordon Arsenoff
Rochester '06
WUStL '14 (really)

Developer of WUStL Updates Statistics Live!
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

The Friar wrote:I have so far not taken the trouble to derive the likelihood function of the model, because that would require effort on my part rather than the computer's and because I'm bad at algebra, but it could surely be done.
I went ahead and tried to do this, and the MLE is rather ugly.

For a given tossup i, we assume that r(x_k) is constant = r_k. Furthermore, we make the slight modification that beta(t_i) = r_0, that is, there is a third team that earns the tossup only if the other two do not (since there is not an actual team 0, r_0 should be constant for all choices of 1 and 2).

We can then represent the probability p(X = k) = (e^[sum(2*(delta(X,k)-0.5)*r_k)])/[sum(e^[sum(2*(delta(l,k)-0.5)*r_k)])]; sorry this isn't prettier, where I hope the sums are fairly explanatory (2 are from 0 < k < 2 and the third, right after the division sign, is from 0 < l < 2) and delta is the Kronecker delta function delta(x,y) = 1 if x = y and 0 otherwise.

After a whole lot of algebra, I get that r_0,est = 0.5*a_3 + 0.5*ln(e^a_1+e^a_2) + 0.5*ln(0.5+(2/N)*sum(delta(X_m,0))) - 0.5*ln(1.5-(2/N)*sum(delta(m,0))), where:

N is the number of random draws
X_m is the variable on the mth of N random draws
a_1 is r_1,est - r_2,est
a_2 is r_2,est - r_1,est
a_3 is r_1,est + r_2,est
sum is from 0 < m < N

r_1,est and r_2,est are nearly identical (substitute either 1 or 2 for 0 in the above equation and 0 for either 1 or 2 in a_1, a_2, a_3)

This gives us three nonlinear equations in three variables, which I've given up trying to solve to find out whether r_k,est is a maximum or minimum.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
The Friar
Wakka
Posts: 159
Joined: Fri Jul 10, 2009 2:39 pm

Re: S-value revision

Post by The Friar »

cvdwightw wrote:I went ahead and tried to do this
Wow! Thanks, man.
cvdwightw wrote: and the MLE is rather ugly.
I didn't suspect it would be elegant. In practice, likelihood functions of non-trivial models rarely are. Still, apply the Newton-Raphson algorithm and the computer will solve it in no time flat compared to the length of time it would need to get a good sample of draws out of an MCMC setup.

Now I fear I might have sent the post about DI/DII relative rating under your head.
Gordon Arsenoff
Rochester '06
WUStL '14 (really)

Developer of WUStL Updates Statistics Live!
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

A few things I thought of on my way back:
The Friar wrote:* I assume you're talking about the Spearman (rank order) rather than Pearson correlation when you mention R^2; if not, you should be. It is inappropriate to examine the correlation between cardinal values on one scale and ranks on another, since the ranks might have come from cardinal (known or latent) parameters that naturally live on any range, not necessarily the same one. Rather, to use a coefficient of correlation as a metric, what should be compared are the finish order at ICT and the ordinal ranks of teams on these scales. Again, ultimately, ordinal rank is what we would deploy in order to issue invitations anyhow. (I suspect, for instance, that the correlation of teams' ranks on your metric with their finishes at ICT would be higher than the R^2s you reported if the reported ones are based on the cardinal value on the rating scale.)
You are right that I reported the wrong correlation. Assuming I did it right, the Spearman for the 2009 non-Penn data is actually 0.65, less than the reported R^2 in all circumstances. A Pearson for winning pct. vs ASV got up to 0.81 but I'm not sure that will hold for other years.

Looking back through my data prompted the question, why are we looking at ICT finish in the first place? Who decided that the S-Value needs to be a predictor of ICT success? Because of lineup changes and whatnot, it is not uncommon that the n+1(th) team from one ICT bracket is better than the nth place team from a different bracket, but ends up lower in the ICT standings. I suspect that within-bracket finish is much more likely to correlate than overall finish, but with only 10-12 teams each year that keep a roughly consistent SCT/ICT lineup, it's unlikely such a correlation will actually tell us anything.

Lastly, while the model is very good, there are some things that make it a bit difficult to work with: first, while doing stuff with the data can be cut down to a matter of minutes, the real time sink is going to be collecting and inputting the data; after all, you're looking at ~20 tossups a game for what I will conservatively estimate as 30 games a sectional, for at least 8 sectionals, is 4800+ bits of data that need to be, essentially, hand-entered as to which team (if any) scored the question and how many bonus points (if any) were awarded. I do not doubt the increased power of a question-level model, but it is unknown whether NAQT wants to take the time to input that data (I suspect the answer is yes, but NAQT would have to approve doing such a thing). Second, this model would work very well if NAQT gave up the clock, but I think it may not be feasible for NAQT to put a randomized packet in each room and a randomized packet order for each sectional, and it is certainly not feasible if the restriction is that each room reads the set of 360/390 tossups in a completely different order. I would absolutely love to see this model perform on a set of real ACF data, since with ACF the packets used are essentially random across sites (there is no a priori reason that a certain packet will be used more often across all sites than another), there is no clock, and there are no powers or wonky bonus formats.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
setht
Auron
Posts: 1205
Joined: Mon Oct 18, 2004 2:41 pm
Location: Columbus, Ohio

Re: S-value revision

Post by setht »

This is impressive-looking stuff, Gordon, but I'm not sure I'm sold on FRIAR (at least not yet). It's very likely I haven't fully understood the advantages of FRIAR, but from reading through the pdf it's not clear to me that there is actually that much difference between the quality of FRIAR's r-value and the quality of simpler measures like tossup points per tossup heard or bonus conversion. You're obviously trying to do much more at the level of individual question data than other models, but I'm not sure that data is all that great to begin with (on a single-question level), and I'm not sure the assumptions going into the model are close enough to reality that the extra effort in modeling is rewarded. Personally, I would rather use a simpler model that doesn't make possibly-dangerous assumptions about the nature of the data, unless a more complicated model can show a significant improvement in predictive power using real data--and it's not even clear how much predictive power any model can ever hope to have, as Dwight has pointed out. In addition, using FRIAR would require that every scoresheet be sent in to NAQT for tabulation following SCT. This isn't necessarily an impossible feat, but I am again a bit dubious that the work is worth the effort.

Here are some more things I find troubling about FRIAR:

-assigning difficulty ratings to individual tossups and bonuses using conversion statistics. The conversion statistics may be based on a small number of data points, for questions (especially bonuses) that appear late in rounds, appear in rounds that many sites don't use, or (in the case of tossups) have hard answers that many teams can't pull even on the giveaway. In some sense the conversion data could be used as a definition of how hard a question was for the field that played it, but taking that data further than a post-tournament review of question difficulty by plugging it back into the model to determine knowledge ratings of each team seems iffy to me.

-no good way to deal with negs (or rather, the effects of negs on opponent tossup conversion and question difficulty ratings). It's true that many games don't feature tons of negs, but what happens to a team playing against a latter-day Nathan Freeburg? What happens to all of the teams in that sectional?

-assigning a single r-value meant to quantify how well a team will do on tossups and on bonuses. I think it's fair to assume that tossup performance (independent of opponent strength) and bonus performance are correlated, but I think some of the simpler models that look at TPTH and bonus conversion separately may have the edge on this aspect.

-assigning bonus difficulty ratings on the assumption that teams don't ever do silly things like miss the easy part and score 20 on the medium and hard parts. If I've understood this point of the model correctly, the assumption is that every 10- or 20-point conversion represents the same (or nearly the same) level of knowledge--teams only 10 a bonus if they know enough to know the easy part but not enough to know the medium part, for instance. I don't know how important this assumption is to modeling bonus difficulty, and I think it's not a horrible assumption, but I know from personal experience that teams occasionally manage to miss easy (or medium) parts of bonuses that they should know, or randomly guess hard parts of bonuses.

-it's probably too complex for most players to wrap their heads around. To a large extent I don't really care about this point--it should be clear to everyone that the best playing strategy is to score as many points as possible on each question, which is what I really care about--but it might be nice if people could actually understand how things work in the new, open source S-value algorithm.


I think my main concerns are the ones I mentioned in my first paragraph. If FRIAR can be tested on real data and proves itself superior to simpler models, it will be clear that the assumptions it uses are close enough, and then we can think about whether its improvement in predictive performance is worth the effort. Since there is no real data available for use in testing FRIAR at the moment, I think NAQT should use a simpler S-value algorithm for at least this next year. If people think FRIAR is worth investigating further, NAQT should ask SCT hosts to send in all scoresheets. In the meantime, testing FRIAR on simulated tournament data (as Gordon has done) seems like a fine step; I'd also like to see simpler models tested on the same set of simulated data to see how different the "post-SCT" rankings are.

-Seth
Seth Teitler
Formerly UC Berkeley and U. Chicago
President of NAQT
Emeritus member of ACF
User avatar
The Friar
Wakka
Posts: 159
Joined: Fri Jul 10, 2009 2:39 pm

Re: S-value revision

Post by The Friar »

cvdwightw wrote:You are right that I reported the wrong correlation. Assuming I did it right, the Spearman for the 2009 non-Penn data is actually 0.65, less than the reported R^2 in all circumstances. A Pearson for winning pct. vs ASV got up to 0.81 but I'm not sure that will hold for other years.
That actually really surprises me and I wonder what would cause you to have a lower Spearman than Pearson correlation, but it does suggest that at least the ASV doesn't suffer from being particularly challenging to numerically interpret.

Y'know, it occurred to me as early as last night -- ASV has room for a tuning constant entirely analogous to the "specificity parameter" (read: "fudge factor") I brought up in the Rules of the Game part of the paper. Here's where:
cvdwightw wrote:4. Convert from the RTSC to the normalized tossup score (NTSC), which is the number of standard deviations away from the mean RTSC.

5. Convert bonus conversion to the normalized bonus score (NBSC), which is the number of standard deviations away from the mean BC.

6. Compute the cumulative standard deviations (CSD) = NTSC + NBSC.
Why does one standard deviation on bonuses have to equal one on tossups? Perhaps being better than average at getting tossups and the same amount worse than average at getting bonuses doesn't add up to a team being exactly average at winning quizbowl games. If the goal actually is to best predict success at the game level, I'd suggest generalizing that to CSD = (k*NTSC + (1-k)*NBSC) where k is just a constant between 0 and 1 that happens to maximize the appropriate correlation.
cvdwightw wrote:Looking back through my data prompted the question, why are we looking at ICT finish in the first place? Who decided that the S-Value needs to be a predictor of ICT success? Because of lineup changes and whatnot, it is not uncommon that the n+1(th) team from one ICT bracket is better than the nth place team from a different bracket, but ends up lower in the ICT standings. I suspect that within-bracket finish is much more likely to correlate than overall finish, but with only 10-12 teams each year that keep a roughly consistent SCT/ICT lineup, it's unlikely such a correlation will actually tell us anything.
Well, it's kind of a natural way to proceed from a model design perspective -- these are training data and those are test data. These and those happen to be given to us as two nicely separated groups -- SCTs and a corresponding ICT. We have to find some function of the test data that we'd like the model to maximize, and order of finish is a really intuitive one. Still, we have a much bigger set of training than test data if we use this approach, and the independent variables in what test data we do have are from a different-looking underlying population than the training ones -- only the best teams on the ability scale we estimated, and also harder questions.

That said, I'm starting to drift away from the idea of ICT performance as a yardstick, and especially the idea of a tuning constant in the model (whatever model it is) to optimize against that yardstick for other reasons. For more, see my next post.
cvdwightw wrote:Lastly, while the model is very good, there are some things that make it a bit difficult to work with: first, while doing stuff with the data can be cut down to a matter of minutes, the real time sink is going to be collecting and inputting the data; after all, you're looking at ~20 tossups a game for what I will conservatively estimate as 30 games a sectional, for at least 8 sectionals, is 4800+ bits of data that need to be, essentially, hand-entered as to which team (if any) scored the question and how many bonus points (if any) were awarded. I do not doubt the increased power of a question-level model, but it is unknown whether NAQT wants to take the time to input that data (I suspect the answer is yes, but NAQT would have to approve doing such a thing).
I will gladly spend a working week of my life coding scoresheets for NAQT in order to pull off this model. While I would even more gladly do it if I were compensated for my time, if NAQT came to me of a March morning and said, "Will you do this for free?", they'd have it done at the end of my spring break.
cvdwightw wrote: Second, this model would work very well if NAQT gave up the clock, but I think it may not be feasible for NAQT to put a randomized packet in each room and a randomized packet order for each sectional, and it is certainly not feasible if the restriction is that each room reads the set of 360/390 tossups in a completely different order.
How difficult the randomization is really depends mostly on what format the questions are stored in. If they are written up as, say, individual plain text files, and only formatted for printing at the end, whether using LaTeX or Word or whatever, a little shell script would accomplish the randomization of questions within packets very fast. It only gets tough if questions begin and end life in a proprietary word-processor format and may never leave. Randomization of packets within the set would be even faster.
cvdwightw wrote:I would absolutely love to see this model perform on a set of real ACF data, since with ACF the packets used are essentially random across sites (there is no a priori reason that a certain packet will be used more often across all sites than another), there is no clock, and there are no powers or wonky bonus formats.
How good is ACF about preserving their scoresheets? I would imagine that, till now, they would have been even less likely to do it than NAQT. But I'd love to run some ACF data through this model, too.
Gordon Arsenoff
Rochester '06
WUStL '14 (really)

Developer of WUStL Updates Statistics Live!
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

The Friar wrote:
cvdwightw wrote:You are right that I reported the wrong correlation. Assuming I did it right, the Spearman for the 2009 non-Penn data is actually 0.65, less than the reported R^2 in all circumstances. A Pearson for winning pct. vs ASV got up to 0.81 but I'm not sure that will hold for other years.
That actually really surprises me and I wonder what would cause you to have a lower Spearman than Pearson correlation, but it does suggest that at least the ASV doesn't suffer from being particularly challenging to numerically interpret.
In short, two things: first, the prelim bracket of death contained 3 of my 11 data points; second, 4 of the 11 data points were clustered 35-38 in the overall ASV rankings and another 5 were clustered between 8-13. The fact that the bracket of death was screwing up my correlations was what prompted me to reconsider trying to correlate with ICT finish in the first place.

In general, getting the extra tossup is far more worthwhile than the extra bonus part. A tossup is worth, on average, TP+BCY+BCO, where TP is tossup points, BCY is your BC, and BCO is your opponent's bonus conversion (actually, this underestimates tossup worth); an extra bonus part is worth 10 points.

I thus used the idea that tossup weight is (TPTHadj,avg+2*BC,avg)/(TPTHadj,avg+3*BC,avg) and bonus weight was (BC,avg)/(TPTHadj,avg+3*BC,avg). This got me up to a Spearman correlation of 0.7 on the 2009 data, but I can't get much higher (interestingly, computing an incorrect Spearman - overall order of finish at ICT vs overall order of finish in rankings - yields a R^2 of 0.79).
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
The Friar
Wakka
Posts: 159
Joined: Fri Jul 10, 2009 2:39 pm

Re: S-value revision

Post by The Friar »

Thanks for the feedback, Seth! There is quite a lot to chew on here.
setht wrote:it's not clear to me that there is actually that much difference between the quality of FRIAR's r-value and the quality of simpler measures like tossup points per tossup heard or bonus conversion
My explication has allowed there to be some confusion about which is an input and which an output, then. r-value is a parameter, an output, and TPTH and BC are dependent variables, which are inputs. The inputs of FRIAR are TPTH (almost; negs don't count) and PPB, except that they are averaged over the individual question (N=1) rather than a team's entire schedule (N ~ 330). The output, for FRIAR and for all the models proposed and for the existing S-Value, is a function of TPTH and PPB. Only if raw TPTH or BC were the metric on which teams were ranked in some proposed system would it make sense to hold composite rankings up against them in terms of simplicity.
setht wrote: I'm not sure that data is all that great to begin with (on a single-question level)
Well, the data are all initially recorded as individual question outcomes. If there are problems in them, they don't go away when you add them up and divide by N. Any model that uses all the available data, whether they have been aggregated into rate statistics first or not, averages over whatever problems there are in the whole set. Do you mean, by the way, that the data are not particularly clean at the question level? That is, do you think there's a significant amount of misrecording? If so, as I said, that doesn't mean it's better to start with aggregate variables, but if you're really worried about it, maybe NAQT should ask whether it's responsible to use those data to determine the winners of games. If that's not what you meant, then I am out of ideas with respect to what you did mean, but what I said applies to pretty much anything you could have.
setht wrote:I'm not sure the assumptions going into the model are close enough to reality that the extra effort in modeling is rewarded. Personally, I would rather use a simpler model that doesn't make possibly-dangerous assumptions about the nature of the data
Any given simpler model will not make as many possibly-dangerous explicit assumptions about the nature of the data. See Sussman attains enlightenment. The simpler model might not make certain assumptions, or it might, but we just haven't proved they're there and we may never know what they are. The more detailed model, on the other hand, allows us at least to identify which assumption is probably causing trouble if the model doesn't work, and fix it. For example, someone could actually come up with a theory of negs that works in this framework, and that would fix one assumption I make.
setht wrote:unless a more complicated model can show a significant improvement in predictive power using real data--and it's not even clear how much predictive power any model can ever hope to have, as Dwight has pointed out.
I agree with you up to a point, although less than I would have a couple days ago, before mulling it over some more. Predictive power with regard to some function of the test data is the gold standard for yardsticks (there's a confusing thought, though I guess at one point there actually was a platinum-iridium standard for meter sticks), but the question is what that function should be in our case. For reasons I'll outline below (CLIFFHANGER WARNING), I am no longer convinced that that has to be order of finish, and the appropriate answer might not even be well-defined.
setht wrote:In addition, using FRIAR would require that every scoresheet be sent in to NAQT for tabulation following SCT. This isn't necessarily an impossible feat, but I am again a bit dubious that the work is worth the effort.
The number of data points is equal to R_bar * Q_bar * T_bar/2 * S, where R_bar is the average number of rounds read at each site (let's plug in 15), Q_bar is the average number of questions heard per round (~22), T_bar the average number of teams hearing each round at each site (may we hope for 20?), and S is the number of sites (~10). Thus, we might expect an SCT to generate 33,000 data points, and I'd guess that's on the high end. If it takes one second to code each one, that's 550 minutes, or just over a full work day. If it takes five seconds (as a former administrative assistant, this seems more reasonable, maybe even generous), it's a full work week.

As I've said above, I'd be entirely too happy to spend that much time coding data for NAQT, even for free. This is one thing the professor(s) employing me at any given time tend to be grateful for as well: I will code your butt some data. In the absolute worst-case scenario, where you're not willing to let me take it on for integrity reasons or whatever and no one else is willing to do it themselves, it's still perfectly tractable to get it done by hiring a temp worker.

Yes, hiring. When I was a temp back in 2006-07, I made $15.00 an hour, and this kind of data entry was the least technical and least strenuous of the duties I pulled. In this economy and in Minneapolis rather than the DC metro are, I expect rates would be even less, but let's stick with $15.00 for now. $15.00 per hour over a 40-hour week is $600.00. That's two people's travel to and from HSNCT. So hire a body for a week to get data entry taken care of if you have to. Heck, hire three for reliability's sake. Then charge HSNCT volunteers about a 2.5% token co-pay on their travel and you'll probably come out ahead.

The other solution would be to have all this done in-game by replacing paper scoresheets with a computer program for entering stats in real time. Between TAFT, BEeS, and the other proposals people have floated for next-generation stats collection, I'm pretty sure it would be no challenge to find someone who would implement this; the only obstacle would be making sure the computing machinery was available. Even if only some SCT data was prepared on the fly this way, though, it would cut deeply into the time or money one would have to spend on the back end to code scoresheets.
setht wrote:Here are some more things I find troubling about FRIAR:

-assigning difficulty ratings to individual tossups and bonuses using conversion statistics. The conversion statistics may be based on a small number of data points, for questions (especially bonuses) that appear late in rounds, appear in rounds that many sites don't use, or (in the case of tossups) have hard answers that many teams can't pull even on the giveaway. In some sense the conversion data could be used as a definition of how hard a question was for the field that played it, but taking that data further than a post-tournament review of question difficulty by plugging it back into the model to determine knowledge ratings of each team seems iffy to me.
I remind you that there is no "plugging back in"; all parameters are estimated simultaneously. That said, the model uses every data point it has equally. The fewer data points there are about a given question, the broader will be the distribution of possible values for its difficulty parameter, and, as a consequence, the less information it will contribute to -- the less weight it will carry in -- the estimation of ability parameters for teams that play on it. This is a natural consequence of MLE (where, in the following, MLE will always be taken to mean "MLE or Bayesian estimation"). The gradient of the likelihood function of the whole model is the product of the gradient of the likelihoods of each question outcome given parameter values, and the likelihood of each question outcome is equal to the probability of each parameter value given the question outcome that bears on them and the other parameters. Thus, the likelihood function representing a question into which less-certain parameter values are plugged (again, all at once) shapes the overall likelihood less than one into which more-certain values are plugged.
setht wrote:no good way to deal with negs (or rather, the effects of negs on opponent tossup conversion and question difficulty ratings). It's true that many games don't feature tons of negs, but what happens to a team playing against a latter-day Nathan Freeburg? What happens to all of the teams in that sectional?
Yep, negs are the big missing thing in this model. Guess what? They're missing from other models, too. In a model that uses only aggregate TPTH and BC, negs only show up as -5/N TPTH. That's all they do. To juice them for all the information they contain, they must at least constitute an independent variable of some kind.

As I outlined above, there are countervailing effects on the estimation of a team's ability when the opponent negs, as long as the model includes powers. Thus, I don't believe there will be much difference at all in their r-values between cases where they play conservative or aggressive teams. There will be some distortion of question difficulty values, which might in turn push abilities around, but (a) it will do so in the same way across the entire field in which a question that induces a neg is heard, and (b) we're talking about a third-order effect on ability ratings, and thus one that is going to be heavily damped, to the point of being pretty well averaged out.
assigning a single r-value meant to quantify how well a team will do on tossups and on bonuses. I think it's fair to assume that tossup performance (independent of opponent strength) and bonus performance are correlated, but I think some of the simpler models that look at TPTH and bonus conversion separately may have the edge on this aspect.
If a model was actually different from FRIAR on this aspect, it would be useless as a replacement for S-Value, because that would mean having multiple ratings for each team, not a single rating on which they could be compared. Ultimately nothing looks at tossup and bonus stats completely separately; at some point all the models combine them (even Greg's, because wins are a composite function of both). The only difference is how.

(By the way, I'm headed back toward the cliffhanger I left you with earlier.)

Any of these models could be very straightforwardly optimized by adjusting the constant that weights the ultimate measure of tossup information and the ultimate measure of bonus information relative to each other so that the (rank-order) correlation of rating and winningness is greatest. In FRIAR, that's the specificity parameters on the Boltzmann factors for probability of powering and of correctly answering each tossup. In ASV, it's k where CSD = k*NTSC + (1-k)*NBSC.

However, if it's known how the two are weighted, and the mix of skills needed to perform best on tossups and on bonuses differs at all (which I agree with you that it does), then, in principle, it's known what the best mix of skills is to acquire in order to maximize your expected (improved) S-Value.

Tossup and bonus performance will correlate, again, to the extent that tossups and bonuses reward the same skill set. I submit to you that the main skill that tossups and bonuses reward in the same way is knowledge. Good teamwork, buzzer speed, lateral thinking (if any), knowledge of your teammates -- all these things bear on one or the other class of questions much more heavily than the other. If we are measuring a single common component of ability to answer tossups and ability to answer bonuses, that component is knowledge. I submit that r-value is a measure specifically of the component of quizbowl ability comprised of knowledge. Thus, the message to teams getting better at quizbowl is: if you want your r-value to go up, know more.

Look, all of the systems proposed have the property that, during the game, the best thing to do is always to play to the best of your ability. A correct answer on a tossup is always better than any alternative, and the same for a correct answer on a bonus. However, a system tuned to rate teams along a dimension as close as possible to win probability encourages preparation along the same dimension, to improve the same mix of skills. On the other hand, a system that rewards exclusively or disproportionately the knowledge component of quizbowl skill, as FRIAR does, encourages teams to improve themselves along a dimension closer to that of pure knowledge.

So we're down to a debate about the philosophy of quizbowl. Do we want players to make themselves better at the game for the game's sake, or, ultimately, do we want them to be motivated to know more? I don't know where in that philosophy debate you come down, but I tend to hear a lot more voices on this board arguing that good quizbowl promotes knowledge as much as possible than I do taking any other position. I personally kind of began life on the side of "quizbowl is about winning quizbowl" and have come to be mostly on the "quizbowl is about knowledge" side in these latter days

That ultimately is the reason why I've talked myself into being skeptical of making finish at ICT (or at SCT, for that matter) the yardstick against which NAQT is to hold the improved S-Value system it finally adopts. The right yardstick would be a measure of knowledge that could be derived from other data. Unfortunately, that kind of measure is a snipe we'd be advised not to bother hunting. I'm therefore coming down on the side of letting theoretical concerns and provision of the characteristics outlined in the press release dictate the choice of a measure more than correlation with tournament outcomes.
setht wrote:assigning bonus difficulty ratings on the assumption that teams don't ever do silly things like miss the easy part and score 20 on the medium and hard parts. If I've understood this point of the model correctly, the assumption is that every 10- or 20-point conversion represents the same (or nearly the same) level of knowledge--teams only 10 a bonus if they know enough to know the easy part but not enough to know the medium part, for instance. I don't know how important this assumption is to modeling bonus difficulty, and I think it's not a horrible assumption, but I know from personal experience that teams occasionally manage to miss easy (or medium) parts of bonuses that they should know, or randomly guess hard parts of bonuses.
This is actually not assumed. All that is assumed is that, on average, a team with more knowledge is more likely to get at least any given number of points on each bonus. How they get there is, for FRIAR's purposes, their business. (This would be assumed only if we modeled bonus parts separately from each other, which I've already recommended against doing because some formats used in NAQT can't be divided that way, and we imposed a restriction on the estimated probability of getting each part that explicitly designated one as the easy part, one the middle, and one the hard.)
setht wrote:it's probably too complex for most players to wrap their heads around. [...] it might be nice if people could actually understand how things work in the new, open source S-value algorithm.
This is the point I'm going to get most defensive about.

First, FRIAR Really Isn't Anything Revolutionary. It is not mind-bending to suggest that "probability you get the question right increases with your ability and decreases with your opponent's ability, if applicable, and the difficulty of the question according to this particular functional form." It's not tough to get used to "we choose log odds as the functional form because it takes the interval of probability [0,1] and maps it to the whole real line from minus to plus infinity, and when we map it that way, the log odds of success change by exactly one unit for a 1-unit change in ability or difficulty no matter where on the scale the parameters are". It's certainly conceptually straightforward to state, "we take the whole pattern of data from all the tournaments and then search for the parameter values at which those equations state we'd be most likely to see exactly these data as opposed to any other." All that is left to be insisted on is either an explanation of how MLE works, at which point a reference to Wikipedia is appropriate, or an actual derivation of the logit form from first principles, in which case the best thing to do is refer people to how many other contexts it comes up in (thermodynamics! chess! the computer-adaptive SAT/GRE!).

FRIAR isn't any more complicated conceptually than Greg's model, because it is Greg's model, at a higher level of detail. Actually, it's simpler, because it doesn't need to spend an auxiliary step assigning virtual wins. It has a more elegant form and a shorter path connecting variables to ratings than ASV, as long as it is OK to tell people "then we let a computer search for the ability and difficulty levels where that pattern of data was most likely to have happened." Based on how many other facets of our lives call for us to implicitly trust that, I object to the notion that it isn't OK to do that.

Second, I believe it was in deep wisdom that NAQT placed mathematical simplicity only as high as "nice to have" on the list of desiderata for the new system. FRIAR makes the most complete use of data of anything proposed and solves major problems no other proposed system has attempted to. There is an old saying in computer science: "Good; cheap; quick: pick two." I'm pretty sure FRIAR has picked Good and asks for Cheap and Quick to be traded off against each other based on NAQT's preferences about spending money on temp workers or netbooks (or smartphones, or even just making sure host sites have accessible computers) in game rooms. Relative to the other choices, it may have chosen Double Good. Philosophically, I believe it would be a disservice to the teams competing for NAQT to be willing to sacrifice much at all of Good in order to get Cheap or Quick, which is what mathematical or computational simplicity will buy you.
setht wrote:It should be clear to everyone that the best playing strategy is to score as many points as possible on each question, which is what I really care about
.
Bueno. Here is the proof that giving the right answer is always a dominant strategy.

There can be said to be two possible types of strategies in the game in which players attempt to maximize their rvalues: try your hardest to get questions right and thereby increase your r-value directly, and any mixture of that and trying not to answer in the hope that you'll inflate the difficulty level of the question and the ability of the opponent and that that will pull your ability rating upward. However, if anyone is playing a strategy of not answering questions in order to increase their difficulty ratings, the optimal mixture of strategies for another team playing the same questions shifts, if anything, toward more effort to answer the questions correctly: the true difficulty of the questions has not changed, but the reward for getting them right has increased because their estimated difficulties have gone up. Other teams may therefore free-ride on someone's choice to do other than try their hardest to get all the questions right. Thus, such a strategy is not a best response to itself, and in a symmetric game of "let's maximize our r-value", it cannot be part of a Nash equilibrium of that game. Since every team is playing some mix of the "let's maximize our r-value" game and the "let's finish highest in the tournament" game, and the second type of strategy is not a best response to any strategy of the opponent in the latter, either, it must be dominated by the first type of strategy no matter what the team's exact objective is.

The category missing from the above is negs. Given a belief that you absolutely cannot get a tossup, is it ever better to neg than not to answer? No. Negging makes it more likely that your opponent will get the tossup for 10 and less likely that they will power it. On average, this will lower the opponent's difficulty rating, which can only make the negging team look worse, since a neg and a no answer each leave the opponent with the same total probability of getting the tossup. Meanwhile, it will raise the apparent bar for power. At best this is expected to be a wash, but if one of the effects is greater, it should be the downward effect on the opponent's r-value. Therefore, negging instead of sitting can be at best a weak best response in the "maximize r-value" game, and is obviously dominated in the regular game; thus, in any mixture of the two, and in particular any reasonable mixture, it is dominated.

Actually, it looks like I didn't give FRIAR enough credit before for resistance to gaming of one's own rating through untoward play. All gaming of one's own rating must apparently be through failure to play. Anyhow, a proof of this type should be constructible for essentially any rating system. The construction of one covering ASV is left as an exercise to the reader (it is most strongly expected to exist, since a proof of another kind has already been found.
If people think FRIAR is worth investigating further, NAQT should ask SCT hosts to send in all scoresheets. In the meantime, testing FRIAR on simulated tournament data (as Gordon has done) seems like a fine step; I'd also like to see simpler models tested on the same set of simulated data to see how different the "post-SCT" rankings are.
What I will work on when I do this (I must begin the semester soon, and will have a little less time to devote to this project) is coming up with a richer, more nuanced simulation specification than FRIAR itself, to generate some data with noise in them both of the type FRIAR expects to pick up and of the types it doesn't. This would include specifying different team ability parameters for different types of questions (tossup, power, neg, bonus) and different subjects, incorporating the NAQT distribution, and modeling the clock. (Those things are much easier to claim to do in a simulation than to claim to capture in a statistical model, because we get less and less traction on each of all those parameters as we try to divine more of them from the same amount of data, but when we simulate, we may fix those parameters with arbitrary precision.) I'll generate several data sets with different distributions of those parameters, and hopefully get expert eyeballs to look at the summary stats of each one and provide some opinions on which ones look most like real SCTs.

If you have read this far, thanks for your attention! I greatly apologize if I have come off as hostile or offensive, especially if I have put myself in that position by presenting something to you that it was really my duty to explain more fully the first time. Now, if you'll excuse me -- this post has been my day's work. I am off to acquire some real [not simulated] dinner.
Last edited by The Friar on Tue Aug 18, 2009 10:42 pm, edited 5 times in total.
Gordon Arsenoff
Rochester '06
WUStL '14 (really)

Developer of WUStL Updates Statistics Live!
User avatar
grapesmoker
Sin
Posts: 6345
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Re: S-value revision

Post by grapesmoker »

Gordon, I'm going to try and collect the data from this year's EFT and give it to you so you can run your program on it. It's not quite an SCT but I'm interested in the results.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
User avatar
The Friar
Wakka
Posts: 159
Joined: Fri Jul 10, 2009 2:39 pm

Re: S-value revision

Post by The Friar »

Cool, Jerry! Thanks.

BTW, to diminish the extent to which I come off as rude to Seth -- "If you have read this far" is directed at everyone, because I wrote a whole lot; I clearly already had Seth's attention and am grateful.
Gordon Arsenoff
Rochester '06
WUStL '14 (really)

Developer of WUStL Updates Statistics Live!
User avatar
setht
Auron
Posts: 1205
Joined: Mon Oct 18, 2004 2:41 pm
Location: Columbus, Ohio

Re: S-value revision

Post by setht »

The Friar wrote:Thanks for the feedback, Seth! There is quite a lot to chew on here.
You're welcome for the feedback, and I apologize for doing a poor job of writing up my concerns in an intelligible fashion in my previous post. Rather than fumble my way through the details of what's going on in FRIAR and the various things I think can happen in quizbowl matches that might confound attempts to translate detailed question data into measures of team ability (or knowledge or whatever), I'll focus on the bigger picture of trying to decide between several proposed models.
The Friar wrote:Look, all of the systems proposed have the property that, during the game, the best thing to do is always to play to the best of your ability. A correct answer on a tossup is always better than any alternative, and the same for a correct answer on a bonus. However, a system tuned to rate teams along a dimension as close as possible to win probability encourages preparation along the same dimension, to improve the same mix of skills. On the other hand, a system that rewards exclusively or disproportionately the knowledge component of quizbowl skill, as FRIAR does, encourages teams to improve themselves along a dimension closer to that of pure knowledge.

So we're down to a debate about the philosophy of quizbowl. Do we want players to make themselves better at the game for the game's sake, or, ultimately, do we want them to be motivated to know more? I don't know where in that philosophy debate you come down, but I tend to hear a lot more voices on this board arguing that good quizbowl promotes knowledge as much as possible than I do taking any other position. I personally kind of began life on the side of "quizbowl is about winning quizbowl" and have come to be mostly on the "quizbowl is about knowledge" side in these latter days.
I think FRIAR and Dwight's system and probably the other proposed systems all primarily reward the knowledge component of quizbowl skill, as long as the questions do (and if the questions don't I don't think there's any reasonable system that can possibly hope to do so). If all these systems are set up so that players should play to the best of their ability on each question, I don't see how we can use this aspect of model construction to select one system over another. If FRIAR or Dwight's system or any other system decides to give different weights to measures of tossup and bonus performance, I don't think that will change anything--if tossups and bonuses are both written to reward knowledge, the best approach to making oneself a better player is to learn things, regardless of what weights are used. I don't see what players would do differently in order to prepare for a system that gives equal weight to tossup and bonus performance versus a system that gives more weight to bonus performance, or how either of these preparation schemes would differ from "learn stuff so you can answer quizbowl questions."

In general, I think the job of promoting knowledge resides with the question set. If the questions do a good job of distinguishing (and advantaging) several levels of knowledge actually present in the playing field, then I think we can safely build bid selection schemes off of question performance without having to worry that the schemes will promote weird preparation routines. I think in that case we can also take order of finish within a large, common field as a worthwhile yardstick for evaluating systems.
The Friar wrote:That ultimately is the reason why I've talked myself into being skeptical of making finish at ICT (or at SCT, for that matter) the yardstick against which NAQT is to hold the improved S-Value system it finally adopts. The right yardstick would be a measure of knowledge that could be derived from other data. Unfortunately, that kind of measure is a snipe we'd be advised not to bother hunting. I'm therefore coming down on the side of letting theoretical concerns and provision of the characteristics outlined in the press release dictate the choice of a measure more than correlation with tournament outcomes.
Again, I think we should let the question writers/editors tackle the job of setting things up so the questions reward knowledge (to the extent that quizbowl ever rewards knowledge), then go ahead and look at things like tournament performance as acceptable proxies for relative levels of knowledge. I think precise order of finish at ICT is problematic, but what about order of finish at ICT within the group of teams that don't change composition significantly? That is, if there are 10 teams that didn't change the majority of their scoring line-up between SCT and ICT 2009, perhaps we should look at the ordering of those 10 teams at ICT without worrying about whether team A finished 2 or 12 spots ahead of team B. I guess we could also look at order-of-r-values and order-of-RSVs for teams at SCT and at ICT to see if one measure does a better job of maintaining relative orderings between the two tournaments.

-Seth
Seth Teitler
Formerly UC Berkeley and U. Chicago
President of NAQT
Emeritus member of ACF
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

setht wrote:I think precise order of finish at ICT is problematic, but what about order of finish at ICT within the group of teams that don't change composition significantly? That is, if there are 10 teams that didn't change the majority of their scoring line-up between SCT and ICT 2009, perhaps we should look at the ordering of those 10 teams at ICT without worrying about whether team A finished 2 or 12 spots ahead of team B.
Seth,

This is exactly what I looked at in 2009 (it's the Spearman correlation coefficient with the value of R^2 ~0.65). Just about everything that looks at something outside of these 11 teams in a "vacuum" yields a better correlation (NTSC vs NBSC, ASV vs win%, order-of-RSV vs. ICT order-of-finish including other teams, etc.). I have not bothered to compute the ICT RSVs for each team but I suspect that might also prove an interesting things to look at.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
Locked