Thanks for the feedback, Seth! There is quite a lot to chew on here.
setht wrote:it's not clear to me that there is actually that much difference between the quality of FRIAR's r-value and the quality of simpler measures like tossup points per tossup heard or bonus conversion
My explication has allowed there to be some confusion about which is an input and which an output, then.
r-value is a
parameter, an
output, and TPTH and BC are
dependent variables, which are
inputs. The inputs of FRIAR
are TPTH (almost; negs don't count) and PPB, except that they are averaged over the individual question (N=1) rather than a team's entire schedule (N ~ 330). The output, for FRIAR and for all the models proposed and for the existing S-Value, is a function of TPTH and PPB. Only if
raw TPTH or BC were the metric on which teams were ranked in some proposed system would it make sense to hold composite rankings up against them in terms of simplicity.
setht wrote: I'm not sure that data is all that great to begin with (on a single-question level)
Well, the data are all initially recorded as individual question outcomes. If there are problems in them, they don't go away when you add them up and divide by N. Any model that uses all the available data, whether they have been aggregated into rate statistics first or not, averages over whatever problems there are in the whole set. Do you mean, by the way, that the data are not particularly clean at the question level? That is, do you think there's a significant amount of misrecording? If so, as I said, that doesn't mean it's better to start with aggregate variables, but if you're really worried about it, maybe NAQT should ask whether it's responsible to use those data to determine the winners of games. If that's not what you meant, then I am out of ideas with respect to what you did mean, but what I said applies to pretty much anything you could have.
setht wrote:I'm not sure the assumptions going into the model are close enough to reality that the extra effort in modeling is rewarded. Personally, I would rather use a simpler model that doesn't make possibly-dangerous assumptions about the nature of the data
Any given simpler model will not make as many possibly-dangerous
explicit assumptions about the nature of the data. See
Sussman attains enlightenment. The simpler model
might not make certain assumptions, or it
might, but we just haven't proved they're there and we may never know what they are. The more detailed model, on the other hand, allows us at least to identify
which assumption is probably causing trouble if the model doesn't work, and fix it. For example, someone could actually come up with a theory of negs that works in this framework, and that would fix one assumption I make.
setht wrote:unless a more complicated model can show a significant improvement in predictive power using real data--and it's not even clear how much predictive power any model can ever hope to have, as Dwight has pointed out.
I agree with you up to a point, although less than I would have a couple days ago, before mulling it over some more. Predictive power with regard to
some function of the test data is the gold standard for yardsticks (there's a confusing thought, though I guess at one point there actually was a platinum-iridium standard for meter sticks), but the question is what that function should be in our case. For reasons I'll outline below (CLIFFHANGER WARNING), I am no longer convinced that that has to be order of finish, and the appropriate answer might not even be well-defined.
setht wrote:In addition, using FRIAR would require that every scoresheet be sent in to NAQT for tabulation following SCT. This isn't necessarily an impossible feat, but I am again a bit dubious that the work is worth the effort.
The number of data points is equal to R_bar * Q_bar * T_bar/2 * S, where R_bar is the average number of rounds read at each site (let's plug in 15), Q_bar is the average number of questions heard per round (~22), T_bar the average number of teams hearing each round at each site (may we hope for 20?), and S is the number of sites (~10). Thus, we might expect an SCT to generate
33,000 data points, and I'd guess that's on the high end. If it takes one second to code each one, that's 550 minutes, or just over a full work day. If it takes five seconds (as a former administrative assistant, this seems more reasonable, maybe even generous), it's a full work week.
As I've said above, I'd be entirely too happy to spend that much time coding data for NAQT, even for free. This is one thing the professor(s) employing me at any given time tend to be grateful for as well: I will code your butt some data. In the absolute worst-case scenario, where you're not willing to let me take it on for integrity reasons or whatever and no one else is willing to do it themselves, it's still perfectly tractable to get it done by
hiring a temp worker.
Yes, hiring. When I was a temp back in 2006-07, I made $15.00 an hour, and this kind of data entry was the least technical and least strenuous of the duties I pulled. In this economy and in Minneapolis rather than the DC metro are, I expect rates would be even less, but let's stick with $15.00 for now. $15.00 per hour over a 40-hour week is $600.00. That's two people's travel to and from HSNCT. So hire a body for a week to get data entry taken care of if you have to. Heck, hire
three for reliability's sake. Then charge HSNCT volunteers about a 2.5% token co-pay on their travel and you'll probably come out ahead.
The other solution would be to have all this done in-game by replacing paper scoresheets with a computer program for entering stats in real time. Between TAFT, BEeS, and the other proposals people have floated for next-generation stats collection, I'm pretty sure it would be no challenge to find someone who would implement this; the only obstacle would be making sure the computing machinery was available. Even if only
some SCT data was prepared on the fly this way, though, it would cut deeply into the time or money one would have to spend on the back end to code scoresheets.
setht wrote:Here are some more things I find troubling about FRIAR:
-assigning difficulty ratings to individual tossups and bonuses using conversion statistics. The conversion statistics may be based on a small number of data points, for questions (especially bonuses) that appear late in rounds, appear in rounds that many sites don't use, or (in the case of tossups) have hard answers that many teams can't pull even on the giveaway. In some sense the conversion data could be used as a definition of how hard a question was for the field that played it, but taking that data further than a post-tournament review of question difficulty by plugging it back into the model to determine knowledge ratings of each team seems iffy to me.
I remind you that there is no "plugging back in"; all parameters are estimated
simultaneously. That said, the model uses every
data point it has equally. The fewer data points there are about a given question, the broader will be the distribution of possible values for its difficulty parameter, and, as a consequence, the less information it will contribute to -- the less
weight it will carry in -- the estimation of ability parameters for teams that play on it. This is a natural consequence of MLE (where, in the following, MLE will always be taken to mean "MLE or Bayesian estimation"). The gradient of the likelihood function of the whole model is the product of the gradient of the likelihoods of each question outcome given parameter values, and the likelihood of each question outcome is equal to the probability of each parameter value given the question outcome that bears on them and the other parameters. Thus, the likelihood function representing a question into which less-certain parameter values are plugged (again, all at once) shapes the overall likelihood less than one into which more-certain values are plugged.
setht wrote:no good way to deal with negs (or rather, the effects of negs on opponent tossup conversion and question difficulty ratings). It's true that many games don't feature tons of negs, but what happens to a team playing against a latter-day Nathan Freeburg? What happens to all of the teams in that sectional?
Yep, negs are the big missing thing in this model. Guess what? They're missing from other models, too. In a model that uses only aggregate TPTH and BC, negs only show up as -5/N TPTH. That's all they do. To juice them for all the information they contain, they must at least constitute an independent variable of some kind.
As I outlined above, there are countervailing effects on the estimation of a team's ability when the opponent negs, as long as the model includes powers. Thus, I don't believe there will be much difference at all in their
r-values between cases where they play conservative or aggressive teams. There will be some distortion of
question difficulty values, which might in turn push abilities around, but (a) it will do so in the same way across the entire field in which a question that induces a neg is heard, and (b) we're talking about a third-order effect on ability ratings, and thus one that is going to be heavily damped, to the point of being pretty well averaged out.
assigning a single r-value meant to quantify how well a team will do on tossups and on bonuses. I think it's fair to assume that tossup performance (independent of opponent strength) and bonus performance are correlated, but I think some of the simpler models that look at TPTH and bonus conversion separately may have the edge on this aspect.
If a model was actually different from FRIAR on this aspect, it would be useless as a replacement for S-Value, because that would mean
having multiple ratings for each team, not a single rating on which they could be compared. Ultimately
nothing looks at tossup and bonus stats completely separately; at some point all the models combine them (even Greg's, because wins are a composite function of both). The only difference is
how.
(By the way, I'm headed back toward the cliffhanger I left you with earlier.)
Any of these models could be very straightforwardly optimized by adjusting the constant that weights the ultimate measure of tossup information and the ultimate measure of bonus information relative to each other so that the (rank-order) correlation of rating and winningness is greatest. In FRIAR, that's the specificity parameters on the Boltzmann factors for probability of powering and of correctly answering each tossup. In ASV, it's k where CSD = k*NTSC + (1-k)*NBSC.
However, if it's known
how the two are weighted, and the mix of skills needed to perform best on tossups and on bonuses differs at all (which I agree with you that it does), then, in principle, it's known what the best mix of skills is to acquire in order to maximize your expected (improved) S-Value.
Tossup and bonus performance will correlate, again, to the extent that tossups and bonuses reward the same skill set. I submit to you that the main skill that tossups and bonuses reward in the same way is
knowledge. Good teamwork, buzzer speed, lateral thinking (if any), knowledge of your teammates -- all these things bear on one or the other class of questions much more heavily than the other. If we are measuring a single common component of ability to answer tossups and ability to answer bonuses, that component is knowledge. I submit that
r-value is a measure specifically of
the component of quizbowl ability comprised of knowledge. Thus, the message to teams getting better at quizbowl is: if you want your
r-value to go up,
know more.
Look, all of the systems proposed have the property that,
during the game, the best thing to do is always to play to the best of your ability. A correct answer on a tossup is always better than any alternative, and the same for a correct answer on a bonus. However, a system tuned to rate teams along a dimension as close as possible to win probability encourages
preparation along the same dimension, to improve the same mix of skills. On the other hand, a system that rewards exclusively or disproportionately the knowledge component of quizbowl skill, as FRIAR does, encourages teams to improve themselves along a dimension closer to that of pure knowledge.
So we're down to a debate about the philosophy of quizbowl. Do we want players to make themselves better at the game for the game's sake, or, ultimately, do we want them to be motivated to know more? I don't know where in that philosophy debate you come down, but I tend to hear a lot more voices on this board arguing that good quizbowl promotes knowledge as much as possible than I do taking any other position. I personally kind of began life on the side of "quizbowl is about winning quizbowl" and have come to be mostly on the "quizbowl is about knowledge" side in these latter days
That ultimately is the reason why I've talked myself into being skeptical of making finish at ICT (or at SCT, for that matter) the yardstick against which NAQT is to hold the improved S-Value system it finally adopts. The right yardstick would be a measure of
knowledge that could be derived from other data. Unfortunately, that kind of measure is a snipe we'd be advised not to bother hunting. I'm therefore coming down on the side of letting theoretical concerns and provision of the characteristics outlined in the press release dictate the choice of a measure more than correlation with tournament outcomes.
setht wrote:assigning bonus difficulty ratings on the assumption that teams don't ever do silly things like miss the easy part and score 20 on the medium and hard parts. If I've understood this point of the model correctly, the assumption is that every 10- or 20-point conversion represents the same (or nearly the same) level of knowledge--teams only 10 a bonus if they know enough to know the easy part but not enough to know the medium part, for instance. I don't know how important this assumption is to modeling bonus difficulty, and I think it's not a horrible assumption, but I know from personal experience that teams occasionally manage to miss easy (or medium) parts of bonuses that they should know, or randomly guess hard parts of bonuses.
This is actually not assumed. All that is assumed is that, on average, a team with more knowledge is more likely to get at least any given number of
points on each bonus. How they get there is, for FRIAR's purposes, their business. (This would be assumed only if we modeled bonus
parts separately from each other, which I've already recommended against doing because some formats used in NAQT can't be divided that way,
and we imposed a restriction on the estimated probability of getting each part that explicitly designated one as the easy part, one the middle, and one the hard.)
setht wrote:it's probably too complex for most players to wrap their heads around. [...] it might be nice if people could actually understand how things work in the new, open source S-value algorithm.
This is the point I'm going to get most defensive about.
First, FRIAR Really Isn't Anything Revolutionary. It is not mind-bending to suggest that "probability you get the question right increases with your ability and decreases with your opponent's ability, if applicable, and the difficulty of the question according to this particular functional form." It's not tough to get used to "we choose log odds as the functional form because it takes the interval of probability [0,1] and maps it to the whole real line from minus to plus infinity, and when we map it that way, the log odds of success change by exactly one unit for a 1-unit change in ability or difficulty no matter where on the scale the parameters are". It's certainly conceptually straightforward to state, "we take the whole pattern of data from all the tournaments and then search for the parameter values at which those equations state we'd be most likely to see exactly these data as opposed to any other." All that is left to be insisted on is either an explanation of how MLE works, at which point a reference to Wikipedia is appropriate, or an actual derivation of the logit form from first principles, in which case the best thing to do is refer people to how many other contexts it comes up in (thermodynamics! chess! the computer-adaptive SAT/GRE!).
FRIAR isn't any more complicated conceptually than Greg's model, because it
is Greg's model, at a higher level of detail. Actually, it's simpler, because it doesn't need to spend an auxiliary step assigning virtual wins. It has a more elegant form and a shorter path connecting variables to ratings than ASV,
as long as it is OK to tell people "then we let a computer search for the ability and difficulty levels where that pattern of data was most likely to have happened." Based on how many other facets of our lives call for us to implicitly trust that, I object to the notion that it isn't OK to do that.
Second, I believe it was in deep wisdom that NAQT placed mathematical simplicity only as high as "nice to have" on the list of desiderata for the new system. FRIAR makes the most complete use of data of anything proposed and solves major problems no other proposed system has attempted to. There is an old saying in computer science: "Good; cheap; quick: pick two." I'm pretty sure FRIAR has picked Good and asks for Cheap and Quick to be traded off against each other based on NAQT's preferences about spending money on temp workers or netbooks (or smartphones, or even just making sure host sites have accessible computers) in game rooms. Relative to the other choices, it may have chosen Double Good. Philosophically, I believe it would be a disservice to the teams competing for NAQT to be willing to sacrifice much at all of Good in order to get Cheap or Quick, which is what mathematical or computational simplicity will buy you.
setht wrote:It should be clear to everyone that the best playing strategy is to score as many points as possible on each question, which is what I really care about
.
Bueno. Here is the proof that giving the right answer is always a dominant strategy.
There can be said to be two possible types of strategies in the game in which players attempt to maximize their
rvalues: try your hardest to get questions right and thereby increase your
r-value directly, and any mixture of that and trying not to answer in the hope that you'll inflate the difficulty level of the question and the ability of the opponent and that that will pull your ability rating upward. However, if anyone
is playing a strategy of not answering questions in order to increase their difficulty ratings, the optimal mixture of strategies for another team playing the same questions shifts, if anything, toward more effort to answer the questions correctly: the
true difficulty of the questions has not changed, but the reward for getting them right has increased because their
estimated difficulties have gone up. Other teams may therefore free-ride on someone's choice to do other than try their hardest to get all the questions right. Thus, such a strategy is not a best response to itself, and in a symmetric game of "let's maximize our
r-value", it cannot be part of a Nash equilibrium of that game. Since every team is playing some mix of the "let's maximize our
r-value" game and the "let's finish highest in the tournament" game, and the second type of strategy is not a best response to any strategy of the opponent in the latter, either, it must be dominated by the first type of strategy no matter what the team's exact objective is.
The category missing from the above is negs. Given a belief that you absolutely cannot get a tossup, is it ever better to neg than not to answer? No. Negging makes it more likely that your opponent will get the tossup for 10 and less likely that they will power it. On average, this will lower the opponent's difficulty rating, which can only make the negging team look worse, since a neg and a no answer each leave the opponent with the same
total probability of getting the tossup. Meanwhile, it will raise the apparent bar for power. At best this is expected to be a wash, but if one of the effects is greater, it should be the downward effect on the opponent's
r-value. Therefore, negging instead of sitting can be at best a weak best response in the "maximize
r-value" game, and is obviously dominated in the regular game; thus, in any mixture of the two, and in particular any reasonable mixture, it is dominated.
Actually, it looks like I didn't give FRIAR enough credit before for resistance to gaming of one's own rating through untoward play. All gaming of one's own rating must apparently be through failure to play. Anyhow, a proof of this type should be constructible for essentially any rating system. The construction of one covering ASV is left as an exercise to the reader (it is most strongly expected to exist, since a proof of another kind has already been found.
If people think FRIAR is worth investigating further, NAQT should ask SCT hosts to send in all scoresheets. In the meantime, testing FRIAR on simulated tournament data (as Gordon has done) seems like a fine step; I'd also like to see simpler models tested on the same set of simulated data to see how different the "post-SCT" rankings are.
What I will work on when I do this (I must begin the semester soon, and will have a little less time to devote to this project) is coming up with a richer, more nuanced simulation specification than FRIAR itself, to generate some data with noise in them both of the type FRIAR expects to pick up and of the types it doesn't. This would include specifying different team ability parameters for different types of questions (tossup, power, neg, bonus) and different subjects, incorporating the NAQT distribution, and modeling the clock. (Those things are much easier to claim to do in a simulation than to claim to capture in a statistical model, because we get less and less traction on each of all those parameters as we try to divine more of them from the same amount of data, but when we simulate, we may fix those parameters with arbitrary precision.) I'll generate several data sets with different distributions of those parameters, and hopefully get expert eyeballs to look at the summary stats of each one and provide some opinions on which ones look most like real SCTs.
If you have read this far, thanks for your attention! I greatly apologize if I have come off as hostile or offensive, especially if I have put myself in that position by presenting something to you that it was really my duty to explain more fully the first time. Now, if you'll excuse me -- this post has been my day's work. I am off to acquire some real [not simulated] dinner.