DP-FRIAR: Simpler! More Accurate! Almost Testable!

The Friar · Post by **The Friar** » Wed Aug 04, 2010 1:22 am

In the past few days, in the course of programming and debugging FRIAR with negs and powers, I devised separately a team rating model simpler than FRIAR that nonetheless is theoretically sound in the same sense in which FRIAR is. I refer to this as DP-FRIAR, where DP stands for Divisional Packets. This model would estimate, via maximum likelihood, an appropriate conversion factor between scores in different divisions, without using computationally and administratively expensive question-level data, as long as any SCT questions used in both DI and DII are heard in the same packet across divisions.

(1) DP-FRIAR in one division: the binomial rotated scores model

Scoresheets parametrize the outcome of a game in terms of team 1's score and team 2's score. In these terms, it is difficult to estimate each team's intrinsic propensity to score points. However, the same total information is provided by team 1's final score and the total score, and these are easier to describe in terms of single-team ability levels.

The maximum total score possible in a NAQT SCT packet is 1170 (45 * 26); the minimum is -130, for a range of 1300. Divide by five to normalize the smallest unit of scoring to 1, and we may speak of each packet containing 260 absolute points. Then the total score in absolute points is a binomial random variable, with n, the number of trials, equal to 260 and p, the probability of success, equal to the probability that at least one team scores any given available point:

Y ∼ Binomial(260, p)

Assuming independence, p is equal to the sum of the probabilities of each team scoring a given point playing alone, minus the probability of both doing so:

Y ∼ Binomial(260, p_1 + p_2 - (p_1 * p_2))

This alone is a sufficient model on which to estimate p_t for each team t in the tournament, across fields, but only half the information in the total scores has been used. We were given two inputs, Y_1 and Y_2, but we have estimated a model for only one transformed input, Y_1 + Y_2. We can still estimate one for a linearly independent quantity, such as Y_1 alone. Conditional on the total score, team 1's score is also a binomial random variable. The probability that a given point scored was scored by team 1 is a simple affine transformation of the difference in the solo scoring probabilities p_1 and p_2:

Y_1 | Y ∼ Binomial(Y, 0.5 + ((p_1 - p_2) / 2(p_1 + p_2)))

(2) DP-FRIAR across divisions

If two teams played on the Division II version of a packet instead of the Division I version, there would be some fraction of the points they could get in DII that would have been too hard for them to get in DI. We can treat this multiplied by (p_1 + p_2) as another probability p * p_D (bounded below by 0 and above by 1) and write an altered binomial model for total scores in Division II:

Y | DII ∼ Binomial (260, p + (p * p_D) - (p^2 * p_D))

The Division I model in part (1) is then simply this model with p_D equal to 0: no points in Division I are easier to get than Division I points. p * p_D is used, rather than p_D alone, to capture the idea that no point is a free gift: two teams of buzzer rocks will still score 0 under this specification, as expected.

(3) Discussion: advantages over FRIAR and obstacles to testing

In order for the model in part (II) to be identified, there must be some teams who play both DI and DII packets. If this is not the case, then an increase in p_D could be offset by a commensurate decrease in p_t for every team t in Division II without changing the likelihood of any result. Unfortunately, all combined DI/DII fields through now have played either one or the other. Thus, there are no data on which to estimate p_D for past SCTs. The best that could be done for testing purposes would be to estimate separate models for DI and DII, omitting combined fields (or characterizing all teams in a combined field as being in the division actually read).

In the future, combined fields could play Division I packets through their first round robin and DII after, or vice versa, or Division II in preliminary rounds and Division I in playoffs, or even Division II in preliminary rounds followed by Division I in the upper playoff bracket and Division I in the lower (if large enough combined fields exist, which should not happen). These schemes would limit the round-to-round fluctuation of difficulty while providing the bridging observations necessary to estimate p_D.

If implemented, DP-FRIAR would offer significant advantages over FRIAR. First, DP-FRIAR has fewer parameters and takes fewer data points, and is therefore easier to code, easier to compile data for, and much easier for a computer to numerically optimize once code and data are provided, than FRIAR. Second, the ratings from DP-FRIAR are expected to correlate better with probability of winning than those from FRIAR. FRIAR models knowledge, while DP-FRIAR models ability to score points. Sitting instead of negging may provide a different amount of evidence about a team's knowledge level than answering a tossup before instead of after the power mark, and these would consequently affect FRIAR ratings differently. Each, however, is worth the same number of extra points, and therefore contributes equally to the probability a team will go on to win given the current state of the game, so it is advantageous that each is treated equally by DP-FRIAR.

Like FRIAR, DP-FRIAR is a model of the quizbowl data-generating process itself. Therefore, like FRIAR, it can be run backward to simulate games for the purposes of resolving intransitivities in the ranking of teams by winning percentage and rating and of penalizing play against the spirit of the game. Like estimation of the model parameters, simulation of games using DP-FRIAR would run much faster than using FRIAR.

(4) Conclusion

Assuming the conditions necessary for estimating p_D are not too onerous, I strongly recommend implementing them and giving preference to DP_FRIAR over FRIAR as a rating system for determining SCT invitations. I do not here address the relative merits of DP-FRIAR and the D-Value, except to note that I have previously recommended FRIAR over the D-Value and other measures calculated without estimates of error, and therefore recommend DP-FRIAR over these as well.

I will continue, in my declining free time this summer, to attempt the estimation of the full FRIAR model. If and when I have determined that (a) I've got it working or (b) I'm not going to finish it by the end of August as previously promised, I will change gears and attempt to implement DP-FRIAR. I welcome any comments, corrections, or suggestions.

cvdwightw · Post by **cvdwightw** » Wed Aug 04, 2010 2:35 pm

I see a couple of inherent problems with this model:

1. The model assumes a range of 1300 points. As previously noted in, well, pretty much every critique of every NAQT tournament ever, the maximum number of points scored is a direct function of the ability of moderator speed and clarity (faster moderators get through more questions, clearer moderators prevent teams from mis-hearing clues they know). The vast majority of games will have less than 1300 point range available.

2. The mathematical assumptions behind step 1 of Part (1) the model are unsound because bonuses are not reboundable in NAQT format. Therefore the p-value used as defined will always be underestimated due to bonus parts answerable by the team that does not control the bonus.

First off, the problem behind (2) can be partially remedied by using three different P-values with bonus P-values empirically estimated from bonus conversion. Unfortunately, one would have to use question-level data to get a reasonable standard deviation, but one could estimate using sqrt(p*(1-p)), where p is (BC/30).

Second, both of these problems do not apply whatsoever to PACE format, which both has a fixed number of questions and reboundable bonuses. Accordingly, one can discard Part (2) of the model and apply it directly to the PACE NSC, for which previous (2010) data exists, in order to validate Part (1) of the model as described. If someone were to write a regular-season high school set in PACE NSC format for 2010-2011, one could use DP-FRIAR to estimate teams' success at PACE NSC, although the variability in team makeup and the increase in difficulty from regular to nationals difficulty would confound conclusions.

The Friar · Post by **The Friar** » Wed Aug 04, 2010 9:58 pm

cvdwightw wrote:I see a couple of inherent problems with this model:

1. The model assumes a range of 1300 points. As previously noted in, well, pretty much every critique of every NAQT tournament ever, the maximum number of points scored is a direct function of the ability of moderator speed and clarity (faster moderators get through more questions, clearer moderators prevent teams from mis-hearing clues they know). The vast majority of games will have less than 1300 point range available.

2. The mathematical assumptions behind step 1 of Part (1) the model are unsound because bonuses are not reboundable in NAQT format. Therefore the p-value used as defined will always be underestimated due to bonus parts answerable by the team that does not control the bonus.

First off, the problem behind (2) can be partially remedied by using three different P-values with bonus P-values empirically estimated from bonus conversion. Unfortunately, one would have to use question-level data to get a reasonable standard deviation, but one could estimate using sqrt(p*(1-p)), where p is (BC/30).

Second, both of these problems do not apply whatsoever to PACE format, which both has a fixed number of questions and reboundable bonuses. Accordingly, one can discard Part (2) of the model and apply it directly to the PACE NSC, for which previous (2010) data exists, in order to validate Part (1) of the model as described. If someone were to write a regular-season high school set in PACE NSC format for 2010-2011, one could use DP-FRIAR to estimate teams' success at PACE NSC, although the variability in team makeup and the increase in difficulty from regular to nationals difficulty would confound conclusions.

I thought about point (1) and kind of dismissed it. Either moderator effects can be estimated (chuck in yet another p factor! why not?) or we can rely on the fact that 260 is a pretty large n for a binomial distribution and say, "eh, forget it, we weren't gonna predict anyone getting close to all the points anyway" -- that is, ignore them as frankly small compared to the total range of available points

Point (2) is more serious. You're right that bonus points are only available conditional on converting the tossup and only to one team, and that that fact really is important enough to break the binomial assumption. This is what I get for getting excited and hurrying. There would be ways to get around this while still estimating a "point-scoring ability" parameter from game-level data (total tossup points distributed binomial, each team's bonus points distributed binomial of its own ability where n is the total number of bonus points it could have scored?), but I can't think of all its ramifications. Instead of fixing it right away, my plan will be to lie awake at night and try and think of a story that justifies the binomial distributional approximation after all.

I should note that, while the ratings of DP-FRIAR are expected to be more accurate in terms of which team is more likely to win a game than those of FRIAR with powers and negs, the latter are expected to be more precise. This kind of model should be expected to give us bigger standard errors on team ratings relative to the separation between one and the next than one that allows the amount of information contributed by each data point to vary based on an estimated specificity parameter.

Mechanical Beasts · Post by **Mechanical Beasts** » Wed Aug 04, 2010 10:17 pm

cvdwightw wrote:2. The mathematical assumptions behind step 1 of Part (1) the model are unsound because bonuses are not reboundable in NAQT format. Therefore the p-value used as defined will always be underestimated due to bonus parts answerable by the team that does not control the bonus.

Doesn't it still fail even with reboundable bonuses? The team that gets the better (first) chance to answer is linked to tossup conversion; against a team that gets all the tossups, you could still get a result of zero for a team that gets zero tossups but would have thirtied all the bonuses--certainly a corner case, but there are bunches of grails that happen against teams that might have managed, say, 8-12ppb. They're still strictly speaking available, but it's a more complicated function of opponent skill. Or perhaps I'm misunderstanding those assumptions and what would make them unsound.

cvdwightw · Post by **cvdwightw** » Thu Aug 05, 2010 12:59 pm

Crazy Andy Watkins wrote:Or perhaps I'm misunderstanding those assumptions and what would make them unsound.

A reboundable (i.e. "bounceback") bonus is one whose parts rebound to the other team if the controlling team misses them (there are various similar definitions but I believe this is the one that the NSC uses). Thus a reboundable bonus gives both teams an opportunity to score the point (although the controlling team gets first chance). If p is trying to estimate the probability that "at least one team scores any available point," it would still be inherently underestimated due to the number of bonuses associated with dead tossups (that is, there are some bonus parts that are answerable by one or both teams but which are not answered because the associated tossup was not answered correctly), but not in the model-breaking way that non-reboundable bonuses would underestimate p.

jonpin · Post by **jonpin** » Thu Aug 05, 2010 3:26 pm

cvdwightw wrote:
Crazy Andy Watkins wrote:Or perhaps I'm misunderstanding those assumptions and what would make them unsound.
A reboundable (i.e. "bounceback") bonus is one whose parts rebound to the other team if the controlling team misses them (there are various similar definitions but I believe this is the one that the NSC uses). Thus a reboundable bonus gives both teams an opportunity to score the point (although the controlling team gets first chance). If p is trying to estimate the probability that "at least one team scores any available point," it would still be inherently underestimated due to the number of bonuses associated with dead tossups (that is, there are some bonus parts that are answerable by one or both teams but which are not answered because the associated tossup was not answered correctly), but not in the model-breaking way that non-reboundable bonuses would underestimate p.

Indeed. There are (obviously pathological) cases where Team X playing against empty chairs at empty tables would score more points than the combined score of Team X vs Team Y (even without negs). For instance, Team X can get any tossup at the end and 30 every bonus, but Team Y can occasionally buzz mid-question and doesn't 30 every bonus.

The Friar · Post by **The Friar** » Wed Aug 11, 2010 1:07 pm

I suggest that the problem with the binomial assumption in DP-FRIAR is more one of interpreting p too concretely (as encouraged by my original post) than in mis-specifying the data-generating process.

Properly accounting for, say, what teams would have scored on bonuses that went dead would probably look like making the total number of points available conditional on the total number of tossups correctly answered. That would give us a p tied more closely to a team's actual ability to score a point given its availability.

It would also increase the extent to which this rating system misses the point, which, ultimately, is to rate teams' ability to win quizbowl games. The more conditionals introduced into the process, the more we will have a rating of teams' ability to do a quizbowl subtask (or of a parameter, like knowledge, that maps onto their ability to do several related subtasks), instead of a rating of their ability to win quizbowl games. This is what I have been criticizing about FRIAR with negs and powers: it's a good rater of knowledge, which I would call philosophically desirable, but DP-FRIAR would be a better rating of quizbowl winning ability.

Put another way, what we want to rate teams on is not their ability to get any given point as it becomes available but their ex ante probability of scoring a point or not. Doing this from the outset would mean something like calculating the expected points each team would have gotten on a bonus times the probability that they get the associated tossup. That would be... hard. The easier thing would be to treat I did, where a team may earn some bonus points if it does get the tossup but gets none if it doesn't, and argue from the asymptotic equivalence of the two approaches and the large size of our data set.

Put another way, you should have gotten the tossup.

By the way, the p = p_1 + p_2 - p_1*p_2 expression isn't helping. That alone is encouraging thinking of p as something easier to interpret than it actually is, which helped me conceive of the model initially but is getting in the way at this point. The desired property that two infinitely bad teams score no points on any questions, no matter how easy, also obtains if p = (e^(r1) + e^(r2)) / (1 + e^(r1) + e^(r2)) where r1 and r2 are team ratings living between -Inf and +Inf, and the p_d is just something like (e^(r1) + e^(r2))+ e^d / (1 + e^(r1) + e^(r2) + e^d). This logit parametrization is canonical and much more well-supported theoretically for models like this than some kind of individual probabilities undergoing Lorenztian addition.

cvdwightw · Post by **cvdwightw** » Fri Aug 13, 2010 5:04 pm

Okay, I've gone through the math and I see sort of where you're coming from, correct me if something doesn't make sense or is completely off base.

We start with a very basic assumption that a team with a higher rating ought to defeat a team with a lower rating a higher proportion of the time. We can model this canonically using a logistic model:

P(Team 1 wins) = 1/(1+e^-a1*(x1-x2)), where x1 and x2 are the ratings of team 1 and team 2 on (-inf, inf). Note that when x1 = x2 the probability that Team 1 wins is 0.5 and that a higher value for a1 more greatly magnifies small differences between x1 and x2. We can find a1 using logistic regression given the results of every game.

Once we have a1 and x_i for all teams i, we can calculate each team's expected number of wins if we have a round robin tournament featuring every team. This seems to be a useful S-value interpretation and is consistent with the canonical use of logistic regression.

The question is, therefore, how to calculate x_i for each team.

In your model, you assume a second logistic function P(the teams combine to score the maximum points) = 1/(1+e^-(b0+b1*x1+b2*x2)), where x1 and x2 are the ratings. We assume that we can plug in a room's actual score for P(teams combine to score the maximum points) since we can treat the actual score as some amount of "combining to score maximum points" balanced with "combining to score zero points" (note that these are not quite the same thing, but that we will treat them as the same thing). Furthermore, we can take the "average team" rating to be 0, which conveniently allows us to estimate b0 = ln(as) - ln(1-as), where as is the average score across all games. In your model you suggest that we use units of points/5; I suggest instead that we use a variant equal to (PPTH+5)/50, in order to (1) control for variable numbers of tossups heard and (2) normalize to the range [0,1]. We can therefore estimate our b's and x's by the equation:

b1x1 + b2x2 = ln(k(1-as)) - ln(as(1-k)), where k is the total score of the game in (PPTH+5)/50 and as is the average score of all games in (PPTH+5)/50.

Now what's the problem with this? This gives a perverse incentive for poor teams to just play like buzzer rocks any time they are matched up against a better team, since the other team will score more on bonus points and therefore inflate both teams' ratings. Therefore we can't actually use this as a measure of, well, anything.

What we can use is P(Team 1 scores all the points). We make a similar assumption that we can estimate P using T1/k = 1/(1+e^-(c1x1-c2x2)), where T1 is Team 1's (PPTH+5)/50. Note the minus sign here; if Team 2 is really really good then we should see P = 0 and I prefer to keep my coefficients positive when possible. We can further note that k = T1+T2 (the combined Team 1 and Team 2 scores). This yields that c1x1 - c2x2 = ln(T1/T2).

The math up to this point at least appears sound to me. My problem is that I don't think we can use maximum likelihood estimation to find the b_i's and c_i's (and we do need both sets) using maximum likelihood estimation, since we don't know the x_i's! Am I missing something here?

The Friar · Post by **The Friar** » Sun Aug 22, 2010 9:56 pm

Sorry to take so long to follow up. Even this will be just a couple points.

First,

cvdwightw wrote:We start with a very basic assumption that a team with a higher rating ought to defeat a team with a lower rating a higher proportion of the time. We can model this canonically using a logistic model:

P(Team 1 wins) = 1/(1+e^-a1*(x1-x2)), where x1 and x2 are the ratings of team 1 and team 2 on (-inf, inf).

I'm not modeling winning. That's what KRACH does. I am strictly modeling scores.

Second,

cvdwightw wrote:Now what's the problem with this? This gives a perverse incentive for poor teams to just play like buzzer rocks any time they are matched up against a better team, since the other team will score more on bonus points and therefore inflate both teams' ratings. Therefore we can't actually use this as a measure of, well, anything.

The model I originally put forth should not have this problem. The upward effect on the weaker team's rating due to a decreased margin of victory should eat up any downward effect on the weaker team's rating induced by the lower total score resulting from the weaker team getting more tossups and therefore fewer total bonus points getting earned.

To find out for certain, we should look at the second partial derivative of the likelihood function with respect to the weaker team's rating and to a linear combination of the weaker team's score and the stronger team's score representing the maximum point swing (i.e. representing two more absolute points for the weaker team and eight fewer for the stronger team). My suspicion is that this second derivative should be positive everywhere or almost everywhere; that is, the likelihood of observing the weaker team getting one more tossup and zeroing one more bonus while the stronger team gets one tossup fewer and thirties one bonus fewer should increase as the weaker team's rating increases. I will get around to doing the actual calculus by next February 30th or when I next get the chance to take a long trip by flying pig (a mentally stimulating means of travel; I am super productive every time I ride Swine Air).

Meanwhile, I successfully forced modeling steps for both negs and powers into classic FRIAR this month, but found this could only be done by constraining one or two global parameters to lie in regions that I suspect in the real world, they simply don't. This evening I came up with a different and altogether better set of expressions for the probability of observing any given tossup outcome, which I will write up soon (sooner than the directional second derivatives of DP-FRIAR, anyway) as FRIAR 2.0. Like DP-FRIAR, I just wish this tossup model had occurred to me before I spent a whole month banging my head against the old one.

EDIT: (HINT: simultaneous latent variables)

The Quizbowl Resource Center

DP-FRIAR: Simpler! More Accurate! Almost Testable!

DP-FRIAR: Simpler! More Accurate! Almost Testable!

Re: DP-FRIAR: Simpler! More Accurate! Almost Testable!

Re: DP-FRIAR: Simpler! More Accurate! Almost Testable!

Re: DP-FRIAR: Simpler! More Accurate! Almost Testable!

Re: DP-FRIAR: Simpler! More Accurate! Almost Testable!

Re: DP-FRIAR: Simpler! More Accurate! Almost Testable!

Re: DP-FRIAR: Simpler! More Accurate! Almost Testable!

Re: DP-FRIAR: Simpler! More Accurate! Almost Testable!

Re: DP-FRIAR: Simpler! More Accurate! Almost Testable!