Dwight Introduces Yet Another QB Rating System

cvdwightw · Post by **cvdwightw** » Fri Jan 25, 2008 4:39 am

For those of you who don't know, I'm the kind of crazy person who's been attempting to fit the ELO ranking model (used in, e.g., chess) to quizbowl for the past couple of years. With the advent of the Hart Win Share Statistic and the transition to the logarithmic scale, I believe I'm pretty close to something that makes sense, though I'm always looking for people who recognize things that send me back to the drawing board.

For the model, I use a modified version of the Hart Stat: TPC+ = Player10s*(10+BC)-(10*Player10s*(Team10s-Player10s)/(PlayerTUH-Player10s); this assumes that a certain percentage of the player's correct buzzes would have been picked up by his teammates anyway.

THE MODEL
The model computes ratings thusly: New = Old + K1*K2*(WS-WS,thr) where
New = New Rating
Old = Old Rating
K1 = a "difficulty factor" ranging from 5 (high school) to 25 (masters), with "novice difficulty" = 10, "standard difficulty" = 15, "nationals difficulty" =20. A severe quality issue will drop the difficulty factor to the next-lowest level (e.g. if a tournament is advertised as "novice difficulty" and majorly blows, it will receive a 5 difficulty factor instead of a 10).
K2 = an "experience factor", currently set to 20 for new players, probably decreasing to 10 after several tournaments and then to 5 after several more (though I haven't figured that out). This may also change to a "rating factor" used by chess to prevent overdeflation of ratings.
WS = Hart Win Shares
WS,thr = Theoretical Win Shares (See below)

In general, the model computes a "theoretical win share", which is how much a player is expected to contribute toward the team theoretical winning percentage, subtracts this from the actual win share, and from this difference computes a new rating.

STEPS TO THE MODEL
1. Find Old Rating and compute Hart Win Shares. I am currently working with everyone starts at the average of 1000; once these ratings get more stable, a new player would go down to probably about 700 to start.
2. Convert to a "individual rating factor" via the following formula: %TUH*4^((rating-1000)/500).
3. Compute a "team rating factor" by summing the individual rating factors for a team.
4. Create a "competition matrix" detailing each team's expected chance of a win over the other teams, assuming differences in the log of team ratings are roughly normal with mean zero and std.dev. log(2). This is accomplished in Excel using the command NORMDIST(log(team rating)-log(opponent rating),0,log(2),TRUE) for each entry in the matrix.
5. Each team, over the course of their games, plays several of these opponents. For each game, add the "expected chance" of winning that game. Then divide by the total number of games to get the "theoretical winning percentage".
6. Multiply the theoretical winning percentage by the ratios %TUH/(%TUH,total,team) and RatingFactor/TeamRatingFactor to get a theoretical win share.
7. Subtract the theoretical win share from the actual win share, multiply by K1 and K2, and add to the Old Rating to get the New Rating.

The model is calibrated such that a solo player would roughly split games against a team of 4 players each with individual ratings 500 points lower; e.g., a 1500 player playing solo is expected to win half his games against a full team of 4 1000-rated players; a 1000-rated player playing solo is expected to go .500 against a team of 4 500-rated players; and so on. Thus, I believe this model accurately takes into account empty chairs with a rating of 0. Using the exponential scale also results in a much heavier weighting of the theoretical win share to the team's high scorers, which is what's typically seen in the Win Share.

There is a small amount of roundoff error due to rounding Actual Win Shares to 3 decimal places and Ratings to whole numbers.

The obvious problems with this are that tournaments need to provide accessible statistics in order for this to work, and include some way of telling who played under what pseudonym at what tournament. Also, the competition matrices get cumbersome for tournaments with large fields; for instance, there could be a problem with ICT's 32-team field in each division. However, this problem could likely be avoided using a rough table approximation of the difference in "team rating factor" for each p-value.

I've done a pilot study based on the West Coast tournaments, and based on those four tournaments I have a range of ratings from 903 to 1417 for West Coast players, under the assumption that every time I encountered a new player that player began with a 1000 rating.

First Chairman · Post by **First Chairman** » Fri Jan 25, 2008 7:14 am

When anyone says we have a new rating system, I ask...

Have you tried this system on NAQT SCT data from last year and compared your results with the NAQT ICT data? Same thing with those teams that competed at ACF Nationals last year compared to their ACF Fall or Regionals data? The reason why I ask is that the "difficulty" of the field and of the questions should somehow be taken into account when computing these data, and you need to show that your factors for difficulty have validity. Same thing with your experience factor... have you done a longitudinal survey of an individual who has participated for three years at ICT or SCT to come up with your coefficients there?

I also noticed... NAQT has posted ELO ratings for high school teams. I suppose it's just because we don't have data for college teams until SCT, but will you be posting those data as well once those events have taken place? What numbers did you use for experience and difficulty, especially if a tournament were using A-series questions?

cvdwightw · Post by **cvdwightw** » Fri Jan 25, 2008 2:53 pm

Will Run PACE for Reese's wrote:Have you tried this system on NAQT SCT data from last year and compared your results with the NAQT ICT data? Same thing with those teams that competed at ACF Nationals last year compared to their ACF Fall or Regionals data?

Not yet. Like I said, I've only done a pilot study using the four west coast tournaments this year, and got encouraging results. My next step was going to be to input stats from the other regional sites and then to work with upcoming Cardinal Classic/SCT/ACF Regs data, however, you may be right that it may be easier and better to look over the course of an entire competition year to see how team's final ratings were affected by major regionals/nationals performance.

Will Run PACE for Reese's wrote:The reason why I ask is that the "difficulty" of the field and of the questions should somehow be taken into account when computing these data, and you need to show that your factors for difficulty have validity. Same thing with your experience factor... have you done a longitudinal survey of an individual who has participated for three years at ICT or SCT to come up with your coefficients there?

The difficulty of the field is automatically taken into account when computing "expected win shares". If you're a 1500 player teaming with a 2000 player against a bunch of 700 players, you're going to be expected to have a very high win share, but not as high as your teammate. Essentially, this compensates for both field strength and team strength. Also, this model adequately compensates for empty chairs - their "individual rating factor" is assumed to be zero since they contribute nothing (a zero-rated player is still better than an empty chair).

I believe that a player's ranking should be more dependent upon his ability to score points and win games at the highest difficulty levels than at the lowest difficulty levels, and on high quality questions rather than low quality questions. Accordingly, I set the difficulty factors to 5 for high school, 10 for novice, 15 for standard, 20 for nationals, and 25 for masters (more difficult than nationals). So, a given player's rating will always depend roughly twice as much upon ICT/ACF Nats results than on ACF Fall results. These do not correlate perfectly with a difficulty measure (e.g. ACF Nats is not necessarily 200% as hard as EFT), but are above those measures to give "bonus points" for players who perform well on harder questions.

Once I settled on a difficulty factor, the next step was to provide a corresponding experience factor such that players who had participated in more tournaments would be less adversely affected than those who played in fewer tournaments. For instance, Jerry performed quite poorly at the 2006 ACF Nats compared to what would be expected from Jerry, and with a "less experience" factor his rating would go significantly down but with a "more experience" factor his rating would go only slightly down. As I'm typing this I think a more fair factor would be a chess-like "rating factor" in that once a player has proven he can perform significantly at a high level, he should not be overly punished for a bad performance or overly rewarded for a dominant performance. Since differences in win shares are often in the thousandths, the combined multiplicative factor would have to be somewhere in the neighborhood of the 100s to ensure that ratings actually change from tournament to tournament. Currently the 20/10/5 scale is completely arbitrary, and I haven't inputted enough data to confirm this is good.

I had nothing to do with the NAQT ratings, and since their link is apparently nonfunctional I can't see how they did theirs. I will note two important differences: first, that NAQT computes these rankings for teams, whereas I compute these for individuals. The reason for this is mostly that in high school, a team like "Thomas Jefferson A" is more-or-less fixed over the course of a season, while a team like "UCLA A" has used like 12 different players over the 4 tournaments, and it would be unfair to compare the EFT UCLA A to the MLK UCLA A since they had no players in common. Thus, it makes sense to compute rankings for an individual who is likely to play on several different teams with several different teammates (and then compute the overall team rating based on individual ratings) rather than to combine a quite experienced team with a novice team to get a "UCLA A" rating. Second, NAQT computes the results per game; I compute the results per tournament. The reason for this is that the Hart Win Share Statistic, which I use, is most fair in determining a player's overall contribution to a team winning percentage. In addition, the p-value takes into account the team's rating over the entire tournament. It generally seems way too time-consuming to be worth combing through the "individual details" to find exactly which set of players played which game, who substituted in or left halfway through, etc. If the model can give a roughly accurate depiction of team strength (and I believe it can), there is no need to nitpick game-by-game lineups. I can see an argument for computing the ratings using Wins instead of Win Shares, but someone would have to convince me why it's more valid.

First Chairman · Post by **First Chairman** » Fri Jan 25, 2008 3:11 pm

Just for convenience, link to the page. But yes, they haven't fixed that "how we compute these ratings" link. Hopefully that will be fixed soon.

As I worked with modeling and systems in a past life, you got to show me [proverbially speaking here] the model/system works well with a complete set of old data before you can show me your current compiling set of data (and its analysis) can be valid. Otherwise you have the stupid BCS "tweaking" issue where year after year the formula changes for the sake of convenience. (Then again, that's not the goal for this particular system.) Now we have a lot of data floating around that one can hopefully be able to test their rating system fairly well. At least to give me some meaning what the ranking really means.

But it would be of interest to see how this could work. I also think that having some way to show how a player actually improves over time is very useful. (CUT-tournament qualification comes to mind.)

All your other points regarding switching rosters around, I agree.

cdcarter · Post by **cdcarter** » Fri Jan 25, 2008 4:09 pm

cvdwightw wrote: 2. Convert to a "individual rating factor" via the following formula: %TUH*4^((rating-1000)/500).

6. Multiply the theoretical winning percentage by the ratios %TUH/(%TUH,total,team) and RatingFactor/TeamRatingFactor to get a theoretical win share.

Forgive my ignorance, but I am a bit confused by these two steps, mainly by the way you present them. Does %TUH refer to PlayerTUH/TeamTUH or Player10/PlayerTUH? And what does (%TUH,total,team) mean, the commas are throwing me off.

Post by **theMoMA** » Fri Jan 25, 2008 5:08 pm

This seems really interesting. I'd like to see this done for all tournaments throughout the year, and I'd be willing to help input some if you send me the spreadsheet and walk me through how to do it.

cvdwightw · Post by **cvdwightw** » Sun Jan 27, 2008 12:14 am

cdcarter wrote:
cvdwightw wrote: 2. Convert to a "individual rating factor" via the following formula: %TUH*4^((rating-1000)/500).

6. Multiply the theoretical winning percentage by the ratios %TUH/(%TUH,total,team) and RatingFactor/TeamRatingFactor to get a theoretical win share.
Forgive my ignorance, but I am a bit confused by these two steps, mainly by the way you present them. Does %TUH refer to PlayerTUH/TeamTUH or Player10/PlayerTUH? And what does (%TUH,total,team) mean, the commas are throwing me off.

In step 2, if a 1000-rated player heard 40 tossups and the team heard as a total 200 tossups (i.e. played 2 full games out of the team's 10), the individual rating factor would be (40/200)*4^((1000-1000)/500) or 0.2. In step 6, my formula is wrong. You are simply multiplying the theoretical winning percentage by individual rating factor and dividing by the team rating factor. So if your team's theoretical winning percentage was 0.5, your individual rating factor was 1, and the team rating factor was 4, your theoretical win share would be (0.5)*(1/4) or 0.125.

trphilli · Post by **trphilli** » Sun Jan 27, 2008 10:42 am

cvdwightw wrote: Also, this model adequately compensates for empty chairs - their "individual rating factor" is assumed to be zero since they contribute nothing (a zero-rated player is still better than an empty chair).

I'll play devil's advocate here. Empty chair doesn't incur negs / penalties / blurt out incorrect bonus answers.

While normally any help is better than no help. I've known experiences where empty chair would be preferable.

Post by **theMoMA** » Mon Jan 28, 2008 5:24 pm

trphilli wrote:
cvdwightw wrote: Also, this model adequately compensates for empty chairs - their "individual rating factor" is assumed to be zero since they contribute nothing (a zero-rated player is still better than an empty chair).
I'll play devil's advocate here. Empty chair doesn't incur negs / penalties / blurt out incorrect bonus answers.

While normally any help is better than no help. I've known experiences where empty chair would be preferable.

There is actually an extensive correction to the TPC formula that normalizes for players that neg and crowd out their teammates. So there are certainly situations when players are worse than an empty chair, but those situations are handled in the formula.

The way this works is that the win share formula calculates a players' positive and negative contribution to the team and adds them together to get the player's total contribution.

The positive contribution is given as:

( 10 * [Tossups correct]) + ([Tossups correct] * [Team bonus conversion]).

The negative contribution is (in simplistic form; the real formula accounts for players who didn't play every tossup) is broken down into two parts:

-5 * [Tossups negged]

This is just the raw point value of all the player's neg fives. This is added to:

([Team tossups correct] - [Player tossups correct]) / [Player tossups heard] - [Player tossups correct]) * ([Player negs] * [Team bonus conversion])

This is the part that tries to correct for the further negative value of negs. The first part of the formula is simply the team tossups not answered by the player in question divided by the total number of team tossups heard not answered by the player in question. This is the formula's way of estimating what chance the team had of answering the question correctly had the player not negged. This is then multiplied by the number of player negs to estimate how many of those negs would have been converted by the rest of the team had the player in question not buzzed. That is multiplied by team bonus conversion to get the total number of estimated points lost.

See this thread for a more detailed explanation. Hope this helps.

Andrew

rylltraka · Post by **rylltraka** » Mon Jan 28, 2008 10:26 pm

I'm not normally a stats junkie, but since you did a pilot study of West Coast tournaments, any way you could make that available?

cvdwightw · Post by **cvdwightw** » Tue Jan 29, 2008 7:54 pm

Mik,

While Andrew and I went back and looked at the statistic, we noticed a couple of issues and I went back and corrected them. Here's the top of the ratings, starting everyone with a rating of 1000.

Player Team #tournaments Rating
Dwight UCI 4 1412
Ray UCLA 3 1181
Andrew Stanford 1 1111
Yogesh Berkeley 3 1106
Seth Chicago 1 1102
Cliff UCLA 3 1089
Steve UCLA 3 1081
Paul Berkeley 1 1068
Ray UCI 3 1063
Brian Stanford 2 1062
Jeff Berkeley 1 1041
Mik USC 2 1040
Juliana Berkeley 1 1038
Kristiaan Stanford 1030
Justin Stanford 1022

So as you can see, although there are some problems with using a limited data set (there's no way I'm 300 points better than Andrew or Seth), we can see most of the usual suspects for best West Coast players starting to break away.

If you're interested in the stats beyond this I can send you the file.

salamanca · Post by **salamanca** » Tue Jan 29, 2008 8:13 pm

Congratulations, you have just proven how pointless this statistic is.

Ezequiel

Captain Sinico · Post by **Captain Sinico** » Tue Jan 29, 2008 8:57 pm

Laughing on line.

MaS

AKKOLADE · Post by **AKKOLADE** » Tue Jan 29, 2008 10:59 pm

A sample size of, at most, four isn't exactly grounds to either endorse or discredit any statistic. The reasoning seems within the realm of sane; I'd like to see what this can do on a notably higher number of tournaments before a decision is rendered.

The Quizbowl Resource Center

Dwight Introduces Yet Another QB Rating System

Dwight Introduces Yet Another QB Rating System

Re: Dwight Introduces Yet Another QB Rating System

Re: Dwight Introduces Yet Another QB Rating System

Hmmm