S-value revision

Old college threads.
User avatar
Important Bird Area
Forums Staff: Administrator
Posts: 6112
Joined: Thu Aug 28, 2003 3:33 pm
Location: San Francisco Bay Area
Contact:

S-value revision

Post by Important Bird Area »

naqt.com wrote:During Summer 2009, NAQT plans to revise the S-value formula it uses to rank teams based on their performance at SCTs and thereby determine the majority of the invitations for the ICT.

The primary goal is to come up with a new system that is not vulnerable to unsportsmanlike manipulation and which can, therefore, be made public. A secondary goal is improving the rankings produced by the system.

NAQT hopes to work with members of the quiz bowl community during the summer to devise the new system.
Full details

Just starting the process now, with a call for volunteers to join a group I'm assembling to draft a new S-value. Qualifications: 1) you have played or will play SCT 2) you like tinkering with formulas for evaluating quizbowl teams. Email me if you're interested. Expected timeline is that we'll generate a draft and present it for public discussion in early September.
Jeff Hoppes
President, Northern California Quiz Bowl Alliance
former HSQB Chief Admin (2012-13)
VP for Communication and history subject editor, NAQT
Editor emeritus, ACF

"I wish to make some kind of joke about Jeff's love of birds, but I always fear he'll turn them on me Hitchcock-style." -Fred
User avatar
theMoMA
Forums Staff: Administrator
Posts: 5993
Joined: Mon Oct 23, 2006 2:00 am

Re: S-value revision

Post by theMoMA »

Sorry if you didn't want responses in this thread, but I'm going to anyway. Here is the simplest, easiest-to-customize S-Value that I can come up with. It's sort of like OPS+.

1. Take whatever factors you want (personally, I'd stick to just PPG and PPB, but you can add in powers or win percentage or any other factors you feel are valuable).

2. Find the average across all SCT sites (you'd do the DI and DII fields separately, of course) of all the factors you want (I would not include combined SCT sites in this average).

3. Find the number of standard deviations the team in question is from each average.

4. Find the "park factor" for context-dependent stats like PPG or powers. This adjusts for the fact that it's harder to score points at stronger SCTs than at weaker ones. Bonus conversion should be relatively stable regardless of the competition, and needs no adjustment. The "park factor" for any context-dependent stat would be calculated by taking the average of a stat [say PPG] for each individual SCT, and then dividing by the national average.

5. Multiply the standard deviations for any context-dependent stats by the "park factor."

6. Combine the standard deviations somehow; either add them, average them, or weight them in any way you choose. Obviously, the first two options are equivalent but just on different scales; the third option is useful if you want to include stats like powers or win percentage with reduced importance. You could also multiply all results in order to give it a more intuitive scale. I'd prefer to average them (or weighted average, as the case may be) and multiply by 100, since it would give you a three-digit output that would basically tell you how well you performed against the average team, adjusted for strength of field.

The benefits to this are that the numbers are directly comparable, and easily and fairly take into account the strength of field. They also tell you exactly where you stand in terms of the national average. The main downside is that combined SCTs, especially ones played on the DII set, would be very hard to judge. I'm assuming that these are already very hard to compare to other SCTs, and my solution would be to call for eliminating the combined SCT played on a DII set, or if that's not feasible, to evaluate those on a case-by-case basis. It might also be harsh for teams to see results in the negative numbers, which could be alleviated by adding 100 to the final number (centering the scale at 100, though the worst teams might still have a negative value).

I'd be interested to hear what Jeff or anyone else thinks about this.
Andrew Hart
Minnesota alum
User avatar
Important Bird Area
Forums Staff: Administrator
Posts: 6112
Joined: Thu Aug 28, 2003 3:33 pm
Location: San Francisco Bay Area
Contact:

Re: S-value revision

Post by Important Bird Area »

theMoMA wrote:Sorry if you didn't want responses in this thread, but I'm going to anyway.
Here is fine; we'll certainly take this into consideration.
Jeff Hoppes
President, Northern California Quiz Bowl Alliance
former HSQB Chief Admin (2012-13)
VP for Communication and history subject editor, NAQT
Editor emeritus, ACF

"I wish to make some kind of joke about Jeff's love of birds, but I always fear he'll turn them on me Hitchcock-style." -Fred
User avatar
Mechanical Beasts
Banned Cheater
Posts: 5673
Joined: Thu Jun 08, 2006 10:50 pm

Re: S-value revision

Post by Mechanical Beasts »

The above kind of system looks to be the best sort, and I couldn't imagine a superior steps 2+. The fun will come in wrangling over the nature of step one.
Andrew Watkins
User avatar
setht
Auron
Posts: 1205
Joined: Mon Oct 18, 2004 2:41 pm
Location: Columbus, Ohio

Re: S-value revision

Post by setht »

Andrew's data-processing scheme looks reasonable to me. I think the real challenge will be coming up with raw data that can be published without encouraging degenerate behavior during matches--throwing games, not answering tossups to avoid getting bonuses or for whatever reason, feeding answers to the other team on their bonuses, etc. One of the appealing features of Andrew's proposal is that it offers the possibility of multiple "park factors" for different statistics, rather than producing a single "strength-of-schedule" adjustment based on a combination of opponent statistics. I think this is appealing because it means that some statistics (most notably bonus conversion; that may be it, or there may be others that I'm not thinking of) don't need any adjustment at all and shouldn't encourage bad gamesmanship.

Win-loss record should be safe--any attempt to inflate an opponent's strength (or the local "park factor") by helping them beat your team will presumably hurt your team more than it helps.

Including PPG would encourage teams to answer tossups and bonuses, but it could also encourage teams to give answers to the other team after negging or on their bonuses. One option here might be to calculate the PPG "park factor" using opponents PPG from matches not involving a given team--so if one team tries to inflate their opponents' PPG it won't help them at all.

Bonus conversion should be safe, if there's no "park factor" for BC--then there's no incentive for one team to feed bonus answers to their opponents. If for some reason it's decided that there should be a "park factor" for all statistics including BC, then I guess BC could be adjusted in the same way as PPG--the contribution of opponent BC to the "park factor" would be based only on opponent BC in games not involving a given team.

-Seth
Seth Teitler
Formerly UC Berkeley and U. Chicago
President of NAQT
Emeritus member of ACF
User avatar
Theory Of The Leisure Flask
Yuna
Posts: 761
Joined: Fri Aug 29, 2003 11:04 am
Location: Brooklyn

Re: S-value revision

Post by Theory Of The Leisure Flask »

Andrew's framework is generally good; I imagine NAQT will end up including win percentage, and I'm personally a fan of including powers in the system. A few ideas:

1. PPTH is probably a better metric than PPG, due to the fact that a team's PPG can be screwed over by slow moderators in a timed system.

2. How to deal with D1/D2 comparability.

There's probably some technical issues with this idea, especially in the case of bonuses which are alike except for having an easier part in D2 swapped out for a hard part in D1:
a) Assemble a dedicated playtest group with a balanced knowledge base, who didn't write or edit anything for the SCT.
b) Have them play all the bonus parts in both sets, D1 and D2.
c) Compare the conversion on D1 questions to the conversion on D2 questions, and use the disparity to create a factor by which PPB (perhaps also PPTH?) can be adjusted.

3. How to deal with (some instances of) degenerate behavior.

A team's "sectional" consists of the stats of the teams they've played. If they forfeit a game, they don't get the benefit (or detriment) of their opponent's statistics; it only counts as an additional L in the "win percentage" column. More tricky might be a situation where a team decides to stall as much as possible and keep the "tossups heard" number low when playing a good team, so as to make sure that game doesn't impact PPTH as much. In many cases, this is actually not degenerate behavior at all but is instead generally good play: fewer tossups = more variability = better chance of pulling the upset. But it may turn into poor strategy/degenerate "gaming" once a team is far behind and needs to catch up. If NAQT is worried about this, a solution could be to calculate the PPTH on a per-game basis, and use the PPTHPG number instead.

Feeding opponents their bonus answers is something that should just be warned about as being against the rules, and against the spirit of the game.
Chris White
Bloomfield HS (New Jersey) '01, Swarthmore College '05, University of Pennsylvania '10. Still writes questions occasionally.
User avatar
Maxwell Sniffingwell
Auron
Posts: 2164
Joined: Sun Feb 12, 2006 3:22 pm
Location: Des Moines, IA

Re: S-value revision

Post by Maxwell Sniffingwell »

For another thing, feeding the other team bonus answers isn't necessarily "degenerate behavior" under NAQT's definition.
naqt.com wrote:"Degenerate behavior" is any behavior that intentionally reduces the chance of a team winning an individual game.
Once you have a lock game, degenerate behavior no longer exists? I feel like this definition needs to be fixed.
Greg Peterson

Northwestern University '18
Lawrence University '11
Maine South HS '07

"a decent player" - Mike Cheyne
User avatar
grapesmoker
Sin
Posts: 6345
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Re: S-value revision

Post by grapesmoker »

cornfused wrote:For another thing, feeding the other team bonus answers isn't necessarily "degenerate behavior" under NAQT's definition.
naqt.com wrote:"Degenerate behavior" is any behavior that intentionally reduces the chance of a team winning an individual game.
Once you have a lock game, degenerate behavior no longer exists? I feel like this definition needs to be fixed.
Perhaps it should say something like "maximize number of points scored," since that would subsume the definition of winning under itself and eliminate the ambiguity.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
jonah
Auron
Posts: 2383
Joined: Thu Jul 20, 2006 5:51 pm
Location: Chicago

Re: S-value revision

Post by jonah »

grapesmoker wrote:
cornfused wrote:For another thing, feeding the other team bonus answers isn't necessarily "degenerate behavior" under NAQT's definition.
naqt.com wrote:"Degenerate behavior" is any behavior that intentionally reduces the chance of a team winning an individual game.
Once you have a lock game, degenerate behavior no longer exists? I feel like this definition needs to be fixed.
Perhaps it should say something like "maximize number of points scored," since that would subsume the definition of winning under itself and eliminate the ambiguity.
How about "maximize the margin of victory", so as to exclude feeding the other team bonus answers?
Jonah Greenthal
National Academic Quiz Tournaments
User avatar
grapesmoker
Sin
Posts: 6345
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Re: S-value revision

Post by grapesmoker »

jonah wrote:How about "maximize the margin of victory", so as to exclude feeding the other team bonus answers?
Even better.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
User avatar
Important Bird Area
Forums Staff: Administrator
Posts: 6112
Joined: Thu Aug 28, 2003 3:33 pm
Location: San Francisco Bay Area
Contact:

Re: S-value revision

Post by Important Bird Area »

jonah wrote:How about "maximize the margin of victory", so as to exclude feeding the other team bonus answers?
That sounds good, since this is one of the degenerate behaviors we're worried about.

Although I think a good solution for that would be something like what Seth suggests above, measuring opponent's strength in games not involving team X. (So that any degenerate behavior would have to involve collusion by multiple teams to fix future matches with unknown outcomes: and thus it would have the direct drawback of causing teams to lose.)
Jeff Hoppes
President, Northern California Quiz Bowl Alliance
former HSQB Chief Admin (2012-13)
VP for Communication and history subject editor, NAQT
Editor emeritus, ACF

"I wish to make some kind of joke about Jeff's love of birds, but I always fear he'll turn them on me Hitchcock-style." -Fred
User avatar
dxdtdemon
Rikku
Posts: 391
Joined: Sat Mar 03, 2007 4:46 pm
Location: Beavercreek, OH

Re: S-value revision

Post by dxdtdemon »

Also, for "maximizing the margin of victory", it should be margin per 20TUH so that teams who are really blowing out someone shouldn't have to really pile it on the other team by bonus blitzing and stuff like that to speed up the game.
Jonathan Graham
Beavercreek HS 1999-2003, Ohio State 2003-2007, Wright State (possibly playing)2012-2015
moderator/scorekeeper at some tournaments in Ohio, and sometimes elsewhere
"Ohio has a somewhat fractured quizbowl circuit, with a few small pockets of intense competition (like in Mahoning County) and with the rest scattered around the state."-Chris Chiego
User avatar
jonpin
Auron
Posts: 2266
Joined: Wed Feb 04, 2004 6:45 pm
Location: BCA NJ / WUSTL MO / Hackensack NJ

Re: S-value revision

Post by jonpin »

quantumfootball wrote:Also, for "maximizing the margin of victory", it should be margin per 20TUH so that teams who are really blowing out someone shouldn't have to really pile it on the other team by bonus blitzing and stuff like that to speed up the game.
Failure to blitz on a bonus does not intentionally and conclusively reduce a margin of victory, because the tossup you don't get to by not blitzing may be your own.
Jon Pinyan
Coach, Bergen County Academies (NJ); former player for BCA (2000-03) and WUSTL (2003-07)
HSQB forum mod, PACE member
Stat director for: NSC '13-'15, '17; ACF '14, '17, '19; NHBB '13-'15; NASAT '11

"A [...] wizard who controls the weather" - Jerry Vinokurov
User avatar
grapesmoker
Sin
Posts: 6345
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Re: S-value revision

Post by grapesmoker »

quantumfootball wrote:Also, for "maximizing the margin of victory", it should be margin per 20TUH so that teams who are really blowing out someone shouldn't have to really pile it on the other team by bonus blitzing and stuff like that to speed up the game.
Nobody has to do that and if you think there's anything that's going to stop a team from buzzing or compel them to buzz other than the desire to answer questions, you're being silly.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
User avatar
setht
Auron
Posts: 1205
Joined: Mon Oct 18, 2004 2:41 pm
Location: Columbus, Ohio

Re: S-value revision

Post by setht »

grapesmoker wrote:
quantumfootball wrote:Also, for "maximizing the margin of victory", it should be margin per 20TUH so that teams who are really blowing out someone shouldn't have to really pile it on the other team by bonus blitzing and stuff like that to speed up the game.
Nobody has to do that and if you think there's anything that's going to stop a team from buzzing or compel them to buzz other than the desire to answer questions, you're being silly.
I think normalizing stats like PPG or margin of victory by tossups heard rather than games played is a good idea, mostly because it makes for fairer comparisons between different SCTs that may have faster/slower groups of readers.

-Seth
Seth Teitler
Formerly UC Berkeley and U. Chicago
President of NAQT
Emeritus member of ACF
User avatar
Important Bird Area
Forums Staff: Administrator
Posts: 6112
Joined: Thu Aug 28, 2003 3:33 pm
Location: San Francisco Bay Area
Contact:

Re: S-value revision

Post by Important Bird Area »

Among my notes is the assumption that all tossup-dependent stats will be rated on tossups heard rather than games played.
Jeff Hoppes
President, Northern California Quiz Bowl Alliance
former HSQB Chief Admin (2012-13)
VP for Communication and history subject editor, NAQT
Editor emeritus, ACF

"I wish to make some kind of joke about Jeff's love of birds, but I always fear he'll turn them on me Hitchcock-style." -Fred
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

I have been working with a system fairly similar to Andrew's proposed system, though I think mine is better in some respects and his is better than others.

In brief, some differences:

--My system accounts for forfeits as follows: the forfeiting team earns a score of 0 on 20 TUH after all opponent-dependent calculations have been made (so as not to penalize other teams for their forfeit), the winner-by-forfeit is unaffected.

--I have looked at trying to model the "national average" as both a normal distribution and a gamma distribution; some data sets and statistics fit either model better than others.

--I separate tossup points and bonus points and compute corrections differently. Currently for tossup-dependent statistics I am using a PATH-like stat (e.g. % of powers your opponents did not get). This is not always working in the way I want it to; I may consider switching to Andrew's "park factor" stat instead. Bonuses I am using straight or through % per bonus (PPB/30*100). In particular, I never once use points per anything, only tossup and bonus points per tossup/bonus heard.

--I have been working with several different weighted averages to attempt to "retroactively predict" previous bids as accurately as possible.

--I have collected and used raw data from the 2008 and 2009 SCTs, D1-only fields only.

--My proposal for combined fields is to use a "subjective difficulty rating" by which bonus questions unchanged from D1 to D2 earn a SDR of 1, changed from D1 to D2 earn a SDR of 0.75, and new to D2 earn a SDR of 0.5 (to roughly correspond to the NAQT difficulty scale; note that the NAQT difficulty scale may be logarithmic and these SDRs need to change). Since NAQT has exact data on what was changed between the two sets, it should be able to assign a SDR over the entire set. D2 bonus conversions for D1 teams are then multiplied by the SDR (D2 teams on D1 questions, divide by the SDR to get what they likely would have earned on D2 questions). Tossup questions and diluted fields are a whole different story and I am not sure that similar pre-calculation SDR corrections (multiplying powers and tossup points by SDR) are adequate.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

After critically thinking about it, here is my issue with the "maximize margin of victory" idea:

This should give teams an incentive to win by as much as possible, e.g., by not letting the other team have tossups and not giving them bonus parts. Conversely, it also gives teams an incentive to lose by as little as possible. The current way these factors are set up, teams are rewarded for not telling the other teams answers; however, average to poor teams are also rewarded for forfeiting against really good teams (they have a loss and a victory margin of 0, as opposed to a loss and a victory margin of somewhere between -200 and -600).

One early correction I tried to make was along the Gaddis lines: you earn 25% of the points earned by your opponents in the S-value calculations. This avoids the forfeit problem (since now a forfeiting team has a total of 0 instead of a positive score for that round) and the problem of "just letting the other team have the tossup" (since the amount of points you score on a TU/B cycle is always equal to or greater than the amount of points scored through the other team), but reintroduces the problem of telling the other team the answer after a neg or telling the other team a bonus answer and also potentially penalizes or rewards teams for not having an opponent to play.

It is clear that there is a tradeoff on each extreme.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

I thought I'd point out that for the D1-field-only data I have for 2008 and 2009, both TPTH (R^2 values of 0.69-0.96) and Bonus Conversion (R^2 values of 0.58-0.98) correlate quite well with within-sectional order of finish (defined as % of teams in same sectional that a given team finished ahead of).

What this means is that we can use the slope and intercept for each separate sectional to determine where in a given sectional a team from a different sectional is likely to have finished.

The intercept of the regression line indicates how bad the worst team in the sectional is likely to be. The slope indicates the variability of the field. For instance, the 2008 Lake Wobegon West Sectional BC regression line had an intercept of 13.96 but a slope of only 6.25, indicating that all the teams were somewhat evenly matched. The 2008 Central Sectional BC regression line had an intercept of 5.11 but a slope of 11.66, indicating that there was high variability in ability levels at that sectional.

I'm not sure what value the regression equation has yet, but it seems to be a better indicator of overall field strength than just an average. We can also find the intersection of two regression lines; a team scoring at the intersection would place ahead of the same % of teams in either sectional. Any ideas for how to use these tools of the TPTH and BC regression lines?

Relevant calculations can be found here. Note that I just took the columns from the relevant (much larger) Excel files.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
cdcarter
Yuna
Posts: 945
Joined: Thu Nov 15, 2007 12:06 am
Location: Minneapolis, MN
Contact:

Re: S-value revision

Post by cdcarter »

cvdwightw wrote:I thought I'd point out that for the D1-field-only data I have for 2008 and 2009, both TPTH (R^2 values of 0.69-0.96) and Bonus Conversion (R^2 values of 0.58-0.98) correlate quite well with within-sectional order of finish (defined as % of teams in same sectional that a given team finished ahead of).
Unless you can come up with a method of regression that can accurately account for 90% of the variation regularly, I don't think this will be useful for a purely computational S-value, although it could be interesting.
Christian Carter
Minneapolis South High School '09 | Emerson College '13
PACE Member (retired)
User avatar
Stained Diviner
Auron
Posts: 5085
Joined: Sun Jun 13, 2004 6:08 am
Location: Chicagoland
Contact:

Re: S-value revision

Post by Stained Diviner »

Chris is wrong. The S-value will be the best system that the group can come up with whether or not its reliability is above or below an arbitrary standard.

To get back to what Dwight was saying, keep in mind that PPB is much simpler than TPTH because the latter is site dependent. A team with 18 PPB performed better on bonuses than a team with 17 PPB.

If you are going to use TPTH, you have to be able to answer the following question: If one team has .6 TPTH at a site where the average team has .4 TPTH and another team has .65 TPTH at a site where the average team has .3 TPTH, then which team performed better on tossups? I think the best way to answer this question is to use data from the past two or three years comparing how well teams did at their Sectional and how strong their Sectional was to how well those teams fared at Nationals. I have never used the statistical techniques myself, but I am pretty sure that there are standard processes to answer such questions (and I don't think that multiplying the standard deviation times a park factor is it). The goal would be to have some formula where you could put in the team TPTHs and site TPTHs and end up with some sort of rating. (It's possible that you could get better results using PATH-type statistics or including power conversion some other way.) I think it's obvious that the team TPTH would weigh more heavily than the site TPTH, but I would leave up to somebody who understands statistics and crunches the numbers to determine the best way to relate the numbers.
David Reinstein
Head Writer and Editor for Scobol Solo, Masonics, and IESA; TD for Scobol Solo and Reinstein Varsity; IHSSBCA Board Member; IHSSBCA Chair (2004-2014); PACE President (2016-2018)
User avatar
dtaylor4
Auron
Posts: 3733
Joined: Tue Nov 16, 2004 11:43 am

Re: S-value revision

Post by dtaylor4 »

Shcool wrote:If you are going to use TPTH, you have to be able to answer the following question: If one team has .6 TPTH at a site where the average team has .4 TPTH and another team has .65 TPTH at a site where the average team has .3 TPTH, then which team performed better on tossups? I think the best way to answer this question is to use data from the past two or three years comparing how well teams did at their Sectional and how strong their Sectional was to how well those teams fared at Nationals. I have never used the statistical techniques myself, but I am pretty sure that there are standard processes to answer such questions (and I don't think that multiplying the standard deviation times a park factor is it). The goal would be to have some formula where you could put in the team TPTHs and site TPTHs and end up with some sort of rating. (It's possible that you could get better results using PATH-type statistics or including power conversion some other way.) I think it's obvious that the team TPTH would weigh more heavily than the site TPTH, but I would leave up to somebody who understands statistics and crunches the numbers to determine the best way to relate the numbers.
The problem with using historical data is that you have to assume that teams don't change, which is false a vast majority of the time.
User avatar
Important Bird Area
Forums Staff: Administrator
Posts: 6112
Joined: Thu Aug 28, 2003 3:33 pm
Location: San Francisco Bay Area
Contact:

Re: S-value revision

Post by Important Bird Area »

dtaylor4 wrote:The problem with using historical data is that you have to assume that teams don't change, which is false a vast majority of the time.
There's enough continuity between SCT and ICT rosters that we could filter the data to account for that and probably still have something useful left over.
Jeff Hoppes
President, Northern California Quiz Bowl Alliance
former HSQB Chief Admin (2012-13)
VP for Communication and history subject editor, NAQT
Editor emeritus, ACF

"I wish to make some kind of joke about Jeff's love of birds, but I always fear he'll turn them on me Hitchcock-style." -Fred
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

Shcool wrote:If you are going to use TPTH, you have to be able to answer the following question: If one team has .6 TPTH at a site where the average team has .4 TPTH and another team has .65 TPTH at a site where the average team has .3 TPTH, then which team performed better on tossups?
My argument is that the regression line is a better indicator of site strength than just the average. If you want the average, it corresponds pretty well to the 50% point of the regression line (accurate to two decimal places). The regression line gives you the average as well as the expected value for the best and worst teams in the sectional, and all points in between (obviously, since it's a regression line, they won't exactly correspond).
I think the best way to answer this question is to use data from the past two or three years comparing how well teams did at their Sectional and how strong their Sectional was to how well those teams fared at Nationals. I have never used the statistical techniques myself, but I am pretty sure that there are standard processes to answer such questions (and I don't think that multiplying the standard deviation times a park factor is it). The goal would be to have some formula where you could put in the team TPTHs and site TPTHs and end up with some sort of rating. (It's possible that you could get better results using PATH-type statistics or including power conversion some other way.) I think it's obvious that the team TPTH would weigh more heavily than the site TPTH, but I would leave up to somebody who understands statistics and crunches the numbers to determine the best way to relate the numbers.
This is misleading because of three gigantic confounding variables before we even get into the "teams aren't the same" argument: 1) number of teams that decline bids (and their corresponding expected placement); 2) proportion (and strength) of teams at the sectional that are not offered bids; 3) strength of host teams that did not compete on the SCT question set.

As a completely hypothetical situation, let's put a four-team sectional together where three of the teams average around 5 TPTH and the other is at 1 TPTH. In other words, the 2009 West Sectional if every team was a little bit weaker. The average here would be 4 TPTH. At another sectional, roughly half the teams are around or slightly above 5 TPTH and the other half are averaging around 3 TPTH. The average here would be around 4 TPTH. This is roughly like the 2008 North Sectional. Because the bottom of the nationals field is often diluted, we can reasonably expect that a couple of teams in the bottom half of the second sectional might come to ICT; however, it is extremely unlikely that the 1 TPTH team would be invited. Thus, it is likely that the second sectional (with more teams and a more competitive bottom end) would actually show worse, on average, at ICT (even though the average is the same).

As a different hypothetical situation, let's say that every team above 3.5 TPTH gets invited to ICT. Two sectionals have five teams that score ~6, ~5, ~4, ~3, and ~1.5 TPTH, in order of finish. Sectional A has the winner decline the autobid; Sectional B has the third-place team decline. Both sectionals have an identical average but Sectional A will, on average, perform weaker than Sectional B. The regression line takes this into account (because the third-place teams should place roughly identically) whereas a simple average does not.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
Stained Diviner
Auron
Posts: 5085
Joined: Sun Jun 13, 2004 6:08 am
Location: Chicagoland
Contact:

Re: S-value revision

Post by Stained Diviner »

I'm not saying that you should follow the results of each Sectional but that you should follow the results of each team. Each team has two independent values, its TPTH at Sectionals and its site's TPTH, and one dependent value, either its TPTH at Nationals or some other measure of its performance at Nationals. The performance of a Sectional as a whole at Nationals does not get taken into account, so the fact that many Sectionals teams don't compete at Nationals and vice versa is not important.
David Reinstein
Head Writer and Editor for Scobol Solo, Masonics, and IESA; TD for Scobol Solo and Reinstein Varsity; IHSSBCA Board Member; IHSSBCA Chair (2004-2014); PACE President (2016-2018)
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

Shcool wrote:I'm not saying that you should follow the results of each Sectional but that you should follow the results of each team. Each team has two independent values, its TPTH at Sectionals and its site's TPTH, and one dependent value, either its TPTH at Nationals or some other measure of its performance at Nationals.
This is good-intentioned but silly. As I have (now repeatedly) said, a site average TPTH in and of itself is a number with little meaning. Here are two Sectionals from 2009:

Sectional A: 6 teams, Average (forfeit-adjusted) TPTH 3.75
Sectional B: 6 teams, Average (forfeit-adjusted) TPTH 3.76

Which one is tougher? From that alone it's hard to tell. Let's say that you've got an average TPTH and that you beat all teams with a lower TPTH, split with teams with a roughly equal (arbitrarily defined as within 0.2) TPTH, and lose to all teams with a higher TPTH.

Sectional A: A double round robin puts you at 8-4.
Sectional B: A double round robin puts you at 5-7.

I think it suffices to say that if you're a borderline team, you would much rather play at Sectional A, even though the ostensible "average" strength of each sectional is the same. Why is this?

Sectional A (East) has one team around 5.8 TPTH, one team around 4.8, and four teams that are below average but not necessarily weak (2.34 TPTH lowest average). Sectional B (Mideast) has three teams around 5 TPTH, one team around average, one team at about 2.8, and a bottom feeder at 0.88.

Sectional C (2008, 4 teams) has an average TPTH of 3.83.
Sectional D (2008, 9 teams) has an average TPTH of 3.83.

A double round robin against Sectional C puts you (average team) at 5-3.
A full round robin against Sectional D puts you (average team) at 4.5-4.5.

It's not as pronounced as the last example, but I think I'd take Sectional C if I were an average team. This time, Sectional C (East) has one team around 5.8 TPTH, two teams between 3.0 and 4.0, and one team at 2.45. Sectional D (South) has two teams between 5.0 and 5.6, two teams between 4.0 and 5.0, two teams between 3.0 and 4.0, two teams between 2.45 and 3.0, and one team at 1.91 - a much more "balanced" distribution of teams. Even though the "average" difference is very small, the distribution of teams at the sectional can have huge impacts

I hope it becomes apparent from these examples that using a single number (average TPTH) to quantify overall Sectional performance is a fool's errand. Obviously, I've also just shown that using a single point on the regression line to do anything is equally silly, but using two numbers (slope and intercept) to quantify how performance goes up across the Sectional is still, I think, valid.
cdcarter wrote:Unless you can come up with a method of regression that can accurately account for 90% of the variation regularly, I don't think this will be useful for a purely computational S-value, although it could be interesting.
I can give you >88% for all 2009 data and >86% for all 2008 data save the wacky West Sectional (only 76.49%) by correlating forfeit-adjusted TPTH with adjusted win% (if you finish with a higher win% than a team in a higher playoff bracket, your adjusted win% is the lowest win% of any team in that higher bracket). No word yet on whether these correlations are significant.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
grapesmoker
Sin
Posts: 6345
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Re: S-value revision

Post by grapesmoker »

cvdwightw wrote:After critically thinking about it, here is my issue with the "maximize margin of victory" idea:

This should give teams an incentive to win by as much as possible, e.g., by not letting the other team have tossups and not giving them bonus parts. Conversely, it also gives teams an incentive to lose by as little as possible. The current way these factors are set up, teams are rewarded for not telling the other teams answers; however, average to poor teams are also rewarded for forfeiting against really good teams (they have a loss and a victory margin of 0, as opposed to a loss and a victory margin of somewhere between -200 and -600).
But how could a team forfeit a match against a stronger team on purpose? Since you don't know who you'll be playing before you show up at the tournament, you'd better show up or risk forfeiting to a worse team, but once you're there, you can't not show up to a match and have that not be detected as degenerate behavior.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
User avatar
setht
Auron
Posts: 1205
Joined: Mon Oct 18, 2004 2:41 pm
Location: Columbus, Ohio

Re: S-value revision

Post by setht »

grapesmoker wrote:But how could a team forfeit a match against a stronger team on purpose? Since you don't know who you'll be playing before you show up at the tournament, you'd better show up or risk forfeiting to a worse team, but once you're there, you can't not show up to a match and have that not be detected as degenerate behavior.
"We got lost on the way back from lunch" and "we have to leave now or we'll be late for our hair-washing appointments" both seem like possibilities. I'm pretty sure I've seen at least one non-degenerate example of the first issue (not sure if the team in question actually got lost or was instead stuck in traffic or had exceptionally slow service). The second might look more suspicious but could also be legitimate--I remember a tournament at UIUC (ACF Fall a year or three ago, I think) where Chicago teams had to leave before the final match because they needed to catch a train.

-Seth
Seth Teitler
Formerly UC Berkeley and U. Chicago
President of NAQT
Emeritus member of ACF
User avatar
Sen. Estes Kefauver (D-TN)
Chairman of Anti-Music Mafia Committee
Posts: 5647
Joined: Wed Jul 26, 2006 11:46 pm

Re: S-value revision

Post by Sen. Estes Kefauver (D-TN) »

Yeah, at the WUSTL sectional this year, Tulsa took forever to be served lunch and missed at least one game.
Charlie Dees, North Kansas City HS '08
"I won't say more because I know some of you parse everything I say." - Jeremy Gibbs

"At one TJ tournament the neg prize was the Hampshire College ultimate frisbee team (nude) calender featuring one Evan Silberman. In retrospect that could have been a disaster." - Harry White
User avatar
grapesmoker
Sin
Posts: 6345
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Re: S-value revision

Post by grapesmoker »

setht wrote:"We got lost on the way back from lunch" and "we have to leave now or we'll be late for our hair-washing appointments" both seem like possibilities. I'm pretty sure I've seen at least one non-degenerate example of the first issue (not sure if the team in question actually got lost or was instead stuck in traffic or had exceptionally slow service). The second might look more suspicious but could also be legitimate--I remember a tournament at UIUC (ACF Fall a year or three ago, I think) where Chicago teams had to leave before the final match because they needed to catch a train.
Sorry, I'm not particularly sympathetic to either of the first two arguments. I would think that grown adults would be able to navigate their immediate environs with a map, or barring that, would be able to call someone and ask for directions. I've been attending quizbowl tournaments for almost a decade now and I've never had a problem going to get lunch somewhere and coming back. If you're not clever enough to figure out that a quizbowl tournament is not the best time to go for a sit-down lunch at Chez Panisse, then I guess it sucks to be you, but the round is going to start regardless. As for leaving early, presumably some level of foresight needs to exist among teams to arrange for these things; if you think the difference between making ICT and not is one late game, then plan accordingly.

It's been said that bad cases make bad law; I don't know if that's the case, but I think building a formula around anomalous events is going to result in a bad and complicated formula. Since whatever formula we come up with is going to include PPTUH as a factor, then if you have a legitimate forfeit that doesn't get counted in your TUH column, but otherwise, it does, and that's going to be up to the discretion of the TD. I don't want to penalize a team that has a legitimate problem making a match because of, say, difficulties with their car, but I also don't care to come up with crazy disincentives for avoiding behavior that you shouldn't be engaging in in the first place. I think simply factoring TUH in this way accounts for both legitimate forfeits and teams being dumb.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
User avatar
theMoMA
Forums Staff: Administrator
Posts: 5993
Joined: Mon Oct 23, 2006 2:00 am

Re: S-value revision

Post by theMoMA »

grapesmoker wrote:It's been said that bad cases make bad law; I don't know if that's the case, but I think building a formula around anomalous events is going to result in a bad and complicated formula. Since whatever formula we come up with is going to include PPTUH as a factor, then if you have a legitimate forfeit that doesn't get counted in your TUH column, but otherwise, it does, and that's going to be up to the discretion of the TD. I don't want to penalize a team that has a legitimate problem making a match because of, say, difficulties with their car, but I also don't care to come up with crazy disincentives for avoiding behavior that you shouldn't be engaging in in the first place. I think simply factoring TUH in this way accounts for both legitimate forfeits and teams being dumb.
I agree completely with Jerry. In addition to the TD's power to rule on such things, if NAQT suspects that a forfeit is for the wrong reasons, it should reserve the right to review the case and count the TUH against the team if necessary.
Andrew Hart
Minnesota alum
User avatar
Captain Sinico
Auron
Posts: 2675
Joined: Sun Sep 21, 2003 1:46 pm
Location: Champaign, Illinois

Re: S-value revision

Post by Captain Sinico »

grapesmoker wrote:Sorry, I'm not particularly sympathetic to either of the first two arguments. I would think that grown adults would be able to navigate their immediate environs with a map, or barring that, would be able to call someone and ask for directions.
Seth isn't positing those as arguments for missing matches or that make it okay to miss matches; he's just saying that they're plausible enough that you couldn't outright accuse a team of cheating/ducking opponents if they made them. He's absolutely right in that.
I'd suggest that, rather than receiving an outright forfeit, a team whose opponent doesn't make it can play a round against empty chairs and have the statistics from it counted in the same way as they otherwise would have been including whatever corrections for opponent (in this case, nobody) strength are used; they're obviously going to be sharp downward corrections for this round. In this way, useful ranking data are still generated by a forfeited round - bonus conversion, for example, ought to be the same as that for the hypothetical game in which the opponent showed to a high degree of precision. That leaves open the question of what to give the forfeited team which, I suppose, is the actual question here but I think there's probably no best answer to that; information is inevitably destroyed by a forfeit.

MaS
Mike Sorice
Former Coach, Centennial High School of Champaign, IL (2014-2020) & Team Illinois (2016-2018)
Alumnus, Illinois ABT (2000-2002; 2003-2009) & Fenwick Scholastic Bowl (1999-2000)
Member, ACF (Emeritus), IHSSBCA, & PACE
User avatar
grapesmoker
Sin
Posts: 6345
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Re: S-value revision

Post by grapesmoker »

Captain Sinico wrote:Seth isn't positing those as arguments for missing matches or that make it okay to miss matches; he's just saying that they're plausible enough that you couldn't outright accuse a team of cheating/ducking opponents if they made them. He's absolutely right in that.
Sure, and I agree. What I am saying is that regardless of whether a team is thought to be cheating or not, no one should have trouble stepping outside, grabbing some food, eating it, and coming back all in the space of an hour. If you can't do that and as a result the round is read to your opponent against empty chairs, those tossups are going to be counted against you in the TUH column. That should be a sufficient disincentive for anyone to do stupid things like order sit-down food that takes too long or just not show up to a match.
I'd suggest that, rather than receiving an outright forfeit, a team whose opponent doesn't make it can play a round against empty chairs and have the statistics from it counted in the same way as they otherwise would have been including whatever corrections for opponent (in this case, nobody) strength are used; they're obviously going to be sharp downward corrections for this round. In this way, useful ranking data are still generated by a forfeited round - bonus conversion, for example, ought to be the same as that for the hypothetical game in which the opponent showed to a high degree of precision. That leaves open the question of what to give the forfeited team which, I suppose, is the actual question here but I think there's probably no best answer to that; information is inevitably destroyed by a forfeit.
Well, as I said above I would give the forefeiting team a zero and however many TUH were read during the round. If it turns out later that the forfeiting team had a good excuse, we can scrub those TUH from their record. I agree that we should just read the round to the opponent, but the onus for not forefeiting a round is on the teams involved; it's not the formula's job to accommodate forfeits that shouldn't be happening.
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
User avatar
rjaguar3
Rikku
Posts: 278
Joined: Tue Dec 05, 2006 8:39 am

Re: S-value revision

Post by rjaguar3 »

Pardon me for bumping, but I just came up with an idea.

The idea is to use KRACH ratings without phantom teams (see http://www.collegehockeynews.com/info/?d=krach and http://slack.net/~whelan/tbrw/tbrw.cgi?krach ). Within each sectional, each team gets the wins that it accumulates against other teams in that sectional. However, each pair of teams NOT in the same sectional receives fractions of wins based on PPB. The idea is to choose 0 < lambda < 1, and split lambda wins among each pair of teams A, B, not from the same sectional by the formula

WINS(A) = lambda*[0.5+k(PPB(A)-PPB(B)])

where k is a positive constant that results from the least-squares regression of the data points (PPB(W)-PPB(L),0.5) and (PPB(L)-PPB(W),-0.5) collected over all matches. Further, the term in brackets must be from 0 to 1, inclusive; if the actual value is outside that range, then it will become 0 or 1, as appropriate.

Put all the teams into a huge matrix along with the number of wins against each team (actual wins for teams in the same sectional, formula wins for teams in different sectionals) and compute the KRACH rating for each team.

BENEFITS:
  • It is difficult to game the system through non-obvious manipulation without collusion among multiple sectionals. The only way that a team at a sectional has the potential to game the system to increase their ranking is to feed opponents bonus answers, which not only has unpredictable effects (because it affects the entire field and therefore has results dependent on the results from other sectionals) but also is easily detected.
  • It produces an (relatively) easily verifiable result: to check that the ratings are accurate, compute the total number of actual wins plus formula wins and compare that to the dot product of the number of games played against each team (either the number of actual games or lambda for teams in different sectionals) and the quotient of the teams' ratings.
  • The ratings serve as an intuitive way of determining not only which team is the favorite but how often that team would win against another team.
  • Bonus conversion is independent of opponent.
DISADVANTAGES:
  • There is currently no way, apart from regression from previous years' data, to adapt DI PPB to corresponding DII scores. According to the NAQT page, this is a fatal flaw.
  • Currently, no tests have been performed to see if the results of this method conform to the actual pattern of invitations. I will have to do that research in the next few weeks.
  • The formula requires a computer.
  • Weak teams that do not convert many toss-ups may have unreliable PPB. This may have a huge ripple effect; as such, it might be necessary to "damp" the PPB by assuming each team gets a 0 and 30 on two fictional bonuses.
  • Paradoxically, it is possible that the addition of an additional sectional can cause already-ranked teams to switch rankings.
I'm just throwing this out there. I hope to be able to do some tests in the future.
Greg (Vanderbilt 2012, Wheaton North 2008)
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

rjaguar3 wrote:It is difficult to game the system through non-obvious manipulation without collusion among multiple sectionals. The only way that a team at a sectional has the potential to game the system to increase their ranking is to feed opponents bonus answers, which not only has unpredictable effects (because it affects the entire field and therefore has results dependent on the results from other sectionals) but also is easily detected.
YOUR ARE WRONG: there is a very easy way to game this system, and that is by not answering tossups. Teams have absolutely no incentive to continue buzzing when the following criteria are met:
  • They are satisfied with their bonus conversion for that round (usually because it is above their bonus conversion coming into that round, so an additional bonus would likely bring it back down)
  • The outcome of the game is no longer in doubt, they are playing a really weak team who is unlikely to come back, or they are playing a really strong team and unlikely to come back. In short, when the answering of an additional tossup is unlikely or impossible to change who wins.

Let's use a hypothetical, but not at all unrealistic situation: With 5 minutes to go in the second half, the reader has finished 15 questions, Team A has gone 4-8-1 and had an above-average bonus conversion of 20 ppb for a total of 375 points; Team B has gone 1-1-0 and gotten good draws on both bonuses for a total of 75 points. At the reader's pace, it is likely that either 20 or 21 questions will be read if both teams continue playing; this means that there is a total of no more than 270 points available the rest of the game. Team A is up by 300 points and has a bonus conversion that is more likely to decrease with another bonus heard; Team A has no further incentive to buzz with the correct answer. Team B is down by 300 points and has a bonus conversion that is much more likely to decrease with another bonus heard; Team B has no further incentive to buzz with the correct answer. It's not unusual for teams to be in either Team A or Team B's position (though it's less likely that a Team A and Team B position occur in the same game). A team in Team A's position might be suspected of degenerate behavior if it stops buzzing, but this becomes impossible to prove if Team A already has a reputation for being fast and loose with the buzzer and piles up 5 straight negs after the outcome of the game is no longer in doubt. Similarly, Team B could just let the other team (who has a blowout lead anyway) have the rest of the questions, and no one would know the difference (and if a question did get all the way to an uncontested buzz, just have a player give a reasonably plausible but wrong guess).

The idea is reasonable, but it's way too susceptible to non-obvious degenerate behavior.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
rjaguar3
Rikku
Posts: 278
Joined: Tue Dec 05, 2006 8:39 am

Re: S-value revision

Post by rjaguar3 »

I don't think this will be much of a problem unless the sample size of bonuses heard is too small, which can only happen for weak teams. For all it's worth, you never know whether the next bonus in the packet will help or hurt PPB.

If you're a good team, the effect of hearing one additional bonus on PPB is very small because you will have gotten so many toss-ups that it's very likely that the PPB is close to the true value.

If you still think that is a problem, then replace any references to PPB with downward-damped PPB (DDPPB), defined as the PPB for all bonuses heard plus one additional fictional 0 bonus per game/20TUH. The problem is that such a fix will make it so that teams who play stronger teams will be negatively affected; however, this problem should be less prevalent in the bubble teams, whose rankings we REALLY need to get right.
Greg (Vanderbilt 2012, Wheaton North 2008)
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

I've gone through the math and I've concluded that when we're looking at about 100 bonuses, the average borderline team only expects to only save about 0.002 per tossup not answered, depending on how high the average to-that-point-in-the-game would bring up the bonus conversion. You are right when it comes to how much gaming the system can help or hinder you in the long run.

However, I think that though the expected value is practically the same, we also need to look at the maximum amount a team might benefit from this. Using the 2009 Irvine data set, which is the only set of line scores I have handy access to: we would have improved by 0.33 PPB had we stopped buzzing at the point at which the game was decided (167 tossups answered, 29 after the result of the game had been mathematically determined based on how many tossups were ultimately heard; that's 17.3% of the tossups that we answered when we didn't need to). Obviously it's a very risky strategy - a team could just as easily lose that much. If we consider two teams of equal real skill level, each of which attempts this strategy with these completely opposite results, then the lucky one looks 0.66 BP better than the unlucky one.

It would be unreasonable for a borderline team to reason that x PPB is not enough to get them into the field, but x+0.33 PPB might be, and decide that it's worth the risk of getting x-0.33 PPB, but then again there are some teams that don't always think reasonably. Trying to game the system does not necessarily mean that you have to be successful at it, but in this case, even trying to game the system can play havoc with results - if any team attempts to game the system then, whether that team succeeds or not, there is the distinct possibility that the invitation order of some teams (possibly entirely unrelated to the team that gamed the system) will get switched.

Here is what I like about the old S-value:
1. It is always better to win a game than to not win a game.
2. It is always better to answer a tossup than to not answer a tossup.
3. It is always better to answer a bonus part than to not answer a bonus part.
4. It is always better to play a stronger opponent than to play a weaker opponent.

I may be the only person who actually thinks this, but personally, I think that these four principles need to be the four principles that guide any search for a new S-value. Under this proposed system, there is no benefit to answering a tossup, and there is no benefit to playing a stronger opponent. In fact, if a team is better than you at getting tossups but your bonus conversions are roughly equal, you are actually better off with that team in a different sectional - they will win the majority of your head-to-head games but you will win 50% of the "fractional" games. We see all the time that teams with higher bonus conversions end up lower in the standings because teams with lower bonus conversions were consistently able to make up for the bonus conversion deficit by answering more tossups.

For instance, 2009 Princeton had a bonus conversion 0.62 points lower than Harvard C but finished three games ahead in the same playoff bracket, partially due to converting 52.3% of tossups heard compared to 47.2% for Harvard C. It stands to reason that against a team with Princeton's bonus conversion, Princeton would do better in the long run because they would be more likely to answer more tossups and thus prevent that team from exercising its advantage on the bonuses. Yet, against a team with Princeton's bonus conversion, Princeton would receive 0.5 wins and Harvard C more than 0.5 using this formula.

To summarize a long post, I think you're right that the gaming-the-system won't have a large impact on the field as a whole, but it may have an impact on the ordering of that field; a more important issue is that teams are not at all rewarded for answering tossups even though that may be a bigger predictor of head-to-head and within-field success.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
rjaguar3
Rikku
Posts: 278
Joined: Tue Dec 05, 2006 8:39 am

Re: S-value revision

Post by rjaguar3 »

My problem with incorporating toss-ups into the mix is that this creates even more possibilities of match fixing. Winning by a lot? Neg with no answer or an answer that is the correct answer with additional incorrect information, allowing your opponents to rebound, making them look stronger and your victory better. Losing by a lot? Do the same thing to make your loss appear to be to a much stronger team. It is therefore very difficult to incorporate toss-ups into the mix without allowing for far more possibilities of match-fixing.
Greg (Vanderbilt 2012, Wheaton North 2008)
User avatar
Important Bird Area
Forums Staff: Administrator
Posts: 6112
Joined: Thu Aug 28, 2003 3:33 pm
Location: San Francisco Bay Area
Contact:

Re: S-value revision

Post by Important Bird Area »

cvdwightw wrote:Here is what I like about the old S-value:
1. It is always better to win a game than to not win a game.
2. It is always better to answer a tossup than to not answer a tossup.
3. It is always better to answer a bonus part than to not answer a bonus part.
4. It is always better to play a stronger opponent than to play a weaker opponent.

I may be the only person who actually thinks this, but personally, I think that these four principles need to be the four principles that guide any search for a new S-value.
You're not; I strongly agree with the first 3, and also concur with (4) noting that we may have difficulty determining what "stronger" means for marginal cases.
rjaguar3 wrote:My problem with incorporating toss-ups into the mix is that this creates even more possibilities of match fixing. Winning by a lot? Neg with no answer or an answer that is the correct answer with additional incorrect information, allowing your opponents to rebound, making them look stronger and your victory better. Losing by a lot? Do the same thing to make your loss appear to be to a much stronger team. It is therefore very difficult to incorporate toss-ups into the mix without allowing for far more possibilities of match-fixing.
I'm not actually worried about this. I think the right solution is to both: 1. incorporate tossup stats (which does need to be done for the reasons Dwight outlines above) and 2. use a measure of schedule strength that is not vulnerable to the attacks mentioned. (For instance, "team X's schedule strength is the average bonus conversion of its opponents in all games not involving team X." Which has various other problems, but match-fixing it would require multiple teams to conspire together in games of unknown outcome. Which is both logistically challenging for the cheaters and likely to hurt them in actual game play in a way that deliberately negging the mopup tossup 20 isn't.)
Jeff Hoppes
President, Northern California Quiz Bowl Alliance
former HSQB Chief Admin (2012-13)
VP for Communication and history subject editor, NAQT
Editor emeritus, ACF

"I wish to make some kind of joke about Jeff's love of birds, but I always fear he'll turn them on me Hitchcock-style." -Fred
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

rjaguar3 wrote:My problem with incorporating toss-ups into the mix is that this creates even more possibilities of match fixing. Winning by a lot? Neg with no answer or an answer that is the correct answer with additional incorrect information, allowing your opponents to rebound, making them look stronger and your victory better. Losing by a lot? Do the same thing to make your loss appear to be to a much stronger team. It is therefore very difficult to incorporate toss-ups into the mix without allowing for far more possibilities of match-fixing.
Yeah, you're pretty much right about this. I guess the issue is whether or not we use a model that pretty much by definition can't use any measure of tossup strength; there's pros and cons either way, and like I've stated, I'm on the side of "include tossup strength as an important factor in the S-value." If a lot of other people are on the other side, I'd have no problem dropping my objections; however, I think that more people are on my side and the solution to issues with tossup strength is not to just drop it from the equation entirely.

As the number of tossups answered decreases, the bonus conversion needed for the inferior team to win, given a certain difference in the number of tossups answered, decreases (or conversely, the bonus conversion needed for the superior team to win increases). I think this should be fairly intuitive. What this means is also intuitive - more dead tossups or a slower reader increases the chance of upsets. I'm pretty sure the chance of getting a tossup against a given opponent is not linearly related to BC (for instance, I would expect a team at 10 ppb to beat a team at 7 ppb more often than a 15 ppb to beat a team at 12 ppb, which would happen more often than a team at 20 ppb beating a team at 17 ppb - I have a feeling that the ratio of two teams' bonus conversion is more important than the difference).

The KRACH ratings also necessarily work because teams play games outside of their conferences; we don't have that luxury with sectionals - you don't get 12 games against the East Sectional and a teleconference game against a Canadian team.

I think it's safe to say that the concept of multiplying opponent-dependent statistics by a strength of schedule factor is going to be a component of the new S-value. The major question is, how do we compute that strength of schedule factor and what opponent-dependent statistics do we include?
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
rjaguar3
Rikku
Posts: 278
Joined: Tue Dec 05, 2006 8:39 am

Re: S-value revision

Post by rjaguar3 »

Dwight, that's why I use PPB to create games between sectionals, so that the sectionals are linked by artificial (lightly-weighted) games in order to produce rankings among teams in all sectionals. Otherwise, I wouldn't bother with PPB.
Greg (Vanderbilt 2012, Wheaton North 2008)
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

I went back and calculated something very similar to Andrew's suggestion. It adds the standard deviations for raw bonus conversion and opponent-adjusted tossup conversion, converts the result to a whole number, and corrects for order-of-finish deviations. Steps as follows:

1. Find the tossup points per tossup heard and the bonus conversion.

2. Compute the Opponent-Averaged Opponent Average Performance (OAOAP, I'm searching for a better and less cumbersome descriptor). This is a strength-of-schedule factor that is unique to each team, and is computed as follows:
A. Set up an nxn matrix, where n is the number of teams in the sectional.
B. For each matrix entry (R, C), compute the tossup points per tossup heard of team R in all games that did not involve team C. Obviously, (R, R) is not applicable and is not used in subsequent steps. Note that forfeit games are not currently counted as games involving other teams. Any future workaround solutions for counting forfeits (e.g. a game against empty chairs) would be included in this step.
C. Take the average of each column weighted by number of games played against each team. So if you had 8 teams, played a full round robin and then split into 4-team brackets with a double RR in each bracket, your column's average would be (1*(each of the teams not in your bracket)+3*(each of the teams in your bracket))/(13 total games played).
D. The resulting number is a rough measure of how hard you had to work to get tossups. Higher numbers mean that your opponents scored tossups more easily against other teams, and therefore you would have had to work harder to get the tossups you did earn. Lower numbers mean that your opponents scored fewer tossups against other teams, and therefore you would not have had to work as hard to get your tossups.

3. Multiply each team's TPTH by its OAOAP and divide by the average OAOAP. This is a team's raw tossup score (RTSC), and represents the TPTH you are likely to have gotten playing an exactly average schedule.

4. Convert from the RTSC to the normalized tossup score (NTSC), which is the number of standard deviations away from the mean RTSC.

5. Convert bonus conversion to the normalized bonus score (NBSC), which is the number of standard deviations away from the mean BC.

6. Compute the cumulative standard deviations (CSD) = NTSC + NBSC.

7. Convert the CSD to a 3 or 4 digit Raw S Value (RSV) = 1000+100*CSD.

8. Convert the RSV to the Adjusted S Value (ASV) as follows:
A. Rank each team within sectional by order of finish and by RSV.
B. For each sectional, if a team's RSV rank is k spots below its order of finish rank (note that this does not apply to a team that is, e.g., 5th in RSV and in a 3 way tie for 3rd in order of finish, only a team whose RSV rank could not possibly equal its order of finish rank):
For that team and the k-1 teams above it, the ASV is equal to (1/(k+1)*(SUM(RSV)+RSV_i), where RSV_i is the ith place RSV out of the k teams.
So, for a team that is 1 spot lower by RSV than its order of finish, its score is (2/3)*higher RSV +(1/3)*lower RSV. For a team that is 2 spots lower, its score is (1/2)*(highest of 3 RSVs)+(1/4)*(middle of 3 RSVs)+(1/4)*(lower of 3 RSVs). And so on.
C. For all other teams, the ASV is simply the RSV.
D. Continue this process until for each sectional, the order of finish and RSV within-sectional rankings match.

9. Round all ASVs to the nearest integer and sort from highest to lowest. If there is a tie in ASV, higher CSD breaks the tie. Note that steps 7 and 8 can be switched and it won't affect the ASV.

The ASV will always correct for order-of-finish such that a team that finishes higher at the SCT itself will always be invited first. Additionally, the OAOAP is a better strength-of-schedule factor because it takes into account that every team at the sectional plays a different schedule, and because it never takes into account games that are played between two teams, it is impervious to within-game gaming (it is still theoretically susceptible to forfeit manipulation, though note that if the lower-TPTH team forfeits, it artificially increases its opponent's OAOAP and decreases its own; it remains to be seen whether this artificial increase/decrease is enough to offset the potential TPTH gain/loss, but I am skeptical that it will). The ASV is given as an easy-to-read 3 or 4 digit number and calculations can be done using two Excel Sheets (one for OAOAP and one for everything else). This method will also work if raw D2/D1 numbers are converted to numbers in the other division, though a way to do that has not yet been established.

EDIT: People wishing to see the relevant calculations should see the OAOAP and ASV sheets here.

EDIT2: I'd just like to point out that independent of any other advantages of this system, the raw-to-adjusted S value conversion completely solves part 1 of NAQT's "three most difficult parts of assembling this system." "Team A" will always be invited ahead of "Team B" and the invitation order for "Team C" can be found directly from the ASV rankings.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
setht
Auron
Posts: 1205
Joined: Mon Oct 18, 2004 2:41 pm
Location: Columbus, Ohio

Re: S-value revision

Post by setht »

cvdwightw wrote:I went back and calculated something very similar to Andrew's suggestion. It adds the standard deviations for raw bonus conversion and opponent-adjusted tossup conversion, converts the result to a whole number, and corrects for order-of-finish deviations. Steps as follows:

1. Find the tossup points per tossup heard and the bonus conversion.

2. Compute the Opponent-Averaged Opponent Average Performance (OAOAP, I'm searching for a better and less cumbersome descriptor). This is a strength-of-schedule factor that is unique to each team, and is computed as follows:
A. Set up an nxn matrix, where n is the number of teams in the sectional.
B. For each matrix entry (R, C), compute the tossup points per tossup heard of team R in all games that did not involve team C. Obviously, (R, R) is not applicable and is not used in subsequent steps. Note that forfeit games are not currently counted as games involving other teams. Any future workaround solutions for counting forfeits (e.g. a game against empty chairs) would be included in this step.
C. Take the average of each column weighted by number of games played against each team. So if you had 8 teams, played a full round robin and then split into 4-team brackets with a double RR in each bracket, your column's average would be (1*(each of the teams not in your bracket)+3*(each of the teams in your bracket))/(13 total games played).
D. The resulting number is a rough measure of how hard you had to work to get tossups. Higher numbers mean that your opponents scored tossups more easily against other teams, and therefore you would have had to work harder to get the tossups you did earn. Lower numbers mean that your opponents scored fewer tossups against other teams, and therefore you would not have had to work as hard to get your tossups.
I'm with you so far, Dwight, but before we move forward with using tossup points per tossup heard and figuring out methods for correcting that statistic for strength-of-schedule effects I think we should check that bonus conversion by itself is not as good as or better than some combination of TPTH and bonus conversion for predicting performance at ICT--focusing only on cases with decent sample size (say, teams that answered at least 50 bonuses at SCT). If it turns out that bonus conversion by itself is as good as or better than a mix of tossup and bonus statistics as a predictive statistic, then we should use that for all teams with decent numbers of bonuses and we can skip figuring out how to handle strength-of-schedule adjustments. If it turns out that including tossup statistics seems to give better predictive power then I think the scheme you present here is a good starting point.

I think 50 bonuses should be more than sufficient as a sample size to deal with fluctuations in conversion by subject or whatever; if that doesn't seem to be the case the cutoff could be moved to 100 bonuses.
cvdwightw wrote: 3. Multiply each team's TPTH by its OAOAP and divide by the average OAOAP. This is a team's raw tossup score (RTSC), and represents the TPTH you are likely to have gotten playing an exactly average schedule.

4. Convert from the RTSC to the normalized tossup score (NTSC), which is the number of standard deviations away from the mean RTSC.

5. Convert bonus conversion to the normalized bonus score (NBSC), which is the number of standard deviations away from the mean BC.

6. Compute the cumulative standard deviations (CSD) = NTSC + NBSC.

7. Convert the CSD to a 3 or 4 digit Raw S Value (RSV) = 1000+100*CSD.
I assume the NTSC and NBSC are signed quantities.

Dwight (or someone else interested in crunching data), would you be willing to test out the predictive power of the RSV a bit by looking through SCT/ICT results from 2007-2009? In particular, would you be willing to play with giving different weights to the NTSC and NBSC in calculating the CSD? For the purposes of this calculation, I think it should be okay to calculate opponent strength using full opponent tossup points per tossup heard (that is, we don't need to exclude the R vs. C matches in calculating the entry (R,C) in the OAOAP matrix).
cvdwightw wrote: 8. Convert the RSV to the Adjusted S Value (ASV) as follows:
A. Rank each team within sectional by order of finish and by RSV.
B. For each sectional, if a team's RSV rank is k spots below its order of finish rank (note that this does not apply to a team that is, e.g., 5th in RSV and in a 3 way tie for 3rd in order of finish, only a team whose RSV rank could not possibly equal its order of finish rank):
For that team and the k-1 teams above it, the ASV is equal to (1/(k+1)*(SUM(RSV)+RSV_i), where RSV_i is the ith place RSV out of the k teams.
So, for a team that is 1 spot lower by RSV than its order of finish, its score is (2/3)*higher RSV +(1/3)*lower RSV. For a team that is 2 spots lower, its score is (1/2)*(highest of 3 RSVs)+(1/4)*(middle of 3 RSVs)+(1/4)*(lower of 3 RSVs). And so on.
C. For all other teams, the ASV is simply the RSV.
D. Continue this process until for each sectional, the order of finish and RSV within-sectional rankings match.

9. Round all ASVs to the nearest integer and sort from highest to lowest. If there is a tie in ASV, higher CSD breaks the tie. Note that steps 7 and 8 can be switched and it won't affect the ASV.

The ASV will always correct for order-of-finish such that a team that finishes higher at the SCT itself will always be invited first. Additionally, the OAOAP is a better strength-of-schedule factor because it takes into account that every team at the sectional plays a different schedule, and because it never takes into account games that are played between two teams, it is impervious to within-game gaming (it is still theoretically susceptible to forfeit manipulation, though note that if the lower-TPTH team forfeits, it artificially increases its opponent's OAOAP and decreases its own; it remains to be seen whether this artificial increase/decrease is enough to offset the potential TPTH gain/loss, but I am skeptical that it will). The ASV is given as an easy-to-read 3 or 4 digit number and calculations can be done using two Excel Sheets (one for OAOAP and one for everything else). This method will also work if raw D2/D1 numbers are converted to numbers in the other division, though a way to do that has not yet been established.

EDIT: People wishing to see the relevant calculations should see the OAOAP and ASV sheets here.

EDIT2: I'd just like to point out that independent of any other advantages of this system, the raw-to-adjusted S value conversion completely solves part 1 of NAQT's "three most difficult parts of assembling this system." "Team A" will always be invited ahead of "Team B" and the invitation order for "Team C" can be found directly from the ASV rankings.
I'm not sure we should set up anything like the ASV; I'll get back to that in a moment. Random comments: in 8D, the process loops until order of finish matches order of ASV, not RSV, right? Also, if a lower TPTH team forfeits a match to a higher TPTH team, that actually doesn't necessarily decrease its own OAOP: that depends on whether the higher TPTH team is above or below the TPTH average for all the other opponents of the lower TPTH team.

About ASV vs. RSV: I know that there's a note on NAQT's S-value revision page about finding a formula that nearly always respects order of finish within an SCT, but I think we may already have that with RSV and I think any scheme that always respects order of finish is going too far. The RSV includes strength-of-schedule adjustments for the actual schedule of each team, so if team A is in the top bracket and team B is in the second bracket, team A should have a heftier strength-of-schedule adjustment to compensate for (possibly) having lower unadjusted TPTH. If team B's best player missed the first couple rounds, it seems possible that team B could wind up in a second bracket behind team A despite clearly being a stronger team; if team B can put up good enough stats in the remaining rounds to compensate for its bad showing in the first several rounds, I see no problem with having B receive an invitation ahead of A. Leapfrogging within the same bracket seems even more justifiable--if team A and team B both wind up in the top bracket, and team B has better TPTH and bonus conversion but went 2-3 in close matches while team A went 3-2, I'd say that team B is the stronger team and deserves a bid to ICT ahead of team A. I might also say that team A played a better tournament that day, but if we're trying to predict who will do better at a later tournament I'd go with team B.

If Dwight (or someone else) wants to crunch some data to see how often RSV ordering disagrees with order of finish for the 2007-2009 SCTs, I think it would be interesting to see whether RSV already "nearly always" respects order of finish within an SCT. If not, I would be in favor of adding some third component to the CSD based on win-loss record rather than using the ASV scheme.

-Seth
Seth Teitler
Formerly UC Berkeley and U. Chicago
President of NAQT
Emeritus member of ACF
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

setht wrote:I'm with you so far, Dwight, but before we move forward with using tossup points per tossup heard and figuring out methods for correcting that statistic for strength-of-schedule effects I think we should check that bonus conversion by itself is not as good as or better than some combination of TPTH and bonus conversion for predicting performance at ICT--focusing only on cases with decent sample size (say, teams that answered at least 50 bonuses at SCT). If it turns out that bonus conversion by itself is as good as or better than a mix of tossup and bonus statistics as a predictive statistic, then we should use that for all teams with decent numbers of bonuses and we can skip figuring out how to handle strength-of-schedule adjustments. If it turns out that including tossup statistics seems to give better predictive power then I think the scheme you present here is a good starting point.
I only included teams for which 80% of the ICT scoring was at SCT and 80% of the SCT scoring was at ICT. This implies that no major contributors left or were added from the SCT team.

For the 2009 Data (12 teams):
Finish vs ASV: R^2 of 0.7191
Finish vs RSV: R^2 of 0.7156
Finish vs NBSC: R^2 of 0.7066
Finish vs NTSC: R^2 of 0.708

I'm not convinced that any of these are significantly different; also note that one of those 12 teams was Penn, which had several ICT forfeit losses. Deleting Penn puts all four R^2 values in the neighborhood of 0.78. I think the interesting thing will be looking at the 2008 data (14 teams), in which raw TPTH correlates with finish essentially not at all (R^2 of 0.0618) and BC correlates not particularly well (R^2 of 0.4865).

Because getting tossups is such an integral part of the game, I would rather include tossups if it gives greater or equal predictive power. Of course, given my preliminary results on the 2008 data, I'm not so sure about this.
setht wrote:I assume the NTSC and NBSC are signed quantities.
They are.
setht wrote:I'm not sure we should set up anything like the ASV; I'll get back to that in a moment. Random comments: in 8D, the process loops until order of finish matches order of ASV, not RSV, right? Also, if a lower TPTH team forfeits a match to a higher TPTH team, that actually doesn't necessarily decrease its own OAOP: that depends on whether the higher TPTH team is above or below the TPTH average for all the other opponents of the lower TPTH team.
The ASV and order-of-finish within sectionals match. The RSV does not match anything. You are right on the second point. I also neglected to point out that apparently I am counting a forfeit win as 0 TUH and a forfeit loss as 0 TPTH in 20 TUH, and that I am apparently doing that for all calculations. Thus, a forfeit loss will (most of the time) decrease your TPTH regardless of what it does to the OAOAP. I say most of the time because it's possible to end up with a negative score in a game, but it's also possible to lose worse than 9-0 in baseball.
setht wrote:If Dwight (or someone else) wants to crunch some data to see how often RSV ordering disagrees with order of finish for the 2007-2009 SCTs, I think it would be interesting to see whether RSV already "nearly always" respects order of finish within an SCT. If not, I would be in favor of adding some third component to the CSD based on win-loss record rather than using the ASV scheme.
The following were instances in 2009 where the as-currently-computed RSV did not match order of finish:

Canadian Sectional: Rochester 889 (6th place), McGill 881 (5th place), UWO 857 (T-3rd place). Adjusted to 871, 877, and 879, respectively. Effect: Missouri S&T (873) would now be invited after UWO and McGill but before Rochester, instead of after Rochester and McGill and before UWO.

Mideast Sectional: Penn 1128 (2nd place), CMU 1116 (1st place). Adjusted to 1120 and 1124, respectively. Effect: other than flipping invitation order, none.

Southeast Sectional: Florida State 1200 (2nd place), Florida 1161 (1st place). Adjusted to 1174 and 1187, respectively. Effect: Stanford B (1190) would be invited ahead of both Florida State and Florida, instead of ahead of Florida only.

South Sectional: LA-Lafayette 664 (5th place), Alabama B 650 (4th place). Adjusted to 654 and 659, respectively. Effect: other than flipping invitation order, none.
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
setht
Auron
Posts: 1205
Joined: Mon Oct 18, 2004 2:41 pm
Location: Columbus, Ohio

Re: S-value revision

Post by setht »

cvdwightw wrote:I think the interesting thing will be looking at the 2008 data (14 teams), in which raw TPTH correlates with finish essentially not at all (R^2 of 0.0618) and BC correlates not particularly well (R^2 of 0.4865).
That's distressing. Does any combination of TPTH and BC do better? Is there some reason to believe that the 2008 data are just not very good (perhaps the 14 teams all clumped in one part of the field because a bunch of really good [or really bad] teams hosted SCT that year, or something)?
cvdwightw wrote:
setht wrote:If Dwight (or someone else) wants to crunch some data to see how often RSV ordering disagrees with order of finish for the 2007-2009 SCTs, I think it would be interesting to see whether RSV already "nearly always" respects order of finish within an SCT. If not, I would be in favor of adding some third component to the CSD based on win-loss record rather than using the ASV scheme.
The following were instances in 2009 where the as-currently-computed RSV did not match order of finish:

Canadian Sectional: Rochester 889 (6th place), McGill 881 (5th place), UWO 857 (T-3rd place). Adjusted to 871, 877, and 879, respectively. Effect: Missouri S&T (873) would now be invited after UWO and McGill but before Rochester, instead of after Rochester and McGill and before UWO.

Mideast Sectional: Penn 1128 (2nd place), CMU 1116 (1st place). Adjusted to 1120 and 1124, respectively. Effect: other than flipping invitation order, none.

Southeast Sectional: Florida State 1200 (2nd place), Florida 1161 (1st place). Adjusted to 1174 and 1187, respectively. Effect: Stanford B (1190) would be invited ahead of both Florida State and Florida, instead of ahead of Florida only.

South Sectional: LA-Lafayette 664 (5th place), Alabama B 650 (4th place). Adjusted to 654 and 659, respectively. Effect: other than flipping invitation order, none.
This seems to me to be an acceptably small number of changes in order of invitations--out of 88 teams in DI, we have about 5 switches in ordering (and of those, I think only 2 switches involved pairs of teams that actually received invites). As far as I can tell none of the changes in invite order could possibly have changed the composition of the ICT field.

If you have time, I'm curious: what do things look like in Div II, and in 2007 and 2008?

-Seth
Seth Teitler
Formerly UC Berkeley and U. Chicago
President of NAQT
Emeritus member of ACF
User avatar
Important Bird Area
Forums Staff: Administrator
Posts: 6112
Joined: Thu Aug 28, 2003 3:33 pm
Location: San Francisco Bay Area
Contact:

Re: S-value revision

Post by Important Bird Area »

setht wrote:
cvdwightw wrote:I think the interesting thing will be looking at the 2008 data (14 teams), in which raw TPTH correlates with finish essentially not at all (R^2 of 0.0618) and BC correlates not particularly well (R^2 of 0.4865).
That's distressing. Does any combination of TPTH and BC do better? Is there some reason to believe that the 2008 data are just not very good (perhaps the 14 teams all clumped in one part of the field because a bunch of really good [or really bad] teams hosted SCT that year, or something)?
Second Seth's request for a look at the 2007 and 2008 data. I think in general we should look at larger sample sizes wherever possible; it's easy to imagine situations where one year's worth of data isn't really useful. (Both the host effects Seth mentions, and more/less substantial roster change between SCT and ICT.)
Jeff Hoppes
President, Northern California Quiz Bowl Alliance
former HSQB Chief Admin (2012-13)
VP for Communication and history subject editor, NAQT
Editor emeritus, ACF

"I wish to make some kind of joke about Jeff's love of birds, but I always fear he'll turn them on me Hitchcock-style." -Fred
User avatar
cvdwightw
Auron
Posts: 3291
Joined: Tue May 13, 2003 12:46 am
Location: Southern CA
Contact:

Re: S-value revision

Post by cvdwightw »

There are a total of 37 teams from 2007 to 2009 that had 80% of their ICT scoring at SCT and 80% of their SCT scoring at ICT. Raw TPTH correlated with ICT finish R^2 = 0.4616; Raw BC correlated with ICT finish R^2 = 0.7034. I suspect that throwing in un-normed data sets from three different years is not the right thing to do and may go back and look at normalized TPTH, normalized BC, and OAOAP-adjusted TPTH.
setht wrote:This seems to me to be an acceptably small number of changes in order of invitations--out of 88 teams in DI, we have about 5 switches in ordering (and of those, I think only 2 switches involved pairs of teams that actually received invites). As far as I can tell none of the changes in invite order could possibly have changed the composition of the ICT field.
Actually, I think that's only out of about 60 teams, since I excluded all the combined field teams. It's theoretically possible, but unlikely, that the composition of the ICT field would change when converting from RSV to ASV. My main reason for doing so is to avoid teams that beat up on a lower playoff bracket getting invited ahead of teams struggling in a higher bracket (as we saw with the 2009 Canada data, the OAOAP isn't always enough of a correction to ensure this), or teams that get crushed in the finals getting invited behind the 3rd place team from that sectional. I'll see if I can get D2 data sometime next week, unless someone else wants to do the data mining.

Finishing ahead of a team may not be the best predictor of ICT success, but at the same time, I feel it is important to somehow reward teams that do what's going to get them high ICT finishes in the first place - win games. Yes, there might be some random factors in there that cause a superior team to lose 2 out of 3 close games to an inferior team, and that's why the ASV doesn't switch values, instead, it allows you to keep a part of your value.

I'm also not convinced that ICT finish is exactly what we want to look at here. The ICT in 2007 especially and to a degree in 2008 had largely unbalanced prelim brackets, and 2009 got thrown off by Penn's airline mishaps. Perhaps we want to see how it correlates with tossup/bonus performance at ICT instead?
Dwight Wynne
socalquizbowl.org
UC Irvine 2008-2013; UCLA 2004-2007; Capistrano Valley High School 2000-2003

"It's a competition, but it's not a sport. On a scale, if football is a 10, then rowing would be a two. One would be Quiz Bowl." --Matt Birk on rowing, SI On Campus, 10/21/03

"If you were my teammate, I would have tossed your ass out the door so fast you'd be emitting Cerenkov radiation, but I'm not classy like Dwight." --Jerry
User avatar
Important Bird Area
Forums Staff: Administrator
Posts: 6112
Joined: Thu Aug 28, 2003 3:33 pm
Location: San Francisco Bay Area
Contact:

Re: S-value revision

Post by Important Bird Area »

cvdwightw wrote:I'm also not convinced that ICT finish is exactly what we want to look at here. The ICT in 2007 especially and to a degree in 2008 had largely unbalanced prelim brackets, and 2009 got thrown off by Penn's airline mishaps. Perhaps we want to see how it correlates with tossup/bonus performance at ICT instead?
I would (at least) throw out any of the 2009 data that measures travel disasters rather than quizbowl skill.

More generally: I'd be fine with examining correlation to ICT tossup/bonus performance rather than actual finish, if we can show some data demonstrating that there is no such thing as clutch quizbowl. (That is: that a team's record should always regress to what would be predicted by calculating a Pythagorean record based on their TPTUH and bonus conversion.)
Jeff Hoppes
President, Northern California Quiz Bowl Alliance
former HSQB Chief Admin (2012-13)
VP for Communication and history subject editor, NAQT
Editor emeritus, ACF

"I wish to make some kind of joke about Jeff's love of birds, but I always fear he'll turn them on me Hitchcock-style." -Fred
User avatar
The Friar
Wakka
Posts: 159
Joined: Fri Jul 10, 2009 2:39 pm

Re: S-value revision

Post by The Friar »

First of all, sorry to be somewhat late to this party. I've been working on the proposal below since about 10:30 Tuesday morning.

That said, I wish to propose a rather new approach: using the outcomes of individual questions as data points, rather than the outcomes of games or rate statistics aggregated over the course of a tournament.

I propose a statistical model called FRIAR, for FRIAR Ranks Items And Respondents, which does so. Here is the pitch.

First, question-level data points may be up to 20 times more numerous than game-level ones from the same competition, meaning that a question-level model has much richer information to draw from than one based on data aggregated at a higher level. (Data at the question level also allow the data-generating process to be modeled in greater detail, more faithfully reproducing the structure of random errors present in real-life quizbowl, and thus reducing bias in estimates of strength.)

Second, using data at the question level allows questions present in both the Division I and Division II sets to serve as a bridge between the two separate groups of teams. Since all teams have their performance evaluated on those questions, appropriate scaling of ratings between the divisions is automatic in FRIAR, as is correct ranking of DI teams who play in combined fields on DII questions and of CC teams who play in separate fields on DII sets.

Third, the model may be run in reverse with fixed ability parameters in order to simulate games, providing a natural way to convert differences in ratings to an expected-wins metric and thereby resolve problems of intransitivity in the desired order of invitation (that is, to order team C from another section with a rating intermediate between the higher-rated B and the lower A, who won more games, given that A is to be ranked above B).

Fourth, FRIAR is based upon well-developed statistical theory -- specifically, item response theory, an approach to evaluating psychometric test results, with the same mathematical form as the Bradley-Terry model underpinning the KRACH ratings proposed above. In more detail, FRIAR models the probability that team A will answer a tossup against team B as proportional to e^((ability rating of A) - (ability rating of B) - (difficulty rating of tossup)). (Physics and chemistry students: yes, that's a Boltzmann factor. It's funny how many places reasonable assumptions lead to the same statistical form.) FRIAR also derives information from bonuses, using an ordinal version of the IRT model, in which the probability of getting exactly 10n points on a bonus equals (e^((ability rating of A) - (10n-point difficulty threshold of bonus)) / (1 + e^((ability rating of A) - (10n-point difficulty threshold of bonus)))) - (probability of getting more), where, of course, the probability of getting more than 30 points is 0. Extensions to powers and bonuses with 5n-point thresholds are discussed in the paper.

I have run a set of 9000 simulated data points to make sure FRIAR runs, which it does. I would love at some time to be able to test the performance of the model on real NAQT results some time; I'll need scoresheets from a year's SCTs, unless data at the question level are already tabulated. If scoresheets aren't saved, I'll do what I can to simulate an SCT.

I welcome questions and comments with reference to the paper. Thanks very much for your attention.

FRIAR: An Exact Model for Ranking Quizbowl Teams using Question-Level Results (pdf)

Abstract: Most of the proposals for an improved NAQT S-Value use only data aggregated at the game or tournament level. Much more information is available at the level of the individual question. By plugging individual question results into a statistical model of the tossup-bonus cycle itself, we may estimate the most accurate possible player ability rankings using either maximum-likelihood or Bayesian estimation. In this paper, such a Bayesian model is developed as an extension of the standard one-parameter item response theory (Rasch or Bradley-Terry) model. The proposed model, named FRIAR, naturally supplies some of the most challenging desiderata for an ability ranking, such as scaling of separate divisions relative to each other and resolution of intransitivity in the desired order of invitation. The FRIAR model also provides for simultaneous estimation of question difficulty levels.
Gordon Arsenoff
Rochester '06
WUStL '14 (really)

Developer of WUStL Updates Statistics Live!
User avatar
grapesmoker
Sin
Posts: 6345
Joined: Sat Oct 25, 2003 5:23 pm
Location: NYC
Contact:

Re: S-value revision

Post by grapesmoker »

Out of principle I support any metric that uses Boltzmann factors!
Jerry Vinokurov
ex-LJHS, ex-Berkeley, ex-Brown, sorta-ex-CMU
presently: John Jay College Economics
code ape, loud voice, general nuissance
Locked