Introducing BPA, a new evaluation metric using detailed stats

ryanrosenberg · Post by **ryanrosenberg** » Wed Oct 17, 2018 3:52 pm

What is BPA?

BPA stands for Buzz Point AUC (area under the curve). It is the total area under the curve of [% of tossups gotten successfully] against [% of question elapsed].

The theoretical maximum is 100 (i.e., if all tossups were gotten near-instantly); however, top players will generally get somewhere between 10 and 15 at regular difficulty, which corresponds to preventing about 15% of the total words in the tournament's tossups from being read by getting the question. Top teams will generally get around 20-25 at regular difficulty. As an illustration, below is Chris Ray's buzz point graph from 2018 ACF Regionals.

BPA can be calculated for any tournament that records buzz points.

How do I calculate BPA?

BPA is actually pretty easy to calculate, especially for an individual player. The below screenshot shows an example of calculating conversion percent at each buzz point (max gets is the number of possible tossups heard, so games played times 20), and BPA is simply the sum of column F (over all buzz points 0.01 to 1).

What are the advantages of BPA over other quizbowl stats?

BPA is the first metric to take advantage of buzz point tracking and provide a more detailed view into how early people are getting questions. This reveals player skill that may be masked by traditional stats.

For example, let's look at the top two scorers from the Minnesota site of CMST: Shan Kothari and Auroni Gupta. Shan outscored Auroni by about a tossup per game, and recorded seven powers to Auroni's four. However, Auroni had a 6.9 BPA, while Shan comes in at 6.47, since Auroni was buzzing earlier on a higher percentage of tossups, particularly in the late-middle clues, before Shan overtakes him during giveaways. BPA ranking Auroni over Shan is in line with subjective appraisals of the two players (the player poll had Auroni as a top-5 player in grad school, and Shan in the 10-15 range), but neither of the traditional stats (PPG and powers) capture this difference in skill.

What are BPA's shortcomings?

BPA is still, like PPG, a heavily context-dependent stat, and is not exactly comparable across fields of different strength (or even across different schedules in the same field). Teammate effects are also fairly strong; BPA does not incorporate the PATH adjustment for shadow effect since I believe that introduces more false positives than the false negatives it corrects.

Who does BPA say is good at quizbowl?

The top 10 players at CMST were Jordan Brownstein (18.04), Jacob Reed (11.36), Stephen Liu (10.51), Neil Gurram (10.21), Eric Mukherjee (9.05), John Lawrence (8.37), Rafael Krichevsky (7.99), Matt Bollinger (7.95), Will Alston (7.36), and Auroni Gupta (6.9).
The top 5 teams were Brownstein et al. (23.86), Yale (23.13), BHSU A (20.45), Bloq Mayus (18.95), and Chicago A (18.13).

The top 10 players at 2018 Regionals were Eric Mukherjee (17.15), Jakob Myers (15.68), Aseem Keyal (14.33), Evan Lynch (12.89), Rafael Krichevsky (12.84), Eric Wolfsberg (12.72), Adam Silverman (12.56), Chris Ray (12.18), John Lawrence (11.82), and Derek So (11.64).
The top 5 teams were Penn A (28.32), Berkeley A (27.6), Chicago A (25.35), Columbia A (25.05), and Maryland A (25.03).

There's also category-specific BPA! Here are overall and category-specific rankings for 2018 Regionals and CMST.

vinteuil · Post by **vinteuil** » Wed Oct 17, 2018 4:09 pm

I think this might be the most precise (and intuitively useful) non-PATH-like stat we've ever had—thanks to Ryan for the computations and visualizations!

naan/steak-holding toll · Wed Oct 17, 2018 4:17 pm

Auroni Gupta (6.9)

nice

t-bar · Post by **t-bar** » Wed Oct 17, 2018 6:47 pm

This is awesome! Thanks for putting the work into coming up with this.

This is also an interesting statistic to look at on a game-by-game basis, though you have to take the results with a grain of salt. Here are the top 10 games from 2018 ACF Regionals by total BPA:

Code: Select all

Winner		Loser		Score		Winner BPA	Loser BPA	Total BPA
Berkeley A	UC San Diego B	500-80		37.915		6.7		44.615
Cambridge B	Oxford B	315-290		24.515		19.83		44.345
Penn A		Villanova	490-50		37.735		6.595		44.33
Penn A		Johns Hopkins A	375-240		31.585		12.31		43.895
Northwestern A	MSU A		320-285		15.855		26.835		42.69
Columbia A	Amherst		355-200		26.67		15.785		42.455
McGill A	McGill B	315-175		26.845		15.42		42.265
Penn A		Delaware	490-115		31.745		10.04		41.785
Northwestern A	Ohio State A	385-215		22.425		19.155		41.58
Columbia A	Harvard A	375-170		26.98		14.565		41.545

Note that in the fifth game, Northwestern A beat MSU A despite having a significantly lower BPA. This is partly due to the fact that Northwestern waited until the end on all three of MSU's negs, while not negging at all themselves. However, even on the 7 live tossups they converted, Northwestern had an average buzz location of 0.547, substantially later than MSU's average of 0.463.

Here are the five games with the closest margin of BPA, selected from among games with a total BPA of at least 30:

Code: Select all

Winner		Loser		Score		Winner BPA	Loser BPA	Total BPA
Ohio State A	Chicago B	325-240		15.1		15.26		30.36
Harvard A	Yale A		310-245		15.9		14.855		30.755
McGill A	Toronto A	270-240		14.95		16.84		31.79
MSU A		Chicago A	310-260		17.33		19.585		36.915
Berkeley B	Stanford	305-230		15.215		17.56		32.775

In all but one of these games, the winner had the lower BPA. However, only some of them can be chalked up to a negstorm by the losing team. For example, in the McGill-Toronto game, McGill went 9/4 to Toronto's 10/2 and won on the strength of their bonus conversion.

Interesting questions for future BPA analysis: what fraction of games are won by the team with the lower BPA? In these situations, can we discriminate between occurrences of (a) one team waiting to the end on a bunch of negs, (b) one team out-bonusing the other, (c) one team having a large advantage in certain categories and being able to sit on those questions, (d) something else? Perhaps it's fruitful to only consider tossups that were not negged, in order to restrict the analysis to situations in which both teams are playing each tossup live. This requires a bit more careful work to determine the number of tossups heard, but it's certainly possible with the data we have.

AGoodMan · Post by **AGoodMan** » Wed Oct 17, 2018 11:58 pm

This is super cool! Is there any chance we can see similar metrics for EFT?

ryanrosenberg · Post by **ryanrosenberg** » Thu Oct 18, 2018 8:07 am

AGoodMan wrote: ↑Wed Oct 17, 2018 11:58 pm This is super cool! Is there any chance we can see similar metrics for EFT?

Yes, I'll post EFT BPA later today.

ryanrosenberg · Post by **ryanrosenberg** » Mon Oct 22, 2018 10:02 pm

Here's a public link to code used to generate overall BPA for last year's Regionals.

ProfessorIanDuncan · Post by **ProfessorIanDuncan** » Wed Oct 24, 2018 1:40 pm

Does this metric factor in negs? Would that be a useful feature? It seems that adding a negative value, namely the difference between the minimum of question length and correct answer buzz point and the neg point, could shed some insight on how negs affect how much of the tournament is heard. I suppose that this would fail to take into account teams waiting until the end of the question to convert, so maybe its not that useful of an addition.

Jasconius · Post by **Jasconius** » Sun Apr 12, 2020 9:12 pm

I calculated BPA for BLAST Online this afternoon. Although I might be the only person who a post like this applied to, I thought I'd put here a few of the traps I fell into and how to avoid them.

I calculated it in R, using Ryan's script posted above. I've never used R before this afternoon, but I was able to download R and RStudio easily enough. Ryan's code uses the tidyverse library, which I had to import before the code will work, but that was easy enough to find online.

When my spreadsheet finally worked, it used the following columns (everything in single quotes is the name of a cell): 'round', 'packet', 'tossup', 'answer', 'category', 'subcategory', 'team', 'player', 'buzz_value', and 'buzz_location_pct'. Of these, 'packet' is super important: the script relies on each packet having a name, even if it's just the name of the round the packet was played in. From glancing over the code, I think 'team', 'player', 'category', 'buzz_value', and 'buzz_location_pct' are all necessary. And because of the way Regionals worked, you can't use S as a packet name without changing the code.

I ended up getting all of this from the file used to generate the ACF Regionals BPA, which Ryan posted here: https://github.com/quizbowl/open-data/b ... ossups.tsv.

Ryan's code also takes in a tsv file. This is easy enough to change by either changing your file type to a .tsv file or changing "read_tsv" in the second line of code to "read_csv". Using an .xlsx as the input file doesn't really work, and it's easy enough to change into a csv.

For category stats, Ryan was kind enough to share his code for that, which can be found below. I'm pretty sure it only works after you've run the overall code. Also, another mistake I made was failing to recognize the difference between "Arts" and "Fine Arts".

Code: Select all

category_bpa <- regs_tossups %>% 
  filter(!is.na(buzz_location_pct)) %>% 
  left_join(regs_games_played) %>% 
  left_join(regs_category_counts) %>% 
  mutate(max_gets = tu_count*n) %>% 
  mutate(conv_flag = ifelse(buzz_value == "10", 1, 0),
         #Uncomment below line if set has powers
         #conv_flag = ifelse(buzz_value %in% c("15","10"), 1, 0),
         buzz_location_pct = ifelse(is.na(buzz_location_pct), 1, round(buzz_location_pct, 2)),
         buzz_location_pct = factor(buzz_location_pct, levels = seq(0,1,.01))) %>% 
  group_by(player, category, team, max_gets, buzz_location_pct) %>% 
  summarize(gets = sum(conv_flag)) %>% 
  complete(nesting(player, category, team, max_gets), buzz_location_pct, fill = list(gets = 0)) %>% 
  group_by(player, category, team) %>% 
  mutate(cum_gets = cumsum(gets),
         conv_pct = cum_gets/max_gets,
         buzz_location_pct = buzz_location_pct %>% as.character() %>% as.numeric()) %>% 
  ungroup() %>% 
  mutate(player = paste0(player, " (", team, ")")) %>% 
  group_by(player, category, team) %>% 
  summarize(BPA = sum(conv_pct)) %>% 
  arrange(-BPA)

Alejandro · Post by **Alejandro** » Wed Apr 15, 2020 2:15 am

Jasconius wrote: ↑Sun Apr 12, 2020 9:12 pmUsing an .xlsx as the input file doesn't really work, and it's easy enough to change into a csv.

If you have Excel 2016 or later, you can also use Power Query to calculate BPA by adding the Excel table as a source (disclaimer: I work on Power Query). Here's a sample for ACF Regionals 2018.

The approach to calculate the area is different, but should give similar results (instead of joining a list of percentages, calculate the number of buzzes at each percentage a player has buzzed at, and multiply it by how that percentage is from 1).

Post by **Smuttynose Island** » Wed Apr 15, 2020 2:39 am

I'm late to the party, but you can find a Jupyter Notebook template similar to the one I use to compute BPA on my github.

The Quizbowl Resource Center

Introducing BPA, a new evaluation metric using detailed stats

Introducing BPA, a new evaluation metric using detailed stats

Re: Introducing BPA, a new evaluation metric using detailed stats

Re: Introducing BPA, a new evaluation metric using detailed stats

Re: Introducing BPA, a new evaluation metric using detailed stats

Re: Introducing BPA, a new evaluation metric using detailed stats

Re: Introducing BPA, a new evaluation metric using detailed stats

Re: Introducing BPA, a new evaluation metric using detailed stats

Re: Introducing BPA, a new evaluation metric using detailed stats

Re: Introducing BPA, a new evaluation metric using detailed stats

Re: Introducing BPA, a new evaluation metric using detailed stats

Re: Introducing BPA, a new evaluation metric using detailed stats