Stunt Hunting and Stunt Busting

*An Exploration of Protection Against Pass Rush Games*

Authors: Mark Surma and Luke Wiley

Abstract:

Pass rush stunts have long been used as a tactic to generate pressure on the opposing quarterback. Over time the offensive line coaching community has developed a core set of philosophies on how to defeat these stunts. Here we put these three to the test: keeping rushers at the "same level", maintaining "square shoulders" and "stopping the penetrator". After isolating and classifying stunts pitting two rushers against two pass protectors, we determined that while each philosophy has merit they differ in utility. Ensuring that rushers have a low relative depth (are near the same level) is particularly useful when facing "ET" stunts but is much less relevant against "TE" stunts. For those two stunt types it benefits both protectors to stay square, but the offensive tackle has much less tolerance for outside rotation early in the rep. Conversely, on "TT" stunts the center's degree of "squareness" matters little before the stunt declares. Regardless of stunt type, impeding the penetrator is a worthy goal. In measuring feature importance using a variety of model algorithms we determined that these are the three most important factors in protecting against stunts, in reverse order of presentation above. One caveat, however, is that we found measuring a blocker's squareness with respect to the quarterback to be more informative than to the line of scrimmage. The speed of individual protectors emerged as a relevant aspect, with higher speed being associated with greater likelihood of rusher success. Whether or not protectors are able to exchange rushers also plays a role, with higher exchange rates being linked to successful protection in "ET" and "TT" stunts. Finally, we discuss how measuring these aspects and modeling pass rush win rate using all relevant features can be applied to player evaluation in opponent, self and pro scouting.

1. Introduction

With NFL passing game ever evolving, effective pass protection has never been more important. Ensuring that the quarterback is given enough time to throw the football is critical in developing a winning offense. One way that defenses try to attack pass protections is by blitzing and sending 5 or more pass rushers. However, this does come at a cost as it creates larger windows in the pass coverage. Another way to manufacture pressure without compromising pass coverage is to run a stunt. Running stunts allows you to be sound in coverage while also dictating matchups and/or creating confusion in pass protection. For this study, we defined a stunt as an intentional exchange of pass rush lanes between two or more rushers aligned within 2 yards of the line of scrimmage. Additional tests, detailed below, were applied in order to ensure that the lane exchanges selected for this study match the traditional definition of a pass rush stunt.

After identifying stunts from Weeks 1-8 of the 2021 NFL season, our project attempts to distill the aspects of pass protection that are most pivotal in determining success. We begin by evaluating commonly held beliefs of how stunts are defeated by pass protectors. Three practices that offensive line coaches preach for defeating the twist are:

  1. To be on the "same level" vertically in order to effectively pass off the rushers as they overlap,
  2. Having "square shoulders" so that one is in a position to collect a penetrating or looping rusher, and
  3. "Stopping the penetrator" close to the line of scrimmage so that the looper has a farther path to the QB.

Like many coaching philosophies, these arise from countless hours of film study, deliberate practice and demonstrated results. Here, tracking data and modern modeling techniques will allow us to not only determine how much each matters in defeating stunts but also bring to light other metrics not previously considered. We conclude by discussing how our findings and modeling generally can be applied in gameplan, practice and player evaluation settings.

2. Data Preparation

True Pass Sets

An important first step is determining which plays are most likely to yield stable and informative results. Pass stunts are unique in that they are designed to exploit traditional 5- and 6-man pass protections on straight dropback passes. In particular, defensive play-callers hope to run a stunt on the "man" side of those protections where offensive lineman are primarily responsible for an individual rusher rather than an area of space. However, pass types such as rollouts or play action are often paired with protection schemes in which every lineman sets close to the line of scrimmage to a gap relative to his alignment. This has the effect of neutralizing penetrators at the line and effectively killing the stunt. Therefore, we included only passes with dropBackType 'TRADITIONAL' or 'SCRAMBLE' as these include the types of protections that stunts are ideally suited to exploit.

Transforming the plane

After selecting plays of interest we isolated the pass rushers and pass protectors from the data provided for those plays. We then transformed the tracking data of those players to orient them as one would expect on an offensive diagram or scout card, with the offense below the line of scrimmage and the defense above. We mapped the spot of the ball to the origin (0,0), putting the line of scrimmage at y = 0. Players the left of the football have negative x values, while those to the right have positive values. Finally, we transformed the orientation values to be on an interval between -180 and 180 such that an offensive player facing the line of scrimmage would have orientation 0. Offensive linemen with negative orientation values are facing left, while those with positive values are facing right.

Identifying Stunts For Exploration

Our first challenge in this study was extracting stunt cases to study. Without the benefit of film to analyze, identifying stunts came down to manipulating the tracking data itself. To create a list of stunt candidates, we first identified rushers whose paths crossed horizontally at a given frame. We then narrowed the list based on whether:

  1. Player interactions within the play fit our definition of a traditional stunt and
  2. The stunt was likely to have taken place intentionally.

Traditional Stunts vs Blitzes / Sims

In this study we were particularly interested in looking at what is traditionally considered a "stunt", "game", or "twist" and not necessarily rush lane exchanges that occur during a blitz or simulated pressure. Traditional stunts involve rushers who are aligned at or near the line of scrimmage and are typically accounted for by offensive linemen in protection schemes. We therefore excluded any rushers aligned 2 or more yards from the line of scrimmage at the snap or rushers whose alignment was identified as that of a secondary member (corners, slot corners and safeties).

Rusher Overlap

Perhaps the most fundamental concept to understand in investigating stunts is the idea of rusher overlap. During the execution of any stunt, one or more rushers crosses the face of another rusher. The former are known as "penetrators" and are responsible for setting a pick and/or moving the QB off of his spot. Meanwhile, the "looper" plays off of the penetrators and ultimately comes free to pressure the QB directly or clean up after a penetrator has flushed the QB. The time of overlap, when one rusher crosses the face of another, is when the stunt is said to have "declared". To determine whether a pair of rushers overlapped we compared their respective x-coordinates at each frame and tagged any frame in which their horizontal position flipped relative to that at the ball's snap.

Intersecting Paths

Next, it is critical that the looper replace the last penetrator he overlaps in the latter's rush lane. Mathematically, this takes place if the looper's path after overlap goes through the path of the penetrator before overlap. Using the tracking data and a few algebra-based user-defined functions, we were able to determine when overlapping rushers cross paths in this manner. The interaction was then tagged as a stunt candidate and each rusher involved was identified as either a looper or penetrator.

Rush paths that overlap but do not intersect

The Cross-Face Test

Finally, we dictated that in order for a lane exchange to truly be considered a stunt the widest rusher involved must cross the face of the offensive lineman on whom he is aligned. To understand why this is critical, consider the case of a defensive end "bull-rushing" the offensive tackle he is matched up against. As the tackle gets driven inside, the path of an interior rusher to the QB is restricted. The interior rusher instinctively adjust his path outside, and in doing so overlaps and intersects that of the defensive end. However, this exchange of rush lanes was not predetermined, and therefore we would not consider it a stunt. Applying this "cross-face test" allows us to identify and eliminate candidates whose paths may have crossed unintentionally. To do so, we first identified the "technique" of each rusher in the stunt (0-7 with "i" tag for inside shades). We then compared the width of the rusher with the highest "technique" value with that of the lineman on which he was initially matched up. Stunt candidates that did not pass this "cross-face test" were excluded from the study.

Rush paths that intersect but fail the cross-face test

Removing Incidental Stunts

One issue we sought to mitigate was mistaking an incidental exchange of pass rush lanes for a planned stunt. The former typically takes place as a rusher retraces steps after running past the level of the quarterback or adjusts his path on a scramble. As these reactions typically take place late in the play, we imposed an upper bound on the amount of time after the snap we would look for a stunt to declare. Initially, the bound on overlaps was set at 2.6 seconds as this is the median time to throw of pass plays in our sample. However, after collecting info on all potential stunts we observed that the rate at which offensive lineman exchange rushers in protection drops from 57.4% to 40.5% for stunts whose first overlap occurs at 2.3 and 2.4 seconds respectively. While is it possible that some designed stunts did declare after 2.3 seconds, the drop in exchange rate indicates that a higher percentage of rushers overlapping after this threshold did so incidentally. We therefore removed any stunt candidates where the first overlap occurred after 2.3 seconds.

Rush paths that cross incidentally at the top of the rush

Naming Stunt Types

A common convention for referring to different types of stunts is to use the letter of the rushers involved ("E" for "End", "T" for "Tackle") and to refer to the looper last. For example, a "TE" stunt involves a Tackle penetrating from the interior while an End loops inside. On an "ETT" stunt, an End and Tackle (typically aligned next to one another) both penetrate with another Tackle looping outside off of them. In this way the looper would overlap the penetrating Tackle first and the penetrating End second. Stunts in our study were named in accordance with this convention. During the stunt identification process rushers were assigned a "relative position" of either "E" or "T". Those getting an "E" were both the widest rusher on their side of the football AND were aligned head-up on the offensive tackle or wider (technique value of at least 4).

Determining Matchups

Once stunts were identified, we identified individual matchups before and after overlap in order determine whether the pass protectors involved exchanged rushers after the stunt declared. To determine matchups we first assigned to each rusher the closest protector in terms of mean distance for the half-second before or after overlap. If a single protector was assigned to multiple rushers, we looked at the next closest protector to each rusher and chose the pairing with the lesser mean distance for re-assignment. In this way we were able to ensure one-to-one matchups in both phases. If the matchups before overlap differed from those after we noted that an exchange had taken place. After determining matchups, our evaluation of a particular stunt involved only the rushers and protectors involved in that stunt.

3. Feature Engineering

Given the great variety of stunt types and matchup scenarios we decided to limit this study to stunts with "2 vs 2" matchups, i.e. only two protectors were involved in blocking the stunt of two rushers. These include cases in which the pass protectors either stayed on the rusher they were blocking or decided to exchange rushers following overlap. Using our domain knowledge as former college coaches (as well as that of former colleagues), we engineered features that would allow us to

  1. Test the philosophies mentioned in the introduction and
  2. Uncover other aspects of protection that may be critical to success for either side.

Each feature falls into one of two categories: those measuring aspects of the individual protectors (derived directly from the tracking data) and those describing an interaction or providing a composite measure of the two protectors. For consistency of comparison, pass protectors were identified as either the "inside" or "outside" protector based on which was aligned closer to the ball at the snap. Feature values were then assessed at 0.1 second intervals, mirroring the format in which the tracking data was provided.

Individual Features

  • width - float - Horizontal distance in yards from the original spot of the ball
  • depth - float - Vertical distance in yards from the line of scrimmage
  • a - float - Acceleration in yards/second
  • s - float - Speed in yards/second^2
  • squareness - float - Absolute angle of deviation (in degrees, [0,180]) from facing the line of scrimmage. Found by taking the absolute value of the transformed orientation value (see Transforming the Plane above).
  • open_outside - float - Orientation angle (in degrees, (-180,180]) with respect to the original spot of the ball. Those facing away from the spot have a positive value, while those facing toward the spot have a negative value.
  • rotation_outside - float - Total angle of rotation (in degrees, unbounded) from 0 (facing the line of scrimmage) with respect to the protector's original alignment to the left/right of the football. Like open_outside, positive values are associated with rotation away from the football, and negative values toward the football. However, values exceed 180 or -180 for players who experience more than a half-turn in a given direction. For example, a Right Tackle beaten outside by a speed move may rotate 200 degrees to his right, resulting in a rotation_outside value of 200 but an open_outside value of -160. Also, while the sign of the value for open_outside will change if a protector crosses the original spot of the ball, the sign of rotation_outside will not.
  • moving_outside - float - Angle of motion (in degrees, (-180,180]) with respect to the original spot of the ball. Those moving away from the spot have a positive value, while those facing toward the spot have a negative value.
  • qb_dist - float - Euclidean distance from the QB
  • qb_squareness - float - Absolute angle of deviation (in degrees, [0,180]) from facing away from the QB. Derived using the pass protector's transformed orientation value and the QB's location on the field.

Relative / Composite Features

  • stunt_type - categorical - Stunt type name as described above
  • exchange - boolean - Whether or not the pass protectors "passed off" rushers following overlap
  • x_diff - float - Difference in width between the outside and inside protectors. Has a negative value if the blocker labeled 'outside' at the snap winds up inside his teammate
  • y_diff - float - Difference in depth between the outside and inside protectors. Has a negative value whenever the outsider protector is closer to the line of scrimmage than his teammate
  • dist - float - Euclidean distance between the pass protectors
  • min_qb_dist - float - Lesser value of qb_dist for the two pass protectors
  • rel_rotation - float - Difference between the rotation values for the two pass protectors. Describes the extent to which the protectors have rotated toward or away from one another. Generally, positive values indicate they are facing away from one another, negative values toward each other, and 0 facing in the same direction.
  • penetrator_depth - float - Vertical distance in yards of the penetrating rusher from the line of scrimmage. Has a negative value early in the play before the penetrator has crossed the line of scrimmage.
  • mean_squareness - float - Arithmetic mean of individual squareness values
  • mean_qb_squareness - float - Arithmetic mean of individual qb_squareness values

For both measures of squareness, we also tested the harmonic mean (favors the lesser value), root mean square (favors the greater value) and maximum (ignores the lesser value) as possible composite values. In both cases, the arithmetic mean won out in terms of correlation with the target (rush wins) and feature importance in a preliminary tree-based (XGBoost) model. However, we must note that some of these differences were not at all substantial (<.01 difference in correlation with target, for example) and the superiority of the arithmetic mean as a composite measure should not be assumed in future studies.

4. Exploratory Data Analysis

Time to Overlap

Our process identified 924 stunts matching two rushers against two pass protectors, with rushers winning on 300 (32.5%) of those reps. The median frame of overlap (when stunts "declare") occurred at 1.7 seconds after the snap of the ball. The distribution of the frame_from_snap for frames of overlap had a mean of 17.4 (1.74 seconds) and standard deviation of 3.06; 71.8% of reps declared between 1.4 and 2.0 seconds after the snap. This distribution resembles that of a standard normal with a truncated right tail where we bounded frame_from_snap for overlap frames at 23 (2.3 seconds).

Stunts of all four types were identified, led by "TT" with 439 reps. We found a single "EE" stunt rep, which was likely incidental (it declared at 2.3 seconds) and later removed from the study. A summary of the overlap frame distributions and win rates for each type can be found in the table below.

Overlap Frames and Win Rate by Stunt Type

stunt_type count mean std min 25% 50% 75% max wins win_rate
EE 1.0 23.000000 NaN 23.0 23.0 23.0 23.0 23.0 0.0 0.000000
ET 286.0 18.213287 2.395895 11.0 17.0 18.0 20.0 23.0 91.0 0.318182
TE 198.0 17.388889 3.052049 10.0 15.0 17.0 20.0 23.0 75.0 0.378788
TT 439.0 16.845103 3.322631 3.0 15.0 17.0 19.0 23.0 134.0 0.305239

Matchups

As one would expect, stunts of type "ET" and "TE" were most often contested by offensive tackles and guards, while "TT" stunts were picked up by a guard and center. A wide receiver was involved once for each of the former types, while a tight end helped defeat an 'ET' stunt. A running back assisted a guard or center on 16 "TT" stunts.

Penetrator Technique

On "TE" and "TT" stunts the penetrating rusher most often aligned in a 4 technique (head up to inside shade of offensive tackle), while a 5 technique (outside shade of offensive tackle) was the likeliest penetrator on "ET" stunts. Note that for this query we lumped in "i" techniques (inside shades) with their head-up counterparts; each technique therefore had an integer value of 0 to 7. Also, keep in mind that rushers were assigned a technique number based on which offensive linemen they were aligned closest to at the snap of the ball rather than which hand was down in their stance (which we don't know).

Winners by Role

When we examinded which rushers were winning (and how) by stunt type we uncovered some surprising facts. First, on "ET" and "TT" stunts the penetrator was nearly as likely to apply pressure as the looper. The ratio observed for "TE" stunts is more consistent with the common understanding that the penetrator is sacrificing himself for the gain of the looper. In the next section we dive deeper into how pass protectors handle different types of stunts and explore how this discrepancy may arise.

The second surprising result is that while the 924 stunts identified netted a rush win 32.5% of the time, not a single rep resulted in a sack. In the analytics community a take that has been gaining traction is that sacks are a quarterback statistic. It may be the case that there is something inherent to stunts that allows quarterbacks to more readily evade pressure, whether by leaving the pocket or getting the ball out. However, as this is a protection study we don't explore that avenue further here.

Pass Rush Wins and Win Types by Role

stunt_type count wins penetrator_hurry penetrator_hit penetrator_sack penetrator_win looper_hurry looper_hit looper_sack looper_win
EE 1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
ET 286 91.0 32.0 9.0 0.0 41.0 31.0 10.0 0.0 41.0
TE 198 75.0 17.0 5.0 0.0 22.0 31.0 9.0 0.0 40.0
TT 439 134.0 37.0 20.0 0.0 57.0 46.0 10.0 0.0 56.0

Feature Expression by Outcome

At this point in the exploration we turn our attention to the differences in values within features of interest between rush wins and losses. Following from our desire to study "traditional" stunts and stunt protection, at this point we filtered out any stunt in which a skill player (HB/WR/TE) was involved. This left only stunts in which rushers were matched up against offensive linemen; there were 904 such reps, with 298 rush wins (33.0%) and 606 losses. We also removed the lone "EE" type stunt from the sample. Here we can begin to test the philosophies described in the introduction by determining if and when each is relevant. To do so, we would like to compare average values of select features at a given number of frames after the play begins. Before we do, however, we must consider the possible confounder of overlap time.

Time to Overlap

When teaching how to execute a stunt, defensive line coaches typically emphasize that the a given amount of time must pass in order for it to develop. More specifically, the looper must allow the penetrator to set up the stunt before replacing him in his lane. This time element involves either an internal clock (one-one thousand, two-one thousand) or a visual cue (the looper waiting to see the penetrator cross his face). In either case, overlapping too early could allow the protection to more easily pass off the stunt (by not putting the protectors "on different levels") while overlapping too late gives the quarterback more time to get the ball out. In this way the overlap time itself could determine the success or failure of a rep. Also consider how the looper's timing can affect other features of interest like penetrator_depth. If the looper goes too early, it could allow his matchup to help stop the penetrator early in the rep. This would result in relatively low penetrator_depth values and a low probability of success for the rusher. If the average time to overlap between rush wins and losses differed significantly, it would be difficult to assess whether differences in feature values such as penetrator_depth were contributing to the outcome of the rep or were merely a byproduct of the time to overlap.

Fortunately, in our sample there is very little difference in time to overlap between rush wins and losses. The frame of overlap for rush wins has a mean frame_from_snap of 17.48 with standard deviation 2.94, while those measures for rush losses are 17.36 and 3.11 respectively. Comparing these sample means using a two-way student's t-test (assuming normality and equal variance) yields a p-value of 0.60, hardly a significant difference. While time to overlap may be related to other features of interest (frame_from_overlap is strongly correlated with penetrator_depth), it is clear that this timing does not meaningfully impact the outcome of the rep. Having eliminated time to overlap as a possible confounder, we can proceed with comparing average values for other features. First, though, a quick note about which frames we chose to consider (and why).

Frames of Interest

In comparing average feature values, we first removed from consideration any frame occurring more than one second after overlap (frame_from_overlap < 11). We did the same in constructing our predictive models (presented later). One heavy lift we did not attempt here was trying to determine at exactly which frame a rep was won or lost. Having this information would have been particularly useful in the modeling phase as we would know exactly which frames to exclude in mitigating data leakage. Ideally, model inputs should be used to predict a future outcome rather than reflect an outcome that has already occurred. Instead of determining that exact moment for reach rep, we sought to identify a threshold that would strike a balance between:

  • Limiting the amount of frames that may have occurred after the rep was won and
  • Keeping enough frames to ensure that results are reliable at the time points we did consider.

To get a general sense of when reps are being won and lost, we considered the summary data for qb_squareness and rel_rotation. Assuming that a pass protector will generally be facing the rusher he is matched up against, having a qb_squareness value greater than 90 degrees indicates that the protector can do little to impede the progress of the rusher toward the quarterback. In that moment, the protector has either lost or is sure to lose soon. Consider also that it only takes one protector to lose his matchup to register a win for the rush. Among rush wins, max_qb_squareness (the greater of the values for each protector) reached a mean value above 90 degrees at 5 frames after overlap. Also, for both rush wins and losses the rel_rotation data shows negative average values (protectors facing each other) from several frames before overlap until between 5 and 6 frames after overlap, when they shift to positive values (protectors facing away). This indicates that when pass protectors are passing off rushers, the exchange is completed around 0.5 seconds after overlap. Here again we can assume that outcomes are being decided shortly after this point. Finally, we note that 701 / 904 (77.5%) of reps have a frame that occurs 1.0 second after overlap; this is the largest such threshold that retains data from more than 75% of reps. Therefore, we settled on 10 frames after overlap as the upper bound for frames of interest in the exploration.

Relative Depth

Having cleared those hurdles, we can now test the three philosophies highlighted at the start. After filtering out all frames occurring more than 1.0 second after overlap, sufficient data remained to test the values for each feature of interest at frames 0 through 32 after the snap. At each frame we compared the sample means of each metric for rush wins and losses using a one-way student's t-test and noted where significant differences existed (at p $\leq$ 0.05). As a general rule the alternative hypothesis was that the metric for rush wins would be greater than that of rush losses, but that inequality was flipped where appropriate. In order to conduct the t-tests, we first tested the normality of each distribution using scipy's normaltest method, which is based on D'Agostino and Pearson's test of normality. Additionally, we assumed equality of variance if the "rule of thumb ratio" held - that the larger variance was less than four times the size of the smaller variance.

We first dive into "staying at the same level", which is represented by the y_diff feature. Recall that y_diff is the difference in depth between the outside and inside protector. A value of 0 puts the protectors at the same level, while positive values indicate that the outside protector is deeper. For every frame after that of the snap y_diff is larger on average for rush_wins than for rush losses. This difference becomes significant after frame 4, and ceases to be only for frames 28, 31 and 32. After frame 20 sample sizes for both distributions begin to drop substantially, contributing to greater variance in the data. This acts to push the p-value above 0.05 at the aforementioned frames.

For this question in particular things get more interesting when we look at the three different stunt types in isolation. For 'ET' stunts, the difference in y_diff becomes significant at frame 2 and remains so through the 29th frame, the highest that we have sufficient samples to test. However, for 'TE' stunts the difference is significant for only 3 frames (9-11) and for 'TT' only after frame 18. Recalling that the median overlap frame is 17, the metrics for 'ET' stunts have an out-sized impact on the general trend particularly when considering frames before overlap.

It is also helpful to look at what happens to the metrics in the 15-20 frame range (when most overlaps occur) for 'ET' and 'TE' stunts. For rush wins, the y_diff value starts to increase after having leveled off for several frames. However, for rush losses y_diff values continue to decrease until the latter end of that range. This suggests that maintaining a low relative depth early in the play helps partners in protection stay around the same level through overlap, which undoubtedly is useful for exchanging rushers.

On the other hand, "TE" stunts seem to have a completely different dynamic. The y_diff for rush wins accelerates to the point where it is significantly greater between frames 9-11. However, after frame 11 we see montonically decreasing mean values regardless of outcome. By frame 20 y_diff is lower on average for rush wins, then by frame 27 protectors are at the same level. If 'staying at the same level' is relevant to success on "TE" stunts, it may only be on the first 1.1 seconds of the rep.

To understand why this is the case, think back to how executing a "TE" stunt is taught. The interior rusher (most often a 4 technique matched up against an offensive guard) charges upfield while the edge rusher (matched up against a tackle) is taught to take a few hard steps upfield before eventually looping inside. In theory, the edge rusher is selling an outside speed rush in order to 'attract the attention' of the Tackle and prevent him from helping the Guard. However, the looper must eventually throttle down in order to loop inside. This progression forces the Guard to work for depth while the Tackle slows his retreat, bringing the two closer to the same level. The stunt itself therefore seems to have the effect of putting the protectors on the same level late in the rep. This may explain why maintaining a close relationship vertically is less relevant (if at all) to the outcome of the rep.

In summary, the importance of 'staying at the same level' is dependent on the type of stunt being executed. On "ET" stunts, protectors have significantly more vertical separation throughout the play on rush wins than they do on protection wins. For "TT" stunts, the difference is less pronounced and only becomes significant after rushers begin to overlap. On "TE" stunts, the difference is significant only during a small window around 1.0 second after the snap during which the looper is still making his upfield push. While there is evidence that 'staying at the same level' is generally important, examining the three stunt types in isolation provides context as to when that strategy is relevant.

Squareness

There are several lenses through which we can investigate the role of 'having square shoulders'. Here, we will start with perhaps the most simple and intuitive with mean_squareness. Recall that squareness refers to the extent to which a pass protector is facing the line of scrimmage, and is measured in absolute terms. Those whose shoulders are parallel to the line get a value of 0.0, while deviations from that (facing left or right, inside or outside) garner positive values up to 180 degrees (back to the line). The mean_squareness is the arithmetic average of the squareness for the two protectors. As the graphs below show, mean_squareness differs little between rush wins and losses until 1.3 seconds into the rep, at which point rush wins show a significantly greater value. This corresponds with 0.4 seconds before overlap, at which point we observe the same significant difference. Isolating the stunt types we see the same general trend but with significant differences emerging at different points: "ET" at frame from snap 15, "TE" at frame 10, "TT" frame 17.

In order to get a clearer picture of who is getting turned and which way, we turned to rotation_outside. Like squareness, rotation_outside gives a 0.0 value for being square to the line of scrimmage but has negative values for turning in the relative direction of the original spot of the ball. Here we looked at the rotation_outside for each protector by stunt_type, starting with "ET" stunts. Metrics for the inside protector follow the normal trend; no difference early in the rep building to differences of increasing significance starting at frame 17. Win or lose, the Guard (inside protector) is getting turned out on an "ET" stunt. The difference between success and failure is a matter of magnitude that starts to differentiate at overlap.

However, when zooming in on the outside protector we observe a unique pattern: significantly greater outward rotation in frames 10-13 followed by the opposite after frame 19. This is the first time we see the paths of winners and losers radically deviate over the life of a rep. The difference in outward turn at frame 10 is slight (approx. 8 degrees) but significant (p = 0.15). This leads to a significantly larger inward rotation (approx. 20 degrees of difference at frame 21) after overlap for unsuccessful protectors. The fact that this rotation stays negative on average indicates that these protectors (often tackles) are less able to exchange rushers. Meanwhile, tackles who stay sufficiently square before overlap seem to have the ability to maintain that squareness through overlap. They then finish turned out - away from the QB after having successfully exchanged rushers.

For "TE" stunts the dynamic is somewhat similar to that of "ET". Inside protectors that are part of losing combinations turn out more than successful ones. However, as these blockers are facing the penetrator it happens more quickly: the difference in rotation_outside becomes significant 1.0 second into the rep. By 2.0 seconds, the inside protector starts to square back up as the looper crosses his face. On the other hand, unsuccessful outside protectors rotate out a significant deal more from the 7th to the 14th frame before rapidly turning in as the edge rusher loops. This leads to a significantly larger inside rotation from frames 22 to 27. Here again, outside protectors who are able to maintain a relatively square position before overlap sustain it through overlap. One difference to note is that regardless of outcome the outside protector tends to finish with shoulders turned inside. Since the interior rusher is penetrating on a "TE" stunt, his momentum is more vertical than horizontal. Is is therefore more difficult on a "TE" stunt than on a "ET" stunt for the outside protector to gain inside position on the interior rusher after exchanging the stunt. It is typical on a protection win for the penetrator to either be stopped entirely or flattened across the face of the QB.

Finally, in turning out attention to "TT" stunts we see that they are unique in that a significant difference in rotation is observed only for the outside protector. Recall that on a "TT" stunt the inside protector is typically a center, while the outside protector is a guard. One can see below that regardless of outcome the centers tend to stay fairly square throughout the rep. However, the guards (who are often responsible for the penetrator initially) turn inside to a larger extent on losing reps. This difference is observed starting at 1.0 second from the snap and becomes significant at 1.5 seconds. As it concerns "TT" stunts, the ability to maintain square shoulders seems relevant only for the outside protector.

In conclusion, keeping one's shoulders square is relevant for each type of stunt in our study. Generally, successful protection tandems collectively stay square to a larger degree than their unsuccessful counterparts. Specifically, on "ET" and "TE" stunts inside protectors (guards) tend to turn out more on losing reps. On those same reps outside protectors (tackles) rotate out more before overlap, then rotate in radically after the edge rusher crosses their face. As we just saw, on "TT" stunts the outside protector (a guard) turns in on losing reps, while the inside protector (a center) tends to stay square regardless. In all cases, successful protectors are relatively square before overlap and maintain that squareness through the overlap and exchange windows.

Penetrator Depth

Finally, we test the strategy of "stopping the penetrator" by looking at how much depth pass protectors allow the rusher with that role to get. Here we observe a pattern very similar to that of mean_squareness: on pass rush wins the penetrator achieves significantly greater depth than on pass rush defeats starting at frame 5. This significant difference only becomes more pronounced as the play progresses. In isolating the stunt types, we see that they differ only at which frame a significant difference arises. For 'ET' stunts, this occurs at 3 frames from the snap, 12 frames for 'TE' stunts and 10 frames (1.0 second) for 'TT'. The figures below reflect the general trend.

Little further elaboration is needed in examining these trends. Successful penetrators push deeper into the backfield early in the rep and are able to sustain their rate of gain in a nearly linear fashion. Those who are not successful see their rate of gain drop off to the point of near stop after 3.0 seconds. An open question is to what extent rushers and protectors control the depth that the penetrator is able to gain throughout the rep. While the penetrator may primarily drive this number early in the rep, the slowing of depth gain on protection wins suggests that protectors are able to 'stop the penetrator' to some extent later in the rep. After all, the penetrator would not intentionally slow himself down. While determining a specific measure of control is not the focus of this study, we can conclude that both parties have some effect on penetrator_depth and that it is clearly related to the outcome of the rep.

5. Model and Feature Selection

In the previous section, we determined that each coaching philosophy (as represented by a feature or set of features) relates to the outcome of a stunt matchup in some way. In this section we attempt to determine to what extent those features tested matter and to uncover other relevant features not yet considered. Feature and model selection occurred simultaneously, where we began by testing a variety of feature representation approaches with several different classification algorithms.

In determining which model algorithms to test, we followed the lead of Yurko et al. (2019) in Going Deep: Models for Continuous-Time Within-Play Valuation of Game Outcomes in American Football with Tracking Data. In doing so we employed the following:

  • LASSO: Logistic Regression Classifier with L1 Regularization (Sci-Kit Learn implementation)
  • XGBoost: Gradient Boosted Tree Classifier (with Sci-Kit Learn API)
  • FNN: Feedforward Neural Network with ReLU Activation (Pytorch implementation)
  • LSTM: Long Short-Term Memory Network (Pytorch implementation)

As Yurko et al. lay out, each of the models above are well regarded for their ability to handle high dimensions of data. However, as a linear model LASSO Regression does not deal well with non-linear features and feature interactions, both of which are to a large degree present in our data. Both LASSO and XGBoost models provide the benefit of feature importance measures, as does a specialized FNN implemenation called LassoNet (used only in the dimensionality reduction step explained later). Finally, only the LSTM explicitly factors in changes in features over time in generating predictions.

Incorporating Change Over Time

Initial experimentation therefore centered around engineering features that would successfully incorporate the time component using the first three models listed above. We tested a true "sliding window" approach (where each observation includes feature values for several consecutive frames) as well as a "moving average" approach (essentially using summary statistics of those sliding windows as model inputs). Both of these approaches improved model performance over the baseline of inputting the instantaneous value of features at each frame. However, we found those approaches to be neither intuitive nor simple and feared that interpretation would suffer as a result.

Fortunately, a simpler alternative did emerge: adding to our current set of features the change in each numeric feature from the previous frame. These additional features are denoted with a _delta suffix in the code. By simply including the deltas we achieved a nearly identical improvement in model performance while maintaining the original (and most intuitive) representation of our features of interest. This came at the small cost of losing the first frame of each rep (the frame in which the ball is snapped). Due to the low variance in feature values at this frame (and thus low information conveyed), one can argue that this consequence in itself contributed to increased performance.

Dimensionality Reduction

After nearly doubling our feature set, in the next step we sought to eliminate redundant features. In addition to the common benefits (simplifying the model, increasing prescriptive power, reducing overfitting) of such practice, we had a strong desire that each feature measure only one thing primarily. In that way, if/when new features emerged as relevant we could point to a specific aspect of defeating stunts. This process involved identifying collinear and multicollinear sets of features and keeping only one feature from each set. The highest criterion for chosing this feature was predictive power; for this we used a combination of absolute correlation with the target (rush_win) and average feature importance in an XGBoost cross-validation paradigm employing all features. One feature was selected from each mulitcollinear set until no such sets remained. Two features were identified as collinear if they had a correlation coefficient $r \geq$ 0.7.

In going through this process several noteworthy items emerged. First, penetrator_depth_delta topped the correlation with target list at $r =$ 0.199. The next 9 features on that list involved squareness, with 8 of those being the different ways we explored 'averaging' the individual squareness values. Each composite measure of qb_squareness (arithmetic mean, root mean square, maximum, and harmonic mean) was more correlated with the target ($r \geq$ 0.150) than the composite measures of squareness (0.137 $\leq r \leq$ 0.150). The winner among these was mean_qb_squareness at $r =$ 0.172, which was also the overall leader on the feature importance list. With mean_qb_squareness being collinear to all composites except harm_squareness and the individual qb_squareness values, most of the rest of this category was eliminated. We then made the decision to retain squareness_out over harm_squareness (which were also collinear) in order to preserve measures of squareness for the individual protectors. In addition, whenever we removed an original feature its delta went out with it. Therefore, surviving this round of cuts were mean_qb_squareness, squareness_in, squareness_out and their associated deltas.

Another set of features getting the axe were the individual depth values and their deltas. Both depth features were in the same multicollinear set as penetrator_depth, and the deltas were collinear with their respective speed measures, s_in and s_out. The latter result is intuitive given that the primary technique used in pass protection is to 'kick back'. Here we can also mention that penetrator_depth and the speed features rank among the top remaining features in correlation to target at $r =$ 0.129 for s_out, 0.098 for penetrator_depth and 0.097 for s_in. Another to hit the chopping block was width_out and its delta due to their correlation with x_diff. These features and their deltas scores similarly with regards to both absolute correlation with target and feature importance. In the end, width_out_delta was also collinear eith moving_outside_out; we chose to eliminate the former in order to save the latter. Finally, min_qb_dist was collinear with both of the individual measures of qb_dist, as were the deltas. Here, the composite measure ranked higher in absolute correlation with target. This also makes senses given that it only takes one protector being close to the QB to move him off of his launch point, while high values of min_qb_dist indicates that neither blocker is in close proximity. Therefore, the individual features for qb_dist and their deltas were removed.

After whittling down the feature set we ran another XGBoost cross-validation and observed near identical performance to the original in terms of ROC-AUC and F1 score. Included below are the correlations to target and relative importance (as determined by the CV run) of the remaining features.

As a final note we would like to address the relatively low bar of $r \leq$ 0.7 used to define collinearity. Here again we sought to strike a balance between eliminating truly redundant features while preserving those who may be highly correlated but still have interesting differences. A good example can be seen in examining the relationship of penetrator_depth to other features. We have already discussed how the individual depth measures of protectors were eliminated due to a near-perfect ($r \gt$ 0.9) correlation with penetrator_depth. Additionally, we observe high correlations of the latter feature with each composite measure of squareness as well as the individual measures of speed ($r \gt$ 0.55 for all). Meanwhile, three of the four composite measures of qb_squareness correlate with penetrator_depth at just below the $r =$ 0.5 threshold. Choosing a relatively absolute $r$ value as a threshold thus allowed us to explore how our models handled the interactions between related (but not necessarily redundant) features.

Model Training and Validation

Each model type that we considered has its own idiosyncrasies around input formatting, training procedures and evaluation of results. Here we explicitly lay out the steps taken in implementing each model.

LASSO

First, to aid in model convergence we standardized all numeric features. Our only categorical feature, stunt_type, was broadcast to three separate features (one for each type) using one-hot encoding. We conducted the same pre-processing for the FNN and LTSM models. Next, in order to compensate for class imbalance in the target variable (rush_win, mean = 0.332) we set the class_weight parameter to 'balanced'. This improves model accuracy by prompting it to assign more weight to (in our case) the positive class. Finally, as the LASSO model carries out feature selection explicitly through L1 regularization, we tuned the model's regularization parameter C prior to employing other feature selection techniques. Our aim here was to select and optimal value that would maximize both regularization (low C value) and model performance. Through CV results we settled on a parameter value of C = 0.1 with minimal performance loss. LASSO models trained on inputs from the individual frames of each stunt rep (including deltas from the previous frame), treating each as a unique with no reference to the specific rep from which it came. Trained models then produced a pseudo-probability relating each frame from the test set to the positive class.

XGBoost

Being a tree-based model, there was no need to standardize features here. XGBoost also offers a built-in parameter for handling categorical variables, which we enabled. We addressed class imbalance by setting the scale_pos_weight parameter to 2, reflecting the nearly 2-to-1 ratio of negative to positive class observations. Like the LASSO models, XGBoost classifiers trained on individual frames and generated a pseudo-probability for each test frame. Within each CV fold we utilized early stopping by evaluating improvements in AUC (ROC-AUC) on the test set. This process terminated tree building after 10 rounds in which the AUC metric did not substantially improve and reverted to the ensemble that achieved the best results (eliminating the last 10 trees built). We chose this approach to benefit from the training efficiency and mitigation of overfitting that it provides. Also, we opted for AUC as an evaluation metric as it reflects how well the model is differentiating between the classes without being tethered to a specific prediction threshold. AUC would also be serve as our metric for model comparison, a choice we will justify shortly.

FNN

As in LASSO pre-processing, we standardized numeric features and one-hot encoded the stunt_type variable prior to training the feedforward neural networks. After experimenting with several different combinations of hidden layers and units, we observed that simpler models were less prone to overfitting, achieving better overall performance. Our initial FNN, implemented using Python's torch package, used a single hidden layer with fifteen hidden units and ReLU activation. Regardless of hidden parameters these models generated two outputs which we converted to pseudo-probabilities using the softmax function. This allowed us to determine loss using the cross-entropy function, which we used for backpropogation in conjunction with an Adam optimizer. In addition, we utilized the weight parameter of pytorch's CrossEntropyLoss object, which makes adjustments to compensate for class imbalance. Finally, we again employed early stopping to terminate training, this time using total cross-entropy loss (sum of the loss calculated at each frame) to evaluate test set performance.

LSTM

Here again we began by transforming our inputs in the same manner as with the LASSO and FNN models. However, the "recurrent" nature of a long short-term memory network requires additional pre-processing in order to proceed. Recall that a key difference (and benefit) of the LTSM is that it incorporates information retained from previous time frames into predictions made for a given time frame. In our case frames are loaded sequentially by play; predictions are updated with the new information while a certain amount of information (determined by the LSTM) is retained at each step. In training, this required that frames for each play be loaded into the model as a "sequence" unit rather than individually. Also, in order to maintain the efficiency offered by pytorch's TrainLoader object we needed to zero-pad any play that had fewer frames than the longest play (33 frames in our case). Although not customary with LSTM's, we still wanted to evaluate model performance at a frame level as opposed to a play level. We achieved this by "unrolling" the LSTM hidden layers and evaluating loss (again with cross-entropy) at each hidden state. Outputs generated for zero-padded frames were excluded during loss calculation for both backpropogation and test set evaluation. Otherwise, parameters and proceduces used mirrored that of the FNN: we used a single LSTM layer with fifteen hidden units trained with an Adam optimizer. Cross-entropy loss with a target class weight adjustment was used during training and test set evaluation and for early stopping.

*As a final note, we decided to include the "delta" features for LSTM training after observing performance improvement over initial iterations where they were not used as inputs.

Validation

Again following the lead of Yurko et al, we generated rush_win probabilities for each frame in our eight-week sample using "Leave-One-Week-Out" (LOWO) cross validation. Essentially, for each week estimates were generated using a model trained on observations from the remaining seven weeks. As with any cross-validation approach, this first helps us ensure that the models we employ are performing well across the sample. Additionally, it allows us to generate a probability for every frame of interest of each stunt rep identified. This maximizes the sample size of reps we can use in later application steps, where we examine feature expression in likely wins and losses and assess player and team performance. Finally, this approach mitigates data leakage by preventing the use of frames from the same game/rep in both the training and testing sets. For example, it would be bad practice to make an estimate for the 5th frame of a rep using a model trained with the 20th frame of that rep given that the latter happens later.

Evaluation

At the conclusion of each CV run the probabilities generated were used to calculate ROC-AUC scores for the overall model as well as scores reflecting performance at each frame_from_snap and frame_from_overlap. These scores provided the basis for model comparison both within and between model types. As discussed previously, we opted for this metric primarily for its power of differentiating between classes regardless of a specific probability threshold. This will prove useful as the models we ultimately selected were "aggressive" in that on average they generated probabilities exceeding the observed pass rush win rate. In addition, we did not feel that the degree of class imbalance (a nearly 2-to-1 ratio of negative to positive) warranted pivoting to another metric such as AUCPR (area under the precision-recall curve). One pattern we observed in initial experimentation was the tendency of models to be very conservative early in the rep (low input variance) and more aggressive later (high variance). Nonetheless, employing the mechanisms used in each model to account for class imbalance mitigated this issue greatly, resulting in much lower variance in the average probability generated at each frame. Finally, for our purposes we have equal interest in examing feature expression for both likely rush wins and losses. As ROC-AUC weighs performance in predicting the positive and negative classes equally, this was the logical choice.

Feature Selection

In the final step of our modeling process we carried out several feature selection methods for each model type. Here we were motivated to not only maximize model performance but also determine which inputs promote high performance and to compare those feature sets across model type. To establish a baseline performance score for each model we first conducted CV runs using the features remaining after the dimensionality reduction step.

Backward and Forward Selection

We then conducted traditional 'backward' and 'forward' selection processes for each model. In backward selection, (also known as 'recursive feature extraction') one evaluates model performance after removing a single feature, doing this for all features at a given step. The feature set that produces the best performance gain is retained, and the feature removed from that set is permanently eliminated. One then iterates on this method until a minimum threshold of performance gain is not reached. Conversely, in forward selection one starts with no features and continually adds the one that best improves model performance. This process also terminates at the step when performance has not improved beyond a pre-determined threshold. In our case, that threshold was a .001 improvement in the overall ROC-AUC score of the model. While we achieved performance gains across all model types by employing these methods, neither produced the best performance for any model type.

Modified Stepwise Selection

In every case, the best improvement resulted from a modified version of stepwise selection that we designed and employed. Traditional stepwise selection starts as forward selection, but after a predetermined number of steps incorporates backward selection by considering candidate features for removal at every step. In addition, after a feature is added or removed it is common practice to withhold it from consideration for several subsequent steps. Here again the add/remove cycle terminates when neither step produces significant performance gain.

For the sake of computational efficiency we sought to employ a "smarter" process by incorporating the correlation with target and measures of feature importance calculated previously. We accomplished this in two ways. First, during each step we considered only the three most obvious candidates for addition or removal at a time. If at least one of the three resulting test sets produced performance gain above threshold, we selected the top one and continued. If the threshold was not reached, we considered the next three candidates; this continued until all candidates for inclusion and extraction were exhausted. For all model types, the "most obvious" candidates for addition were ranked by order of correlation with target. In the LASSO and XGBoost models, candidates for removal were sorted by relative importance in the model produced with the current feature set. For example, consider a LASSO model trained using thirteen features, with nine of those being candidates for removal (and the four most recently added features housed in the holdout set). The nine candidates would be ranked in terms of average absolute coefficient across the eight CV folds (and therefore eight models) used to produce estimates. The three candidates with the lowest values were considered first for removal, and so on. In the XGBoost models, we substituted in the average feature importance to serve this purpose.

Second, the initial CV run for each model using all features provided rankings of feature importance for both the LASSO and XGBoost models. We were therefore able to test iterations of our modified process that started with the top $n$ features for a large range of $n$. A key difference here is that any feature included at the start of the run was immediately deemed a candidate for removal. While our modifications did not guarantee convergence to the optimal feature set, they did allow us to test a large and diverse array of different feature sets in a directed and computationally feasible way. This advantage would prove critical in attempting a similar process with our neural networks. The FNN and LTSM algorithms require far more time and resources to run and do not inherently contain attributes that can be used to measure feature importance. For the latter end we turned to a specialized FNN algorithm known as LassoNet.

LassoNet

In short, the creators of LassoNet developed a method which teases out a measure of feature importance from a feedforward neural network with one simple modification. This involves adding a "skip" layer connection from each model input to the final output. The weight of each skip connection provides an upper bound to the weights from each input node to the model's hidden layer. This amplifies the effect of applying a LASSO penalty by encouraging the model to eliminate all the weights coming from a given input node. By progressively increasing the L1 penalty at each training epoch, LassoNet determines feature importance by noting at which epoch each feature's outgoing weights have been eliminated. Resources and documentation on the theory and application of LassoNet can be found here.

Using the same LOWO CV protocol mentioned previously, we conducted several LassoNet runs across a range of hidden layer and unit qualities. We opted to utilize the results of single-layer LassoNet FNN with 30 hidden units as it produced its best results with a greater regularization penalty than a simpler model (15 hidden units) that performed slightly better (.671 vs .673 AUC value). As before, feature importance was determined by taking the average across CV folds. Feature importance rankings from the LassoNet CV were then used for stepwise feature selection in the FNN and LSTM models. One key difference from previous application is that we did not conduct multiple runs of the modified stepwise procedure with $n$ starting features. Due to time concerns we conducted only a single run for each neural net type with the same set of starting features. Here the decision as to how many of the top features to include in the initial set was somewhat arbitrary. Ultimately, we chose the lowest threshold that excluded each of the encoded features representing stunt_type values. Included below are the feature importance values derived from the LassoNet CV run selected.

Hyperparameter Tuning

After determining the best features for each model type we performed hyperparameter tuning in hopes of achieving a final boost to model performance. While each tuning run used a grid-search approach, specific parameters tuned were dependent on model architecture. Generally, the hyperparameters used to produce the best overall AUC were chosen for inclusion in the final model.

For the LASSO model, the only parameter we experimented with was again the regularization value C. In doing so, the performance gains we observed with larger C values (less regularization) were minute ($\approx$ .0002 change in AUC). Therefore, we retained the parameter value C = 0.1.

After performing two grid-search sweeps for the XGBoost model, the following hyperparameters produced the best performance: max_depth = 6, learning_rate = 0.25, reg_lambda = 20.0 and colsample_bytree = 0.95. We also tested a range of values for gamma (used for tree "pruning") in the first run, but the best results were obtained with a gamma value of 0.0 (no pruning).

Finally, for the neural networks we tuned the same four parameters: number of hidden layers (n_layers), amount of hidden units at each layer (n_hidden; with each layer having an equal unit count), dropout rate between hidden layers (dropout; where n_layers $\geq$ 2) and learning rate of the Adam optimizer (eta). For the FNN the following parameters produced the best AUC: n_layers = 2, n_hidden = 40, dropout = 0.3 and eta = 0.01. The LSTM grid search saw best results with n_layers = 3, n_hidden = 40, dropout = 0.15 and eta = 0.005.

6. Results and Discussion

Model Performance

Included below are final ROC-AUC curves and metrics (in the legend) for each model type. The "Baseline" line represents a hypothetical model with effectively no discrimination in the target variable at any threshold. Focusing on the curves for our four models considered, we have two broad conclusions:

  1. Results for each model are remarkably similar (range $\approx$ 0.035 AUC)
  2. FNN is the clear winner on overall AUC, with the XGBoost model also out-performing the LSTM

A helpful way to interpret the chart above is that at virtually any tolerance for false positive rate (% of protection wins predicted as pass rush wins) the FNN produces the best true positive rate (% of pass rush wins correctly predicted).

We also evaluated models on their performance by frame, for which we have two perspectives: frame_from_snap and frame_from_overlap. In the left chart below, we see that the FNN separates itself early in the play and cedes an advantage to the XGBoost model only on a small range of frames. Meanwhile, on the right we see that the LSTM performed particularly well on the early frames of reps with a relatively high time_to_overlap before seeing a decline in performance. Keep in mind that there are far fewer reps to evaluate in that early range (36 at frame_from_overlap == -22 (4.0% of total reps) rising to 834 (83.4%) by frame_from_overlap == -14). Although tough to distinguish in these visualizations, the logistic regression model is a distant fourth to others until around 5 frames before overlap when it begins to track favorably with the others.

Considering the perspectives offered above, we can summarize model performance results in the following way:

  • As expected, we oberserved the poorest overall performance with the logistic regression model. It was particularly bad at discriminating wins from losses early in the rep. While useful for its ability to provide a measure of individual feature importance (average absolute coefficient), logistic regression is not an appropriate choice as a predictive model due to the highly related nature of our inputs.
  • Conversely, the relatively poor performance of the long short-term memory network was surprising. To speculate on a root cause, the sequential nature of its training process was potentially detrimental to performance over time. By retaining information from the beginning of the rep (where input variance is low), it is possible that the LSTM was slower to detect and/or properly weigh subtle differences in key features.
  • The XGBoost classifier acquitted itself well in discriminated wins from losses throughout the rep, at times producing the best results for a given frame. However, at lower tolerances for false positive rate (less than 0.5) it performed comparably to the LSTM and was outshined by...
  • The feedforward neural network, which performed the best across the majority of frames and at nearly every tolerance for false positive rate. Among the models we considered, the FNN is the obvious choice for distinguishing rush wins from losses on a frame-by-frame basis.

Feature Importance

To this point we've already shown and put to use some measure of feature importance using models trained on every feature under consideration. Now we can hone in on the aspects of pass protection that are most relevant by examining:

  • Which survived the feature selection process for each model type and
  • How they rank in terms of relative importance in the final models.

First, consider the features that made it into each optimized model as displayed in the chart below.

Here we can think of each feature as belonging to one of four tiers, with each tier sharing the number optimized model types its features contributed to. In the top tier, five features were included in the best model for each type. Twelve features contributed to 3 of the optimized models, with four being included in 2 types and eight surviving the cut in just 1 model. Keep in mind that the tiers tell us nothing about the amount of contribution a feature makes to predictions in a model. Rather, belonging to a higher tier validates that feature by indicating that it contributes something to a greater number of models.

We are therefore not surprised to see that three of the features in the top tier (qb_squareness_delta, penetrator_depth_delta, y_diff) relate to the initial subjects of our investigation (squareness, penetrator depth, relative depth of protectors). Other features that one could call 'adjacent' to the aforementioned appear in the top two tiers. They include open_outside_in, open_outside_out, rel_rotation, which like mean_qb_squareness and its delta are derived from the orientation of the protectors. Also, like y_diff the depth of individual protectors is used (along with other factors) to determine dist_delta and min_qb_dist_delta. On the other hand, several features with no obvious relation to those previously considered contributed across model types: those being the width of the inside protector and the speed of both. This was somewhat portended (and perhaps driven) by their relatively high correlation to the target variable. Finally, whether or not the protectors exchanged rushers influenced predictions in all but the logistic regression models. Being our only boolean variable, discussion around it will be handled somewhat differently.

Next, we calculate relative feature importance in the same manner as before. The logistic regression and XGBoost models have built-in measures which we average across CV folds and report as a percentage of the summed averages. Since our neural networks do not have these attributes, we again use LassoNet as a reasonable proxy. Here we trained a LassoNet with the same hyperparameters as our final FNN using the optimal feature set of that same model type. In the chart below, features are listed top-to-bottom in order of average rank of relative importance over the three model types. The shading (opacity) of each rectangle represents relative feature importance value, while the numbers indicate rank within that model type.

Here we have an emphatic answer to the question posed at the beginning of our investigation. We have shown to this point that stopping the penetrator, staying square and minimizing relative depth are relevant factors in predicting pass protection success. Now we see consensus among our models that these are the most important factors, though in a slightly different way than originally defined. The winner of our contest by unanimous decision is penetrator_depth_delta, indicating that slowing the penetrator is absolutely crucial to defeating stunts. Deemed next most important is mean_qb_squareness. This measure and its delta rank higher than any other features derived from the orientation of the pass protectors. Each of those squareness-related features varies slightly in terms of what it measures, and many make some contribution to producing predictions in one of our models. Nonetheless, the evidence suggests that the best framing of squareness is the extent to which protectors collectively keep their backs to the QB. Rounding out the top 3 is y_diff, which is the very same measure of relative depth previously considered.

Moving down the chart, we see that the type of stunt being executed factored heavily into the XGBoost models (where stunt_type was treated as a single categorical variable) but not the others (where the categories were broken up into individual encoded variables). Here we note that while the TT encoded variable was in the optimal feature set for the LSTM models, we lack an appropriate way to measure its importance. The fact that the FNN, our best performing model type, does not consider stunt_type speaks to the universality of the other features under consideration. Moving forward, we will treat stunt_type as we have previously by demonstrating how the other features vary in expression by stunt_type when relevant.

Finally, we see the exchange feature playing a prominent role in several model types, with the speed and width measures factoring in to lesser degrees. It is particularly interesting that while changes in speed (s_in_delta, s_out_delta) contributed significantly across model types the measures of acceleration (derived from a in the tracking data) played no role whatsoever. This suggests that for our purposes examing average acceleration (change in speed over a time interval) is more useful than instantaneous acceleration (acceleration at a specific time point). It is also noteworthy that width_in_delta contributes modestly to predictions across model types while its relative x_diff_delta is perhaps the most mercurial, factoring prominently in logistic regression to not at all in the neural networks. Considering previous discussion about the relative strengths of each model type, this serves to validate width_in_delta as worthy of consideration moving forward.

Feature Expression

To this point we have established a general idea of which factors matter in determining pass protection success and have devised a way to rank those factors in terms of importance. However, we have not yet explored exactly how the values for a particular feature influence the predictions of our models. Such explanations would be useful as they ultimately could help to determine benchmark values to strive for or avoid. Unfortunately, determining exactly how this takes place mathemathically becomes more difficult as the complexity of the model increases. Further, clearly explaining such a function to the coaches putting the information to practical use would itself be a tall task. Instead, we opted to answer a more intuitive and directly observable question: For each feature of interest, what does our optimal model think good and poor pass protection looks like?

Likely Wins and Losses

To answer this question we first must define "good" and "poor" in the context of our model's estimates. Recall that for each frame of a pass rush rep our model generated a pseudo-probability which can be used to predict success or failure. Higher probabilities indicate greater chance of a pass rush win, while lower probabilities are associated with protection wins. However, as previously mentioned the probabilities generated by our FNN CV leaned aggressive, with an average estimate of 0.476. Evaluated at a decision threshold of 0.5, our probabilities would predict a rush win rate of 0.438. Both of these values far exceed the observed win rate (by frame) of 0.332. Therefore, some of the decision boundaries we set will not be intuitive from a probabilistic standpoint. Although we will continue to refer to them as "probabilities", it may be helpful to think our of model estimates as "scores" instead.

In any case, we decided to set the threshold for "poor" protection (a likely pass rush win) to be at the lowest boundary that produces at least a 0.75 precision score. In other words, we chose a maximal amount of frames such that at least 75% of those frames were associated with observed pass rush wins. Specifically, any frame with a "probability" estimate at or above 0.775 was deemed a "likely (pass rush) win", and 1395 such frames identified. For the sake of consistency, we then decided to use the same thought process to determine a threshold for "likely losses". This time we chose the highest boundary that produced a 0.75 precision score in predicting the negative class. This threshold was set at 0.561, and 17415 frames fell below it. Finally, our end goal in identifying likely wins and losses is to compare the average of their feature values by frame. In order to guarantee a high level of confidence in the means for likely wins, we eliminated from consideration any frame that did not have at least 20 likely wins. This had the effect of filtering out the initial 12 and final 3 frames from the snap and balancing our numbers quite a bit, with 8636 likely losses and 1334 likely wins remaining. The figure below shows the final results of this process.

Penetrator Depth

The charts below demonstrate how penetrator_depth and its delta compare between likely wins and losses. Note that the shading around each line represents the 95% confidence interval for the mean value at each frame, and not the distributions of the values themselves. As in the histogram above, the distribution of values for likely wins and losses overlap significantly even when individual frames are isolated. We consider the mean at each frame in order to observe the general trends and see how the model is discriminating wins from losses.

As in previous evaluations of penetrator depth, very little explanation is needed. On good pass protection reps, the penetrator is slowed virtually to a stop by the end of the rep. However, failing to slow the penetrator past a minimal level is likely to result in a pass rush win. We explore specific benchmark values in the application section.

QB Squareness

For mean_qb_squareness we observe a slightly different trend than above. Successful protection is characterized by maintaining a low value throughout the rep (i.e. keeping their backs to the QB), while poor protection sees blockers get turned toward the QB to a greater extent as the rep progresses. By 2.5 seconds into the rep the delta becomes less meaningful as the mean for likely wins reaches 90 degrees, indicating that at least one of the protectors has "opened the gate" and allowed a clear path to the QB.

Relative Depth

The story on relative depth is similar to that of QB squareness: good protection involves maintaining the same relatively low value throughout the rep. While successful protectors are not necessarily at the same level, they maintain only a small buffer zone. However, we see that for likely wins y_diff is more of a sustained high value than an increasing one. Further, we see in the bottom right chart that the confidence intervals for y_diff_delta overlap almost completely. Taken together, this suggests that when considering the relative depth of protectors it is more useful to look at the instantaneous measure y_diff than its delta. This claim is supported by our measures of feature importance, each of which attributed more impact to the former.

Also, note below on the left that unlike previous examples the confidence intervals for the two groups are much closer together (and even overlap in frame 18). Although a clear trend can be established, this indicates a higher degree of overlap in the distribution of values between the two groups.

Now recall that in the exploratory analysis section we identified only a small window (frames 9-11) that the means of y_diff differed signicantly between rush wins and losses on "TE" stunts. When we split the stunt types by "TE" and "non-TE" and take a second look at y_diff from the model's perspective, we see that much of this overlap can be attributed to "TE" stunts. For "ET" and "TT" stunts y_diff averages become more differentiated, while for "TE" stunts there is separation only at the end of the rep. This suggests that the importance of y_diff to the model is driven primarily by variation of feature values on "non-TE" stunts. We will return to this topic in our discussion of exchanges next.

Exchange

Perhaps the most interesting feature to explore is the exchange rate between likely wins and losses. As previously mentioned, we treat the exchange feature differently as it is not only our only boolean variable but also every frame within a rep receives the same value. For this reason we evaluate exchange rate at the rep level as opposed to the frame level. To do this within our set of likely win and loss frames we queried only the latest occuring frame of each rep in the set. Overall, exchanges occured in 71.3% of likely losses but only 56.7% of likely wins. The chart below breaks down these percentages by stunt type. For 'TT' and 'ET' stunts the results are intuitive and follow the general pattern: exchanging rushers on stunts is associated with success for the protection. For each of these stunt types there is a clear difference in exchange rate, with the rate for likely wins falling well below the confidence interval for that of likely losses. Conversely, the exchange rate for likely wins on 'TE' stunts sits just above the confidence interval for likely losses of that stunt type.

To help explain this disparity, we return our focus to y_diff. In order for an exchange to take place, protectors must be at or near "the same level" vertically, i.e. have minimal y_diff. Considering that exchanges happen just after rusher overlap occurs, we should expect a collection of reps to have a similar exchange rate if their y_diff is similar in the frames after overlap (say frames 18-23). For likely wins and losses, this is true for "TE" stunts but not among the other types. On average, protectors have about 0.5 yards greater relative depth on likely wins of "non-TE" stunts. This makes executing exchanges more difficult and to some extent explains the differences in exchange rate for those stunt types. Like y_diff, we can confidently speculate that the exchange feature is mostly impactful in helping distinguish good from poor protection in "ET" and "TT" stunts.

It is less obvious why exchange rates may be higher when protection is poor on "TE" stunts. First, we should point out that the observed exchange rates between rush wins and losses on "TE" stunts were 82.7% and 77.0% respectively. Although not significantly different, these proportions align well with those shown above. This provides some evidence that the difference in rates for likely wins and losses is not just the result of variance within the samples. We can only speculate at this point, but one possibility is that the protectors get themselves into such a disadvantageous position that exchanging rushers does more harm than good. For example, consider a scenario where the penetrator is not sufficiently slowed and/or the Guard (blocking the penetrator) turns his shoulders too far outside. By leaving the penetrator to pick up the looper, the Guard may allow the penetrator a free path to the QB before the Tackle can collect him. If possible, the best option may be for the protectors to stay on and hope the Tackle can wash the looper across the pocket. Again, this is purely speculation and represents one of may hypotheses that could be tested moving forward to answer this question.

Speed

As in the case of penetrator_depth, little elaboration is required when discussing the differences in speed for good and poor pass protection. For both the inside and outside protector, lower rates of speed are associated with success at every frame of interest. In addition, the tendency on likely rush wins is for the protector to accelerate through overlap while on likely losses protectors start to decelerate by the frame of overlap and continue to slow down through the rep. There can be little doubt that the speed and average acceleration of protectors is a factor, and that the ability to anchor down is associated with good protection. This should be of little surprise given the high correlation of speed with penetrator_depth and the clear association of "slowing the penetrator" with pass pro success.

The speed results presented here pose perhaps our biggest "chicken or the egg" question: Are protectors likely to lose because they are going faster, or are they going faster because they are losing? While we cannot come to a precise answer here, it almost certainly depends on the frames being considered. The fact that a difference in speed is apparent as early as 1.3 seconds into the rep suggests that the speed with which protectors choose (consciously or not) to come out of their stance affects their likelihood of success. This process is free from influence from the rush: the alignment and get-off of rushers will affect the initial speed of the protectors tasked with stopping them. However, before making contact protectors are still determining their own speed, whereas after contact the upfield charge of rushers will physically alter how fast blockers are moving. Given that differences in speed are exacerbated as the rep continues, we posit that moving faster prior to overlap creates sufficient momentum that it becomes more difficult to slow down once contact with the rush occurs. To evaluate this hypothesis, we would need precise data on when contact is occuring. As this can be more reliably charted from game or practice film than from tracking data, it represents yet another suggested approach to future research.

7. Application and Future Research

Process Automation

Beyond conducting research, one of the most exciting possibilities for leveraging tracking data is using it to automate weekly scouting processes. Here we refer specifically to the opponent- and self-scouting that coaches and support staff conduct each week in order to identify the schemes employed by opponents and isolate areas of weakness that can be exploited. This typically involves incorporating data collected by staff members as well as external vendors (such as PFF) who meticulously chart and grade specific aspects of each play. While these collection methods likely ensure the highest quality of data upon completion, they require substantial resources in terms of time and personnel and are always subject to human error and discretion. In contrast, running a computer program using tracking data to identify schemes and evaluate performance provides a quick and unbiased alternative. In this section we argue not that the latter replace the former, but rather that automation be used to make later processes more efficient and less subjective.

Identifying Stunts

One of the first challenges we were tasked with in this investigation was identifying and naming stunts without the benefit of film. To do so we created a working definition for a "stunt" and extracted stunts from pass plays using a series of logical tests based on that definition. We then applied a similar process in naming the stunts we extracted. Once the code was written, extracting stunts from Weeks 1-8 of the 2021 season required only a few minutes of run time. For a team looking to scout all of the stunts their opponent has run or faced, a base of the plays and players involved as well as other relevant data (such as time_to_overlap or exchange) can be produced shortly after tracking data is made available. This data can then be matched with game film prior to manual charting. This speeds up the charting process in allowing staff to simply ensure the quality of the data rather than take steps to collect it. It also allows for the possibility of collecting certain data points that teams may not have previously chosen to collect due to the tedious or time-consuming nature of doing so. Here time_to_overlap provides a good example as it otherwise would have to be manually recorded several times per stunt in order ensure accuracy. Finally, we have no delusions that our algorithm is perfect in capturing all stunts: any that are missed in initial detection can be identified during manual charting and have the automated subprocess that collects stunt data be re-applied.

Further, we would like to point that although our algorithm for identifying stunts was useful in this exploration, it is by no means the only one that can be employed. In developing our definition of a stunt, we specified that we were looking for "traditional" stunts and that we wanted to ignore "incidental" stunts to the extent possible. Teams may have differing opinions on whether lane exchanges involving second-level or out-of-box defenders should be included or to what point in time overlaps should be considered. To the extent that they do, it should be reflected in their algorithms. Also, recall that defining the concept of a "stunt" was necessary because we were provided no information which directly indicated whether one had taken place. Conversely, it is likely that many teams have or can acquire recent historical data detailing when, between who and how stunts occured. This information can then be used in conjunction with the tracking data to train supervised models that identify and name stunts to increasing levels of accuracy as errors are corrected over time. Finally, while the algorithms discussed here apply very specifically to stunt identification, similar methods can be (and have been) used to automate other processes such as identifying run schemes, pass concepts and defensive coverages.

The Utility of a Model

In our exploration we used modelling to rank features in terms of importance and to examine how what those features look like when the model predicts either good or poor performance. Here we will discuss a more direct application of model outputs: using them to assess pass protection performance. Before we describe exactly how we would execute this we will first explain why doing so is useful.

For most of the history of statistical analysis in sports, practitioners have used results-based metrics to evaluate athletic performance. In our specific context the "sack" represents this type of metric, and in many ways is still the most popular metric used to evaluate pass rush and protection performance. However, in recent years sports analysts have turned their attention to process-based metrics, which measure aspects of play that often lead to a desired outcome. For this study we are provided with PFF's charting data for pressures and hits in addition to sacks, all of which can be aggregated into a total pressures metric. Sacks have a direct effect on the quarterback, but pressures and hits can force the quarterback to move off his launch point and ultimately make an inaccurate or ill-advised throw. Also, whether a pressure becomes a sack is often dependent on the quarterback's ability to evade the rush either by scrambling or getting rid of the ball. For these reasons it is often better to look at total pressures than just sacks to get a more accurate sense of how rushers and pass protectors are performing.

We are also given data on whether a rusher "beat" a defender but did not record a pressure. These events often occur for reasons outside the control of the rusher and protector involved, such as a quarterback releasing the ball before the rusher could record a pressure. While the rusher does not affect the QB on such reps, it is assumed that a pressure would have occured had that other event not preceeded it. In this way plays where just a "beaten by defender" is recorded represent our best example of a process-based event. Like pressures, we can aggreggate all of the events discussed here into the broader metric of "pass rush wins". Not all wins become pressures, but they are still an important indicator of performance if you want project how a protector will perform with a quarterback whose time to throw is slower. From a holistic standpoint, pass rush wins surrendered give a better sense of performance than just pressures allowed.

At this point one might wonder why a model predicting pass rush wins is useful if we already have access to the observed outcome. The answer here is two-fold. First, just as not every pressure becomes a sack and not every rush win results in pressure, poor protection technique does not always result in a rush win. If, for example, the quarterback gets rid of the ball shortly after overlap the rush may not have an opportunity to register even a win. The opposite situation can also be true; good protection technique can be beaten by a well-executed stunt. What you then want is something that measures whether the protection is in a position to succeed or fail, and not necessarily the outcome itself. We argue here that the models used in our investigation do precisely that. In training them, the majority of features used were designed to measure specific aspects of the movement and positioning of pass protectors, and each of those is controlled to some extent by the protectors themselves. While penetrator_depth does not directly measure a single aspect of blocker technique, it is certainly influenced by several aspects as evidenced by its correlation to many other important features. It can therefore be said that our models effectively distill multiple aspects of protection into a single metric relating technique to success.

Second, while the model outputs can be used to make binary predictions it is critical that the outputs themselves are continuous numbers. Unlike the observed outcome of rush win or protection win, this allows for comparison and ranking of individual reps within all possible groups. In layman's terms, this enables us to determine that a rep assigned a pseudo-probability of 0.9 involves worse protection than one with a 0.8 output even though both represent likely wins for the rush. It also presents a means for comparison between teams, players and protection pairs beyond the observed win rate. As we will discuss shortly, one can use these values to compute both an average value across reps as well as an expected win rate. While expected and observed win rates should converge over a large number of reps, the key point is that stunts are run infrequently enough that teams often do not have many reps to evaluate. If over 5 reps a protection pair surrendered only one win but was expected to lose three, that is an important distinction that the model can catch. In summary, the model outputs measure not just whether protectors were in a position to succeed but also the extent to which good technique was used. More than any metric previously discussed, this embodies the ideal of evaluating process over result.

Measuring Performance - Mean Probability and Expected Win Rate

With that, let us explain how we would deploy the models. First, recall that our models take as input feature values at a single frame of a rep and return a pseudo-probability for that frame. These estimates would be useful if teams were interested solely in isolating likely wins (or losses) or if they wanted to evaluate each rep at a particular frame (say frame_of_overlap). However, we contend that the most practical use involves having a single value for each rep that encapsulates information from throughout the play. In this way one can compare and aggregate reps regardless of their length in addition to getting a sense of overall performance. In our exploration we found that the optimized FNN model performed the best distinguishing wins from losses at the frame level. One could certainly devise a way of combining frame-level outputs from that model to form a single rep-level value, say by computing a weighted average. Nevertheless, we are fortunate in that we already have a model which incorporates information from throughout the rep: the LSTM. We mentioned previously that LSTM outputs are traditionally evaluated only after the last input frame of a given sequence has run through the model. Taking the LSTM estimates just from the last frame of each rep we observe a model performance of AUC = 0.785. For the sake of practicality we have deemed this sufficient to move forward.

Now that we have a single value measuring performance at each rep we can explore how to aggregate those values over multiple reps to get an overall picture of protection performance. The first and simplest way we will do this is to average the values across reps. While this does provide a general sense of performance, it does not tell the full story as the variation in these values across reps matters a great deal. Let's call our performance metric $\hat{p}$ and consider two pairs of protection partners $A$ and $B$, both with mean $\bar{\hat{p}}$ = 0.5 over 5 reps. Pair $A$ is consistently mediocre, registering a value of $\hat{p}$ = 0.5 on each rep while Pair $B$ is more inconsistent, with values of $\hat{p}$ = 0.2, 0.3, 0.5, 0.7 and 0.8. We've already discussed the aggressive nature of the models, implying that our threshold for predicting a win will be greater than 0.5. As such we would predict exactly 0 rush wins for Pair $A$ while we would expect up to 2 rush wins against Pair $B$. While this is highly simplistic example, it demonstrates that having a measure of expected wins and losses accounts for the variation in individual values of $\hat{p}$. When converted to a win rate, it will also provide a basis for comparison to the actual win rate observed. What remains then is to determine a threshold for predicting rush wins.

Since we are working with an unbalanced target class and are not prioritizing predicting either wins or losses, we opted to choose the prediction threshold that maximizes the f1 score. With an f1 = 0.635, that optimal threshold is $\hat{p}$ = 0.533. Any probability at or above that threshold we predicted to be a rush win, while those below are expected rush losses. The confusion matrix below shows the results of making predictions at this threshold for each rep.

With that we can calculate both of our aggregate metrics for protection pairs in the first 8 weeks of the 2021 season. The diagram below includes the top 10 pairs (min 5 reps) sorted first by win_rate_exp, then by mean_proba (arithmetic mean of the $\hat{p}$ values). Note how these metrics serve as a "tie-breaker" for each other; when evaluating a pair who have the same expected win rate, one rates the pair with the lower mean probability as having performed better. Here we also include win_rate_below_exp, which compares the expected and actual win rates. We can interpret win rate below expected as a measure of luck, with those who allowed fewer rush wins than predicted registering positive values and vise versa. When we examine team performance we will observe less deviation from 0.0 in win_rate_below_exp as actual and expected win rates converge with higher rep counts.

Top 10 Stunt Protection Pairs by Expected Win Rate and Mean Predicted Probability

player_in position_in player_out position_out team count mean_proba win_rate_exp win_rate_act win_rate_below_exp
51 Andre James C John Simpson G LV 7.0 0.172977 0.000000 0.142857 -0.142857
15 Ali Marpet G Donovan Smith T TB 8.0 0.204209 0.000000 0.125000 -0.125000
9 Trey Hopkins C Quinton Spain G CIN 6.0 0.242197 0.000000 0.333333 -0.333333
40 Bradley Bozeman C Ben Powers G BAL 5.0 0.242506 0.000000 0.000000 0.000000
65 Creed Humphrey C Trey Smith G KC 7.0 0.259482 0.000000 0.285714 -0.285714
47 Michael Deiter C Solomon Kindley G MIA 5.0 0.276511 0.000000 0.000000 0.000000
49 Sam Mustipher C James Daniels G CHI 8.0 0.218483 0.125000 0.375000 -0.250000
57 Lloyd Cushenberry C Graham Glasgow G DEN 8.0 0.387180 0.125000 0.375000 -0.250000
35 Connor Williams G Tyron Smith T DAL 7.0 0.260263 0.142857 0.000000 0.142857
67 Kendrick Green C Kevin Dotson G PIT 7.0 0.289395 0.142857 0.285714 -0.142857

The next table lays out the bottom 10 protection pairs for the same time frame. Here we see in the top two entries the exact situation presented previously: pairs of offensive linemen who "should" have lost several more times than they actually did. Again, this highlights the value of developing a model that can be used to identify such situations. We will lay out in the next section how teams can leverage this information in scouting themselves or their opponent.

Bottom 10 Stunt Protection Pairs by Expected Win Rate and Mean Predicted Probability

player_in position_in player_out position_out team count mean_proba win_rate_exp win_rate_act win_rate_below_exp
34 Will Hernandez G Matt Peart T NYG 5.0 0.518651 0.600000 0.200000 0.400000
52 Cesar Ruiz C Calvin Throckmorton G NO 5.0 0.538397 0.600000 0.200000 0.400000
45 Tytus Howard G Laremy Tunsil T HOU 5.0 0.624961 0.600000 0.600000 0.000000
43 Chris Lindstrom G Kaleb McGary T ATL 8.0 0.523538 0.625000 0.500000 0.125000
54 Jonah Jackson G Penei Sewell T DET 12.0 0.598842 0.666667 0.666667 0.000000
56 Matt Hennessy C Jalen Mayfield G ATL 6.0 0.613503 0.666667 0.500000 0.166667
48 Olisaemeka Udoh G Brian O\'Neill T MIN 7.0 0.600777 0.714286 0.571429 0.142857
7 Joel Bitonio G Jedrick Wills T CLE 5.0 0.598496 0.800000 0.200000 0.600000
8 Trai Turner G Chukwuma Okorafor T PIT 5.0 0.652755 0.800000 0.400000 0.400000
27 Kyle Fuller C Gabe Jackson G SEA 5.0 0.664014 0.800000 0.800000 0.000000

For the curious, the following tables display the measures evaluated at the individual and team level. The individual metrics should be taken with a grain of salt as they assign equal credit/blame for performance to each protector in a given pair. Developing an accurate picture of individual performance would require more historical data with individual players being paired with multiple partners.

Top and Bottom 5 Stunt Protectors by Expected Win Rate and Mean Predicted Probability

player position team count mean_proba win_rate_exp win_rate_act win_rate_below_exp
5 Brandon Brooks G PHI 5 0.133654 0.0 0.000000 0.000000
110 Ted Karras G NE 7 0.197829 0.0 0.142857 -0.142857
139 Donovan Smith T TB 8 0.204209 0.0 0.125000 -0.125000
119 Trey Smith G KC 10 0.253931 0.0 0.300000 -0.300000
48 Michael Deiter C MIA 8 0.305257 0.0 0.125000 -0.125000
... ... ... ... ... ... ... ... ...
32 Kyle Fuller C SEA 9 0.630153 0.777778 0.666667 0.111111
193 Jedrick Wills T CLE 5 0.598496 0.800000 0.200000 0.600000
128 Jamarco Jones G SEA 5 0.643065 0.800000 0.400000 0.400000
204 Jamarco Jones T SEA 5 0.643065 0.800000 0.400000 0.400000
144 Chukwuma Okorafor T PIT 6 0.682835 0.833333 0.500000 0.333333

On the other hand, we get a much clearer idea of how teams compare to one another simply because there are more reps to evaluate. This mitigates the effect of random variation and can be observed in the lower absolute differences between expected and actual win rate. Of course, the two most notable exceptions are listed below in Buffalo (who boast the highest actual win rate allowed on the second-fewest reps faced) and Cleveland (whose second-highest win rate expected tracks with the fourth-highest mean probability).

Top and Bottom 5 Stunt Protection Teams by Expected Win Rate and Mean Predicted Probability

count mean_proba win_rate_exp win_rate_act win_rate_below_exp
KC 30 0.347607 0.200000 0.266667 -0.066667
PHI 21 0.342927 0.238095 0.285714 -0.047619
LV 40 0.347954 0.250000 0.250000 0.000000
BUF 18 0.441171 0.277778 0.611111 -0.333333
MIA 25 0.397591 0.280000 0.240000 0.040000
... ... ... ... ... ...
CAR 22 0.543320 0.500000 0.454545 0.045455
TEN 19 0.471435 0.526316 0.368421 0.157895
CLE 30 0.516289 0.566667 0.300000 0.266667
SEA 20 0.517943 0.600000 0.500000 0.100000

Gameplan Example

In recent years, perhaps no team has earned more notoriety for their love of pass rush stunts than the Dallas Cowboys under defensive coordinator Dan Quinn. In our study we tallied the Cowboys as having run the second-most stunts (64) on the fourth-most plays (54). Their mentality concerning stunts is not if to do it, but more likely who to attack and how. In Week 9 the Cowboys were set to face the Denver Broncos at home. Based on the 2 on 2 stunts in our study, Denver appeared to be at best a mediocre stunt protection team; they lost on 34.2% of 38 stunt reps faced, good for a rank of 21st out of 32 teams. However, our model expected them to lose only 28.9% of those reps, the 7th-lowest expected win rate allowed in the league. Even armed with this information Dallas was not likely to be deterred because, after all, the Broncos had not played them yet.

In deciding who to attack we can begin by examining the expected and actual success of the Denver's pass protection pairs. A breakdown is detailed in the table below. While the table includes all of the pairs who faced a stunt through Week 8, we will focus the Broncos' projected starters.

Denver Broncos Stunt Protection Pairs

player_in nflId_in pos_in player_out nflId_out pos_out team count mean_proba win_rate_exp win_rate_act win_rate_below_exp
1 Lloyd Cushenberry 52491.0 C Quinn Meinerz 53527.0 LG DEN 1.0 0.739397 1.000000 0.000000 1.000000
2 Lloyd Cushenberry 52491.0 C Dalton Risner 47824.0 LG DEN 7.0 0.425620 0.428571 0.285714 0.142857
3 Dalton Risner 47824.0 LG Garett Bolles 44832.0 LT DEN 5.0 0.476174 0.400000 0.200000 0.200000
4 Graham Glasgow 43384.0 RG Bobby Massie 38642.0 RT DEN 10.0 0.381573 0.300000 0.500000 -0.200000
5 Lloyd Cushenberry 52491.0 C Netane Muti 52589.0 RG DEN 4.0 0.482076 0.250000 0.500000 -0.250000
6 Lloyd Cushenberry 52491.0 C Graham Glasgow 43384.0 RG DEN 8.0 0.387180 0.125000 0.375000 -0.250000
7 Quinn Meinerz 53527.0 LG Garett Bolles 44832.0 LT DEN 1.0 0.408734 0.000000 0.000000 0.000000
8 Netane Muti 52589.0 RG Bobby Massie 38642.0 RT DEN 1.0 0.366985 0.000000 0.000000 0.000000
9 Lloyd Cushenberry 52491.0 C Bobby Massie 38642.0 RT DEN 1.0 0.120528 0.000000 0.000000 0.000000

As previously stated, our model is particularly useful in identifying situations where protectors are potentially performing better or worse than their observed results would indicate. One such example involves starting right gaurd Graham Glasgow and right tackle Bobbie Massie. Having surrendered 5 rush wins in 10 attempts, they seem like prime candidates for attack. However, their 0.30 expected win rate allowed and mean probability of 0.38 are relatively better than the actual results would imply. With such a small rep count, we can get a clearer picture by looking at a breakdown by stunt type and individual predictions for each rep.

Protection Summary by Stunt Type

count mean_proba win_rate_exp win_rate_act win_rate_below_exp
ET 4 0.274027 0.0 0.250000 -0.250000
TE 6 0.453271 0.5 0.666667 -0.166667

Predictions and Results for Stunts Faced

gameId playId stuntId player_in pos_in player_out pos_out team stunt_type exchange proba pred rush_win
1 2021091212 702 1.0 Graham Glasgow RG Bobby Massie RT DEN ET 1 0.080456 0 0.0
2 2021091212 3181 1.0 Graham Glasgow RG Bobby Massie RT DEN TE 1 0.087262 0 1.0
3 2021101709 374 1.0 Graham Glasgow RG Bobby Massie RT DEN TE 1 0.659949 1 1.0
4 2021101709 2502 1.0 Graham Glasgow RG Bobby Massie RT DEN TE 1 0.184090 0 0.0
5 2021101709 3134 1.0 Graham Glasgow RG Bobby Massie RT DEN TE 1 0.880401 1 1.0
6 2021101709 3248 1.0 Graham Glasgow RG Bobby Massie RT DEN ET 1 0.419034 0 0.0
7 2021101709 4063 1.0 Graham Glasgow RG Bobby Massie RT DEN ET 1 0.131873 0 0.0
8 2021101709 4111 1.0 Graham Glasgow RG Bobby Massie RT DEN TE 1 0.599247 1 1.0
9 2021102100 1205 1.0 Graham Glasgow RG Bobby Massie RT DEN TE 1 0.308676 0 0.0
10 2021102100 2106 1.0 Graham Glasgow RG Bobby Massie RT DEN ET 0 0.464745 0 1.0

We would first like to make two points about the prediction errors seen in the table above at reps 2 and 10. First, as we said before, the model is not perfect. It attempts to make predictions based on what it has deemed the most relevant features, but it ultimately limited to the information fed into it. Situations like rep 2 where the probability differs wildly from the observed outcome require special attention in film study to determine if the win resulted from something outside of the purview of the model. Second, from a probabilistic standpoint that scenario will arise from time to time. If the model output of 0.087 for rep 2 were a true probability, we would expect a rush win roughly once every eleven reps. Again, the model is best used to draw attention to events where process and outcome may not be aligned and is not meant to be deterministic.

Next, the summary by stunt type clearly indicates that Glasgow and Massie have handled "ET" stunts better than "TE" in both actual and expected terms. On face value, a recommendation of favoring "TE" stunts against this pair is warranted. Considering that such a recommendation is likely to come from a non-practitioner, it would be helpful to isolate which features are driving probabilities up on predicted wins. There are several ways one could attempt this, and a modern approach that immediately comes to mind is calculating Shapley values for inputs on each rep. While this seems plausible, it would be computationally expensive and require careful selection of randomized feature values due to the multicollinear nature of our features. Also, one would have to devise a way to interpret and present the Shapley values in a way that is compelling to coaches. That is a can of worms we will not open here. Another approach would be to simply average values for each relevant feature at each frame and compare them to means of the entire sample. However, in a low-rep environment one very poor play with extreme input values could cancel out many good reps when an average is taken. Even normalizing feature values in this context can be futile because the same normalized value for different features (say one standard deviation from the mean for penetrator_depth or squareness_in) is likely to differ in its impact on predictions.

The approach taken here attempts to assess the frequency with which the pair of protectors demonstrates "poor" technique. It does so by linking pre-determined feature values to observed win rates, something which is quite common in coaching circles. For example, it has become sort of an adage that "a team will win 85% of the time when its defense forces three turnovers". For our purposes, we first selected a subset of relevant features to focus on and divided the sample data by stunt type. For each feature, we then calculated the lowest value at each frame (from the snap) for which the observed win rate at or above that value was at least 50%. Inspired by "survival" curves, the plot below demonstrates this visually for a single feature and frame. On "TE" stunts, protectors that allow 4.56 or more yards of penetration within the first 1.7 seconds lose approximately 50% of the time. While these thresholds are not strictly tied to our model, comparing observed values to them provides an educated guess as to which factors are driving specific model outputs. More importantly, since the information is framed in a way that coaches intuitively understand they can act on it without delay.

Calculating such values for each feature of interest at each frame resulted in the table below. Most of the features included in the table's header are obvious choices as ones that greatly impact model outputs. We chose to include squareness and open_outside for both protectors to provide additional context for mean_qb_squareness. Specifically, when the latter feature is elevated the former give a sense of which blocker is not square and in what direction (inside or outside). Cells with value "NaN" appear where there does not exist a threshold above which a 50% or greater rush win rate was observed.

penetrator_depth penetrator_depth_delta mean_qb_squareness mean_qb_squareness_delta squareness_in squareness_out open_outside_in open_outside_out y_diff s_in s_out
1 -0.08 0.09 96.272763 NaN 177.60 NaN NaN NaN NaN 0.60 0.83
2 0.01 0.16 92.135822 2.252017 176.58 NaN NaN NaN NaN 0.89 1.12
3 0.10 0.21 57.953026 NaN 168.21 NaN NaN NaN NaN 1.39 1.40
4 0.26 0.23 56.653009 5.076606 167.59 NaN NaN NaN 0.87 1.75 1.67
5 0.49 0.25 56.663824 NaN 166.15 NaN NaN NaN 0.88 NaN 2.02
6 0.72 0.29 54.396176 NaN 158.80 NaN NaN NaN 1.02 NaN 2.09
7 1.03 0.33 54.300634 NaN 156.05 60.52 NaN 21.31 NaN NaN 2.27
8 1.34 0.33 55.260885 NaN 168.55 21.78 NaN 13.81 0.68 2.95 2.34
9 1.73 0.33 54.127060 NaN 60.32 18.03 56.29 17.29 0.61 3.26 2.64
10 1.98 0.37 60.451030 5.217881 62.16 21.65 63.91 16.11 NaN 3.46 NaN
11 2.30 0.38 57.515736 3.351481 48.89 25.45 49.77 14.92 NaN 3.64 2.63
12 2.71 0.38 41.359162 1.046716 52.28 24.73 52.88 22.95 0.91 NaN 2.66
13 2.89 0.36 41.405108 2.517329 55.87 36.59 56.72 21.11 0.93 NaN 2.55
14 3.27 0.35 42.943847 1.753492 56.95 38.98 57.69 25.53 0.94 3.55 2.62
15 3.71 0.34 38.881475 2.293357 60.43 32.46 61.33 28.24 0.86 2.54 2.55
16 4.10 0.32 35.345441 2.041986 56.69 32.28 59.96 32.14 0.72 2.53 2.55
17 4.39 0.31 36.972663 2.526260 57.39 57.53 59.38 33.97 0.75 2.71 2.55
18 4.56 0.29 38.693093 2.547165 53.15 35.01 56.66 59.21 0.77 2.38 2.65
19 4.77 0.28 38.299821 4.071512 52.22 33.78 53.86 91.96 0.79 2.28 2.59
20 4.88 0.25 38.279336 3.592320 44.37 44.62 53.51 79.91 2.08 2.13 2.48
21 4.91 0.24 39.462294 2.462110 41.89 42.22 73.72 82.49 2.23 2.08 2.23
22 5.04 0.21 35.036204 2.510727 34.19 41.59 86.63 94.73 1.53 1.84 2.27
23 5.28 0.23 33.042665 3.350514 47.68 55.69 90.51 109.36 1.71 2.00 2.27
24 5.50 0.21 33.830020 2.013921 47.98 51.79 88.54 114.51 1.90 1.82 2.22
25 5.49 0.15 30.634889 0.774795 33.22 48.96 76.33 107.26 0.99 1.50 1.71
26 5.01 0.08 26.348230 0.880717 25.42 45.47 76.33 110.08 0.53 1.30 1.15
27 4.75 0.04 20.533145 1.556416 26.46 12.92 79.01 115.18 -0.08 1.15 0.86
28 4.32 0.02 21.434744 -0.971644 20.15 8.44 46.74 52.80 -0.16 0.76 0.70
29 5.08 0.02 34.575697 -1.544540 38.12 27.18 74.89 57.42 0.81 1.19 1.08
30 6.48 0.10 49.148040 6.691900 64.56 32.03 75.14 61.97 0.88 1.67 1.54
31 6.69 0.24 75.532546 5.519362 70.07 NaN 70.07 NaN 1.88 NaN 2.17
32 6.66 0.19 74.096730 5.707643 59.93 54.60 59.93 NaN NaN 2.29 1.90
33 7.24 0.09 81.935967 NaN 75.09 74.84 75.09 NaN NaN 1.92 NaN

To further explore Glasgow and Massie's "TE" reps, we then calculated the relative frequency at which they exceeded the threshold values above. Note that each rep does not include the same number of frames of interest, as reflected in the diminishing values at the bottom of the count column. Higher rates appearing at the bottom of the table should therefore be given less weight as they are derived from fewer reps. Also, cells with a corresponding threshold of "NaN" always result in a relative frequency of 0.0 as one cannot exceed a threshold which does not exist. Within each feature, we are searching for sustained intervals (3 or more consecutive frames) of high frequency (0.50 or greater) that start within the first 20 frames. Trends that emerge too long after overlap are less vulnerable to exploitation as the likelihood that the ball is thrown rises.

count penetrator_depth penetrator_depth_delta mean_qb_squareness mean_qb_squareness_delta squareness_in squareness_out open_outside_in open_outside_out y_diff s_in s_out
1 6.0 0.500000 0.166667 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.166667 0.333333
2 6.0 0.500000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.166667 0.333333
3 6.0 0.500000 0.000000 0.166667 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.333333
4 6.0 0.500000 0.000000 0.333333 0.166667 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.333333
5 6.0 0.500000 0.000000 0.333333 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.166667
6 6.0 0.500000 0.000000 0.166667 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.333333
7 6.0 0.500000 0.000000 0.166667 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.333333
8 6.0 0.500000 0.000000 0.166667 0.000000 0.000000 0.000000 0.000000 0.166667 0.166667 0.000000 0.333333
9 6.0 0.166667 0.000000 0.166667 0.000000 0.000000 0.166667 0.000000 0.166667 0.500000 0.000000 0.000000
10 6.0 0.500000 0.000000 0.000000 0.166667 0.166667 0.166667 0.166667 0.333333 0.000000 0.000000 0.000000
11 6.0 0.500000 0.000000 0.000000 0.333333 0.500000 0.166667 0.333333 0.333333 0.000000 0.000000 0.166667
12 6.0 0.166667 0.000000 0.166667 0.833333 0.500000 0.000000 0.500000 0.000000 0.000000 0.000000 0.166667
13 6.0 0.500000 0.000000 0.333333 0.666667 0.666667 0.000000 0.666667 0.166667 0.000000 0.000000 0.166667
14 6.0 0.333333 0.166667 0.333333 0.333333 0.666667 0.000000 0.666667 0.166667 0.000000 0.000000 0.000000
15 6.0 0.166667 0.166667 0.500000 0.333333 0.666667 0.333333 0.666667 0.166667 0.000000 0.833333 0.000000
16 6.0 0.166667 0.333333 0.833333 0.500000 0.666667 0.333333 0.666667 0.166667 0.000000 0.833333 0.000000
17 6.0 0.166667 0.166667 0.833333 0.166667 0.500000 0.166667 0.500000 0.166667 0.000000 0.500000 0.166667
18 6.0 0.166667 0.333333 0.666667 0.166667 0.833333 0.333333 0.500000 0.000000 0.000000 0.833333 0.000000
19 6.0 0.166667 0.166667 0.666667 0.166667 0.833333 0.500000 0.666667 0.000000 0.000000 0.666667 0.166667
20 6.0 0.166667 0.333333 0.500000 0.166667 0.833333 0.166667 0.666667 0.000000 0.000000 0.666667 0.166667
21 6.0 0.166667 0.333333 0.500000 0.833333 0.833333 0.500000 0.333333 0.000000 0.000000 0.666667 0.166667
22 5.0 0.200000 0.600000 0.600000 0.400000 0.800000 0.400000 0.000000 0.000000 0.000000 1.000000 0.200000
23 5.0 0.200000 0.600000 0.600000 0.400000 0.600000 0.200000 0.000000 0.000000 0.000000 0.800000 0.200000
24 5.0 0.200000 0.400000 0.800000 0.600000 0.600000 0.400000 0.200000 0.000000 0.000000 1.000000 0.200000
25 5.0 0.400000 0.400000 0.800000 1.000000 0.800000 0.400000 0.200000 0.000000 0.000000 1.000000 0.600000
26 4.0 1.000000 0.750000 1.000000 1.000000 0.750000 0.500000 0.250000 0.000000 0.500000 1.000000 1.000000
27 3.0 1.000000 0.666667 1.000000 1.000000 0.666667 1.000000 0.000000 0.000000 0.666667 1.000000 1.000000
28 2.0 1.000000 1.000000 1.000000 1.000000 0.500000 1.000000 0.500000 0.000000 1.000000 1.000000 1.000000
29 2.0 1.000000 1.000000 0.500000 0.000000 0.500000 0.500000 0.000000 0.000000 0.500000 0.500000 1.000000
30 1.0 1.000000 1.000000 0.000000 0.000000 0.000000 1.000000 0.000000 1.000000 1.000000 0.000000 1.000000
31 1.0 1.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000 0.000000

The first trend that we recognize is an interval of high frequencies for penetrator_depth to start the rep that dissipates after frame 13. Note that penetrator_depth_delta is seldom over threshold at the same frames, indicating nothing more than a good initial get-off for the penetrator. The real signal appears to arise in mean_qb_squareness, where over threshold frequency hits 0.5 at frame 15 and stays above that mark through to rep 30. Moving to the squareness and open_outside columns we identify a clear culprit, with the inside protector (RG Glasgow) having high rates in both categories starting at frame 12. While his open_outside rates bottom out shortly after overlap, Glasgow's sustained high squareness rates indicate a hard rotation inside. Conversely, Massie sees only a modest rise in over threshold rate for squareness through overlap. Finally, by frame 15 Glasgow's speed is over threshold about as often as his squareness. These trends are depicted in figures below, with shaded areas representing intervals in which over threshold rate is at or above 50%.

Based on this evidence we can confidently state that Glasgow is opening too far outside before overlap on "TE" stunts. This ultimately drives up the pair's mean_qb_squareness values, and may result from excessive penetration early in the rep. We can then speculate that with elevated speed Glasgow struggles to adjust back inside to flatten the looper after overlap. While coaches will gain further insight from film study there exists a clear opportunity to exploit the deficiencies of the Bronco's right guard with "TE" stunts.

Next we turn our attention to center Logan Cushenberry and left guard Dalton Risner. Unlike Glasgow and Massie, this pair surrendered one fewer loss than expected on their seven reps. Here again we can gain insight by inspecting predictions and results for individual reps and seeing where the two were consistently over threshold.

Protection Summary by Stunt Type

count mean_proba win_rate_exp win_rate_act win_rate_below_exp
TT 7 0.42562 0.428571 0.285714 0.142857

Predictions and Results for Stunts Faced

gameId playId stuntId player_in pos_in player_out pos_out team stunt_type proba pred rush_win
1 2021091212 1199 1.0 Lloyd Cushenberry C Dalton Risner LG DEN TT 0.900030 1 1.0
2 2021091904 2723 1.0 Lloyd Cushenberry C Dalton Risner LG DEN TT 0.836632 1 0.0
3 2021101006 2192 1.0 Lloyd Cushenberry C Dalton Risner LG DEN TT 0.075959 0 0.0
4 2021101709 4001 1.0 Lloyd Cushenberry C Dalton Risner LG DEN TT 0.564830 1 0.0
5 2021102100 267 1.0 Lloyd Cushenberry C Dalton Risner LG DEN TT 0.223576 0 0.0
6 2021102100 2248 1.0 Lloyd Cushenberry C Dalton Risner LG DEN TT 0.074102 0 0.0
7 2021103110 1970 1.0 Lloyd Cushenberry C Dalton Risner LG DEN TT 0.304212 0 1.0

As one would expect, each rep faced by this pair of interior linemen was a "TT" stunt. While interior stunts will generally be classified this way, it is important to note that these can be run in a multitude of ways. For example, the wider rusher can be either the penetrator or the looper. In some cases this is determined "on the fly" with the penetrating rusher coming away from the initial slide of the center. In recent years the tactic of twisting two interior rushers from the same side of the center has also become popular. While the homogeneity of the classification suggests a lack of choice for defensive strategists, this is not necessarily the case. To account for this teams can specify stunt subtypes either in their automated stunt capture algorithm or with manual charting. In any case, it helps to know the ways other teams have attacked the opponent you are scouting and how your opponent has responded.

The plots below highlight the trends that emerge in looking at where Cushenberry and Risner are frequently in poor position. These focus exclusively on the orientation of the protectors and implicate Risner (the outside protector) as the likely cause of protection issues. Like Glasgow in the previous example, persistently high rates of squareness_out coupled with a dropping off of open_outside_out suggest that Risner turning his shoulders too far outside before overlap precipitates a radical inside turn just after. It is clear that defenses have opted to work the wider rusher upfield first rather than having him penetrate the left A gap immediately. With the center's shoulders getting turned only after overlap (as evidenced by the squareness_in plot), we can speculate that he is more often facing the looper. Taken as a whole, this tracks with our research which indicated that success on "TT" stunts is largely dependent on the guard staying square. Here the Cowboys potentially have an opportunity to attack Risner to the inside by first getting him to open outside.

Finally we stay on the left side of Denver's line to consider their expected guard-tackle pairing. With starting left tackle Garrett Bolles being ruled out after sustaining an injury Week 8, second-year reserve Calvin Anderson got his first start of the season. To that point Anderson had seen only limited action, with 13 total reps (according to Pro Football Reference) and no stunts faced (according to our algorithm). Due to the complex nature of stunt and blitz protection, consistency of working partners is critical to an offensive line's success. Over countless practice and game reps pass protectors develop an intuitive feel for how the men next to them will handle certain looks. Therefore, an inexperienced player entering the fold is grounds enough for attacking with those schemes even in the absence of evidence. Nonetheless, we are not operating completely in the dark as Risner worked with Bolles on five stunt reps to that point. Looking for any deficiencies the guard displayed on those plays may uncover a preferred mode of attack. Summaries are once again provided below.

Protection Summary by Stunt Type

count mean_proba win_rate_exp win_rate_act win_rate_below_exp
ET 4 0.371397 0.25 0.25 0.0
TE 1 0.895283 1.00 0.00 1.0

Predictions and Results for Stunts Faced

gameId playId stuntId player_in pos_in player_out pos_out team stunt_type proba pred rush_win
1 2021091904 1962 1.0 Dalton Risner LG Garett Bolles LT DEN ET 0.900928 1 1.0
2 2021091904 3533 1.0 Dalton Risner LG Garett Bolles LT DEN ET 0.235012 0 0.0
3 2021101709 1757 1.0 Dalton Risner LG Garett Bolles LT DEN TE 0.895283 1 0.0
4 2021101709 2823 1.0 Dalton Risner LG Garett Bolles LT DEN ET 0.131048 0 0.0
5 2021103110 3017 2.0 Dalton Risner LG Garett Bolles LT DEN ET 0.218599 0 0.0

In their five stunt reps, Risner and Bolles have seen mainly "ET" stunts. Here the model correctly predicted the outcome on each: three good reps resulting in rush losses with one very poor rep that led to a rush win. On the other hand, they saw only one "TE" stunt to which the model erroneously predicted they would lose. Here we will leave the "TE" rep to film study and focus on the "ET" stunts. The plots below represent the salient trends emerging from our comparison to threshold values for "ET" stunts. As the probabilities above indicate, one can see that Bolles and Risner were seldom in poor position. Only Bolles shows a tendency to get turned in late in the rep (high squareness_out rates with no associated open_outside_out rates).

Nonetheless, the y_diff plots grabs our attention here as our research suggested this feature is particularly important when considering "ET" stunts. Taking a closer look at Bolles' depth on each rep, he is never above threshold at any point. Given the y_diff plot, this indicates that at on several reps Risner did not get sufficient depth. As we discussed previously, being on the same level allows protectors to exchange twisting rushers, and exchange rates are heavily tied to success on "ET" stunts. If Risner does not get sufficient depth he is unlikely to be able to help Anderson, who will be isolated on one of Dallas' many formidible edge rushers. Given this information it may benefit Dallas open up with "ET" stunts against Denver's left side.

With that we conclude this example. As a final note, we would like to mention that Denver can carry out the same steps within their self-scout process. In self-scout, teams evaluate their own performance and look for tendencies that their opponents may try to exploit. By uncovering the insights discussed here the Broncos could address their vulnerabilities by making schematic adjustments and/or refining technique in practice. In this way the models and tracking data they are trained on provide yet another layer of feedback to aid players and coaches in their constant pursuit of improvement. Our final example explores how measurement and modelling can be used in practice scenarios to ensure favorable outcomes on Sundays.

Practice Application

Practical Considerations

As the saying goes, NFL athletes in whom much has been invested are "being evaluated all the time". Nearly all practice drills and competitive periods are filmed (often from multiple angles), and that film is later scrutinized and often meticulously graded. However, even with ideally placed cameras and multiple sets of well-trained eyes it may be difficult to detect subtle differences in technique. For example, it is not obvious that a veteran coach can detect an eight degree difference in shoulder tilt or 0.4 yard/second change in speed. Yet our research indicates that even these small distinctions can make the difference between defeating a stunt or not. Tracking data therefore provides a useful perspective in being able to measure and identify aspects of technique that one may not be able to distinguish in-person or off of tape.

With the modern state of NFL training facilities there is little doubt that many (if not all) franchises are capable of collecting tracking data in practice just as the league does in-game. As only a handful of teams still leave their home facilities for training camp, data collection and processing workflows could commence and be refined starting with the first practices of the season. It is also reasonable to expect that the accuracy of data collected in-house would be more accurate than game data due to greater control of the environment and the sheer proximity of equipment. Possible applications are limited less by technological infrastructure and more by the quantity and skill-set of organizational personnel and their collective imaginations.

Measurement in Drills and Competitions

In our specific context, practice application essentially boils down to using the tracking data to measure the relevant aspects of stunt protection that we identified with model feature importance. This can be done in any practice scenario where protecting against stunts is relevant. In individual drills, technical concepts are often practiced "on air" or with a slower-tempo "look" opponent in order to emphasize optimal technique. Measurements outside of a desired range can then be flagged using thresholds calculated in a similar way as we did in the previous section. The main difference here is the emphasis would likely be placed on staying under a certain threshold that is associated with a rush win rate below baseline. For example, consider the graph below which displays win rates on "TE" stunts when s_in (speed of the inside protector) is below that value.

Here we see that staying under 2.0 yards/second at 1.5 seconds after the snap is associated with a rush win rate of only 25%. As before, similar thresholds can be calculated at every frame for any feature of interest. In drill-work, an immediate feedback mechanism can be used to alert a linemen or coach when they are above these optimal thresholds. One can get creative with what that mechanism is; a flashing light, vibration in the shoulder pads, or perhaps an alert within a tablet app. This would ensure that players are truly practicing good habits in drills where their technique is largely within their own control. This would also free up coaches to focus on other aspects of technique that are more visually obvious but not easily measured, such as hand placement and weight distribution. As practice progresses to sub-unit and full team competitions, data can continue to be collected and processed so that players can be given quick feedback on their reps off from the drill. It can be then be presented in conjunction with film when performance is evaluated in after-practice meetings.

Utilizing Additional Data

Collecting tracking data at practice also opens the door for experimentation and individualized performance evaluation. In our research we were only able to uncover associations between feature values and performance that applied generally to players across the league. However, practice scenarios are already known for being scripted and highly controlled. Coaches therefore have the greatest opportunity to set up experiments by which causal links can be established between technique (measured using the tracking data) and results. By giving specific technical cues to players, coaches can encourage feature values within certain ranges and note the effect on not just wins and losses but also other feature values expressing later in the play. Many of the hypotheses and open questions put forth here could then be adequately tested. While dedicating time and resources to experiments may seem ludicrous in the short-term, the long-term benefit of having a tactical edge over one's opponent is worth considering. Regardless, if there is any hope of advancing our research to determine causality it must happen on the practice field.

Also, collecting large amounts of data in practice teams would allow teams to evaluate players against their own past performance and train individual predictive models. For example, an exceptional player may have a greater tolerance with respect to squareness if his ability to react to lateral movement is superior to that of his peers. For a replacement-level player we might say the opposite. In either case, in an ideal scenario one would calculate individualized thresholds and evaluate players with respect to those. Also, training models using data for an individual or small group of individuals could produce feature importances that differ from those of the league generally. This would provide coaches and players a better sense of the most important aspects of protection for them and set a direction for future work in practice. If a player knows the areas where he must improve and/or has to be solid, he can work on those aspects outside of a practice setting.

Finally, a more direct use of modelling game and practice data is in long-term player evaluation. We have already discussed the short-term benefits of using the data to reinforce sound technical habits . Earlier, we laid out how models can be used to create a measure of expected performance (as a function of technique) and described its use in gameplan and self-scout settings. As a final note, the expected values derived from model outputs can act as a data point for coaches in determining playing time and teambuilders when deciding who to retain or bring in. Although they represent only a small fraction of total reps, being able to handle stunts at an acceptable level is a critical at football's highest level. Offensive line units are only as good as their weakest link, and defenses will relentlessly attack any deficiencies they identify. While performance against stunts is certainly not the biggest driver of these decisions, it could the deciding factor between playing, being cut, or being signed.

Future Research

In addition to directions previously mentioned, an obvious next step is to investigate the higher-dimensional cases found in the beginning of our study. The next most numerous case pitted two rushers against three protectors, for which we identified 221 reps. Considering the limitations we dealt with in the present study with over four times that many reps, acquiring data for several seasons worth of those cases would be required. One must also take into account that the win rate drops to 20.4% in this scenario, which might limit coaches' interest in the study. Defensive strategists are experts at devising ways to get either free rushers or even numbers with favorable matchups; exploring how to operate at a disadvantage may not offer enough upside to garner interest. Getting back to even number, the rep count is even more bleak for three against three stunt scenarios at 96. Assuming that one can amass enough reps from historical data, one would then have to devise how to measure our features (or others) at an added dimension. There are multiple ways, for example, that one could approach measuring the relative depth (y_diff) of three pass protectors. The one saving grace is that teams have been running stunts in largely the same way for decades, and will likely continue to so long as they are effective. Diving deeper into higher-dimensional cases therefore becomes a matter of time and access to the data itself.

Next, it would be helpful to compare the important factors of stunt protection with those in absence of a stunt. Prior to the snap the offensive line may have some indication based off of defensive alignment whether a stunt is coming, but can never truly know until a stunt declares. Focusing on optimal stunt protection technique therefore needs to be weighed against general technique, especially early in the play. Disregarding stunt protection and blitz pick-up, the most difficult situation a pass blocker can find himself him is isolated one-on-one with a rusher. Conducting a study similar to ours of true one-on-one matchups would allow us to compare the most important factors on which success depends in each. As they already do, coaches can then weigh the relative importance of these factors into deciding how to build out their training framework.

8. Conclusion

In a general sense we largely confirmed the wisdom of coaches regarding stunts in this study. After capturing 924 two-on-two stunts and studying a great majority of them, we concluded that penetrator depth, squareness and relative depth were more than just relevant factors. In measuring feature importance over a range of different model types we determined that "stopping the penetrator", "staying square" and "staying at the same level" were the most important factors in discriminating success from failure for pass protection against stunts. In addition, we identified that the protectors' ability to exchange rushers after overlap plays a role and that this role potentially differs depending on the stunt type. The speed of individual protectors and the width of the inside protector also emerged as meaningful albeit less impactful factors. Studying how these features express between both observed wins and losses and "likely" wins and losses (decided according to model outputs) added context to how and when each feature impacts end success across the three stunt types.

We then discussed how the methods used in this study and its findings can be applied to organizational processes. Having an algorithm to identify and name stunt reps using the tracking data can help streamline the gameplan and self-scout processes. In the same vein, using models to develop measures of expected success can be used to identify situations in which pass blockers' quality of technique does not align with their actual success rates. Once trends are identified specific areas of weakness can then be explored by evaluating recorded values of relevant features against benchmarks that are associated with high rates of rusher success. Additionally, collecting tracking data at practice presents the opportunity to monitor subtle differences in technique that may not be noticable by the human eye. This has implications both for providing immediate feedback in those areas and freeing up coaches to focus on aspects which current tracking technology cannot easily measure. Over a short time enough data would accumulate for each player to allow for comparison to individual precedents and modelling the performance of individuals or small units. Finely tuned measures of expectation could then be developed to supplement the evaluation process for coaches and front office members.

Future research into stunt protection can go in several different directions. Many of the associations uncovered and open questions posed here can be tested by teams on the practice field. There exist few better options for determining causal links between aspects of player technique and their collective effect on performance. With a larger base of data one can explore higher-dimensional cases, keeping in mind that those where the number of rushers and protectors are equal will likely garner the most interest. Finally, a general investigation into the critical aspects of pass protection technique is warranted. That would a base for comparison to our findings and potentially influence the aspects coaches choose to focus on moving forward. Despite the narrow focus of our investigation we contend that the methods used here can be applied to almost any aspect of the game to test prevailing wisdom, uncover relevant aspects not previously considered and improve upon existing evaluation processes. Each would act in service to the ultimate goals (of us and many others) of advancing knowledge of the game and improving the on-field product.

9. Acknowledgements

We would first like to thank Pat Taylor, truster of the process and offensive line coach at Kutztown University, for sharing his domain expertise for this project. Next, a big thank you to Josh Starmer of Statquest (Bam!), for clearly explaining many of the modeling algorithms used here on his Youtube channel. In addition, the research of Yurko et al. (as presented in Going Deep and associated Youtube videos) provided direction and insight into best practices for our exploration. Finally, we want to recognize and cite the developers of Lassonet, which played a critical role in allowing us to measure feature importance within a neural net framework.

10. Code

Code can be found here.