|
|||||||
| Capping All handicapping, betting systems, spreadsheets, mathematics & quantitative technicapping. |
![]() |
|
|
LinkBack | Thread Tools | Display Modes |
|
#1
|
|||
|
|||
|
Regression NCAAF Stats onto the line
I mentioned earlier how the difference in team philosophies skew the aggregation of line data, and to account for this I created a z-score for each team in regards to rushing attempts and passing attempts. I’ll discuss that approach later. With the remaining 892 teams in the sample of eight years of football, I continued the regression with one singular framework.
If you don't care about the logic leading into the regression just scroll down until you see the regression numbers I decided to isolate the different statistical variables I have collected into three distinct groups. Isolate in essence anyway. The factor that distinguishes one variable from another are based on quality, and value content. The isolation was somewhat arbitrary. Its arbitrary in the sense that the delimiters between one group and another are preferential, though arbitrary in this case certainly does not imply without reason. I consider a variable to be any statistic that, after being exposed to the conditions for which it functions the most, persist long enough to serve as a unit of regression. Uniform color is not a variable. Though I should admit with turnovers and penalties, based purely on my judgment, I consider those to be accumulations of a wide array of intangibles, and randomness. Also they were omitted for convenience. Additionally I did not include special teams, for when I first built the database I forgot about special teams, then upon attempting to include special teams it was clear there was little correlation. Not enough for me to go through the trouble of modifying my database to fit in special teams. Though in a way, through an indirect stream of logic, yards per point is shaped by a team’s performance on special teams. The reasons are fairly obvious and I don’t feel it necessary to expand on that point here. Moving on, variables that are ‘indicative’, and in this instance the only variable being points per game, conspicuously determine the fundamental concept of the line itself. Obviously the only factor that determines an ATS win or loss is the final score. And the point differential of how much any team beats its opponents reflects the particular appropriation of the respective team’s line. This is simple enough, but felt it essential to place the variable ‘ppg’ on a state of which the other variables coalesce to. Now it would be too easy just to regress average line on ‘ppg’, and doing such results in a sound enough measure of an interval with which one might expect a team’s line to fall. But a single variable regression can ultimately be refined by its co-adaptive performance enhancers. These are basically the other statistics that can lead to the realization of a team’s ‘ppg’ differential. I call these variables ‘imperative’. ‘Imperative’ statistics provide a framework for the nature of a team’s viability. ‘Indicative’ statistics are susceptible to various degrees of randomness and luck. ‘Imperative’ variables are a marker of team performance that strongly correlate to wins and losses, but without the luck factor. When one thinks of efficiency measures of performance, these can be considered ‘imperative’. A more glaring facsimile would be to WHIP and FIP in baseball, which are measures of pitcher acuity devoid of luck and randomness. For example, Pitcher A has a sequence of batters with the following results: Single, Strikeout, Homerun, Groundout, Strikeout Pitcher B produces similar results with a slight variation in order: Single, Double Play, Homerun, Strikeout These two scenarios result in the exact same WHIP, and Pitcher A even has a higher strikeout ratio over his sequence of batters, but the one extra strikeout actually penalized Pitcher A compared to B, and therefore the ‘indicative’ performance was not an accurate reflection of the ‘imperative’ performance, when the two pitchers are placed side by side. Aside from this severe digression, the ‘imperative’ variables have high quality information content, and can be used to assess team viability. And I assigned the following to the label ‘imperative’: Yards, Rushing Yards per Attempt, Rushing yards, Completion Percentage, Passing Yards per Attempt, Passing yards, Yards per play, Yards per point, Plays. This seems reasonable enough. So I’ve given authority to a number of variables. And now take into consideration each of the aforementioned statistics have three different levels. Offense, Defense, and Differential. This creates a very complex and sophisticated multifaceted regression schematic. The other statistics I call ‘sterile’. While ‘sterile’ statistics can give some indication of a team’s systematic gameplan, or philosophy, the numbers themselves have zero privilege over the nature of team viability. Passing attempts, Passing completions, Rushing attempts, amongst some others. (Even penalties I consider ‘sterile’. However to the contrary, as has been said before, a high number of penalties could lead one to think that a particular team is very aggressive, i.e. Southern Cal) Now I guess the argument can be made that a team that has a high amount of rushing attempts per game controls the clock, does not turn the ball over, and can shorten the game, but this argument is combated by the powerful mechanism known as correlation. I think I posted the correlation matrix in one of my earlier college football dirges, and the concepts were explained to capacity previous. If I didn't feel free to inquire and I'll be happy to accomodate. Three levels of each of the twelve variables, which makes for a combinatorics spectrum that exceeds the scope of human comprehension. To decode every possible permutations by hand is virtually unattainable. That is where the serrying of different groups has its advantage. Even with the distinction into desirable groups, the permutations of possible elements is extremely large, and would still be exhausting to decode each scenario. Therefore, with a little luck, I had to find the ‘zone of attraction’, or where certain ‘imperative’ values betray a higher sense of determination. I won’t expound further on the process, and let the wonders of Stata convey the results through its ingenious data processing system. Code:
Source | SS df MS Number of obs = 892
-------------+------------------------------ F( 9, 882) = 600.03
Model | 64934.9358 9 7214.99287 Prob > F = 0.0000
Residual | 10605.4221 882 12.0242881 R-squared = 0.8596
-------------+------------------------------ Adj R-squared = 0.8582
Total | 75540.3579 891 84.7815465 Root MSE = 3.4676
------------------------------------------------------------------------------
avgline | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ppg | -.2327069 .0268559 -8.67 0.000 -.2854158 -.179998
ydsd | .035118 .005302 6.62 0.000 .0247119 .0455241
yppo | -4.1972 .2556919 -16.42 0.000 -4.699036 -3.695365
rypad | 1.016908 .2712062 3.75 0.000 .4846229 1.549193
pypad | .5005141 .2467422 2.03 0.043 .0162437 .9847845
yppto | .4438111 .0748342 5.93 0.000 .2969371 .590685
ypptd | -.4937034 .0727098 -6.79 0.000 -.6364077 -.350999
pctd | 6.502121 3.446404 1.89 0.060 -.2619893 13.26623
playso | -.1727749 .0300915 -5.74 0.000 -.2318342 -.1137157
_cons | 11.73068 3.509905 3.34 0.001 4.841943 18.61943
------------------------------------------------------------------------------
‘ydsd’ – yards allowed ‘yppo’ – yards per play offense ‘rypad’ – rushing yards per attempt defense ‘pypad’ – passing yards per attempt defense ‘yppto’ – yards per point offense ‘ypptd’ – yards per point defense ‘pctd’ – completion percentage defense ‘playso’ – plays offense ‘_cons’ – constant A brief survey of the table and the results are very encouraging. The particulate combination of variables as I said is similar to the ‘zone of attraction’ method. Some ‘imperative’ variables offered a greater sense of co-adaptation and mutual assistance with the ‘indicative’ points per game. The brilliance of the Stata program lies in the quickness and efficiency with which different regressions can be run. For those fortunate to have Stata on their computer, I provided the file below so you can manipulate the data with however you so please. Now how do we decipher these coefficients? Let’s look at points per game. It seems logical that an decrease in points per game differential would decrease the line (or in fact increase the line since we are dealing with a scale of negative for a favorite to positive for an underdog). The average line, dependent variable, will increase or decrease by each given value of the coefficient for every one value increase or decrease in each variable. And the model demonstrates by the coefficients that the results are governed by reason as well as the optimal combination of independent variables that creates a number that closely resembles the dependent variable. What I wanted to accomplish was to manifest a new line whose descriptive statistics embody a strong resemblance to the descriptive statistics of the average line. I think the resulting product is highly encouraging. Not only shaped up by the regression results, but by the value of the content as well. Here are the descriptive statistics comparing the average line and the regressed line. Code:
Variable | Obs Mean Std. Dev. Min Max
-------------+--------------------------------------------------------
newline | 892 .4234546 8.536902 -24.99476 28.78546
avgline | 892 .4234641 9.20769 -24.69 29.29
Code:
Mean estimation Number of obs = 892
--------------------------------------------------------------
| Mean Std. Err. [95% Conf. Interval]
-------------+------------------------------------------------
line_mad | 2.746412 .069846 2.609331 2.883494
Code:
Average Line = -2.327069ppg + .035118*ydsd + (-4.1972)yppo + 1.016908rypad + .5005141pypad + .4438111yppto + 6.502121pctd + (-.1727749)pctd + 11.73608 Just as a frame of reference, here are the z-scores for Air Force for each season from 2002-2009 (‘ra’ – rushing attempts, ‘att’ – passing attempts): Then, using similar methods from above, finding a ‘zone of attraction’ which relies upon logic, luck, and the content of the variable, related to the aggregation of teams, this is the equation for predominantly running offenses (Air Force, Navy, etc…): Code:
Source | SS df MS Number of obs = 34
-------------+------------------------------ F( 4, 29) = 38.41
Model | 1406.92746 4 351.731865 Prob > F = 0.0000
Residual | 265.570477 29 9.15760265 R-squared = 0.8412
-------------+------------------------------ Adj R-squared = 0.8193
Total | 1672.49794 33 50.6817557 Root MSE = 3.0262
------------------------------------------------------------------------------
avgline | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ppg | -.5942952 .117771 -5.05 0.000 -.835164 -.3534264
ryd | .0578676 .0316136 1.83 0.077 -.0067894 .1225246
rypa | 2.399145 1.128734 2.13 0.042 .0906247 4.707666
ydso | -.0430914 .0162273 -2.66 0.013 -.0762799 -.0099029
_cons | 7.322725 5.051622 1.45 0.158 -3.009002 17.65445
------------------------------------------------------------------------------
Fortunately, the model surrounding teams with a high z-score in relation to passing attempts however, are more constrained to thoughtful and sensible train of thought. Code:
Source | SS df MS Number of obs = 23
-------------+------------------------------ F( 6, 17) = 54.49
Model | 1612.32956 6 268.721594 Prob > F = 0.0000
Residual | 83.8426096 17 4.93191821 R-squared = 0.9506
-------------+------------------------------ Adj R-squared = 0.9331
Total | 1696.17217 23 73.7466163 Root MSE = 2.2208
------------------------------------------------------------------------------
avgline | Coef. Std. Err. t P>|t| [95% Conf. Interval]
-------------+----------------------------------------------------------------
ppgo | -.3231369 .2326192 -1.39 0.183 -.8139204 .1676467
yppd | 7.833663 1.62701 4.81 0.000 4.400972 11.26635
yppo | -1.753175 1.266587 -1.38 0.184 -4.425439 .919089
ydsd | -.0242419 .0198519 -1.22 0.239 -.0661258 .0176419
pyd | -.0657155 .0261637 -2.51 0.022 -.1209162 -.0105149
yppt | 1.137832 .4093928 2.78 0.013 .2740884 2.001575
------------------------------------------------------------------------------
Code:
Average Line = (-.3231369ppgo) + (7.833663yppd) + (-1.753175yppo) + (-.0242419ydsd) + (-.0657155pyd) + (1.137832yppt) At length I’ll try to put into practice this new equation by selecting a few games at random over the last eight years and create a line based on the line difference between two teams in a given matchup. Additionally I plan on regressing some mixture of pythagorean, line pythagorean, and linear line to win prediction, with some year n+1 wins. Maybe do something similar with the NFL. Stata file: NCAA Football 2002-2009.dta
__________________
"Nobody goes there anymore, its too crowded." --Yogi Berra "Always tell the truth, that way you won't have to remember what you said." --Mark Twain *=$50,000 |
|
#2
|
|||
|
|||
|
Great stuff as always. It's a shame that more people don't appreciate it here. Question for you uva:
Did you come across drastically different findings in the correlation between time of possession and winning in the NFL vs. NCAAF? |
|
#3
|
|||
|
|||
|
we appreciate it's just that besides you and him no one understands it
|
|
#4
|
|||
|
|||
|
Quote:
I would think for NFL since virtually every single statistic is normally distributed (i.e. Teams fluctuate between 250-350 yds, 50-70 plays) because of the parity (or at least statistical parity), that NFL TOP would be relatively consistent from team 1 through 31, so it would be hard to draw any conclusions concerning advantages and disadvantages for NCAAF I included plays as an imperative variable, but didn't find that it was a big enough factor over the league as a whole to include as an independent variable to regress onto the line so i guess my answer is inconclusive, or I have no idea
__________________
"Nobody goes there anymore, its too crowded." --Yogi Berra "Always tell the truth, that way you won't have to remember what you said." --Mark Twain *=$50,000 |
|
#5
|
|||
|
|||
|
Yeah I guess appreciate wasn't the right word because I know you and others do. Engaged is probably better. He puts a shit load of effort into them and doesn't get much of a response.
|
|
#6
|
|||
|
|||
|
I'm just not smart enough to get as in depth as both of you. I'd love to build a model for starters, just have no clue how to lol.
Post this stuff and I'll skim thru it and try to get a gist of it Post plays and I'll tail <tup> but you modelers and technicappers are stingy |
|
#7
|
|||
|
|||
|
modelers dont' post plays
__________________
"Nobody goes there anymore, its too crowded." --Yogi Berra "Always tell the truth, that way you won't have to remember what you said." --Mark Twain *=$50,000 |
|
#8
|
|||
|
|||
|
lol I know it's why forumville says modelers don't make money, but i know Tim does.
W/E doesn't matter to me, it's your info do whatever you want with it |
|
#9
|
|||
|
|||
|
Haha Seanie is good people. But uva's right - modeling doesn't lend itself to posting plays because of the inherent long-term approach. People like to ride "hot" cappers (as if there is such a thing as getting hot at picking independent events), not follow along for an entire season trying to hit numbers at openers and/or peak value. But I'll see if I can't throw out some plays from time to time. You know where I hang out. Maybe uva will do the same.
|
|
#10
|
|||
|
|||
|
in reality this sub forum has changed UVA. he used to text me for plays and now can't hook a brother up.
|
|
#11
|
|||
|
|||
|
Quote:
Maybe every week, for me, in football, you could post just what your producing that the line should be. I have problems creating my own line and don't trust it at all |
|
#12
|
|||
|
|||
|
Sure, I'll see what I can do. I know they don't have PMs here but you know where to get in touch with me.
Another reason modelers don't post plays is because guys like those long writeups with all the silly angles. Saying I'm betting Atl -138 because my model made the line -152 doesn't appeal to the general population. Last edited by IrishTim; 08-05-2010 at 12:38 AM. |
|
#13
|
|||
|
|||
|
Quote:
lol it appeals to me. it's guys that have "feelings" and post a trend from 9 years ago that doesn't appeal to me. I wish I could just get something like Reagan has. Baseball is my worst sport by far |
|
#14
|
|||
|
|||
|
Tim you using SO or did you spring for Don Best?
|
|
#15
|
|||
|
|||
|
i've been posting ways to make ownl ine in nba and mlb all year, even posting ways of gathering information with one click
nba playoff model worked wonders :D
__________________
"Nobody goes there anymore, its too crowded." --Yogi Berra "Always tell the truth, that way you won't have to remember what you said." --Mark Twain *=$50,000 |
![]() |
| Bookmarks |
«
Previous Thread
|
Next Thread
»
| Thread Tools | |
| Display Modes | |
|
|
All times are GMT -5. The time now is 05:57 PM.









Linear Mode
