Go Back   Sports Handicapping Forum > Welcome Forums > Main Street > Capping

Capping All handicapping, betting systems, spreadsheets, mathematics & quantitative technicapping.

Reply
 
LinkBack Thread Tools Display Modes
  #1  
Old 08-04-2010, 10:40 PM
Hall of Fame
 
Join Date: Oct 2005
Location: Salem, VA
Posts: 22,450
Rewards: 1,800
Regression NCAAF Stats onto the line

I mentioned earlier how the difference in team philosophies skew the aggregation of line data, and to account for this I created a z-score for each team in regards to rushing attempts and passing attempts. I’ll discuss that approach later. With the remaining 892 teams in the sample of eight years of football, I continued the regression with one singular framework.

If you don't care about the logic leading into the regression just scroll down until you see the regression numbers


I decided to isolate the different statistical variables I have collected into three distinct groups. Isolate in essence anyway. The factor that distinguishes one variable from another are based on quality, and value content. The isolation was somewhat arbitrary. Its arbitrary in the sense that the delimiters between one group and another are preferential, though arbitrary in this case certainly does not imply without reason.

I consider a variable to be any statistic that, after being exposed to the conditions for which it functions the most, persist long enough to serve as a unit of regression. Uniform color is not a variable. Though I should admit with turnovers and penalties, based purely on my judgment, I consider those to be accumulations of a wide array of intangibles, and randomness. Also they were omitted for convenience. Additionally I did not include special teams, for when I first built the database I forgot about special teams, then upon attempting to include special teams it was clear there was little correlation. Not enough for me to go through the trouble of modifying my database to fit in special teams. Though in a way, through an indirect stream of logic, yards per point is shaped by a team’s performance on special teams. The reasons are fairly obvious and I don’t feel it necessary to expand on that point here.


Moving on, variables that are ‘indicative’, and in this instance the only variable being points per game, conspicuously determine the fundamental concept of the line itself. Obviously the only factor that determines an ATS win or loss is the final score. And the point differential of how much any team beats its opponents reflects the particular appropriation of the respective team’s line. This is simple enough, but felt it essential to place the variable ‘ppg’ on a state of which the other variables coalesce to.

Now it would be too easy just to regress average line on ‘ppg’, and doing such results in a sound enough measure of an interval with which one might expect a team’s line to fall. But a single variable regression can ultimately be refined by its co-adaptive performance enhancers. These are basically the other statistics that can lead to the realization of a team’s ‘ppg’ differential.


I call these variables ‘imperative’. ‘Imperative’ statistics provide a framework for the nature of a team’s viability. ‘Indicative’ statistics are susceptible to various degrees of randomness and luck. ‘Imperative’ variables are a marker of team performance that strongly correlate to wins and losses, but without the luck factor. When one thinks of efficiency measures of performance, these can be considered ‘imperative’. A more glaring facsimile would be to WHIP and FIP in baseball, which are measures of pitcher acuity devoid of luck and randomness. For example, Pitcher A has a sequence of batters with the following results:


Single, Strikeout, Homerun, Groundout, Strikeout

Pitcher B produces similar results with a slight variation in order:


Single, Double Play, Homerun, Strikeout

These two scenarios result in the exact same WHIP, and Pitcher A even has a higher strikeout ratio over his sequence of batters, but the one extra strikeout actually penalized Pitcher A compared to B, and therefore the ‘indicative’ performance was not an accurate reflection of the ‘imperative’ performance, when the two pitchers are placed side by side.


Aside from this severe digression, the ‘imperative’ variables have high quality information content, and can be used to assess team viability. And I assigned the following to the label ‘imperative’:


Yards, Rushing Yards per Attempt, Rushing yards, Completion Percentage, Passing Yards per Attempt, Passing yards, Yards per play, Yards per point, Plays.


This seems reasonable enough. So I’ve given authority to a number of variables. And now take into consideration each of the aforementioned statistics have three different levels. Offense, Defense, and Differential. This creates a very complex and sophisticated multifaceted regression schematic.


The other statistics I call ‘sterile’. While ‘sterile’ statistics can give some indication of a team’s systematic gameplan, or philosophy, the numbers themselves have zero privilege over the nature of team viability. Passing attempts, Passing completions, Rushing attempts, amongst some others. (Even penalties I consider ‘sterile’. However to the contrary, as has been said before, a high number of penalties could lead one to think that a particular team is very aggressive, i.e. Southern Cal) Now I guess the argument can be made that a team that has a high amount of rushing attempts per game controls the clock, does not turn the ball over, and can shorten the game, but this argument is combated by the powerful mechanism known as correlation. I think I posted the correlation matrix in one of my earlier college football dirges, and the concepts were explained to capacity previous. If I didn't feel free to inquire and I'll be happy to accomodate.


Three levels of each of the twelve variables, which makes for a combinatorics spectrum that exceeds the scope of human comprehension. To decode every possible permutations by hand is virtually unattainable. That is where the serrying of different groups has its advantage. Even with the distinction into desirable groups, the permutations of possible elements is extremely large, and would still be exhausting to decode each scenario. Therefore, with a little luck, I had to find the ‘zone of attraction’, or where certain ‘imperative’ values betray a higher sense of determination.



I won’t expound further on the process, and let the wonders of Stata convey the results through its ingenious data processing system.

Code:
Source |       SS       df       MS              Number of obs =     892
-------------+------------------------------           F(  9,   882) =  600.03
       Model |  64934.9358     9  7214.99287           Prob > F      =  0.0000
    Residual |  10605.4221   882  12.0242881           R-squared     =  0.8596
-------------+------------------------------           Adj R-squared =  0.8582
       Total |  75540.3579   891  84.7815465           Root MSE      =  3.4676

------------------------------------------------------------------------------
     avgline |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ppg |  -.2327069   .0268559    -8.67   0.000    -.2854158    -.179998
        ydsd |    .035118    .005302     6.62   0.000     .0247119    .0455241
        yppo |    -4.1972   .2556919   -16.42   0.000    -4.699036   -3.695365
       rypad |   1.016908   .2712062     3.75   0.000     .4846229    1.549193
       pypad |   .5005141   .2467422     2.03   0.043     .0162437    .9847845
       yppto |   .4438111   .0748342     5.93   0.000     .2969371     .590685
       ypptd |  -.4937034   .0727098    -6.79   0.000    -.6364077    -.350999
        pctd |   6.502121   3.446404     1.89   0.060    -.2619893    13.26623
      playso |  -.1727749   .0300915    -5.74   0.000    -.2318342   -.1137157
       _cons |   11.73068   3.509905     3.34   0.001     4.841943    18.61943
------------------------------------------------------------------------------
‘ppg’ – points per game
‘ydsd’ – yards allowed
‘yppo’ – yards per play offense
‘rypad’ – rushing yards per attempt defense
‘pypad’ – passing yards per attempt defense
‘yppto’ – yards per point offense
‘ypptd’ – yards per point defense
‘pctd’ – completion percentage defense
‘playso’ – plays offense
‘_cons’ – constant


A brief survey of the table and the results are very encouraging. The particulate combination of variables as I said is similar to the ‘zone of attraction’ method. Some ‘imperative’ variables offered a greater sense of co-adaptation and mutual assistance with the ‘indicative’ points per game. The brilliance of the Stata program lies in the quickness and efficiency with which different regressions can be run. For those fortunate to have Stata on their computer, I provided the file below so you can manipulate the data with however you so please.


Now how do we decipher these coefficients? Let’s look at points per game. It seems logical that an decrease in points per game differential would decrease the line (or in fact increase the line since we are dealing with a scale of negative for a favorite to positive for an underdog). The average line, dependent variable, will increase or decrease by each given value of the coefficient for every one value increase or decrease in each variable. And the model demonstrates by the coefficients that the results are governed by reason as well as the optimal combination of independent variables that creates a number that closely resembles the dependent variable.


What I wanted to accomplish was to manifest a new line whose descriptive statistics embody a strong resemblance to the descriptive statistics of the average line. I think the resulting product is highly encouraging. Not only shaped up by the regression results, but by the value of the content as well. Here are the descriptive statistics comparing the average line and the regressed line.

Code:
Variable |       Obs        Mean    Std. Dev.       Min        Max
-------------+--------------------------------------------------------
     newline |       892    .4234546    8.536902  -24.99476   28.78546
     avgline |       892    .4234641     9.20769     -24.69      29.29
The mean, standard deviation, and range are almost identically with only the slightest discrepancies. Here is the absolute average difference between the actual line and the new line.

Code:
    Mean estimation                     Number of obs    =     892

--------------------------------------------------------------
             |       Mean   Std. Err.     [95% Conf. Interval]
-------------+------------------------------------------------
    line_mad |   2.746412    .069846      2.609331    2.883494
Using the coefficients above, this is the equation to formulate an accurate estimation of a team’s average line:

Code:
Average Line = -2.327069ppg + .035118*ydsd + (-4.1972)yppo + 1.016908rypad + .5005141pypad + .4438111yppto + 6.502121pctd + (-.1727749)pctd + 11.73608
The remaining 57 teams have been left untouched by the previous regression. How did I isolate these 57 teams? I found the z-score of each team’s offensive passing attempts and rushing attempts and then separated those with a z-score greater than two with either measure. Two being again rather arbitrary, though it should be said passing attempts resemble a normal distribution as the sample size increases, allowing the z-score of two to be an accessible marker. Rushing attempts are more asymmetric, though perhaps as the sample approaches infinity the central limit theorem applies. Overtime the league changes as a whole and the replication of ideas oscillates from one extreme to another, therefore an asymptomatic system is probably the average. Regardless, I used two as the line of demarcation between “typical” and an “atypical” gameplan. To find the z-score, divide the difference between value xi and mean of the population by the standard deviation of the population.





Just as a frame of reference, here are the z-scores for Air Force for each season from 2002-2009 (‘ra’ – rushing attempts, ‘att’ – passing attempts):



Then, using similar methods from above, finding a ‘zone of attraction’ which relies upon logic, luck, and the content of the variable, related to the aggregation of teams, this is the equation for predominantly running offenses (Air Force, Navy, etc…):

Code:
Source |       SS       df       MS              Number of obs =      34
-------------+------------------------------           F(  4,    29) =   38.41
       Model |  1406.92746     4  351.731865           Prob > F      =  0.0000
    Residual |  265.570477    29  9.15760265           R-squared     =  0.8412
-------------+------------------------------           Adj R-squared =  0.8193
       Total |  1672.49794    33  50.6817557           Root MSE      =  3.0262

------------------------------------------------------------------------------
     avgline |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
         ppg |  -.5942952    .117771    -5.05   0.000     -.835164   -.3534264
         ryd |   .0578676   .0316136     1.83   0.077    -.0067894    .1225246
        rypa |   2.399145   1.128734     2.13   0.042     .0906247    4.707666
        ydso |  -.0430914   .0162273    -2.66   0.013    -.0762799   -.0099029
       _cons |   7.322725   5.051622     1.45   0.158    -3.009002    17.65445
------------------------------------------------------------------------------
Its not necessary for me to point out the massive flaws in logic here. The system appears to break down with teams inclined to rush the football more often than not. Its counter-intuitive to think a high rushing yards per attempt differential can have an inverse effect on the line, certainly with a running team.


Fortunately, the model surrounding teams with a high z-score in relation to passing attempts however, are more constrained to thoughtful and sensible train of thought.

Code:
Source |       SS       df       MS              Number of obs =      23
-------------+------------------------------           F(  6,    17) =   54.49
       Model |  1612.32956     6  268.721594           Prob > F      =  0.0000
    Residual |  83.8426096    17  4.93191821           R-squared     =  0.9506
-------------+------------------------------           Adj R-squared =  0.9331
       Total |  1696.17217    23  73.7466163           Root MSE      =  2.2208

------------------------------------------------------------------------------
     avgline |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]
-------------+----------------------------------------------------------------
        ppgo |  -.3231369   .2326192    -1.39   0.183    -.8139204    .1676467
        yppd |   7.833663    1.62701     4.81   0.000     4.400972    11.26635
        yppo |  -1.753175   1.266587    -1.38   0.184    -4.425439     .919089
        ydsd |  -.0242419   .0198519    -1.22   0.239    -.0661258    .0176419
         pyd |  -.0657155   .0261637    -2.51   0.022    -.1209162   -.0105149
        yppt |   1.137832   .4093928     2.78   0.013     .2740884    2.001575
------------------------------------------------------------------------------
Code:
Average Line = (-.3231369ppgo) + (7.833663yppd) + (-1.753175yppo) + (-.0242419ydsd) + (-.0657155pyd) + (1.137832yppt)
The previous two conditions, based on the z-score of passing and running schemes, are immanently tethered to the sample size, or lack thereof. Notwithstanding, when discriminating the data using a z-score, that results provide a more precise equation compared to the league as a whole. Those teams with a high z-score using a re-configured regression tactic produce a new line that is closer after a regression to the actual average line, closer than using the base framework applied to the 892 teams that have more or less a gameplan in line with the prevailing wisdom of the eight year sample.


At length I’ll try to put into practice this new equation by selecting a few games at random over the last eight years and create a line based on the line difference between two teams in a given matchup. Additionally I plan on regressing some mixture of pythagorean, line pythagorean, and linear line to win prediction, with some year n+1 wins. Maybe do something similar with the NFL.


Stata file:


NCAA Football 2002-2009.dta
__________________
"Nobody goes there anymore, its too crowded." --Yogi Berra

"Always tell the truth, that way you won't have to remember what you said." --Mark Twain


*=$50,000
Reply With Quote
  #2  
Old 08-05-2010, 12:04 AM
Registered User
 
Join Date: Apr 2010
Posts: 111
Rewards: 171
Great stuff as always. It's a shame that more people don't appreciate it here. Question for you uva:

Did you come across drastically different findings in the correlation between time of possession and winning in the NFL vs. NCAAF?
Reply With Quote
  #3  
Old 08-05-2010, 12:05 AM
Hall Of Fame '11
 
Join Date: Aug 2004
Location: Philadelphia
Posts: 35,968
Rewards: 475
we appreciate it's just that besides you and him no one understands it
Reply With Quote
  #4  
Old 08-05-2010, 12:17 AM
Hall of Fame
 
Join Date: Oct 2005
Location: Salem, VA
Posts: 22,450
Rewards: 1,800
Quote:
Originally Posted by IrishTim View Post
Great stuff as always. It's a shame that more people don't appreciate it here. Question for you uva:

Did you come across drastically different findings in the correlation between time of possession and winning in the NFL vs. NCAAF?
i'm working on NFL right now

I would think for NFL since virtually every single statistic is normally distributed (i.e. Teams fluctuate between 250-350 yds, 50-70 plays) because of the parity (or at least statistical parity), that NFL TOP would be relatively consistent from team 1 through 31, so it would be hard to draw any conclusions concerning advantages and disadvantages

for NCAAF I included plays as an imperative variable, but didn't find that it was a big enough factor over the league as a whole to include as an independent variable to regress onto the line

so i guess my answer is inconclusive, or I have no idea
__________________
"Nobody goes there anymore, its too crowded." --Yogi Berra

"Always tell the truth, that way you won't have to remember what you said." --Mark Twain


*=$50,000
Reply With Quote
  #5  
Old 08-05-2010, 12:18 AM
Registered User
 
Join Date: Apr 2010
Posts: 111
Rewards: 171
Quote:
Originally Posted by Seanie Mac View Post
we appreciate it's just that besides you and him no one understands it
Yeah I guess appreciate wasn't the right word because I know you and others do. Engaged is probably better. He puts a shit load of effort into them and doesn't get much of a response.
Reply With Quote
  #6  
Old 08-05-2010, 12:21 AM
Hall Of Fame '11
 
Join Date: Aug 2004
Location: Philadelphia
Posts: 35,968
Rewards: 475
I'm just not smart enough to get as in depth as both of you. I'd love to build a model for starters, just have no clue how to lol.

Post this stuff and I'll skim thru it and try to get a gist of it

Post plays and I'll tail <tup> but you modelers and technicappers are stingy
Reply With Quote
  #7  
Old 08-05-2010, 12:24 AM
Hall of Fame
 
Join Date: Oct 2005
Location: Salem, VA
Posts: 22,450
Rewards: 1,800
modelers dont' post plays
__________________
"Nobody goes there anymore, its too crowded." --Yogi Berra

"Always tell the truth, that way you won't have to remember what you said." --Mark Twain


*=$50,000
Reply With Quote
  #8  
Old 08-05-2010, 12:27 AM
Hall Of Fame '11
 
Join Date: Aug 2004
Location: Philadelphia
Posts: 35,968
Rewards: 475
lol I know it's why forumville says modelers don't make money, but i know Tim does.

W/E doesn't matter to me, it's your info do whatever you want with it
Reply With Quote
  #9  
Old 08-05-2010, 12:30 AM
Registered User
 
Join Date: Apr 2010
Posts: 111
Rewards: 171
Haha Seanie is good people. But uva's right - modeling doesn't lend itself to posting plays because of the inherent long-term approach. People like to ride "hot" cappers (as if there is such a thing as getting hot at picking independent events), not follow along for an entire season trying to hit numbers at openers and/or peak value. But I'll see if I can't throw out some plays from time to time. You know where I hang out. Maybe uva will do the same.
Reply With Quote
  #10  
Old 08-05-2010, 12:30 AM
Hall Of Fame '11
 
Join Date: Aug 2004
Location: Philadelphia
Posts: 35,968
Rewards: 475
in reality this sub forum has changed UVA. he used to text me for plays and now can't hook a brother up.
Reply With Quote
  #11  
Old 08-05-2010, 12:33 AM
Hall Of Fame '11
 
Join Date: Aug 2004
Location: Philadelphia
Posts: 35,968
Rewards: 475
Quote:
Originally Posted by IrishTim View Post
Haha Seanie is good people. But uva's right - modeling doesn't lend itself to posting plays because of the inherent long-term approach. People like to ride "hot" cappers (as if there is such a thing as getting hot at picking independent events), not follow along for an entire season trying to hit numbers at openers and/or peak value. But I'll see if I can't throw out some plays from time to time. You know where I hang out. Maybe uva will do the same.


Maybe every week, for me, in football, you could post just what your producing that the line should be. I have problems creating my own line and don't trust it at all
Reply With Quote
  #12  
Old 08-05-2010, 12:37 AM
Registered User
 
Join Date: Apr 2010
Posts: 111
Rewards: 171
Sure, I'll see what I can do. I know they don't have PMs here but you know where to get in touch with me.

Another reason modelers don't post plays is because guys like those long writeups with all the silly angles. Saying I'm betting Atl -138 because my model made the line -152 doesn't appeal to the general population.

Last edited by IrishTim; 08-05-2010 at 12:38 AM.
Reply With Quote
  #13  
Old 08-05-2010, 12:40 AM
Hall Of Fame '11
 
Join Date: Aug 2004
Location: Philadelphia
Posts: 35,968
Rewards: 475
Quote:
Originally Posted by IrishTim View Post
Sure, I'll see what I can do. I know they don't have PMs here but you know where to get in touch with me.

Another reason modelers don't post plays is because guys like those long writeups with all the silly angles. Saying I'm betting Atl -138 because my model made the line -152 doesn't appeal to the general population.


lol it appeals to me. it's guys that have "feelings" and post a trend from 9 years ago that doesn't appeal to me.

I wish I could just get something like Reagan has. Baseball is my worst sport by far
Reply With Quote
  #14  
Old 08-05-2010, 12:41 AM
Hall Of Fame '11
 
Join Date: Aug 2004
Location: Philadelphia
Posts: 35,968
Rewards: 475
Tim you using SO or did you spring for Don Best?
Reply With Quote
  #15  
Old 08-05-2010, 03:18 AM
Hall of Fame
 
Join Date: Oct 2005
Location: Salem, VA
Posts: 22,450
Rewards: 1,800
i've been posting ways to make ownl ine in nba and mlb all year, even posting ways of gathering information with one click

nba playoff model worked wonders :D
__________________
"Nobody goes there anymore, its too crowded." --Yogi Berra

"Always tell the truth, that way you won't have to remember what you said." --Mark Twain


*=$50,000
Reply With Quote
Reply

Bookmarks

Thread Tools
Display Modes

Posting Rules
You may not post new threads
You may not post replies
You may not post attachments
You may not edit your posts

BB code is On
Smilies are Off
[IMG] code is On
HTML code is Off
Trackbacks are On
Pingbacks are On
Refbacks are On



All times are GMT -5. The time now is 05:57 PM.


Powered by vBulletin® Version 3.8.7
Copyright ©2000 - 2012, vBulletin Solutions, Inc.