This is my statistics project for the end of the semester. I think it's pretty friggin sweet that I will just look up baseball stats and call it school work, MBA = much easier than Bachelor's in Electrical Engineering. Well here's the ole project proposal, I'll be posting the results and analysis whenever I finish the project / do the work...
When constructing a baseball lineup in any league whether it is the major leagues, a high school team or a little league team there have always been general guidelines to go by. Through this regression test I would like to find what if any correlation these guidelines have to the overall production of a major league offense or if all the premises I have been taught my whole life about building the ideal lineup are actually erroneous. With new theories about ideal hitters and looking at statistics developed through the ‘Moneyball’ philosophy it is important to see how or if these age old theories are still accurate.
Dependent Variable: Runs, the easiest way to analyze how a baseball team performed over the course of a season is to use how many Runs a team scored.
Independent Variable #1: DH Rule, the first aspect that needs to be accounted for in this regression test is to account for the difference between the NL and AL teams and their use of the pitcher as a hitter or in the AL a DH. So the sets will be assigned a value of 1 if they are in the AL and 0 if they are in the NL.
Independent Variable #2: Leadoff Hitter Pitches Per Plate Appearance, a common statement for a leadoff hitter is that it is essential for them to take a lot of pitches, especially with their first at bat. Taking several pitches is supposed to give the batters later in the lineup a better look at what the pitcher is throwing.
Independent Variable #3: Leadoff Hitter Stolen Bases Per Game, the ideal leadoff hitter is supposed to be fast. The best measure of a player’s speed and is the stolen base that should help their roster score some cheap runs.
Independent Variable #4: Leadoff Hitter On Base Percentage, another important trait of the leadoff hitter is to get on base. If the leadoff hitter does not get on base than the big hitters that ‘should’ be hitting 3 and 4 in the lineup will not have rbi opportunities.
Independent Variable #5: 2nd Hitter (Strikeouts-Sacrifices) Per Game, the second batter in the lineup is said to need a few characteristics. First to be able to put the ball in play, this can be measured by Strikeouts, the less strikeouts, obviously the better. Secondly, a #2 hitter should be adept at moving a runner over on the base paths via a bunt. To combine these characteristics into one I came up with Strikeouts – Sacrifices.
Independent Variable #6: 3rd Hitter On Base Percentage + Slugging Percentage, the 3rd hitter is ‘supposed’ to be the best hitter in your lineup. The OPS stat is commonly used as the best depiction of a hitters ability.
Independent Variable #7: Cleanup Hitter Slugging Percentage, the cleanup hitter should be the masher in your lineup. A HR hitter, a guy that pretty much hits extra base hits.
Independent Variable #8: 5th Hitter Runner’s in Scoring Position Average, the #5 hitter is there to clean up whatever mess is left on base behind the masher. Batting Average with Runners In Score Position shows how effective he is in knocking home the batters left on base.
Data Collection: All data will be gathered from online websites most likely Sportsline.com and ESPN.com, as it will be difficult to gather the exact data for an entire team at a position in the lineup I will be gathering the data of the player that hit most in each lineup spot for each team. This way I can get accurate data for the majority of the games played in the major leagues last season. I am using all percentage data for my independent variables to extrapolate a per game approach which will to an extent counteract the differences in games played. I will note games played and the player for each team at each lineup position on a separate spreadsheet.
In addition to using percentages for variables, I wanted to make sure I didn’t have variables that held a direct connection to the overall runs scored for a team. Stats like runs scored by the leadoff hitter or Runs Batted In for the 3rd batter or cleanup hitter would not bring much insight to the end results as they are a proportion of the teams production levels.
Predictions: My guess is that none of the variables individually will explain the overall run production of a team but that a combination of the variables
When constructing a baseball lineup in any league whether it is the major leagues, a high school team or a little league team there have always been general guidelines to go by. Through this regression test I would like to find what if any correlation these guidelines have to the overall production of a major league offense or if all the premises I have been taught my whole life about building the ideal lineup are actually erroneous. With new theories about ideal hitters and looking at statistics developed through the ‘Moneyball’ philosophy it is important to see how or if these age old theories are still accurate.
Dependent Variable: Runs, the easiest way to analyze how a baseball team performed over the course of a season is to use how many Runs a team scored.
Independent Variable #1: DH Rule, the first aspect that needs to be accounted for in this regression test is to account for the difference between the NL and AL teams and their use of the pitcher as a hitter or in the AL a DH. So the sets will be assigned a value of 1 if they are in the AL and 0 if they are in the NL.
Independent Variable #2: Leadoff Hitter Pitches Per Plate Appearance, a common statement for a leadoff hitter is that it is essential for them to take a lot of pitches, especially with their first at bat. Taking several pitches is supposed to give the batters later in the lineup a better look at what the pitcher is throwing.
Independent Variable #3: Leadoff Hitter Stolen Bases Per Game, the ideal leadoff hitter is supposed to be fast. The best measure of a player’s speed and is the stolen base that should help their roster score some cheap runs.
Independent Variable #4: Leadoff Hitter On Base Percentage, another important trait of the leadoff hitter is to get on base. If the leadoff hitter does not get on base than the big hitters that ‘should’ be hitting 3 and 4 in the lineup will not have rbi opportunities.
Independent Variable #5: 2nd Hitter (Strikeouts-Sacrifices) Per Game, the second batter in the lineup is said to need a few characteristics. First to be able to put the ball in play, this can be measured by Strikeouts, the less strikeouts, obviously the better. Secondly, a #2 hitter should be adept at moving a runner over on the base paths via a bunt. To combine these characteristics into one I came up with Strikeouts – Sacrifices.
Independent Variable #6: 3rd Hitter On Base Percentage + Slugging Percentage, the 3rd hitter is ‘supposed’ to be the best hitter in your lineup. The OPS stat is commonly used as the best depiction of a hitters ability.
Independent Variable #7: Cleanup Hitter Slugging Percentage, the cleanup hitter should be the masher in your lineup. A HR hitter, a guy that pretty much hits extra base hits.
Independent Variable #8: 5th Hitter Runner’s in Scoring Position Average, the #5 hitter is there to clean up whatever mess is left on base behind the masher. Batting Average with Runners In Score Position shows how effective he is in knocking home the batters left on base.
Data Collection: All data will be gathered from online websites most likely Sportsline.com and ESPN.com, as it will be difficult to gather the exact data for an entire team at a position in the lineup I will be gathering the data of the player that hit most in each lineup spot for each team. This way I can get accurate data for the majority of the games played in the major leagues last season. I am using all percentage data for my independent variables to extrapolate a per game approach which will to an extent counteract the differences in games played. I will note games played and the player for each team at each lineup position on a separate spreadsheet.
In addition to using percentages for variables, I wanted to make sure I didn’t have variables that held a direct connection to the overall runs scored for a team. Stats like runs scored by the leadoff hitter or Runs Batted In for the 3rd batter or cleanup hitter would not bring much insight to the end results as they are a proportion of the teams production levels.
Predictions: My guess is that none of the variables individually will explain the overall run production of a team but that a combination of the variables
Comments