This week’s guest blogger is Dataspora’s own Antonio Piccolboni. The originally post can be found on his personal blog (Click here).

Here I describe how to use the R programming language to do some simple analysis of baseball statistics.

The first step is getting the data we’ll analyze, which can be downloaded as a MySQL database at baseball-databank.org .  After downloading the zipped files, I populated this database on my Linux machine as follows (replace with your own options as necessary).


mysql -vv -u'root' -p'pwd' -hlocalhost <
drop database if exists bbdb;
create database bbdb;
EOF
mysql -u'root' -p'pwd' -hlocalhost -Dbbdb < bbdb.sql

Once you’ve got you the database created, you can do some basic analysis of hitting statistics within R. First, we connect to the local MySQL database within R (the following code should be executed at the R prompt ‘>’).


library(RMySQL)
con <- dbConnect(dbDriver('MySQL'),
user='usr',
password = 'pwd',
host = 'localhost',
dbname = 'bbdb')

Next query the “teams” table for the team-level batting statistics, and push the results into an R data frame.


resultSet <- dbSendQuery(con,
"select AB,BB,H,2B,3B,HR,SF,HBP,G,R
from teams
where yearID between 2000 and 2005")
teamStats <- fetch(resultSet, n=-1)

Finally create the calculate the two batting statistics and plot them.

rpg <- R/G
bavg <- H/AB
plot(rpg ~ bavg)

f you have an X11 interface set up properly (I use XMing), the last plot command should pop up the image below.

MLB runs per game vs. batting average