Archive for the ‘R’ Category

Visualization of Prosper.com’s Loan Data Part I of II – Compare and Contrast with Lending Club

Tuesday, December 6th, 2011

Due to the positive feedback received on this post I thought I would re-create the analysis on another peer-to-peer lending dataset, courtesy of Prosper.com. You can access the Prosper Marketplace data via an API or by simply downloading XML files that are updated nightly http://www.prosper.com/tools/.

If you are going to follow the route I took and download the latest XML file, ProsperDataExport_xml.zip, you will find this utility helpful in converting the XML files to CSVs: Convert Prosper XML to CSV

Once you have downloaded the .jar file run the following command (changing the parameters of course!):
java -jar ProsperXMLtoCSV.jar ProsperXMLFileLocation CSVDestinationDirectory

Similar to Lending Club, Prosper provides loan-level data such as interest rate, amount funded/requested, borrower state, borrower debt to income ratio, etc. However, Prosper also provides additional information regarding their user base and loan performance history. This information includes extended credit profiles of users, groups that users belong to, social networks within the user base and even retroscores, or how a loan would be rated by Prosper under a new heuristic given macroeconomic shifts over time.

Let’s jump right into the visualizations by state:


library(ggplot2)
library(maps)


## Warning: this is a very large dataset that required ~10 minutes ## to read into R on a fast 8-core Xeon server.
loans <- read.csv("Loans.CSV", header=TRUE)
listings <- read.csv("Listings.CSV", header=TRUE)


## Obtain the active loans from the Listings file, since it
## contains more detailed information than the Loans file
listings.match <- listings[match(loans$ListingKey, listings$Key),]


listings.match$BorrowerState <- as.character(listings.match$BorrowerState)
loans <- listings.match
states <- map_data("state")


## Change state abbreviations to full names so we can merge our
## data frames together
state.names <- unlist(sapply(loans$BorrowerState, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]) )
loans$BorrowerState <- tolower(state.names)
colnames(loans)[11] <- "region"
state.counts <- data.frame(table(loans$region))
colnames(state.counts) <- c("region", "Num.Loans")
result<-merge(state.counts, states, by=c("region"))
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75) + opts(title = 'Number of Issued Loans by State')
print(p)

Click for Larger Image

It comes as no surprise that a majority of issued loans originate in California. As with Lending Club, Prosper is a San Francisco-based peer-to-peer lending company.

Now we will take the log of the number of loans issued by state and compare Prosper’s market reach with Lending Club’s.


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75) + opts(title = 'Log Number of Issued Loans by State')
print(p)

Prosper

Click for Larger Image

Lending Club

Click for Larger Image

The two maps are extremely similar. Both lending companies issue the most loans in California, Texas and Florida. There are some minor differences such as Lending Club issuing more loans than Prosper in Wyoming and Montana.

Instead of the Monthly Income by State map that I created for Lending Club, we will observe Debt to Income Ratios by state for both Prosper borrowers and Lending Club borrowers.


## Aggregate median debt to income ratio by state
debt.to.income <-aggregate(loans$DebtToIncomeRatio, by=list(loans$region), function(x) median(x, na.rm=TRUE))
colnames(debt.to.income) <- c("region", "debt.to.income")
result <- merge(debt.to.income, states, by="region")
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=debt.to.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75) + labs(fill="Debt to Income Ratio") + opts(title = 'Median Debt to Income Ratio of Borrowers by State')

Prosper

Click for Larger Image

Lending Club

Click for Larger Image

Does anyone want to start pointing fingers for the United States debt crisis yet? The states that Prosper loans to the most are also the ones with the lowest Debt to Income Ratios. New Yorkers, in particular, have the lowest median Debt to Income Ratio. Lending Club seems to have much more homogeneous interest rates. We can compare the distributions of the two companies' Debt to Income Ratios with a call to ggplot (after a bit of pre-processing that I left out due to real estate on this page):


ggplot(combined, aes(x=DebtToIncomeRatio)) + geom_histogram() + facet_grid(Company ~ .)

Click for Larger Image

It appears as if Lending Club has a hard cut-off at a 0.30 Debt to Income Ratio for borrowers. Note that this data is taking into account all loans since the inception of both companies. Prosper implemented stricter borrowing guidelines and interest rates after 2009, which can be seen in the animation below.


issue.year <- substr(loans$StartDate, 0, 4)
loans$Issued.Year <- issue.year
interest.by.year.by.state<-aggregate(loans$BorrowerRate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE))
years <- c("2006", "2007", "2008", "2009", "2010", "2011")
colnames(interest.by.year.by.state) <- c("year", "region", "interest.rate")
interest.by.year.by.state$interest.rate <- interest.by.year.by.state$interest.rate * 100

result <- merge(interest.by.year.by.state, states, by="region")
result <- result[order(result$order),]


#Calculate the lower and upper bounds for the gradient
lower <- floor(summary(interest.by.year.by.state$interest.rate)[1])[[1]]
upper <- ceiling(summary(interest.by.year.by.state$interest.rate)[6])[[1]]


states2 <- data.frame(map("state", plot=FALSE)[c("x","y")])
animateMap <- function(year){
result.year <- result[grep(year, result$year),]
usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate))
print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab(""))
}


saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);


Notice the interest rates are the most varied in 2006, the year of Prosper’s inception.
It also worth noting that the median interest rates for borrowers soared after 2009, when Prosper implemented stricter guidelines for borrowers, which also resulted in lower default rates.

Stay tuned for a "social network" analysis of Prosper.com's member data coming up in Part II!

Mining Lending Club’s Goldmine of Loan Data Part I of II – Visualizations by State

Friday, October 14th, 2011

I have a few friends that keep bragging about their 14% annual returns by investing their money with Lending Club, a peer-to-peer lending service that cuts out the complexities and difficulties of getting approved for a loan through a bank. To give you an idea of the sheer amount of volume Lending Club has been dealing with, here’s a snapshot of the Company Statistics as of 10/14/2011:

  • Loans funded to date: $387,043,375
  • Loans funded last month: $24,945,400
  • Interest paid to investors since inception: $32,135,688

Currently Lending Club is boasting that 91% of borrowers earn between 6-18%. Now of course, higher returns are correlated with higher risk. You can choose to diversify your investment across hundreds of different loans with different credit grades – the worse the credit grade, the higher the return percentage, and the higher the risk. I thought it would be interesting to investigate Lending Club a bit more so I navigated over to their site and found something that only a data scientist would consider to be gold:

https://www.lendingclub.com/info/download-data.action


Yes, they have provided complete loan data in CSV format for all of us data geeks to devour.  The data include the current loan status (Current, Late, Fully Paid, etc.), credit grades, interest rates, loan purposes, and all sorts of other juicy tidbits of borrower information.

I downloaded the data and quickly determined that this csv file contained information on 37,122 loans. Of course the first thing I did was fire up R:


library(ggplot2)
library(maps)
loans <- read.csv("LoanStats.csv", header=TRUE, skip = 1)

One of the three sexy skills of the data geek includes data munging, otherwise known as suffering. This post will briefly touch on 2 of the 3 skills - Data Munging and Data Visualization. But first, we need to get the data into a format that our tool, in this case R, can handle. We’ll replace some percentage signs and change a factor to a character string.


loans$Debt.To.Income.Ratio <- as.numeric(gsub("%", "", loans$Debt.To.Income.Ratio))
loans$State <- as.character(loans$State)
loans$Interest.Rate <- as.numeric(gsub("%", "", loans$Interest.Rate))
loans$Revolving.Line.Utilization <- as.numeric(gsub("%", "", loans$Revolving.Line.Utilization))

Conveniently, map_data(“state”) will get all of the latitude and longitude data for each state so that we can draw a map of the U.S.


states <- map_data("state")
loans<-loans[-which(loans$State == ""),]


#Change state abbreviations to full names so we can merge our data frames together
state.names <- unlist(sapply(loans$State, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]) )
loans$State <- tolower(state.names)
colnames(loans)[23] <- "region"

Now we will determine the number of loans by state and merge this data.frame with the states data so we can plot this all out on a map using ggplot2.


state.counts <- data.frame(table(loans$region))
colnames(state.counts) <- c("region", "Num.Loans")
result<-merge(state.counts, states, by=c("region"))
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75)
print(p)

It doesn’t take a geography whiz to realize that a state is missing from this map! Turns out, Lending Club has zero borrowers in North Dakota as of 10/14/2011. Since the number of loans can range anywhere from 3 in Maine to 6,452 in California, we can also plot the log of the total number of loans in order to more easily compare each state's loan activity visually. Why don’t we also add poor North Dakota onto our map? We will assign its Num.Loans variable a value of 1 since we will be taking the log for our next visualization and log(1) = 0.


nd<-map_data("state")[grep("north dakota", map_data("state")[,5]),]
nd$Num.Loans <- 1
result <- rbind(result, nd)
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75)
print (p)

That’s better. We can see that most of Lending Club’s borrowers are from CA, which makes sense given that Lending Club is headquartered in San Francisco. They also have vast reach across Texas, Florida, New York, a good portion of the east coast, and states bordering California. They have the least number of borrowers in Maine and parts of the West to Midwest.

Now let’s explore some of the other variables and project them onto our map. We will look at the median monthly incomes by state.


monthly.income <-aggregate(loans$Monthly.Income, by=list(loans$region), function(x) median(x, na.rm=TRUE))
colnames(monthly.income) <- c("region", "monthly.income")
result <- merge(monthly.income, states, by="region")
nd<-map_data("state")[grep("north dakota", map_data("state")[,5]),]
nd$monthly.income <- 0
result <- rbind(result, nd)
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=monthly.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75)
print(p)

You may recall that Lending Club has only issued 3 loans in Maine. This means we are only looking at 3 data points, which is not a large sample size. We can add any type of information we would like to the center of each state on our map. Let’s add the total number of loans in each state using geom_text() to the center of each state to give this information a little more context.


state.info<-data.frame(region = tolower(state.name), long=state.center$x, lat=state.center$y)
state.info <- subset(state.info, !region %in% c("alaska", "hawaii"))
totals <- data.frame(table(loans$region))
colnames(totals) <- c("region", "total")
state.info <- merge(state.info, totals)


p + geom_text(data=state.info, aes(label=total, cex=0.5))

I have one more trick up my sleeve, which I hacked together thanks to this post from r-bloggers.com. We will look at how the median interest rate for loans issued by Lending Club have varied over the past 4 years by state.


library(animation)


#Pull out just the year from the Issued.Date for each loan
loans$Issued.Year <- substr(loans$Issued.Date, 1, 4)
interest.by.year.by.state<-aggregate(loans$Interest.Rate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE))
years <- c("2007", "2008", "2009", "2010", "2011")
colnames(interest.by.year.by.state) <- c("year", "region", "interest.rate")


result <- merge(interest.by.year.by.state, states, by="region")
result <- result[order(result$order),]


#Calculate the lower and upper bounds for the gradient
lower <- floor(summary(interest.by.year.by.state$interest.rate)[1])[[1]]
upper <- ceiling(summary(interest.by.year.by.state$interest.rate)[6])[[1]]


states2 <- data.frame(map("state", plot=FALSE)[c("x","y")])
animateMap <- function(year){
result.year <- result[grep(year, result$year),]
usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate))
print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab(""))
}
saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);

If we observe our award-winning animated GIF created in R, we can see that the interest rates that Lending Club calculated for issued loans in 2007, the year of its inception, were much more heterogeneous than they are now. They are the highest in 2009 at around 14% across a majority of the U.S. and now they are more constant, hovering around 11% for most states. States without color simply indicate that there were no loans issued by Lending Club in that state for the given year.

What are some interesting visualizations you have come up with using Lending Club’s trove of borrower data?

UPDATE

Due to high demand, I have created a map of "Good" vs. "Bad" borrowers broken down by state. Since some states have many more borrowers than others, I also included the total number of borrowers that went into the ratio, depicted as a number on each state's center. I filtered the original loan data down to two classes of borrowers. "Good" borrowers as those that were fully paid and "Bad" borrowers are those that either charged off, defaulted, or were late on payments. This resulted in 1,542 "Bad" borrowers and 4,647 "Good" borrowers. I then simply calculated the percentage of "Bad" customers by state. Keep in mind this does not include data on the ~24,000 other loans that are current! Click the image to see a larger version.

As you can see, the only states with 0% "Bad" borrowers are those with fewer than 13 borrowers. If we compare states with multiple hundreds of borrowers, Florida consists of about 40% "Bad" borrowers! That's approximately 167 borrowers out of 418! Texas, New York and Pennsylvania borrowers on the other hand, are pretty diligent with paying back their loans and are boasting that only 20% of their borrowers are naughty. Meanwhile California, Lending Club's home state, has the most borrowers and about 30% of those have either charged off, defaulted or were late on their payments.

Hang tight for a more quantitative analysis in which we will try to determine which factors other than state of residence are most important in determining what makes a "good" or "bad" borrower.

Color: The Cinderella of dataviz

Friday, March 13th, 2009

“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.”  — Envisioning Information, Edward Tufte, Graphics Press, 1990   

multivariate color strip plot Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.

Most of us think twice before walking outside in fluorescent red underoos. If only we were as cautious in choosing colors for infographics. The difference is that few of us design our own clothes. But until good palettes (like ColorBrewer) are commonplace, to get colors that fit our purposes, we must be our own tailors.

While obsessing about how to implement color on the Dataspora Labs’ PitchFX viewer I began with a basic motivating question: (more…)

People who love scatter plots & connecting dots

Thursday, February 19th, 2009


We hosted the first Dataviz Salon SF on Tuesday night, with lightning talks by boredom cop Shane Booth, dataviz wiz Lee Byron , computational journalist Brad Stenger, data wrangler Pete Skomoroch , and any/all data enthusiast Brendan O’Connor .

I was going to blog all about it — but Tom Carden of Stamen Design already has a great write-up.

… Dataspora invited a few people to a Dataviz Salon yesterday evening. Mike and I went along and huddled in a brick-built basement in SoMa to listen to the following:

.

How Google and Facebook are using R

Thursday, February 19th, 2009


(March 26th Update: Video now available)
Last night, I moderated our Bay Area R Users Group kick-off event with a panel discussion entitled “The R and Science of Predictive Analytics”, co-located with the Predictive Analytics World conference here in SF.

The panel comprised of four recognized R users from industry:

  • Bo Cowgill, Google
  • Itamar Rosenn, Facebook
  • David Smith, Revolution Computing
  • Jim Porzak, The Generations Network (and Co-Chair of our R Users Group)

The panelists were asked to explain how they use R for predictive analytics within their firms, its strengths and weaknesses as a tool, and provide a case study. What follows is my summary with comments.

(more…)

What I’ll be presenting at O’Reilly Money Tech 2009

Tuesday, October 21st, 2008

(April 2009 Update:  Unfortunately, The Money Tech Conference was indefinitely postponed, but fortunately I will be presenting a version of this talk in July at OSCON 2009).

I’ve been invited to speak at O’Reilly’s Money Tech conference this coming February 4-6th in New York City and thought I’d share the abstract for my talk here.  I’ll likely be in New York for several days, if you’d like to get together to chat about data drop me a line!

My talk is entitled “Open Source Analytics: Visualization and Predictive Modeling of Big Data with the R Programming Language”
(more…)

How do you measure a major league slugger?

Monday, September 1st, 2008

I gave a talk last month at SAP Labs in Palo Alto, along with Jim Porzak of ResponSys, introducing the R Statistical Language to a Business Intelligence interest group.  The goal was to highlight how open source tools, like R, can be used to build predictive models.  The example I gave centered around baseball and a simple question:  how do you measure a baseball slugger?

Michael Lewis, in Moneyball , described how the baseball analyst Bill James was frustrated by the fact that major league hitters were consistently rated by their batting averages. James wrote: (more…)