Mining Lending Club’s Goldmine of Loan Data Part I of II – Visualizations by State

by Tanya Cashorali | October 14th, 2011

I have a few friends that keep bragging about their 14% annual returns by investing their money with Lending Club, a peer-to-peer lending service that cuts out the complexities and difficulties of getting approved for a loan through a bank. To give you an idea of the sheer amount of volume Lending Club has been dealing with, here’s a snapshot of the Company Statistics as of 10/14/2011:

  • Loans funded to date: $387,043,375
  • Loans funded last month: $24,945,400
  • Interest paid to investors since inception: $32,135,688

Currently Lending Club is boasting that 91% of borrowers earn between 6-18%. Now of course, higher returns are correlated with higher risk. You can choose to diversify your investment across hundreds of different loans with different credit grades – the worse the credit grade, the higher the return percentage, and the higher the risk. I thought it would be interesting to investigate Lending Club a bit more so I navigated over to their site and found something that only a data scientist would consider to be gold:

https://www.lendingclub.com/info/download-data.action


Yes, they have provided complete loan data in CSV format for all of us data geeks to devour.  The data include the current loan status (Current, Late, Fully Paid, etc.), credit grades, interest rates, loan purposes, and all sorts of other juicy tidbits of borrower information.

I downloaded the data and quickly determined that this csv file contained information on 37,122 loans. Of course the first thing I did was fire up R:


library(ggplot2)
library(maps)
loans <- read.csv("LoanStats.csv", header=TRUE, skip = 1)

One of the three sexy skills of the data geek includes data munging, otherwise known as suffering. This post will briefly touch on 2 of the 3 skills - Data Munging and Data Visualization. But first, we need to get the data into a format that our tool, in this case R, can handle. We’ll replace some percentage signs and change a factor to a character string.


loans$Debt.To.Income.Ratio <- as.numeric(gsub("%", "", loans$Debt.To.Income.Ratio))
loans$State <- as.character(loans$State)
loans$Interest.Rate <- as.numeric(gsub("%", "", loans$Interest.Rate))
loans$Revolving.Line.Utilization <- as.numeric(gsub("%", "", loans$Revolving.Line.Utilization))

Conveniently, map_data(“state”) will get all of the latitude and longitude data for each state so that we can draw a map of the U.S.


states <- map_data("state")
loans<-loans[-which(loans$State == ""),]


#Change state abbreviations to full names so we can merge our data frames together
state.names <- unlist(sapply(loans$State, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]) )
loans$State <- tolower(state.names)
colnames(loans)[23] <- "region"

Now we will determine the number of loans by state and merge this data.frame with the states data so we can plot this all out on a map using ggplot2.


state.counts <- data.frame(table(loans$region))
colnames(state.counts) <- c("region", "Num.Loans")
result<-merge(state.counts, states, by=c("region"))
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75)
print(p)

It doesn’t take a geography whiz to realize that a state is missing from this map! Turns out, Lending Club has zero borrowers in North Dakota as of 10/14/2011. Since the number of loans can range anywhere from 3 in Maine to 6,452 in California, we can also plot the log of the total number of loans in order to more easily compare each state's loan activity visually. Why don’t we also add poor North Dakota onto our map? We will assign its Num.Loans variable a value of 1 since we will be taking the log for our next visualization and log(1) = 0.


nd<-map_data("state")[grep("north dakota", map_data("state")[,5]),]
nd$Num.Loans <- 1
result <- rbind(result, nd)
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75)
print (p)

That’s better. We can see that most of Lending Club’s borrowers are from CA, which makes sense given that Lending Club is headquartered in San Francisco. They also have vast reach across Texas, Florida, New York, a good portion of the east coast, and states bordering California. They have the least number of borrowers in Maine and parts of the West to Midwest.

Now let’s explore some of the other variables and project them onto our map. We will look at the median monthly incomes by state.


monthly.income <-aggregate(loans$Monthly.Income, by=list(loans$region), function(x) median(x, na.rm=TRUE))
colnames(monthly.income) <- c("region", "monthly.income")
result <- merge(monthly.income, states, by="region")
nd<-map_data("state")[grep("north dakota", map_data("state")[,5]),]
nd$monthly.income <- 0
result <- rbind(result, nd)
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=monthly.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75)
print(p)

You may recall that Lending Club has only issued 3 loans in Maine. This means we are only looking at 3 data points, which is not a large sample size. We can add any type of information we would like to the center of each state on our map. Let’s add the total number of loans in each state using geom_text() to the center of each state to give this information a little more context.


state.info<-data.frame(region = tolower(state.name), long=state.center$x, lat=state.center$y)
state.info <- subset(state.info, !region %in% c("alaska", "hawaii"))
totals <- data.frame(table(loans$region))
colnames(totals) <- c("region", "total")
state.info <- merge(state.info, totals)


p + geom_text(data=state.info, aes(label=total, cex=0.5))

I have one more trick up my sleeve, which I hacked together thanks to this post from r-bloggers.com. We will look at how the median interest rate for loans issued by Lending Club have varied over the past 4 years by state.


library(animation)


#Pull out just the year from the Issued.Date for each loan
loans$Issued.Year <- substr(loans$Issued.Date, 1, 4)
interest.by.year.by.state<-aggregate(loans$Interest.Rate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE))
years <- c("2007", "2008", "2009", "2010", "2011")
colnames(interest.by.year.by.state) <- c("year", "region", "interest.rate")


result <- merge(interest.by.year.by.state, states, by="region")
result <- result[order(result$order),]


#Calculate the lower and upper bounds for the gradient
lower <- floor(summary(interest.by.year.by.state$interest.rate)[1])[[1]]
upper <- ceiling(summary(interest.by.year.by.state$interest.rate)[6])[[1]]


states2 <- data.frame(map("state", plot=FALSE)[c("x","y")])
animateMap <- function(year){
result.year <- result[grep(year, result$year),]
usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate))
print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab(""))
}
saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);

If we observe our award-winning animated GIF created in R, we can see that the interest rates that Lending Club calculated for issued loans in 2007, the year of its inception, were much more heterogeneous than they are now. They are the highest in 2009 at around 14% across a majority of the U.S. and now they are more constant, hovering around 11% for most states. States without color simply indicate that there were no loans issued by Lending Club in that state for the given year.

What are some interesting visualizations you have come up with using Lending Club’s trove of borrower data?

UPDATE

Due to high demand, I have created a map of "Good" vs. "Bad" borrowers broken down by state. Since some states have many more borrowers than others, I also included the total number of borrowers that went into the ratio, depicted as a number on each state's center. I filtered the original loan data down to two classes of borrowers. "Good" borrowers as those that were fully paid and "Bad" borrowers are those that either charged off, defaulted, or were late on payments. This resulted in 1,542 "Bad" borrowers and 4,647 "Good" borrowers. I then simply calculated the percentage of "Bad" customers by state. Keep in mind this does not include data on the ~24,000 other loans that are current! Click the image to see a larger version.

As you can see, the only states with 0% "Bad" borrowers are those with fewer than 13 borrowers. If we compare states with multiple hundreds of borrowers, Florida consists of about 40% "Bad" borrowers! That's approximately 167 borrowers out of 418! Texas, New York and Pennsylvania borrowers on the other hand, are pretty diligent with paying back their loans and are boasting that only 20% of their borrowers are naughty. Meanwhile California, Lending Club's home state, has the most borrowers and about 30% of those have either charged off, defaulted or were late on their payments.

Hang tight for a more quantitative analysis in which we will try to determine which factors other than state of residence are most important in determining what makes a "good" or "bad" borrower.

  1. Peter Renton says:

    This is fascinating. Good work here. How about some data visualizations for defaults (expressed as a percentage of total), percentage of loans paid early and plain old ROI. That would be even more interesting IMHO.

  2. tanya says:

    Thanks Peter. Those sorts of questions will be addressed in Part II where we will try to determine what factors make a “good” or “bad” borrower, with “paid fully” and “defaulting” as good and bad outcomes, respectively.

  3. Wiseclerk says:

    Great job, Tanya. Another map about the default risk by state would be great.

    As for interesting visualizations. I don’t remember one about Lending Club, but I liked thsi one about kiva:
    http://www.wiseclerk.com/group-news/servicesmicrofinance/kiva-stunning-visualization-of-kiva-loan-growth/
    And you probably can tell us with which tool it was created?

    • tanya says:

      That is a great video. My best guess would be that it was created using Processing http://processing.org/, probably in combination with other movie editing tools.

      I do have a map of good to bad customers by state. I will post a follow-up!

  4. Peter Renton says:

    Great. I look forward to part II. Every investor wishes they could tell the difference between a good borrower and bad borrower. So this will have tremendous benefit for all p2p lending investors.

  5. [...] Shared Mining Lending Club’s Goldmine of Loan Data Part I of II – Visualizations by State. [...]

  6. eran says:

    Super nice. Thanks for sharing, the code is superb.

  7. [...] Dataspora - Mining Lending Club’s Goldmine of Loan Data Part I of II – Visualizations by State [...]

  8. [...] Dataspora - Mining Lending Club’s Goldmine of Loan Data Part I of II – Visualizations by State [...]

  9. Adam says:

    This is really cool. It’s great seeing P2P lending analyses + good work in R and ggplot! Congratulations, very nice!

  10. Mike says:

    Very interesting graphs Tanya. Looking at the state default rates, I’m wondering if the difference might be partially explained by the economic condition of each state.

  11. eran says:

    I think there is a mistake in the code:

    loans$Issued.Year <- substr(loans$Issued.Date, 1, 4)

    while:
    head(substr(loans$Issued.Date, 1, 4) ,1)
    [1] "6/27"

    Its not years… the following can fix it:

    loans$Issued.Year <- unlist(strsplit(as.character(loans$Issued.Date), "/"))[3]

  12. Charlie H says:

    Tanya

    Now we are cooking with fire!

  13. I agree with Peter. This is a very interesting analysis.

    We would love to see you do a similar analysis on Prosper, where we believe our net returns (yields minus defaults) are significantly higher than Lending Club’s. We are 100% transparent with our data.

    What kind of data can we give you?

    BTW – North Dakota, Maine and Iowa do not allow peer to peer lending, thus the reason for so few loans in those states.

    Glenn G. Millar
    (Prosper Employee)