Archive for December, 2011

Prosper Loan Data Part II of II – Social Network Analysis: What is the Value of a Friend?

Wednesday, December 14th, 2011

Since Prosper provides data on members and their friends who are also members, we can conduct a simple “social network” analysis. What is the value of a friend when getting approved for a loan through Prosper? I first determined how many borrowers were approved and how many borrowers were declined for a loan. Next, I determined how many approved friends each borrower had. From that data, we get the following contingency table of counts:

Now we can calculate the following probabilities: the probability that you are approved given that you have at least 1 approved friend, or P(A | F), where A = Approved and F = Has at least 1 approved friend. We can also calculate the probability that you are approved given that you have zero approved friends, or P(A | F’).

Following the rules of conditional probability we have P(A | F) = P(A ∩ F) / P(F).

Probability of being approved: P(A) = 37212 / 286791 = 0.129
Probability of having at least 1 approved friend: P(F) = 5692 / 286791 = 0.0198
Probability of being approved and having at least 1 approved friend: P(A ∩ F) = 2838 / 286791 = 0.0098
Probability of being approved given that you have at least 1 approved friend:
P(A ∩ F) / P(F) = 0.0098 / 0.0198 = 0.4949

Now we will also calculate the probability of being approved given that you do not have at least 1 friend:

Probability of being approved: P(A) = 0.129
Probability of having zero approved friends: (F’) = 281099/286791 = 0.980
Probability of being approved and having zero approved friends: P(A ∩ F’) = 34374 / 286791 = 0.119
Probability of being approved given that you have zero approved friends: P(A ∩ F’) / P(F’) = 0.119 / .980 = 0.12

Therefore:
P(A | F) = 0.49 (49% of applicants with at least one friend in the network were approved.)
P(A | F’) = 0.12 (12% of applicants with no friends in the network were approved.)

We can calculate a risk ratio from these two quantities:
Risk Ratio: P(A | F) / P(A | F’) = 4.08

Members with at least 1 approved friend are 4.08x more likely to be approved for a loan than members who have 0 approved friends

While this is an interesting statement, it does not mean that having an approved friend causes approval for a loan, nor does it mean that being approved for a loan causes one to have an approved friend. It is simply an observation of two correlated variables. In fact, I would be willing to bet that being approved for a loan actually causes one to have approved friends as a result of word of mouth referrals.

Dataspora leverages a proprietary platform that can distinguish correlation from causality between variables from massive data sets. This complex yet extremely important notion of causality vs. correlation applied to business intelligence will be discussed in further detail in a future post.

Visualization of Prosper.com’s Loan Data Part I of II – Compare and Contrast with Lending Club

Tuesday, December 6th, 2011

Due to the positive feedback received on this post I thought I would re-create the analysis on another peer-to-peer lending dataset, courtesy of Prosper.com. You can access the Prosper Marketplace data via an API or by simply downloading XML files that are updated nightly http://www.prosper.com/tools/.

If you are going to follow the route I took and download the latest XML file, ProsperDataExport_xml.zip, you will find this utility helpful in converting the XML files to CSVs: Convert Prosper XML to CSV

Once you have downloaded the .jar file run the following command (changing the parameters of course!):
java -jar ProsperXMLtoCSV.jar ProsperXMLFileLocation CSVDestinationDirectory

Similar to Lending Club, Prosper provides loan-level data such as interest rate, amount funded/requested, borrower state, borrower debt to income ratio, etc. However, Prosper also provides additional information regarding their user base and loan performance history. This information includes extended credit profiles of users, groups that users belong to, social networks within the user base and even retroscores, or how a loan would be rated by Prosper under a new heuristic given macroeconomic shifts over time.

Let’s jump right into the visualizations by state:


library(ggplot2)
library(maps)


## Warning: this is a very large dataset that required ~10 minutes ## to read into R on a fast 8-core Xeon server.
loans <- read.csv("Loans.CSV", header=TRUE)
listings <- read.csv("Listings.CSV", header=TRUE)


## Obtain the active loans from the Listings file, since it
## contains more detailed information than the Loans file
listings.match <- listings[match(loans$ListingKey, listings$Key),]


listings.match$BorrowerState <- as.character(listings.match$BorrowerState)
loans <- listings.match
states <- map_data("state")


## Change state abbreviations to full names so we can merge our
## data frames together
state.names <- unlist(sapply(loans$BorrowerState, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]) )
loans$BorrowerState <- tolower(state.names)
colnames(loans)[11] <- "region"
state.counts <- data.frame(table(loans$region))
colnames(state.counts) <- c("region", "Num.Loans")
result<-merge(state.counts, states, by=c("region"))
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75) + opts(title = 'Number of Issued Loans by State')
print(p)

Click for Larger Image

It comes as no surprise that a majority of issued loans originate in California. As with Lending Club, Prosper is a San Francisco-based peer-to-peer lending company.

Now we will take the log of the number of loans issued by state and compare Prosper’s market reach with Lending Club’s.


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75) + opts(title = 'Log Number of Issued Loans by State')
print(p)

Prosper

Click for Larger Image

Lending Club

Click for Larger Image

The two maps are extremely similar. Both lending companies issue the most loans in California, Texas and Florida. There are some minor differences such as Lending Club issuing more loans than Prosper in Wyoming and Montana.

Instead of the Monthly Income by State map that I created for Lending Club, we will observe Debt to Income Ratios by state for both Prosper borrowers and Lending Club borrowers.


## Aggregate median debt to income ratio by state
debt.to.income <-aggregate(loans$DebtToIncomeRatio, by=list(loans$region), function(x) median(x, na.rm=TRUE))
colnames(debt.to.income) <- c("region", "debt.to.income")
result <- merge(debt.to.income, states, by="region")
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=debt.to.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75) + labs(fill="Debt to Income Ratio") + opts(title = 'Median Debt to Income Ratio of Borrowers by State')

Prosper

Click for Larger Image

Lending Club

Click for Larger Image

Does anyone want to start pointing fingers for the United States debt crisis yet? The states that Prosper loans to the most are also the ones with the lowest Debt to Income Ratios. New Yorkers, in particular, have the lowest median Debt to Income Ratio. Lending Club seems to have much more homogeneous interest rates. We can compare the distributions of the two companies' Debt to Income Ratios with a call to ggplot (after a bit of pre-processing that I left out due to real estate on this page):


ggplot(combined, aes(x=DebtToIncomeRatio)) + geom_histogram() + facet_grid(Company ~ .)

Click for Larger Image

It appears as if Lending Club has a hard cut-off at a 0.30 Debt to Income Ratio for borrowers. Note that this data is taking into account all loans since the inception of both companies. Prosper implemented stricter borrowing guidelines and interest rates after 2009, which can be seen in the animation below.


issue.year <- substr(loans$StartDate, 0, 4)
loans$Issued.Year <- issue.year
interest.by.year.by.state<-aggregate(loans$BorrowerRate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE))
years <- c("2006", "2007", "2008", "2009", "2010", "2011")
colnames(interest.by.year.by.state) <- c("year", "region", "interest.rate")
interest.by.year.by.state$interest.rate <- interest.by.year.by.state$interest.rate * 100

result <- merge(interest.by.year.by.state, states, by="region")
result <- result[order(result$order),]


#Calculate the lower and upper bounds for the gradient
lower <- floor(summary(interest.by.year.by.state$interest.rate)[1])[[1]]
upper <- ceiling(summary(interest.by.year.by.state$interest.rate)[6])[[1]]


states2 <- data.frame(map("state", plot=FALSE)[c("x","y")])
animateMap <- function(year){
result.year <- result[grep(year, result$year),]
usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate))
print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab(""))
}


saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);


Notice the interest rates are the most varied in 2006, the year of Prosper’s inception.
It also worth noting that the median interest rates for borrowers soared after 2009, when Prosper implemented stricter guidelines for borrowers, which also resulted in lower default rates.

Stay tuned for a "social network" analysis of Prosper.com's member data coming up in Part II!