Archive for the ‘data’ Category

Prosper Loan Data Part II of II – Social Network Analysis: What is the Value of a Friend?

Wednesday, December 14th, 2011

Since Prosper provides data on members and their friends who are also members, we can conduct a simple “social network” analysis. What is the value of a friend when getting approved for a loan through Prosper? I first determined how many borrowers were approved and how many borrowers were declined for a loan. Next, I determined how many approved friends each borrower had. From that data, we get the following contingency table of counts:

Now we can calculate the following probabilities: the probability that you are approved given that you have at least 1 approved friend, or P(A | F), where A = Approved and F = Has at least 1 approved friend. We can also calculate the probability that you are approved given that you have zero approved friends, or P(A | F’).

Following the rules of conditional probability we have P(A | F) = P(A ∩ F) / P(F).

Probability of being approved: P(A) = 37212 / 286791 = 0.129
Probability of having at least 1 approved friend: P(F) = 5692 / 286791 = 0.0198
Probability of being approved and having at least 1 approved friend: P(A ∩ F) = 2838 / 286791 = 0.0098
Probability of being approved given that you have at least 1 approved friend:
P(A ∩ F) / P(F) = 0.0098 / 0.0198 = 0.4949

Now we will also calculate the probability of being approved given that you do not have at least 1 friend:

Probability of being approved: P(A) = 0.129
Probability of having zero approved friends: (F’) = 281099/286791 = 0.980
Probability of being approved and having zero approved friends: P(A ∩ F’) = 34374 / 286791 = 0.119
Probability of being approved given that you have zero approved friends: P(A ∩ F’) / P(F’) = 0.119 / .980 = 0.12

Therefore:
P(A | F) = 0.49 (49% of applicants with at least one friend in the network were approved.)
P(A | F’) = 0.12 (12% of applicants with no friends in the network were approved.)

We can calculate a risk ratio from these two quantities:
Risk Ratio: P(A | F) / P(A | F’) = 4.08

Members with at least 1 approved friend are 4.08x more likely to be approved for a loan than members who have 0 approved friends

While this is an interesting statement, it does not mean that having an approved friend causes approval for a loan, nor does it mean that being approved for a loan causes one to have an approved friend. It is simply an observation of two correlated variables. In fact, I would be willing to bet that being approved for a loan actually causes one to have approved friends as a result of word of mouth referrals.

Dataspora leverages a proprietary platform that can distinguish correlation from causality between variables from massive data sets. This complex yet extremely important notion of causality vs. correlation applied to business intelligence will be discussed in further detail in a future post.

Visualization of Prosper.com’s Loan Data Part I of II – Compare and Contrast with Lending Club

Tuesday, December 6th, 2011

Due to the positive feedback received on this post I thought I would re-create the analysis on another peer-to-peer lending dataset, courtesy of Prosper.com. You can access the Prosper Marketplace data via an API or by simply downloading XML files that are updated nightly http://www.prosper.com/tools/.

If you are going to follow the route I took and download the latest XML file, ProsperDataExport_xml.zip, you will find this utility helpful in converting the XML files to CSVs: Convert Prosper XML to CSV

Once you have downloaded the .jar file run the following command (changing the parameters of course!):
java -jar ProsperXMLtoCSV.jar ProsperXMLFileLocation CSVDestinationDirectory

Similar to Lending Club, Prosper provides loan-level data such as interest rate, amount funded/requested, borrower state, borrower debt to income ratio, etc. However, Prosper also provides additional information regarding their user base and loan performance history. This information includes extended credit profiles of users, groups that users belong to, social networks within the user base and even retroscores, or how a loan would be rated by Prosper under a new heuristic given macroeconomic shifts over time.

Let’s jump right into the visualizations by state:


library(ggplot2)
library(maps)


## Warning: this is a very large dataset that required ~10 minutes ## to read into R on a fast 8-core Xeon server.
loans <- read.csv("Loans.CSV", header=TRUE)
listings <- read.csv("Listings.CSV", header=TRUE)


## Obtain the active loans from the Listings file, since it
## contains more detailed information than the Loans file
listings.match <- listings[match(loans$ListingKey, listings$Key),]


listings.match$BorrowerState <- as.character(listings.match$BorrowerState)
loans <- listings.match
states <- map_data("state")


## Change state abbreviations to full names so we can merge our
## data frames together
state.names <- unlist(sapply(loans$BorrowerState, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]) )
loans$BorrowerState <- tolower(state.names)
colnames(loans)[11] <- "region"
state.counts <- data.frame(table(loans$region))
colnames(state.counts) <- c("region", "Num.Loans")
result<-merge(state.counts, states, by=c("region"))
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75) + opts(title = 'Number of Issued Loans by State')
print(p)

Click for Larger Image

It comes as no surprise that a majority of issued loans originate in California. As with Lending Club, Prosper is a San Francisco-based peer-to-peer lending company.

Now we will take the log of the number of loans issued by state and compare Prosper’s market reach with Lending Club’s.


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75) + opts(title = 'Log Number of Issued Loans by State')
print(p)

Prosper

Click for Larger Image

Lending Club

Click for Larger Image

The two maps are extremely similar. Both lending companies issue the most loans in California, Texas and Florida. There are some minor differences such as Lending Club issuing more loans than Prosper in Wyoming and Montana.

Instead of the Monthly Income by State map that I created for Lending Club, we will observe Debt to Income Ratios by state for both Prosper borrowers and Lending Club borrowers.


## Aggregate median debt to income ratio by state
debt.to.income <-aggregate(loans$DebtToIncomeRatio, by=list(loans$region), function(x) median(x, na.rm=TRUE))
colnames(debt.to.income) <- c("region", "debt.to.income")
result <- merge(debt.to.income, states, by="region")
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=debt.to.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75) + labs(fill="Debt to Income Ratio") + opts(title = 'Median Debt to Income Ratio of Borrowers by State')

Prosper

Click for Larger Image

Lending Club

Click for Larger Image

Does anyone want to start pointing fingers for the United States debt crisis yet? The states that Prosper loans to the most are also the ones with the lowest Debt to Income Ratios. New Yorkers, in particular, have the lowest median Debt to Income Ratio. Lending Club seems to have much more homogeneous interest rates. We can compare the distributions of the two companies' Debt to Income Ratios with a call to ggplot (after a bit of pre-processing that I left out due to real estate on this page):


ggplot(combined, aes(x=DebtToIncomeRatio)) + geom_histogram() + facet_grid(Company ~ .)

Click for Larger Image

It appears as if Lending Club has a hard cut-off at a 0.30 Debt to Income Ratio for borrowers. Note that this data is taking into account all loans since the inception of both companies. Prosper implemented stricter borrowing guidelines and interest rates after 2009, which can be seen in the animation below.


issue.year <- substr(loans$StartDate, 0, 4)
loans$Issued.Year <- issue.year
interest.by.year.by.state<-aggregate(loans$BorrowerRate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE))
years <- c("2006", "2007", "2008", "2009", "2010", "2011")
colnames(interest.by.year.by.state) <- c("year", "region", "interest.rate")
interest.by.year.by.state$interest.rate <- interest.by.year.by.state$interest.rate * 100

result <- merge(interest.by.year.by.state, states, by="region")
result <- result[order(result$order),]


#Calculate the lower and upper bounds for the gradient
lower <- floor(summary(interest.by.year.by.state$interest.rate)[1])[[1]]
upper <- ceiling(summary(interest.by.year.by.state$interest.rate)[6])[[1]]


states2 <- data.frame(map("state", plot=FALSE)[c("x","y")])
animateMap <- function(year){
result.year <- result[grep(year, result$year),]
usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate))
print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab(""))
}


saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);


Notice the interest rates are the most varied in 2006, the year of Prosper’s inception.
It also worth noting that the median interest rates for borrowers soared after 2009, when Prosper implemented stricter guidelines for borrowers, which also resulted in lower default rates.

Stay tuned for a "social network" analysis of Prosper.com's member data coming up in Part II!

Data scientists or data composers? Four steps to a symphony of data

Friday, September 16th, 2011

My 10-year-old son recently asked me what a data scientist does. I’m a visual guy, and like to paint a picture, so I thought about how best to explain this. I liked an explanation I came across a while back at http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html, describing the relationship between data, information and knowledge. I would take it one step further, because that’s what we do here at Dataspora. We take data, and transform it into actionable intelligence.

A data scientist is someone who takes your data and transforms it into actionable intelligence. But how do you explain that to a 10 year old? Well it just so happens, this 10 year old is starting to play the clarinet, so music seemed to be a good choice to use as an example.

Let’s say the note this aspiring Benny Goodman squeaks out of his shiny new instrument is a piece of data. All alone, hanging in the air in my living room, echoing off the walls, it doesn’t mean a whole lot. It’s raw, unadultered data. Now how about if we do something with it to put it in context. I can play a chord on the piano, and have him play his (squeaky) note, and suddenly we can tell if he’s sharp or flat or in tune. We can measure how long the note is. His note is no longer alone, but has some context. We have some information. We know he was playing middle C. This is the equivalent of step 2. We turned data into information by giving it context.

Is that enough? Perhaps. If that’s all you want to know, sure. But you probably want more. My son has a pretty good ear, and can pick up a rhythm fairly quickly. If he were to play several notes in a particular rhythm, we could have a motif. Wow. That’s more useful than a single note. He’s now taken several notes, with varying durations, added some pauses and made something larger – a motif.  In this analogy, the motif is equivalent to a bit of knowledge. More motifs = more knowledge. If I were a composer, I could combine various motifs to make a symphony (really important knowledge). The more skilled the composer (data scientist), the better the symphony (= better knowledge).

This is great! But we’re still just at the knowledge state, and I want to do something useful. I have a score sitting in front of me that started from a 10 year-old boy blowing on his rented clarinet. What can I do with it? This is where things get interesting. If I were a conductor, I could choose how and when to present this new score. Do I give it to my 5th grade band to play, or pass it on to the Philharmonic? Maybe I change a few things, and prepare it for a string quartet. The choice of action is up to me. This is the final stage – ACTION.

So, some might say that a data scientist fiddles around with data, but I prefer to look at the larger picture. A data scientist transforms data into actionable intelligence, picking and choosing what’s useful and what’s not. It doesn’t really matter if you have the data if you can’t actually do something with it – even if your choice is to do nothing at all.

How XML Threatens Big Data

Saturday, August 22nd, 2009

Credit:  http://www.flickr.com/photos/digitalart/2101765353Confessions from a Massive, Nightmarish Data Project

Back in 2000, I went to France to build a genomics platform. A biotech hired me to combine their in-house genome data with that of public repositories like Genbank. The problem was the repositories, all with millions of records, each had their own format. It sounded like a massive, nightmarish data interoperability project. And an ideal fit for a hot new technology : XML.

So I dove in, spending my days designing DTDs, writing parsers, tweaking tags (“taxon” or “species”? attribute or element?). At night I dreamt in ontologies. It was perfect.

Then reality struck. The pipeline was slow: Oracle loaded XML at a crawl. And it was a memory hog, since XSLT required putting full document trees in RAM.

We had a deadline to meet (and, mon dieu, a 35 hour work-week). So we changed course. We hacked our Perl scripts to emit a flat tab-delimited format — “TabML” — which was bulk loaded into Oracle. It wasn’t elegant, but it was fast and it worked.

Yet looking back, I realize that XML was the wrong format from the start. And as I’ll argue here, our unhealthy obsession with XML formats threatens to slow or impede many open data projects, including initiatives like Data.gov.

In the next sections, I discuss how XML fails for Big Data because of its unnatural form, bulk, and complexity. Finally, I generalize to three rules that advocate a more liberal approach to data.

(more…)

The Rise of the Data Web

Thursday, August 20th, 2009

The future of the web is data, not documents. The web has evolved from Tim Berners-Lee’s original vision of “some big, virtual documentation system in the sky” into an vibrant ecosystem of data where documents — and human actors — will play an ever smaller role.

As others have noted, we’ve reached a tipping point in history: more data is being manufactured by machines — servers, cell phones, GPS-enabled cars — than by people. The early, document-centric web was populated by hand-coded hypertext files; today, a hand-coded web page is as rare as hand-woven clothing.

Through web frameworks, wikis, and blogs, we have industrialized the creation of hypertext. Similarly, we’ve also industrialized the collection of data, and spliced out the human steps in many data flows, such that data entry clerks may soon be as rare as typesetters.

The web we experience will continue to be dominated by documents — e-mail, blogs, and news. And while many sites are data-centric — Google maps, Weather.com, and Yahoo finance — it’s the web that we can’t see that surging with data. It’s not about us, it’s about servers in the cloud mediating entire pipelines of data, only occasionally surfacing in a browser.

But the web’s data architecture is fractious and in flux: many competing standards exist for serializing, parsing, and describing data. As we build out the data web, we ought to embrace standards that mirror data’s form in its natural habitats — as programmatic data structures, relational tables, or key-value pairs — while taking advantage of data’s stream-like nature. Mark-up languages like HTML and XML are ideal for documents, but they are poor containers for data, especially Big Data.

(more…)

Color: The Cinderella of dataviz

Friday, March 13th, 2009

“Avoiding catastrophe becomes the first principle in bringing color to information: Above all, do no harm.”  — Envisioning Information, Edward Tufte, Graphics Press, 1990   

multivariate color strip plot Color is one of the most abused and neglected tools in data visualization. It is abused when we make poor color choices; it is neglected when we rely on poor software defaults. Yet despite its historically poor treatment at the hands of engineers and end-users alike, if used wisely, color is unrivaled as a visualization tool.

Most of us think twice before walking outside in fluorescent red underoos. If only we were as cautious in choosing colors for infographics. The difference is that few of us design our own clothes. But until good palettes (like ColorBrewer) are commonplace, to get colors that fit our purposes, we must be our own tailors.

While obsessing about how to implement color on the Dataspora Labs’ PitchFX viewer I began with a basic motivating question: (more…)

How Google and Facebook are using R

Thursday, February 19th, 2009


(March 26th Update: Video now available)
Last night, I moderated our Bay Area R Users Group kick-off event with a panel discussion entitled “The R and Science of Predictive Analytics”, co-located with the Predictive Analytics World conference here in SF.

The panel comprised of four recognized R users from industry:

  • Bo Cowgill, Google
  • Itamar Rosenn, Facebook
  • David Smith, Revolution Computing
  • Jim Porzak, The Generations Network (and Co-Chair of our R Users Group)

The panelists were asked to explain how they use R for predictive analytics within their firms, its strengths and weaknesses as a tool, and provide a case study. What follows is my summary with comments.

(more…)

Is Big Data at a tipping point?

Friday, January 9th, 2009

(5/18/09 update – included an overdue reference to linked data!) 

Stuart Kauffman, in one of his books about complexity, discusses tipping points in networks — what he calls a phase transitions — by way of buttons. Suppose you’re sitting on a floor strewn with 400 buttons, and you begin tying them together with pieces of string at random. At first, you have just pairs of buttons.   Then, you have clusters of threes, which in turn get tied into ever larger clumps. The question is: How long until picking any button off the floor pulls them all off together, in one connected mass?

It turns out that this supercluster of buttons doesn’t build gradually as we tie more threads, it emerges suddenly.  This rapid phase transition, from relatively unconnected to mostly connected, occurs right around where we have about half as many threads as buttons (see figure).  This is the tipping point of the system:  where a few threads make a big difference.

A similar phase transition has already occurred with regards to data inside business ecosystems. For the past several decades, an increasing number of business processes– from sales, customer service, shipping – have come online, along with the data they throw off.  As these individual databases are linked, via common formats or labels, a tipping point is reached: suddenly, every part of the company organism is connected to the data center.  And every action — sales lead, mouse click, and shipping update  — is stored.  The result:  organizations are overwhelmed by what feels like a tsunami of data.

The same trend is occurring in the larger universe of data that these organizations inhabit.  Big Data unleashed by the “Industrial Revolution of Data”, whether from public agencies, non-profit institutes, or forward-thinking private firms.

At present, much of the world’s Big Data is iceberg-like: frozen and mostly underwater. It’s frozen because format and meta-data standards make it hard to flow from one place to another:  comparing the SEC’s financial data with that of Europe’s requires common formats and labels (ahem, XBRL) that don’t yet exist. Data is “underwater” when, whether reasons of competitiveness, privacy, or sheer incompetence it’s not shared: US medical records may contain a wealth of data, but much of it is on paper and offline (not so in Europe, enabling studies with huge cohorts).

Yet there’s a slow thaw underway as evidenced by a number of initiatives:  Aaron Swartz’s theinfo.org, Flip Kromer’s infochimps, Carl Malamud’s bulk.resource.org, the Tim-Berners-Lee-inspired LinkedData.org, as well as Numbrary, Swivel, Freebase, and Amazon’s public data sets.  These are all ambitious projects, but the challenge of weaving these data sets together is still greater.

How far are we from the tipping point of Big Data? When will the world’s icebergs of data melt into one sea? More importantly, when it happens, will we be ready to do something useful with it all?

What I’ll be presenting at O’Reilly Money Tech 2009

Tuesday, October 21st, 2008

(April 2009 Update:  Unfortunately, The Money Tech Conference was indefinitely postponed, but fortunately I will be presenting a version of this talk in July at OSCON 2009).

I’ve been invited to speak at O’Reilly’s Money Tech conference this coming February 4-6th in New York City and thought I’d share the abstract for my talk here.  I’ll likely be in New York for several days, if you’d like to get together to chat about data drop me a line!

My talk is entitled “Open Source Analytics: Visualization and Predictive Modeling of Big Data with the R Programming Language”
(more…)

Data: bigger, faster, cheaper. And more valuable than ever.

Sunday, June 1st, 2008

“Information about money has become almost as important as money itself.” — Walter Wriston, former Chairman of Citicorp

“Some firms believe that in 10 years half their business will come from moving information about goods, rather than moving the goods themselves.” — The 20-Ton Packet, Wired 7.10

Information has always been valuable, but it’s only in the last decade that has it become so dramatically cheap — to store, to move, and to process. But it’s still not “too cheap to meter” (Stewart Brand’s phrase) and probably never will be.

Yet despite this dramatic drop in cost, the real value of information, by any measure, has not diminished. And because the costs of other goods — whether shipping containers, materials, or home furnishings — have fallen more slowly, information contributes an ever larger fraction to a firm’s profits.

A corollary to the falling cost of information, and its persistent value, is that as more kinds of information come online, more of this data is worth keeping. Even data whose value is metered in cents per terabyte is increasingly worth storing, and eventually analyzing, as it may yield several cents profit.
(more…)