Prosper Loan Data Part II of II – Social Network Analysis: What is the Value of a Friend?

by Tanya Cashorali | December 14th, 2011

Since Prosper provides data on members and their friends who are also members, we can conduct a simple “social network” analysis. What is the value of a friend when getting approved for a loan through Prosper? I first determined how many borrowers were approved and how many borrowers were declined for a loan. Next, I determined how many approved friends each borrower had. From that data, we get the following contingency table of counts:

Now we can calculate the following probabilities: the probability that you are approved given that you have at least 1 approved friend, or P(A | F), where A = Approved and F = Has at least 1 approved friend. We can also calculate the probability that you are approved given that you have zero approved friends, or P(A | F’).

Following the rules of conditional probability we have P(A | F) = P(A ∩ F) / P(F).

Probability of being approved: P(A) = 37212 / 286791 = 0.129
Probability of having at least 1 approved friend: P(F) = 5692 / 286791 = 0.0198
Probability of being approved and having at least 1 approved friend: P(A ∩ F) = 2838 / 286791 = 0.0098
Probability of being approved given that you have at least 1 approved friend:
P(A ∩ F) / P(F) = 0.0098 / 0.0198 = 0.4949

Now we will also calculate the probability of being approved given that you do not have at least 1 friend:

Probability of being approved: P(A) = 0.129
Probability of having zero approved friends: (F’) = 281099/286791 = 0.980
Probability of being approved and having zero approved friends: P(A ∩ F’) = 34374 / 286791 = 0.119
Probability of being approved given that you have zero approved friends: P(A ∩ F’) / P(F’) = 0.119 / .980 = 0.12

Therefore:
P(A | F) = 0.49 (49% of applicants with at least one friend in the network were approved.)
P(A | F’) = 0.12 (12% of applicants with no friends in the network were approved.)

We can calculate a risk ratio from these two quantities:
Risk Ratio: P(A | F) / P(A | F’) = 4.08

Members with at least 1 approved friend are 4.08x more likely to be approved for a loan than members who have 0 approved friends

While this is an interesting statement, it does not mean that having an approved friend causes approval for a loan, nor does it mean that being approved for a loan causes one to have an approved friend. It is simply an observation of two correlated variables. In fact, I would be willing to bet that being approved for a loan actually causes one to have approved friends as a result of word of mouth referrals.

Dataspora leverages a proprietary platform that can distinguish correlation from causality between variables from massive data sets. This complex yet extremely important notion of causality vs. correlation applied to business intelligence will be discussed in further detail in a future post.

Visualization of Prosper.com’s Loan Data Part I of II – Compare and Contrast with Lending Club

by Tanya Cashorali | December 6th, 2011

Due to the positive feedback received on this post I thought I would re-create the analysis on another peer-to-peer lending dataset, courtesy of Prosper.com. You can access the Prosper Marketplace data via an API or by simply downloading XML files that are updated nightly http://www.prosper.com/tools/.

If you are going to follow the route I took and download the latest XML file, ProsperDataExport_xml.zip, you will find this utility helpful in converting the XML files to CSVs: Convert Prosper XML to CSV

Once you have downloaded the .jar file run the following command (changing the parameters of course!):
java -jar ProsperXMLtoCSV.jar ProsperXMLFileLocation CSVDestinationDirectory

Similar to Lending Club, Prosper provides loan-level data such as interest rate, amount funded/requested, borrower state, borrower debt to income ratio, etc. However, Prosper also provides additional information regarding their user base and loan performance history. This information includes extended credit profiles of users, groups that users belong to, social networks within the user base and even retroscores, or how a loan would be rated by Prosper under a new heuristic given macroeconomic shifts over time.

Let’s jump right into the visualizations by state:


library(ggplot2)
library(maps)


## Warning: this is a very large dataset that required ~10 minutes ## to read into R on a fast 8-core Xeon server.
loans <- read.csv("Loans.CSV", header=TRUE)
listings <- read.csv("Listings.CSV", header=TRUE)


## Obtain the active loans from the Listings file, since it
## contains more detailed information than the Loans file
listings.match <- listings[match(loans$ListingKey, listings$Key),]


listings.match$BorrowerState <- as.character(listings.match$BorrowerState)
loans <- listings.match
states <- map_data("state")


## Change state abbreviations to full names so we can merge our
## data frames together
state.names <- unlist(sapply(loans$BorrowerState, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]) )
loans$BorrowerState <- tolower(state.names)
colnames(loans)[11] <- "region"
state.counts <- data.frame(table(loans$region))
colnames(state.counts) <- c("region", "Num.Loans")
result<-merge(state.counts, states, by=c("region"))
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75) + opts(title = 'Number of Issued Loans by State')
print(p)

Click for Larger Image

It comes as no surprise that a majority of issued loans originate in California. As with Lending Club, Prosper is a San Francisco-based peer-to-peer lending company.

Now we will take the log of the number of loans issued by state and compare Prosper’s market reach with Lending Club’s.


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75) + opts(title = 'Log Number of Issued Loans by State')
print(p)

Prosper

Click for Larger Image

Lending Club

Click for Larger Image

The two maps are extremely similar. Both lending companies issue the most loans in California, Texas and Florida. There are some minor differences such as Lending Club issuing more loans than Prosper in Wyoming and Montana.

Instead of the Monthly Income by State map that I created for Lending Club, we will observe Debt to Income Ratios by state for both Prosper borrowers and Lending Club borrowers.


## Aggregate median debt to income ratio by state
debt.to.income <-aggregate(loans$DebtToIncomeRatio, by=list(loans$region), function(x) median(x, na.rm=TRUE))
colnames(debt.to.income) <- c("region", "debt.to.income")
result <- merge(debt.to.income, states, by="region")
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=debt.to.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75) + labs(fill="Debt to Income Ratio") + opts(title = 'Median Debt to Income Ratio of Borrowers by State')

Prosper

Click for Larger Image

Lending Club

Click for Larger Image

Does anyone want to start pointing fingers for the United States debt crisis yet? The states that Prosper loans to the most are also the ones with the lowest Debt to Income Ratios. New Yorkers, in particular, have the lowest median Debt to Income Ratio. Lending Club seems to have much more homogeneous interest rates. We can compare the distributions of the two companies' Debt to Income Ratios with a call to ggplot (after a bit of pre-processing that I left out due to real estate on this page):


ggplot(combined, aes(x=DebtToIncomeRatio)) + geom_histogram() + facet_grid(Company ~ .)

Click for Larger Image

It appears as if Lending Club has a hard cut-off at a 0.30 Debt to Income Ratio for borrowers. Note that this data is taking into account all loans since the inception of both companies. Prosper implemented stricter borrowing guidelines and interest rates after 2009, which can be seen in the animation below.


issue.year <- substr(loans$StartDate, 0, 4)
loans$Issued.Year <- issue.year
interest.by.year.by.state<-aggregate(loans$BorrowerRate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE))
years <- c("2006", "2007", "2008", "2009", "2010", "2011")
colnames(interest.by.year.by.state) <- c("year", "region", "interest.rate")
interest.by.year.by.state$interest.rate <- interest.by.year.by.state$interest.rate * 100

result <- merge(interest.by.year.by.state, states, by="region")
result <- result[order(result$order),]


#Calculate the lower and upper bounds for the gradient
lower <- floor(summary(interest.by.year.by.state$interest.rate)[1])[[1]]
upper <- ceiling(summary(interest.by.year.by.state$interest.rate)[6])[[1]]


states2 <- data.frame(map("state", plot=FALSE)[c("x","y")])
animateMap <- function(year){
result.year <- result[grep(year, result$year),]
usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate))
print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab(""))
}


saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);


Notice the interest rates are the most varied in 2006, the year of Prosper’s inception.
It also worth noting that the median interest rates for borrowers soared after 2009, when Prosper implemented stricter guidelines for borrowers, which also resulted in lower default rates.

Stay tuned for a "social network" analysis of Prosper.com's member data coming up in Part II!

Mining Lending Club’s Goldmine of Loan Data Part I of II – Visualizations by State

by Tanya Cashorali | October 14th, 2011

I have a few friends that keep bragging about their 14% annual returns by investing their money with Lending Club, a peer-to-peer lending service that cuts out the complexities and difficulties of getting approved for a loan through a bank. To give you an idea of the sheer amount of volume Lending Club has been dealing with, here’s a snapshot of the Company Statistics as of 10/14/2011:

  • Loans funded to date: $387,043,375
  • Loans funded last month: $24,945,400
  • Interest paid to investors since inception: $32,135,688

Currently Lending Club is boasting that 91% of borrowers earn between 6-18%. Now of course, higher returns are correlated with higher risk. You can choose to diversify your investment across hundreds of different loans with different credit grades – the worse the credit grade, the higher the return percentage, and the higher the risk. I thought it would be interesting to investigate Lending Club a bit more so I navigated over to their site and found something that only a data scientist would consider to be gold:

https://www.lendingclub.com/info/download-data.action


Yes, they have provided complete loan data in CSV format for all of us data geeks to devour.  The data include the current loan status (Current, Late, Fully Paid, etc.), credit grades, interest rates, loan purposes, and all sorts of other juicy tidbits of borrower information.

I downloaded the data and quickly determined that this csv file contained information on 37,122 loans. Of course the first thing I did was fire up R:


library(ggplot2)
library(maps)
loans <- read.csv("LoanStats.csv", header=TRUE, skip = 1)

One of the three sexy skills of the data geek includes data munging, otherwise known as suffering. This post will briefly touch on 2 of the 3 skills - Data Munging and Data Visualization. But first, we need to get the data into a format that our tool, in this case R, can handle. We’ll replace some percentage signs and change a factor to a character string.


loans$Debt.To.Income.Ratio <- as.numeric(gsub("%", "", loans$Debt.To.Income.Ratio))
loans$State <- as.character(loans$State)
loans$Interest.Rate <- as.numeric(gsub("%", "", loans$Interest.Rate))
loans$Revolving.Line.Utilization <- as.numeric(gsub("%", "", loans$Revolving.Line.Utilization))

Conveniently, map_data(“state”) will get all of the latitude and longitude data for each state so that we can draw a map of the U.S.


states <- map_data("state")
loans<-loans[-which(loans$State == ""),]


#Change state abbreviations to full names so we can merge our data frames together
state.names <- unlist(sapply(loans$State, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]) )
loans$State <- tolower(state.names)
colnames(loans)[23] <- "region"

Now we will determine the number of loans by state and merge this data.frame with the states data so we can plot this all out on a map using ggplot2.


state.counts <- data.frame(table(loans$region))
colnames(state.counts) <- c("region", "Num.Loans")
result<-merge(state.counts, states, by=c("region"))
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75)
print(p)

It doesn’t take a geography whiz to realize that a state is missing from this map! Turns out, Lending Club has zero borrowers in North Dakota as of 10/14/2011. Since the number of loans can range anywhere from 3 in Maine to 6,452 in California, we can also plot the log of the total number of loans in order to more easily compare each state's loan activity visually. Why don’t we also add poor North Dakota onto our map? We will assign its Num.Loans variable a value of 1 since we will be taking the log for our next visualization and log(1) = 0.


nd<-map_data("state")[grep("north dakota", map_data("state")[,5]),]
nd$Num.Loans <- 1
result <- rbind(result, nd)
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75)
print (p)

That’s better. We can see that most of Lending Club’s borrowers are from CA, which makes sense given that Lending Club is headquartered in San Francisco. They also have vast reach across Texas, Florida, New York, a good portion of the east coast, and states bordering California. They have the least number of borrowers in Maine and parts of the West to Midwest.

Now let’s explore some of the other variables and project them onto our map. We will look at the median monthly incomes by state.


monthly.income <-aggregate(loans$Monthly.Income, by=list(loans$region), function(x) median(x, na.rm=TRUE))
colnames(monthly.income) <- c("region", "monthly.income")
result <- merge(monthly.income, states, by="region")
nd<-map_data("state")[grep("north dakota", map_data("state")[,5]),]
nd$monthly.income <- 0
result <- rbind(result, nd)
result <- result[order(result$order),]


p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=monthly.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75)
print(p)

You may recall that Lending Club has only issued 3 loans in Maine. This means we are only looking at 3 data points, which is not a large sample size. We can add any type of information we would like to the center of each state on our map. Let’s add the total number of loans in each state using geom_text() to the center of each state to give this information a little more context.


state.info<-data.frame(region = tolower(state.name), long=state.center$x, lat=state.center$y)
state.info <- subset(state.info, !region %in% c("alaska", "hawaii"))
totals <- data.frame(table(loans$region))
colnames(totals) <- c("region", "total")
state.info <- merge(state.info, totals)


p + geom_text(data=state.info, aes(label=total, cex=0.5))

I have one more trick up my sleeve, which I hacked together thanks to this post from r-bloggers.com. We will look at how the median interest rate for loans issued by Lending Club have varied over the past 4 years by state.


library(animation)


#Pull out just the year from the Issued.Date for each loan
loans$Issued.Year <- substr(loans$Issued.Date, 1, 4)
interest.by.year.by.state<-aggregate(loans$Interest.Rate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE))
years <- c("2007", "2008", "2009", "2010", "2011")
colnames(interest.by.year.by.state) <- c("year", "region", "interest.rate")


result <- merge(interest.by.year.by.state, states, by="region")
result <- result[order(result$order),]


#Calculate the lower and upper bounds for the gradient
lower <- floor(summary(interest.by.year.by.state$interest.rate)[1])[[1]]
upper <- ceiling(summary(interest.by.year.by.state$interest.rate)[6])[[1]]


states2 <- data.frame(map("state", plot=FALSE)[c("x","y")])
animateMap <- function(year){
result.year <- result[grep(year, result$year),]
usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate))
print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab(""))
}
saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);

If we observe our award-winning animated GIF created in R, we can see that the interest rates that Lending Club calculated for issued loans in 2007, the year of its inception, were much more heterogeneous than they are now. They are the highest in 2009 at around 14% across a majority of the U.S. and now they are more constant, hovering around 11% for most states. States without color simply indicate that there were no loans issued by Lending Club in that state for the given year.

What are some interesting visualizations you have come up with using Lending Club’s trove of borrower data?

UPDATE

Due to high demand, I have created a map of "Good" vs. "Bad" borrowers broken down by state. Since some states have many more borrowers than others, I also included the total number of borrowers that went into the ratio, depicted as a number on each state's center. I filtered the original loan data down to two classes of borrowers. "Good" borrowers as those that were fully paid and "Bad" borrowers are those that either charged off, defaulted, or were late on payments. This resulted in 1,542 "Bad" borrowers and 4,647 "Good" borrowers. I then simply calculated the percentage of "Bad" customers by state. Keep in mind this does not include data on the ~24,000 other loans that are current! Click the image to see a larger version.

As you can see, the only states with 0% "Bad" borrowers are those with fewer than 13 borrowers. If we compare states with multiple hundreds of borrowers, Florida consists of about 40% "Bad" borrowers! That's approximately 167 borrowers out of 418! Texas, New York and Pennsylvania borrowers on the other hand, are pretty diligent with paying back their loans and are boasting that only 20% of their borrowers are naughty. Meanwhile California, Lending Club's home state, has the most borrowers and about 30% of those have either charged off, defaulted or were late on their payments.

Hang tight for a more quantitative analysis in which we will try to determine which factors other than state of residence are most important in determining what makes a "good" or "bad" borrower.

Data scientists or data composers? Four steps to a symphony of data

by bartev | September 16th, 2011

My 10-year-old son recently asked me what a data scientist does. I’m a visual guy, and like to paint a picture, so I thought about how best to explain this. I liked an explanation I came across a while back at http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html, describing the relationship between data, information and knowledge. I would take it one step further, because that’s what we do here at Dataspora. We take data, and transform it into actionable intelligence.

A data scientist is someone who takes your data and transforms it into actionable intelligence. But how do you explain that to a 10 year old? Well it just so happens, this 10 year old is starting to play the clarinet, so music seemed to be a good choice to use as an example.

Let’s say the note this aspiring Benny Goodman squeaks out of his shiny new instrument is a piece of data. All alone, hanging in the air in my living room, echoing off the walls, it doesn’t mean a whole lot. It’s raw, unadultered data. Now how about if we do something with it to put it in context. I can play a chord on the piano, and have him play his (squeaky) note, and suddenly we can tell if he’s sharp or flat or in tune. We can measure how long the note is. His note is no longer alone, but has some context. We have some information. We know he was playing middle C. This is the equivalent of step 2. We turned data into information by giving it context.

Is that enough? Perhaps. If that’s all you want to know, sure. But you probably want more. My son has a pretty good ear, and can pick up a rhythm fairly quickly. If he were to play several notes in a particular rhythm, we could have a motif. Wow. That’s more useful than a single note. He’s now taken several notes, with varying durations, added some pauses and made something larger – a motif.  In this analogy, the motif is equivalent to a bit of knowledge. More motifs = more knowledge. If I were a composer, I could combine various motifs to make a symphony (really important knowledge). The more skilled the composer (data scientist), the better the symphony (= better knowledge).

This is great! But we’re still just at the knowledge state, and I want to do something useful. I have a score sitting in front of me that started from a 10 year-old boy blowing on his rented clarinet. What can I do with it? This is where things get interesting. If I were a conductor, I could choose how and when to present this new score. Do I give it to my 5th grade band to play, or pass it on to the Philharmonic? Maybe I change a few things, and prepare it for a string quartet. The choice of action is up to me. This is the final stage – ACTION.

So, some might say that a data scientist fiddles around with data, but I prefer to look at the larger picture. A data scientist transforms data into actionable intelligence, picking and choosing what’s useful and what’s not. It doesn’t really matter if you have the data if you can’t actually do something with it – even if your choice is to do nothing at all.

Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages

by Alan | April 25th, 2011

This week’s guest blogger is Dataspora’s own Antonio Piccolboni. The original post can be found on his personal blog.

On a quest for an elegant and effective map reduce language, I went through a number of options and put together some considerations. And the winner is …

In a couple of blog entries from my personal blog I described some map-reduce algorithms for statistical and graph problems and sketched their implementation using pseudo-code. Pseudo-code has two problems: not everybody agrees on what a statement means and it doesn’t run, so you can’t test it or use it. Real programming languages on the other hand tend to obscure the logic of a program with unnecessary detail and have other issues that hinder readability, the reason why people resort to pseudo-code. But there is more to it than just aesthetics. Conciseness of code is related to programming abstractions, constructs that achieve higher generality and remove unnecessary detail; to reuse, whereby the same code is used in different contexts, reducing total program size; and even testing, that is concise programs can be tested more easily. In sum, shorter programs are better. The elegance of less is hardly my own or a software engineering discovery. As Antoine de Saint-Exupery, French writer and aviator, so eloquently put it :
Read more

Mining the Tar Sands of Big Data

by mike | February 14th, 2011

This post was co-authored by Roger Ehrenberg, founder and managing partner at IA Ventures.  A variation of this post was published by the GigaOm Media network.

The tar sands of Alberta, Canada contain the largest reserves of oil on the planet. However, they remain largely untouched, and for one reason: economics. It costs as much as $40 to extract a barrel of oil from tar sand, and until recently, petroleum companies could not profitably mine these reserves.

In a similar vein, much of the world’s most valuable information is trapped in digital sand, siloed in servers scattered around the globe. These vast expanses of data — streaming from our smart phones, DVRs, and GPS-enabled cars — require mining and distillation before they can be useful.

Both oil and sand, information and data share another parallel: in recent years, technology has catalyzed dramatic drops in the costs of extracting each.

Unlike oil reserves, data is an abundant resource on our wired planet. Though much of it is noise, at scale and with the right mining algorithms, this data can yield information that can predict traffic jams, entertainment trends, even flu outbreaks.

These are hints of the promise of big data, which will mature in the coming decade, driven by advances in three principle areas: sensor networks, cloud computing, and machine learning. Read more

The Seven Secrets of Successful Data Scientists

by mike | August 27th, 2010

At O’Reilly’s “Making Data Work” seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data.

What follows is a blog-ified and amended version of that talk, originally entitled “Secrets of Successful Data Scientists.”

1. Choose The Right-Sized Tool

Or, as I like to say, you don’t need a chainsaw to cut butter.

If you’ve got 600 lines of CSV data that you need to work with on a one-time basis, paste it into Excel or Emacs and just do it (yes, curse the Flying Spaghetti Monster, I’ve just endorsed that dull knife called Excel).

In fact, Excel’s and Emacs’ program-by-example keyboard macros can be fantastic tool for quick and dirty data clean-up.
Read more

The Data Singularity, Part II: Human-Sizing Big Data

by mike | May 27th, 2010

“There are no more promising or important targets for basic scientific research than understanding how human minds… solve problems and make decisions effectively.” – Herbert Simon

In my previous post , I discussed the forces behind what I’m calling The Data Singularity. My basic thesis is that as information generating processes become more frictionless — as humans have been excised from information read-write loops — the velocity and volume of data in the world is increasing, and at an exponential rate.

But where we go from here? What are the consequences of living in an age where every datum is stored? Where are the bottlenecks, pain points, and opportunities? Which technologies are addressing these?

The upshot is this: a new class of tools are evolving for Big Data because traditional approaches can’t scale up. But these tools share a common goal: scaling down data, and making it human-sized. That’s the “reduce” part of MapReduce, the single statistic from analysis, or the hundred pixel line from one hundred million events.

What’s happening today isn’t entirely new, though. There were echoes of it decades ago, when surveillance satellites first began scanning the globe.

VI. How Satellite Data Paralyzed the CIA

Beginning in the early 1970s the CIA began relying more on global satellite reconnaissance imagery for its intelligence operations. But according to one history, this massive, rich data didn’t accelerate the pace of US intelligence: it slowed it down.

Why? Because confronted with this firehose, CIA leaders attempted to analyze every image, chase every half-formed hypothesis, simply because it was possible. The few good leads were washed out by the many mediocre. The CIA didn’t adjust their decision-making to this new scale, and they were drowned by it.

Many organizations are at a similar inflection point now, with access to massive, rich data about their customers or products. And, like like the CIA in the 1970s, they find themselves paralyzed by the possibilities.

VII. People Still Pull the Big Levers

That Big Data paralyzes human decision-makers matters, because humans still make the big decisions. When someone praises a company as being “data-driven”, I’d like to imagine that this is literally true: that the company is nothing more than a few server racks blinking & humming away, slinging bits and earning money.

But no such company exists. What “data-driven” really means is that the executives & employees use data as inputs for making decisions. Companies may be data-fueled, but they’re people-driven.

VIII. Human-sizing Big Data: Filter & Crunch
Read more

The Data Singularity is Here

by mike | March 8th, 2010

In this blog post I’ll attempt to sketch the forces behind what I’m calling, somewhat sensationally, the Data Singularity, and then (in a following post) discuss what I see as its consequences.

In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we aren’t even at the terminal node of action. International cargo shipments, high-frequency stock trades, and genetic diagnoses are all made without us.

Absent humans, these data and decision loops have far less friction; they become constrained only by the costs of bandwidth, computation, and storage– all of which are dropping exponentially.

The result is an explosion of data thrown off from these machine-mediated pipelines, along with data about those flows (and data about that data, and so on). The machines all around us — our smart phones, smart cars, and fee-happy bank accounts — are talking, and increasingly we’re being left out of the conversation.

So whether or not the Singularity is Near, the Data Singularity is here, and its consequences are being felt.

But before I discuss these consequences, I’d like to expand on the premise. The world wasn’t always drowning in this data deluge, so how did we get here?

I. Data at the Speed of Speech
Read more

SQL is Dead. Long Live SQL!

by mike | November 25th, 2009

“The adoption of a relational model of data, as described above, permits the development of a universal data sub-language.”– E.F. Codd, 1969

“Database research has produced a number of good results, but the relational database is not one of them.” – Henry Baker, 1991

Outside of programming language flame wars, few questions raise the hackles of hackers more than: “how should I store my data?”

I will argue here, like many such debates , the answer is:  it depends on what you’re doing.

While the rise of non-relational data stores serves a much-needed niche, the death of SQL and relational databases has been much exaggerated.  E.F. Codd may be dead, but SQL is alive and well as a simple yet powerful data query language.

3NF Crusaders vs NoSQL Rebels

While the current critique relational databases shares features of earlier debates (such as in the 1990s, when object-oriented databases were heralded as the next big thing), it has some new twists.  Thus to review the players and their positions:

On our right are the relational curmudgeons, the kind of folks who pen manifestos and crusade against NULL values. They have converted nearly all of big business to their ministry, and have billions of dollars in their coffers to show for it. They insist that data should be stored in terms of its relations, to protect its integrity and facilitate its analysis. Ideally that means third-normal form, but more liberal branches of the church exist.

Read more