<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Dataspora</title>
	<atom:link href="http://www.dataspora.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.dataspora.com</link>
	<description></description>
	<lastBuildDate>Wed, 14 Dec 2011 20:32:52 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Prosper Loan Data Part II of II &#8211; Social Network Analysis: What is the Value of a Friend?</title>
		<link>http://www.dataspora.com/2011/12/prosper-loan-data-part-2-social-network-analysis-what-is-the-value-of-a-friend/</link>
		<comments>http://www.dataspora.com/2011/12/prosper-loan-data-part-2-social-network-analysis-what-is-the-value-of-a-friend/#comments</comments>
		<pubDate>Wed, 14 Dec 2011 16:28:28 +0000</pubDate>
		<dc:creator>Tanya Cashorali</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[social network analysis]]></category>

		<guid isPermaLink="false">http://www.dataspora.com/?p=899</guid>
		<description><![CDATA[Since Prosper provides data on members and their friends who are also members, we can conduct a simple “social network” analysis. What is the value of a friend when getting approved for a loan through Prosper? I first determined how many borrowers were approved and how many borrowers were declined for a loan. Next, I [...]]]></description>
			<content:encoded><![CDATA[<p>Since Prosper provides data on members and their friends who are also members, we can conduct a simple “social network” analysis. What is the value of a friend when getting approved for a loan through Prosper? I first determined how many borrowers were approved and how many borrowers were declined for a loan. Next, I determined how many approved friends each borrower had. From that data, we get the following contingency table of counts:</p>
<p><img src="http://www.dataspora.com/wp-content/uploads/2011/12/Contingency_Table.png" alt="" title="Contingency_Table" width="500" height="125" class="aligncenter size-full wp-image-888" /></a></p>
<p>Now we can calculate the following probabilities: the probability that you are approved given that you have at least 1 approved friend, or P(A | F), where A = Approved and F = Has at least 1 approved friend. We can also calculate the probability that you are approved given that you have zero approved friends, or P(A | F’). </p>
<p>Following the rules of conditional probability we have P(A | F) = P(A ∩ F) / P(F). </p>
<p><i>Probability of being approved:</i> P(A) = 37212 / 286791 = 0.129<br />
<i>Probability of having at least 1 approved friend:</i> P(F) = 5692 / 286791 = 0.0198<br />
<i>Probability of being approved and having at least 1 approved friend:</i> P(A ∩ F) = 2838 / 286791 = 0.0098<br />
<i>Probability of being approved given that you have at least 1 approved friend:</i><br />
P(A ∩ F) / P(F) = 0.0098 / 0.0198 = 0.4949</p>
<p>Now we will also calculate the probability of being approved given that you do not have at least 1 friend:</p>
<p><i>Probability of being approved:</i> P(A) = 0.129<br />
<i>Probability of having zero approved friends:</i> (F’) = 281099/286791 = 0.980<br />
<i>Probability of being approved and having zero approved friends:</i> P(A ∩ F’) = 34374 / 286791 = 0.119<br />
<i>Probability of being approved given that you have zero approved friends:</i> P(A ∩ F’) / P(F’) = 0.119 / .980 = 0.12</p>
<p>Therefore:<br />
P(A | F) = 0.49 (49% of applicants with at least one friend in the network were approved.)<br />
P(A | F’) = 0.12 (12% of applicants with no friends in the network were approved.)</p>
<p>We can calculate a risk ratio from these two quantities:<br />
<b>Risk Ratio: P(A | F) / P(A | F’) = 4.08</b></p>
<h5>Members with at least 1 approved friend are 4.08x more likely to be approved for a loan than members who have 0 approved friends</h5>
<p><a href="http://www.dataspora.com/wp-content/uploads/2011/12/iStock_000011995388XSmall.jpg"><img src="http://www.dataspora.com/wp-content/uploads/2011/12/iStock_000011995388XSmall.jpg" alt="" title="iStock_000011995388XSmall" width="300" height="250" class="alignleft size-full wp-image-940" /></a></p>
<p>While this is an interesting statement, it does not mean that having an approved friend <b>causes</b> approval for a loan, nor does it mean that being approved for a loan <b>causes</b> one to have an approved friend. It is simply an observation of two <b>correlated</b> variables. In fact, I would be willing to bet that being approved for a loan actually causes one to have approved friends as a result of word of mouth referrals.  </p>
<p>Dataspora leverages a proprietary platform that can <i>distinguish correlation from causality</i> between variables from massive data sets. This complex yet extremely important notion of causality vs. correlation applied to business intelligence will be discussed in further detail in a future post. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.dataspora.com/2011/12/prosper-loan-data-part-2-social-network-analysis-what-is-the-value-of-a-friend/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Visualization of Prosper.com’s Loan Data Part I of II &#8211; Compare and Contrast with Lending Club</title>
		<link>http://www.dataspora.com/2011/12/visualization-of-prosper-com%e2%80%99s-loan-data-part-i-compare-and-contrast-with-lending-club/</link>
		<comments>http://www.dataspora.com/2011/12/visualization-of-prosper-com%e2%80%99s-loan-data-part-i-compare-and-contrast-with-lending-club/#comments</comments>
		<pubDate>Tue, 06 Dec 2011 18:34:03 +0000</pubDate>
		<dc:creator>Tanya Cashorali</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[dataviz]]></category>
		<category><![CDATA[R]]></category>
		<category><![CDATA[34374]]></category>
		<category><![CDATA[data visualization]]></category>
		<category><![CDATA[investing]]></category>
		<category><![CDATA[lending club]]></category>
		<category><![CDATA[peer-to-peer lending]]></category>
		<category><![CDATA[prosper]]></category>

		<guid isPermaLink="false">http://www.dataspora.com/?p=823</guid>
		<description><![CDATA[Due to the positive feedback received on this post I thought I would re-create the analysis on another peer-to-peer lending dataset, courtesy of Prosper.com. You can access the Prosper Marketplace data via an API or by simply downloading XML files that are updated nightly http://www.prosper.com/tools/. If you are going to follow the route I took [...]]]></description>
			<content:encoded><![CDATA[<p>Due to the positive feedback received on <a href = "http://www.dataspora.com/2011/10/mining-lending-clubs-goldmine-of-loan-data-part-i-of-ii-visualizations-by-state/" target="_blank">this post</a> I thought I would re-create the analysis on another peer-to-peer lending dataset, courtesy of Prosper.com. You can access the Prosper Marketplace data via an API or by simply downloading XML files that are updated nightly <a href = "http://www.prosper.com/tools/" target="_blank">http://www.prosper.com/tools/</a>. </p>
<p>If you are going to follow the route I took and download the latest XML file, ProsperDataExport_xml.zip, you will find this utility helpful in converting the XML files to CSVs: <a href = "http://www.rateladder.com/2008/02/22/convert-prosper-xml-to-csv/" target="_blank">Convert Prosper XML to CSV</a></p>
<p>Once you have downloaded the .jar file run the following command (changing the parameters of course!):<br />
java -jar ProsperXMLtoCSV.jar ProsperXMLFileLocation CSVDestinationDirectory</p>
<p>Similar to Lending Club, Prosper provides loan-level data such as interest rate, amount funded/requested, borrower state, borrower debt to income ratio, etc. However, Prosper also provides additional information regarding their user base and loan performance history.  This information includes extended credit profiles of users, groups that users belong to, social networks within the user base and even retroscores, or how a loan would be rated by Prosper under a new heuristic given macroeconomic shifts over time.</p>
<p>Let’s jump right into the visualizations by state:</p>
<p><code><br />
library(ggplot2)<br />
library(maps)<br />
</code></p>
<p><code><br />
## Warning: this is a very large dataset that required ~10 minutes ## to read into R on a fast 8-core Xeon server.<br />
loans <- read.csv("Loans.CSV", header=TRUE)<br />
listings <- read.csv("Listings.CSV", header=TRUE)<br />
</code></p>
<p><code><br />
## Obtain the active loans from the Listings file, since it<br />
## contains more detailed information than the Loans file<br />
listings.match <- listings[match(loans$ListingKey, listings$Key),]<br />
</code></p>
<p><code><br />
listings.match$BorrowerState <- as.character(listings.match$BorrowerState)<br />
loans <- listings.match<br />
states <- map_data("state")<br />
</code></p>
<p><code><br />
## Change state abbreviations to full names so we can merge our<br />
## data frames together<br />
state.names <- unlist(sapply(loans$BorrowerState, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]) )<br />
loans$BorrowerState <- tolower(state.names)<br />
colnames(loans)[11] <- "region"<br />
state.counts <- data.frame(table(loans$region))<br />
colnames(state.counts) <- c("region", "Num.Loans")<br />
result<-merge(state.counts, states, by=c("region"))<br />
result <- result[order(result$order),]<br />
</code></p>
<p><code><br />
p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75) + opts(title = 'Number of Issued Loans by State')<br />
print(p)<br />
</code></p>
<p><a href = "http://www.dataspora.com/wp-content/uploads/2011/12/Prosper.Num_.Loans_.By_.State_.png"><img src="http://www.dataspora.com/wp-content/uploads/2011/12/Prosper.Num_.Loans_.By_.State_.png" alt="Click for Larger Image" title="Click for Larger Image" width="600" height="350" class="aligncenter size-full wp-image-841" /></a></p>
<p>It comes as no surprise that a majority of issued loans originate in California. As with Lending Club, Prosper is a San Francisco-based peer-to-peer lending company. </p>
<p>Now we will take the log of the number of loans issued by state and compare Prosper’s market reach with Lending Club’s. </p>
<p><code><br />
p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75) + opts(title = 'Log Number of Issued Loans by State')<br />
print(p)<br />
</code></p>
<h2>Prosper</h2>
<p><a href="http://www.dataspora.com/wp-content/uploads/2011/12/Prosper.Log_.Num_.Loans_.by_.State_.png"><img src="http://www.dataspora.com/wp-content/uploads/2011/12/Prosper.Log_.Num_.Loans_.by_.State_.png" alt="Click for Larger Image" title="Click for Larger Image" width="550" height="350" class="aligncenter size-full wp-image-848" /></a></p>
<h2>Lending Club</h2>
<p><a href="http://www.dataspora.com/wp-content/uploads/2011/12/Log.Num_.Issued.Loans_.by_.State_.png"><img src="http://www.dataspora.com/wp-content/uploads/2011/12/Log.Num_.Issued.Loans_.by_.State_.png" alt="Click for Larger Image" title="Click for Larger Image" width="550" height="350" class="aligncenter size-full wp-image-911" /></a></p>
<p>The two maps are extremely similar. Both lending companies issue the most loans in California, Texas and Florida. There are some minor differences such as Lending Club issuing more loans than Prosper in Wyoming and Montana.</p>
<p>Instead of the Monthly Income by State map that I created for Lending Club, we will observe Debt to Income Ratios by state for both Prosper borrowers and Lending Club borrowers. </p>
<p><code><br />
## Aggregate median debt to income ratio by state<br />
debt.to.income <-aggregate(loans$DebtToIncomeRatio, by=list(loans$region), function(x) median(x, na.rm=TRUE))<br />
colnames(debt.to.income) <- c("region", "debt.to.income")<br />
result <- merge(debt.to.income, states, by="region")<br />
result <- result[order(result$order),]<br />
</code></p>
<p><code><br />
p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=debt.to.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75) + labs(fill="Debt to Income Ratio") + opts(title = 'Median Debt to Income Ratio of Borrowers by State')<br />
</code></p>
<h2>Prosper</h2>
<p><a href="http://www.dataspora.com/wp-content/uploads/2011/12/Prosper.Median.Debt_.to_.Income.Ratio_.By_.State2_.png"><img src="http://www.dataspora.com/wp-content/uploads/2011/12/Prosper.Median.Debt_.to_.Income.Ratio_.By_.State2_.png" alt="Click for Larger Image" title="Click for Larger Image" width="550" height="350" class="aligncenter size-full wp-image-903" /></a></p>
<h2>Lending Club</h2>
<p><a href="http://www.dataspora.com/wp-content/uploads/2011/12/LC.Median.Debt_.to_.Income.Ratio_.of_.Borrowers.by_.State_.png"><img src="http://www.dataspora.com/wp-content/uploads/2011/12/LC.Median.Debt_.to_.Income.Ratio_.of_.Borrowers.by_.State_.png" alt="Click for Larger Image" title="Click for Larger Image" width="550" height="350" class="aligncenter size-full wp-image-907" /></a></p>
<p>Does anyone want to start pointing fingers for the United States debt crisis yet? The states that Prosper loans to the most are also the ones with the lowest Debt to Income Ratios. New Yorkers, in particular, have the lowest median Debt to Income Ratio. Lending Club seems to have much more homogeneous interest rates. We can compare the distributions of the two companies' Debt to Income Ratios with a call to ggplot (after a bit of pre-processing that I left out due to real estate on this page):</p>
<p><code><br />
ggplot(combined, aes(x=DebtToIncomeRatio)) + geom_histogram() + facet_grid(Company ~ .)<br />
</code></p>
<p><a href="http://www.dataspora.com/wp-content/uploads/2011/12/Distributions.of_.Debt_.To_.Income.Ratios.By_.Company.png"><img src="http://www.dataspora.com/wp-content/uploads/2011/12/Distributions.of_.Debt_.To_.Income.Ratios.By_.Company.png" alt="Click for Larger Image" title="Click for Larger Image" width="550" height="350" class="aligncenter size-full wp-image-914" /></a></p>
<p>It appears as if Lending Club has a hard cut-off at a 0.30 Debt to Income Ratio for borrowers. Note that this data is taking into account all loans since the inception of both companies. Prosper implemented stricter borrowing guidelines and interest rates after 2009, which can be seen in the animation below.</p>
<p><code><br />
issue.year <- substr(loans$StartDate, 0, 4)<br />
loans$Issued.Year <- issue.year<br />
interest.by.year.by.state<-aggregate(loans$BorrowerRate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE))<br />
years <- c("2006", "2007", "2008", "2009", "2010", "2011")<br />
colnames(interest.by.year.by.state) <- c("year", "region", "interest.rate")<br />
interest.by.year.by.state$interest.rate <- interest.by.year.by.state$interest.rate * 100<br />
<code><br />
result <- merge(interest.by.year.by.state, states, by="region")<br />
result <- result[order(result$order),]<br />
</code><br />
<code><br />
#Calculate the lower and upper bounds for the gradient<br />
lower <- floor(summary(interest.by.year.by.state$interest.rate)[1])[[1]]<br />
upper <- ceiling(summary(interest.by.year.by.state$interest.rate)[6])[[1]]<br />
</code><br />
<code><br />
states2 <- data.frame(map("state", plot=FALSE)[c("x","y")])<br />
animateMap <- function(year){<br />
	result.year <- result[grep(year, result$year),]<br />
	usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate))<br />
	print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab(""))<br />
}<br />
</code></p>
<p><code><br />
saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);<br />
</code><br />
<img src="http://www.dataspora.com/wp-content/uploads/2011/12/animation.gif" alt="" title="animation" width="480" height="480" class="aligncenter size-full wp-image-871" /></a><br />
</code></p>
<p>Notice the interest rates are the most varied in 2006, the year of Prosper’s inception.<br />
It also worth noting that the median interest rates for borrowers soared after 2009, when Prosper implemented stricter guidelines for borrowers, which also resulted in lower default rates.</p>
<p>Stay tuned for a "social network" analysis of Prosper.com's member data coming up in Part II!</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dataspora.com/2011/12/visualization-of-prosper-com%e2%80%99s-loan-data-part-i-compare-and-contrast-with-lending-club/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Mining Lending Club&#8217;s Goldmine of Loan Data Part I of II &#8211; Visualizations by State</title>
		<link>http://www.dataspora.com/2011/10/mining-lending-clubs-goldmine-of-loan-data-part-i-of-ii-visualizations-by-state/</link>
		<comments>http://www.dataspora.com/2011/10/mining-lending-clubs-goldmine-of-loan-data-part-i-of-ii-visualizations-by-state/#comments</comments>
		<pubDate>Fri, 14 Oct 2011 13:39:54 +0000</pubDate>
		<dc:creator>Tanya Cashorali</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[dataviz]]></category>
		<category><![CDATA[R]]></category>

		<guid isPermaLink="false">http://www.dataspora.com/?p=532</guid>
		<description><![CDATA[I have a few friends that keep bragging about their 14% annual returns by investing their money with Lending Club, a peer-to-peer lending service that cuts out the complexities and difficulties of getting approved for a loan through a bank. To give you an idea of the sheer amount of volume Lending Club has been [...]]]></description>
			<content:encoded><![CDATA[<p>I have a few friends that keep bragging about their 14% annual returns by investing their money with <a href="http://www.lendingclub.com/">Lending Club</a>, a peer-to-peer lending service that cuts out the complexities and difficulties of getting approved for a loan through a bank. To give you an idea of the sheer amount of volume Lending Club has been dealing with, here’s a snapshot of the Company Statistics as of 10/14/2011:</p>
<ul>
<li>Loans funded to date: <strong>$387,043,375</strong></li>
<li>Loans funded last month: <strong>$24,945,400</strong></li>
<li>Interest paid to investors since inception: <strong>$32,135,688</strong></li>
</ul>
<p>Currently Lending Club is boasting that 91% of borrowers earn between 6-18%. Now of course, higher returns are correlated with higher risk. You can choose to diversify your investment across hundreds of different loans with different credit grades &#8211; the worse the credit grade, the higher the return percentage, and the higher the risk. I thought it would be interesting to investigate Lending Club a bit more so I navigated over to their site and found something that only a data scientist would consider to be gold:</p>
<p><a href="https://www.lendingclub.com/info/download-data.action", target="_blank">https://www.lendingclub.com/info/download-data.action</a></p>
<p><img src="http://www.dataspora.com/wp-content/uploads/2011/10/gold-coin-treasure-chest.jpg" alt="" title="treasure" width="250" height="250" class="alignleft size-full wp-image-427" /></a><br />
Yes, they have provided complete loan data in CSV format for all of us data geeks to devour.  The data include the current loan status (Current, Late, Fully Paid, etc.), credit grades, interest rates, loan purposes, and all sorts of other juicy tidbits of borrower information.</p>
<p>I downloaded the data and quickly determined that this csv file contained information on 37,122 loans. Of course the first thing I did was fire up R:</p>
<p><code><br />
library(ggplot2)<br />
library(maps)<br />
loans <- read.csv("LoanStats.csv", header=TRUE, skip = 1)<br />
</code></p>
<p>One of the <a href="http://www.dataspora.com/2009/05/sexy-data-geeks/", target="_blank">three sexy skills of the data geek</a> includes data munging, otherwise known as suffering. This post will briefly touch on 2 of the 3 skills - Data Munging and Data Visualization. But first, we need to get the data into a format that our tool, in this case R, can handle. We’ll replace some percentage signs and change a factor to a character string.</p>
<p><code><br />
loans$Debt.To.Income.Ratio <- as.numeric(gsub("%", "", loans$Debt.To.Income.Ratio))<br />
loans$State <- as.character(loans$State)<br />
loans$Interest.Rate <- as.numeric(gsub("%", "", loans$Interest.Rate))<br />
loans$Revolving.Line.Utilization <- as.numeric(gsub("%", "", loans$Revolving.Line.Utilization))<br />
</code></p>
<p>Conveniently, map_data(“state”) will get all of the latitude and longitude data for each state so that we can draw a map of the U.S. </p>
<p><code><br />
states <- map_data("state")<br />
loans<-loans[-which(loans$State == ""),]<br />
</code></p>
<p><code><br />
#Change state abbreviations to full names so we can merge our data frames together<br />
state.names <- unlist(sapply(loans$State, function(x) if(length(state.name[grep(x, state.abb)]) == 0) "District of Columbia" else state.name[grep(x, state.abb)]) )<br />
loans$State <- tolower(state.names)<br />
colnames(loans)[23] <- "region"<br />
</code></p>
<p>Now we will determine the number of loans by state and merge this data.frame with the states data so we can plot this all out on a map using ggplot2.</p>
<p><code><br />
state.counts <- data.frame(table(loans$region))<br />
colnames(state.counts) <- c("region", "Num.Loans")<br />
result<-merge(state.counts, states, by=c("region"))<br />
result <- result[order(result$order),]<br />
</code></p>
<p><code><br />
p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue") + coord_equal(ratio=1.75)<br />
print(p)<br />
</code></p>
<p><img src="http://www.dataspora.com/wp-content/uploads/2011/10/Num.Loans_.by_.State_.png" alt="" title="Num.Loans.by.State" width="550" height="350" class="alignleft size-full wp-image-649" /></a></p>
<p>It doesn’t take a geography whiz to realize that a state is missing from this map! Turns out, Lending Club has zero borrowers in North Dakota as of 10/14/2011. Since the number of loans can range anywhere from 3 in Maine to 6,452 in California, we can also plot the log of the total number of loans in order to more easily compare each state's loan activity visually. Why don’t we also add poor North Dakota onto our map? We will assign its Num.Loans variable a value of 1 since we will be taking the log for our next visualization and log(1) = 0.</p>
<p><code><br />
nd<-map_data("state")[grep("north dakota", map_data("state")[,5]),]<br />
nd$Num.Loans <- 1<br />
result <- rbind(result, nd)<br />
result <- result[order(result$order),]<br />
</code></p>
<p><code><br />
p <- ggplot(result, aes(x=long, y=lat, group=group, fill=Num.Loans)) + geom_polygon() + scale_fill_gradient(low = "yellow", high = "blue", trans="log") + coord_equal(ratio=1.75)<br />
print (p)<br />
</code></p>
<p><img src="http://www.dataspora.com/wp-content/uploads/2011/10/Log.Num_.Loans_.by_.State_.png" alt="" title="Log.Num.Loans.by.State" width="550" height="350" class="alignleft size-full wp-image-648" /></a></p>
<p>That’s better. We can see that most of Lending Club’s borrowers are from CA, which makes sense given that Lending Club is headquartered in San Francisco.  They also have vast reach across Texas, Florida, New York, a good portion of the east coast, and states bordering California.  They have the least number of borrowers in Maine and parts of the West to Midwest. </p>
<p>Now let’s explore some of the other variables and project them onto our map. We will look at the median monthly incomes by state.</p>
<p><code><br />
monthly.income <-aggregate(loans$Monthly.Income, by=list(loans$region), function(x) median(x, na.rm=TRUE))<br />
colnames(monthly.income) <- c("region", "monthly.income")<br />
result <- merge(monthly.income, states, by="region")<br />
nd<-map_data("state")[grep("north dakota", map_data("state")[,5]),]<br />
nd$monthly.income <- 0<br />
result <- rbind(result, nd)<br />
result <- result[order(result$order),]<br />
</code></p>
<p><code><br />
p <- ggplot(result, aes(x=long, y=lat)) + geom_polygon(data=result, aes(x=long, y=lat, group = group, fill=monthly.income)) + scale_fill_gradient(low="yellow", high="purple") + coord_equal(ratio=1.75)<br />
print(p)<br />
</code></p>
<p><img src="http://www.dataspora.com/wp-content/uploads/2011/10/Monthly.Income.by_.State_.png" alt="" title="Monthly.Income.by.State" width="550" height="350" class="alignleft size-full wp-image-647" /></a></p>
<p>You may recall that Lending Club has only issued 3 loans in Maine. This means we are only looking at 3 data points, which is not a large sample size. We can add any type of information we would like to the center of each state on our map. Let’s add the total number of loans in each state using geom_text() to the center of each state to give this information a little more context.</p>
<p><code><br />
state.info<-data.frame(region = tolower(state.name), long=state.center$x, lat=state.center$y)<br />
state.info <- subset(state.info, !region %in% c("alaska", "hawaii"))<br />
totals <- data.frame(table(loans$region))<br />
colnames(totals) <- c("region", "total")<br />
state.info <- merge(state.info, totals)<br />
</code><br />
<code><br />
p + geom_text(data=state.info, aes(label=total, cex=0.5))<br />
</code></p>
<p><img src="http://www.dataspora.com/wp-content/uploads/2011/10/Monthly.Income.by_.State_.Labels.png" alt="" title="Monthly.Income.by.State.Labels" width="550" height="350" class="alignleft size-full wp-image-668" /></a></p>
<p>I have one more trick up my sleeve, which I hacked together thanks to <a href="http://www.r-bloggers.com/visualizing-growth-of-a-retail-chain/", target="_blank">this post</a> from r-bloggers.com. We will look at how the median interest rate for loans issued by Lending Club have varied over the past 4 years by state. </p>
<p><code><br />
library(animation)<br />
</code></p>
<p><code><br />
#Pull out just the year from the Issued.Date for each loan<br />
loans$Issued.Year <- substr(loans$Issued.Date, 1, 4)<br />
interest.by.year.by.state<-aggregate(loans$Interest.Rate,by=list(loans$Issued.Year, loans$region), function(x) median(x, na.rm=TRUE))<br />
years <- c("2007", "2008", "2009", "2010", "2011")<br />
colnames(interest.by.year.by.state) <- c("year", "region", "interest.rate")<br />
</code></p>
<p><code><br />
result <- merge(interest.by.year.by.state, states, by="region")<br />
result <- result[order(result$order),]<br />
</code></p>
<p><code><br />
#Calculate the lower and upper bounds for the gradient<br />
lower <- floor(summary(interest.by.year.by.state$interest.rate)[1])[[1]]<br />
upper <- ceiling(summary(interest.by.year.by.state$interest.rate)[6])[[1]]<br />
</code></p>
<p><code><br />
states2 <- data.frame(map("state", plot=FALSE)[c("x","y")])<br />
animateMap <- function(year){<br />
	result.year <- result[grep(year, result$year),]<br />
	usamap <- ggplot(data=states2, aes(x=x, y=y)) + geom_path()+ geom_polygon(data=result.year, aes(x=long, y=lat, group = group, fill=interest.rate))<br />
	print(usamap + scale_fill_gradient(low="yellow", high="blue", limits=c(lower, upper)) + coord_equal(ratio=2.00) + opts(title = paste('Median Interest Rates for all Issued Loans by State in', year)) + labs(fill="Interest Rate (%)") + xlab("") + ylab(""))<br />
}<br />
saveMovie(for (i in 1:length(years)) animateMap(years[i]), clean = T);<br />
</code></p>
<p><img src="http://www.dataspora.com/wp-content/uploads/2011/10/animation.gif" alt="" title="animation" width="515" height="480" class="alignleft size-full wp-image-630" /></a></p>
<p>If we observe our award-winning animated GIF created in R, we can see that the interest rates that Lending Club calculated for issued loans in 2007, the year of its inception, were much more heterogeneous than they are now. They are the highest in 2009 at around 14% across a majority of the U.S. and now they are more constant, hovering around 11% for most states. States without color simply indicate that there were no loans issued by Lending Club in that state for the given year.</p>
<p>What are some interesting visualizations you have come up with using Lending Club’s trove of borrower data? </p>
<p><strong>UPDATE</strong></p>
<p>Due to high demand, I have created a map of "Good" vs. "Bad" borrowers broken down by state. Since some states have many more borrowers than others, I also included the total number of borrowers that went into the ratio, depicted as a number on each state's center. I filtered the original loan data down to two classes of borrowers. "Good" borrowers as those that were fully paid and "Bad" borrowers are those that either charged off, defaulted, or were late on payments. This resulted in 1,542 "Bad" borrowers and 4,647 "Good" borrowers. I then simply calculated the percentage of "Bad" customers by state. Keep in mind this does not include data on the ~24,000 other loans that are current! Click the image to see a larger version.</p>
<p><a href="http://www.dataspora.com/wp-content/uploads/2011/10/Percentage.Bad_.by_.State_1.png"><img src="http://www.dataspora.com/wp-content/uploads/2011/10/Percentage.Bad_.by_.State_1.png" alt="" title="Percentage.Bad.by.State" width="550" height="350" class="alignleft size-full wp-image-674" /></a></p>
<p>As you can see, the only states with 0% "Bad" borrowers are those with fewer than 13 borrowers. If we compare states with multiple hundreds of borrowers, Florida consists of about 40% "Bad" borrowers! That's approximately 167 borrowers out of 418! Texas, New York and Pennsylvania borrowers on the other hand, are pretty diligent with paying back their loans and are boasting that only 20% of their borrowers are naughty. Meanwhile California, Lending Club's home state, has the most borrowers and about 30% of those have either charged off, defaulted or were late on their payments. </p>
<p>Hang tight for a more quantitative analysis in which we will try to determine which factors other than state of residence are most important in determining what makes a "good" or "bad" borrower. </p>
]]></content:encoded>
			<wfw:commentRss>http://www.dataspora.com/2011/10/mining-lending-clubs-goldmine-of-loan-data-part-i-of-ii-visualizations-by-state/feed/</wfw:commentRss>
		<slash:comments>15</slash:comments>
		</item>
		<item>
		<title>Data scientists or data composers? Four steps to a symphony of data</title>
		<link>http://www.dataspora.com/2011/09/data-scientists-or-data-composers/</link>
		<comments>http://www.dataspora.com/2011/09/data-scientists-or-data-composers/#comments</comments>
		<pubDate>Fri, 16 Sep 2011 23:10:16 +0000</pubDate>
		<dc:creator>bartev</dc:creator>
				<category><![CDATA[analytics]]></category>
		<category><![CDATA[Big data]]></category>
		<category><![CDATA[data]]></category>
		<category><![CDATA[data scientist]]></category>

		<guid isPermaLink="false">http://www.dataspora.com/?p=341</guid>
		<description><![CDATA[My 10-year-old son recently asked me what a data scientist does. I’m a visual guy, and like to paint a picture, so I thought about how best to explain this. I liked an explanation I came across a while back at http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html, describing the relationship between data, information and knowledge. I would take it one step [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.dataspora.com/wp-content/uploads/2011/09/Clarinet.png"><img class="alignleft" title="Clarinet" src="http://www.dataspora.com/wp-content/uploads/2011/09/Clarinet.png" alt="" width="250" height="165" /></a></p>
<p><span class="Apple-style-span" style="font-size: 13px; font-weight: normal;">My 10-year-old son recently asked me what a data scientist does. I’m a visual guy, and like to paint a picture, so I thought about how best to explain this. I liked an explanation I came across a while back at <a href="http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html">http://hadoop-karma.blogspot.com/2010/03/how-much-data-is-generated-on-internet.html</a>, describing the relationship between data, information and knowledge. I would take it one step further, because that’s what we do here at Dataspora. We take data, and transform it into actionable intelligence.</span></p>
<p style="text-align: center;"><a href="http://www.dataspora.com/wp-content/uploads/2011/09/FourStepsToAction.png"><img class="aligncenter size-full wp-image-346" title="FourStepsToAction" src="http://www.dataspora.com/wp-content/uploads/2011/09/FourStepsToAction.png" alt="" width="511" height="74" /></a></p>
<p><strong>A data scientist is someone who takes your data and transforms it into actionable intelligence</strong>. But how do you explain that to a 10 year old? Well it just so happens, this 10 year old is starting to play the clarinet, so music seemed to be a good choice to use as an example.</p>
<p>Let’s say the note this aspiring Benny Goodman squeaks out of his shiny new instrument is a piece of <strong>data</strong>. All alone, hanging in the air in my living room, echoing off the walls, it doesn’t mean a whole lot. It’s raw, unadultered data. Now how about if we do something with it to put it in context. I can play a chord on the piano, and have him play his (squeaky) note, and suddenly we can tell if he’s sharp or flat or in tune. We can measure how long the note is. His note is no longer alone, but has some context. We have some <strong>information</strong>. We know he was playing middle C. This is the equivalent of step 2. We turned data into information by giving it context.</p>
<p>Is that enough? Perhaps. If that’s all you want to know, sure. But you probably want more. My son has a pretty good ear, and can pick up a rhythm fairly quickly. If he were to play several notes in a particular rhythm, we could have a motif. Wow. That’s more useful than a single note. He’s now taken several notes, with varying durations, added some pauses and made something larger – a motif.  In this analogy, the motif is equivalent to a bit of <strong>knowledge</strong>. More motifs = more knowledge. If I were a composer, I could combine various motifs to make a symphony (really important knowledge). The more skilled the composer (data scientist), the better the symphony (= better knowledge).</p>
<p><a href="http://www.dataspora.com/wp-content/uploads/2011/09/Symphony.png"><img class="alignright" title="Symphony" src="http://www.dataspora.com/wp-content/uploads/2011/09/Symphony.png" alt="" width="192" height="144" /></a>This is great! But we’re still just at the knowledge state, and I want to do something useful. I have a score sitting in front of me that started from a 10 year-old boy blowing on his rented clarinet. What can I do with it? This is where things get interesting. If I were a conductor, I could choose how and when to present this new score. Do I give it to my 5<sup>th</sup> grade band to play, or pass it on to the Philharmonic? Maybe I change a few things, and prepare it for a string quartet. The choice of action is up to me. This is the final stage – <strong>ACTION</strong>.</p>
<p>So, some might say that a data scientist fiddles around with data, but I prefer to look at the larger picture. A data scientist transforms data into <strong>actionable intelligence</strong>, picking and choosing what’s useful and what’s not. It doesn’t really matter if you have the data if you can’t actually do something with it – even if your choice is to do nothing at all.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dataspora.com/2011/09/data-scientists-or-data-composers/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Pigs, Bees, and Elephants: A Comparison of Eight MapReduce Languages</title>
		<link>http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/</link>
		<comments>http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/#comments</comments>
		<pubDate>Mon, 25 Apr 2011 19:48:21 +0000</pubDate>
		<dc:creator>Alan</dc:creator>
				<category><![CDATA[computing]]></category>

		<guid isPermaLink="false">http://new.dataspora.com/?p=193</guid>
		<description><![CDATA[This week&#8217;s guest blogger is Dataspora&#8217;s own Antonio Piccolboni. The original post can be found on his personal blog. On a quest for an elegant and effective map reduce language, I went through a number of options and put together some considerations. And the winner is … In a couple of blog entries from my [...]]]></description>
			<content:encoded><![CDATA[<p>This week&#8217;s guest blogger is Dataspora&#8217;s own Antonio Piccolboni. The original post can be found on his <a href="http://blog.piccolboni.info/2011/04/looking-for-map-reduce-language.html">personal blog</a>.</p>
<p>On a quest for an elegant and effective map reduce language, I went through a number of options and put together some considerations. And the winner is …</p>
<p>In a couple of blog entries from my personal blog I described some map-reduce algorithms for <a href="http://blog.piccolboni.info/2010/07/algorithm-for-sample-quantiles-in-map.html">statistical</a> and <a href="http://blog.piccolboni.info/2010/07/map-reduce-algorithm-for-connected.html">graph</a> problems and sketched their implementation using pseudo-code. Pseudo-code has two problems: not everybody agrees on what a statement means and it doesn&#8217;t run, so you can&#8217;t test it or use it. Real programming languages on the other hand tend to obscure the logic of a program with unnecessary detail and have other issues that hinder readability, the reason why people resort to pseudo-code. But there is more to it than just aesthetics. Conciseness of code is related to programming abstractions, constructs that achieve higher generality and remove unnecessary detail; to reuse, whereby the same code is used in different contexts, reducing total program size; and even <a href="http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.34.567">testing</a>, that is concise programs can be tested more easily. In sum, shorter programs are better. The elegance of less is hardly my own or a software engineering discovery. As Antoine de Saint-Exupery, French writer and aviator, so eloquently put it :<br />
<span id="more-193"></span></p>
<blockquote><p>Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away.</p></blockquote>
<p>Unfortunately, in some circles, dull, predictable, repetitive code is considered simpler than short and to the point code, or at least tolerable. From java.util.Arrays:</p>
<blockquote><p>478     /*<br />
479      * The code for each of the seven primitive types is largely identical.<br />
480      * C&#8217;est la vie.<br />
481      */</p></blockquote>
<p>In this case, repetition gets a free pass in exchange for efficiency. Very expressive languages tend to exact a higher toll on resources, and the different map-reduce environments we will look into are no exception.<br />
I will present for each language or library the implementation of a word count program, lifted from its documentation, since this has become sort of the &#8220;Hello World&#8221; for map reduce. I don&#8217;t think such a simple program is the ultimate test of the quality of a language, so this is just to give a taste of the language. What I am most interested in is:</p>
<ul>
<li>Can I write reasonably concise, abstract programs in this language or library?</li>
<li>Can I write the &#8220;inside&#8221; of map reduce, that is the code for the mapper and the reducer, as well as the &#8220;outside&#8221;, the logic that decides which map reduce jobs to run?</li>
<li>Is it general? Can I write any map-reduce program, including programs that require multiple map-reduce jobs, including the case of a data dependent number and type of jobs?</li>
</ul>
<h5>Java Hadoop</h5>
<p>This is the original, the real thing, the current performance champion and what &#8220;real men&#8221; write in. It is also the most mature of the different options. But take a look:</p>
<pre class="brush: java">public static class MapClass extends MapReduceBase
public class WordCount {

    public static class Map extends MapReduceBase implements Mapper&lt;LongWritable, Text, Text, IntWritable&gt; {
      private final static IntWritable one = new IntWritable(1);
             private Text word = new Text();

      public void map(LongWritable key, Text value, OutputCollector&lt;Text, IntWritable&gt; output, Reporter reporter) throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
          word.set(tokenizer.nextToken());
          output.collect(word, one);
        }
      }
    }

    public static class Reduce extends MapReduceBase implements Reducer&lt;Text, IntWritable, Text, IntWritable&gt; {
      public void reduce(Text key, Iterator&lt;IntWritable&gt; values, OutputCollector&lt;Text, IntWritable&gt; output, Reporter reporter) throws IOException {
        int sum = 0;
        while (values.hasNext()) {
          sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
      }
    }

    public static void main(String[] args) throws Exception {
      JobConf conf = new JobConf(WordCount.class);
      conf.setJobName("wordcount");

      conf.setOutputKeyClass(Text.class);
      conf.setOutputValueClass(IntWritable.class);

      conf.setMapperClass(Map.class);
      conf.setCombinerClass(Reduce.class);
      conf.setReducerClass(Reduce.class);

      conf.setInputFormat(TextInputFormat.class);
      conf.setOutputFormat(TextOutputFormat.class);

      conf.setInputPath(new Path(args[0]));
      conf.setOutputPath(new Path(args[1]));

      JobClient.runJob(conf);
    }
}</pre>
<p>48 lines to write a word count program (and I stripped the import statements at the top out of mercy)! My favorite line is number 5, a line devoted to redefining the number one. This makes sense in a world where programmer productivity is measured by number of lines of code written or for a production job that runs on a 1,000 node cluster for 5 hours every night, in which case efficiency may trump other considerations. But for a blog, for discussing and enjoying code, anything remotely more interesting than a word count program would not fit the size of an entry but would have to be an attachment, as John Mount did with his painstaking <a href="http://www.win-vector.com/blog/2010/12/large-data-logistic-regression-with-example-hadoop-code/">Java/Hadoop implementation of logistic regression</a>. I wonder how many people opened that tar file and read it through and through.</p>
<p>Hadoop was started by Doug Cutting and is developed by a large community with a significant group employed at Yahoo.</p>
<h5>Cascading</h5>
<p>Cascading is a Java library written on top of Hadoop. It enables programming in a dataflow style, with some primitives inspired by SQL (like GroupBy). But according to a person closely related to the project, &#8220;it&#8217;s still Java, it&#8217;s still boilerplate code&#8221;. My favorite line is number 18. Remarkably, it trims down the line count for the word count program to half as many as plain Hadoop. I don&#8217;t have first hand experience with Cascading, but since there is no or little performance penalty compared to the real thing — depending on programmer skill, it could actually be better — it&#8217;s worth a try for production work.</p>
<p>Cascading is developed by Chris Wensel at Concurrent.</p>
<pre class="brush: java">Scheme sourceScheme = new TextLine( new Fields( "line" ) );
Tap source = new Hfs( sourceScheme, inputPath );

Scheme sinkScheme = new TextLine( new Fields( "word", "count" ) );
Tap sink = new Hfs( sinkScheme, outputPath, SinkMode.REPLACE );

Pipe assembly = new Pipe( "wordcount" );

String regex = "(?&gt;!\pL)(?=\pL)[^ ]*(?&lt;=\pL)(?!\pL)";
Function function = new RegexGenerator( new Fields( "word" ), regex );
assembly = new Each( assembly, new Fields( "line" ), function );

assembly = new GroupBy( assembly, new Fields( "word" ) );

Aggregator count = new Count( new Fields( "count" ) );
assembly = new Every( assembly, count );

Properties properties = new Properties();
FlowConnector.setApplicationJarClass( properties, Main.class );

FlowConnector flowConnector = new FlowConnector( properties );
Flow flow = flowConnector.connect( "word-count", source, sink, assembly );

flow.complete();</pre>
<h5>Pipes — C++</h5>
<p>C++ fits into the Java environment not without some effort, which is encapsulated in a library called Pipes. The word count program looks more compact than in Java/Hadoop. I&#8217;ve read opposite comments on the efficiency of Pipes/C++ vs Hadoop/Java and I suspect it may depend on the specific problem being tackled. Even if I used to be quite proficient in C++, I do not remember fondly the 8000 characters template-induced error messages and I don&#8217;t think it is the type of language I would want to use to discuss algorithms or for prototyping.</p>
<p>Pipes is developed as part of the Hadoop project.</p>
<pre class="brush: cpp">class WordCountMap: public HadoopPipes::Mapper {
public:
 WordCountMap(HadoopPipes::TaskContext&amp; context){}
 void map(HadoopPipes::MapContext&amp; context) {
   std::vector&lt;std::string&gt; words =
     HadoopUtils::splitString(context.getInputValue(), " ");
   for(unsigned int i=0; i &lt; words.size(); ++i) {
     context.emit(words[i], "1");
   }
 }
};

class WordCountReduce: public HadoopPipes::Reducer {
public:
 WordCountReduce(HadoopPipes::TaskContext&amp; context){}
 void reduce(HadoopPipes::ReduceContext&amp; context) {
   int sum = 0;
   while (context.nextValue()) {
     sum += HadoopUtils::toInt(context.getInputValue());
   }
   context.emit(context.getInputKey(), HadoopUtils::toString(sum));
 }
};

int main(int argc, char *argv[]) {
 return HadoopPipes::runTask(HadoopPipes::TemplateFactory&lt;WordCountMap,
                             WordCountReduce&gt;());
}</pre>
<h5>Hive</h5>
<p>Hive is a SQL-like language that is interpreted on top of Hadoop. It can also be combined with small programs written in a variety of languages, to make up for the fact that the language itself is not general purpose. For what it does, it is very concise and expressive, but outside that you need to supplement it with other languages. Case in point, the word count example where two additional scripts are left as an exercise for the reader.<br />
Hive started as part of the Hadoop project.</p>
<pre>FROM
(MAP docs.contents USING 'tokenizer_script' AS word, cnt
FROM docs
CLUSTER BY word) map_output

REDUCE map_output.word, map_output.cnt USING 'count_script' AS word, cnt;</pre>
<h5>Pig</h5>
<p>Pig adds to the limitations of Hive the hubris of creating a brand new language, as if creating a new programming language were easy. As you can see, it is inspired by SQL to a degree. It is not a general purpose language as clearly explained <a href="http://wiki.apache.org/pig/TuringCompletePig">here</a>. It interfaces with any JVM based language for custom extensions.</p>
<pre>A = load '/tmp/bible+shakes.nopunc';
B = foreach A generate flatten(TOKENIZE((chararray)$0)) as word;
C = filter B by word matches '\w+';
D = group C by word;
E = foreach D generate COUNT(C) as count, group as word;
F = order E by count desc;
store F into '/tmp/wc';</pre>
<p>Pig development was started at Yahoo.</p>
<h5>Rhipe</h5>
<p>Rhipe is an R package to describe and execute map-reduce jobs. It is reasonably high level and satisfies all the criteria I listed above. It&#8217;s not a speed daemon, because of R itself, there are some quirks in the API and it&#8217;s still at an initial stage of development, but interesting.</p>
<pre>rhinit()
m &lt;- expression({
 y &lt;- strsplit(unlist(map.values)," ")
 lapply(y,function(r) rhcollect(r,T))
})
r &lt;- expression(
   pre={
     count=0
   },
   reduce={
     count &lt;- sum(as.numeric(unlist(reduce.values)),count)
   },post={
     rhcollect(reduce.key,count)
   })
z=rhmr(map=m,reduce=r,comb=T,inout=c("text","sequence"),ifolder="/tmp/50mil",ofolder='/tmp/tof')
rhex(z)</pre>
<p>Rhipe is developed by Saptarshi Guha at Mozilla at Mountain. View, California</p>
<h5>Dumbo</h5>
<p>Dumbo is a Hadoop library for python, but also imposes a set of tools to run dumbo program. If you look at the word count program in Dumbo, below, it almost looks like pseudo-code! Finally! But there is a serious catch. There can only be a run statement per dumbo-powered program — I <a href="https://groups.google.com/d/msg/dumbo-user/9FFGeFAZqQc/N3uuo-S-61YJ">asked the author himself</a> after seeing some outlandish looking errors. To coordinate two runs, for instance one that starts based on the output of the first, one has to run separate python programs and go through the unix shell. This is different from static composition of jobs, which is well supported, but not general enough for my purposes. Other options for python include MR Job and pydoop, but I haven&#8217;t had time to look into these yet.</p>
<pre class="brush: python">mapper (filename, file-contents):
def mapper(key,value):
  for word in value.split(): yield word,1
def reducer(key,values):
  yield key,sum(values)
if __name__ == "__main__":
  import dumbo
  dumbo.run(mapper,reducer)</pre>
<p>Dumbo is developed by Klaas Bosteels at last.fm.</p>
<h5>Cascalog</h5>
<p>Built on top of the already powerful cascading as a domain specific language within Clojure, Cascalog wins the word count conciseness contest with a one-liner. Indeed, word counting is simple enough that a line is all that it should take. But look at what a line:</p>
<pre>(?&lt;- (stdout) [?word ?count] (sentence ?s) (split ?s :&gt; ?word) (c/ count ?count))</pre>
<p>It probably looks familiar to anybody who&#8217;s familiar with it. Conciseness can become terseness, but once some domain specific concepts have been grasped a terse program such as this might become perfectly clear. It was to me at some point. My misgivings here are more about the JVM-powered revival of LISP in the form of Clojure. LISP has been around some 50-odd years without taking off despite several attempts at its revival (Common LISP, Scheme, Arc and now Clojure). I suspect something is wrong with it, even if popularity is not an accurate gauge of language quality, as BASIC has long proved. Personally, I dislike LISP odd syntax, the widespread use of side effects in a functional language and the poor abstraction that lists represent over RAM, from a performance point of view — indeed LISP variants often add additional data structures, somehow negating the &#8220;LIS&#8221; part of the language. In the specific case of Clojure, the fact that a compiled language is compiled into an interpreted one, JVM bytecode, combining a slow dev cycle with suboptimal performance, makes me think Clojure users must be glutton for punishment.</p>
<p>Cascalog is  developed by Nathan Martz at Backtype.</p>
<h5>Final thoughts</h5>
<p>At the end of this by necessity incomplete and unscientific language and library comparison, there is a winner and there isn&#8217;t. There isn&#8217;t because language comparison is always multidimensional and subjective but also because the intended applications are very different. On the other hand, looking for a general purpose, moderately elegant, not necessarily most efficient, not necessarily mature language for exploration purposes, Rhipe seems to fit the bill pretty nicely. First, it is just a library, which means that one can continue to use the tools he&#8217;s familiar with. I found it particularly useful to run map-reduce jobs in the interpreter, inspecting the inputs and outputs of each, an invaluable debugging help — but no, you can not step into a mapper or reducer, I use counters instead to trace what&#8217;s going on in there. I also like that one can read and write sequence files with one call, to examine the output of previous jobs and decide what to do next. Additionally since R is a statistical language and Hadoop is the tool of choice for big data analytics, this seems like a natural fit. Personally, I am familiar with both, which helps, and I have used R, in combination with Hive or Hadoop, to do analytics in the past, but not at this level of integration. Since there is nothing like trying a more substantial example than word count to figure out a language pros and cons, stay tuned for a fairly complex example. After that is published, I plan to pose a friendly challenge to experts in the languages and libraries above or other Hadoop related languages and see what an implementation of the same algorithm would look like in their language of choice and learn something from the comparison. Maybe among my &#8220;25 readers&#8221; there is someone who will take it up.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dataspora.com/2011/04/pigs-bees-and-elephants-a-comparison-of-eight-mapreduce-languages/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Mining the Tar Sands of Big Data</title>
		<link>http://www.dataspora.com/2011/02/mining-the-tar-sands-of-big-data/</link>
		<comments>http://www.dataspora.com/2011/02/mining-the-tar-sands-of-big-data/#comments</comments>
		<pubDate>Mon, 14 Feb 2011 07:05:22 +0000</pubDate>
		<dc:creator>mike</dc:creator>
				<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://www.dataspora.com/blog/?p=121</guid>
		<description><![CDATA[This post was co-authored by Roger Ehrenberg, founder and managing partner at IA Ventures.  A variation of this post was published by the GigaOm Media network. The tar sands of Alberta, Canada contain the largest reserves of oil on the planet. However, they remain largely untouched, and for one reason: economics. It costs as much [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.dataspora.com/wp-content/uploads/2011/02/tar_sands.png"><img class="alignleft size-thumbnail wp-image-122" title="tar_sands" src="http://www.dataspora.com/wp-content/uploads/2011/02/tar_sands-150x150.png" alt="" width="150" height="150" /></a></p>
<p><em>This post was co-authored by Roger Ehrenberg, founder and managing partner at IA Ventures.  A variation of this post was published by the GigaOm Media network.</em></p>
<p>The tar sands of Alberta, Canada contain the largest reserves of oil on the planet. However, they remain largely untouched, and for one reason: economics. It costs as much as $40 to extract a barrel of oil from tar sand, and until recently, petroleum companies could not profitably mine these reserves.</p>
<p>In a similar vein, much of the world&#8217;s most valuable information is trapped in digital sand, siloed in servers scattered around the globe. These vast expanses of data &#8212; streaming from our smart phones, DVRs, and GPS-enabled cars &#8212; require mining and distillation before they can be useful.</p>
<p>Both oil and sand, information and data share another parallel: in recent years, technology has catalyzed dramatic drops in the costs of extracting each.</p>
<p>Unlike oil reserves, data is an abundant resource on our wired planet. Though much of it is noise, at scale and with the right mining algorithms, this data can yield information that can predict traffic jams, entertainment trends, even flu outbreaks.</p>
<p><strong> </strong></p>
<p>These are hints of the promise of big data, which will mature in the coming decade, driven by advances in three principle areas: sensor networks, cloud computing, and machine learning.<span id="more-183"></span>The first, sensor networks, historically included devices ranging from NASA satellites and traffic monitors to grocery scanners and Nielsen rating boxes. Expensive to deploy and maintain, these were the exclusive province of governments and industry. But another, wider sensor network has emerged in the last decade: smart phones and web-connected consumer devices. These sensors &#8212; and the Tweets, check-ins, and digital pings they generate &#8212; form the tendrils of a global digital nervous system, pulsing with petabytes.</p>
<p>Just as these devices have multiplied, so have the data centers that they communicate with. Housed in climate-controlled warehouses, they consume an estimated 2 percent &#8212; and represent the fastest growing segment &#8212; of the United States energy budget. These data centers are at the heart of cloud computing, the second driver of big data.</p>
<p>Cloud computing reframes compute power as a utility, like electricity or water. It offers large-scale computing to even the smallest start-ups: with a few keystrokes, one can lease 100 virtual machines from Amazon&#8217;s Elastic Compute Cloud for less than $10 per hour.</p>
<p>Yet this computing brawn is only valuable when combined with intelligence. Enter machine learning, the third principle component driving value in the industrial age of data.</p>
<p>Machine learning is a discipline that blends statistics with computer science to classify and predict patterns in data. Its algorithms lie at the heart of spam filters, self-driving cars, and movie recommendation systems, including one to which Netflix awarded its million-dollar prize to in 2009. While data storage and distributed computing technologies are being commoditized, machine learning is increasingly a source of competitive advantage among data-driven firms.</p>
<p><strong> </strong></p>
<p>Together, these three technology advances lead us to make several predictions for the coming decade:</p>
<p>1<strong>. A spike in demand for &#8220;data scientists.” </strong>Fueled by the oversupply of data, more firms will need individuals who are facile with manipulating and extracting meaning from large data sets. Until universities adapt their curricula to match these market realities, the battle for these scarce human resources will be intense.</p>
<p>2. <strong>A reassertion of control by data producers. </strong>Firms such as retailers, banks, and online publishers are recognizing that they have been giving away their most precious asset &#8212; customer data &#8212; to transaction processors and other third-parties. We expect firms to spend more effort protecting, structuring and monetizing their data assets.</p>
<p>3. <strong>The end of privacy as we know it. </strong>With devices tracking our every point and click, acceptable practice for personal data will shift from preventing disclosures towards policing uses. It&#8217;s not what our databases know that matters &#8212; for soon they will know everything &#8212; it&#8217;s how this data is used in advertising, consumer finance, and health care.</p>
<p>4. <strong>The rise of data start-ups. </strong>A class of companies is emerging whose supply chains consist of nothing but data. Their inputs are collected through partnerships or from publicly available sources, processed, and transformed into traffic predictions, news aggregations, or real estate valuations. Data start-ups are the wildcatters of the information age, searching for opportunities across a vast and virgin data landscape.</p>
<p>The consequence of sensor networks, cloud computing, and machine learning is that the data landscape is broadening: data is abundant, cheap, and more valuable than ever. It&#8217;s a rich, renewable resource that will shape how we live in the decades ahead, long after the last barrel has been squeezed from the tar sands of Athabasca.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dataspora.com/2011/02/mining-the-tar-sands-of-big-data/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>The Seven Secrets of Successful Data Scientists</title>
		<link>http://www.dataspora.com/2010/08/the-seven-secrets-of-successful-data-scientists/</link>
		<comments>http://www.dataspora.com/2010/08/the-seven-secrets-of-successful-data-scientists/#comments</comments>
		<pubDate>Fri, 27 Aug 2010 13:12:04 +0000</pubDate>
		<dc:creator>mike</dc:creator>
				<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=114</guid>
		<description><![CDATA[At O&#8217;Reilly&#8217;s &#8220;Making Data Work&#8221; seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data. What follows is a blog-ified and amended version of that talk, originally entitled &#8220;Secrets of Successful Data Scientists.&#8221; 1. Choose The [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://dataspora.com/images/ds200.jpg"><img class="alignleft size-thumbnail wp-image-100" title="phoenix" src="http://dataspora.com/images/ds200.jpg" alt="" width="200"/></a>At O&#8217;Reilly&#8217;s &#8220;Making Data Work&#8221; seminar earlier this summer, I teamed up with a few other folks (data diva Hilary Mason, R extraordinaire Joe Adler, and visualization guru Ben Fry) to talk about data.</p>
<p>What follows is a blog-ified and amended version of that talk, originally entitled &#8220;Secrets of Successful Data Scientists.&#8221;</p>
<p><strong> 1. Choose The Right-Sized Tool </strong></p>
<p>Or, as I like to say, you don&#8217;t need a chainsaw to cut butter.</p>
<p>If you&#8217;ve got 600 lines of CSV data that you need to work with on a one-time basis, paste it into Excel or Emacs and just do it (yes, curse the Flying Spaghetti Monster, I&#8217;ve just endorsed that dull knife called Excel).</p>
<p>In fact, Excel&#8217;s and Emacs&#8217; program-by-example keyboard macros can be <a href="http://emacs-fu.blogspot.com/2010/07/keyboard-macros.html"> fantastic tool for quick and dirty data clean-up. </a><br />
<span id="more-179"></span></p>
<p>Alternatively, if you&#8217;ve got 600 million lines of data and you need something simple, piping together a several Unix tools (cut, uniq, sort) with a dash of <a href="http://www.ibm.com/developerworks/linux/library/l-p102.html#1">Perl one-liner foo </a> may get you there.</p>
<p>But don&#8217;t confuse this kind of data exploration, where the goal is to size up the data, with building proper data plumbing, where you want robustness and maintainability.  Perl and bash scripts are nice for the former, but can be a nightmare for building data pipelines.</p>
<p>When you&#8217;re data gets very large, so big it can&#8217;t fit reasonably on your laptop (in 2010, that&#8217;s north of a terabyte), then you&#8217;re in Hadoop, <a href="http://www.greenplum.com"> parallelized database </a>, or <a href="http://oracle.com"> overpriced Big Iron </a> territory.</p>
<p>So, when it comes to choosing tools: scale them up as you need, and focus on getting results first.</p>
<p><strong> 2. Compress Everything </strong></p>
<p>We live in an IO-bound world, where the dominant bottlenecks to data flow are disk read-speed and network bandwidth.</p>
<p>As I was writing this, I was downloading an uncompressed CSV file via a web API.  Uncompressed, it was 257MB, ZIP-compressed: 9MB.</p>
<p>Compression gives you a 6-8x bump out of the gate.  When moving or crunching data of a certain heft, compress everything, always: it will save you time and money.</p>
<p>That said, because compression can render data difficult to introspect, I don&#8217;t recommend compressing TBs of data into a single tarball, but rather splitting it up, as I discuss next.</p>
<p><strong> 3. Split Up Your Data </strong></p>
<p>&#8220;Monolithic&#8221; is a bad word in software development.</p>
<p>It&#8217;s also, in my experience, a bad word when it comes to data.</p>
<p>The real world is partitioned – whether as zip codes, states, hours, or top-level web domains – and your data should be too. Respect the grain of your data, because eventually you&#8217;ll need to use it to shard your database or distribute it across your file system.</p>
<p>Even more, it&#8217;s this splitting up of data that enables the parallel execution in Hadoop and commercial data platforms (such as Greenplum, Aster, and Netezza).</p>
<p>Splitting is part of a larger design pattern succinctly identified in a paper by Hadley Wickham as:     <strong> <a href="http://had.co.nz/plyr/plyr-intro-090510.pdf"> split, apply, combine </a></strong>.</p>
<p>This is, in my mind, a more lucid formulation of &#8220;map, reduce&#8221; to include key selection (&#8220;split&#8221;) as a distinct step before any map/apply.</p>
<p><strong> 4.  Sample Your Data </strong></p>
<p>Let&#8217;s say hypothetically you&#8217;ve got 200 GBs of data from your <a href="http://en.wikipedia.org/wiki/Portmanteau">portmanteau</a> of a start-up, FaceLink.  Someone wants to know if more people visit on Mondays or Fridays, what do you do?</p>
<p>Before you wonder &#8220;if only I had 64 GB of RAM on my MacBook Pro&#8221;, or fire up a Hadoop streaming job, try this: look at a 10k sample of data.</p>
<p>It&#8217;s easy to visually inspect, or pull into R and plot.</p>
<p>Sampling allows you to quickly iterate your approach, and work around edge cases (say, pesky unescaped line terminators), before running a many-hour job on the full monty.</p>
<p>That said, sampling can bite you if you&#8217;re not careful: when data is skewed, which it always is, it can be hard to estimate joint-distributions – comparing the means of California vs Alaska, for example, if your sample is dominated by Californians (an issue that statistics, that sexy skill, can address).</p>
<p><strong> 5. Smart Borrows, But Genius Uses Open Source </strong></p>
<p>Before you create something new out of whole cloth, pause and consider that someone else may have already seen it, solved it, and open-sourced it.</p>
<p>A Google Code Search may find turn up a regular expression for that obscure data format.</p>
<p>The open source community allows you, if not to stand on the shoulders of giants, to at least rely on the gruntwork of fellow geeks.</p>
<p><strong> 6. Keep Your Head in the Cloud </strong></p>
<p>This past week, an engineer friend was just thinking about buying a dream desktop: a high RAM, multi-core box to run machine learning code over TBs of data.</p>
<p>I told him it was a terrible idea.</p>
<p>Why?  Because the data he wants to work on isn&#8217;t local, it&#8217;s on an Amazon EC2 cluster.  It&#8217;d take hours to download those TBs over a cable connection.</p>
<p>If you want to compute locally, pull down a sample.  But if your data is in the cloud, that&#8217;s where your tools and code should be.</p>
<p><strong> 7. Don&#8217;t Be Clever </strong></p>
<p>I once heard Brewster Kahle discuss managing the Internet Archive&#8217;s many-petabyte data platform: &#8220;everytime one of our engineers comes to me with a new, ingenious and clever idea for managing our data, I have a response: &#8216;You&#8217;re fired.&#8217;&#8221;</p>
<p>Hyperbole aside, his point is well-taken: cleverness doesn&#8217;t scale.</p>
<p>When dealing with big data, embrace standards and use commonly available tools.  Most of all, keep it simple, because simplicity scales.</p>
<p>I know of a firm that, several years ago, decided to fork one part of Hadoop because they had a more clever approach.  Today, they are several versions behind the latest release, and devoting time &amp; energy to back-porting changes.</p>
<p>Cleverness rarely pays off.  Focus your precious programmer-hours on the problems that are unsolved, not simply unoptimized.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dataspora.com/2010/08/the-seven-secrets-of-successful-data-scientists/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>The Data Singularity, Part II:  Human-Sizing Big Data</title>
		<link>http://www.dataspora.com/2010/05/new-tools-for-big-data/</link>
		<comments>http://www.dataspora.com/2010/05/new-tools-for-big-data/#comments</comments>
		<pubDate>Thu, 27 May 2010 14:10:39 +0000</pubDate>
		<dc:creator>mike</dc:creator>
				<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=110</guid>
		<description><![CDATA[&#8220;There are no more promising or important targets for basic scientific research than understanding how human minds&#8230; solve problems and make decisions effectively.&#8221; &#8211; Herbert Simon In my previous post , I discussed the forces behind what I&#8217;m calling The Data Singularity. My basic thesis is that as information generating processes become more frictionless &#8212; [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>
&#8220;There are no more promising or important targets for basic scientific research than understanding how human minds&#8230; solve problems and make decisions effectively.&#8221; &#8211; <a href="http://dieoff.org/page163.htm">Herbert Simon</a>
</p></blockquote>
<p><a href="http://www.dataspora.com/wp-content/uploads/2010/05/cern_supercollider.jpg"><img class="alignleft size-thumbnail wp-image-109" title="cern_supercollider" src="http://www.dataspora.com/wp-content/uploads/2010/05/cern_supercollider-150x150.jpg" alt="" width="150" height="150" /></a></p>
<p>In my <a href="http://dataspora.com/blog/the-data-singularity-is-here/"> previous post </a>, I discussed the forces behind what I&#8217;m calling The Data Singularity. My basic thesis is that as information generating processes become more frictionless &#8212; as humans have been excised from information read-write loops &#8212; the velocity and volume of data in the world is increasing, and at an exponential rate.</p>
<p>But where we go from here? What are the consequences of living in an age where every datum is stored? Where are the bottlenecks, pain points, and opportunities? Which technologies are addressing these?</p>
<p>The upshot is this: a new class of tools are evolving for Big Data because traditional approaches can&#8217;t scale up.  But these tools share a common goal: scaling down data, and making it human-sized.  That&#8217;s the &#8220;reduce&#8221; part of MapReduce, the single statistic from analysis, or the hundred pixel line from one hundred million events.</p>
<p>What&#8217;s happening today isn&#8217;t entirely new, though. There were echoes of it decades ago, when surveillance satellites first began scanning the globe.</p>
<p><strong>VI. How Satellite Data Paralyzed the CIA </strong></p>
<p>Beginning in the early 1970s the CIA began relying more on global satellite reconnaissance imagery for its intelligence operations. But according to <a href="http://www.amazon.com/exec/obidos/ASIN/140004684X">one history</a>, this massive, rich data didn&#8217;t accelerate the pace of US intelligence: it slowed it down.</p>
<p>Why? Because confronted with this firehose, CIA leaders attempted to analyze every image, chase every half-formed hypothesis, simply because it was possible. The few good leads were washed out by the many mediocre. The CIA didn&#8217;t adjust their decision-making to this new scale, and they were drowned by it.</p>
<p>Many organizations are at a similar inflection point now, with access to massive, rich data about their customers or products. And, like like the CIA in the 1970s, they find themselves paralyzed by the possibilities.</p>
<p><strong>VII. People Still Pull the Big Levers </strong></p>
<p>That Big Data paralyzes human decision-makers matters, because humans still make the big decisions. When someone praises a company as being &#8220;data-driven&#8221;, I&#8217;d like to imagine that this is literally true: that the company is nothing more than a few server racks blinking &amp; humming away, slinging bits and earning money.</p>
<p>But no such company exists. What &#8220;data-driven&#8221; really means is that the executives &amp; employees use data as inputs for making decisions. Companies may be data-fueled, but they&#8217;re people-driven.</p>
<p><strong>VIII. Human-sizing Big Data: Filter &amp; Crunch </strong><br />
<span id="more-182"></span><br />
All of the analytics in the world won&#8217;t matter if it remains inaccessible to the people driving an organization &#8212; the human decision-makers.</p>
<p>We have processes all around us acting as data amplifiers, recording events at a pace &amp; scale that we can&#8217;t comprehend. But this has created a disequilibrium: our capacity to create data is vastly outstripping our ability to consume it.  Analytics is the act of taking Big Data streams and human-sizing them for our small data brains.  </p>
<p>We can reduce data by either filtering it, which sifts through but does not alter data, or by crunching it, reducing many data points to a few.</p>
<p><strong> Google and Facebook are Filters </strong>.  Many consumer web technologies might be viewed as powerful filters.  Google is a relevance filter for 20 billion web pages.  Facebook is a social filter for baby photos.  FourSquare is a geo-social filter for hipster bars.  Amazon is a filter for retail products, combining search with a powerful recommendation engine.</p>
<p><strong> Wikipedia is a Natural Language Cruncher </strong>.  Crunching data is harder than filtering it.  Perhaps the toughest nut to crack involves processing natural language:  if you read a thousand web pages about the Gutenberg Bible, how would you describe it <a href="http://en.wikipedia.org/wiki/Gutenberg_Bible">in a few paragraphs</a>?  Wikipedia is a human-powered natural language cruncher, powered by <a href="http://www.aaronsw.com/2002/whowriteswikipedia/">its army of mechanical turks</a>, whose collective actions <a href="http://trendingtopics.org">even reveal news trends.</a></p>
<p><strong> Crunch the Past to Predict the Future </strong>.  Crunching of quantitative data is at the heart of many prediction tasks: the National Weather Service aggregates weather station measurements into forecasts, Fair Isaac calculates a score of credit-worthiness by examining your credit history, and a sports contest might be construed as an algorithm &#8212; operating on a sequence of individually played points &#8212; to predict the best team or athlete.</p>
<p>Number crunching has its more banal forms, as well, in the kind of sums and averages found in your phone or utility bill. These are necessary, but predictive algorithms &#8212; the kind involved in weather forecasting &#8212; will continue to grow in importance. For at a certain scale of data, exact reporting become an insurmountable task: we can only hope to have probabilistic answers.</p>
<p><strong>IX. Business Intelligence is Dead: New Tools for a New Era </strong></p>
<p>That our traditional tools don&#8217;t operate at scale was highlighted by Tim O&#8217;Reilly recently, when he declared <a href="http://www.slideshare.net/timoreilly/the-future-of-business-intelligence">&#8220;Business intelligence as we knew it is dead.&#8221;</a></p>
<p>A new class of tools is emerging along the Big Data stack, in three areas: (1) storage &amp; computation, (2) analytics, and (3) dashboards &amp; visualization.</p>
<p>These tools will disrupt and attack many of the traditional Business Intelligence firms, ranging from tool-makers like SAS and SPSS, to relational database vendors like Oracle, to custom hardware providers.</p>
<ul>
<li><strong>1. Storage &amp; Computation:  Mixed Platforms, not Monolithic Databases </strong>. At the lowest level of storage &amp; computation, Big Data is driving the success of cloud computing platforms like Amazon&#8217;s Elastic Compute Cloud &#8212; a massive, virtualized commodity-hardware grid &#8212; as an alternative to the Big Iron sold by hardware makers.Big Data has also catalyzed widespread adoption of the distributed, fault-tolerant Hadoop platform &#8212; an open-source implementation of Google&#8217;s BigTable that was developed by Yahoo, and is now commercially supported by Cloudera.
<p>A bit further up the stack, relational databases are suffering: newer commercial entrants in this space &#8212; such as Greenplum, Aster Data, Vertica, and Netezza &#8212; offer parallelized relational systems that operate at greater scale and lower cost than Oracle and Teradata.Many open-source, non-relational data stores &#8212; with a colorful constellation of names such as HBase, MongoDB, CouchDB, Cassandra, and Voldemort &#8212; have gained traction for high-traffic, content-driven web sites.</p>
<p><strong> SQL &amp; NoSQL are Complementary, Not Antagonistic</strong>.  While some may view storage technologies as antagonistic, either-or choices, the truth is that most Big Data-driven companies use a mixture of tools in complementary ways.  Hadoop is often used for batch-processing and transformation of log data that is fed to more structured data stores, such as a distributed RDBMS, in backend systems. Non-relational data stores are in turn ideal for front-facing, high-performance web applications, where queries return a bolus of data related to a single key &#8212; often a product, user, or page identifier.  All of these pieces working together form <a href="http://my.safaribooksonline.com/9780596801656/information_platforms_and_the_rise_of_th">an information platform</a>: an ecosystem of APIs working together. </li>
</ul>
<ul>
<li><strong>2. Analytics: There Are No Turnkey Solutions</strong>. Imagine if any piece of data you ever wanted was within a query&#8217;s reach:  what would you do with it?  We&#8217;re fast approaching this scenario, and making data meaningful is the bottleneck.  But unlike storing data &#8212; where use cases &#038; technologies are common and becoming commoditized &#8212; the ways that firms filter and crunch their data varies widely.
<p>This reflects the range of analytics needs that firms have: for example, a financial firm may need low-latency, continuous analysis of data streams, while an online retailer or pharmaceutical firm can tolerate 24-hour delays for analysis.</p>
<p><strong>Scaling Up Analytics is Hard</strong>.  R, my favorite analytics tool, is fantastic for modeling either aggregated data sets or samples of data that can fit in memory, but methods for deploying R in a large-scale data environment are still nascent.One promising approach is <a href="http://www.stat.purdue.edu/~sguha/rhipe/">Saptarshi Guha&#8217;s RHIPE </a>, which combines R with Hadoop ( <a href="http://files.meetup.com/1225993/RHIPE%20-%20Saptarshi%20Guha.pdf">slides </a>) from his March presentation at the <a href="http://www.meetup.com/R-Users">Bay Area R Users Group </a>.  Another MapReduce-based framework for large-scale data analysis include the <a href="http://mahout.apache.org/">Apache Mahout project</a>.</p>
<p><strong>Learn, Then Apply:  But Stay Close to the Data</strong>.  In general, there are two pieces in any analytics pipeline: (i) learning, or the training of a model with historical data, and (ii) prediction, or the application of a model to new data.  On the the learning side, it&#8217;s been said that <a href="http://anand.typepad.com/datawocky/2008/03/more-data-usual.html">more data beats better algorithms </a>, and this is certainly true for many classification problems. In general, training a model is a computationally intensive task, and the development of methods that can train on massive data sets is <a href="http://research.google.com/pubs/pub36296.html">an area of active research. </a></p>
<p>On the application/prediction side of modeling, the challenges often revolve around deployment, or How do we get the model to the data? (Since the reverse, pushing data to the model, is more expensive). To address the desire of porting models across different environments <a href="http://en.wikipedia.org/wiki/Predictive_Model_Markup_Language">PMML (Predictive Modeling Markup Language) </a>has been developed, which is supported by a range of database vendors.</p>
<p>The meme of &#8220;in-database analytics&#8221; is resonating because given data&#8217;s increasing heft, efficient analytics will follow the pattern of having the training &amp; execution of models stay close to where the data lives.</p>
<p>As it will be several years before either open-source or commercial analytics tools are mature here, the most successful Big Data modelers will be those data scientists who can build and glue together their own methods, tailored for individual environments and needs.</li>
</ul>
<ul>
<li><strong>3. Dashboards &amp; Visualization:  Why &#8220;I See&#8221; is a Synonym for &#8220;I Understand&#8221; </strong>. The most visible way in which Big Data is disrupting old tools is by changing the way we look at data.  The ultimate end-point for most data analysis is a human decision-maker, whose highest bandwidth channel is his or her eyeballs.  To take optimal advantage of the human visual system, dashboards and data visualization must be well-designed, and until recently, tools that achieved even a minimal standard of competence were rare.
<p><strong>Visual Literacy is on the Rise</strong>.  But a new set of visualization tools and packages, as well as growing popular interest in data visualization &#8212; catalyzed by the books of Edward Tufte, blogs like <a href="http://www.flowingdata.com">Nathan Yau&#8217;s FlowingData </a>and <a href="http://www.ted.com/talks/hans_rosling_shows_the_best_stats_you_ve_ever_seen.html">talks at T.E.D. conferences </a>&#8211; are changing this.As I&#8217;ve written about before, there are two distinct kinds of data visualization pathways: (i) exploratory, a highly interactive path whereby a data scientist may permute through dozens or even hundreds of views of a data set to understand its shape or fit to a hypothesized model, and (ii) narrative, a more constrained path whereby only one or several views of the data are presented.</p>
<p><strong>Exploring Data Requires Fast, Frequent Feedback </strong>.  For the exploratory path, desktop tools are ideal. The open-source language R has <a href="http://www.slideshare.net/dataspora/a-survey-of-r-graphics">several outstanding visualization packages</a>, including <a href="http://had.co.nz/ggplot2/">ggplot2 </a>and <a href="http://lmdvr.r-forge.r-project.org/">lattice </a>(based on William Cleveland&#8217;s trellis).  Two solid commercial products for exploratory visualization are <a href="http://spotfire.tibco.com/">SpotFire </a>and <a href="http://www.tableausoftware.com/">Tableau </a>(the latter of which has <a href="http://www.perceptualedge.com/blog/?p=191">been praised </a>by the hard-to-please Stephen Few).</p>
<p><strong>Sharing Visualizations:  Web Dashboards Are Ideal</strong>.  Ultimately, however, visualizations need to be shared beyond a single user, to an audience. Web-driven dashboards are an ideal form for sharing narrative visualizations, by allowing navigation along defined axes of the data.The challenge is moving visualizations from the desktop to the web. Tableau has this capacity, but with R the process is less straightforward. One promising route is via <a href="http://rapache.net/">Jeff Horner&#8217;s RApache tool </a>, which embeds R inside an Apache server (which I&#8217;ve used for my <a href="http://labs.dataspora.com/gameday/">MLB Pitch F/X tool</a>, and which Jeroen Ooms&#8217; uses to power his <a href="http://www.stat.ucla.edu/~jeroen/ggplot2.html">ggplot2 web app </a>).</p>
<p>The major limitation of R-driven web graphics is that achieving some interactivity within the graphic itself is difficult, as R&#8217;s graphics model is focused on static graphics.  There are, however, several routes for achieving highly interactive, web-based data visualizations, whether by using Javascript, HTML5&#8242;s Canvas, or Flash. Two in particular are:  (i) Ben Fry&#8217;s <strong><a href="http://processing.org">Processing </a></strong>, an expressive language for vector animation, which recently added <a href="http://www.processing.js">Javascript </a>as one of its implementations, and (ii) the <strong><a href="http://vis.stanford.edu/protovis/">Protovis </a></strong> framework out of Stanford: a Javascript graphing toolkits whose conceptual integrity and expressive flexibility was inspired (like ggplot2) by Wilkinson&#8217;s grammar of graphics.</li>
</ul>
<p><strong>X.  Collaborating with Big Data: Analytics is a Social Process </strong></p>
<p><a href="http://www.dataspora.com/wp-content/uploads/2010/05/greenplum_chorus.png"></a><a href="http://www.dataspora.com/wp-content/uploads/2010/05/greenplum_chorus.png"></a><a href="http://www.dataspora.com/wp-content/uploads/2010/05/greenplum_chorus1.png"><img class="alignleft size-thumbnail wp-image-112" title="greenplum_chorus1" src="http://dataspora.com/blog/wp-content/uploads/2010/05/greenplum_chorus1-150x150.png" alt="" width="150" height="150" /></a>In the same talk that Tim O&#8217;Reilly proclaimed the death of BI &#8220;as we knew it&#8221;, he also highlighted a new initiative by Greenplum called <a href="http://www.greenplum.com/products/chorus/">Chorus </a>(Greenplum is a Dataspora client, but I confess I&#8217;ve only seen a limited preview).</p>
<p>The animating spirit of Chorus is that analytics is not only about data, models, and visualizations &#8212; it&#8217;s also about the people who work on these various pieces.  One of the reasons I love Box.net is the layer of social information that&#8217;s overlayed onto my files: appended notes, access statistics from collaborators, automatic notifications when a change is made.</p>
<p>Chorus is a vision to do this with Big Data; it allows, for instance, an analyst to link a data visualization to an underyling data source, include the R code that created the visualization, and append a note about a recent change to it.</p>
<p>As the Big Data stack matures, tools that help manage the workflow from data to analytics to visualizations, and ultimately to decisions, will be critical.  Someday, creating and sharing a data analysis through a web dashboard should be as easy as writing a blog post.  Until that day, there&#8217;s plenty of work to keep us data scientists well-employed.</p>
<p><em> If crunching terabytes of data is the kind of thing you&#8217;d like to do for breakfast, please send me a note at med @ dataspora.com.  I&#8217;m looking to hire technologists &amp; analytics experts for a new venture&#8230; more on that soon. </em></p>
]]></content:encoded>
			<wfw:commentRss>http://www.dataspora.com/2010/05/new-tools-for-big-data/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>The Data Singularity is Here</title>
		<link>http://www.dataspora.com/2010/03/the-data-singularity-is-here/</link>
		<comments>http://www.dataspora.com/2010/03/the-data-singularity-is-here/#comments</comments>
		<pubDate>Mon, 08 Mar 2010 08:36:22 +0000</pubDate>
		<dc:creator>mike</dc:creator>
				<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=104</guid>
		<description><![CDATA[In this blog post I&#8217;ll attempt to sketch the forces behind what I&#8217;m calling, somewhat sensationally, the Data Singularity, and then (in a following post) discuss what I see as its consequences. In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we aren&#8217;t [...]]]></description>
			<content:encoded><![CDATA[<p><a href="http://www.dataspora.com/wp-content/uploads/2010/03/thematrix.jpg"><img class="alignleft size-full wp-image-108" title="thematrix" src="http://www.dataspora.com/wp-content/uploads/2010/03/thematrix.jpg" alt="" width="150" height="113" /></a>In this blog post I&#8217;ll attempt to sketch the forces behind what I&#8217;m calling, somewhat sensationally, the Data Singularity, and then (in a <a href="http://dataspora.com/blog/new-tools-for-big-data/">following post</a>) discuss what I see as its consequences.</p>
<p>In a nutshell, the Data Singularity is this: humans are being spliced out of the data-driven processes around us, and frequently we aren&#8217;t even at the terminal node of action.  International cargo shipments, high-frequency stock trades, and genetic diagnoses are all made without us.</p>
<p>Absent humans, these data and decision loops have far less friction; they become constrained only by the costs of bandwidth, computation, and storage&#8211; all of which are dropping exponentially.</p>
<p>The result is an explosion of data thrown off from these machine-mediated pipelines, along with data about those flows (and data about that data, and so on).  The machines all around us &#8212; our smart phones, smart cars, and fee-happy bank accounts &#8212; are talking, and increasingly we&#8217;re being left out of the conversation.</p>
<p>So whether or not the Singularity is Near, the Data Singularity is here, and its consequences are being felt.</p>
<p>But before I discuss these consequences, I&#8217;d like to expand on the premise.  The world wasn&#8217;t always drowning in this data deluge, so how did we get here?</p>
<p><strong>I.  Data at the Speed of Speech</strong><br />
<span id="more-104"></span><br />
For most of human history, information traveled no faster than the sound of the human voice.  The origin of human language was the original singularity:  it marked the birth of a non-biological information channel,  distinct from our DNA.</p>
<p>But despite this achievement , the production of information &#8212; whether farmers&#8217; almanacs or merchants&#8217; ledgers &#8212; was still constrained the by costs of ink and parchment and the write-speed of the human hand.</p>
<p>All 70,000 volumes of the Library of Alexandria, the collected body of human knowledge in antiquity, could fit on two thumb drives today.</p>
<p>Thus the transmission and production of data, when it was done at all, was painstaking in form, small in scale, and occurred between people.</p>
<p><code> People --&gt; People </code></p>
<p><strong>II.  Data at the Speed of Light</strong></p>
<p>With the telegraph, for the first time, data flowed at the speed of light.</p>
<p>In the late 18th century, the first substantive telegraph line connected Paris to a suburb 210 kilometers to its north, using optical semaphores rather than electrical currents to communicate.  Yet while data hopped between stations at light speed, it had to be routed by human operators at each station.</p>
<p>Centuries earlier, the printing press dramatically reduced the production costs of information.  Still, human authors transmitted their hand drafted manuscripts to type setters, who set type with fonts optimally designed for human eyes.</p>
<p><strong>III. Programmable Looms and Reading Machines</strong></p>
<p>Punch cards represented the movement of data away from human-readable, anthropocentric substrates, onto a medium designed principally for consumption by machines.</p>
<p>Punch cards were developed in the early 18th century <a href="http://en.wikipedia.org/wiki/Basile_Bouchon"> to control industrial looms </a>, in France.</p>
<p>Now, machines were the final terminus of data transmission.  This act of communicating with our machines, <em>programming</em> them, was at the heart of Charles Babbage&#8217;s Analytical Engine, which came more than a century later.</p>
<p><code> People --&gt; Machines</code></p>
<p><strong>IV.  Phonographs and Recording Machines </strong></p>
<p>Developing on the other side of the communication spectrum were machines that excelled at writing and storing data.</p>
<p>The <a href="http://www-03.ibm.com/ibm/history/exhibits/storage/storage_350.html"> modern rotating disk drive </a> feels less inspired by punch cards, but by Thomas Edison&#8217;s cylinder machines, better known as phonographs.</p>
<p>The human voice was a natural data format, and if early pioneers had a vision for the modern human-machine interface, I imagine it would have been to program machines by voice.  It&#8217;s a vision that still eludes us.</p>
<p>By the middle of the 20th century, a slew of semiconductor technologies emerged to close the loop of data generation: we had machines that produced digital data, and machines that continuously consumed it, without human intervention.</p>
<p><code> Machines --&gt; Machines</code></p>
<p>These technologies also sparked the beginning of a less-celebrated, but equally important exponential curve: the falling cost of data storage.</p>
<p><a href="http://www.dataspora.com/wp-content/uploads/2010/03/cost_of_data_storage_360.png"><img class="alignnone size-full wp-image-106" title="cost_of_data_storage_360" src="http://www.dataspora.com/wp-content/uploads/2010/03/cost_of_data_storage_360.png" alt="" width="360" height="360" /></a></p>
<p><strong>V.  Listening to the Pulse of the Planet</strong></p>
<p>The exponential drop in data storage costs has meant that logging historical data about a process, or billions of processes, is economically feasible.</p>
<p>I conjecture that the largest share of data on the planet sits in log files; these are the EKGs of the server farms that manage our cell phones, our e-mail accounts, and every other facet of our online existence &#8212; and which consume 3% of the <a href="http://arstechnica.com/old/content/2007/08/epa-power-usage-in-data-centers-could-double-by-2011.ars">US energy budget </a>.</p>
<p>Ubiquitous networking and cheap bandwidth has meant these pools of storage are no longer isolated on individual sensors, phones, or servers, but form the tributaries feeding an ocean of data in the Cloud.</p>
<p>And yet, funneling these massive volumes of data creates enormous technological pressures, against which companies struggle.  So why keep the data?</p>
<p>Because inside these log files, amidst the myriad conversations recorded between machines, lies the pulse of their customers.</p>
<p>Collectively, these logs reveal the pulse of the planet &#8212; flight delays, package shipments, job losses, and human sentiments.</p>
<p>And as I&#8217;ll discuss <a href="http://dataspora.com/blog/new-tools-for-big-data/">in my next post</a>, those who can extract a meaningful signal from this thunderous cacophony &#8212; the analysts, statisticians, and data scientists &#8212; are uniquely positioned to change the world.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dataspora.com/2010/03/the-data-singularity-is-here/feed/</wfw:commentRss>
		<slash:comments>26</slash:comments>
		</item>
		<item>
		<title>SQL is Dead.  Long Live SQL!</title>
		<link>http://www.dataspora.com/2009/11/sql-is-dead-long-live-sql/</link>
		<comments>http://www.dataspora.com/2009/11/sql-is-dead-long-live-sql/#comments</comments>
		<pubDate>Wed, 25 Nov 2009 10:58:14 +0000</pubDate>
		<dc:creator>mike</dc:creator>
				<category><![CDATA[analytics]]></category>

		<guid isPermaLink="false">http://dataspora.com/blog/?p=97</guid>
		<description><![CDATA[&#8220;The adoption of a relational model of data, as described above, permits the development of a universal data sub-language.&#8221;– E.F. Codd, 1969 &#8220;Database research has produced a number of good results, but the relational database is not one of them.&#8221; – Henry Baker, 1991 Outside of programming language flame wars, few questions raise the hackles [...]]]></description>
			<content:encoded><![CDATA[<blockquote><p>&#8220;The adoption of a relational model of data, as described above, permits the development of a universal data sub-language.&#8221;– <a href="http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf">E.F. Codd, 1969</a></p></blockquote>
<blockquote><p>&#8220;Database research has produced a number of good results, but the relational database is not one of them.&#8221; – <a href="http://home.pipeline.com/~hbaker1/letters/CACM-RelationalDatabases.html">Henry Baker, 1991</a></p></blockquote>
<p><a href="http://www.dataspora.com/wp-content/uploads/2009/11/relational_theory.png"><img class="alignleft size-thumbnail wp-image-102" title="relational_theory" src="http://dataspora.com/wp-content/uploads/2009/11/relational_theory-150x150.png" alt="" width="150" height="150" /></a> Outside of programming language flame wars, few questions raise the hackles of hackers more than: &#8220;how should I store my data?&#8221;</p>
<p>I will argue here, like many such debates , the answer is:  it depends on what you&#8217;re doing.</p>
<p>While the rise of non-relational data stores serves a much-needed niche, the death of SQL and relational databases <a href="http://www.readwriteweb.com/enterprise/2009/02/is-the-relational-database-doomed.php">has been much exaggerated</a>.  E.F. Codd may be dead, but SQL is alive and well as a simple yet powerful data query language.</p>
<p><strong>3NF Crusaders vs NoSQL Rebels</strong></p>
<p>While the current critique relational databases shares features of earlier debates (such as in the 1990s, when object-oriented databases were heralded as the next big thing), it has some new twists.  Thus to review the players and their positions:</p>
<p>On our right are the relational curmudgeons, the kind of folks who <a href="http://www.thethirdmanifesto.com/"> pen manifestos and crusade against NULL values</a>.  They have converted nearly all of big business to their ministry, and have billions of dollars in their coffers to show for it.  They insist that data should be stored in terms of its relations, to protect its integrity and facilitate its analysis.  Ideally that means third-normal form, but <a href="http://www.amazon.com/exec/obidos/ASIN/0471200247"> more liberal branches of the church </a> exist.</p>
<p><span id="more-173"></span>On our left are the folks from the misnomered NoSQL movement, <a href="http://blog.oskarsson.nu/2009/06/nosql-debrief.html">shaggy kids</a> from <a href="http://gigaom.com/2009/08/15/how-yahoo-facebook-amazon-and-google-think-about-big-data/"> the likes of Facebook and Twitter </a>.  They&#8217;ve rebelled against the shackles of relational tables (and bear the scars of MySQL scaling struggles).  They believe that data should be persisted as it&#8217;s programmed: in objects.  And they&#8217;ve spawned a constellation of colorfully named open-source projects – Cassandra, Voldemort, CouchDB, MongoDB, and <a href="http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html">Dynamo</a> – to consummate their cause.</p>
<p><strong>A Three-Pronged Attack on SQL:  Syntax, Schemas, and Performance</strong></p>
<p>At the heart of the NoSQL movement are three distinct critiques:</p>
<ol>
<li>A dislike for SQL&#8217;s syntax, which is ill-fitted to programming patterns.  It&#8217;s painful to write select statements to grab the data spread out across many tables, when all you want is a record.  Within web frameworks, the interface problem has been solved to a large degree by object-relational-mappers, such as Ruby&#8217;s ActiveRecord.</li>
<li>A rejection of the strong typing of relational schemas, which make it painfully difficult to alter one&#8217;s data model.  It also makes <a href="http://codemonkeyism.com/essential-storage-tradeoff-simple-reads-simple-writes/">writing to the data store a complex process</a>.</li>
<li>A critique of performance, which in turn relates to how concurrency and partitioning of computation is handled.  Most relational databases maintain a shared state, which strives for perfect concurrency, but complicates distributed computation over many nodes.  NoSQL architectures are built on languages and tools, like Erlang and Hadoop, that favor distributed processes which (to use two favorite catch phrases) &#8220;share nothing&#8221; but are &#8220;<a href="http://www.allthingsdistributed.com/2008/12/eventually_consistent.html">eventually consistent</a>.&#8221;  The NoSQL philosophy also weighs heavily against joins.</li>
</ol>
<p>These critical threads are mirrored in the movement and their associated projects.  One the one hand you have developers who prefer the programmatic ease of interacting with NoSQL data stores, such as Cassandra and CouchDB.  They also don&#8217;t suffer the performance penalties of scale:  unlike with relational tables, the performance of look-ups does not degrade as the stored number of objects rises.</p>
<p>On the other, you have Big Data analysts (like myself), who love Hadoop because it allows easy distributed computation over massive, loosely typed data sets.</p>
<p><strong>Analytics:  MapReduce for Munging, SQL for Set Operations</strong></p>
<p>With regard to analytics, the Hadoop ecosystem makes it easy to dump several billion records of varying formats into a data store and process them – without having to conform them to a common data model.   Thus NoSQL framework is great for massive data munging.</p>
<p>But if I had to access an already structured massive data set, I prefer SQL&#8217;s declarative syntax to MapReduce constructs.</p>
<p>I recently sat down at an SQL terminal with several hundred billion call records behind it.  With a simple SQL query, I determined how many distinct people the average American telephones more than once in a given month (answer: five).  In a few hundred seconds, I&#8217;d generated a report on the global state of the customer calling network.</p>
<p>Contrary to what the NoSQL may inveigh, it&#8217;s not that relational databases can&#8217;t scale – in fact, they can scale to petabytes, as <a href="http://www.dbms2.com/2009/04/30/ebays-two-enormous-data-warehouses/"> those who know Fortune 500 enterprise computing can attest </a>.  The problem is that relational databases require lots of ETL cruft to munge fluid blobs of data into strongly typed tables.</p>
<p>I can&#8217;t imagine the programmer pain and suffering that went into building one, unified, global database.  But once it&#8217;s there, I&#8217;d much prefer to access it with SQL statements than MapReduce code .</p>
<p>And I&#8217;m not alone in feeling this way:  Jeff Hammerbacher of Cloudera recently told me that, for an enterprise deployment, usage jumped 10x when an SQL interface – HIVE (which I mention below) – was placed on the cluster.</p>
<p><strong>NoSQL is a Misnomer: SQL is Innocent!</strong></p>
<p>Which brings me to my defense of SQL.  I agree with two of three above critiques that embody the NoSQL philosophy, namely the need for schema-less storage and distributed architectures.  But when they go after SQL, and name the movement in opposition to it, they&#8217;ve named the wrong villain.  (Your honor,) <a href="http://cacm.acm.org/blogs/blog-cacm/50678-the-nosql-discussion-has-nothing-to-do-with-sql/fulltext">SQL is just an innocent query language!</a></p>
<p>As evidence of innocence, look no further than <a href="http://code.google.com/appengine/docs/python/datastore/gqlreference.html">Google&#8217;s GQL</a> and <a href="http://wiki.apache.org/hadoop/Hive"> Hadoop&#8217;s HIVE</a>, two SQL-style query languages for NoSQL data stores.</p>
<p>Why SQL in a NoSQL data store?   For one, it&#8217;s a language that both business analysts and developers already know; so the zero-th order adoption step is shorter.</p>
<p>But SQL lives on for a deeper reason: it is a simple yet powerful language for set operations.  SQL captures the essential patterns of data manipulation, such as:</p>
<ol>
<li>intersections (JOINs)</li>
<li> filters (WHEREs)</li>
<li>reductions or aggregations (GROUP BYs)</li>
</ol>
<p>I suspect that many developers who profess a disdain for SQL have been deceived by its simplicity.  One of my favorite packages in R is <a href="http://code.google.com/p/sqldf/">sqldf</a>, which allows SQL queries on R data frames.  SQL&#8217;s declarative expressions are frequently more readable and compact than their R programmatic equivalents.</p>
<p><strong>MapReduce is Possible in SQL</strong></p>
<p>Until very recently one of the more difficult operations to perform in SQL was a top-K query, for example, finding the five highest priced items in for every store in a retail database.  But so-called window functions, which make such queries easy to express, have become part of the SQL standard and are now natively supported in Postgres.</p>
<p>Window functions are powerful because they provide a &#8220;split-apply&#8221; functionality, otherwise known as a map function.  Combine these with SQL&#8217;s GROUP BY operations, which is a reduce function, and you have achieved – voila! – map-reduce in SQL.  And as with all map functions, window operations are massively parallelizable (something that has not gone unnoticed by <a href="http://www.greenplum.com">some commercial vendors.</a>)</p>
<p><strong>Verdict:  Don&#8217;t Use a Chainsaw to Cut Butter (Use the Right Tool)</strong></p>
<p>Both NoSQL and SQL have their place in an analytics ecosystem.   In the <a href="http://dataspora.com/blog/sexy-data-geeks/">Big Data workflow</a> that I&#8217;ve advocated in the past, I view SQL as a pipe feeding data into more sophisticated modeling and visualization tools, such as R.  But it is an easy-to-use pipe, and it allows analysts to quickly pull out a subset of data &#8212; and start asking questions of that data.</p>
<p>The verdict in the great NoSQL debate is:  know your tools and know your goals.  In the Big Data space today, there can be an undue focus on formats or mechanics, but these are just a means to one end:  products.  Remember, Paul Graham and his team wrote Viaweb in Lisp, and it just worked.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.dataspora.com/2009/11/sql-is-dead-long-live-sql/feed/</wfw:commentRss>
		<slash:comments>8</slash:comments>
		</item>
	</channel>
</rss>

