Databricks, the company trying to commercialize the Apache Spark data-processing framework, announced on Monday at the second-annual Spark Summit that it has raised $33 million in series B venture capital, and is rolling out a cloud computing service for easily creating, deploying and running Spark workloads. NEA led the funding round, which also included existing investor Andreessen Horowitz and follows up on a $14 million series A round in September 2013.
While the pace and amount of Databricks' fundraising is impressive, the new service, called Databricks Cloud, is the real big deal here. It gives a lot of guidance as to the company's plan for making money from Spark -- a plan that includes a lot more than just certifying applications to run on Spark, and providing enterprise support and certification to Hadoop vendors that support it. (By the way, Hortonworks, MapR, Pivotal, IBM and NoSQL database vendor DataStax are now offering Databricks-certified Spark distributions.) Rather, Databricks is betting on a big demand for cloud-based big data workloads and frameworks, a la Google's new Dataflow service, that simplify the user experience.
For more on info on Spark's history and why it's so popular right now (the short version is that it's fast, flexible and easy to program), check out this recent Structure Show interview with Spark Co-creator, and Databricks Co-founder and CTO Matei Zaharia.
Databricks Co-founder and CEO Ion Stoica (pictured above) said Databricks Cloud is designed to simply the data pipeline process from having to manage ETL on one end and actually building products or visualizing data on the other end. In between are the myriad data stores and data-processing systems (e.g., for batch processing, stream processing, interactive SQL and graph processing) that companies need to manage. Databricks Cloud combines much of this functionality -- including the various processing engines and "notebook" and dashboard features for building and displaying machine learning models -- into a single platform under a single API.
The service will initially run atop Amazon Web Services (and use S3 as the default storage layer, but should soon run atop other clouds as well. Data is stored in Amazon S3 by default -- because so many companies already have data stored there, Stoica said -- but can be stored in HDFS as well if users have Hadoop clusters already running in AWS. Databricks cloud can read data from, and export data to, MongoDB, MySQL and Amazon Redshift.
Stoica said Databricks already has quite a few companies running Spark in the cloud, including as part of a Databricks Cloud closed beta program that began earlier this year, and he expects many more to sign up as the service gradually rolls out to the public. Still, he noted, the company intends to support hybrid cloud-local Spark environments over time, and built Databricks Cloud entirely on open source Spark in order to ensure workload portability.
The Job Launcher UI in Databricks Cloud.
However, while the Hadoop community largely supports Spark and understands its value, they're not all ready to proclaim its inevitable ascension over MapReduce and various other Hadoop-based processing engines -- at least any time soon. When I asked John Schroeder, CEO of newly enriched Hadoop vendor MapR, about what it would mean for business if Spark overtakes MapReduce, he said, "[It's] way too early to draw that conclusion ... It's the fact that we have to hear about things two years ahead of them being ready for primetime that ends up confusing the customers."
It's an understandable attitude considering how many customer workloads (nearly all of them) are still based on MapReduce and the untold dollars and human resources Hadoop vendors have put into building their technologies and ecosystems. Embracing Spark's popularity is a wise bet on the future, but right now it's also an additional cost center in terms of technological integration, expertise and field support all for a framework that doesn't technically even need Hadoop's storage layer in order to work.
The Spark stack. Source: Databricks
By releasing Databricks Cloud, though, the company hopes it can make the Spark experience easy enough that anyone trying to build a new big data application will be foolish not to at least give it a try. "Really, I think the biggest problem with big data is how do you use it if you're not a tech company, if you don't have a lot of expertise, a lot of Ph.Ds. in-house that can work with it," CTO Zaharia said in his podcast interview. "I think innovation in the tools that let the next 10 times more companies use big data is where the action will be."