When a team of Yahoo engineers left the company in 2011 to spin out Hadoop startup Hortonworks, the change wasn't as stark to either company as one might have expected. That's because Yahoo didn't just invest in Hortonworks and let it go on its way transforming the big data technology, which during its formative years was largely developed within Yahoo, into a commercial software product. In some ways, the two companies never really parted ways.

"We more or less virtualized the engineering departments between the two companies," said Hortonworks Vice President of Engineering Greg Pavlik, in a recent interview. "... There's a pretty tight overlap between what Yahoo needs and what Hortonworks is looking to productize."

In fact, Yahoo Senior Vice President of Platforms and Personalization Jay Rossiter noted, "not a week goes by" where engineers from the two companies aren't getting together and working closely on building new Hadoop technologies or improving the existing ones. On the surface, it's not such a big deal given how Hortonworks came to be, but it's definitely an advantage.

L to R: Sumit Singh (Yahoo), Jay Rossiter (Yahoo), Greg Pavlik (Hortonworks), Tim Hall (Hortonworks). Source: John Curley

L to R: Sumit Singh (Yahoo), Jay Rossiter (Yahoo), Greg Pavlik (Hortonworks), Tim Hall (Hortonworks). Source: John Curley

Early on, debate among Hadoop vendors -- particularly Cloudera and Hortonworks -- was as much about its engineers' pedigrees and whose business model was best as it was about actual technology. But now that all the vendors in the space are largely staking out their own paths technologically -- at least when it comes to non-core technologies such as security, interactive queries and cluster management -- an engineering agreement with a large end-user seems pretty meaningful.

"I believe that the most interesting data management work happening on the planet right now is happening in the consumer internet, in general, and at Google in particular," Cloudera co-founder and Chief Strategy Officer Mike Olson said on the Structure Show podcast in February. "We watch very carefully what is happening at the big scale-out web properties as basically a prediction of what more traditional enterprises are going to want in the future."

Say what you will about its business, but Yahoo, which operates a many, many-petabyte Hadoop environment and runs 26 million Hadoop jobs a month, probably counts, too.

And its insights have already paid dividends for Hortonworks. Consider, for example, the rollout of the Hortonworks Data Platform 2.0, the second generation of the company's Hadoop distribution, in October 2013. Its release was timed to coincide with its technological foundation -- Hadoop 2 -- achieving general availability status as an Apache project. Yahoo played a big role in getting Hadoop 2 GA-ready by stress-testing it across its massive Hadoop environment.

"When you drive a system at that scale, you shake out a lot of bugs, you shake out a lot of problems and you make it really real," Rossiter said. Listen to the Structure Show embed below to hear former Yahoo CTO and current Altiscale CEO Raymie Stata talk about the importance of webscale experience in building out Hadoop software.

Hortonworks was working right alongside Yahoo all through that process. They've also worked together on things like rolling upgrades so Hadoop users can upgrade software without taking down a cluster. However, as Cloudera's Olson alluded to, the more important thing going forward might be how the companies can collaborate on those pieces of satellite Hadoop technology that are really helping to clear up the distinctions among competing platforms.

Already, Hortonworks created Stinger -- an evolution of Apache Hive capable of running faster SQL queries -- and Yahoo is now running it in production (and, presumably, troubleshooting it) to the tune of 2.5 million jobs per month. Yahoo is working on capabilities such as Pig on Tez, HBase multitenancy, and Storm on YARN that aren't part of the Hortonworks commercial distribution but could make their way in if customers start asking for them. (Apologies for the preceding mess of Hadoop jargon.)

Of course, none of this is to say that it takes a partnership with a large user like Yahoo in order to successfully build commercial Hadoop software. Cloudera has a mega engineering and corporate partnership in place with microprocessor giant Intel, and enough cash on hand to buy some innovative startups where needed. Cloudera, Hortonworks and MapR have all developed some impressive technologies entirely in-house, as well. Pivotal, the data- and cloud-centric spinoff of EMC and VMware, likes to tout the hundreds of engineers working on its Hadoop software.

Traverso explains the architecture of Facebook's new Presto engine. Source: Jordan Novet

A Facebook engineer presenting at its Analytics@Webscale event that also featured LinkedIn and Twitter. Source: Jordan Novet

There's also a set of web companies beside Yahoo and Google (which technically isn't even a Hadoop shop) building and open sourcing some impressive Hadoop technologies. Facebook and Twitter are probably the most active, although even Microsoft and Netflix have gotten into the act. The Hadoop vendor community is no doubt watching what they're doing, whether their projects are catching on among the greater user community, and assessing how or when to integrate them or turn them into enterprise software.

But considering the complexity of Hadoop, both as a distributed system expected to run at scale and as a technology expected to plug into myriad existing enterprise data systems, any sort of meaningful engineering partnerships really have to help. "Is it possible for a company to move the tech forward on its own? Yes," Hortonworks' Pavlik said. "Is it desirable? Our view is no."