Real World Use of Hadoop at Yahoo

 

Eric Baldeschwieler
Eric Baldeschwieler

In an earlier post, I featured how Yahoo and Hadoop Sort Big Data. At Hadoop World, Eric Baldeschwieler of Yahoo gave a featured presentation on Hadoop Applications at Yahoo! 

Yahoo! have been using Hadoop for years. They are also the biggest contributors, testers, and users of Hadoop. Yahoo! has a 4,000 node cluster to run Hadoop — surely one of the largest in the world.
 
While the ideas behind Hadoop are simple — break down the task of analying large data files into a mapping step and a reduction step, both of which can be run on independent nodes. The reality is that running huge Hadoop tasks on hundreds or thousands of servers is a massive technical challenge. Yahoo! scientiests and researchers have 82 petabytes of data distributed on 25,000 nodes. Eric and his team keep it all running.

Going from Weeks to Days 

 A concrete example that Eric presented is the database for Yahoo!s Search Assist ™ feature. This feature gives predictive input to users as they type in the search window. The database is built using Hadoop by analysing three years of log data using a 20-step map-reduce. Here are the results of using Hadoop in the before and after picture of Search Assist:
Yahoo! Search Assist Performance

Both the execution time and development time improvements seem outlandish. But time and again, Eric has seen these types of improvements at Yahoo! If you have lots of data and need to tame it in near real time, Yahoo! has shown that Hadoop is one of the solutions.

Hadoop World Excitement and the Challenges Ahead

Times Square, New York City

I just returned from Hadoop World in New York City. Hadoop is the open source project that has made Google’s ideas for distributed file systems and algorithms to map and reduce huge data sources to useable information available to the rest of us. Cloudera, the commercial company supporting Hadoop, put on Hadoop World to bring together users and practitioners of Hadoop.

Why all the excitment? In talk after talk we were shown how Hadoop is being used today to solve real world problems. Many times we heard of cases where there was so much data to process that either it could never be used or it would take weeks to process a simple query. Hadoop has changed the game by producing answers on terabyte and petabyte data sets in hours or minutes.

There are ony a couple of times in my career when I’ve felt on the cusp of technology that can really change the way we do things. The previous ones that I’ve seen include:

  1. The early days of mini-computers, including the HP 3000 that I specialized on, where transactional processing could be done at one tenth the cost of mainframes.
  2. When I registeredthe 80,000th web site in the world. Six months later there were 250,000.

There is much more to come, but Hadoop is in production and making a difference today. There is much to learn. Alexander Sicular does a great job of giving us an overview in his blog posting “Are You New to Hadoop? Settle in …”

Understanding Hadoop and the Challenges Ahead

For decades many of us have been dealing with a fundamental problem. We have too much data to be able to select, sort, and report on for the given hardware available to us. I’ve seen this problem since the 1980′s. While there have been many hardware and software advances since then, data is still growing faster than existing solutions can provide answers. The problem is getting worse as the Internet and other technologies produce data at a faster rate than ever before.

Thanks to Google, a whole new way of solving the problem has been created:

  • Distribute the data over many commodity servers
  • Use two algorithms called Map and Reduce to select and sort the data you want
  • Execute Map and Reduce in parallel on the hundreds of servers where the data has been distributed
  • Hide as many details as possible of the distributed architecture from the end user

Hadoop is the open source implementation of what is described above. Parallel computing is really difficult and Hadoop abstracts most of the difficulties out of the way. Based on the many talks that I heard at Hadoop World, Hadoop is out of the lab and in use solving real world problems at many organizations such as Yahoo and Facebook.

While Hadoop has seen numerous progress recently, there are still many challenges for wide adoption of the platform:

  1. It is a technical solution with many different parts. It requires a highly technical person to understand and install all of the components.
  2. You need many servers to use Hadoop. At least ten or more and many are using hundreds. That’s fine if you are Yahoo or Facebook, but daunting if you are a Fortune 500 company. There are third-party solutions from Amazon, Rackspace, and Softlayer, but using a third-party supplier can introduce security issues when trying to host your data outside of the corporate firewall.
  3. You need to code your own Map and Reduce funtions. The best way to code these functions is in Java. For business analysts who are trying to ask questions about their data, this introduces a major impediment. They don’t have the skills to code their own Map and Reduce functions so they need to find and work with a top notch programmer to get the functions done.

All these challenges will have solutions. Cloudera is making it easier to deploy Hadoop. Amazon has a Hadoop service making it easy to deploy Hadoop jobs. There are many efforts to introduce new scripting and SQL-like tools to enable analysts to ask questions without having to learn Map and Reduce.

What used to be impossible is now possible thanks to Hadoop. The questions we should be asking ourselves are:

  1. What information is hiding in our data that we could use to improve our business?
  2. If we could now ask questions of enormous data sets that we could never ask before, what should those questions be?

Good luck finding answers with from your big data.

Yahoo and Hadoop Sort Big Data

I’ve spent a lot of my career working with big data sets, selecting subsets, and sorting. As data sets have grown exponentially over the last decade, we are starting to deal with petabytes of data. New ways of selecting and sorting datasets using concurrent execution have been invented to deal with these large datasets.

Using the open source Hadoop, Yahoo! have demonstrated that they can sort a terabyte in just over one minute. Using the same configuration, Yahoo! sorted a petabyte in 16.25 hours. The sort tests were done using Jim Gray’s Sort Benchmark.

What I found interesting is that the sort tests are really similar to those that I was running on HP’s HP 3000 platform in 2000. The tests have a 10-byte random key followed by 90 bytes of data for 100-byte records. These are typical of many business applications. In my experience, data movement of 100-byte records can overwhelm sorting times versus the time to compare keys. It’s a tribute to both Hadoop and Yahoo! that they have been able to achieve the times that they have.

For more details on how Yahoo did it see Winning a 60 Second Dash with a Yellow Elephant by Owen O’Malley and Arun Murthy of Yahoo!

A Petabyte for Under $120K

Over at Backblaze they have created a custom solution for a petabyte of storage that only costs a bit more than the hard disc drives themselves. They even tell you how to do it yourself in their blog posting Petabytes On A Budget.

Their platform runs on open source solutions, so there are no additional software costs. The form factor of 4U is designed to fit in modern data centers. I think they are several interesting take aways from Backblaze’s work:

  • Storage is cheap. Putting it together can cost a lot, especially if you use some of the popular commercial solutions.
  • It’s still possible to come up with elegant designs using off the shelf components.
  • Elegant designs still take an amazing amount of detail to pull off (just read the part of the post on vibration issues and how they were handled).

Software and Storage

The post glosses over the software and access details, but points out that both control and data access is through HTTPS running through Tomcat and Backblazes own software. We are a point in time where petabytes of storage are a commodity being accessed through a simple protocol.

For me, the post gets really interesting just at the point they start talking about “Cloud Storage: The Next Step”. Quoting from the post:

Building a cloud includes not only deploying a large quantity of hardware, but, critically, deploying software to manage it. At Backblaze we have developed software that de-duplicates and chops data into blocks; encrypts and transfers it for backup; reassembles, decrypts, re-duplicates, and packages the data for recovery; and monitors and manages the entire cloud storage system. This process is proprietary technology that we have developed over the years.

As storage needs keep expanding the amount of duplicated data stored around the planet grows exponentially. Yet removing duplicated data is just one of the things that the Backblaze software has to deal with. For Cloud Computing to take off, we need to move a couple of layers up the application stack to deal with data as more than just storage.

Do you know where your data is being stored?

Posted in Petabyte. No Comments »

Finding Global Leaders Dealing with Petabyte Storage and Performance Challenges

I am refocusing my professional efforts and going back to my technical roots. I am passionate about taking software technology and applying it to business. I have always focused on the leading edge of software technology and I see challenges ahead where I want to help make a difference.

Petabytes and Performance

I’m looking for entpreneurs and leaders who face these challenges:

  • Customers demanding higher performance
  • An organization struggling to cost effectively deliver the performance that customers are demanding
  • Software platforms handling petabytes of data
  • Dealing with the next generation of delivery, whether inside the enterprise or over the Internet
  • Either have or desire global customers who require solutions that fit with the way they do things (both in language and culture)

What I Offer

For anyone dealing with these challenges, this is a reminder of how I might be able to help

  • Full-time energy to take on a significant project (3-12 months duration)
  • Help you reliably deliver your solutions to global customers by using my twenty years experience delivering technically challenging mission critical multilingual software solutions around the planet
  • Find the people and solutions to solve big technical challenges
  • Bridge the gap between all stake holders, technical and non-technical, inside and outside the company, by writing and communicating, in person and remotely
  • Evangelize, raise awareness, and create pull demand for your products with innovative marketing

Finding the Challenges

I’m looking for your help to find those leaders around the planet who are struggling with the challenge of dealing with petabytes of data with cost effective performance. Who are those individuals? Where are they writing and blogging about the next generation solutions to high performance big data problems? I’m looking forward to finding these leaders and helping them to make a difference.