I just returned from Hadoop World in New York City. Hadoop is the open source project that has made Google’s ideas for distributed file systems and algorithms to map and reduce huge data sources to useable information available to the rest of us. Cloudera, the commercial company supporting Hadoop, put on Hadoop World to bring together users and practitioners of Hadoop.
Why all the excitment? In talk after talk we were shown how Hadoop is being used today to solve real world problems. Many times we heard of cases where there was so much data to process that either it could never be used or it would take weeks to process a simple query. Hadoop has changed the game by producing answers on terabyte and petabyte data sets in hours or minutes.
There are ony a couple of times in my career when I’ve felt on the cusp of technology that can really change the way we do things. The previous ones that I’ve seen include:
- The early days of mini-computers, including the HP 3000 that I specialized on, where transactional processing could be done at one tenth the cost of mainframes.
- When I registeredthe 80,000th web site in the world. Six months later there were 250,000.
There is much more to come, but Hadoop is in production and making a difference today. There is much to learn. Alexander Sicular does a great job of giving us an overview in his blog posting “Are You New to Hadoop? Settle in …”
Understanding Hadoop and the Challenges Ahead
For decades many of us have been dealing with a fundamental problem. We have too much data to be able to select, sort, and report on for the given hardware available to us. I’ve seen this problem since the 1980′s. While there have been many hardware and software advances since then, data is still growing faster than existing solutions can provide answers. The problem is getting worse as the Internet and other technologies produce data at a faster rate than ever before.
Thanks to Google, a whole new way of solving the problem has been created:
- Distribute the data over many commodity servers
- Use two algorithms called Map and Reduce to select and sort the data you want
- Execute Map and Reduce in parallel on the hundreds of servers where the data has been distributed
- Hide as many details as possible of the distributed architecture from the end user
Hadoop is the open source implementation of what is described above. Parallel computing is really difficult and Hadoop abstracts most of the difficulties out of the way. Based on the many talks that I heard at Hadoop World, Hadoop is out of the lab and in use solving real world problems at many organizations such as Yahoo and Facebook.
While Hadoop has seen numerous progress recently, there are still many challenges for wide adoption of the platform:
- It is a technical solution with many different parts. It requires a highly technical person to understand and install all of the components.
- You need many servers to use Hadoop. At least ten or more and many are using hundreds. That’s fine if you are Yahoo or Facebook, but daunting if you are a Fortune 500 company. There are third-party solutions from Amazon, Rackspace, and Softlayer, but using a third-party supplier can introduce security issues when trying to host your data outside of the corporate firewall.
- You need to code your own Map and Reduce funtions. The best way to code these functions is in Java. For business analysts who are trying to ask questions about their data, this introduces a major impediment. They don’t have the skills to code their own Map and Reduce functions so they need to find and work with a top notch programmer to get the functions done.
All these challenges will have solutions. Cloudera is making it easier to deploy Hadoop. Amazon has a Hadoop service making it easy to deploy Hadoop jobs. There are many efforts to introduce new scripting and SQL-like tools to enable analysts to ask questions without having to learn Map and Reduce.
What used to be impossible is now possible thanks to Hadoop. The questions we should be asking ourselves are:
- What information is hiding in our data that we could use to improve our business?
- If we could now ask questions of enormous data sets that we could never ask before, what should those questions be?
Good luck finding answers with from your big data.