How to Start With Hadoop: 3 Key Questions to Ask Yourself
This is part one of a two part series on setting up Hadoop. Link to part 2
One of the big challenges people have with using Hadoop is knowing how to scale up beyond just a personal computer. Even folks I have talked to who are more advanced and on their second, third, or fourth+ cluster spend a lot of time rethinking their strategy. Any way you slice it, setting up and maintaining your cluster can be a real head scratcher no matter what software vendors or cloud providers tell (or promise) you. Using a cloud provider’s infrastructure is equally complex no matter how easy, secure, or reliable they claim it to be. In this blog article I wanted to share some of the things I have learned over the years and things we have run into at UpStream.
I have had the pleasure, or maybe the curse, of building several clusters – both in the “cloud” and “traditional” dedicated – and using MapR and Cloudera. When starting all of those projects there were a lot of unknowns since we were doing something new and different each time. The three key questions I think you should ask (and attempt to answer):
1 - How will the cluster be used?
2 - Who will be involved?
3 - What configuration options are available that meet your objectives and fit within your company’s IT design?
These questions are difficult to answer and you will likely need to revisit them after you have done your research and have some options in hand. It can’t hurt to think through some scenarios up front; just remember to revise and then communicate them in the end, because this is what your users are going to think about on that first day they get their hands on the cluster. The important advice at this stage is to try before you buy. No matter where the cluster is, who provides the distribution, or if the application and data it relies on are half baked – take it for a test drive. I guarantee once you get engineers, IT, and business users testing different combinations you will uncover pros and cons you had no idea existed.
How the cluster will be used
Your engineering team will likely be the first users of Hadoop. Your developers will be dumping data into the cluster (maybe with the help of IT or Ops) and running jobs as coding is underway. Also think about how efficient your users will be at writing jobs and what trade-offs might your engineering department make. Is it more efficient to use brute force and churn out code quickly, or will they spend some percentage of time optimizing and tuning? Whoever holds the P&L and strategy goals will need to make that trade-off. Testing out some code is going to reveal quite a lot too about the type of jobs they are running (cpu heavy, IO heavy, etc.) and hopefully shine a light on how that matches up to the hardware.
In a perfect world you would get some test gear in and bench mark their jobs to help figure out server configuration and number of nodes, but in all likelihood you are going to have to guess. Ask your engineering team what other solutions are going to be in the mix? Is Hadoop going to just be a “slow” batch job processor? Are they going to try to add in HBase? Are they trying to tie in Casandra, Riak, or other NoSQL options (or even other SQL options)? Most Hadoop solutions are hybrids. Get a dialog going on where the full architecture is headed.
Eventually your cluster(s) will need to support production jobs where there are real or perceived SLAs. If those already exist or you can get some idea of those goals, it is great to capture them since it will be important to your decision process. You will probably not know what to do with them out of the gate, especially if this is your first cluster or you are making radical design changes. Also think about your backup plan. Hadoop is complex, and even if you have support through a vendor, inevitably it will break or be down much like any other complex system (Terradata, Oracle, etc). Talk about your company’s comfort level for data loss and downtime. It is better for everyone to be aware than to be surprised later.
Who is going to be involved?
Once enough people in the organization find out about this gem, expect other groups to want access in some shape or fashion. You might receive requests from analysts to run Hive jobs since they are comfortable with SQL like syntax. Also expect analytics teams to want to export certain elements to their data warehouse, or expose to other tools like Tableau. Inevitably people will gravitate to this new toy and some team in your company is going to need to broker their requests and play peacemaker.
Finally, think about your IT investment to the cluster. You will need to define who will be involved in the project: network engineer, hardware engineer, procurement, Hadoop software configuration engineer, and 24/7 support [onshore and offshore]. Ask yourself whether you have the right team in place or if you need outside help or training, or maybe some of both. Supporting Hadoop takes time and resources. I personally don’t think it is a complex ERP or high end data base system; some “slop” is ok and Hadoop is actually pretty fault tolerant. But the law of big numbers will catch you. Eventually when you have a lot of boxes, terabytes+ of data, and hundreds of daily jobs, the care and feeding of the beast becomes time intensive. Keep all this info about how you are thinking of supporting the cluster for later – I guarantee it will raise more questions than answers.
What options are available?
Now you have spent some time thinking through who and how you will use the cluster, turn your attention to the data. This will be the single biggest driver of how you design and scale it. Inevitably you will underestimate how much data people are going to want to feed it. Any estimate engineering or the business gives you will be wrong, and probably by an order of magnitude. Don’t beat them up too much, this isn’t their fault. The cluster is going to take on a life of its own. The more success you have with projects here, the more data people are going to want to throw at it. Also don’t think about Hadoop the way you might approach a database installation since it is completely different. In the database world IO is king and hugely expensive. Super-fast disk arrays are a prime consideration factor. In Hadoop, IO is important (and usually a bottle neck), but the real pain in the neck is the amount of storage space. In most real world cases convincing people to delete or clean up data is a task second only to going to the dentist. As long as you are not violating any corporate data retention policies, I would err on the side of keeping your important data forever, or at least try to err on the side of space abundance.
By now the wheels should be turning. You are probably tossing out ideas like: Do I need one cluster or two? Do I trick it out with lots of ram and disks, or keep my per node price cheap and err on the side of more of them? If you already have one cluster under your belt (or more), at some point you will have to address more computing capacity, conflicts in SLA (Joe can’t run his uber-query and your production report job at the same time). Inevitably, whatever you choose you will need to expand.
You have two expansion paths to choose from: add more nodes or bring up another cluster. Adding more nodes is generally pretty easy, especially if you are using automation tools like Puppet or Chef. The first couple times you will miss some things, especially with regards to tuning. Bringing up a second cluster is another good option. It also gives you a backup plan in case one goes down and may be easier if you want to isolate people and jobs. It also provides a safer option if the word “upgrade” is being used on your primary cluster. No process will be more scary. Long term you will probably expand along both paths and add nodes to your cluster as well as bring up additional clusters.