I’ve been trying to get my head around what exactly Hadoop is, how one would use it, and how I can take advantage of it. Well it turns out that Hadoop is a lot more than just an open source Java implementation of MapReduce. It’s also an entire supporting infrastructure for replicating data and intelligently distributing your MapReduces across your cluster.
Knowing that I’m not the only one wishing there was a single document with all the Hadoop and MapReduce basics, I wrote one. Or, more accurately, I summarized the info on various pages into one. Enjoy, and let me know if it was useful to you.
P.S. The really short version is that if you have large data sets/files that need mining, and lots of computers (or a willingness to run your data through Amazon’s EC2 and S3) Hadoop is worth checking out. I’ll try installing it this weekend and write up my notes for you.