This is a mirror of official site: http://jasper-net.blogspot.com/

Apache Hadoop: Best Practices and Anti-Patterns

| Thursday, November 11, 2010
Apache Hadoop is a software framework to build large-scale, shared storage and computing infrastructures. Hadoop clusters are used for a variety of research and development projects, and for a growing number of production processes at Yahoo!, EBay, Facebook, LinkedIn, Twitter, and other companies in the industry. It is a key component in several business critical endeavors representing a very significant investment and technology component. Thus, appropriate usage of the clusters and Hadoop is critical in ensuring that we reap the best possible return on this investment.

This blog post represents compendium of best practices for applications running on Apache Hadoop. In fact, we introduce the notion of aGrid Pattern which, similar to a Design Pattern, represents a general reusable solution for applications running on the Grid.

This blog post enumerates characteristics of well behaved applications and provides guidance on appropriate uses of various features and capabilities of the Hadoop framework. It is largely prescriptive in its nature; a useful way to look at this document is to understand that applications that follow, in spirit, the best practices prescribed here are very likely to be efficient, well-behaved in the multi-tenant environment of the Apache Hadoop clusters, and unlikely to fall afoul of most policies and limits.

This blog post also attempts to highlight some of the anti-patterns for applications running on the Apache Hadoop clusters.

Overview

Applications processing data on Hadoop are written using the Map-Reduce paradigm.

A Map-Reduce job usually splits the input data-set into independent chunks, which are processed by the map tasks in a completely parallel manner. The framework sorts the outputs of the maps, which are then input to the reduce tasks. Typically both the input and the output of the job are stored in a file-system. The framework takes care of scheduling tasks, monitoring them and re-executes the failed tasks.

Read more: Yahoo developer network

Posted via email from .NET Info

0 comments: