{"id":330,"date":"2010-05-09T14:37:26","date_gmt":"2010-05-09T19:37:26","guid":{"rendered":"http:\/\/jamesdevine.info\/?p=330"},"modified":"2010-05-12T20:06:51","modified_gmt":"2010-05-13T01:06:51","slug":"getting-hadoop-mapreduce-0-20-2-running-on-ubuntu","status":"publish","type":"post","link":"https:\/\/jamesdevine.info\/index.php\/2010\/05\/getting-hadoop-mapreduce-0-20-2-running-on-ubuntu\/","title":{"rendered":"Getting Hadoop MapReduce 0.20.2 Running On Ubuntu"},"content":{"rendered":"<p>I decided to setup a Hadoop cluster and write a MapReduce job \u00a0for my distrbuted systems final project. I had done this before with an earlier release and it was fairly straight forward. It turns out it is still straight forward with Hadoop 0.20.2, but the process is not well documented and the configuration has changed. Hopefully I can clear up the process here.<\/p>\n<h3>What is Hadoop MapReduce?<\/h3>\n<p>MapReduce is a powerful\u00a0distributed\u00a0computation technique pioneered by Google. Hadoop MapReduce is an open source\u00a0implementation\u00a0written in Java that is maintained by the Apache Software Foundation. Hadoop MapReduce consists of two main parts: the Hadoop distrbuted file system (HDFS) and the MapReduce system.<\/p>\n<h3>Getting Hadoop<\/h3>\n<p>The first step is to download Hadoop. Go to <a href=\"http:\/\/hadoop.apache.org\/mapreduce\" target=\"_blank\">http:\/\/hadoop.apache.org\/mapreduce<\/a>. It is worthwhile to read up on how Hadoop and MapReduce work before you move onto the installation and configuration.<\/p>\n<h3>Plan The Installation<\/h3>\n<p>Before the actual installation there is a bit of planning to be done. Hadoop works best when run from a local file system. However for convienceince it is also nice to have a common NFS file share to save configuration and log files. Below is an image of what I setup. For the\u00a0distributed\u00a0setup at least two nodes are required.<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/network.jpg?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"aligncenter size-full wp-image-348\" title=\"network\" src=\"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/network.jpg?resize=280%2C213&#038;ssl=1\" alt=\"\" width=\"280\" height=\"213\" srcset=\"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/network.jpg?w=776&amp;ssl=1 776w, https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/network.jpg?resize=300%2C228&amp;ssl=1 300w\" sizes=\"(max-width: 280px) 100vw, 280px\" \/><\/a><\/p>\n<h3>Initial Setup<\/h3>\n<p>Before doing any setup of the actual Hadoop system there is some initial setup that needs to be completed, namely the creation of a directory on each node and a shared ssh key. The first step is the easiest. A hadoop install directory needs to be created on each nodes that is going to be a part of the system. The directory must have the same name and location on each node. It is recommended not to use an NFS file share for the installation directory as it can affect performance.<\/p>\n<p>After the\u00a0install\u00a0directory has been created a shared ssh key needs to be generated on each node and added to the authorized_hosts file. This allow for passwordless ssh login and is required by the Hadoop cluster startup scripts.<\/p>\n<h3>Open Firewall Ports<\/h3>\n<p>Hadoop requires a number of ports to be open for the system to work.<\/p>\n<table border=\"0\" align=\"center\">\n<tbody>\n<tr>\n<td><strong>Port<\/strong><\/td>\n<td><strong>Function<\/strong><\/td>\n<\/tr>\n<tr>\n<td>50010<\/td>\n<td>DataNode Port<\/td>\n<\/tr>\n<tr>\n<td>50020<\/td>\n<td>JobTracker Service<\/td>\n<\/tr>\n<tr>\n<td>50030<\/td>\n<td>MapReduce Administrative Page<\/td>\n<\/tr>\n<tr>\n<td>50105<\/td>\n<td>Backup\/Checkpoint node<\/td>\n<\/tr>\n<tr>\n<td>54310<\/td>\n<td>HDFS File System<\/td>\n<\/tr>\n<tr>\n<td>54311<\/td>\n<td>JobTracker Service<\/td>\n<\/tr>\n<tr>\n<td>50060<\/td>\n<td>TaskTracker Port<\/td>\n<\/tr>\n<tr>\n<td>50070<\/td>\n<td>DFS Administrative Webpage (namenode)<\/td>\n<\/tr>\n<tr>\n<td>50075<\/td>\n<td>DataNode Port<\/td>\n<\/tr>\n<tr>\n<td>50090<\/td>\n<td>SecondaryNameNode Port<\/td>\n<\/tr>\n<\/tbody>\n<\/table>\n<h3>Configuration Files<\/h3>\n<p>There are three main configuration files that need to be edited: hdfs-site.xml, mapred-site.xml, and core-site.xml. Each file resides in the conf folder where Hadoop is extracted from. There are a lot of parameters that can go into each file but only a few basic ones needs to be set. I have provided my configuration files below. The final file that needs to be edited is hadoop-env.sh, which is a shell script that sets up Hadoop environment variables. At the very least the $JAVA_HOME variable needs to be\u00a0uncommented\u00a0and properly set.<\/p>\n<p><a href=\"https:\/\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/core-site.xml\">core-site.xml<\/a><\/p>\n<p><a href=\"https:\/\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/hdfs-site.xml\">hdfs-site.xml<\/a><\/p>\n<p><a href=\"https:\/\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/mapred-site.xml\">mapred-site.xml<\/a><\/p>\n<p><a href=\"https:\/\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/hadoop-env.sh\">hadoop-env.sh<\/a><\/p>\n<h3>Set the Slaves and Master<\/h3>\n<p>The master node needs to be defined in the hadoop_dir\/conf\/masters file. Each slave node needs to be hadoop_dir\/conf\/slaves file, one machine name\/IP address per line.<\/p>\n<h3>Deploy the Installation and Configuration Files<\/h3>\n<p>The installation and configuration files need to be deployed to each node in the cluster. The easiest way to do this is through scp. I wrote the script below so that I could run a command on each node in my cluster. Another alternative is the Cluster SSH program (cssh). Either approach is preferable to logging onto each node to run\u00a0 a command.<\/p>\n<p>Using my run_comm.sh script I ran scp on each node in the cluster:<\/p>\n<pre>.\/run_comm.sh \"scp -r ~\/hadoop \/opt\/hadoop\/hadoop\"<\/pre>\n<p>This runs the command in quotes on each node in the cluster. In this case I copied the Hadoop installation fom the NFS share (my home directory) to a local directory on each node.<\/p>\n<p><a href=\"https:\/\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/run_comm.sh\">run_comm.sh<\/a><\/p>\n<h3>Formatting the NameNode<\/h3>\n<p>Now that the Hadoop files are on each node the NameNode can be formatted to setup the Hadoop File System.<\/p>\n<pre>hadoop_dir\/bin\/hadoop namenode -format<\/pre>\n<h3>Starting the Hadoop File System<\/h3>\n<p>Now that the namenode has been formatted the\u00a0distributed\u00a0file system (DFS) can be started. This is done by using the start-dfs.sh script in the bin directory of the Hadoop installation.<\/p>\n<pre>hadoop_dir\/bin\/start-dfs.sh<\/pre>\n<p>The status of the Hadoop File System can be viewed from the\u00a0administrative\u00a0page on on the master server, http:\/\/master_server:50070.  <a href=\"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/dfs.jpg?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"aligncenter size-full wp-image-352\" title=\"dfs\" src=\"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/dfs.jpg?resize=515%2C485&#038;ssl=1\" alt=\"\" width=\"515\" height=\"485\" srcset=\"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/dfs.jpg?w=859&amp;ssl=1 859w, https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/dfs.jpg?resize=300%2C282&amp;ssl=1 300w\" sizes=\"(max-width: 515px) 100vw, 515px\" \/><\/a><\/p>\n<h3>Starting the MapReduce System<\/h3>\n<p>The final step to setting up MapReduce is to start the MapReduce system. This is done by using the start-mapred.sh script that is located in the bin directory of the Hadoop installation.<\/p>\n<pre>hadoop_dir\/bin\/start-mapred.sh<\/pre>\n<p>The status of the MapReduce system can be viewed from the\u00a0administrative\u00a0page on on the master server, http:\/\/master_server:50030.<\/p>\n<p style=\"text-align: center;\"><a href=\"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/mapred.jpg?ssl=1\"><img data-recalc-dims=\"1\" decoding=\"async\" class=\"aligncenter size-full wp-image-353\" title=\"mapred\" src=\"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/mapred.jpg?resize=585%2C485&#038;ssl=1\" alt=\"\" width=\"585\" height=\"485\" srcset=\"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/mapred.jpg?w=975&amp;ssl=1 975w, https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/mapred.jpg?resize=300%2C248&amp;ssl=1 300w\" sizes=\"(max-width: 585px) 100vw, 585px\" \/><\/a><\/p>\n<h3>Submitting a MapReduce Job<\/h3>\n<p>Now that the cluster is up and running it is ready to start accepting MapReduce jobs. This is done using the hadoop executable from the bin directory of the Hadoop installation and a jar file that contains a MapReduce program. An example of running the WordCount demo program provided with Hadoop is shown below.<\/p>\n<pre>hadoop_dir\/bin\/hadoop jar jar_location\/wordcount.jar org.myorg.WordCount \/file_dir_in_hdfs \/output_dir_in_hdfs<\/pre>\n","protected":false},"excerpt":{"rendered":"<p>I decided to setup a Hadoop cluster and write a MapReduce job \u00a0for my distrbuted systems final project. I had [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"om_disable_all_campaigns":false,"_monsterinsights_skip_tracking":false,"_monsterinsights_sitenote_active":false,"_monsterinsights_sitenote_note":"","_monsterinsights_sitenote_category":0,"_uf_show_specific_survey":0,"_uf_disable_surveys":false,"_jetpack_memberships_contains_paid_content":false,"footnotes":""},"categories":[17,22,26,1],"tags":[],"class_list":["post-330","post","type-post","status-publish","format-standard","hentry","category-linux","category-systems","category-server","category-general-information"],"aioseo_notices":[],"jetpack_featured_media_url":"","jetpack-related-posts":[{"id":103396,"url":"https:\/\/jamesdevine.info\/index.php\/2022\/02\/architecting-for-failure-how-to-ensure-application-availability-and-resiliency\/","url_meta":{"origin":330,"position":0},"title":"Architecting for failure: how to ensure application availability and resiliency","author":"James Devine","date":"February 14, 2022","format":false,"excerpt":"Werner Vogels, CTO of Amazon, said it best \"Everything fails, all the time.\" The statement is of course simple and obvious, yet also quite thought provoking. Infrastructure can and does fail for a myriad of reasons, e.g., natural failure rates of hardware, natural disasters, power, network, cooling. This means the\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/jamesdevine.info\/index.php\/category\/general-information\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2022\/02\/failure-scaled.jpeg?fit=1200%2C800&ssl=1&resize=350%2C200","width":350,"height":200,"srcset":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2022\/02\/failure-scaled.jpeg?fit=1200%2C800&ssl=1&resize=350%2C200 1x, https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2022\/02\/failure-scaled.jpeg?fit=1200%2C800&ssl=1&resize=525%2C300 1.5x, https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2022\/02\/failure-scaled.jpeg?fit=1200%2C800&ssl=1&resize=700%2C400 2x, https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2022\/02\/failure-scaled.jpeg?fit=1200%2C800&ssl=1&resize=1050%2C600 3x"},"classes":[]},{"id":88,"url":"https:\/\/jamesdevine.info\/index.php\/2009\/03\/totally-awsome-gamming-pc-for-under-750\/","url_meta":{"origin":330,"position":1},"title":"Totally Awsome Gaming PC for Under $750","author":"James Devine","date":"March 12, 2009","format":false,"excerpt":"Back in September 2008 (before the severe economic downturn)\u00a0 it become clear that my 2 year old laptop was no longer powerful enough to handle the average work load, let along gaming. After months of searching and price hunting I put together the rig outlined in this post. Without cutting\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/jamesdevine.info\/index.php\/category\/general-information\/"},"img":{"alt_text":"vista-test","src":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2009\/03\/vista-test-300x225.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":147,"url":"https:\/\/jamesdevine.info\/index.php\/2009\/03\/6-tips-to-stay-safe-on-the-internet\/","url_meta":{"origin":330,"position":2},"title":"6 Tips to Stay Safe on the Internet","author":"James Devine","date":"March 31, 2009","format":false,"excerpt":"1.) If you receive a suspicious email don't open it! If you get an email from a website that you do not have an account at just delete it. A lot of viruses and attacks will try to look like a legitimate message from a bank or other website like\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/jamesdevine.info\/index.php\/category\/general-information\/"},"img":{"alt_text":"real_link_name1","src":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2009\/03\/real_link_name1-300x216.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":321,"url":"https:\/\/jamesdevine.info\/index.php\/2010\/05\/performance-report-in-the-virtual-infrastructure-client\/","url_meta":{"origin":330,"position":3},"title":"Performance Report in the Virtual Infrastructure Client","author":"James Devine","date":"May 2, 2010","format":false,"excerpt":"VMware vCenter server reports a lot of performance information and displays tables in the Virtual\u00a0Infrastructure\u00a0client. They provide a nice at a glace view, but do not allow for anything more. While poking around the GUI I found a feature to export the\u00a0performance\u00a0data to Excel by going to file-reports-performance. This is\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/jamesdevine.info\/index.php\/category\/general-information\/"},"img":{"alt_text":"","src":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2010\/05\/performance.jpg?resize=350%2C200","width":350,"height":200},"classes":[]},{"id":377,"url":"https:\/\/jamesdevine.info\/index.php\/2011\/12\/move-netapp-root-volume-vol0-to-a-new-aggrigate\/","url_meta":{"origin":330,"position":4},"title":"Move NetApp Root Volume (vol0) to a New Aggrigate","author":"James Devine","date":"December 2, 2011","format":false,"excerpt":"By default vol0 is the root volume on a NetApp storage device and is stored on aggregate aggr0. After accidentally assigning too many disks to aggr0 I found the need to decrease the size of the aggregate. Unfortunately this is not possible. I had to create a new aggregate to\u2026","rel":"","context":"In &quot;Systems&quot;","block_context":{"text":"Systems","link":"https:\/\/jamesdevine.info\/index.php\/category\/systems\/"},"img":{"alt_text":"","src":"","width":0,"height":0},"classes":[]},{"id":253,"url":"https:\/\/jamesdevine.info\/index.php\/2009\/10\/your-phone-google-and-the-cloud\/","url_meta":{"origin":330,"position":5},"title":"Your Phone, Google, and the Cloud","author":"James Devine","date":"October 9, 2009","format":false,"excerpt":"Google has had sync available for quite some time, but up until recently it has only allowed for contacts and calendars to be synchronized between your phone and Google.The feature has been a great and allowed users to easily back their data up to the \"cloud\" where it will forever\u2026","rel":"","context":"In &quot;General&quot;","block_context":{"text":"General","link":"https:\/\/jamesdevine.info\/index.php\/category\/general-information\/"},"img":{"alt_text":"googlesync2","src":"https:\/\/i0.wp.com\/jamesdevine.info\/wp-content\/uploads\/2009\/10\/googlesync2-300x196.jpg?resize=350%2C200","width":350,"height":200},"classes":[]}],"jetpack_sharing_enabled":true,"_links":{"self":[{"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/posts\/330","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/comments?post=330"}],"version-history":[{"count":28,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/posts\/330\/revisions"}],"predecessor-version":[{"id":341,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/posts\/330\/revisions\/341"}],"wp:attachment":[{"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/media?parent=330"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/categories?post=330"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/jamesdevine.info\/index.php\/wp-json\/wp\/v2\/tags?post=330"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}