Skip to main content
Ask Question
Big Data
Asked a question 5 years ago

How can I add Apache Mahout to my Hadoop instance?

Where am I?

In Bright Computing, Inc. you can ask and answer questions and share your experience with others!

How can I add Apache Mahout to my Hadoop instance?

Apache Mahout is a suite of machine learning libraries. Depending on the algorithm, Mahout can work with or without Hadoop.

We will show how Mahout can be added to a Bright cluster that has a Hadoop instance already installed. In this case it is "CDH5.2.1", and uses Cloudera CDH 5.2.1. An example is given of the use of Mahout to run Mapreduce jobs.

1. Download Apache Mahout tarball and unpack it
Execute the following commands on the active head node as root user:

# cd /tmp/
# curl -O http://archive.cloudera.com/cdh5/cdh/5/mahout-0.9-cdh5.2.1.tar.gz1
# /cm/shared/apps/hadoop/Cloudera
# tar xvzf /tmp/mahout-0.9-cdh5.2.1.tar.gz

2. Grant access to HDFS for user "foobar"
Granting access will create directory /user/foobar in HDFS.

# cmsh
% user use user foobar
% set hadoophdfsaccess cdh5.2.1 
% commit

3. Prepare execution of Mahout test
For the Naive Bayes classifier test, a sample of Wikipedia articles in xml format will be used.

# su - foobar
$ curl -O http://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles1.xml-p000000010p000010000.bz21
$ bunzip2 enwiki-latest-pages-articles1.xml-p000000010p000010000.bz2
$ module load hadoop/CDH5.2.1/Cloudera/2.5.0-cdh5.2.1 
$ hdfs dfs -mkdir /user/foobar/wiki
$ hdfs dfs -copyFromLocal enwiki-latest-pages-articles1.xml-p000000010p000010000 /user/foobar/wiki
$ hdfs dfs -ls /user/foobar/wiki

4. Execute Mahout job (as YARN application) and check result

# su - foobar
$ /cm/shared/apps/hadoop/Cloudera/mahout-0.9-cdh5.2.1/bin/mahout seqwiki -i /user/foobar/wiki/enwiki-latest-pages-articles1.xml-p000000010p000010000 -o /user/foobar/wiki/seqfiles
$ hdfs dfs -ls /user/foobar/wiki/seqfiles