The Bayes example package provides some helper classes for training the Naive Bayes classifier on the Twenty Newsgroups data. See {@link org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups} for details on running the trainer. See {@link org.apache.mahout.classifier.bayes.PrepareTwentyNewsgroups} for details on formatting the Twenty Newsgroups data properly for the training.
The easiest way to prepare the data is to use the ant task in core/build.xml:
    ant extract-20news-18828
  
This runs the arg line:
    -p ${working.dir}/20news-18828/ -o ${working.dir}/20news-18828-collapse -a ${analyzer} -c UTF-8
  
 
To Run the Wikipedia examples (assumes you've built the Mahout Job jar):
  1. Download the Wikipedia Dataset. Use the Ant target: ant enwiki-files
  2. Chunk the data using the WikipediaXmlSplitter (from the Hadoop home):
    bin/hadoop jar <PATH TO MAHOUT>/target/mahout-examples-0.2 org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d <MAHOUT_HOME>/examples/temp/enwiki-latest-pages-articles.xml -o <MAHOUT_HOME>/examples/work/wikipedia/chunks/ -c 64
Copyright © 2008 Apache Software Foundation