Introduction

The bayes package provides an implementation of a Map Reduce enabled naive bayes classifier. The naive bayes classifier is a very simple classifier that counts the occurrences of words in association with a label which can then be used to determine the likelihood that a new document, and it's words, should be assigned a particular label.

Implementation

The implementation is divided up into three parts:

  1. The Trainer -- responsible for doing the counting of the words and the labels
  2. The Model -- responsible for holding the training data in a useful way
  3. The Classifier -- responsible for using the trainers output to determine the category of previously unseen documents

The Trainer

The trainer is manifested in several classes:

  1. {@link org.apache.mahout.classifier.bayes.BayesDriver} -- Creates the Hadoop Naive Bayes job and outputs the model. This Driver encapsulates a lot of intermediate Map-Reduce Classes
  2. {@link org.apache.mahout.classifier.bayes.common.BayesFeatureDriver}
  3. {@link org.apache.mahout.classifier.bayes.common.BayesTfIdfDriver}
  4. {@link org.apache.mahout.classifier.bayes.common.BayesWeightSummerDriver}
  5. {@link org.apache.mahout.classifier.bayes.BayesThetaNormalizerDriver}
The trainer assumes that the input files are in the {@link org.apache.hadoop.mapred.KeyValueTextInputFormat}, i.e. the first token of the line is the label and separated from the remaining tokens on the line by a tab-delimiter. The remaining tokens are the unique features (words). Thus, input documents might look like:
      hockey puck stick goalie forward defenseman referee ice checking slapshot helmet
      football field football pigskin referee helmet turf tackle
    
where hockey and football are the labels and the remaining words are the features associated with those particular labels.

The output from the trainer is a {@link org.apache.hadoop.io.SequenceFile}.

The Model

The {@link org.apache.mahout.classifier.bayes.BayesModel} is the data structure used to represent the results of the training for use by the {@link org.apache.mahout.classifier.bayes.BayesClassifier}. A Model can be created by hand, or, if using the {@link org.apache.mahout.classifier.bayes.BayesDriver}, it can be created from the {@link org.apache.hadoop.io.SequenceFile} that is output. To create it from the SequenceFile, use the {@link org.apache.mahout.classifier.bayes.io.SequenceFileModelReader} located in the io subpackage.

The Classifier

The {@link org.apache.mahout.classifier.bayes.BayesClassifier} is responsible for using a {@link org.apache.mahout.classifier.bayes.BayesModel} to classify documents into categories.

 
Copyright © 2008 Apache Software Foundation