Map/Reduce(Parallel) implementation of FP Growth Algorithm for frequent Itemset Mining
 
We have a Top K Parallel FPGrowth Implementation. What it means is that given a huge transaction list, we find all unique features(field values) and eliminates those features whose frequency in the whole dataset is less that minSupport. Using these remaining features N, we find the top K closed patterns for each of them, generating NK patterns. FPGrowth Algorithm is a generic implementation, we can use any Object type to denote a feature. Current implementation requires you to use a String as the object type. You may implement a version for any object by creating Iterators, Convertors and TopKPatternWritable for that particular object. For more information please refer the package org.apache.mahout.fpm.pfpgrowth.convertors.string
e.g:
 FPGrowth<String> fp = new FPGrowth<String>();
 Set<String> features = new HashSet<String>();
 fp.generateTopKStringFrequentPatterns(
     new StringRecordIterator(new FileLineIterable(new File(input), encoding, false), pattern), 
        fp.generateFList(
          new StringRecordIterator(new FileLineIterable(new File(input), encoding, false), pattern), minSupport),
         minSupport,
        maxHeapSize, 
        features,
        new StringOutputConvertor(new SequenceFileOutputCollector<Text, TopKStringPatterns>(writer))
  );
 

The command line launcher for string transaction data org.apache.mahout.fpm.pfpgrowth.FPGrowthJob has other features including specifying the regex pattern for spitting a string line of a transaction into the constituent features

The numGroups parameter in FPGrowthJob specifies the number of groups into which transactions have to be decomposed. The numTreeCacheEntries parameter specifies the number of generated conditional FP-Trees to be kept in memory so as not to regenerate them. Increasing this number increases the memory consumption but might improve speed until a certain point. This depends entirely on the dataset in question. A value of 5-10 is recommended for mining up to top 100 patterns for each feature
 
Copyright © 2009 Apache Software Foundation - Mahout