We have a Top K Parallel FPGrowth Implementation. What it means is that given a huge transaction list, we find all unique features(field values)
and eliminates those features whose frequency in the whole dataset is less that
minSupport. Using these remaining
features N, we find the top K closed patterns for each of them, generating NK patterns.
FPGrowth Algorithm is a generic implementation, we can use
any Object type to denote a feature. Current implementation requires you to use a String as the object type. You may implement a version for any object
by creating Iterators, Convertors and TopKPatternWritable for that particular object. For more information please refer the package
org.apache.mahout.fpm.pfpgrowth.convertors.string
e.g:
FPGrowth<String> fp = new FPGrowth<String>();
Set<String> features = new HashSet<String>();
fp.generateTopKStringFrequentPatterns(
new StringRecordIterator(new FileLineIterable(new File(input), encoding, false), pattern),
fp.generateFList(
new StringRecordIterator(new FileLineIterable(new File(input), encoding, false), pattern), minSupport),
minSupport,
maxHeapSize,
features,
new StringOutputConvertor(new SequenceFileOutputCollector<Text, TopKStringPatterns>(writer))
);
- The first argument is the iterator of transaction in this case its Iterator<List<String>>
- The second argument is the output of generateFList function, which returns the frequent items and their frequencies from the given database transaction iterator
- The third argument is the minimum Support of the pattern to be generated
- The fourth argument is the maximum number of patterns to be mined for each feature
- The fifth argument is the set of features for which the frequent patterns has to be mined
- The last argument is an output collector which takes [key, value] of Feature and TopK Patterns of the format [String, List<Pair<List<String>, Long>>] and writes them to the appropriate writer class which takes care of storing the object, in this case in a Sequence File Output format
The command line launcher for string transaction data org.apache.mahout.fpm.pfpgrowth.FPGrowthJob has other features including specifying the regex pattern for spitting a string line of a transaction into the constituent features
The
numGroups parameter in FPGrowthJob specifies the number of groups into which transactions have to be decomposed.
The
numTreeCacheEntries parameter specifies the number of generated conditional FP-Trees to be kept in memory so as not to regenerate them. Increasing this number increases the memory consumption but might improve speed until a certain point. This depends entirely on the dataset in question. A value of 5-10 is recommended for mining up to top 100 patterns for each feature