Itemset Mining

downloads

latest release supported python versions package status license

travis build status docs build status coverage status

Implements itemset mining algorithms.

Algorithms

High-utility itemset mining (HUIM)

HUIM generalizes the problem of frequent itemset mining (FIM) by considering item values and weights. A popular application of HUIM is to discover all sets of items purchased together by customers that yield a high profit for the retailer. In such a case, item values would show not just that a load of bread is in a basket, but how many there are; and weights would include the profit from a loaf of bread.

More technically, HUIM requires transactions in the transactions “database” to have internal utilities (i.e. item values) associated with each item in each transaction and a “database” of external utilities for each item (i.e. weights).

Algorithm

Class

How to Cite

Two-Phase*

itemset_mining.two_phase_huim.TwoPhase

Link

  • Includes max length support

Roadmap (high to low priority):

  • Address low-correlation HUIs with one of bond, all-confidence, or affinity. Itemsets that are high utility, but where the items aren’t correlated can be misleading for making marketing decisions. E.g. if an itemset of a TV and a pen is a HUI, it’s likely just because the TV is expensive, not because it’s an interesting pattern.

  • Add *average* utility measure support, for easier, more intuitive minutil

  • Support discount strategies via a discount strategy table and upgraded external utilities table.

  • Add top-k HUI support.

  • Support identifying periodic high utility itemsets. This allows detection of purchase patterns among high-utility itemsets to allow cross-promotions to customers who buy sets of items periodically.

  • Support items’ on-shelf time. Ignmoring on-shelf time will biat toward items that have more shelf time, since they have more chance to generate higher utility.

  • Allow incremental transaction updates without rerunning everything.

  • Support concise HUI itemsets, specifically closed form. This allows the algorithm to be more efficient, only showing longer itemsets, which may be the most interesting ones (correlation issues aside).

Installation:

pip install itemset-mining

Example:

>>> from operator import attrgetter
>>> from itemset_mining.two_phase_huim import TwoPhase
>>> transactions = [
...     [("Coke 12oz", 6), ("Chips", 2), ("Dip", 1)],
...     [("Coke 12oz", 1)],
...     [("Coke 12oz", 2), ("Chips", 1), ("Filet Mignon 1lb", 1)],
...     [("Chips", 1)],
...     [("Chips", 2)],
...     [("Coke 12oz", 6), ("Chips", 1)]
... ]
>>> # ARP for each item
>>> external_utilities = {
...     "Coke 12oz": 1.29,
...     "Chips": 2.99,
...     "Dip": 3.49,
...     "Filet Mignon 1lb": 22.99
... }
>>> # Minimum dollar value generated by an itemset we care about across all transactions
>>> minutil = 20.00

>>> hui = TwoPhase(transactions, external_utilities, minutil)
>>> result = hui.get_hui()
>>> sorted(result, key=attrgetter('itemset_utility'), reverse=True)
... 
[HUIRecord(items=('Chips', 'Coke 12oz'), itemset_utility=30.02),
 HUIRecord(items=('Chips', 'Coke 12oz', 'Filet Mignon 1lb'), itemset_utility=28.56),
 HUIRecord(items=('Chips', 'Filet Mignon 1lb'), itemset_utility=25.979999999999997),
 HUIRecord(items=('Coke 12oz', 'Filet Mignon 1lb'), itemset_utility=25.57),
 HUIRecord(items=('Filet Mignon 1lb',), itemset_utility=22.99),
 HUIRecord(items=('Chips',), itemset_utility=20.93)]