Association in Python

This is a short tutorial with my simple understanding of association analysis for marketing applications in python

Read in a sample retail dataset available from Kaggle

The dataset contains mock retail orders of products that we can use as our template for association analysis

https://www.kaggle.com/datasets/knightbearr/sales-product-data. The order ID contains a unique transaction or order while the product column contains typical items that are frequently bought together

import pandas as pd 
import numpy as np
sales_april = pd.read_csv('../python/data/Sales_April_2019.csv')

sales_april.head()

  Order ID  ...                      Purchase Address
0   176558  ...          917 1st St, Dallas, TX 75001
1      NaN  ...                                   NaN
2   176559  ...     682 Chestnut St, Boston, MA 02215
3   176560  ...  669 Spruce St, Los Angeles, CA 90001
4   176560  ...  669 Spruce St, Los Angeles, CA 90001

[5 rows x 6 columns]

print('Total item sales (ignoring quanitities)')

Total item sales (ignoring quanitities)

print(len(sales_april))

print('Total orders')

Total orders

print(len(sales_april['Order ID'].value_counts()))

print('Total unique products')

Total unique products

print(len(sales_april['Product'].value_counts()))

There were 18,383 items sold in April 2019, with 17538 unique orders and 20 unique products. How do we figure out which pairs of products to market together to lift sales the next month? The unique products (electronics) available in the dataset are shown below:

sales_april['Product'].value_counts()

Product
Lightning Charging Cable      2201
USB-C Charging Cable          2079
AA Batteries (4-pack)         2063
AAA Batteries (4-pack)        1989
Wired Headphones              1890
Apple Airpods Headphones      1515
Bose SoundSport Headphones    1280
27in FHD Monitor               734
iPhone                         693
34in Ultrawide Monitor         650
Google Phone                   581
27in 4K Gaming Monitor         563
Flatscreen TV                  458
Macbook Pro Laptop             453
ThinkPad Laptop                392
20in Monitor                   390
Vareebadd Phone                220
LG Dryer                        77
LG Washing Machine              61
Product                         35
Name: count, dtype: int64

Restructure the dataset into an order ID, product list format

This is so it’s easier to figure out which items sold together in orders.

One way to do this is to create a pivot table of orders and quantity using the pandas pivot method https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.pivot.html

sales_filter = sales_april[['Order ID', 'Product']] #Ignore date and purchase address for now
sales_filter = sales_filter.reset_index()

df_orders = sales_filter.pivot_table(index = 'Order ID', columns = 'Product', aggfunc = 'count')

df_orders.head()

                index                         ...                        
Product  20in Monitor 27in 4K Gaming Monitor  ... Wired Headphones iPhone
Order ID                                      ...                        
176558            NaN                    NaN  ...              NaN    NaN
176559            NaN                    NaN  ...              NaN    NaN
176560            NaN                    NaN  ...              1.0    NaN
176561            NaN                    NaN  ...              1.0    NaN
176562            NaN                    NaN  ...              NaN    NaN

[5 rows x 20 columns]

#Fill null values in the dataframe
df_orders = df_orders.fillna(0)
df_orders.head()

                index                         ...                        
Product  20in Monitor 27in 4K Gaming Monitor  ... Wired Headphones iPhone
Order ID                                      ...                        
176558            0.0                    0.0  ...              0.0    0.0
176559            0.0                    0.0  ...              0.0    0.0
176560            0.0                    0.0  ...              1.0    0.0
176561            0.0                    0.0  ...              1.0    0.0
176562            0.0                    0.0  ...              0.0    0.0

[5 rows x 20 columns]

Print the support of each item

i.e. how frequently the item occurs over the total number of orders

We only count each item once in an order for simplicity.

Based on the summary below it looks like lightning charging cables, USBC, AA Batteries, Wired headphones and airpods are more frequent selling items. However, note that low support items are more expensive so it’s a good idea to cross sell them alongside a high frequency selling item

#Convert df orders to boolean
col_names = df_orders.columns
df_orders = df_orders[col_names].astype('bool')
df_orders.head()

                index                         ...                        
Product  20in Monitor 27in 4K Gaming Monitor  ... Wired Headphones iPhone
Order ID                                      ...                        
176558          False                  False  ...            False  False
176559          False                  False  ...            False  False
176560          False                  False  ...             True  False
176561          False                  False  ...             True  False
176562          False                  False  ...            False  False

[5 rows x 20 columns]

df_orders.mean().sort_values(ascending=False).plot(kind='bar')

Use the apriori algorithm to identify a set of meaningful rules

These rules will help cross sell products that go well together to boost sales. For example, if we want to increase the sales of headsets - would it make more sense to cross sell the mackbook pro with the airpods or with the bose headphones? We will only consider cross selling pairs of items together so this will consider 2 rule combinations only

from mlxtend.frequent_patterns import association_rules
from mlxtend.frequent_patterns import apriori

#Compute items that sell at minimum in 0.001% of all orders placed in the month (min_support)
frequent_itemsets = apriori(df_orders, min_support = 0.00000001,max_len = 2, use_colnames = True)

#Compute pairs of orders that have a meaningful lift of more than 1 in their sales when marketed together
frequent_itemsets_rules = association_rules(frequent_itemsets,metric = "lift", min_threshold = 1.0)

frequent_itemsets_rules

                           antecedents  ... zhangs_metric
0      ((index, USB-C Charging Cable))  ...      0.391666
1              ((index, Google Phone))  ...      0.357158
2                    ((index, iPhone))  ...      0.181122
3  ((index, Lightning Charging Cable))  ...      0.198853
4           ((index, Vareebadd Phone))  ...      0.319023
5      ((index, USB-C Charging Cable))  ...      0.357294

[6 rows x 10 columns]

Conclusion

Though more close inspection of all individual pairs of items is needed, this dataset shows that marketing USB sales with google phones can lift the sales of individual items. However, we should also consider other metrics such as the support of individual items. For example, USB cables and lightning charging cables also sell well on their own. However, retailers may get a slight boost in the sales of their phones if they pair them up with the cables as well