Recursive Hierarchical Clustering

This script performs recursive hierarchical clustering on clickstream data, and generate clusters of user behaviors. The system does three things: (1) group similar users into clusters; (2) identify the natural hierarchy within user clusters; (3) select key features to help to interpret the semantic meaning of each cluster. [Code Download]

Dependency

Python library numpy, scipy is required.

Usage

The name of the main script is recursiveHierarchicalCustering.py. Two ways of executing the script: command line interface or python import.

Command Line Interface

$> python recursiveHierarchicalClustering.py inputFile outputPath/ sizeThreshold

Arguments

inputFile: input file that contains the clickstream information for users to be clustered. Each line represents one user.

user_id \t A(1)G(10)

where A and G are action patterns and 1 and 10 represent how many times that the action pattern appears in this user's clickstream. In our case, action patterns refer to k-grams or subsequences of the clickstream. user_id grows from 1 to the total number of users.

outputPath: The directory to store all the temporary files and the final result.

sizeThreshold (optional): Defines the minimum size of the clusters. 0.05 means that clusters containing less than 5% of the total users will not be further split.

Python Interface

import recursiveHierarchicalClustering as rhc
data = rhc.getSidNgramMap(inputPath)
matrixPath = '%smatrix.dat' % inputPath
treeData = rhc.runDiana(outputPath, data, matrixPath)

treeData is the resulting cluster tree. Same as output/result.json if ran through CLI.

Result

outputPath/matrix.dat: A distance matrix for the root level (a temporary file for quick computation). If the file is available, the script will read in the matrix instead of calculating it again. The file format is a N*N distance matrix scaled to integer in the range of (0-100).

outputPath/result.json: A file for the clustering result, in the form of ['t', sub-cluster list, cluster info] or ['l', user list, cluster info].

Node type: l means leaf node that cannot be further split. t means tree cluster who has child clusters.
Sub-cluster list: a list of clusters that is the resulting clusters derived from splitting the current cluster.
User list: a list of user ids representing the users in the given cluster.
Cluster info: a dictionary containing meta data for the cluster.
- gini: gini-coefficient for chi-square score distribution, the skewness of feature importance distribution.
- sweetspot: the modularity for the best k we picked when further splitting this cluster.
- exclusions: a list of top features (ranked) that helps to distinguish the cluster from others.
- exclusionsScore: the chi-square scores correspond to the top features listed in exclusions.

Configuration

This script is designed to be distributed on multiple machines with shared file system. However, it is also be configured to run locally on a single machine. The configuration file is server.json in the following format:

{
    "threadNum": 5,
    "minPerSlice": 1000,
    "server":
        ["server1.example.com", "server2.example.com"]
}

threadNum specifies how many threads can be ran on each server.
minPerSlice specifies the minimum number of users each thread should handle.
server specifies the server name to be used for matrix computation task. If you want to run it locally, specify it as ["localhost"].