top of page

Clustering Large Datasets Using Quantum-Inspired Methods

Writer's picture: Brian MaoBrian Mao

Authors: Brian Mao, MMath [Head of Algorithms], Saavan Patel, PhD [CTO], InfinityQ Technology Inc


Introduction:


Clustering is the general process of generating structure within datasets by grouping associated data points with strong similarity.


There are many different applications for clustering algorithms including:

·        Image Segmentation

·        Anomaly Detection

·        Medical Imaging

·        Social Network Analysis

·        Data Compression



The figure above demonstrates an example of image segmentation applied onto camera data from an autonomous vehicle.


Different objects such as pedestrians, roads, traffic lights, and other vehicles are all segmented from the original image through clustering.

 

Clustering can also be applied as an initial step towards solving Vehicle Routing Problems (VRPs). Many established solutions involve clustering a large set of delivery addresses into distinct non-overlapping groups as the first step.


Afterwards, an individual Traveling Salesman Problem (TSP) is solved within each previously identified cluster.


Clearly, the quality of the initial clusters generated will greatly affect the quality of the overall VRP solution.



Many different clustering algorithms exist of varying complexity such as K-Means, K-Medoids, and DBSCAN.


However, most of these generic algorithms do not allow for the incorporation of additional problem-specific constraints, and many of them assume that the points are embedded within a two-dimensional plane which may not be true for complex datasets.


One example of a complex constraint, under the context of the VRP, is to make the number of delivery addresses per vehicle roughly equal such that driver shifts are uniform across the fleet.


Basic clustering algorithms, such as the ones mentioned previously, do not provide simple methodologies to incorporate this desired property among generated clusters.


Hence, this motivated InfinityQ to develop a novel optimization-based clustering approach.


Mathematical Formulation:


Clustering is posed as an optimization problem with binary variables in the following mathematical formulation:

where:


  • The objective function is defined to minimize the distance between points within a particular cluster. Note that dij represents the distance between data point i and data point j.


  • A set partitioning constraint is also included to ensure that each data point belongs to exactly one cluster. Under the context of the VRP, this refers to each address being assigned to exactly one delivery vehicle.


Note that the total number of variables in this formulation is n x k, where n is the number of data points to cluster and k is the number of desired clusters.


Under the context of the VRP, n represents the number of delivery addresses and k represents the total number of delivery vehicles utilized.


The mathematical formulation above can then be easily setup and solved on TitanQ.


Constructing the model only requires a few lines of code with the Python SDK as seen within the InfinityQ Examples GitHub.


Performance Metrics:


The main performance metrics of interest are the intracluster distance and the intercluster distance.


Intracluster distance refers to the distance between data points within the same cluster, S. More formally, we use the centroid diameter distance which we define as:

where cs and |S| are the center and the number of points in S respectively.

Note that cs can be simply calculated as:



The intracluster distance is visually represented below:



The worst-case scenario for intracluster distance is when there are a small number of addresses which are significantly further away from most of the others addresses.


Significant penalties would be incurred under this scenario.


However, a cluster of this variety would still exist under the context of the VRP to avoid using an entirely separate vehicle to deliver to the far-off addresses.



The intercluster distance refers to the distance between points among different clusters. More formally, we use the centroid linkage distance which we define as:



where cs and cT are the centers of cluster S and T respectively. Note that cs and cT can be simply calculated as:

where |S| and |T| are the number of points in clusters S and T respectively.


The intercluster distance is visually represented below:


The worst-case scenario that occurs for intercluster distance is a densely packed region with multiple clusters directly adjacent to each other.


The Capacitated Vehicle Routing Problem (CVRP) is the VRP with additional constraints where each vehicle has a limited amount of capacity that can be carried.


Under this context, the worst-case scenario identified often occurs within concentrated urban areas with many different addresses within a close vicinity.



Overall, ideal clusters would have the intracluster distance minimized, and the intercluster distance maximized.

 

Results:


The performance of the optimization-based clustering approach is demonstrated on a dataset acquired from a delivery company based in Montreal, Canada.


The dataset contains a total of 3000 different delivery addresses.



The results gathered throughout this section from the clustering formulation are benchmarked against the classical K-means algorithm.


Both the median intracluster distance among all clusters and the median intercluster distance between all pairs of generated clusters are compared.


It is also worth noting that TitanQ is a general optimization solver that can be utilized for clustering using the previously specified mathematical formulation.


Conversely, K-Means is a clustering algorithm specifically, to be used within a 2D Euclidean space.


The first set of results is from a subset of the original data set containing 100 delivery addresses. K=5 clusters are generated and a similar performance between TitanQ and the classical K-means algorithm is achieved across both the intracluster distance and the intercluster distance.


TitanQ vs K-Means on generating 5 clusters out of 100 delivery addresses


TitanQ

K-Means

Visual Result



Median

Intracluster Distance

0.0702

0.0873

Median

Intercluster Distance

0.175

0.185


Next, results are displayed for generating k=4 clusters out of a set of 500 delivery addresses.


Once again, the performance between TitanQ and K-Means is relatively similar between the intracluster distance and intercluster distance metrics.


TitanQ vs K-Means on generating 4 clusters out of 500 delivery addresses


TitanQ

K-Means

Visual Result



Median

Intracluster Distance

0.121

0.126

Median

Intercluster Distance

0.203

0.219

 

Finally, results are displayed for generating k=4 clusters on the full dataset of 3000 delivery addresses across Montreal.


Similar performance metrics are yielded once again with regards to intracluster and intercluster distances despite generating visibly different clusters.


TitanQ vs K-Means on generating 4 clusters out of 3000 delivery addresses


TitanQ

K-Means

Visual Result



Median

Intracluster Distance

0.125

0.127

Median

Intercluster Distance

0.206

0.218


Overall, the clustering solutions generated using TitanQ are quite comparable to the quality of clusters generated using a classical algorithm such as K-Means.


However, the benefits of this new clustering scheme are far clearer when incorporating additional constraints that can’t be so easily applied onto primitive clustering algorithms.


The next section provides an example in more detail.


Cluster Rebalancing:


One example of an additional constraint is to yield a balance between the number of data points within each cluster.


This can be quite useful under the context of CVRP as it would imply that the number of delivery addresses per vehicle is roughly equal to yield uniform driver shifts across the fleet.


This constraint is accomplished by adding a Lagrangian term to the objective function as shown below:


To yield balanced clusters in the formulation above, kavg is set to the ratio of the number of coordinates to the number of clusters:


Updated results for the example of generating 4 clusters out of the total dataset containing 3000 delivery addresses using the new mathematical formulation are displayed below:

Clustering Method

# Points Within Each Cluster

Standard Deviation

K-Means

[1729, 413, 594, 264]

666.47

TitanQ W/O Rebalancing

[954, 503, 667, 876]

204.44

TitanQ With Rebalancing

[897, 555, 672, 876]

164.92

It is important to note that the standard deviation for the number of points within each cluster is significantly lower compared to the clusters generated from applying K-Means on the same dataset.

 

Comparison to Gurobi


Gurobi is a classical branch & bound/branch & cut solver which is also able to provide high quality heuristic solutions to complex optimization problems.


We choose to compare against Gurobi as it is considered to be an example of a state of the art commercially available mixed integer quadratic programming (MIQP) solver.


This section contains 4 separate comparisons between the performance of TitanQ and Gurobi as optimization solvers on the same mathematical formulation for the clustering problem.


Note that TitanQ version 0.25.0 and Gurobi Optimizer version 11.0.3 are used throughout each of the comparisons.


In the first comparison, 3 clusters are generated out of a dataset containing 500 coordinates.


Both solvers perform very well for this small problem containing 1500 total binary variables.


The median intracluster and intercluster distances are also quite close to each other.


TitanQ vs Gurobi on a 1500 variable problem of generating 3 clusters out of 500 coordinates


TitanQ

Gurobi

Visual Result After 6 Second Timeout



Median

Intracluster Distance

0.170

0.165

Median

Intercluster Distance

0.186

0.193


The second comparison was conducted on a 9000 variable problem where 3 clusters are generated out of the full set of 3000 coordinates.


TitanQ is capable of generating a visually pleasing result after only 10 seconds.


However, Gurobi is unable to determine a reasonable result where most of the coordinates are not allocated towards one group even after 60 seconds.


TitanQ vs Gurobi on a 9000 variable problem to generate 3 clusters out of 3000 coordinates


TitanQ

Gurobi

Visual Result After 10 Second Timeout



Median

Intracluster Distance

0.178

0.226

Median

Intercluster Distance

0.187

0.0628




Visual Result After 60 Second Timeout




Median

Intracluster Distance

0.177

0.225

Median

Intercluster Distance

0.189

0.0749


Next, a third comparison is made on a 10000 variable problem where 5 clusters are generated out of a dataset containing 2000 coordinates.


Once again, TitanQ is capable of outputting a visually pleasing result after only 10 seconds while Gurobi is unable to do so after an entire 60 seconds.


TitanQ vs Gurobi on a 10000 variable problem of generating 5 clusters out of 2000 coordinates


TitanQ

Gurobi

Visual Result After 10 Second Timeout



Median

Intracluster Distance

0.104

0.171

 

Median

Intercluster Distance

0.192

0.117

 




Visual Result After 60 Second Timeout




Median

Intracluster Distance

0.104

0.172

Median

Intercluster Distance

0.192

0.135


Finally, a fourth comparison is made on a 12000 variable problem where 4 clusters are generated out of the full dataset of 3000 coordinates.


Note that Gurobi crashed entirely when attempting to create the model as the problem was too large to handle.


TitanQ vs Gurobi on a 12000 variable problem of generating 4 clusters out of 3000 coordinates


TitanQ

Gurobi

Visual Result After 6 Second Timeout


N/A

Median

Intracluster Distance

0.127

N/A

Median

Intercluster Distance

0.205

N/A


Overall, TitanQ and Gurobi demonstrate a similar performance on smaller problems.


However, TitanQ notably outperforms Gurobi on larger problems containing around 10000 variables and beyond.


Problem Specific Compression


Another further improvement upon the initial clustering formulation is the incorporation of a compression technique specific to the clustering problem.


Many different matrix-vector computations are involved within the TitanQ solver of the form:

The weights matrix W can become very large when handling large instances.


However, it is possible to take advantage of the sparse structure of the weights matrix that arises in the clustering formulation to optimize the amount of memory required.


In addition, the high costs associated with moving large matrices between platforms, such as between GPUs and CPUs, can also be minimized.


In this problem specific compression scheme, each matrix is first reformatted as follows:

A new and more efficient computation then arises as follows:


This scheme allows for scaling problems easily beyond 10,000 variables.


The overall problem sizes of the previously shown examples with and without the problem specific compression scheme are compared in the following table:

 

Problem Size W/O Compression

Problem Size With Compression

100 Data Points x 5 Clusters

1 MB

0.04 MB

500 Data Points x 4 Clusters

16 MB

1 MB

3000 Data Points x 4 Clusters

576 MB

36 MB


Conclusion


A novel optimization-based clustering algorithm has been developed by InfinityQ and can be easily solved using TitanQ.


The main advantage of this clustering approach is that it allows for the incorporation of additional constraints which are unsupported by many primitive algorithms.


Results demonstrate a comparable performance to the K-means clustering algorithm with regards to the intracluster and intercluster distance metrics.


It was also demonstrated that TitanQ outperforms Gurobi for solving the mathematical formulation on larger problem sizes, with Gurobi crashing on problems greater than 9000 variables.


Finally, a problem specific compression scheme was also presented to allow for efficient scaling across larger problem sizes.


A demo of the clustering scheme presented along with additional examples of use cases of TitanQ are available within the InfinityQ Examples GitHub.


About the Authors


Brian Mao, MMath - Brian is the Head of Algorithms at InfinityQ with a background in both mechanical engineering and applied mathematics.


Prior to InfinityQ, he was the Path Planning and Controls Lead at MIT Driverless where he led multiple sub-teams to program a racecar to drive fully autonomously at over 240 km/h.

 

He also worked at Apple Inc. under the Metal Tooling team where he developed an internal iOS app and created simulations of various manufacturing processes.


Brian continues to pursue the intersection between engineering and mathematics through solving many different mathematical optimization problems at InfinityQ.


Saavan Patel, PhD - Saavan is the Chief Technology Officer at InfinityQ.


Prior to InfinityQ, he was a PhD Candidate at University of California, Berkeley where he completed his studies under Professor Sayeef Salahuddin, working on developing accelerated solutions to optimization problems using quantum-inspired methods.


He also spent time at Meta Reality Labs, working on Machine Learning hardware for next generation VR systems.


His interests are broadly in optimization, quantum computation, machine learning, and hardware design.


bottom of page