Authors: Royg Garcia [GPU Programmer Intern] & Kevin Jordan [Lead GPU Researcher] - **InfinityQ Technology Inc.**

**Abstract**

TitanQ is a novel solver platform using off-the-shelf hardware for non-convex optimization problems that leverages quantum-inspired, probabilistic computing.

The platform manages to showcase exceptional speed, state-of-the-art solution quality, and unparalleled cost-effectiveness.

These system properties place TitanQ as one of the most state-of-the-art solver systems available today.

The recent incorporation of sparse matrix support not only solidifies this lead but propels TitanQ even further ahead.

Many real-world problems have sparse structures, which gives the TitanQ system a speed boost on important problems.

Solving NP-Hard combinatorial optimization problems is now improved in both speed and quality thanks to sparse matrix formats.

**Background**

When solving NP-Hard combinatorial optimization problems for business and research applications, the primary concern is obtaining a solution that meets time constraints while maintaining an acceptable level of quality.

A trade-off is frequently seen in practical applications where the amount of cost associated with a marginal improvement in solution quality no longer fits the use case, and an acceptable quality target is defined.

While not provably optimal, a solution that represents a drastic improvement in runtime will certainly be selected when available if the solution's quality is within the target margin.

The TitanQ solver finds solutions though rapid stochastic sampling. It begins with a random solution, and it probabilistically improves the solution until a given timeout is reached.

This adaptable approach caters to the time and quality requirements of useful applications.

An integral part of TitanQ’s sampling process involves matrix multiplication. This operation significantly impacts the time required to take a sample solution.

Typically, at least 85% of the sampling time is consumed by matrix multiplication.

The time taken for matrix multiplication depends primarily on the matrix size, which is proportional to the problem size. Therefore, investigating performance gains in matrix multiplication is crucial, especially when dealing with larger problem sizes.

A significant portion of the real-world problems TitanQ encounters involve sparse matrices — matrices composed primarily of zeros, frequently above 95% zeros.

Without considering this inherent sparsity, storing the matrix naïvely results in zero values occupying as much memory space as non-zero elements. This represents a large space inefficiency.

In terms of runtime efficiency, when sparsity is not considered, the operation of multiplying by zero is carried out for all stored zero values. The time cost of these operations is also inefficient.

Mathematically, this is a useless computation as anything multiplied by zero will also be zero.

This is the main feature exploited by considering sparsity in system design, since sparse format matrices give both computational and memory benefits over the classic dense format.

**Sparse Matrix Formats**

Given the ubiquitous nature of matrix multiplication across various domains, a large volume of work exists exploring sparsity in matrix operations. Consequently, various methods and formats to represent sparse matrices have emerged.

Many well-known frameworks include built-in tools and implementations for various sparse matrix formats.

The most frequently used and supported formats include the Coordinate format (COO), the Compressed Sparse Row format (CSR), and the Compressed Sparse Column (CSC) format among others.

The Coordinate format (COO) for sparse matrices consists of two lists of row and column coordinates with a third list of their associated values for the non-zero values in the matrix.

These three arrays are all the length of the number of non-zero elements in the matrix.

The points in all arrays are sorted by the row indices to assure proper alignment.

This conveniently maps to what's known in graph theory as an edge list.

An example of both traditional dense and COO format is included below as Fig. 1, also including the size requirements in bytes to store each format.

CSR format is similar to COO, but the array of row indices is stored as the cumulative sum of the number of non-zero elements in each row in increasing order.

Instead of the length being equal to the number of non-zero elements as in COO, the compressed row indices array is of length m+1, where m is the number of rows and the +1 is the added leading zero for algorithmic convenience.

CSC is equivalent to CSR but instead using compressed column indices instead of rows.

Fig. 2 below shows the same example matrix as Fig. 1 represented in both COO and CSR format.

As is apparent, for sparse matrices these formats offer the benefit of requiring less memory than the traditional dense format.

**Results**

An investigation was conducted as to whether sparse format representations of sparse matrices could yield significant computational benefits for our solver TitanQ.

To achieve this, we conducted benchmarks on an NVIDIA RTX A1000 GPU. Initially, we generated random sparse matrices of varying sizes and densities.

Our goal was to observe how density variations impact the performance of matrix multiplication with an array — a relevant operation for our solver.

To accurately measure the execution time, the CUDA executions were synchronized to ensure completion of all computations and the NVIDIA GPU profiler tool used.

This ensured that the time measured for the execution of each type of matrix multiplication was exclusively the GPU operations and did not include CPU operations or transfer overheads.

The time performances that were collected for various densities of a 5000x5000 square matrix for each of the sparse format types was included below as Fig. 3.

In Fig. 3 above we observe that for a 5000x5000 sparse matrix in CSR format, there is a significant speedup when the density is below approximately 6%.

However, COO and CSC formats exhibit slightly degraded performance, slowing down above 1% density compared to the dense format. Also, as anticipated, the performance of the dense format remains constant regardless of changes in the density of non-zero elements.

We observed that the crossover point of where CSR runtime is improved as compared to dense, while affected by various factors such as matrix size, generally falls at approximately 5% density.

An example for a large 15000x15000 sparse matrix was included below as Fig. 4.

In terms of memory benefits, Fig. 5 below indicates that up to 20% density, the CSR format is the most memory-efficient, closely followed by other sparse formats.

The exact point at which the CSR format becomes less memory-efficient is somewhat easier to predict than the runtime crossover point, as it is largely independent of the hardware.

Given the demonstrated performance edge of CSR format as compared to COO and CSC in these preliminary results, the remaining work considers exclusively CSR format as an alternative to dense format.

These very promising initial results motivated further investigation via adding the enhancement of sparse matrix formats for the sampling step used in TitanQ.

The goal of the below experiments was to explore how the reduced multiplication time affected sampling efficiency and the rate at which high-quality solutions are achieved.

The results of these investigations are summarized in Fig. 6 and 7 below.

The plot above in Fig. 6 confirms that faster sampling is achievable with sparse matrix formats used in the system.

Instance g000476, a benchmark instance taken from a common library MQLib used for testing solvers similar to TitanQ, was used for this experiment. g000476 is a problem instance with an 8000 x 8000 matrix at 0.2% density [1].

When we apply this enhanced sampling rate to the same problem, we expect to achieve the same near-optimal solution more rapidly.

Sampling at a faster rate, less time should be required to obtain the same quality solution as the quality improvements of both samples should be equivalent. Fig. 7 below illustrates exactly this predicted improvement.

The metric of optimality represents the ratio of the found solution to the known optimal solution. Fig. 7 demonstrates that when TitanQ employs CSR format to represent the sparse matrix, higher sampling rate results in faster convergence towards a near optimal solution.

For this instance, a factor of approximately 6.9x speedup in convergence is seen.

The plot in Fig. 5 predicted savings in memory usage, and these savings were confirmed and demonstrated when conducting these experiments.

The peak memory allocation during matrix multiplication with TitanQ operating in Dense mode was 270 MB, whereas with the new sparse CSR version, it reduced significantly to 17 MB.

Not only does the inclusion of support for a CSR representation of sparse matrices enable faster sampling, but it also conserves GPU memory.

As a result, we can handle even larger problems before exhausting memory resources.

These realizations help TitanQ scale to ultra-large-scale problems enabling solutions to previously unsolved problems.

Adjacent to the experimental goals, it was observed that the time required to convert a dense format sparse matrix into CSR format was negligible compared to the time spent on sampling.

**Conclusion**

Through the various experiments conducted and review of state-of-the-art literature, significant performance and memory advantages via sparse matrix formats were demonstrated.

These improvements were observed both independently and as a component of TitanQ.

The seamless integration into TitanQ has enabled faster convergence toward near-optimal solutions and expands the range of problem sizes the solver can handle.

With these findings, we expect to find application in larger problem scopes, new industries, and ever wider adoption in solving the world’s most difficult computational problems.

**References**

[1] I. Dunning, S. Gupta, and J. Silberholz, “What Works Best When? A Systematic Evaluation of Heuristics for Max-Cut and QUBO,” *INFORMS Journal on Computing*, vol. 30, no. 3, pp. 608–624, Aug. 2018,

**About the Authors:**

**Royg Garcia**

Royg is a GPU Programming Intern at InfinityQ, and the main investigator of this work.

He is currently pursuing a degree in Software Engineering with a concentration in Machine Learning from École de Technologie Supérieure (ÉTS) in Montreal.

After previous experiences in web development, a growing interest in lower-level, deep tech, and research computing prompted him to take up his role at InfinityQ.

His primary areas of interest include GPU programming, machine learning, as well as functional and logic programming. Research and the ability to contribute to cutting-edge technology developments are a favorite part of his work at InfinityQ.

**Kevin Jordan**

Kevin leads the GPU research efforts at InfinityQ.

His background includes recent research collaborations with the DASLab at Harvard SEAS, investigating multi-GPU systems infrastructure for distributed AI workloads, as part of his graduate study.

He has previously conducted research at the Berkeley SETI Research Center, advancing data processing infrastructure of radio telescope signals via GPU, research at NASA Ames Research Center in planetary rover mission data synthesis / visualization platforms, and worked as a systems engineer integrating a state-of-the-art satellite constellation at MDA Space in Montreal.

Scientific computing, hardware accelerated solutions, and computing systems research are what drive and excite him.