Appendix F: Algorithmic & Performance Details

This document provides a technical summary of the key algorithmic optimizations implemented in the PyEQSP Python package to ensure scalability and efficiency for large-scale sphere partitioning analysis.

Algorithmic Improvements

Min-Distance Optimization

Status: Improved from \(O(N^2)\) to \(O(N \log N)\).

Previous Bottleneck: Pairwise distance matrices created using scipy.spatial.distance.pdist or NumPy broadcasting consumed \(O(N^2)\) memory and time.
Approach: We leverage KDTrees (scipy.spatial.KDTree) for \(S^d\) (\(d \le 4\)) to perform localized neighbour searches. For higher dimensions or specific partition types, we use structure-aware searches that exploit the recursive nature of the EQ algorithm to bound the search space.
Result: Calculating the min-distance for \(N=100,000\) points now takes seconds rather than minutes, and memory usage remains linear.

Recursive Partitioning Scaling

Status: Verified \(O(\mathcal{N}^{0.6})\) scaling ([Leo07], Section 3.10.2).

Approach: The eq_regions function implements the Recursive Zonal Equal Area Partitioning algorithm. We performed a high-fidelity verification sweep for \(d\) up to 11 and \(\mathcal{N}\) up to \(2^{22}\) (\(\approx 4.2 \times 10^6\)).
Result: The performance follows the theoretical \(O(\mathcal{N}^{0.6})\) scaling for \(S^2\), ensuring that even millions of regions can be calculated in minutes.

Riesz Energy Calculations

Status: Memory-efficient \(O(N^2)\) with Symmetry Exploitation (\(0.5 \times\) work).

Previous Bottleneck: Creating a full \(N \times N\) distance matrix to sum \(d_{ij}^{-s}\) lead to \(O(N^2)\) memory exhaustion (e.g., 3.2 GB for \(N=20,000\)).
Approach:
- Block-based Summation (Tiling): Calculations are performed in blocks of size \(M \times M\), keeping the peak memory footprint at \(O(N \times M)\) instead of \(O(N^2)\).
- Symmetry Exploitation: Since \(d_{ij} = d_{ji}\), we only compute the upper triangle of the interaction matrix, effectively doubling the performance.
Result: \(N=50,000\) energy calculations complete in under 15 seconds on standard 2026 hardware (e.g. Ryzen 7).

Histogram-Based Region Lookup (\(S^2\))

Status: Fully Vectorized (Stable with “Index Rotation”).

Previous Bottleneck: Assigning points to regions on \(S^2\) used a recursive Python loop. The first vectorized iteration faced accuracy failures at the \(0/2\pi\) longitude boundary due to non-monotonic search tables in wrapping collars.
Improved Approach: We implement Domain Translation (Index Rotation) within each collar: a. Relative Mapping: Points and region boundaries are shifted by the collar’s first boundary (\(\phi_0\)) using a modulo \(2\pi\) operation: (v - phi0) % 2pi. b. Monotonic Conversion: The final boundary (which would wrap back to \(0\)) is explicitly set to \(2\pi\) to ensure the table is strictly increasing. c. Direct Lookup: A single np.searchsorted call handles the entire collar in one pass, without the overhead of the legacy lookup_table utility (which has been retired to achieve 100% test coverage). d. Precision Rounding: Latitude (band) lookups include \(10^{-12}\) rounding to prevent points on a boundary from jumping bands due to float variance.
Result: Millions of sample points can be binned with mathematical fidelity in milliseconds. This provides a \(1000\times\) speedup over the previous recursive loop.

Symmetric Partition Performance (`even_collars`)

Status: Vectorized support in all properties functions.

Approach: The symmetric partitioning logic (even_collars=True) ensures an even number of collars. All property calculation functions (eq_area_error, eq_min_dist, eq_energy_dist, eq_diam_coeff) are fully vectorized to support this parameter.
Result: Symmetry calculation adds negligible overhead compared to standard partitions. In some cases, the simplified collar recurrence in symmetric mode can lead to slight performance improvements for high \(N\).

Optimized NumPy & SciPy Patterns

During development, many “hot” paths were refactored to use more efficient library patterns:

Vectorized Root Finding

In sradius_of_cap, we replaced a Python loop over scipy.optimize.root_scalar with a vectorized Newton-Raphson implementation using scipy.optimize.newton on arrays. This provides a 10–50x speedup for high-dimensional spherical cap radius calculations.

Efficient Coordinate Transforms

Functions in utilities.py (like cart2polar2) were refactored to use:

np.atleast_2d to handle both single points and large arrays uniformly.
Direct array operations instead of for loops.
np.arctan2 and np.arccos for numerically stable conversions.

Avoiding Massive Intermediate Arrays

Common patterns like np.linalg.norm(a[:, None] - b, axis=2) were replaced with more memory-conscious implementations or block-based loops where the intermediate broadcasting would exceed available RAM.

Parallel Dimension Calculations ([Leo07], Figure 3.7)

fig_3_7_max_diam_multi_dim.py calculates max diameter coefficients for EQ partitions across dimensions \(d=2\) to \(d=8\). The dimension-8 calculation alone accounts for approximately 81% of the total CPU time, making it the dominant bottleneck.

Approach: Uses concurrent.futures.ProcessPoolExecutor with max_workers=2. Dimensions are dispatched in decreasing order so that dim=8 begins immediately, while the second worker handles dim=7 and then the remaining smaller dimensions sequentially.
Result: For a full-fidelity thesis run (\(N=2^{20}\)), wall-clock time is reduced (e.g., from ~1h 45m to ~1h 24m). This matches the theoretical max improvement for a 2-worker strategy given the Amdahl’s Law limit imposed by the serial dim=8 task.

For more details on the performance roadmap and items currently under review for future releases, see the Release Roadmap.

For detailed benchmarks and instructions on running performance tests yourself, see the Performance Benchmarks.