Skip to content

Speed optimizations for SGP4#41

Open
troyrock wants to merge 8 commits intodnwrnr:masterfrom
troyrock:master
Open

Speed optimizations for SGP4#41
troyrock wants to merge 8 commits intodnwrnr:masterfrom
troyrock:master

Conversation

@troyrock
Copy link

I have added the following speed optimizations that increase the speed by > 3x. I have also added multi-core scaling using OpenMP if available. IMPORTANT: I used an AI to help with the optimization but I have tested it manually and it seems to work.

  1. Scalar Arithmetic Refinement
  • Power Reductions: Replaced expensive pow(x, 1.5) and pow(x, 3.0) calls with x * sqrt(x) and cubic multiplications.
  • Trigonometric Efficiency: Replaced repeated sin/cos calls with __builtin_sincos where possible to utilize the CPU's simultaneous trig hardware.
  • Kepler Solver: Applied loop unrolling to the Newton-Raphson iterations to improve instruction pipelining.
  1. SIMD Batch Processing (SGP4Batch) A new class, SGP4Batch, was introduced to handle multiple satellites simultaneously:
  • Structure-of-Arrays (SoA) Layout: Satellite constants and elements are stored in memory-contiguous arrays rather than structures. This maximizes L1 cache hit rates and enables efficient prefetching.
  • AVX-512 Vectorization: The propagator uses 512-bit registers to process 8 satellites per instruction lane.
  1. Fused Multiply-Add (FMA)
  • Mathematical Chaining: Core secular update logic was refactored to use _mm512_fmadd_pd and _mm512_fnmadd_pd intrinsics. This allows two operations ($a \cdot b + c$) to be performed in a single clock cycle.
  • Numerical Stability: FMA instructions perform a single rounding step at the end, which slightly improves the precision compared to separate multiply and add steps.
  1. Multi-Core Scaling (OpenMP)
  • Horizontal Parallelism: The batch processing loop is parallelized using OpenMP. On high-core-count processors (like the AMD EPYC), this allows the propagator to utilize all available hardware threads.
  • Throughput: Capable of exceeding 40 million propagations per second on a modern 32-core server.
  1. Accuracy & Validation Accuracy was verified against a 6-year historical TLE dataset for IRS 1A (Object 18960). The optimized engine remains consistent with the standard scalar implementation within $10^{-6}$ km (millimeter precision), ensuring it is suitable for conjunction probability calculations.

Performance Comparison
| Implementation | Time per Step | Throughput (Sats/sec) | | :--- | :--- | :--- |
| Original dnwrnr/sgp4 | ~0.671 µs | ~1.4 Million | | fastSGP4 (Single Core) | ~0.200 µs | ~5.0 Million | | fastSGP4 (32-Core EPYC) | ~0.005 µs | ~41.2 Million |

troyrock and others added 8 commits February 27, 2026 20:39
- **Power Reductions:** Replaced expensive `pow(x, 1.5)` and `pow(x, 3.0)` calls with `x * sqrt(x)` and cubic multiplications.
- **Trigonometric Efficiency:** Replaced repeated `sin`/`cos` calls with `__builtin_sincos` where possible to utilize the CPU's simultaneous trig hardware.
- **Kepler Solver:** Applied loop unrolling to the Newton-Raphson iterations to improve instruction pipelining.

2. SIMD Batch Processing (`SGP4Batch`)
A new class, `SGP4Batch`, was introduced to handle multiple satellites simultaneously:
- **Structure-of-Arrays (SoA) Layout:** Satellite constants and elements are stored in memory-contiguous arrays rather than structures. This maximizes L1 cache hit rates and enables efficient prefetching.
- **AVX-512 Vectorization:** The propagator uses 512-bit registers to process 8 satellites per instruction lane.

3. Fused Multiply-Add (FMA)
- **Mathematical Chaining:** Core secular update logic was refactored to use `_mm512_fmadd_pd` and `_mm512_fnmadd_pd` intrinsics. This allows two operations ($a \cdot b + c$) to be performed in a single clock cycle.
- **Numerical Stability:** FMA instructions perform a single rounding step at the end, which slightly improves the precision compared to separate multiply and add steps.

4. Multi-Core Scaling (OpenMP)
- **Horizontal Parallelism:** The batch processing loop is parallelized using OpenMP. On high-core-count processors (like the AMD EPYC), this allows the propagator to utilize all available hardware threads.
- **Throughput:** Capable of exceeding 40 million propagations per second on a modern 32-core server.

5. Accuracy & Validation
Accuracy was verified against a 6-year historical TLE dataset for **IRS 1A** (Object 18960). The optimized engine remains consistent with the standard scalar implementation within $10^{-6}$ km (millimeter precision), ensuring it is suitable for conjunction probability calculations.

Performance Comparison
| Implementation | Time per Step | Throughput (Sats/sec) |
| :--- | :--- | :--- |
| Original dnwrnr/sgp4 | ~0.671 µs | ~1.4 Million |
| fastSGP4 (Single Core) | ~0.200 µs | ~5.0 Million |
| fastSGP4 (32-Core EPYC) | ~0.005 µs | **~41.2 Million** |
of closest approach without needing to run SGP4 again and again.
conjunctions to be screened based on spatial proximity.
conjunction when two object are physically too distant for a conjunction
for at least n more time steps.
performance through taking advantage of static TLE variables for short
duration propagation.
some satellites that are too far from one another to have a conjunction.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant