Distributed QEM Simplification is a high-performance C++ implementation for 3D mesh simplification based on Quadric Error Metrics (QEM). It provides multiple implementations to handle massive meshes using both Shared Memory parallelism and Distributed Memory clusters.
- Parallel QEM: High-speed edge collapse algorithm using OpenMP and FastFlow.
- Spatial Partitioning: Supports both Uniform Grid and Octree decomposition for workload balancing.
- Distributed Computing: Scalable MPI implementations (Hybrid and Full MPI) for processing meshes across multiple nodes.
- Modern C++: Built with C++23 features for efficiency and robust memory management.
- Topology Management: Powered by OpenMesh for reliable half-edge data structures.
- Surface Simplification Using Quadric Error Metrics
- Scalable Algorithms for Distributed-Memory Adaptive Mesh Refinement
- Mesh Simplification in Parallel
- Distributed Processing of Mesh Simplification
- External Memory Management and Simplification of Huge Meshes
- cxxopts for CLI parsing
- eigen for math computation
- openmesh for mesh topological management
- cpp-utils for generic utilities
- fastflow for parallel pattern testing
- CMake as the build system
- C++: >= C++23
- OpenMPI: >= 5.0
- OpenMP: >= 6.1
- CMake: >= 3.20
- Compiler: GCC 13+ / Clang 16+ / MSVC with C++23 support
git clone https://github.com/bigmat18/distributed-qem-simplification.git
cd distributed-qem-simplification
git submodule update --init --recursive./SETUP.sh <type>
./BUILD.sh <type>The type parameter can be debug, release, reldeb that activete different optiomations.
The project generates different executables based on the parallelization strategy:
Shared Memory (OMP/FastFlow):
# Uniform Grid OpenMP
./build/examples/02-omp-uniform input.obj -n <target_faces_num>
# Octree OpenMP
./build/examples/03-omp-octree input.obj -n <target_faces_num>
# FastFlow Uniform
./build/examples04-ff-uniform input.obj -n <target_faces_num>Main Options:
-i, --filename: Input filename list-t, --threads: Number of threads (default: -1, uses all available)-p, --partitions: Start partitions (default: 16)-n, --target: Target faces
Distributed Memory (MPI):
Use the provided bash script to launch MPI executions. The script safely wraps mpirun handling hardware threads, CPU bindings, and OpenMP thread mapping for masters and workers.
./RUN.sh <type> <exe> -nw 4 -wt 2 -mt 4 -i data_folder/ -p 16 -t 10Script Usage: ./run_mpi.sh <type> <exe> [options]
<type>:debug,release, orreldeb<exe>: Target executable (e.g.,06-mpi-omp,07-mpi)-nw: Number of MPI worker processes-wt: Number of OpenMP threads per worker-mt: Number of OpenMP threads for the master-p: Number of partitions-t: Target percentage-pr: Enable performance profiling viaperf record
![]() |
![]() |
|
Uniform Grid Reduction Stanford Bunny from 56,172 faces to 1,000. |
Octree Reduction Stanford Bunny from 56,172 faces to 1,000 with wireframe visualization. |
The performance of the various parallel implementations (Shared Memory and Distributed Memory) was evaluated against a sequential baseline and MeshLab's decimation algorithm. The efficiency of these parallel solutions is highly correlated with the input model's complexity. Small meshes, such as the Bunny (68K faces), show limited improvement due to thread management and communication overhead. Conversely, massive models like Lucy (28M faces) achieve up to a 10x speedup in distributed environments.
![]() Bunny (68K Triangles) [cite: 315] |
![]() Armadillo (300K Triangles) [cite: 316] |
![]() Dragon (7.5M Triangles) [cite: 334] |
![]() Lucy (28M Triangles) [cite: 375] |
Strong scaling analysis highlights the operational limits of both architectures. Shared memory approaches scale well initially but saturate near 32 cores due to Amdahl's Law, synchronization overhead, and memory bandwidth limits. Distributed implementations (Full MPI and Hybrid MPI+OMP) process massive datasets efficiently but face severe communication bottlenecks and load imbalance beyond 128 cores, primarily driven by I/O constraints.
![]() Shared Memory Strong Scaling Execution time variation up to 32 physical cores. |
![]() Distributed Memory Scalability Comparison of Hybrid MPI+OMP and Full MPI architectures. |
Detailed Documentation: For a comprehensive explanation of the mathematical models, spatial partitioning strategies (Uniform Grid vs. Octree), and in-depth performance analysis, please refer to the full project report:
📄 Relation_SPM.pdf







