Skip to content

bigmat18/distributed-qem-simplification

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distributed QEM Simplification

1. Introduction

Distributed QEM Simplification is a high-performance C++ implementation for 3D mesh simplification based on Quadric Error Metrics (QEM). It provides multiple implementations to handle massive meshes using both Shared Memory parallelism and Distributed Memory clusters.

Features

  • Parallel QEM: High-speed edge collapse algorithm using OpenMP and FastFlow.
  • Spatial Partitioning: Supports both Uniform Grid and Octree decomposition for workload balancing.
  • Distributed Computing: Scalable MPI implementations (Hybrid and Full MPI) for processing meshes across multiple nodes.
  • Modern C++: Built with C++23 features for efficiency and robust memory management.
  • Topology Management: Powered by OpenMesh for reliable half-edge data structures.

2. Papers, Libraries, and Requirements

Reference Papers and Algorithms

Libraries Used

  • cxxopts for CLI parsing
  • eigen for math computation
  • openmesh for mesh topological management
  • cpp-utils for generic utilities
  • fastflow for parallel pattern testing
  • CMake as the build system

Software Requirements

  • C++: >= C++23
  • OpenMPI: >= 5.0
  • OpenMP: >= 6.1
  • CMake: >= 3.20
  • Compiler: GCC 13+ / Clang 16+ / MSVC with C++23 support

3. Installation and Usage

Installation

git clone https://github.com/bigmat18/distributed-qem-simplification.git
cd distributed-qem-simplification
git submodule update --init --recursive

Build

./SETUP.sh <type>
./BUILD.sh <type>

The type parameter can be debug, release, reldeb that activete different optiomations.

Basic Usage

The project generates different executables based on the parallelization strategy:

Shared Memory (OMP/FastFlow):

# Uniform Grid OpenMP
./build/examples/02-omp-uniform input.obj -n <target_faces_num>
# Octree OpenMP
./build/examples/03-omp-octree input.obj -n <target_faces_num>
# FastFlow Uniform
./build/examples04-ff-uniform input.obj -n <target_faces_num>

Main Options:

  • -i, --filename : Input filename list
  • -t, --threads : Number of threads (default: -1, uses all available)
  • -p, --partitions : Start partitions (default: 16)
  • -n, --target : Target faces

Distributed Memory (MPI): Use the provided bash script to launch MPI executions. The script safely wraps mpirun handling hardware threads, CPU bindings, and OpenMP thread mapping for masters and workers.

./RUN.sh <type> <exe> -nw 4 -wt 2 -mt 4 -i data_folder/ -p 16 -t 10

Script Usage: ./run_mpi.sh <type> <exe> [options]

  • <type> : debug, release, or reldeb
  • <exe> : Target executable (e.g., 06-mpi-omp, 07-mpi)
  • -nw : Number of MPI worker processes
  • -wt : Number of OpenMP threads per worker
  • -mt : Number of OpenMP threads for the master
  • -p : Number of partitions
  • -t : Target percentage
  • -pr : Enable performance profiling via perf record

4. Results

Uniform Grid Reduction
Stanford Bunny from 56,172 faces to 1,000.
Octree Reduction
Stanford Bunny from 56,172 faces to 1,000 with wireframe visualization.

Uniform Grid Partitions Reduction


5. Benchmark and Performance Analysis

Overall Speedup Comparison

The performance of the various parallel implementations (Shared Memory and Distributed Memory) was evaluated against a sequential baseline and MeshLab's decimation algorithm. The efficiency of these parallel solutions is highly correlated with the input model's complexity. Small meshes, such as the Bunny (68K faces), show limited improvement due to thread management and communication overhead. Conversely, massive models like Lucy (28M faces) achieve up to a 10x speedup in distributed environments.


Bunny (68K Triangles) [cite: 315]

Armadillo (300K Triangles) [cite: 316]

Dragon (7.5M Triangles) [cite: 334]

Lucy (28M Triangles) [cite: 375]

Scalability Analysis

Strong scaling analysis highlights the operational limits of both architectures. Shared memory approaches scale well initially but saturate near 32 cores due to Amdahl's Law, synchronization overhead, and memory bandwidth limits. Distributed implementations (Full MPI and Hybrid MPI+OMP) process massive datasets efficiently but face severe communication bottlenecks and load imbalance beyond 128 cores, primarily driven by I/O constraints.


Shared Memory Strong Scaling
Execution time variation up to 32 physical cores.

Distributed Memory Scalability
Comparison of Hybrid MPI+OMP and Full MPI architectures.

Detailed Documentation: For a comprehensive explanation of the mathematical models, spatial partitioning strategies (Uniform Grid vs. Octree), and in-depth performance analysis, please refer to the full project report:
📄 Relation_SPM.pdf

About

Parallel and distributed mesh simplification using Quadric Error Metrics (QEM), designed for large-scale 3D models.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors