To optimize the performance of an HTML parsing pipeline, I experimented with various parallelization strategies using Python's multiprocessing and multithreading modules.
Note that this is tested locally, but not on AWS ECS.
- Pages: 10
- Properties: 400
- Units: 4,504
- List building (BeautifulSoup):
256.8691 s
| Method | Time (s) |
|---|---|
| Single process (base case) | 19.1319 |
apply_async |
17.0994 |
apply_async + large chunks (25 chunks) |
18.5841 |
apply_async + small chunks (88 chunks) |
16.9023 |
map |
18.0836 |
ThreadPoolExecutor (multithread) |
20.3130 |
apply_async + small chunks (run only first 25) |
6.6897 |
Despite using multiprocessing, the performance improvement was marginal. Here’s the breakdown of findings:
- Replacing
apply_asyncwithmapto avoid shared state: no effect - Commenting out logic that modifies
Manager().list(): no effect - Even returning early without running any parsing code: no improvement
- Reducing the number of tasks using chunking (25 vs. 88 chunks): no effect
- However, running only the first 25 chunks of the 88-chunk version was significantly faster
According to the official Python documentation on multiprocessing,
"When using multiple processes, one generally uses message passing for communication between processes and avoids having to use any synchronization primitives like locks....
One difference from other Python queue implementations, is that multiprocessing queues serializes all objects that are put into them using pickle. The object return by the get method is a re-created object that does not share memory with the original object."
In our case, the main process must serialize large BeautifulSoup objects for every task, which introduces significant overhead on a single core—especially when the actual parsing tasks are lightweight. As a result, multiprocessing brings only marginal improvement until we reduce the serialization cost.
Refactor the code to pass lightweight HTML strings instead of BeautifulSoup objects.
Each worker then rebuilds the soup locally. This significantly reduces serialization cost and improves parallel throughput.
- List building (without turning into soup object beforehand):
190.0905 s
| Method | Time (s) |
|---|---|
| Single process (base case) | 52.3781 |
apply_async |
10.8709 |
apply_async + large chunks (25) |
11.1214 |
apply_async + small chunks (88) |
11.6920 |
map |
10.9354 |
ThreadPoolExecutor |
57.2707 |
- Pass HTML strings instead of soup objects to avoid heavy serialization.
- Use
apply_asyncfor simple, efficient parallelism. - Overall speed improvement: ~27%
| Stage | Before | After |
|---|---|---|
| List building | 256.8691 s | 190.0905 s |
| Parsing | 19.1319 s | 10.8709 s |
| Total | 276.001 s | 200.9614 s |
Avoid passing complex objects (like soup objects from
BeautifulSoup) between processes. Instead, pass raw data (e.g., strings) and reconstruct them within the worker. This small change significantly improves performance and scalability.