Architecture Internals¶
This document describes OpenWPM’s internal architecture in detail: process model, communication channels, timeouts, and coupling points. For a high-level overview, see Platform-Architecture.md.
Process Model¶
Main Process (TaskManager)
├── StorageController Process (asyncio TCP server)
│ ├── StructuredStorageProvider (SQLite/Parquet/S3/GCS)
│ └── UnstructuredStorageProvider (LevelDB/Gzip/S3/GCS)
├── BrowserManager Process 1..N
│ ├── Selenium WebDriver → Firefox
│ └── WebExtension (HTTP/Cookie/JS/Nav/DNS instrumentation)
├── MPLogger Process (log aggregation)
└── Watchdog Threads (memory/process monitoring)
The TaskManager is the user-facing entry point. It spawns:
A StorageController in a child process, which runs an asyncio TCP server to receive data from both the WebExtension (JSON serialization) and the TaskManager itself (dill serialization via
DataSocket).One or more BrowserManager child processes, each driving a Firefox instance via Selenium WebDriver. Each browser loads the OpenWPM WebExtension for instrumentation.
An MPLogger thread for centralized log aggregation.
Watchdog threads for memory and process monitoring.
Communication Channels¶
| Channel | Direction | Mechanism | Pattern |
|---|---|---|---|
| TaskManager → BrowserManager | command_queue (mp.Queue) |
Request (one command at a time) | |
| BrowserManager → TaskManager | status_queue (mp.Queue) |
Reply ("OK" / "FAILED" / "CRITICAL" / "NETERROR") |
|
| Extension → StorageController | TCP socket (JSON) | Fire-and-forget | |
| TaskManager → StorageController | DataSocket (TCP, dill) |
Fire-and-forget | |
| StorageController → TaskManager | status_queue (mp.Queue) |
Periodic push (every 5s) | |
| StorageController → TaskManager | completion_queue (mp.Queue) |
Polled by callback thread (every 1s/5s) | |
| TaskManager → StorageController | shutdown_queue (mp.Queue) |
One-shot signal | |
| Extension startup → BrowserManager | File-based (extension_port.txt) |
Polling every 0.1s for 5s |
Key observation: No channel supports bidirectional communication. The
status_queue between BrowserManager and TaskManager is the closest, but it’s
strictly request-reply: TaskManager sends one command, then blocks waiting for
exactly one status response.
Timeout Inventory¶
| Location | Timeout | Purpose |
|---|---|---|
BrowserManagerHandle._SPAWN_TIMEOUT |
120s | Browser spawn per attempt |
CommandSequence.get() default |
60s | Page navigation |
CommandSequence.screenshot() default |
30s | Screenshot capture |
CommandSequence.dump_profile() default |
120s | Profile archiving |
BrowserManager._start_extension |
5s | Find extension_port.txt |
BrowserManager._start_extension |
10s | Find OPENWPM_STARTUP_SUCCESS.txt |
BrowserManagerHandle.close_browser_manager |
30s | Webdriver shutdown response |
BrowserManagerHandle.close_browser_manager |
30s | Browser process exit |
StorageController.STATUS_TIMEOUT |
120s | Storage status heartbeat |
StorageController.BATCH_COMMIT_TIMEOUT |
30s | Flush if idle |
StorageController.STATUS_UPDATE_INTERVAL |
5s | Status push interval |
StorageController.update_completion_queue |
5s | Completion poll interval |
_mark_command_sequences_complete |
1s | Callback thread poll |
BrowserManagerHandle.execute_command_sequence |
2s | Post-sequence drain sleep |
FinalizeCommand.execute |
5s | Extension buffer drain |
| Watchdog | 10s | Check interval |
| Process watchdog buffer | 300s | Avoid killing fresh browsers |
Coupling Analysis¶
BrowserManagerHandle ↔ TaskManager¶
BrowserManagerHandle.execute_command_sequence (in openwpm/browser_manager.py)
takes a TaskManager reference and directly accesses these internal fields:
task_manager.sock—DataSocketfor recording crawl historytask_manager.failure_status—dict, set on critical errorstask_manager.failure_count—int, incremented on failurestask_manager.threadlock—threading.Locktask_manager.closing—booltask_manager.failure_limit—inttask_manager.sock.finalize_visit_id()— visit lifecycle management
This creates a circular dependency: TaskManager imports BrowserManagerHandle,
and BrowserManagerHandle uses TaskManager (via TYPE_CHECKING forward ref).
BaseCommand Coupling¶
Commands (in openwpm/commands/types.py) require concrete types: Firefox
webdriver, BrowserParamsInternal, ManagerParamsInternal, ClientSocket.
Commands cannot be executed without a real browser and extension socket.
TaskManager.init Sequencing¶
The constructor performs all initialization sequentially: validates config → starts MPLogger → launches StorageController → initializes browsers → launches browsers → starts watchdog → starts storage logger → saves config → starts callback thread. Any test touching TaskManager must pay all these costs.