Architecture Internals

This document describes OpenWPM’s internal architecture in detail: process model, communication channels, timeouts, and coupling points. For a high-level overview, see Platform-Architecture.md.

Process Model

Main Process (TaskManager)
  ├── StorageController Process (asyncio TCP server)
  │   ├── StructuredStorageProvider (SQLite/Parquet/S3/GCS)
  │   └── UnstructuredStorageProvider (LevelDB/Gzip/S3/GCS)
  ├── BrowserManager Process 1..N
  │   ├── Selenium WebDriver → Firefox
  │   └── WebExtension (HTTP/Cookie/JS/Nav/DNS instrumentation)
  ├── MPLogger Process (log aggregation)
  └── Watchdog Threads (memory/process monitoring)

The TaskManager is the user-facing entry point. It spawns:

  • A StorageController in a child process, which runs an asyncio TCP server to receive data from both the WebExtension (JSON serialization) and the TaskManager itself (dill serialization via DataSocket).

  • One or more BrowserManager child processes, each driving a Firefox instance via Selenium WebDriver. Each browser loads the OpenWPM WebExtension for instrumentation.

  • An MPLogger thread for centralized log aggregation.

  • Watchdog threads for memory and process monitoring.

Communication Channels

Channel Direction Mechanism Pattern
TaskManager → BrowserManager command_queue (mp.Queue) Request (one command at a time)
BrowserManager → TaskManager status_queue (mp.Queue) Reply ("OK" / "FAILED" / "CRITICAL" / "NETERROR")
Extension → StorageController TCP socket (JSON) Fire-and-forget
TaskManager → StorageController DataSocket (TCP, dill) Fire-and-forget
StorageController → TaskManager status_queue (mp.Queue) Periodic push (every 5s)
StorageController → TaskManager completion_queue (mp.Queue) Polled by callback thread (every 1s/5s)
TaskManager → StorageController shutdown_queue (mp.Queue) One-shot signal
Extension startup → BrowserManager File-based (extension_port.txt) Polling every 0.1s for 5s

Key observation: No channel supports bidirectional communication. The status_queue between BrowserManager and TaskManager is the closest, but it’s strictly request-reply: TaskManager sends one command, then blocks waiting for exactly one status response.

Timeout Inventory

Location Timeout Purpose
BrowserManagerHandle._SPAWN_TIMEOUT 120s Browser spawn per attempt
CommandSequence.get() default 60s Page navigation
CommandSequence.screenshot() default 30s Screenshot capture
CommandSequence.dump_profile() default 120s Profile archiving
BrowserManager._start_extension 5s Find extension_port.txt
BrowserManager._start_extension 10s Find OPENWPM_STARTUP_SUCCESS.txt
BrowserManagerHandle.close_browser_manager 30s Webdriver shutdown response
BrowserManagerHandle.close_browser_manager 30s Browser process exit
StorageController.STATUS_TIMEOUT 120s Storage status heartbeat
StorageController.BATCH_COMMIT_TIMEOUT 30s Flush if idle
StorageController.STATUS_UPDATE_INTERVAL 5s Status push interval
StorageController.update_completion_queue 5s Completion poll interval
_mark_command_sequences_complete 1s Callback thread poll
BrowserManagerHandle.execute_command_sequence 2s Post-sequence drain sleep
FinalizeCommand.execute 5s Extension buffer drain
Watchdog 10s Check interval
Process watchdog buffer 300s Avoid killing fresh browsers

Coupling Analysis

BrowserManagerHandle ↔ TaskManager

BrowserManagerHandle.execute_command_sequence (in openwpm/browser_manager.py) takes a TaskManager reference and directly accesses these internal fields:

  • task_manager.sockDataSocket for recording crawl history

  • task_manager.failure_statusdict, set on critical errors

  • task_manager.failure_countint, incremented on failures

  • task_manager.threadlockthreading.Lock

  • task_manager.closingbool

  • task_manager.failure_limitint

  • task_manager.sock.finalize_visit_id() — visit lifecycle management

This creates a circular dependency: TaskManager imports BrowserManagerHandle, and BrowserManagerHandle uses TaskManager (via TYPE_CHECKING forward ref).

BaseCommand Coupling

Commands (in openwpm/commands/types.py) require concrete types: Firefox webdriver, BrowserParamsInternal, ManagerParamsInternal, ClientSocket. Commands cannot be executed without a real browser and extension socket.

TaskManager.init Sequencing

The constructor performs all initialization sequentially: validates config → starts MPLogger → launches StorageController → initializes browsers → launches browsers → starts watchdog → starts storage logger → saves config → starts callback thread. Any test touching TaskManager must pay all these costs.