openwpm.config module¶
- class openwpm.config.BrowserParams(extension_enabled: bool = True, cookie_instrument: bool = True, js_instrument: bool = False, js_instrument_settings: ~typing.List[str | dict] = <factory>, http_instrument: bool = False, navigation_instrument: bool = False, save_content: bool | str = False, callstack_instrument: bool = False, dns_instrument: bool = False, seed_tar: ~pathlib.Path | None = None, display_mode: ~typing.Literal['native', 'headless', 'xvfb'] = 'native', browser: str = 'firefox', prefs: dict = <factory>, tp_cookies: str = 'always', bot_mitigation: bool = False, profile_archive_dir: ~pathlib.Path | None = None, tmp_profile_dir: ~pathlib.Path = PosixPath('/tmp'), maximum_profile_size: int | None = None, recovery_tar: ~pathlib.Path | None = None, donottrack: bool = False, tracking_protection: bool = False, custom_params: ~typing.Dict[~typing.Any, ~typing.Any] = <factory>)[source]¶
Bases:
DataClassJsonMixin
Configuration that might differ per browser
OpenWPM allows you to run multiple browsers with different configurations in parallel and this class allows you to customize behaviour of an individual browser
- bot_mitigation: bool = False¶
- browser: str = 'firefox'¶
- callstack_instrument: bool = False¶
- cookie_instrument: bool = True¶
- custom_params: Dict[Any, Any]¶
- display_mode: Literal['native', 'headless', 'xvfb'] = 'native'¶
- dns_instrument: bool = False¶
- donottrack: bool = False¶
- extension_enabled: bool = True¶
- http_instrument: bool = False¶
- js_instrument: bool = False¶
- js_instrument_settings: List[str | dict]¶
- maximum_profile_size: int | None = None¶
The total amount of on disk space the generated browser profiles and residual files are allowed to consume in bytes. If this option is not set, no checks will be performed
Rationale¶
This option can serve as a happy medium between killing a browser after each crawl and allowing the application to still perform quickly.
Used as a way to save space in a limited environment with minimal detriment to speed.
If the maximum_profile_size is exceeded after a CommandSequence is completed, the browser will be shut down and a new one will be created. Even with this setting you may temporarily have more disk usage than the sum of all maximum_profile_sizes However, this will also ensure that a CommandSequence is allowed to complete without undue interruptions.
Sample values¶
1073741824: 1GB
20971520: 20MB - for testing purposes
52428800: 50MB
73400320: 70MB
104857600: 100MB - IDEAL for 10+ browsers
- prefs: dict¶
- profile_archive_dir: Path | None = None¶
- recovery_tar: Path | None = None¶
- save_content: bool | str = False¶
- seed_tar: Path | None = None¶
- tmp_profile_dir: Path = PosixPath('/tmp')¶
The tmp_profile_dir defaults to the OS’s temporary file folder (typically /tmp) and is where the generated browser profiles and residual files are stored.
- tp_cookies: str = 'always'¶
- tracking_protection: bool = False¶
- class openwpm.config.BrowserParamsInternal(extension_enabled: bool = True, cookie_instrument: bool = True, js_instrument: bool = False, js_instrument_settings: List[Union[str, dict]] = <factory>, http_instrument: bool = False, navigation_instrument: bool = False, save_content: Union[bool, str] = False, callstack_instrument: bool = False, dns_instrument: bool = False, seed_tar: Optional[pathlib.Path] = None, display_mode: Literal['native', 'headless', 'xvfb'] = 'native', browser: str = 'firefox', prefs: dict = <factory>, tp_cookies: str = 'always', bot_mitigation: bool = False, profile_archive_dir: Optional[pathlib.Path] = None, tmp_profile_dir: pathlib.Path = PosixPath('/tmp'), maximum_profile_size: Optional[int] = None, recovery_tar: Optional[pathlib.Path] = None, donottrack: bool = False, tracking_protection: bool = False, custom_params: Dict[Any, Any] = <factory>, browser_id: Optional[openwpm.types.BrowserId] = None, profile_path: Optional[pathlib.Path] = None, cleaned_js_instrument_settings: Optional[List[Dict[str, Any]]] = None)[source]¶
Bases:
BrowserParams
- browser_id: BrowserId | None = None¶
- cleaned_js_instrument_settings: List[Dict[str, Any]] | None = None¶
- profile_path: Path | None = None¶
- class openwpm.config.ConfigEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]¶
Bases:
JSONEncoder
- default(obj)[source]¶
Implement this method in a subclass such that it returns a serializable object for
o
, or calls the base implementation (to raise aTypeError
).For example, to support arbitrary iterators, you could implement default like this:
def default(self, o): try: iterable = iter(o) except TypeError: pass else: return list(iterable) # Let the base class default method raise the TypeError return JSONEncoder.default(self, o)
- class openwpm.config.ManagerParams(data_directory: Path = PosixPath('/home/docs/openwpm'), log_path: Path = PosixPath('/home/docs/openwpm/openwpm.log'), testing: bool = False, memory_watchdog: bool = False, process_watchdog: bool = False, num_browsers: int = 1, _failure_limit: int | None = None)[source]¶
Bases:
DataClassJsonMixin
Configuration for the TaskManager The configuration will be the same for all browsers running on the same TaskManager. It can be used to control storage locations or which watchdogs should run
- data_directory: Path = PosixPath('/home/docs/openwpm')¶
The directory into which screenshots and page dumps will be saved
- property failure_limit: int¶
- log_path: Path = PosixPath('/home/docs/openwpm/openwpm.log')¶
The path to the file in which OpenWPM will log. The directory given will be created if it does not exist.
- memory_watchdog: bool = False¶
A watchdog that tries to ensure that no Firefox instance takes up too much memory. It is mostly useful for long running cloud crawls
- num_browsers: int = 1¶
- process_watchdog: bool = False¶
It is used to create another thread that kills off GeckoDriver (or Xvfb) instances that haven’t been spawned by OpenWPM. (GeckoDriver is used by Selenium to control Firefox and Xvfb a “virtual display” so we simulate having graphics when running on a server).
- testing: bool = False¶
A platform wide flag that can be used to only run certain functionality while testing. For example, the Javascript instrumentation
- class openwpm.config.ManagerParamsInternal(data_directory: pathlib.Path = PosixPath('/home/docs/openwpm'), log_path: pathlib.Path = PosixPath('/home/docs/openwpm/openwpm.log'), testing: bool = False, memory_watchdog: bool = False, process_watchdog: bool = False, num_browsers: int = 1, _failure_limit: int | None = None, storage_controller_address: Tuple[str, int] | None = None, logger_address: Tuple[str, ...] | None = None, screenshot_path: pathlib.Path | None = None, source_dump_path: pathlib.Path | None = None)[source]¶
Bases:
ManagerParams
- logger_address: Tuple[str, ...] | None = None¶
- screenshot_path: Path | None = None¶
- source_dump_path: Path | None = None¶
- storage_controller_address: Tuple[str, int] | None = None¶
- openwpm.config.validate_browser_params(browser_params: BrowserParams) None [source]¶
- openwpm.config.validate_crawl_configs(manager_params: ManagerParams, browser_params: List[BrowserParams]) None [source]¶
- openwpm.config.validate_manager_params(manager_params: ManagerParams) None [source]¶