openwpm.config module

class openwpm.config.BrowserParams(extension_enabled: bool = True, cookie_instrument: bool = True, js_instrument: bool = False, js_instrument_settings: ~typing.List[str | dict] = <factory>, http_instrument: bool = False, navigation_instrument: bool = False, save_content: bool | str = False, callstack_instrument: bool = False, dns_instrument: bool = False, seed_tar: ~pathlib.Path | None = None, display_mode: ~typing.Literal['native', 'headless', 'xvfb'] = 'native', browser: str = 'firefox', prefs: dict = <factory>, tp_cookies: str = 'always', bot_mitigation: bool = False, profile_archive_dir: ~pathlib.Path | None = None, tmp_profile_dir: ~pathlib.Path = PosixPath('/tmp'), maximum_profile_size: int | None = None, recovery_tar: ~pathlib.Path | None = None, donottrack: bool = False, tracking_protection: bool = False, custom_params: ~typing.Dict[~typing.Any, ~typing.Any] = <factory>)[source]

Bases: DataClassJsonMixin

Configuration that might differ per browser

OpenWPM allows you to run multiple browsers with different configurations in parallel and this class allows you to customize behaviour of an individual browser

bot_mitigation: bool = False
browser: str = 'firefox'
callstack_instrument: bool = False
cookie_instrument: bool = True
custom_params: Dict[Any, Any]
display_mode: Literal['native', 'headless', 'xvfb'] = 'native'
dns_instrument: bool = False
donottrack: bool = False
extension_enabled: bool = True
http_instrument: bool = False
js_instrument: bool = False
js_instrument_settings: List[str | dict]
maximum_profile_size: int | None = None

The total amount of on disk space the generated browser profiles and residual files are allowed to consume in bytes. If this option is not set, no checks will be performed

Rationale

This option can serve as a happy medium between killing a browser after each crawl and allowing the application to still perform quickly.

Used as a way to save space in a limited environment with minimal detriment to speed.

If the maximum_profile_size is exceeded after a CommandSequence is completed, the browser will be shut down and a new one will be created. Even with this setting you may temporarily have more disk usage than the sum of all maximum_profile_sizes However, this will also ensure that a CommandSequence is allowed to complete without undue interruptions.

Sample values

  • 1073741824: 1GB

  • 20971520: 20MB - for testing purposes

  • 52428800: 50MB

  • 73400320: 70MB

  • 104857600: 100MB - IDEAL for 10+ browsers

navigation_instrument: bool = False
prefs: dict
profile_archive_dir: Path | None = None
recovery_tar: Path | None = None
save_content: bool | str = False
seed_tar: Path | None = None
tmp_profile_dir: Path = PosixPath('/tmp')

The tmp_profile_dir defaults to the OS’s temporary file folder (typically /tmp) and is where the generated browser profiles and residual files are stored.

tp_cookies: str = 'always'
tracking_protection: bool = False
class openwpm.config.BrowserParamsInternal(extension_enabled: bool = True, cookie_instrument: bool = True, js_instrument: bool = False, js_instrument_settings: List[Union[str, dict]] = <factory>, http_instrument: bool = False, navigation_instrument: bool = False, save_content: Union[bool, str] = False, callstack_instrument: bool = False, dns_instrument: bool = False, seed_tar: Optional[pathlib.Path] = None, display_mode: Literal['native', 'headless', 'xvfb'] = 'native', browser: str = 'firefox', prefs: dict = <factory>, tp_cookies: str = 'always', bot_mitigation: bool = False, profile_archive_dir: Optional[pathlib.Path] = None, tmp_profile_dir: pathlib.Path = PosixPath('/tmp'), maximum_profile_size: Optional[int] = None, recovery_tar: Optional[pathlib.Path] = None, donottrack: bool = False, tracking_protection: bool = False, custom_params: Dict[Any, Any] = <factory>, browser_id: Optional[openwpm.types.BrowserId] = None, profile_path: Optional[pathlib.Path] = None, cleaned_js_instrument_settings: Optional[List[Dict[str, Any]]] = None)[source]

Bases: BrowserParams

browser_id: BrowserId | None = None
cleaned_js_instrument_settings: List[Dict[str, Any]] | None = None
profile_path: Path | None = None
class openwpm.config.ConfigEncoder(*, skipkeys=False, ensure_ascii=True, check_circular=True, allow_nan=True, sort_keys=False, indent=None, separators=None, default=None)[source]

Bases: JSONEncoder

default(obj)[source]

Implement this method in a subclass such that it returns a serializable object for o, or calls the base implementation (to raise a TypeError).

For example, to support arbitrary iterators, you could implement default like this:

def default(self, o):
    try:
        iterable = iter(o)
    except TypeError:
        pass
    else:
        return list(iterable)
    # Let the base class default method raise the TypeError
    return JSONEncoder.default(self, o)
class openwpm.config.ManagerParams(data_directory: Path = PosixPath('/home/docs/openwpm'), log_path: Path = PosixPath('/home/docs/openwpm/openwpm.log'), testing: bool = False, memory_watchdog: bool = False, process_watchdog: bool = False, num_browsers: int = 1, _failure_limit: int | None = None)[source]

Bases: DataClassJsonMixin

Configuration for the TaskManager The configuration will be the same for all browsers running on the same TaskManager. It can be used to control storage locations or which watchdogs should run

data_directory: Path = PosixPath('/home/docs/openwpm')

The directory into which screenshots and page dumps will be saved

property failure_limit: int
log_path: Path = PosixPath('/home/docs/openwpm/openwpm.log')

The path to the file in which OpenWPM will log. The directory given will be created if it does not exist.

memory_watchdog: bool = False

A watchdog that tries to ensure that no Firefox instance takes up too much memory. It is mostly useful for long running cloud crawls

num_browsers: int = 1
process_watchdog: bool = False

It is used to create another thread that kills off GeckoDriver (or Xvfb) instances that haven’t been spawned by OpenWPM. (GeckoDriver is used by Selenium to control Firefox and Xvfb a “virtual display” so we simulate having graphics when running on a server).

testing: bool = False

A platform wide flag that can be used to only run certain functionality while testing. For example, the Javascript instrumentation

class openwpm.config.ManagerParamsInternal(data_directory: pathlib.Path = PosixPath('/home/docs/openwpm'), log_path: pathlib.Path = PosixPath('/home/docs/openwpm/openwpm.log'), testing: bool = False, memory_watchdog: bool = False, process_watchdog: bool = False, num_browsers: int = 1, _failure_limit: int | None = None, storage_controller_address: Tuple[str, int] | None = None, logger_address: Tuple[str, ...] | None = None, screenshot_path: pathlib.Path | None = None, source_dump_path: pathlib.Path | None = None)[source]

Bases: ManagerParams

logger_address: Tuple[str, ...] | None = None
screenshot_path: Path | None = None
source_dump_path: Path | None = None
storage_controller_address: Tuple[str, int] | None = None
openwpm.config.path_to_str(path: Path | None) str | None[source]
openwpm.config.str_to_path(string: str | None) Path | None[source]
openwpm.config.validate_browser_params(browser_params: BrowserParams) None[source]
openwpm.config.validate_crawl_configs(manager_params: ManagerParams, browser_params: List[BrowserParams]) None[source]
openwpm.config.validate_manager_params(manager_params: ManagerParams) None[source]