openwpm.storage.in_memory_storage module

This module contains implementations for various kinds of storage providers that store their results in memory. These classes are designed to allow for easier parallel testing as there are no shared resources between tests. It also makes it easier to verify results by not having to do a round trip through a persistent storage provider

class openwpm.storage.in_memory_storage.MemoryArrowProvider[source]

Bases: ArrowProvider

async shutdown() None[source]

Close all open resources After this method has been called no further calls should be made to the object

async write_table(table_name: TableName, table: Table) None[source]

Write out the table to persistent storage

This should only return once it’s actually saved out

class openwpm.storage.in_memory_storage.MemoryProviderHandle(queue: Queue)[source]

Bases: object

Call poll_queue to load all available data into the dict at self.storage

poll_queue(*args: Any, **kwargs: Any) None[source]
class openwpm.storage.in_memory_storage.MemoryStructuredProvider[source]

Bases: StructuredStorageProvider

This storage provider passes all it’s data to the MemoryStructuredProviderHandle in a process safe way.

This makes it ideal for testing

It also aims to only save out data as late as possible to ensure that storage_controller only relies on the guarantees given in the interface.

cache1: DefaultDict[VisitId, DefaultDict[TableName, List[Dict[str, Any]]]]

The cache for entries before they are finalized

cache2: DefaultDict[TableName, List[Dict[str, Any]]]

For all entries that have been finalized but not yet flushed out to the queue

async finalize_visit_id(visit_id: VisitId, interrupted: bool = False) Task[None][source]

This method is invoked to inform the StructuredStorageProvider that no more records for this visit_id will be submitted

This method returns once the data is ready to be written out. If the data is immediately written out nothing will be returned. Otherwise an awaitable will returned that resolve onces the records have been saved out to persistent storage

async flush_cache() None[source]

Blockingly write out any cached data to the respective storage

async init() None[source]

Initializes the StorageProvider for use

Guaranteed to be called in the process the StorageController runs in.

lock: Lock
async shutdown() None[source]

Close all open resources After this method has been called no further calls should be made to the object

async store_record(table: TableName, visit_id: VisitId, record: Dict[str, Any]) None[source]

Submit a record to be stored The storing might not happen immediately

class openwpm.storage.in_memory_storage.MemoryUnstructuredProvider[source]

Bases: UnstructuredStorageProvider

This storage provider stores all data in memory under self.storage as a dict from filename to content. Use this provider for writing tests and for small crawls where no persistence is required

async flush_cache() None[source]

Blockingly write out any cached data to the respective storage

async init() None[source]

Initializes the StorageProvider for use

Guaranteed to be called in the process the StorageController runs in.

async shutdown() None[source]

Close all open resources After this method has been called no further calls should be made to the object

async store_blob(filename: str, blob: bytes, compressed: bool = True, skip_if_exists: bool = True) None[source]

Stores the given bytes under the provided filename