API¶
Core¶
- class aioscraper.core.scraper.AIOScraper(*scrapers, config=None, lifespan=None, sessionmaker_factory=None)[source]¶
Bases:
objectCore entrypoint that wires scrapers, middlewares, and pipelines.
- Parameters:
*scrapers (Scraper) – Callable scrapers queued on startup.
config (Config | None) – Pre-built configuration; when
Nonethe scraper loads one lazily viaload_config()onstart.lifespan (Lifespan | None) – Optional async context manager factory that wraps the scraper’s lifecycle (setup/teardown).
sessionmaker_factory (SessionMakerFactory | None) – Override the function that builds HTTP sessions (defaults to
aioscraper.core.session.factory.get_sessionmaker()).
- __call__(scraper)[source]¶
Add a scraper callable and return it for decorator use.
- Parameters:
scraper (Callable[[...], Awaitable[Any]])
- Return type:
Callable[[…], Awaitable[Any]]
- add_dependencies(**kwargs)[source]¶
Add shared dependencies to inject into scraper callbacks.
- Parameters:
kwargs (Any)
- lifespan(lifespan)[source]¶
Attach a lifespan callback to run before/after scraping.
- Parameters:
lifespan (Callable[[AIOScraper], AsyncGenerator[None, None]])
- Return type:
Callable[[AIOScraper], AsyncGenerator[None, None]]
- property middleware: MiddlewareHolder¶
Access the middleware registry for request/response hooks.
- property pipeline: PipelineHolder¶
Access the pipeline registry and middleware helpers.
- async aioscraper.core.runner.run_scraper(scraper)[source]¶
Public entrypoint to run scraper with signal handling.
- Parameters:
scraper (AIOScraper)
Configuration¶
- class aioscraper.config.models.Config(session=SessionConfig(timeout=60.0, ssl=True, proxy=None, http_backend=None, retry=RequestRetryConfig(enabled=False, attempts=3, backoff=<BackoffStrategy.EXPONENTIAL_JITTER: 'exponential_jitter'>, base_delay=0.5, max_delay=30.0, statuses=(500, 502, 503, 504, 522, 524, 408, 429), exceptions=(<class 'TimeoutError'>, )), rate_limit=RateLimitConfig(enabled=False, group_by=None, default_interval=0.0, cleanup_timeout=60.0, adaptive=None)), scheduler=SchedulerConfig(concurrent_requests=64, pending_requests=1, close_timeout=0.1, ready_queue_max_size=0), execution=ExecutionConfig(timeout=None, shutdown_timeout=0.1, shutdown_check_interval=0.1, log_level=40), pipeline=PipelineConfig(strict=True))[source]¶
Bases:
objectMain configuration class that combines all configuration components.
- Parameters:
session (SessionConfig)
scheduler (SchedulerConfig)
execution (ExecutionConfig)
pipeline (PipelineConfig)
- class aioscraper.config.models.SessionConfig(timeout=60.0, ssl=True, proxy=None, http_backend=None, retry=RequestRetryConfig(enabled=False, attempts=3, backoff=<BackoffStrategy.EXPONENTIAL_JITTER: 'exponential_jitter'>, base_delay=0.5, max_delay=30.0, statuses=(500, 502, 503, 504, 522, 524, 408, 429), exceptions=(<class 'TimeoutError'>, )), rate_limit=RateLimitConfig(enabled=False, group_by=None, default_interval=0.0, cleanup_timeout=60.0, adaptive=None))[source]¶
Bases:
objectHTTP session settings shared by every request.
- Parameters:
timeout (float) – Request timeout in seconds
ssl (ssl.SSLContext | bool) – SSL handling; bool toggles verification, SSLContext can carry custom CAs
proxy (str | dict[str, str | None] | None) – Default proxy passed to the HTTP client
http_backend (HttpBackend | None) – Force
aiohttp/httpx;Nonelets the factory auto-detectretry (RequestRetryConfig) – Controls built-in retry middleware behaviour
rate_limit (RateLimitConfig) – Controls built-in rate limiting behaviour
- class aioscraper.config.models.RequestRetryConfig(enabled=False, attempts=3, backoff=BackoffStrategy.EXPONENTIAL_JITTER, base_delay=0.5, max_delay=30.0, statuses=(500, 502, 503, 504, 522, 524, 408, 429), exceptions=(<class 'TimeoutError'>, ))[source]¶
Bases:
objectRetry behaviour applied by the built-in retry middleware.
- Parameters:
enabled (bool) – Toggle retries on or off.
attempts (int) – Maximum number of retry attempts per request.
backoff (BackoffStrategy) – Backoff strategy for retries.
base_delay (float) – Base delay between retries in seconds.
max_delay (float) – Maximum delay between retries in seconds.
statuses (tuple[int, ...]) – HTTP status codes that should trigger a retry.
exceptions (tuple[type[BaseException], ...]) – Exception types that should trigger a retry.
- class aioscraper.config.models.SchedulerConfig(concurrent_requests=64, pending_requests=1, close_timeout=0.1, ready_queue_max_size=0)[source]¶
Bases:
objectConfiguration for request scheduler.
- Parameters:
concurrent_requests (int) – Maximum number of concurrent requests
pending_requests (int) – Number of pending requests to maintain
close_timeout (float | None) – Timeout for closing scheduler in seconds
ready_queue_max_size (int) – Maximum size of the ready queue (0 for unlimited)
- class aioscraper.config.models.ExecutionConfig(timeout=None, shutdown_timeout=0.1, shutdown_check_interval=0.1, log_level=40)[source]¶
Bases:
objectConfiguration for execution.
- Parameters:
timeout (float | None) – Overall execution timeout in seconds
shutdown_timeout (float) – Timeout for graceful shutdown in seconds
log_level (int) – Log level for timeout events (e.g., logging.ERROR, logging.WARNING). Defaults to logging.ERROR.
shutdown_check_interval (float)
- class aioscraper.config.models.PipelineConfig(strict=True)[source]¶
Bases:
objectConfiguration for pipelines.
- Parameters:
strict (bool) – Raise an exception if a pipeline for an item is missing
- class aioscraper.config.models.HttpBackend(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
StrEnum
- class aioscraper.config.models.BackoffStrategy(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]¶
Bases:
StrEnumBackoff strategy for retries.
- CONSTANT¶
Constant backoff
- LINEAR¶
Linear backoff
- EXPONENTIAL¶
Exponential backoff
- EXPONENTIAL_JITTER¶
Exponential backoff with jitter
- class aioscraper.config.models.RateLimitConfig(enabled=False, group_by=None, default_interval=0.0, cleanup_timeout=60.0, adaptive=None)[source]¶
Bases:
objectConfiguration for rate limiting.
- Parameters:
enabled (bool) – Toggle rate limiting on or off.
group_by (Callable[[Request], tuple[Hashable, float]] | None) – Function to group requests by.
default_interval (float) – Default interval for group.
cleanup_timeout (float) – Timeout in seconds before cleaning up an idle request group.
adaptive (AdaptiveRateLimitConfig | None) – Adaptive rate limiting configuration (EWMA + AIMD).
- class aioscraper.config.models.AdaptiveRateLimitConfig(min_interval=0.001, max_interval=5.0, increase_factor=2.0, decrease_step=0.01, success_threshold=5, ewma_alpha=0.3, respect_retry_after=True, inherit_retry_triggers=True, custom_trigger_statuses=(), custom_trigger_exceptions=())[source]¶
Bases:
objectConfiguration for adaptive rate limiting using EWMA + AIMD.
Adaptively adjusts request intervals based on server response patterns. Uses EWMA (Exponentially Weighted Moving Average) for latency tracking and AIMD (Additive Increase Multiplicative Decrease) for interval adjustment.
- Parameters:
min_interval (float) – Minimum allowed interval between requests (seconds).
max_interval (float) – Maximum allowed interval between requests (seconds).
increase_factor (float) – Multiplicative factor for interval increase on failure (must be > 1.0).
decrease_step (float) – Additive step for interval decrease on success (seconds).
success_threshold (int) – Number of consecutive successes before decreasing interval.
ewma_alpha (float) – EWMA smoothing factor for latency (0 < alpha <= 1, higher = more weight to recent).
respect_retry_after (bool) – Whether to use Retry-After header as interval override.
inherit_retry_triggers (bool) – Whether to use RequestRetryConfig statuses/exceptions as triggers.
custom_trigger_statuses (tuple[int, ...]) – Additional HTTP statuses to trigger adaptive slowdown.
custom_trigger_exceptions (tuple[type[BaseException], ...]) – Additional exception types to trigger adaptive slowdown.
- aioscraper.config.loader.load_config()[source]¶
Load configuration from environment variables.
Reads configuration from environment variables prefixed with SESSION, SCHEDULER, EXECUTION, and PIPELINE. When parameters are None, values are read from corresponding environment variables. Defaults are used when env vars are not set.
- Returns:
Complete configuration object with all settings resolved.
- Return type:
Session¶
- class aioscraper.core.session.base.BaseRequestContextManager(request)[source]¶
Bases:
ABCAsynchronous context manager that encapsulates request execution lifecycle.
- Parameters:
request (Request)
- class aioscraper.core.session.base.BaseSession[source]¶
Bases:
ABCBase abstract class for HTTP session.
- abstract async close()[source]¶
Close the session and release all resources.
This method should be called after finishing work with the session to properly release resources.
- aioscraper.core.session.factory.get_sessionmaker(config)[source]¶
Return a factory that builds a session using the chosen or available HTTP backend.
- Parameters:
config (SessionConfig)
- Return type:
Callable[[], BaseSession]
- class aioscraper.types.session.Request(*, url, method=<HTTPMethod.GET>, params=None, data=None, json_data=None, files=None, cookies=None, headers=None, auth=None, proxy=None, proxy_auth=None, proxy_headers=None, timeout=None, allow_redirects=True, max_redirects=10, delay=None, priority=0, callback=None, cb_kwargs=<factory>, errback=None, state=<factory>)[source]¶
Bases:
objectRepresents an HTTP request with all its parameters.
- Parameters:
url (str) – Target URL
method (str) – HTTP method
params (QueryParams | None) – URL query parameters
data (Any) – Request body data
files (RequestFiles | None) – Multipart files mapping
json_data (Any) – JSON data to be sent in the request body
cookies (RequestCookies | None) – Request cookies
headers (RequestHeaders | None) – Request headers
auth (BasicAuth | None) – Basic authentication credentials
proxy (str | None) – Proxy URL (per-request proxies are honored only by the
aiohttpbackend)proxy_auth (BasicAuth | None) – Proxy authentication credentials
proxy_headers (RequestHeaders | None) – Proxy headers
timeout (float | None) – Request timeout in seconds
allow_redirects (bool) – Whether to follow HTTP redirects
max_redirects (int) – Maximum number of redirects to follow
delay (float | None) – Delay before sending the request
priority (int) – Priority of the request
callback (Callable[..., Awaitable] | None) – Async callback function to be called after successful request
cb_kwargs (dict[str, Any]) – Keyword arguments for the callback function
errback (Callable[..., Awaitable] | None) – Async error callback function
state (dict[str, Any]) – State for middlewares
- class aioscraper.types.session.Response(url, method, status, headers, cookies, read)[source]¶
Bases:
objectRepresents an HTTP response with all its components.
- Parameters:
url (str)
method (str)
status (int)
headers (Mapping[str, str])
cookies (SimpleCookie)
read (Callable[[], Awaitable[bytes]])
- property cookies: SimpleCookie¶
Parsed response cookies.
- get_encoding()[source]¶
Resolve response encoding from the
Content-Typeheader.Parses the Content-Type header for a charset parameter. Returns “utf-8” as a safe default if no charset is found or if the charset is invalid.
- Returns:
Detected charset or
"utf-8"as a safe default.- Return type:
str
- property headers: Mapping[str, str]¶
Response headers.
- async json(*, encoding=None, loads=<function loads>)[source]¶
Read and decodes JSON response.
- Parameters:
encoding (str | None)
loads (Callable[[str], Any])
- Return type:
Any
- property method: str¶
HTTP method used.
- property ok: bool¶
Returns
Trueifstatusis less than400,Falseif not
- property status: int¶
HTTP status code.
- async text(encoding='utf-8', errors='strict')[source]¶
Read response payload and decode.
- Parameters:
encoding (str | None)
errors (str)
- Return type:
str
- property url: str¶
Final URL of the response.
Pipeline¶
- class aioscraper.core.pipeline.PipelineDispatcher(config, pipelines, global_middleware_factories=None, dependencies=None)[source]¶
Bases:
objectRoutes items through the registered pipeline chain.
- Parameters:
config (PipelineConfig)
pipelines (Mapping[Any, PipelineContainer])
global_middleware_factories (list[Callable[[...], GlobalPipelineMiddleware[Any]]] | None)
dependencies (Mapping[str, Any] | None)
- class aioscraper.types.pipeline.Pipeline(*args, **kwargs)[source]¶
Bases:
Protocol[PipelineItemType]Protocol for callables that accept an item and return the processed item.
- class aioscraper.types.pipeline.BasePipeline(*args, **kwargs)[source]¶
Bases:
Protocol[PipelineItemType]Interface for classes that process scraped items of a specific type.
- class aioscraper.types.pipeline.PipelineMiddleware(*args, **kwargs)[source]¶
Bases:
Protocol[PipelineItemType]Async hook used before or after pipeline execution; must return the item.
Middlewares¶
- class aioscraper.middlewares.retry.RetryMiddleware(config, send_request)[source]¶
Bases:
objectRequest middleware that retries failed requests based on configuration.
- Parameters:
config (RequestRetryConfig)
Execution¶
- class aioscraper.core.executor.ScraperExecutor(config, scrapers, dependencies, middleware_holder, pipeline_dispatcher, sessionmaker)[source]¶
Bases:
objectExecutes scrapers and manages the scraping process.
This class is responsible for running scraper functions, managing the request scheduler, and handling the graceful shutdown of the scraping process.
- Parameters:
config (Config)
scrapers (list[Callable[[...], Awaitable[Any]]])
dependencies (dict[str, Any])
middleware_holder (MiddlewareHolder)
pipeline_dispatcher (PipelineDispatcher)
sessionmaker (Callable[[], BaseSession])
- class aioscraper.core.request_manager.RequestManager(scheduler_config, rate_limit_config, retry_config, shutdown_check_interval, sessionmaker, dependencies, middleware_holder)[source]¶
Bases:
objectManages HTTP requests with priority queuing, rate limiting, and middleware support.
- Parameters:
scheduler_config (SchedulerConfig) – Configuration for the request scheduler.
rate_limit_config (RateLimitConfig) – Configuration for the request rate limiter.
retry_config (RequestRetryConfig) – Configuration for request retries.
shutdown_check_interval (float) – Interval between shutdown checks in seconds
sessionmaker (SessionMaker) – A factory for creating session objects.
dependencies (dict[str, Any]) – Additional dependencies to be injected into middleware and callbacks.
middleware_holder (MiddlewareHolder) – A container for middleware collections.
- class aioscraper.core.rate_limiter.RateLimitManager(config, retry_config, schedule)[source]¶
Bases:
objectManages rate limiting for requests using group-based throttling.
Requests are grouped by a configurable key (default: hostname) and processed with a specified interval between requests in each group. Groups are created dynamically and cleaned up automatically after inactivity.
- Parameters:
config (RateLimitConfig) – Rate limiting configuration including grouping strategy and intervals.
retry_config (RequestRetryConfig) – Retry configuration for inheriting trigger conditions.
schedule (Callable[[PRequest], Awaitable[Any]]) – Callback function to schedule request execution.
- property active: bool¶
Check if any request groups have pending requests.
- get_group_key(request)[source]¶
Get group key for a request.
- Parameters:
request (Request)
- Return type:
Hashable
- on_request_outcome(outcome)[source]¶
Handle request outcome and adjust group interval adaptively.
- Parameters:
outcome (RequestOutcome)
- class aioscraper.core.rate_limiter.RequestGroup(key, interval, cleanup_timeout, schedule, on_finished)[source]¶
Bases:
objectManages a group of requests that share the same rate limit interval.
Each group processes requests sequentially with a configured delay between them. Groups automatically clean up after a period of inactivity.
- Parameters:
key (Hashable) – Unique identifier for this request group.
interval (float) – Delay in seconds between processing requests in this group.
cleanup_timeout (float) – Timeout in seconds before cleaning up an idle group.
schedule (Callable[[PRequest], Awaitable[None]]) – Callback function to schedule request execution.
on_finished (Callable[[Hashable, RequestGroup], None]) – Callback invoked when the group finishes or becomes idle.
- property active: bool¶
Check if the group has pending requests in its queue.
- property interval: float¶
Get the current interval for this group.
- class aioscraper.core.rate_limiter.AdaptiveStrategy(*, min_interval=0.001, max_interval=5.0, increase_factor=2.0, decrease_step=0.01, success_threshold=5, ewma_alpha=0.3, trigger_statuses=(429, 500, 502, 503, 504, 522, 524, 408), trigger_exceptions=(<class 'TimeoutError'>, ), respect_retry_after=True)[source]¶
Bases:
objectEWMA + AIMD adaptive rate limiting strategy.
Fast multiplicative increase on overload (server pushback). Slow additive decrease on sustained success (probing for capacity).
- Parameters:
enabled (bool) – Enable adaptive rate limiting.
min_interval (float) – Minimum allowed interval (seconds).
max_interval (float) – Maximum allowed interval (seconds).
increase_factor (float) – Multiplicative factor for interval increase on failure.
decrease_step (float) – Additive step for interval decrease on success.
success_threshold (int) – Number of consecutive successes before decreasing interval.
ewma_alpha (float) – Smoothing factor for latency EWMA (0 < alpha <= 1).
trigger_statuses (tuple[int, ...]) – HTTP statuses that trigger adaptive slowdown.
trigger_exceptions (tuple[type[BaseException], ...]) – Exception types that trigger adaptive slowdown.
respect_retry_after (bool) – Whether to use Retry-After header as override.
- calculate_interval(group_key, current_interval, outcome)[source]¶
Calculate new interval based on request outcome.
Algorithm: - On failure: interval = min(max_interval, interval * increase_factor) - On success: if success_count >= threshold: interval = max(min_interval, interval - decrease_step) - Retry-After override: Use header value if present and enabled
- Returns:
New interval in seconds.
- Parameters:
group_key (Hashable)
current_interval (float)
outcome (RequestOutcome)
- Return type:
float
- class aioscraper.core.rate_limiter.RequestOutcome(group_key, latency, retry_after=None, status_code=None, exception_type=None)[source]¶
Bases:
objectCaptures the result of a request execution.
- Parameters:
group_key (Hashable)
latency (float)
retry_after (float | None)
status_code (int | None)
exception_type (type[BaseException] | None)
- group_key¶
The RequestGroup key this outcome belongs to.
- Type:
Hashable
- latency¶
Request latency in seconds (start to finish).
- Type:
float
- retry_after¶
Value from Retry-After header if present.
- Type:
float | None
- status_code¶
HTTP status code if applicable.
- Type:
int | None
- exception_type¶
Type of exception if one occurred.
- Type:
type[BaseException] | None
- class aioscraper.core.rate_limiter.AdaptiveMetrics(ewma_latency=0.0, ewma_alpha=0.3, success_count=0, failure_count=0, last_outcome_time=None, last_outcome_success=True, total_requests=0)[source]¶
Bases:
objectTracks metrics for adaptive rate limiting using EWMA + AIMD.
- Parameters:
ewma_latency (float)
ewma_alpha (float)
success_count (int)
failure_count (int)
last_outcome_time (float | None)
last_outcome_success (bool)
total_requests (int)
- ewma_latency¶
Exponentially weighted moving average of request latency.
- Type:
float
- ewma_alpha¶
Smoothing factor for EWMA (0 < alpha <= 1).
- Type:
float
- success_count¶
Consecutive successful requests since last failure.
- Type:
int
- failure_count¶
Consecutive failures since last success.
- Type:
int
- last_outcome_time¶
Timestamp of last completed request.
- Type:
float | None
- last_outcome_success¶
Whether last request was successful.
- Type:
bool | None
- total_requests¶
Total number of completed requests in this group.
- Type:
int
Holders¶
- class aioscraper.holders.middleware.MiddlewareHolder[source]¶
Bases:
objectStores request middleware factories in registration order.
- __call__(factory)[source]¶
Decorator form: register a middleware factory.
- Parameters:
factory (Callable[[...], RequestMiddleware])
- Return type:
Callable[[…], RequestMiddleware]
- add(*factories)[source]¶
Register request middleware factories in order.
Each factory can accept injected dependencies and must return a middleware with signature
async def mw(call_next, request): ...which wraps the request handler chain for every request..- Parameters:
factories (Callable[[...], RequestMiddleware])
- class aioscraper.holders.pipeline.PipelineHolder[source]¶
Bases:
objectKeeps pipeline containers and exposes decorator helpers.
- __call__(item_type, *args, **kwargs)[source]¶
Return a decorator that instantiates and registers a pipeline class for the given item type.
- Parameters:
item_type (type[PipelineItemType])
- Return type:
Callable[[type[BasePipeline[PipelineItemType]]], type[BasePipeline[PipelineItemType]]]
- add(item_type, *pipelines)[source]¶
Add pipelines to process scraped data.
- Parameters:
item_type (type[PipelineItemType])
pipelines (BasePipeline[PipelineItemType])
- add_global_middlewares(*factories)[source]¶
Register global pipeline middleware factories in order.
Each factory can accept injected dependencies and must return a middleware with signature
async def mw(handler, item): ...which wraps the entire pipeline chain for every item type.- Parameters:
factories (Callable[[...], GlobalPipelineMiddleware[PipelineItemType]])
- add_middlewares(middleware_type, item_type, *middlewares)[source]¶
Add pipeline processing middlewares.
- Parameters:
middleware_type (Literal['pre', 'post'])
item_type (type[PipelineItemType])
middlewares (PipelineMiddleware[PipelineItemType])
- global_middleware(factory)[source]¶
Decorator form of
add_global_middlewares().- Parameters:
factory (Callable[[...], GlobalPipelineMiddleware[PipelineItemType]])
- Return type:
Callable[[…], GlobalPipelineMiddleware[PipelineItemType]]
- middleware(middleware_type, item_type)[source]¶
Return a decorator that registers a pipeline middleware for the given stage.
- Parameters:
middleware_type (Literal['pre', 'post'])
item_type (type[PipelineItemType])
- Return type:
Callable[[PipelineMiddleware[PipelineItemType]], PipelineMiddleware[PipelineItemType]]
Exceptions¶
- class aioscraper.exceptions.ClientException[source]¶
Bases:
AIOScraperExceptionBase exception class for all client-related errors.
- class aioscraper.exceptions.HTTPException(url, method, status_code, headers, message)[source]¶
Bases:
ClientExceptionException raised when an HTTP request fails with a specific status code.
- Parameters:
status_code (int) – The HTTP status code of the failed request
message (str) – Error message describing the failure
url (str) – The URL that was being accessed
method (str) – The HTTP method used for the request
headers (Mapping[str, str]) – Response headers returned by the server
- class aioscraper.exceptions.PipelineException[source]¶
Bases:
AIOScraperExceptionBase exception class for all pipeline-related errors.
- class aioscraper.exceptions.StopItemProcessing[source]¶
Bases:
AIOScraperExceptionRaised by pipeline middlewares to stop processing the current item.
- class aioscraper.exceptions.StopMiddlewareProcessing[source]¶
Bases:
AIOScraperExceptionStop further pipeline middlewares in the current phase (pre/post).
- class aioscraper.exceptions.InvalidRequestData[source]¶
Bases:
AIOScraperExceptionRaised when request payload fields conflict.
- class aioscraper.exceptions.CLIError[source]¶
Bases:
AIOScraperExceptionRaised when CLI arguments are invalid or cannot be resolved.