Configuration¶
aioscraper ships sane defaults but exposes configuration for sessions, scheduling, execution, and pipeline dispatching.
You can build a Config and pass it to AIOScraper via AIOScraper(config=...), or override values via environment variables.
The CLI reads well-known environment variables (for example SESSION_REQUEST_TIMEOUT, SCHEDULER_CONCURRENT_REQUESTS, EXECUTION_TIMEOUT, PIPELINE_STRICT) and applies them before launching the scraper.
The HTTP client is chosen at runtime: aiohttp is used when installed, otherwise httpx. Install one of the extras from Installation Guide so requests can be executed.
Set SessionConfig.http_backend (or SESSION_HTTP_BACKEND) to a value from HttpBackend if you want to force one client even when both are available.
import logging
from aioscraper import AIOScraper, run_scraper
from aioscraper.config import (
Config,
SessionConfig,
SchedulerConfig,
ExecutionConfig,
PipelineConfig,
RateLimitConfig,
)
config = Config(
session=SessionConfig(
timeout=20,
rate_limit=RateLimitConfig(default_interval=0.05),
ssl=True,
proxy="http://localhost:8080",
),
scheduler=SchedulerConfig(
concurrent_requests=32,
pending_requests=4,
close_timeout=0.5,
ready_queue_max_size=1000,
),
execution=ExecutionConfig(
timeout=60,
shutdown_timeout=0.5,
shutdown_check_interval=0.1,
log_level=logging.WARNING,
),
pipeline=PipelineConfig(strict=False),
)
async def main():
scraper = AIOScraper(config=config)
await run_scraper(scraper)
Graceful shutdown¶
execution.timeout- overall budget (Noneby default, i.e. no total limit); on expiry the runner logs atexecution.log_leveland cancels all tasks.execution.shutdown_timeout- grace period after SIGINT/SIGTERM/timeout before hard cancelling in-flight work.execution.shutdown_check_interval- pause between drain checks while waiting for the scheduler/queue to empty.Signals: first SIGINT/SIGTERM initiates shutdown, second triggers force-exit. Lifespan is shielded so cleanup still runs.
These settings are honored by both the CLI and run_scraper, giving consistent stop behavior in code or from the terminal.
Proxies¶
SessionConfig.proxy accepts two shapes; pick the one your HTTP client supports:
aiohttp-"http://localhost:8000"(single proxy applied to every request).httpx(single proxy) -"http://localhost:8000"when one proxy handles all schemes.httpx(per-scheme) -{"http": "http://localhost:8000", "https": "http://localhost:8001"}to routehttp/httpsseparately.
Warning
httpx only supports client-scoped proxies, so per-request overrides are ignored. aiohttp does the opposite: a proxy passed directly in Request(..., proxy=...) takes precedence over config.session.proxy.
Authentication¶
Authenticated proxies can be provided by embedding credentials directly in the proxy URL, for example:
http://username:password@localhost:8030
This works for both aiohttp and httpx proxy configurations.
Rate Limiting¶
Set SessionConfig.rate_limit or override values via environment variables to enable built-in rate limiting.
Rate limiting groups requests by a key (by default, the URL hostname) and enforces a minimum interval between requests within each group. This helps avoid overwhelming target servers and getting blocked.
from aioscraper.config import RateLimitConfig
rate_limit_config = RateLimitConfig(
enabled=True,
default_interval=0.5, # 500ms between requests per host
cleanup_timeout=60.0, # Clean up idle groups after 60 seconds
)
Configuration options:
enabled: Toggle rate limiting on or off (default:False).group_by: Custom function to group requests and specify per-group intervals. Must returntuple[Hashable, float]where the first element is the group key and the second is the interval in seconds.default_interval: Default delay in seconds between requests within each group (default:0.0).cleanup_timeout: Timeout in seconds for cleaning up inactive request groups (default:60.0).adaptive: Enable adaptive rate limiting (default:None).
Custom grouping¶
You can define custom grouping logic to apply different rate limits per domain or endpoint:
from yarl import URL
from aioscraper.config import RateLimitConfig
def custom_group_by(request):
"""Group by domain with custom intervals."""
host = URL(request.url).host
if host == "api.example.com":
return (host, 0.1) # 100ms for API
elif host == "www.example.com":
return (host, 1.0) # 1 second for website
return (host, 0.5) # 500ms default
rate_limit_config = RateLimitConfig(enabled=True, group_by=custom_group_by)
When enabled=False (default), group-based rate limiting is bypassed. However, if default_interval is set, it will still apply a simple delay between all requests without grouping logic.
Adaptive Rate Limiting¶
The adaptive rate limiting feature automatically adjusts request intervals based on server responses, using a hybrid EWMA (Exponentially Weighted Moving Average) + AIMD (Additive Increase Multiplicative Decrease) algorithm inspired by TCP congestion control.
How it works:
Fast multiplicative increase on server overload (429, 503, timeouts) - backs off aggressively to avoid hammering struggling servers
Slow additive decrease on sustained success - gradually probes for increased capacity
Respects Retry-After headers - server-provided backoff takes priority over heuristics
Per-group adaptation - each hostname/group adapts independently
from aioscraper.config import RateLimitConfig, AdaptiveRateLimitConfig
rate_limit_config = RateLimitConfig(
enabled=True,
default_interval=0.1, # Starting interval: 100ms
adaptive=AdaptiveRateLimitConfig(
min_interval=0.001, # Min: 1ms (won't go below)
max_interval=5.0, # Max: 5s (won't exceed)
increase_factor=2.0, # Double interval on failure
decrease_step=0.01, # Subtract 10ms on success
success_threshold=5, # Decrease after 5 consecutive successes
ewma_alpha=0.3, # Latency smoothing factor
respect_retry_after=True, # Honor server Retry-After headers
),
)
Configuration options:
min_interval: Minimum allowed interval in seconds (default:0.001)max_interval: Maximum allowed interval in seconds (default:5.0)increase_factor: Multiplicative factor for interval increase on failure (default:2.0)decrease_step: Additive step for interval decrease on success in seconds (default:0.01)success_threshold: Number of consecutive successes before decreasing interval (default:5)ewma_alpha: Smoothing factor for latency EWMA, between 0 and 1 (default:0.3)respect_retry_after: UseRetry-Afterheader as override (default:True)inherit_retry_triggers: Inherit trigger statuses/exceptions from retry config (default:True)
Behavior:
When a request fails with a trigger status (429, 500, 502, 503, 504, etc.) or exception (timeout):
If
Retry-Afterheader present andrespect_retry_after=True→ use that valueOtherwise, multiply current interval by
increase_factor(e.g., 0.1s → 0.2s → 0.4s)
When requests succeed consistently:
After
success_thresholdconsecutive successes, subtractdecrease_stepfrom intervalThis gradually probes for increased capacity (e.g., 0.4s → 0.39s → 0.38s)
Example scenario:
Time Event Interval
---- ----- --------
0.0s Start 0.100s (default)
0.1s Request #1 → 429 0.100s → 0.200s (×2)
0.3s Request #2 → 503 0.200s → 0.400s (×2)
0.7s Request #3 → 200 OK 0.400s (no change, count=1)
1.1s Request #4 → 200 OK 0.400s (no change, count=2)
... (3 more successes) ...
2.7s Request #8 → 200 OK 0.400s → 0.390s (count≥5, -0.01)
3.1s Request #9 → 200 OK 0.390s (no change, count=1)
Integration with retry middleware:
When both adaptive rate limiting and retry middleware are enabled:
Retry middleware handles retry logic (attempts, backoff)
Adaptive rate limiter adjusts the sending rate to prevent future failures
Trigger statuses/exceptions are shared when
inherit_retry_triggers=True
This prevents the system from repeatedly hammering an overloaded server while retries are ongoing.
Retries¶
Set SessionConfig.retry or override values via environment variables to enable the built-in retry middleware.
You can pick the number of retry attempts, backoff strategy, status codes, exception types:
The backoff option accepts the following values:
CONSTANT: uses a fixed delay for every retry attempt.LINEAR: delay increases linearly with each attempt:delay = base_delay * attempt.EXPONENTIAL: delay grows exponentially with each attempt:delay = base_delay * (2 ** attempt).EXPONENTIAL_JITTER: exponential backoff with added randomness (jitter) to prevent thundering herd effects.
For EXPONENTIAL_JITTER, the delay is calculated as follows:
delay = base_delay * (2 ** attempt)
delay = (delay / 2) + random.uniform(0, delay / 2)
For both EXPONENTIAL and EXPONENTIAL_JITTER, max_delay caps the final delay to avoid excessively long waits.
import asyncio
from aioscraper.config import RequestRetryConfig, BackoffStrategy
retry_config = RequestRetryConfig(
enabled=True,
attempts=5,
backoff=BackoffStrategy.EXPONENTIAL_JITTER,
base_delay=1.0,
max_delay=5.0,
statuses=(500, 502, 503),
exceptions=(asyncio.TimeoutError,),
)
When enabled, RetryMiddleware is registered automatically as the innermost middleware (closest to dispatch) and reschedules the request through the internal queue.
Server-side Retry-After¶
When the server responds with a Retry-After header (RFC 9110), the middleware respects it and uses the server-specified delay instead of the configured backoff strategy. This only applies to 429 Too Many Requests and 503 Service Unavailable responses.
The Retry-After header can be specified as:
Seconds:
Retry-After: 120(wait 120 seconds)HTTP-date:
Retry-After: Wed, 21 Oct 2015 07:28:00 GMT
The delay from Retry-After is capped at 600 seconds (10 minutes) to prevent indefinite delays.
API¶
- class aioscraper.config.models.Config(session=SessionConfig(timeout=60.0, ssl=True, proxy=None, http_backend=None, retry=RequestRetryConfig(enabled=False, attempts=3, backoff=<BackoffStrategy.EXPONENTIAL_JITTER: 'exponential_jitter'>, base_delay=0.5, max_delay=30.0, statuses=(500, 502, 503, 504, 522, 524, 408, 429), exceptions=(<class 'TimeoutError'>, )), rate_limit=RateLimitConfig(enabled=False, group_by=None, default_interval=0.0, cleanup_timeout=60.0, adaptive=None)), scheduler=SchedulerConfig(concurrent_requests=64, pending_requests=1, close_timeout=0.1, ready_queue_max_size=0), execution=ExecutionConfig(timeout=None, shutdown_timeout=0.1, shutdown_check_interval=0.1, log_level=40), pipeline=PipelineConfig(strict=True))[source]
Bases:
objectMain configuration class that combines all configuration components.
- Parameters:
session (SessionConfig)
scheduler (SchedulerConfig)
execution (ExecutionConfig)
pipeline (PipelineConfig)
- aioscraper.config.loader.load_config()[source]
Load configuration from environment variables.
Reads configuration from environment variables prefixed with SESSION, SCHEDULER, EXECUTION, and PIPELINE. When parameters are None, values are read from corresponding environment variables. Defaults are used when env vars are not set.
- Returns:
Complete configuration object with all settings resolved.
- Return type:
- class aioscraper.config.models.SessionConfig(timeout=60.0, ssl=True, proxy=None, http_backend=None, retry=RequestRetryConfig(enabled=False, attempts=3, backoff=<BackoffStrategy.EXPONENTIAL_JITTER: 'exponential_jitter'>, base_delay=0.5, max_delay=30.0, statuses=(500, 502, 503, 504, 522, 524, 408, 429), exceptions=(<class 'TimeoutError'>, )), rate_limit=RateLimitConfig(enabled=False, group_by=None, default_interval=0.0, cleanup_timeout=60.0, adaptive=None))[source]
Bases:
objectHTTP session settings shared by every request.
- Parameters:
timeout (float) – Request timeout in seconds
ssl (ssl.SSLContext | bool) – SSL handling; bool toggles verification, SSLContext can carry custom CAs
proxy (str | dict[str, str | None] | None) – Default proxy passed to the HTTP client
http_backend (HttpBackend | None) – Force
aiohttp/httpx;Nonelets the factory auto-detectretry (RequestRetryConfig) – Controls built-in retry middleware behaviour
rate_limit (RateLimitConfig) – Controls built-in rate limiting behaviour
- class aioscraper.config.models.SchedulerConfig(concurrent_requests=64, pending_requests=1, close_timeout=0.1, ready_queue_max_size=0)[source]
Bases:
objectConfiguration for request scheduler.
- Parameters:
concurrent_requests (int) – Maximum number of concurrent requests
pending_requests (int) – Number of pending requests to maintain
close_timeout (float | None) – Timeout for closing scheduler in seconds
ready_queue_max_size (int) – Maximum size of the ready queue (0 for unlimited)
- class aioscraper.config.models.ExecutionConfig(timeout=None, shutdown_timeout=0.1, shutdown_check_interval=0.1, log_level=40)[source]
Bases:
objectConfiguration for execution.
- Parameters:
timeout (float | None) – Overall execution timeout in seconds
shutdown_timeout (float) – Timeout for graceful shutdown in seconds
log_level (int) – Log level for timeout events (e.g., logging.ERROR, logging.WARNING). Defaults to logging.ERROR.
shutdown_check_interval (float)
- class aioscraper.config.models.PipelineConfig(strict=True)[source]
Bases:
objectConfiguration for pipelines.
- Parameters:
strict (bool) – Raise an exception if a pipeline for an item is missing
- class aioscraper.config.models.RequestRetryConfig(enabled=False, attempts=3, backoff=BackoffStrategy.EXPONENTIAL_JITTER, base_delay=0.5, max_delay=30.0, statuses=(500, 502, 503, 504, 522, 524, 408, 429), exceptions=(<class 'TimeoutError'>, ))[source]
Bases:
objectRetry behaviour applied by the built-in retry middleware.
- Parameters:
enabled (bool) – Toggle retries on or off.
attempts (int) – Maximum number of retry attempts per request.
backoff (BackoffStrategy) – Backoff strategy for retries.
base_delay (float) – Base delay between retries in seconds.
max_delay (float) – Maximum delay between retries in seconds.
statuses (tuple[int, ...]) – HTTP status codes that should trigger a retry.
exceptions (tuple[type[BaseException], ...]) – Exception types that should trigger a retry.
- class aioscraper.config.models.RateLimitConfig(enabled=False, group_by=None, default_interval=0.0, cleanup_timeout=60.0, adaptive=None)[source]
Bases:
objectConfiguration for rate limiting.
- Parameters:
enabled (bool) – Toggle rate limiting on or off.
group_by (Callable[[Request], tuple[Hashable, float]] | None) – Function to group requests by.
default_interval (float) – Default interval for group.
cleanup_timeout (float) – Timeout in seconds before cleaning up an idle request group.
adaptive (AdaptiveRateLimitConfig | None) – Adaptive rate limiting configuration (EWMA + AIMD).
- class aioscraper.config.models.AdaptiveRateLimitConfig(min_interval=0.001, max_interval=5.0, increase_factor=2.0, decrease_step=0.01, success_threshold=5, ewma_alpha=0.3, respect_retry_after=True, inherit_retry_triggers=True, custom_trigger_statuses=(), custom_trigger_exceptions=())[source]
Bases:
objectConfiguration for adaptive rate limiting using EWMA + AIMD.
Adaptively adjusts request intervals based on server response patterns. Uses EWMA (Exponentially Weighted Moving Average) for latency tracking and AIMD (Additive Increase Multiplicative Decrease) for interval adjustment.
- Parameters:
min_interval (float) – Minimum allowed interval between requests (seconds).
max_interval (float) – Maximum allowed interval between requests (seconds).
increase_factor (float) – Multiplicative factor for interval increase on failure (must be > 1.0).
decrease_step (float) – Additive step for interval decrease on success (seconds).
success_threshold (int) – Number of consecutive successes before decreasing interval.
ewma_alpha (float) – EWMA smoothing factor for latency (0 < alpha <= 1, higher = more weight to recent).
respect_retry_after (bool) – Whether to use Retry-After header as interval override.
inherit_retry_triggers (bool) – Whether to use RequestRetryConfig statuses/exceptions as triggers.
custom_trigger_statuses (tuple[int, ...]) – Additional HTTP statuses to trigger adaptive slowdown.
custom_trigger_exceptions (tuple[type[BaseException], ...]) – Additional exception types to trigger adaptive slowdown.