CLI¶
Run scrapers from the command line without wiring up the event loop yourself.
pip install aioscraper
aioscraper scraper
See the minimal code in Quickstart.
Entrypoint contract¶
The CLI loads a module (file path or module.path) and optionally a specific attribute using module:attr.
Entry rules:
Without
:attr: the CLI looks for ascraperattribute that is either anAIOScraperinstance or a callable returning one.With
:attrpointing to anAIOScraper: the CLI uses that instance.With
:attrpointing to a callable (sync or async): the CLI executes/awaits it and expects anAIOScraperinstance in return.
Examples¶
aioscraper scraper # uses scraper variable from scraper.py
aioscraper mypkg.scraper:custom_app # uses custom_app AIOScraper instance
aioscraper mypkg.factory:make # calls make() (sync factory)
aioscraper mypkg.factory:make_async # awaits make_async() (async factory)
For resource setup/teardown around the same scraper instance, attach a lifespan(scraper) when constructing the scraper in code (see Lifespan).
Running without the CLI¶
You can run the same scraper programmatically using run_scraper:
import asyncio
from aioscraper import AIOScraper, Request, SendRequest, run_scraper
from aioscraper.config import load_config
async def scrape(send_request: SendRequest):
await send_request(Request(url="https://example.com"))
async def main():
scraper = AIOScraper(scrape, config=load_config())
await run_scraper(scraper)
if __name__ == "__main__":
asyncio.run(main())
This gives you the same signal handling and graceful shutdown behavior as the CLI.
run_scraper expects scraper.config to be set ahead of time, which is why the example passes config=load_config() to the constructor.
Configuration¶
Configuration precedence (when the CLI needs to load a config): CLI flags -> environment variables -> Config defaults.
If the resolved AIOScraper already has config set, the CLI leaves it untouched and CLI flags/env vars are ignored.
See Configuration for detailed configuration options and examples.
CLI flags¶
--concurrent-requests: Max concurrent requests (overridesSCHEDULER_CONCURRENT_REQUESTS).--pending-requests: Pending requests to keep queued (overridesSCHEDULER_PENDING_REQUESTS).
Environment variables¶
All environment variables map directly to fields in Config and its nested configuration classes.
The CLI reads these variables automatically. For programmatic use, call load_config to read environment variables and construct a Config instance.
SessionConfig¶
HTTP session and client behavior.
SESSION_REQUEST_TIMEOUT→timeoutSESSION_SSL→sslSESSION_PROXY→proxy(docs)SESSION_HTTP_BACKEND→http_backend
RequestRetryConfig¶
Retry middleware behavior (docs).
SESSION_RETRY_ENABLED→enabledSESSION_RETRY_ATTEMPTS→attemptsSESSION_RETRY_BACKOFF→backoffSESSION_RETRY_BASE_DELAY→base_delaySESSION_RETRY_MAX_DELAY→max_delaySESSION_RETRY_STATUSES→statusesSESSION_RETRY_EXCEPTIONS→exceptions
RateLimitConfig¶
Rate limiting behavior (docs).
SESSION_RATE_LIMIT_ENABLED→enabledSESSION_RATE_LIMIT_INTERVAL→default_intervalSESSION_RATE_LIMIT_CLEANUP_TIMEOUT→cleanup_timeout
AdaptiveRateLimitConfig¶
Adaptive rate limiting (EWMA + AIMD) (docs).
Set SESSION_RATE_LIMIT_ADAPTIVE_ENABLED=true to enable and configure other parameters.
SESSION_RATE_LIMIT_ADAPTIVE_MIN_INTERVAL→min_intervalSESSION_RATE_LIMIT_ADAPTIVE_MAX_INTERVAL→max_intervalSESSION_RATE_LIMIT_ADAPTIVE_INCREASE_FACTOR→increase_factorSESSION_RATE_LIMIT_ADAPTIVE_DECREASE_STEP→decrease_stepSESSION_RATE_LIMIT_ADAPTIVE_SUCCESS_THRESHOLD→success_thresholdSESSION_RATE_LIMIT_ADAPTIVE_EWMA_ALPHA→ewma_alphaSESSION_RATE_LIMIT_ADAPTIVE_RESPECT_RETRY_AFTER→respect_retry_afterSESSION_RATE_LIMIT_ADAPTIVE_INHERIT_RETRY_TRIGGERS→inherit_retry_triggers
SchedulerConfig¶
Request scheduler behavior.
SCHEDULER_CONCURRENT_REQUESTS→concurrent_requestsSCHEDULER_PENDING_REQUESTS→pending_requestsSCHEDULER_CLOSE_TIMEOUT→close_timeoutSCHEDULER_READY_QUEUE_MAX_SIZE→ready_queue_max_size
ExecutionConfig¶
Execution and shutdown behavior.
EXECUTION_TIMEOUT→timeoutEXECUTION_SHUTDOWN_TIMEOUT→shutdown_timeoutEXECUTION_SHUTDOWN_CHECK_INTERVAL→shutdown_check_intervalEXECUTION_LOG_LEVEL→log_level
PipelineConfig¶
Pipeline dispatching behavior.
PIPELINE_STRICT→strict