Quickstart ========== Build your first API data collector in minutes. You'll fetch data from GitHub's REST API, extract repository stats, aggregate them in a pipeline, and run everything from the CLI. Before you start, install ``aioscraper`` with an HTTP backend - see :doc:`installation` for details. Create your first scraper ------------------------- Save this as ``scraper.py``: .. code-block:: python import logging from aioscraper import AIOScraper, Request, Response, SendRequest, Pipeline from dataclasses import dataclass logger = logging.getLogger("github_repos") scraper = AIOScraper() @dataclass(slots=True) class RepoStats: """Data model for extracted repository stats.""" name: str stars: int language: str # this decorator registers this pipeline to handle RepoStats items @scraper.pipeline(RepoStats) class StatsPipeline: """Pipeline for processing extracted repository data.""" def __init__(self): self.total_stars = 0 async def put_item(self, item: RepoStats) -> RepoStats: """ Called for each extracted item. This is where you'd: - Save to database - Send to message queue - Perform validation/transformation - Aggregate statistics """ self.total_stars += item.stars logger.info("✓ %s: ⭐ %s (%s)", item.name, item.stars, item.language) return item async def close(self): """ Called when scraper shuts down. Use for: - Final aggregations - Closing database connections - Cleanup operations """ logger.info("Total stars collected: %s", self.total_stars) # this decorator marks this as the scraper's entry point. @scraper async def get_repos(send_request: SendRequest): """ Entry point: defines what to scrape. Receives send_request - a function to schedule HTTP requests. """ repos = ( "django/django", "fastapi/fastapi", "pallets/flask", "encode/httpx", "aio-libs/aiohttp", ) for repo in repos: await send_request( Request( url=f"https://api.github.com/repos/{repo}", # API endpoint callback=parse_repo, # Success handler errback=on_failure, # Error handler (network failures, timeouts) cb_kwargs={"repo": repo}, # Additional arguments to pass to callbacks headers={"Accept": "application/vnd.github+json"}, # Required by GitHub API ) ) async def parse_repo(response: Response, pipeline: Pipeline): """ Success callback: parse response and extract data. The `pipeline` dependency is automatically injected by aioscraper. """ data = await response.json() # Parse JSON response from API await pipeline( # Send extracted item to pipeline RepoStats( name=data["full_name"], stars=data["stargazers_count"], language=data.get("language", "Unknown"), ) ) async def on_failure(exc: Exception, repo: str): """ Error callback: handle request/processing failures. Use for: - Logging errors - Sending alerts - Custom retry logic """ logger.error("%s: cannot parse response: %s", repo, exc) Run it ------ Execute your scraper from the command line: .. code-block:: bash aioscraper scraper --concurrent-requests=4 The ``--concurrent-requests`` flag controls how many requests run simultaneously. Without it, the default concurrency limit of 64 applies. What happens when it runs ------------------------- 1. **CLI loads the module**: The ``aioscraper`` command finds your ``scraper.py`` file and locates the ``AIOScraper`` instance. 2. **Entry point executes**: Your ``get_repos()`` function runs and schedules 5 requests to GitHub's API. 3. **Concurrent execution**: All 5 requests execute concurrently (limited by ``--concurrent-requests``). The async HTTP client makes non-blocking calls, so responses can arrive in any order. 4. **Callbacks process responses**: - If successful: ``parse_repo()`` extracts data and sends it to the pipeline - If failed: ``on_failure()`` logs the error 5. **Pipeline processes items**: ``StatsPipeline.put_item()`` runs for each ``RepoStats`` item, aggregating the total stars. 6. **Cleanup on shutdown**: After all requests complete, ``StatsPipeline.close()`` prints the total stars collected. Customize for your use case ---------------------------- **Change the API** Replace GitHub API with your target API: .. code-block:: python await send_request( Request( url="https://api.example.com/products", callback=parse_product, headers={"Authorization": "Bearer YOUR_TOKEN"}, ) ) **Add query parameters** Use the ``params`` argument: .. code-block:: python Request( url="https://api.example.com/search", params={"q": "python", "limit": 100}, callback=parse_results, ) **Save to database** In ``put_item()``, use your ORM or database client: .. code-block:: python async def put_item(self, item: RepoStats) -> RepoStats: await self.db.execute( "INSERT INTO repos (name, stars, language) VALUES (?, ?, ?)", (item.name, item.stars, item.language) ) return item **Handle pagination** Send follow-up requests from callbacks: .. code-block:: python async def parse_page(response: Response, send_request: SendRequest, page: int): data = await response.json() # Process items... if data.get("next_page"): await send_request( Request( url=data["next_page"], callback=parse_page, cb_kwargs={"page": page + 1}, ) ) Production configuration ------------------------ For production use, configure retries, rate limiting, and concurrency via environment variables: .. code-block:: bash # Enable retries for transient failures export RETRY_ENABLED=true export RETRY_MAX_ATTEMPTS=3 # Enable rate limiting export RATE_LIMIT_ENABLED=true export RATE_LIMIT_DEFAULT_INTERVAL=1.0 # Set concurrency export SCHEDULER_CONCURRENT_REQUESTS=10 aioscraper scraper See :doc:`cli` for all available configuration options. Next steps ---------- - **Learn about pipelines**: See :doc:`concepts/pipelines` for advanced item processing, error handling, and multiple pipelines. - **Add middlewares**: See :doc:`concepts/middlewares` for request/response transformation, auth, logging, and circuit breaking. - **Manage resources**: See :doc:`concepts/lifespan` for setting up database connections, external services, and cleanup. - **Dependency injection**: See :doc:`concepts/wiring` to inject custom dependencies into callbacks and pipelines. - **Configuration**: See :doc:`concepts/config` for programmatic configuration and advanced settings.