adv-web-scraper-api

Advanced Web Scraper API Documentation

Welcome to the Advanced Web Scraper API documentation. This documentation provides detailed information about the various components and features of the API.

Domain Documentation: Organized by functional domains
API Reference: Detailed API reference for all endpoints.
- Queue System API: Understanding asynchronous job handling and status checks.
- Proxy API: Managing and utilizing proxies.
- AI API: Endpoints for AI-powered features.
- Template API: Accessing configuration templates.
Navigation Module: Documentation for multi-step navigation flows.
- Navigation Step Types: Details on available step types.
- Navigation Examples: Practical examples of navigation configurations.
Extraction Module: Documentation for data extraction strategies.
- Regex Extraction: Using regular expressions for extraction.
Storage Module: Documentation for the storage module, including adapters.
Session Management: Managing persistent scraping sessions.
- Browser Configuration: Configuring browser settings for sessions.
- Proxy Configuration: Configuring and using proxies.
- Storage Adapters Comparison: Comparing session storage options.
AI Features: Documentation for AI-powered capabilities.
- AI Configuration Generation: Generating configurations from prompts.
Deployment Guides: General deployment instructions using Docker.
- Coolify Deployment: Specific guide for deploying with Coolify.
Configuration: Configuration options and environment variables (coming soon).
Examples: Examples of common use cases (coming soon).

Architecture

The Advanced Web Scraper API follows a modular monolith architecture with clear boundaries between components. This approach provides the benefits of a microservices architecture (separation of concerns, independent development) while avoiding the complexity of distributed systems during initial development.

High-Level Architecture

graph TD
 Client[Client Applications] --> API[API Layer]
 API --> Scraper[Scraper Module]
 API --> Navigation[Navigation Module]
 API --> Captcha[CAPTCHA Module]
 API --> Proxy[Proxy Module]
 API --> AI[AI Module]
 Scraper --> Browser[Browser Automation]
 Scraper --> Extraction[Data Extraction]
 Scraper --> Storage[Data Storage]
 Navigation --> Browser
 Navigation --> Extraction
 Captcha --> Browser
 Browser --> Human[Human Emulation]
 Browser --> Proxy
 AI --> LLM[(External LLM API)]
 subgraph Data_Storage[Data Storage]
 MemoryStorage[Memory Storage]
 FileStorage[File Storage]
 MongoDB[(MongoDB)]
 Redis[(Redis)]
 ApiStorage[API Storage]
 end
 Storage --> MemoryStorage
 Storage --> FileStorage
 Storage --> MongoDB
 Storage --> Redis
 Storage --> ApiStorage
 AI --> Storage

Core Components

The Advanced Web Scraper API consists of the following core components:

API Layer: Handles HTTP requests and responses, manages authentication and authorization, implements rate limiting and request validation, and routes requests to appropriate modules.
Scraper Module: Coordinates the scraping process, manages browser instances, handles data extraction and transformation, and stores results in the database.
Navigation Module: Implements multi-step navigation flows, manages state during navigation, handles pagination and crawling, and executes conditional logic.
CAPTCHA Module: Detects various types of CAPTCHAs, implements solving strategies, integrates with external solving services, and manages token application.
Proxy Module: Manages a pool of proxies, implements rotation strategies, monitors proxy health and performance, and handles authentication and session management.
Browser Automation: Controls browser instances, manages browser contexts and pages, implements stealth techniques, and handles resource optimization.
Human Emulation: Simulates human-like behavior, implements realistic mouse movements, creates variable timing patterns, and adds randomization to interactions.
Data Extraction: Implements various selector strategies, extracts structured data from pages, transforms and cleans extracted data, and validates against schemas.
Data Storage: Provides a flexible and extensible system for storing, retrieving, updating, and deleting extraction results, with support for multiple storage destinations.
AI Module: (New) Handles interactions with Large Language Models (LLMs) to provide features like configuration generation from natural language prompts.

Getting Started

For information on how to get started with the Advanced Web Scraper API, please refer to the main README file.

Contributing

Contributions to the Advanced Web Scraper API are welcome. Please refer to the contributing guidelines for more information.

License

This project is licensed under the MIT License - see the LICENSE file for details.