adv-web-scraper-api

Navigation API Documentation

This document describes the endpoints for executing multi-step browser navigation flows, crawling websites, and retrieving navigation results.

Base Path

/api/v1/navigate

Common Response Format

All successful responses share a common structure:

{
  "success": true,
  "message": "Operation successful",
  "data": { /* Endpoint-specific data */ },
  "timestamp": "2023-10-27T10:00:00.000Z"
}

Error responses also follow a common structure:

{
  "success": false,
  "message": "Operation failed",
  "error": "Detailed error message",
  "timestamp": "2023-10-27T10:00:00.000Z"
}

Endpoints

1. Execute Navigation Flow

Queues a job to execute a predefined sequence of browser actions (steps) starting from a given URL.

Request Body:

{
  "startUrl": "https://example.com",
  "steps": [
    { "type": "goto", "url": "https://example.com/login" },
    { "type": "type", "selector": "#username", "text": "user" },
    { "type": "type", "selector": "#password", "text": "password" },
    { "type": "click", "selector": "button[type='submit']" },
    { "type": "waitForNavigation" },
    {
      "type": "extract",
      "name": "userData",
      "selector": ".user-profile",
      "fields": {
        "name": ".name",
        "email": ".email"
      }
    }
  ],
  "variables": {
    "searchTerm": "example product"
  },
  "options": {
    "timeout": 60000,
    "proxy": true,
    "solveCaptcha": true,
    "humanEmulation": true,
    "screenshots": true,
    "screenshotsPath": "/path/to/screenshots",
    "useSession": true,
    "alwaysCheckCaptcha": false,
    "javascript": true
  }
}

Request Body Parameters:

Successful Response (202 Accepted):

Indicates the job was successfully queued.

{
  "success": true,
  "message": "Navigation job queued successfully",
  "jobId": "nav_1678886400000",
  "statusUrl": "/api/jobs/nav_1678886400000",
  "timestamp": "2023-10-27T10:00:00.000Z"
}

Error Responses:

2. Retrieve Navigation Result

Fetches the result of a completed navigation job using its ID.

Path Parameters:

Successful Response (200 OK):

{
  "success": true,
  "message": "Navigation result retrieved successfully",
  "data": {
    "id": "nav_1678886400000",
    "url": "https://example.com",
    "status": "completed", // "completed", "failed", "partial"
    "stepsExecuted": 6,
    "data": {
      "userData": {
        "name": "John Doe",
        "email": "john.doe@example.com"
      }
    },
    "timestamp": "2023-10-27T10:05:00.000Z",
    "error": null // Contains error message if status is 'failed'
  },
  "timestamp": "2023-10-27T10:06:00.000Z"
}

Error Responses:

3. Start Crawling Job

Initiates a crawling process starting from a URL, automatically discovering and scraping linked pages based on selectors.

Request Body:

{
  "startUrl": "https://example.com/products",
  "maxPages": 10,
  "selectors": {
    "itemSelector": ".product-item",
    "fields": {
      "name": ".product-name",
      "price": ".product-price",
      "link": "a@href"
    }
  },
  "filters": {
    "nextPageSelector": "a.next-page",
    "waitForSelector": ".products-loaded"
  },
  "options": {
    "timeout": 300000,
    "proxy": true,
    "solveCaptcha": false,
    "humanEmulation": false,
    "screenshots": false,
    "useSession": false,
    "javascript": true
  }
}

Request Body Parameters:

Successful Response (202 Accepted):

Indicates the crawl job was started.

{
  "success": true,
  "message": "Crawling job started",
  "data": {
    "id": "crawl_1678887000000",
    "startUrl": "https://example.com/products",
    "maxPages": 10,
    "status": "pending", // Initial status
    "timestamp": "2023-10-27T10:10:00.000Z"
  },
  "timestamp": "2023-10-27T10:10:00.000Z"
}

Error Responses:

4. List Navigation Results

Retrieves a list of past navigation and crawling results, with filtering and pagination options.

Query Parameters:

Successful Response (200 OK):

{
  "success": true,
  "message": "Navigation results retrieved successfully",
  "data": [
    {
      "id": "crawl_1678887000000",
      "url": "https://example.com/products",
      "status": "completed",
      "stepsExecuted": 10, // Corresponds to pages crawled in this context
      "data": { /* Aggregated crawl data */ },
      "timestamp": "2023-10-27T10:20:00.000Z",
      "error": null
    },
    {
      "id": "nav_1678886400000",
      "url": "https://example.com",
      "status": "completed",
      "stepsExecuted": 6,
      "data": { /* Extracted navigation data */ },
      "timestamp": "2023-10-27T10:05:00.000Z",
      "error": null
    }
    // ... other results
  ],
  "count": 2, // Total number of results returned in this response
  "timestamp": "2023-10-27T10:30:00.000Z"
}

Error Responses:


Step Types

The steps array in the POST / request consists of objects, each defining an action. Key properties include:

(Note: A full definition of all step types and their parameters should ideally be in a separate, more detailed document or linked here.)