Documentation

CLI

Command-line interface for ScrapingBee

Installation

Recommended — install with uv (no virtual environment needed):

curl -LsSf https://astral.sh/uv/install.sh | sh
uv tool install scrapingbee-cli

Alternative — install with pip in a virtual environment:

pip install scrapingbee-cli

Verify the installation:

scrapingbee --version

Authentication

Save your API key so all commands can use it automatically.

Interactive prompt (recommended for first-time setup):

scrapingbee auth

Non-interactive (CI/CD, scripts):

scrapingbee auth --api-key YOUR_API_KEY

Environment variable (alternative — no file stored):

export SCRAPINGBEE_API_KEY=YOUR_API_KEY

The CLI also reads .env files in the current directory.

Show stored key location:

scrapingbee auth --show

Remove stored key:

scrapingbee logout

Credits and Plan

Check your API credit balance and plan concurrency:

scrapingbee usage

Example response:

{
  "max_api_credit": 1000000,
  "used_api_credit": 42150,
  "max_concurrency": 100,
  "current_concurrency": 0,
  "renewal_subscription_date": "2025-07-26T04:57:13.580067"
}

scrape

The scrape command calls the HTML API to fetch any web page. CLI flags map 1:1 to API parameters (underscores become hyphens: render_js → --render-js). For predefined values like sort orders, both hyphens and underscores are accepted (e.g. --sort-by price-low and --sort-by price_low both work).

See the HTML API documentation — every code snippet includes a CLI tab with the equivalent command.

Basic Usage

Scrape a page and print HTML to stdout:

scrapingbee scrape "https://example.com"

Save output to a file (extension auto-detected):

scrapingbee scrape "https://example.com" --output-file result

Scrape with JavaScript rendering and premium proxy:

scrapingbee scrape "https://example.com" --render-js true --premium-proxy true

Extract data with AI:

scrapingbee scrape "https://example.com" --ai-query "extract the main article title and author"

Return page as markdown (great for LLM pipelines):

scrapingbee scrape "https://example.com" --return-page-markdown true

Scrape Parameters

Scrape flags correspond directly to HTML API parameters. The table below groups them by category — click any parameter name to see full documentation on the HTML API page.

Rendering

Flag	API Parameter	Description
`--render-js`	`render_js`	Enable/disable JavaScript rendering
`--js-scenario`	`js_scenario`	JavaScript scenario to execute
`--wait`	`wait`	Wait time in ms before returning
`--wait-for`	`wait_for`	CSS/XPath selector to wait for
`--wait-browser`	`wait_browser`	Browser event to wait for
`--block-ads`	`block_ads`	Block ads on the page
`--block-resources`	`block_resources`	Block images and CSS
`--window-width`	`window_width`	Viewport width in pixels
`--window-height`	`window_height`	Viewport height in pixels

Proxy

Flag	API Parameter	Description
`--premium-proxy`	`premium_proxy`	Use premium/residential proxies (25 credits with JS)
`--stealth-proxy`	`stealth_proxy`	Use stealth proxies for hard-to-scrape sites (75 credits)
`--country-code`	`country_code`	Proxy country code (ISO 3166-1)
`--own-proxy`	`own_proxy`	Use your own proxy (user:pass@host:port)

Headers

Flag	API Parameter	Description
`-H / --header`	Custom headers	Add custom headers (repeatable: `-H "Key:Value"`)
`--forward-headers`	`forward_headers`	Forward custom headers to target
`--forward-headers-pure`	`forward_headers_pure`	Forward only custom headers

Output Format

Flag	API Parameter	Description
`--json-response`	`json_response`	Wrap response in JSON
`--return-page-source`	`return_page_source`	Return original HTML before JS rendering
`--return-page-markdown`	`return_page_markdown`	Return content as markdown
`--return-page-text`	`return_page_text`	Return content as plain text

Screenshots

Flag	API Parameter	Description
`--screenshot`	`screenshot`	Capture a screenshot
`--screenshot-selector`	`screenshot_selector`	CSS selector for screenshot area
`--screenshot-full-page`	`screenshot_full_page`	Capture full-page screenshot

Extraction

Flag	API Parameter	Description
`--extract-rules`	`extract_rules`	CSS/XPath extraction rules as JSON
`--ai-query`	`ai_query`	Natural language extraction (+5 credits)
`--ai-selector`	`ai_selector`	CSS selector to focus AI extraction
`--ai-extract-rules`	`ai_extract_rules`	AI extraction rules as JSON (+5 credits)

Request

Flag	API Parameter	Description
`--session-id`	`session_id`	Session ID for sticky IP (0-10000000)
`--timeout`	`timeout`	Timeout in ms (1000-140000)
`--cookies`	`cookies`	Custom cookies
`--device`	`device`	Device type: desktop or mobile
`--custom-google`	`custom_google`	Scrape Google domains (true/false). 15 credits per request.
`--transparent-status-code`	`transparent_status_code`	Return target's status code and body as-is (true/false)
`-X / --method`	HTTP method	GET, POST, or PUT
`-d / --data`	Request body	Request body for POST/PUT

Configuration

Flag	API Parameter	Description
`--scraping-config`	`scraping_config`	Apply a pre-saved scraping configuration by name

Scraping Configurations

Use --scraping-config to apply a pre-saved configuration from your ScrapingBee dashboard. This lets you reuse commonly used settings without typing them each time.

scrapingbee scrape "https://example.com" --scraping-config "My-Config"

Inline options override configuration settings — so you can use a saved config as a base and customize individual parameters per request:

scrapingbee scrape "https://example.com" --scraping-config "My-Config" --premium-proxy false

Create and manage configurations in the ScrapingBee request builder. Configuration names are case-sensitive and only accept alphanumeric characters, hyphens, and underscores.

Presets

Presets apply a predefined set of options. They only set flags you haven't already set, so you can override any preset value.

Preset	Description
`screenshot`	Capture a viewport screenshot (enables `--screenshot` and `--render-js`)
`screenshot-and-html`	Full-page screenshot + HTML in a single JSON response
`fetch`	Fast fetch without JavaScript rendering (`--render-js false`)
`extract-links`	Extract all `<a href>` links from the page as JSON
`extract-emails`	Extract all `mailto:` links from the page
`extract-phones`	Extract all `tel:` links from the page
`scroll-page`	Infinite scroll with JS rendering (loads lazy content)

scrapingbee scrape "https://example.com" --preset screenshot --output-file page

CLI-Only Scrape Flags

These flags are specific to the CLI and do not have API parameter equivalents.

Escalate Proxy

--escalate-proxy

[

flag

] (

false

)

On 403 or 429 responses, automatically retry with premium proxy, then stealth proxy. Useful for sites with aggressive bot detection.

scrapingbee scrape "https://example.com" --escalate-proxy

Chunk Size

--chunk-size

[

integer

] (

)

Split text/markdown output into chunks of N characters for LLM or vector DB pipelines. Outputs NDJSON (one JSON object per chunk). Set to 0 to disable.

scrapingbee scrape "https://example.com" --return-page-markdown true --chunk-size 2000 --chunk-overlap 200

Chunk Overlap

--chunk-overlap

[

integer

] (

)

Number of overlapping characters between consecutive chunks. Only used when --chunk-size > 0.

Force Extension

--force-extension

[

string

]

Force the output file extension (e.g. html, json). Skips automatic extension inference when --output-file has no extension.

scrapingbee scrape "https://example.com" --output-file result --force-extension md

crawl

The crawl command follows links across pages using Scrapy under the hood. Three modes are available:

1. Quick crawl — start from URL(s), follow same-domain links:

scrapingbee crawl "https://example.com" --max-depth 2 --max-pages 50

2. Sitemap crawl — fetch all URLs from a sitemap:

scrapingbee crawl --from-sitemap "https://example.com/sitemap.xml" --max-pages 100

3. Project spider — run a Scrapy project spider with ScrapingBee middleware:

scrapingbee crawl my_spider --project ./my_scrapy_project

All scrape rendering, proxy, and extraction flags are also available for crawl (e.g. --render-js, --premium-proxy, --ai-query). Batch utility flags are also available: -H/--header, --retries, --backoff, --verbose, --output-file, --extract-field, --fields.

Quick Crawl

Start from one or more URLs and follow same-domain links. Each page is saved as a numbered file in the output directory, with a manifest.json mapping URLs to files.

scrapingbee crawl "https://docs.example.com" \
  --max-depth 3 \
  --max-pages 200 \
  --return-page-markdown true \
  --output-dir docs_crawl

Restrict crawling with URL patterns:

scrapingbee crawl "https://example.com" \
  --include-pattern "/blog/" \
  --exclude-pattern "/tag/" \
  --max-pages 50

Save only specific pages while crawling the full site for link discovery:

scrapingbee crawl "https://example.com" \
  --save-pattern "/product/" \
  --ai-query "extract the product name and price" \
  --max-pages 100

Sitemap Crawl

Fetch and parse a sitemap (including sitemap indexes) then crawl all discovered URLs:

scrapingbee crawl --from-sitemap "https://example.com/sitemap.xml" \
  --return-page-markdown true \
  --concurrency 20

Project Spider

Run any Scrapy spider from a project directory. ScrapingBee middleware and your API key are automatically injected:

scrapingbee crawl my_spider --project /path/to/scrapy/project --concurrency 10

The crawl command also supports --scraping-config to apply a pre-saved configuration from your dashboard. All scrape parameters (rendering, proxy, extraction) are passed to each page request.

Crawl Parameters

name [type] (default)

Description

target [string] required

One or more URLs to start crawling from, or a Scrapy spider name (project mode)

Learn more

--allow-external-domains [boolean] (false)

Follow links to any domain instead of same-domain only. Quick-crawl only

Learn more

--allowed-domains [string] ("")

Comma-separated list of domains to crawl (default is same domain as start URLs). Quick-crawl only

Learn more

--autothrottle [boolean] (false)

Enable Scrapy AutoThrottle to automatically adapt the request rate based on server load

Learn more

--concurrency [integer] (0)

Maximum concurrent requests (0 = auto-detect from your plan's concurrency limit)

Learn more

--download-delay [float] ("")

Delay in seconds between requests (Scrapy DOWNLOAD_DELAY)

Learn more

--exclude-pattern [string] ("")

Regex: skip URLs matching this pattern

Learn more

--from-sitemap [string] ("")

Fetch URLs from a sitemap.xml and crawl them (URL or path to sitemap)

Learn more

--include-pattern [string] ("")

Regex: only follow URLs matching this pattern

Learn more

--max-depth [integer] (0)

Maximum link depth when following same-domain links (0 = unlimited). Quick-crawl only

Learn more

--max-pages [integer] (0)

Maximum pages to fetch from the API (0 = unlimited). Each page costs API credits

Learn more

--on-complete [string] ("")

Shell command to run after the crawl completes

Learn more

--output-dir [path] ("crawl_<timestamp>")

Crawl output folder. Defaults to crawl_<timestamp>

Learn more

--project [path] ("")

Path to a Scrapy project directory. Spider mode only

Learn more

--resume [boolean] (false)

Skip already-crawled URLs from a previous run (reads manifest.json in output dir)

Learn more

--save-pattern [string] ("")

Regex: only save pages whose URL matches this pattern. Other pages are visited for link discovery but not saved

Learn more

Target

target

[

string

] (

)

required

The positional argument — one or more URLs to start crawling from. In project spider mode, this is the spider name instead of a URL.

scrapingbee crawl "https://example.com"
scrapingbee crawl "https://example.com" "https://blog.example.com"
scrapingbee crawl my_spider --project ./my_project

From Sitemap

--from-sitemap

[

string

] (

)

Accepts a URL to a sitemap.xml file. The CLI fetches the sitemap (through the ScrapingBee API for proxy support), parses it (handling sitemap indexes recursively up to depth 2), and starts crawling all discovered page URLs.

scrapingbee crawl --from-sitemap "https://example.com/sitemap.xml"

Max Depth

--max-depth

[

integer

]

Controls how many link-hops deep the crawler will follow from the start URLs. A depth of 0 means unlimited. Depth 1 means only pages directly linked from the start URLs.

scrapingbee crawl "https://example.com" --max-depth 2

Max Pages

--max-pages

[

integer

]

Limits the total number of pages fetched from the ScrapingBee API. Each page costs API credits. A value of 0 means unlimited.

scrapingbee crawl "https://example.com" --max-pages 100

Save Pattern

--save-pattern

[

string

] (

)

When set, only pages whose URL matches this regex are saved to disk. All other pages are still visited for link discovery (using lightweight HTML-only requests) but their content is not saved. This lets you crawl an entire site for structure while only saving the pages you care about.

scrapingbee crawl "https://example.com" --save-pattern "/product/" --ai-query "extract product details"

Resume

--resume

[

boolean

] (

false

)

When resuming a previous crawl, the CLI reads manifest.json in the output directory to skip already-crawled URLs and continue numbering files from where the previous run left off.

scrapingbee crawl "https://example.com" --output-dir my_crawl --resume

On Complete

--on-complete

[

string

] (

)

Requires advanced features setup. This feature executes shell commands and is disabled by default.

Run a shell command after the crawl finishes. The command receives $SCRAPINGBEE_OUTPUT_DIR, $SCRAPINGBEE_SUCCEEDED, and $SCRAPINGBEE_FAILED environment variables.

scrapingbee crawl "https://example.com" --on-complete "echo 'Done! Files in $SCRAPINGBEE_OUTPUT_DIR'"

Project

--project

[

path

] (

)

Path to a Scrapy project directory for running project spiders. The CLI injects ScrapingBee middleware and your API key into the project's Scrapy settings automatically.

scrapingbee crawl my_spider --project /path/to/scrapy/project

Allowed Domains

--allowed-domains

[

string

] (

)

Comma-separated list of domains the crawler is allowed to visit. By default, the crawler only follows links on the same domain as the start URL(s). Use this to explicitly whitelist additional domains.

scrapingbee crawl "https://example.com" --allowed-domains "example.com,blog.example.com"

Allow External Domains

--allow-external-domains

[

boolean

] (

false

)

Follow links to any domain, not just the start URL's domain. Use with caution — the crawl can expand rapidly. Combine with --max-pages to set a hard limit.

scrapingbee crawl "https://example.com" --allow-external-domains --max-pages 50

Include Pattern

--include-pattern

[

string

] (

)

A regex pattern that URLs must match to be followed. Only links whose full URL matches this pattern will be visited. Useful for restricting crawls to specific sections of a site.

scrapingbee crawl "https://example.com" --include-pattern "/docs/" --max-pages 100

Exclude Pattern

--exclude-pattern

[

string

] (

)

A regex pattern for URLs to skip. Links matching this pattern will not be followed, even if they match --include-pattern. Useful for avoiding pagination, tags, or other low-value pages.

scrapingbee crawl "https://example.com" --exclude-pattern "/tag/|/page/|/author/"

Download Delay

--download-delay

[

float

] (

)

Delay in seconds between consecutive requests. Useful for being polite to the target server or avoiding rate limits. Accepts decimal values.

scrapingbee crawl "https://example.com" --download-delay 1.5

Autothrottle

--autothrottle

[

boolean

] (

false

)

Enable Scrapy's AutoThrottle extension, which automatically adjusts the download delay based on the server's response time and load. Recommended for large crawls where you don't want to overwhelm the target.

scrapingbee crawl "https://example.com" --autothrottle --max-pages 500

Output Directory

--output-dir

[

path

] (

"crawl_<timestamp>"

)

Folder where crawl results are saved. Each page is written as a numbered file with a manifest.json mapping URLs to files. Defaults to crawl_<timestamp>.

scrapingbee crawl "https://example.com" --output-dir my_crawl

Concurrency

--concurrency

[

integer

]

Maximum number of concurrent requests. Set to 0 (default) to auto-detect from your plan's concurrency limit. Higher values speed up crawls but use more credits in parallel. The CLI caps concurrency at min(--concurrency, --max-pages) to prevent overshoot.

scrapingbee crawl "https://example.com" --concurrency 20 --max-pages 100

Batch Processing

The --input-file flag enables batch mode on scrape, google, and all other scraper commands. Instead of processing a single item, the CLI reads a file of URLs (or queries, ASINs, etc.) and processes them concurrently.

Input

Batch input supports .txt (one URL per line), .csv, and .tsv files. Use --input-column for CSV files:

# Text file (one URL per line)
scrapingbee scrape --input-file urls.txt

# CSV file with a "url" column
scrapingbee scrape --input-file sites.csv --input-column url

# Pipe from stdin
cat urls.txt | scrapingbee scrape --input-file -

Output

Results are saved as numbered files in the output directory (default: batch_<timestamp>):

scrapingbee scrape --input-file urls.txt --output-dir my_results

Alternative output formats:

# Single CSV file
scrapingbee google --input-file queries.txt --output-format csv --output-dir results

# NDJSON to stdout (great for piping)
scrapingbee scrape --input-file urls.txt --output-format ndjson | jq .title

Additional Options

Deduplication and Sampling

Clean up your input before spending credits. --deduplicate normalizes URLs (lowercases domains, strips fragments and trailing slashes) and removes duplicates. --sample picks N random items for testing your configuration before committing to a full run.

scrapingbee scrape --input-file urls.txt --deduplicate --sample 10

Post-Processing

Requires advanced features setup. This feature executes shell commands and is disabled by default.

Transform each result before it's written to disk by piping it through a shell command. The result body is sent to stdin, and the command's stdout replaces it. Works with any tool: jq for JSON filtering, sed for text manipulation, or custom scripts.

# Keep only the first 3 organic results from each Google search
scrapingbee google --input-file queries.txt --post-process "jq '.organic_results[:3]'"

# Extract just the title from each scraped page
scrapingbee scrape --input-file urls.txt --post-process "jq -r '.title // empty'"

Note: --post-process applies to files and ndjson output formats, but not to --update-csv.

Update CSV In-Place

Fetch fresh data for each row and add the results as new columns directly into the original CSV. The existing columns are preserved and new data is merged alongside them. Ideal for enriching datasets with live web data — prices, stock levels, ratings, or any extracted field.

scrapingbee scrape --input-file products.csv --input-column url \
  --extract-rules '{"price":".price","title":"h1"}' \
  --update-csv

The CLI reads the CSV, scrapes each URL in the specified column, flattens the JSON response, and writes the enriched CSV back. Nested JSON is automatically flattened to dot-notation columns (e.g. buybox.price).

Resume After Interruption

If a batch is interrupted (Ctrl+C, network issue, credit limit), re-run with --resume and the same --output-dir. The CLI scans existing output files and skips already-completed items, continuing from where it left off.

scrapingbee scrape --input-file urls.txt --output-dir my_batch --resume

Extract Specific Fields

Pull values from JSON responses using dot-path notation. The output is one value per line, ready to pipe into another command or save as a list. If the path traverses an array, values from every item are extracted.

# Extract all URLs from Google search results
scrapingbee google "best laptops 2025" --extract-field organic_results.url

# Extract product ASINs from Amazon search, then fetch each product
scrapingbee amazon-search "headphones" --extract-field products.asin > asins.txt
scrapingbee amazon-product --input-file asins.txt --output-dir products

If the path doesn't match any data, the CLI prints a warning with all available dot-paths to help you find the correct one.

Run a Command After Completion

Requires advanced features setup. This feature executes shell commands and is disabled by default.

Trigger a notification, sync results to a database, or start a downstream pipeline when the batch finishes. The command receives environment variables with the results summary.

scrapingbee scrape --input-file urls.txt --on-complete "echo 'Done: $SCRAPINGBEE_SUCCEEDED ok, $SCRAPINGBEE_FAILED failed'"

Batch Parameters

name [type] (default)

Description

--input-file [string] required

Path to input file with one item per line (URL, query, ASIN, etc.). Use - for stdin. Supports .txt, .csv, .tsv

Learn more

--backoff [float] (2)

Retry backoff multiplier (exponential)

Learn more

--concurrency [integer] (0)

Maximum concurrent requests (0 = auto-detect from your plan's concurrency limit)

Learn more

--deduplicate [boolean] (false)

Normalize URLs and remove duplicates from the input before processing

Learn more

--extract-field [string] ("")

Extract values from JSON responses using dot-path notation (e.g. organic_results.url)

Learn more

--fields [string] ("")

Comma-separated top-level JSON keys to include in output

Learn more

--input-column [string] ("")

For CSV input, the column name or 0-based index containing the target values

Learn more

--no-progress [boolean] (false)

Suppress the progress bar during batch processing

Learn more

--on-complete [string] ("")

Shell command to run after the batch completes

Learn more

--output-dir [path] ("batch_<timestamp>")

Folder for batch output files. Defaults to batch_<timestamp>

Learn more

--output-file [path] ("")

Write output to a specific file instead of stdout

Learn more

--output-format ["files" | "csv" | "ndjson"] ("files")

Batch output format: files (one file per result), csv (single CSV), or ndjson (newline-delimited JSON to stdout)

Learn more

--post-process [string] ("")

Pipe each result through a shell command (e.g. 'jq .title')

Learn more

--resume [boolean] (false)

Skip items already saved in --output-dir from a previous run

Learn more

--retries [integer] (3)

Number of retry attempts on transient errors

Learn more

--sample [integer] (0)

Process only N random items from input for testing (0 = all)

Learn more

--update-csv [boolean] (false)

Fetch fresh data and update the input CSV file in-place with new result columns

Learn more

--verbose [boolean] (false)

Show response headers and status code for each request

Learn more

Input File

--input-file

[

string

] (

)

required

Path to the file containing one item per line. Supports .txt, .csv, and .tsv formats. Use - to read from stdin. For CSV/TSV files, combine with --input-column to specify which column contains the target values.

Output Format

--output-format

[

"files" | "csv" | "ndjson"

] (

"files"

)

Controls how batch results are written:

files (default): One file per result in the output directory, with a manifest.json index.
csv: All results merged into a single CSV file.
ndjson: Newline-delimited JSON streamed to stdout (ideal for piping to jq or other tools).

Update CSV

--update-csv

[

boolean

] (

false

)

When used with a CSV input file, fetches fresh data for each row and adds the results as new columns in the original CSV. Useful for enriching existing datasets.

scrapingbee scrape --input-file products.csv --input-column url \
  --extract-rules '{"price":".price"}' --update-csv

Post Process

--post-process

[

string

] (

)

Requires advanced features setup.

Pipe each individual result through a shell command before writing to disk. The command receives the result body on stdin. Useful for filtering or transforming JSON.

scrapingbee google --input-file queries.txt --post-process "jq '.organic_results[:5]'"

Resume

--resume

[

boolean

] (

false

)

When resuming a previous batch, the CLI scans the output directory for already-completed items and skips them. Numbering continues from the previous run.

On Complete

--on-complete

[

string

] (

)

Requires advanced features setup.

Shell command to run after batch completion. Your script receives three environment variables: $SCRAPINGBEE_OUTPUT_DIR (path to the results folder), $SCRAPINGBEE_SUCCEEDED (number of successful requests), and $SCRAPINGBEE_FAILED (number of failed requests) — so it can process the output, trigger downstream workflows, or send alerts based on results.

Input Column

--input-column

[

string

] (

)

For CSV/TSV input files, specifies which column contains the target values. Accepts a column name (from the header row) or a 0-based index. When omitted, the first column is used.

scrapingbee scrape --input-file sites.csv --input-column url
scrapingbee scrape --input-file data.csv --input-column 2

Output Directory

--output-dir

[

path

] (

"batch_<timestamp>"

)

Folder where batch results are saved. Each result is written as a numbered file (e.g. 1.html, 2.json) with a manifest.json index mapping inputs to files. Defaults to batch_<timestamp>.

Output File

--output-file

[

path

] (

)

Write output to a specific file instead of stdout. For single-item commands (not batch), this saves the response directly. The file extension is auto-detected from the response type (HTML, JSON, PNG, etc.) unless you include one.

Concurrency

--concurrency

[

integer

]

Maximum number of concurrent requests. Set to 0 (default) to auto-detect from your plan's concurrency limit via the usage API. Higher values speed up batch processing but use more credits simultaneously.

scrapingbee scrape --input-file urls.txt --concurrency 20

Deduplicate

--deduplicate

[

boolean

] (

false

)

Normalize URLs and remove duplicates from the input before processing. URL normalization lowercases the domain, strips fragments, and removes trailing slashes. Useful when your input file may contain duplicate or near-duplicate URLs.

scrapingbee scrape --input-file urls.txt --deduplicate

Sample

--sample

[

integer

]

Process only N random items from the input file. Useful for testing your batch configuration on a subset before running the full job. Set to 0 (default) to process all items.

scrapingbee scrape --input-file urls.txt --sample 10 --output-dir test_run

No Progress

--no-progress

[

boolean

] (

false

)

Suppress the per-item progress counter during batch processing. Useful when piping output or running in CI/CD where the progress updates would clutter logs.

Verbose

--verbose

[

boolean

] (

false

)

Show HTTP status code, credit cost, resolved URL, and other response headers for each request. In verbose mode, the CLI displays exact credit costs for SERP commands (e.g. Credit Cost: 10) based on the request parameters.

Extract Field

--extract-field

[

string

] (

)

Extract values from JSON responses using a dot-path expression, outputting one value per line. Supports nested paths and automatically iterates over arrays. The output is newline-separated, making it ideal for piping into --input-file of another command.

scrapingbee google "pizza" --extract-field organic_results.url
scrapingbee amazon-search "laptop" --extract-field products.asin > asins.txt

If the path doesn't match any data, the CLI prints a warning with all available dot-paths to help you find the correct one.

Fields

--fields

[

string

] (

)

Filter JSON output to include only the specified comma-separated top-level keys. Useful for reducing output size when you only need certain parts of the response.

scrapingbee google "test" --fields "organic_results,meta_data"

Retries

--retries

[

integer

] (

)

Number of retry attempts on transient errors (HTTP 5xx, connection errors). Default is 3. Each retry uses exponential backoff controlled by --backoff.

Backoff

--backoff

[

float

] (

)

Multiplier for exponential backoff between retries. Default is 2.0, meaning delays of 2s, 4s, 8s between retries. Lower values retry faster; higher values are gentler on the API.

export

The export command merges numbered output files from a batch or crawl into a single file. It reads manifest.json (if present) to annotate each record with its source URL.

Examples

# Merge to NDJSON (default)
scrapingbee export --input-dir batch_20250101_120000 --output-file all.ndjson

# Merge to plain text
scrapingbee export --input-dir crawl_20250101 --format txt --output-file pages.txt

# Merge to CSV with flattened nested JSON
scrapingbee export --input-dir serps/ --format csv --flatten --output-file results.csv

# CSV with specific columns only
scrapingbee export --input-dir serps/ --format csv --columns "title,url,price" --output-file filtered.csv

# Deduplicate CSV rows
scrapingbee export --input-dir batch/ --format csv --deduplicate --output-file unique.csv

Export Parameters

name [type] (default)

Description

--input-dir [path] required

Batch or crawl output directory to read from

Learn more

--columns [string] ("")

CSV mode: comma-separated column names to include. Rows missing all selected columns are dropped

Learn more

--deduplicate [boolean] (false)

CSV mode: remove duplicate rows

Learn more

--flatten [boolean] (false)

CSV mode: recursively flatten nested dicts to dot-notation columns (e.g. buybox.price)

Learn more

--format ["ndjson" | "txt" | "csv"] ("ndjson")

Output format: ndjson (one JSON object per line), txt (plain text blocks), or csv (flat table from JSON arrays)

Learn more

--output-file [path] ("")

Write output to file instead of stdout

Learn more

Format

--format

[

"ndjson" | "txt" | "csv"

] (

"ndjson"

)

ndjson (default): One JSON object per line. If the source file is valid JSON, it's output as-is with an added _url field. Non-JSON files are wrapped as {"content": "...", "_url": "..."}.
txt: Plain text output. Each file's content is separated by a blank line, prefixed with # URL when manifest is available.
csv: Flattens JSON files into tabular rows. JSON arrays inside each file are expanded into individual rows. Use --flatten for nested objects and --columns to select specific fields.

Flatten

--flatten

[

boolean

] (

false

)

In CSV mode, recursively flattens nested dictionaries to dot-notation column names. For example, {"buybox": {"price": 29.99}} becomes a column named buybox.price. Lists of dictionaries are indexed: buybox.0.price, buybox.1.price, etc.

Input Directory

--input-dir

[

path

] (

)

required

The batch or crawl output directory to read from. The export command looks for numbered files (e.g. 1.json, 2.html) and optionally reads manifest.json to annotate each record with its source URL.

Deduplicate

--deduplicate

[

boolean

] (

false

)

In CSV mode, remove duplicate rows from the output. Two rows are considered duplicates if all their column values are identical.

Columns

--columns

[

string

] (

)

In CSV mode, include only the specified comma-separated column names. Rows missing all selected columns are dropped. Useful for extracting specific fields from large JSON responses.

scrapingbee export --input-dir results/ --format csv --columns "title,url,price" --output-file filtered.csv

Output File

--output-file

[

path

] (

)

Write the merged output to a file instead of stdout. The default outputs to stdout, which is useful for piping to other tools.

schedule

Requires advanced features setup. The schedule command executes shell commands via cron and is disabled by default.

The schedule command creates cron jobs to run any ScrapingBee CLI command at fixed intervals.

Creating a Schedule

# Monitor a price every 5 minutes
scrapingbee schedule --every 5m --name btc-price \
  scrape "https://example.com/price" --extract-rules '{"price":".amount"}'

# Scrape news headlines every hour
scrapingbee schedule --every 1h --name news \
  google "breaking news" --search-type news

# Daily crawl
scrapingbee schedule --every 1d --name daily-crawl \
  crawl "https://example.com" --max-pages 50 --return-page-markdown true

Managing Schedules

# List all active schedules
scrapingbee schedule --list

# Stop a specific schedule
scrapingbee schedule --stop btc-price

# Stop all schedules
scrapingbee schedule --stop all

How It Works

The CLI uses your system's cron to run commands at the specified interval. Each schedule:

Creates a cron entry tagged with the schedule name
Logs output to ~/.config/scrapingbee-cli/logs/<name>.log
Tracks metadata in ~/.config/scrapingbee-cli/schedules.json

Interval syntax: 5s (seconds, converted to minutes), 5m (minutes), 1h (hours), 2d (days). Minimum interval is 1 minute.

Schedule Parameters

name [type] (default)

Description

--every [string] required

Run interval using duration syntax: 5m (minutes), 1h (hours), 2d (days). Minimum 1m. Uses system cron

Learn more

--list [boolean] (false)

List all active schedules with their intervals, run times, and commands

Learn more

--name [string] ("")

Name for this schedule. Auto-generated from the command if omitted

Learn more

--stop [string] ("")

Stop a schedule by name (e.g. --stop btc-price), or stop all schedules (--stop all)

Learn more

Every

--every

[

string

] (

)

required

Duration string specifying how often to run the command. Uses cron under the hood:

5m → runs every 5 minutes (*/5 * * * *)
1h → runs every hour (0 */1 * * *)
2d → runs every 2 days (0 0 */2 * *)

Stop

--stop

[

string

] (

)

Stop a schedule by name, removing its cron entry and registry record. Use --stop all to stop all active schedules.

scrapingbee schedule --stop btc-price

Name

--name

[

string

] (

)

A human-readable name for the schedule. Used to identify it in --list output and to stop it with --stop. If omitted, a name is auto-generated from the command arguments.

scrapingbee schedule --every 1h --name hourly-news google "breaking news"

List

--list

[

boolean

] (

false

)

Display all active schedules in a table showing the name, interval, how long each has been running, and the full command. Useful for checking what's scheduled before adding or removing jobs.

scrapingbee schedule --list

Pipelines

The real power of the CLI emerges when you chain commands together. Every command is designed to compose — output from one step feeds naturally into the next. This turns the CLI into a data pipeline engine where web scraping is just the first stage.

Scrape to LLM: Building a Knowledge Base

Large language models and RAG (Retrieval-Augmented Generation) systems need clean text. The CLI can crawl an entire documentation site and convert every page to markdown — ready for embedding and indexing in a vector database.

scrapingbee crawl "https://docs.example.com" \
  --return-page-markdown true --max-pages 500 --output-dir knowledge_base

For single-page ingestion, use --chunk-size on the scrape command to split content into overlapping NDJSON chunks with metadata (URL, chunk index, total chunks, timestamp) — ready to pipe directly into an embedding API.

scrapingbee scrape "https://docs.example.com/guide" \
  --return-page-markdown true --chunk-size 2000 --chunk-overlap 200

Unix Piping: Composing with Standard Tools

The CLI speaks stdin and stdout fluently. Use --input-file - to read from a pipe and --output-format ndjson to stream structured results — connecting ScrapingBee to the entire Unix ecosystem.

Extract titles from a list of URLs and filter with jq:

cat urls.txt | scrapingbee scrape --input-file - \
  --output-format ndjson --extract-rules '{"title":"h1"}' | jq -r '.title'

Chain two ScrapingBee commands — search Google, then scrape the top results:

scrapingbee google "best python libraries 2025" \
  --extract-field organic_results.url | scrapingbee scrape --input-file - \
  --return-page-markdown true --output-dir articles

Data Enrichment: Augmenting Existing Datasets

Start with a CSV of products, competitors, or leads — and enrich it with live web data. The --update-csv flag adds scraped results as new columns directly into your existing file, preserving all original data.

scrapingbee scrape --input-file products.csv --input-column url \
  --extract-rules '{"price":".price","stock":".availability"}' --update-csv

This is particularly powerful for monitoring workflows: run it on a schedule and your CSV accumulates fresh data over time. Use --extract-rules to target exactly the fields you need — keeping your dataset clean and focused.

ETL: Extract, Transform, Load

For larger datasets, the batch → export → transform pattern gives you full control over each stage. Scrape in parallel, merge the results, then reshape into exactly the format your downstream system needs.

scrapingbee amazon-search --input-file queries.txt --output-dir raw_results
scrapingbee export --input-dir raw_results --format csv --flatten --output-file products.csv

The --flatten flag recursively expands nested JSON into dot-notation columns (buybox.price, seller.0.name), turning deeply nested API responses into flat CSV rows that work in any spreadsheet or database.

Monitoring: Scheduled Data Collection

Requires advanced features setup.

Combine schedule with any pipeline to run it automatically. The CLI registers a cron job that executes your command at the specified interval, with output logged for debugging.

scrapingbee schedule --every 1h --name competitor-prices \
  scrape --input-file competitors.csv --input-column url \
  --extract-rules '{"price":".price"}' --update-csv

Each run appends fresh data. Use --on-complete to trigger a notification, sync to a database, or kick off a downstream analysis when a batch job finishes.

scrapingbee schedule --every 6h --name news-digest \
  google --input-file queries.txt --output-dir news_results \
  --on-complete "python analyze.py"

Save-Pattern Crawling: Surgical Data Extraction

Sometimes you need to crawl an entire site for navigation structure but only extract data from specific pages. The --save-pattern flag crawls all pages for link discovery (using lightweight HTML requests) but only applies your expensive extraction options to pages whose URLs match the pattern.

scrapingbee crawl "https://store.example.com" \
  --save-pattern "/product/" --ai-query "extract product name, price, and reviews" \
  --max-pages 500

This can dramatically reduce API credit usage on large sites where only a fraction of pages contain the data you need.

Scraper API Commands

These commands wrap ScrapingBee's specialized scraper APIs. Full parameter documentation lives on each API's page — select the CLI tab for command-line examples.

Command	API Page
`scrapingbee google "query"`	Google Search API →
`scrapingbee fast-search "query"`	Fast Search API →
`scrapingbee amazon-product ASIN`	Amazon Product API →
`scrapingbee amazon-search "query"`	Amazon Search API →
`scrapingbee walmart-product ID`	Walmart Product API →
`scrapingbee walmart-search "query"`	Walmart Search API →
`scrapingbee youtube-search "query"`	YouTube API →
`scrapingbee youtube-metadata VIDEO_ID`	YouTube API →
`scrapingbee chatgpt "prompt"`	ChatGPT API →

All scraper commands support --input-file for batch processing and the same output flags (--output-file, --output-format, --extract-field, --fields).

Quick Examples

# Google search
scrapingbee google "web scraping best practices" --output-file results.json

# Fast search (lightweight, 1 credit per request)
scrapingbee fast-search "python web scraping"

# Amazon product
scrapingbee amazon-product B08N5WRWNW --output-file product.json

# Amazon search
scrapingbee amazon-search "wireless headphones" --sort-by bestsellers

# Walmart product
scrapingbee walmart-product 123456789 --output-file product.json

# Walmart search
scrapingbee walmart-search "gaming laptop" --sort-by price-low

# YouTube search
scrapingbee youtube-search "python tutorial" --upload-date this-week

# YouTube metadata
scrapingbee youtube-metadata dQw4w9WgXcQ --output-file video.json

# ChatGPT query
scrapingbee chatgpt "Summarize the latest AI news" --search true

Batch Examples

# Batch Google search
scrapingbee google --input-file queries.txt --output-format csv --output-dir serps

# Batch Amazon products
scrapingbee amazon-product --input-file asins.txt --output-dir products

Advanced Features

The --post-process, --on-complete, and schedule features execute arbitrary shell commands on your machine. To prevent accidental or unauthorized use, these are disabled by default and require explicit setup.

Why Are They Gated?

In AI agent environments, scraped web content could contain prompt injection attempts that trick an AI into constructing malicious shell commands. The exec gate ensures these features can only run when a human has deliberately enabled them.

How to Enable

Three conditions must be met before these features will run:

Step 1 — Set the environment variable:

export SCRAPINGBEE_ALLOW_EXEC=1

Add this to your ~/.bashrc or ~/.zshrc to persist across sessions.

Step 2 — Run the unsafe verification command:

scrapingbee auth --unsafe

This writes a verification flag to your config file (~/.config/scrapingbee-cli/.env).

Step 3 (optional) — Restrict allowed commands:

export SCRAPINGBEE_ALLOWED_COMMANDS="jq,head,python3 /path/to/transform.py"

Comma-separated list of allowed command prefixes. When set, only commands matching these prefixes can be executed by --post-process and --on-complete. If not set, any command is allowed once the first two conditions are met.

Status and Audit

# Check if advanced features are enabled and view audit log
scrapingbee unsafe --list

# View recent shell command audit log
scrapingbee unsafe --audit

# View only the last N lines of the audit log
scrapingbee unsafe --audit --audit-lines 20

Disabling

To revoke advanced features:

scrapingbee logout

This removes both the API key and the unsafe verification flag. Alternatively, unset the environment variable (unset SCRAPINGBEE_ALLOW_EXEC).

To revoke only the unsafe flag while keeping your API key stored:

scrapingbee unsafe --disable

Get help for any command:

scrapingbee --help
scrapingbee scrape --help
scrapingbee crawl --help