In recent years, the surge in web scraping activities has prompted the emergence of diverse APIs provided by proxy services and data collection firms.

This report delves into seven prominent vendors in the web scraping API landscape, analyzing their features, scraping capabilities, parsing efficiency, and cost-effectiveness.

Focusing on three key website categories—search engines, e-commerce platforms, and social media—we aim to provide insights into the evolving realm of web scraping APIs.

Evolution of Web Scraping APIs

Web scraping APIs act as remote web scrapers, accepting API requests with target URLs and optional parameters.

Behind the scenes, these APIs utilize proxies, headers, and even headless browsers to retrieve HTML content. Some advanced APIs employ AI vision and pattern recognition for sophisticated tasks.

Pricing models are often based on successful requests, ensuring predictability. However, some providers exhibit opaque pricing structures.

code monitoring on multiple screen

Key Insights

➡️ Data Output and Parsing:

  • Six out of seven APIs return raw HTML, with advanced parsers available for specific websites.
  • Google and Amazon are the most targeted websites, with Oxylabs offering a machine-learning model for parsing most e-commerce stores.

➡️ Data Transfer and Customization:

  • APIs transfer data over open connections, often acting as proxies for seamless integration.
  • Customization options include location selection, device specification, and custom headers.
  • Four APIs accept CSS selectors and three support browser interactions for dynamic scraping scenarios.

➡️ Performance and Reliability:

  • Performance tests reveal varying speeds, with some APIs excelling in Google and Amazon scraping.
  • Social media, especially GraphQL, proves challenging for many APIs.
  • Oxylabs, Smartproxy, and Bright Data emerge as the most reliable performers, boasting robust parsers.

➡️ Pricing Models:

  • Bright Data charges a uniform price for all features, while Oxylabs and Smartproxy differentiate prices by target group.
  • ScraperAPI and Zyte employ tiered pricing, with rates differing significantly based on the target website.
a programmer scraping data

Participant Overview

We engaged with seven prominent companies offering web scraping APIs, including established names and proxy providers transitioning into this domain.

The participants willingly provided access to their APIs for scraping Google, Amazon, and a social media network.

Participant Snapshot

API APIs Tested Starting Price
Oxylabs Web Scraper API, SERP Scraper API, E-Commerce Scraper API $99
Bright Data Web Unlocker, SERP API $3 (pay as you go), $500 (plan)
Smartproxy Web Scraping API, SERP Scraping API, E-Commerce Scraping API $50
Zyte Zyte API $0 (pay as you go), $25 (plan)
Rayobyte Scraping Robot $0.0018/req
ScraperAPI ScraperAPI $49
Shifter Web Scraping API, SERP API $44.95

Feature Overview

Integration Methods

In theory, all web scraping APIs use the same basic structure: there's an endpoint where you pass URLs you want to scrape with one or more parameters.

In practice, the implementation can differ significantly. Here are the four main methods we've encountered:

Provider API (open connection) API (asynchronous) Proxy Library/SDK
Oxylabs ✅ Open connection where you send requests and wait for a response. ✅ Allows asynchronous delivery for bulk scraping. ✅ Can integrate as a proxy. ❌ No dedicated library or SDK.
Bright Data ❌ No open connection method. ✅ Supports asynchronous delivery. ✅ Can integrate as a proxy. ❌ No dedicated library or SDK.
Smartproxy ✅ Open connection method available. ❌ Does not support asynchronous delivery. ✅ Can integrate as a proxy. ❌ No dedicated library or SDK.
Zyte ✅ Open connection for requests. ❌ Does not support asynchronous delivery. ❌ Can be used as a proxy. ✅ Provides a Library/SDK.
Rayobyte ✅ Open connection for requests. ❌ Does not support asynchronous delivery. ❌ Can be used as a proxy. ❌ No dedicated library or SDK.
ScraperAPI ✅ Open connection method. ✅ Supports asynchronous delivery. ✅ Can integrate as a proxy. ✅ Provides a Library/SDK.
Shifter ✅ Open connection for requests. ❌ Does not support asynchronous delivery. ❌ Can be used as a proxy. ✅ Provides a Library/SDK.
  • API (open connection): Open connection means sending requests to an API endpoint and waiting for the response. GET and POST methods are used, with variations in implementation.
  • API (asynchronous): Asynchronous delivery allows sending API calls with an ID and fetching results over a webhook, which is useful for scraping in bulk.
  • Proxy: Most APIs can integrate as proxies, easing the transition from regular proxy servers.
  • Library/SDK: Some providers offer SDKs for additional convenience.
two women developer talking about code

HTML Scraping

General-purpose APIs have one endpoint that attempts to scrape any website, returning pages in raw HTML.

All participants offer an API for general-purpose scraping:

Provider Relevant Tool
Oxylabs Web Scraper API
Bright Data Web Unlocker
Smartproxy Web Scraping API
Zyte Zyte API
Rayobyte Scraping Robot
ScraperAPI ScraperAPI
Shifter Web Scraping API

Parameters like geolocation, residential proxies, device type, sessions, cookies, and data input are common among APIs.

Headless Scraping

Headless scraping is crucial for overcoming website protection systems.

Most providers manage headless browsers for you:

Provider JavaScript Rendering Screenshots Browser Actions
Oxylabs ✅ JavaScript rendering is universally available. ✅ Supports taking screenshots. ❌ Does not support direct browser interactions.
Bright Data ✅ JavaScript is handled automatically. ❌ Does not support screenshots. ❌ Does not support direct browser interactions.
Smartproxy ✅ JavaScript rendering is universally available. ✅ Supports taking screenshots. ❌ Does not support direct browser interactions.
Zyte ✅ JavaScript rendering is universally available. ✅ Supports taking screenshots. ✅ Allows direct browser interactions.
Rayobyte ✅ JavaScript rendering is universally available. ✅ Supports taking screenshots. ✅ Allows direct browser interactions.
ScraperAPI ✅ JavaScript rendering is universally available. ❌ Does not support screenshots. ❌ Does not support direct browser interactions.
Shifter ✅ JavaScript rendering is universally available. ✅ Supports taking screenshots. ✅ Allows advanced browser interactions.

JavaScript rendering is universally available, and some providers allow interactions with the browser, such as clicking and scrolling.

Specialized APIs

Specialized APIs target specific website groups, ensuring compatibility and structured scraping:

Provider Search Engine APIs E-commerce APIs Social Media APIs
Oxylabs Google, Baidu, Bing, Yandex Amazon, Walmart, eBay, Wayfair + 7 more
Bright Data Google, Bing, DuckDuckGo, Yandex
Smartproxy Google, Baidu, Bing, Yandex Amazon, Idealo, Wayfair
Zyte ❌ No specialized search engine API. ❌ No specialized e-commerce API.
Rayobyte Google Amazon
ScraperAPI ❌ No specialized search engine API. ❌ No specialized e-commerce API.
Shifter Google, Bing, Yandex

Search engines and e-commerce sites are common targets, with Google and Amazon receiving the most attention.

Google Features

Google Features Oxylabs Bright Data Smartproxy Rayobyte Shifter
APIs Search, ads, hotels, images, autocomplete, search volume, trends Search, maps, trends, reviews, hotels, reverse image Search, ads, hotels, images, autocomplete, trends Search Search, maps, autocomplete, scholar, product, reverse image, jobs, events, Google Play, trends
Search Type (tbm) ✅ Supports specifying search types. ✅ Supports specifying search types. ✅ Supports specifying search types. ❌ Does not support specifying search types. ✅ Supports specifying search types.
Device Type ✅ Supports specifying device types. ✅ Supports specifying device types. ✅ Supports specifying device types. ❌ Does not support specifying device types. ✅ Supports specifying device types.
Location Selection City-level City-level City-level Country-level City-level
Localization Domain, language Domain, language Domain, language Domain, language Domain, language
Pagination Start, number of pages Start, number of pages Start, number of pages Number of pages Start, number of pages

Amazon Features

Amazon Features Oxylabs Smartproxy Rayobyte
APIs Bestsellers, pricing, product, QA, reviews, search, sellers Product, pricing, reviews, QA, search, sellers Product
Device Type
Domain
Delivery Location
Pagination Start, number of pages Start, number of pages

Data Parsing

Parsing capabilities vary among providers. Some offer specialized APIs with built-in parsers, while others expose selectors for manual parsing. Overall parsing capabilities are as follows:

Provider Manual Parsing Search Engine Parsers E-commerce Parsers
Oxylabs ❌ Does not support manual parsing. Google Amazon, Walmart, eBay, Wayfair, Target, Etsy, AI parsing
Bright Data ❌ Does not support manual parsing. Google, Bing, Yandex, DuckDuckGo ❌ Does not support specialized e-commerce parsing.
Smartproxy ❌ Does not support manual parsing. Google Amazon
Zyte CSS selectors ❌ Does not support specialized search engine parsing. ❌ Does not support specialized e-commerce parsing.
Rayobyte CSS, XPath selectors Google ❌ Does not support specialized e-commerce parsing.
ScraperAPI ❌ Does not support manual parsing. Google Amazon
Shifter CSS selectors Google, Bing, Yandex ❌ Does not support specialized parsing.

Pre-built parsers for Google are common, and manual parsing is offered by a few providers. Specialized parsers for Amazon are available, with Oxylabs supporting targets beyond Amazon.

Google Parsing

Google Parsing Oxylabs Bright Data Smartproxy Rayobyte ScraperAPI Shifter
Data Formats JSON, CSV JSON JSON JSON JSON JSON
Parsable Elements SERP ✅ Supports parsing Search Engine Results Page (SERP). ✅ Supports parsing SERP. ✅ Supports parsing SERP. ✅ Supports parsing SERP. ✅ Supports parsing SERP.
Search Types (tbms) Images, news, shopping Images, news, shopping, videos, maps, hotels Shopping ❌ Does not support specifying search types. Shopping Images, news, shopping, videos, maps
Other Ads, autocomplete, reverse image, monthly search volume, trends Reverse image, trends, reviews Ads, autocomplete, trends ❌ Does not support specialized parsing. ❌ Does not support specialized parsing. Autocomplete, reverse image, scholar, Play, trends

Amazon Parsing

Amazon Parsing Oxylabs Smartproxy Rayobyte ScraperAPI
Data Formats JSON JSON JSON JSON
Parsable Elements Search ✅ Supports parsing search results. ✅ Supports parsing search results. ✅ Supports parsing offer listings.
Product ✅ Supports parsing product information. ✅ Supports parsing product information. ✅ Supports parsing product information.
Reviews ✅ Supports parsing reviews. ❌ Does not support parsing reviews. ✅ Supports parsing reviews.
Others Bestsellers, ASIN prices, QA, seller info ASIN prices, QA ❌ Does not support specialized parsing. ❌ Does not support specialized parsing.

Performance Benchmarks of Web Scraping APIs

In a comprehensive evaluation of web scraping APIs, a custom Python script utilizing Asyncio and AIOHTTP libraries was employed for asynchronous requests with a timeout of 150 seconds.

The focus was on assessing Google, Amazon, and a photo-centric social media platform across various scenarios.

import asyncio
import aiohttp
from aiohttp import ClientSession

async def fetch_data(session: ClientSession, url: str, timeout: int = 150) -> dict:
    try:
        async with session.get(url, timeout=timeout) as response:
            return await response.json()
    except aiohttp.ClientError as e:
        print(f"Error fetching data from {url}: {e}")
        return {}

async def scrape_google():
    google_url = "https://www.google.com"
    async with aiohttp.ClientSession() as session:
        google_data = await fetch_data(session, google_url)
        print("Google Data:", google_data)

async def scrape_amazon():
    amazon_url = "https://www.amazon.com"
    async with aiohttp.ClientSession() as session:
        amazon_data = await fetch_data(session, amazon_url)
        print("Amazon Data:", amazon_data)

async def main():
    tasks = [
        scrape_google(),
        scrape_amazon(),
    ]
    await asyncio.gather(*tasks)

if __name__ == "__main__":
    asyncio.run(main())

Google

Unparsed Results

Provider Success Rate Avg. Response Time (s)
Oxylabs 100% 6.04
Bright Data 98.42% 4.62
Smartproxy 100% 6.09
Zyte 99.47% 4.72
Rayobyte 100% 6.53
ScraperAPI 94.10% 12.58
Shifter 81.76% 1.67

Most APIs performed well, with notable exceptions. Shifter's general-purpose scraper faced challenges with Google, resulting in a 429 detection error every fifth request. The specialized API improved performance but experienced a decrease in speed.

Parsed Results

Provider Success Rate Avg. Response Time (s)
Oxylabs 99.90% 6.15
Bright Data 99.71% 6.03
Smartproxy 99.85% 6.04
Zyte 10.03
Rayobyte 99.93% 13.24
ScraperAPI 96.88% 10.08
Shifter 96.65%

The use of a data parser had minimal impact on response time, except for Rayobyte, which exhibited a three-second delay in JSON results for unexplained reasons.

Amazon

Provider Success Rate Avg. Response Time (s)
Oxylabs 100% 4.69
Bright Data 98.42% 4.31
Smartproxy 100% 4.66
Zyte 85.50% 4.51
Rayobyte 95.60% 20.70
ScraperAPI 95.80% 9.69
Shifter 98.80% 5.35

Bright Data, Oxylabs, and Smartproxy consistently delivered excellent results. Rayobyte's slow response was attributed to defaulting to datacenter IPs for Amazon, necessitating multiple request retries. Zyte encountered 520 errors, and ScraperAPI mirrored its Google performance. Shifter performed well, but its scraper faced challenges.

Photo-Focused Social Media Platform

GraphQL Endpoint

Provider Success Rate Avg. Response Time (s)
Oxylabs 100% 17.89
Bright Data 73.40% 3.71
Smartproxy 100% 8.95
Zyte 98.40% 2.59
Rayobyte 80% 4.52
ScraperAPI* 24.80% 8.08
Shifter 54.80% 1.77

The GraphQL endpoint posed a serious challenge, with Shifter struggling even with rendering enabled. ScraperAPI faced difficulties, while Zyte stood out with commendable performance.

Headless Rendering

Provider Success Rate Avg. Response Time (s)
Oxylabs 100% 28.88
Bright Data 100% 4.10
Smartproxy 100% 29.09
Zyte 94.00% 28.14
Rayobyte 98.60% 23.05
ScraperAPI* 98.20% 16.05
Shifter 62.40% 4.42

The headless test was more forgiving, with Bright Data demonstrating superior results. Shifter was fast but faced errors. ScraperAPI had improved performance, and Oxylabs and Smartproxy maintained success rates at the expense of some speed.

Concurrency

Provider Concurrency
Oxylabs 5 req/s to unlimited
Bright Data Unlimited
Smartproxy Unspecified
Zyte 2 req/s
Rayobyte 100 req/min
ScraperAPI 200-400 threads
Shifter Unspecified

Concurrency varied, with Bright Data, Smartproxy, and Oxylabs allowing for high parallel requests. Rayobyte and Zyte had more restrictive default limits, mainly applicable to enterprise-level needs.

Evaluation of Parsing Capabilities in Web Scraping APIs

In a nuanced examination of web scraping APIs, a qualitative test was conducted to assess their parsing abilities on four distinct types of pages: localized Google search desktop query, localized Google search mobile query, Google Shopping query, and Amazon product pages.

Google SERP, Localized Desktop Query

For the localized desktop query "best hairdresser near me" in London, the APIs were evaluated based on various elements:

Provider Localized Organic Snack Pack Map Related Searches People Also Ask
Oxylabs
Bright Data
Smartproxy
Rayobyte
ScraperAPI
Shifter

While ScraperAPI and Rayobyte focused on essential information, others aimed to parse the entire SERP.

Notably, Bright Data even provided a screenshot of the map. Shifter faced issues with the location parameter, hindering local result retrieval.

Google SERP, Localized Mobile Query

The mobile query with the same parameters as the desktop query yielded the following results:

Provider Localized Organic Snack Pack Map Related Searches People Also Ask
Oxylabs
Bright Data
Smartproxy
Rayobyte
ScraperAPI
Shifter

Bright Data, Oxylabs, and Smartproxy successfully returned complete and accurate results. However, ScraperAPI failed to scrape anything, and Shifter's mobile parser regressed to main page elements, missing local data.

Google Shopping

The Google Shopping query for "Nike Air Max" in London was analyzed for various aspects:

Provider Localized Search Filters Ads Item Pricing Merchant Delivery Evaluation Other
Oxylabs
Bright Data Price Comparison
Smartproxy
ScraperAPI Filter by Material, Related Searches, Price Comparison
Shifter

ScraperAPI provided the most comprehensive results, including related searches and the "you might like" block. It successfully retrieved ad results, a feature absent in other providers. Bright Data and Shifter failed to localize the page for this specific query.

Amazon Product Pages

Various product pages from art supplies, kitchenware, and electronics were targeted for parsing. The evaluation included elements such as breadcrumbs, item details, images, pricing, merchant information, availability, bestsellers rank, delivery, evaluation, and warranty.

Provider Breadcrumbs Item Images Item Variations Pricing Merchant Availability Bestsellers Rank Delivery Evaluation Warranty
Oxylabs
Smartproxy
Rayobyte
ScraperAPI

All four APIs demonstrated the ability to parse most page elements. Oxylabs and Smartproxy provided the most comprehensive results, including discounts, delivery, and warranty information. Rayobyte's parser was less informative, excluding item variations, delivery, and warranty information. Shifter chose to exclude buy box data and experienced a few formatting errors.

In summary, this qualitative test unveiled the varying parsing capabilities of web scraping APIs, shedding light on their strengths and limitations across different types of web pages.

Pricing Models

Web scraping APIs predominantly adopt a pricing structure centered around successful requests, simplifying expense calculations. Providers typically charge based on the number of successful requests, allowing users to gauge costs easily. The standard metric for comparison is the CPM (cost per 1,000 requests).

Provider Pricing Model Structure Starting Price Trial
Oxylabs Subscription Successful requests $99 5,000 req for a week
Bright Data Pay as you go, Subscription Successful requests $3 (pay as you go), $500 (plan) 7 days for companies
Smartproxy Subscription Successful requests $50 3,000 req for 3 days
Zyte Pay as you go, Subscription Successful requests $0 (pay as you go), $25 (plan) $5 free credit
Rayobyte Pay as you go Successful requests $0.0018/request 5,000 free per month (renewed)
ScraperAPI Subscription Successful requests $49 5,000 credits for a week
Shifter Subscription Successful requests $44 Money-back guarantee

The dominant pricing model remains the monthly subscription, but variations exist. Zyte introduces an intriguing approach where users set a monthly limit and pay half in advance each month. Notably, trials are available with most providers, with a standard offering of 5,000 requests.

Calculating Request Price

While the pricing model appears straightforward, some web scraping APIs introduce complexities in calculating a request's price.

Factors such as the target website, JavaScript rendering, residential proxies, and more contribute to price modifiers, leading to significant cost variations.

Provider Price Modifiers Max Price Difference
Oxylabs Search engines, e-commerce websites x2-3
Bright Data x1
Smartproxy Search engines, e-commerce websites x1.5-3
Zyte Target, JS rendering, premium proxies, screenshots, browser actions Custom
Rayobyte x1
ScraperAPI Premium, super premium proxies, premium targets, JS rendering x75
Shifter Premium proxies, JS rendering, search engines x25

ScraperAPI stands out with a complex structure involving three tiers of proxy networks and JavaScript rendering.

The pricing varies based on factors like the use of residential proxies, headless scraping, and rates for specific websites such as Google, Amazon, and social media.

Oxylabs and Smartproxy adopt a differentiation approach, with higher costs for search engine scrapers and approximately double expenses for e-commerce scrapers.

Shifter follows a similar strategy for search engines, while its regular scraper aligns with ScraperAPI's structure.

Bright Data and Rayobyte maintain consistent pricing irrespective of whether they use custom scrapers or render JavaScript, simplifying the process of scraping challenging targets.

Zyte, on the other hand, dynamically calculates the price per request for each website, considering its difficulty, JavaScript rendering, screenshots, and browser actions. This dynamic approach makes it challenging to estimate expenses in advance.

Conclusion

The web scraping API landscape is dynamic, offering diverse features and pricing structures.

Key insights include the evolution of advanced features, the targeting of major websites like Google and Amazon, and the importance of parsing capabilities.

Performance and reliability vary, with Oxylabs, Smartproxy, and Bright Data emerging as reliable performers.

Pricing models are generally based on successful requests, but some providers introduce complexity with differentiated pricing.

Organizations should carefully assess their needs and budget constraints when choosing a web scraping API, considering factors like data output, customization, and parsing capabilities. Ongoing monitoring is essential in this competitive and evolving ecosystem.

illustration of a group of people sitting in front of a computer with a question mark monitor

Frequently Asked Questions

How do web scraping APIs handle pricing?

Web scraping APIs typically follow a pricing model based on successful requests. Users are charged for the number of requests that are completed successfully. Some providers introduce additional complexities, such as differentiated pricing for specific websites or features.

What are the key features to consider when evaluating a web scraping API?

Important features include data output format, customization options (e.g., location selection, device specification), parsing capabilities, and performance/reliability. Consideration of the target websites and the ability to handle dynamic content and JavaScript is also crucial.

What are some challenges associated with web scraping, and how can they be addressed?

Challenges include handling dynamic content, CAPTCHAs, and changes in website structure. To address these challenges, choose a web scraping API with robust parsing capabilities and support for JavaScript rendering, and consider implementing techniques like rotating proxies and user agents to avoid detection. Regularly monitor and adapt your scraping strategy as websites evolve.

For further reading, you might be interested in the following:

Share this post