How to scrape Flipkart reviews by SKU

Flipkart product reviews contain rating, review text, reviewer name, date, upvotes, and product context. The ScrapeNow Flipkart reviews scraper extracts all of these from a product SKU and returns structured JSON.

Where this scraper fits

Use the Flipkart Reviews Search by SKU scraper when you already have Flipkart product SKUs. If your pipeline starts earlier in the product discovery flow, use a product or category scraper first.

Starting point	Scraper to use	Output you get
You have a Flipkart SKU	Extract Flipkart reviews by SKU	Review rows for that product
You have a product URL	Extract Flipkart product data	Product metadata and identifiers
You have a category path	Flipkart products by category	Product list for a category
You have a search URL	Search Flipkart results	Product results from a search page

A production review pipeline usually has five steps:

Extract products from category pages, search pages, or product URLs.
Store each product SKU in your product table.
Send those SKUs to the reviews scraper on a schedule.
Normalize review rows into your warehouse.
Deduplicate records by review_id.

The reviews scraper starts at step 3. It assumes your system already knows the product SKU and needs review data for that SKU.

How to use this scraper

The Flipkart Reviews Scraper job pipeline, from input to stored output.

The ScrapeNow scraper slug is the Flipkart Reviews Search by SKU scraper. It takes one input field named sku.

The input must be the SKU value only. Send itm5309cefa035da, not a full Flipkart product URL.

A valid request input looks like this:

{
  "sku": "itm5309cefa035da"
}

Step 1. Find the Flipkart SKU

Open flipkart.com/search.

Flipkart search dropdown suggesting violin instrument categories above listings — Search for the product whose reviews you want to extract

Search for a product, such as `violin`.

Flipkart violin search results grid with KADENCE first listing highlighted — Click the product listing to open its detail page and find the SKU

Click the product you want from the search results.

Flipkart Kadence violin product page with SKU in URL — Extract the SKU from the product URL, starting with the itm prefix

Extract the SKU from the product page URL. The SKU starts with `itm` and appears as the last path parameter before optional query parameters.

Example SKU:

itm5309cefa035da

Use that value as the sku input. Keep only the SKU value.

For example, this URL contains the SKU in the product path:

https://www.flipkart.com/example-product/p/itm5309cefa035da?pid=ACGGX9M23RZ2UPZE

The scraper input should contain this value:

{
  "sku": "itm5309cefa035da"
}

If you already store Flipkart product URLs, extract the SKU once during ingestion. Store the SKU as its own column, then reuse it for review jobs.

Step 2. Run the API request

Install the dependency:

pip install requests

Then run this Python script with your ScrapeNow API key:

"""
Configuration:
    - Set SCRAPER_SLUG to the scraper you want to run.
    - Set SCRAPER_INPUTS to the list of input dicts matching that scraper's schema.
    - Set API_KEY to your scraper API key.
"""

import sys
import time
import json
import requests
import os

API_KEY = "YOUR_API_KEY"

SCRAPER_SLUG = "flipkart-reviews-search-by-sku"

SCRAPER_INPUTS = [
    {
        "sku": "itm5309cefa035da"
    }
]



BASE_URL = "https://api.scrapenow.io/api/v1/scraping"
TIMEOUT_SECONDS = 3600
POLL_INTERVAL = 5
SPINNER = "|/-\\"


def build_headers(api_key: str, content_type: str | None = None) -> dict:
    headers = {"Authorization": f"Bearer {api_key}"}
    if content_type:
        headers["Content-Type"] = content_type
    return headers


def trigger_scrape(slug: str, inputs: list[dict]) -> str:
    url = f"{BASE_URL}/scrape?scraper={slug}"
    response = requests.post(
        url,
        headers=build_headers(API_KEY, "application/json"),
        json={"inputs": inputs},
    )
    response.raise_for_status()
    return response.json()["data"]["job_id"]


def poll_until_done(job_id: str) -> str:
    start = time.time()
    i = 0
    while True:
        elapsed = time.time() - start
        if elapsed > TIMEOUT_SECONDS:
            print(f"\nTimeout after {TIMEOUT_SECONDS}s")
            sys.exit(1)
        response = requests.get(
            f"{BASE_URL}/jobs/{job_id}",
            headers=build_headers(API_KEY),
        )
        response.raise_for_status()
        data = response.json()
        status = data["data"]["status"]
        mins, secs = divmod(int(elapsed), 60)
        sys.stdout.write(
            f"\r[{SPINNER[i % 4]}] Waiting... {status} ({mins}m {secs:02d}s)  "
        )
        sys.stdout.flush()
        if status in ("completed", "failed"):
            print()
            return status
        time.sleep(POLL_INTERVAL)
        i += 1


def fetch_results(job_id: str) -> dict:
    response = requests.get(
        f"{BASE_URL}/jobs/{job_id}/results?format=json",
        headers=build_headers(API_KEY),
    )
    response.raise_for_status()
    return response.json()


def save_results(data: dict, slug: str) -> str:
    os.makedirs("output", exist_ok=True)
    filename = os.path.join("output", f"{slug}.json")
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    return filename


def main() -> None:
    print(f"Triggering scraper: {SCRAPER_SLUG}")
    job_id = trigger_scrape(SCRAPER_SLUG, SCRAPER_INPUTS)
    print(f"Job started: {job_id}")
    final_status = poll_until_done(job_id)
    if final_status != "completed":
        print(f"Job failed with status: {final_status}")
        sys.exit(1)
    print("Fetching results...")
    results = fetch_results(job_id)
    output_file = save_results(results, SCRAPER_SLUG)
    print(f"Results saved to: {output_file}")


if __name__ == "__main__":
    main()

The script performs these steps:

Starts a scrape job with POST /scrape?scraper=flipkart-reviews-search-by-sku.
Polls the job every 5 seconds.
Stops after 3600 seconds if the job does not finish.
Saves the JSON output to output/flipkart-reviews-search-by-sku.json.

Run the script:

python flipkart_reviews.py

For production runs, keep POLL_INTERVAL at 5 seconds or higher. Faster polling adds API traffic and does not make Flipkart extraction finish faster.

Set limit_per_input based on the number of reviews you need per SKU. The sample uses 1 so you can test the request and inspect one row before running a larger job.

For scheduled jobs, treat the script as a starting point. Move API_KEY, BASE_URL, batch size, and output paths into environment variables or your job configuration.

Step 3. Read the JSON output

A successful result looks like this:

[
  {
    "inputs": {
      "sku": "itm5309cefa035da"
    },
    "scrape_status": "success",
    "rootdomain": "flipkart.com",
    "sku": "itm5309cefa035da",
    "variant_sku": null,
    "review_id": "2d8f0692-24bf-4030-afb5-e1ff620d7090",
    "author": "Santu  Roy",
    "rating": "4",
    "date": "2026-04-13T00:00:00.000Z",
    "purchased_date": null,
    "location": "Birpara Tea Garden, West Bengal",
    "attributes": "Verified Buyer",
    "title": "Wonderful",
    "text": "Good product 👌",
    "product_name": "KADENCE KAD-BB01-NAT(with Online Classes) Acoustic Guitar Hard Wood Rosewood Right Hand Orientation",
    "recommended_review": null,
    "product_has_been_tried": null,
    "is_verified_purchase": true,
    "is_verified_buyer": true,
    "is_incentivized": null,
    "sold_and_shipped_by": null,
    "helpful_count": {
      "up": 0,
      "down": null
    },
    "brand_response": null,
    "syndicated": null,
    "program": null,
    "link": "https://flipkart.com/reviews/ACGGX9M23RZ2UPZE:146?reviewId=2d8f0692-24bf-4030-afb5-e1ff620d7090",
    "review_images_url": [],
    "seller_id": null,
    "variant_attributes": null,
    "url": "https://flipkart.com/",
    "timestamp": "2026-05-13T09:30:46.910Z",
    "input": {
      "url": "https://flipkart.com/",
      "sku": "itm5309cefa035da"
    },
    "discovery_input": {
      "sku": "itm5309cefa035da"
    },
    "scrape_error": null,
    "scrape_error_code": null
  }
]

Store the full raw row for audit and debugging. Write a normalized row into analytics tables after type conversion, deduplication, and error checks.

Keep raw JSON in object storage or a raw warehouse table. Raw rows help when Flipkart changes a field, your downstream schema changes, or analysts ask why a metric moved.

A raw row also preserves fields you do not model today. That matters when a product team asks for review images or badge history six months later.

What data you get back

Flipkart Reviews Scraper output schema — Flipkart Reviews Scraper output fields grouped by category.

Each result row is one Flipkart review. The API returns normalized fields, so your code does not parse review cards, star icons, badges, or dates from the page.

Field	Type	What it means
`sku`	string	Flipkart product identifier used as the input
`review_id`	string	Stable review identifier from the review URL
`author`	string	Reviewer display name
`rating`	string	Rating value, usually `1` through `5`
`date`	ISO string	Review date in UTC format
`location`	string or null	Reviewer location shown by Flipkart
`attributes`	string or null	Review badges, for example `Verified Buyer`
`title`	string	Review headline
`text`	string	Full review body
`product_name`	string	Product name attached to the review
`is_verified_purchase`	boolean or null	Purchase verification flag
`is_verified_buyer`	boolean or null	Buyer verification flag
`helpful_count.up`	number or null	Upvotes on the review
`helpful_count.down`	number or null	Downvotes on the review
`review_images_url`	array	Image URLs attached to the review
`link`	string	Direct review URL
`scrape_status`	string	`success` for completed rows
`scrape_error_code`	string or null	Error code if extraction failed

Ready to get this data? Extract Flipkart reviews by SKU.

Use review_id as your primary key. It is safer than author + date + text because names repeat, text changes, and dates are not unique.

For product monitoring, join review rows back to product rows by sku. The Flipkart products scraper guide covers product extraction when your pipeline needs product metadata before reviews.

Keep both sku and product_name in your warehouse. Product names change, and the SKU gives you the stable join key.

Treat nullable fields as expected output. Flipkart does not show every badge, image, location, seller response, or verification marker on every review.

Production tips

Validate SKUs before you send jobs

Invalid SKU inputs waste credits and add unnecessary retry entries. Reject any value that does not start with itm.

import re

SKU_RE = re.compile(r"^itm[a-zA-Z0-9]+$")

def validate_sku(sku: str) -> str:
    sku = sku.strip()

    if not SKU_RE.match(sku):
        raise ValueError(f"Invalid Flipkart SKU: {sku}")

    return sku


raw_skus = [
    "itm5309cefa035da",
    "https://www.flipkart.com/some-product/p/itm5309cefa035da",
    "5309cefa035da"
]

valid_skus = []

for sku in raw_skus:
    try:
        valid_skus.append(validate_sku(sku))
    except ValueError as exc:
        print(exc)

print(valid_skus)

Expected output:

["itm5309cefa035da"]

If your source data contains full Flipkart URLs, extract the SKU before validation. This keeps the validation function strict and keeps URL parsing in one place.

from urllib.parse import urlparse

def extract_sku_from_flipkart_url(url: str) -> str:
    path = urlparse(url).path
    parts = [part for part in path.split("/") if part]

    for part in reversed(parts):
        if part.startswith("itm"):
            return part

    raise ValueError(f"No SKU found in URL: {url}")


url = "https://www.flipkart.com/example-product/p/itm5309cefa035da?pid=ACGGX9M23RZ2UPZE"
print(extract_sku_from_flipkart_url(url))

Expected output:

itm5309cefa035da

Add this check before you create API jobs.

Deduplicate by review ID

Run review scraping on a schedule and store new rows only. The right key is review_id.

def dedupe_reviews(reviews: list[dict]) -> list[dict]:
    seen = set()
    output = []

    for review in reviews:
        review_id = review.get("review_id")

        if not review_id:
            continue

        if review_id in seen:
            continue

        seen.add(review_id)
        output.append(review)

    return output

For warehouse tables, use a unique index on review_id. If you track multiple marketplaces in the same table, use (rootdomain, review_id).

Use a warehouse upsert keyed by review_id for cleaner scheduled runs.

Store ratings as integers

The response returns rating as a string. Cast it before writing analytics tables.

def normalize_review(row: dict) -> dict:
    return {
        "rootdomain": row.get("rootdomain"),
        "sku": row.get("sku"),
        "review_id": row.get("review_id"),
        "author": row.get("author"),
        "rating": int(row["rating"]) if row.get("rating") else None,
        "date": row.get("date"),
        "title": row.get("title"),
        "text": row.get("text"),
        "product_name": row.get("product_name"),
        "is_verified_buyer": row.get("is_verified_buyer"),
        "helpful_up": (row.get("helpful_count") or {}).get("up"),
        "helpful_down": (row.get("helpful_count") or {}).get("down"),
        "link": row.get("link"),
        "scrape_timestamp": row.get("timestamp"),
    }

A compact normalized schema works well:

Column	Suggested type
`rootdomain`	text
`sku`	text
`review_id`	text unique
`author`	text
`rating`	integer
`date`	timestamp
`title`	text
`text`	text
`product_name`	text
`is_verified_buyer`	boolean
`helpful_up`	integer
`helpful_down`	integer
`link`	text
`scrape_timestamp`	timestamp

Keep date and scrape_timestamp separate. The review date tells you when the customer posted the review.

The scrape timestamp tells you when your system observed the review. That difference matters when you backfill old products or compare daily snapshots.

Use integer ratings for averages, distribution charts, and alert rules. Keep the raw rating value in your raw table if you need replayable transformations.

Handle row-level scrape errors

Treat job success and row success as separate checks. A job can complete while individual inputs return scrape_error_code.

def split_success_and_errors(rows: list[dict]) -> tuple[list[dict], list[dict]]:
    success = []
    errors = []

    for row in rows:
        if row.get("scrape_status") == "success" and not row.get("scrape_error_code"):
            success.append(row)
        else:
            errors.append({
                "input": row.get("input") or row.get("inputs"),
                "scrape_status": row.get("scrape_status"),
                "scrape_error": row.get("scrape_error"),
                "scrape_error_code": row.get("scrape_error_code"),
            })

    return success, errors

Retry only the failed SKUs, and cap retries at 2 attempts. Write failed inputs to a separate table with the input, job ID, and error code.

Batch inputs by product list

The sample sends 1 SKU with limit_per_input set to 1. In production, build inputs from product or search pipelines and send a list.

skus = [
    "itm5309cefa035da",
    "itm123456789abcd",
    "itmabcdef1234567"
]

SCRAPER_INPUTS = [{"sku": sku} for sku in skus]

If your SKU source starts from search pages, use the Search Flipkart results scraper to build the product list first. Browse all 86+ scrapers to see all available Flipkart scrapers.

Set a batch size that matches your downstream retry model, such as one batch per category or brand.

Monitor review volume changes

Review count changes tell you when a product gains traction or receives negative attention. Store daily counts by sku, then compare today’s count with the previous scrape.

A small table with sku, scrape_date, and review_count is enough for most teams. Add rating averages after you normalize the rating field to an integer.

For QA pipelines, alert when a product receives multiple 1-star reviews in a short period. For catalog monitoring, alert when a product stops returning reviews after previous successful runs.

Track rating distribution as buckets from 1 to 5. Averages hide movement when both 1-star and 5-star counts rise at the same time.

A basic monitoring table looks like this:

Column	Suggested type	Use
`sku`	text	Product join key
`scrape_date`	date	Daily snapshot date
`review_count`	integer	Total reviews returned
`avg_rating`	numeric	Mean rating after casting
`one_star_count`	integer	QA alert input
`five_star_count`	integer	Positive trend input

This table gives analysts a stable daily series. It also keeps your alert code away from raw scraper output.

Use the daily table for alerting, not raw review rows. Raw review rows work for audit and debugging, while aggregates work for monitoring.

Keep review ingestion idempotent

Scheduled review scraping reruns the same SKU many times. Your write path should handle duplicate rows without manual cleanup.

Use an upsert keyed by review_id or (rootdomain, review_id). Update mutable fields such as helpful_up, helpful_down, scrape_timestamp, and product_name.

Avoid delete-and-reload for review tables. It creates gaps in dashboards when a scheduled run fails halfway through the product list.

A simple warehouse pattern looks like this:

Load raw scraper output into flipkart_review_raw.
Split successful rows and failed rows.
Normalize successful rows into a staging table.
Upsert staging rows into flipkart_review.
Insert failed rows into flipkart_review_error.

This pattern keeps reruns safe. It also gives your team a clear place to inspect failures without mixing them into analytics rows.

Pick a review limit per use case

limit_per_input controls how many review rows the API returns for each SKU. Use a small value for smoke tests and a larger value for production backfills.

Use case	Suggested `limit_per_input`
API smoke test	`1`
Daily monitoring	`25` to `100`
New product backfill	`500` or higher
Sentiment dataset build	Match your dataset target

The right limit depends on how many SKUs you run and how much history you need. A daily monitor usually needs recent reviews, while a sentiment dataset often needs a larger sample.

Start with 1 row while testing credentials, schema, and storage. Increase the limit after your pipeline writes raw rows and normalized rows correctly.

Pricing

ScrapeNow charges per returned row. One row costs one credit, starting at $0.04 per credit for small runs and dropping with volume. No monthly contracts, no proxy fees, no charges for failed rows. See the pricing page for current rates.