Facebook page scraper for pages and profiles

Facebook page metadata includes name, category, location, follower count, verification status, and contact fields. The ScrapeNow Facebook page scraper extracts all of these from a public URL and returns structured JSON, without browser sessions or proxy management.

Use this scraper when you already have the Facebook URL and need structured metadata back. The scraper handles public pages and profiles, then returns one result object for each submitted URL.

Run it as the first step in a Facebook data pipeline. Pull the page metadata once, store the canonical URL and Facebook ID, then use those keys for posts, events, creator records, or downstream enrichment.

The scraper is built for known URLs. If your source list contains search terms, shared links, mobile links, or redirect URLs, clean those inputs before submission.

How to use this scraper

The Facebook Page Scraper job pipeline, from input to stored output.

Use the the Facebook Pages and Profiles Extract by URL scraper scraper when you already have a Facebook page or profile URL. The input has one field, url, and the value must start with https://www.facebook.com/.

The scraper expects a direct Facebook URL. Normalize search result URLs, redirect links, mobile URLs, and shortened links before you create the job.

Step 1. Find the Facebook page or profile URL

Open facebook.com.

Facebook search bar autocompleting Coldplay page and music suggestions — Type the page or profile name in the Facebook search bar

In the search bar, type the name of the page or person. Use a query such as Coldplay or Harry Styles.

Facebook search results showing verified Coldplay musician page — Identify the verified Page from the search results before clicking

Click the target profile or page on the results page. Copy the URL from the browser address bar.

Facebook Coldplay Moon Music page with facebook.com/coldplay URL highlighted — Copy the canonical page URL from the address bar for scraper input

The URL should match one of these examples:

https://www.facebook.com/harrystyles
https://www.facebook.com/coldplay

Avoid URLs with tracking parameters from search results or shared links. A clean profile path gives you predictable dedupe logic later.

Keep one URL per page or profile in your source table. If your input source contains multiple variants, normalize them before the first scrape.

Store the normalized URL before you call the scraper. That gives your ingestion job a single source of truth for retries, audit logs, and dedupe checks.

If you need posts from a Facebook page after extracting page metadata, use the Extract Facebook page posts scraper. It takes a page URL and returns public post records.

Step 2. Run the scraper from Python

Use this Python script. Replace YOUR_API_KEY with your ScrapeNow API key.

"""
Configuration:
    - Set SCRAPER_SLUG to the scraper you want to run.
    - Set SCRAPER_INPUTS to the list of input dicts matching that scraper's schema.
    - Set API_KEY to your scraper API key.
"""

import sys
import time
import json
import requests
import os

API_KEY = "YOUR_API_KEY"

SCRAPER_SLUG = "facebook-pages-and-profiles-extract-by-url"

SCRAPER_INPUTS = [
    {
        "url": "https://www.facebook.com/harrystyles"
    }
]



BASE_URL = "https://api.scrapenow.io/api/v1/scraping"
TIMEOUT_SECONDS = 3600
POLL_INTERVAL = 5
SPINNER = "|/-\\"


def build_headers(api_key: str, content_type: str | None = None) -> dict:
    headers = {"Authorization": f"Bearer {api_key}"}
    if content_type:
        headers["Content-Type"] = content_type
    return headers


def trigger_scrape(slug: str, inputs: list[dict]) -> str:
    url = f"{BASE_URL}/scrape?scraper={slug}"
    response = requests.post(
        url,
        headers=build_headers(API_KEY, "application/json"),
        json={"inputs": inputs},
    )
    response.raise_for_status()
    return response.json()["data"]["job_id"]


def poll_until_done(job_id: str) -> str:
    start = time.time()
    i = 0
    while True:
        elapsed = time.time() - start
        if elapsed > TIMEOUT_SECONDS:
            print(f"\nTimeout after {TIMEOUT_SECONDS}s")
            sys.exit(1)
        response = requests.get(
            f"{BASE_URL}/jobs/{job_id}",
            headers=build_headers(API_KEY),
        )
        response.raise_for_status()
        data = response.json()
        status = data["data"]["status"]
        mins, secs = divmod(int(elapsed), 60)
        sys.stdout.write(
            f"\r[{SPINNER[i % 4]}] Waiting... {status} ({mins}m {secs:02d}s)  "
        )
        sys.stdout.flush()
        if status in ("completed", "failed"):
            print()
            return status
        time.sleep(POLL_INTERVAL)
        i += 1


def fetch_results(job_id: str) -> dict:
    response = requests.get(
        f"{BASE_URL}/jobs/{job_id}/results?format=json",
        headers=build_headers(API_KEY),
    )
    response.raise_for_status()
    return response.json()


def save_results(data: dict, slug: str) -> str:
    os.makedirs("output", exist_ok=True)
    filename = os.path.join("output", f"{slug}.json")
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    return filename


def main() -> None:
    print(f"Triggering scraper: {SCRAPER_SLUG}")
    job_id = trigger_scrape(SCRAPER_SLUG, SCRAPER_INPUTS)
    print(f"Job started: {job_id}")
    final_status = poll_until_done(job_id)
    if final_status != "completed":
        print(f"Job failed with status: {final_status}")
        sys.exit(1)
    print("Fetching results...")
    results = fetch_results(job_id)
    output_file = save_results(results, SCRAPER_SLUG)
    print(f"Results saved to: {output_file}")


if __name__ == "__main__":
    main()

The script does four things.

Starts a ScrapeNow job with the page or profile URL.
Polls the job every 5 seconds.
Stops after 3600 seconds if the job does not finish.
Writes the JSON response to output/facebook-pages-and-profiles-extract-by-url.json.

This pattern works for small tests and scheduled production jobs. For larger batches, pass multiple input objects in SCRAPER_INPUTS, then process each returned row by scrape_status.

The script sets a 30 second request timeout for each API call. That prevents a stalled network request from blocking the worker indefinitely.

Keep the polling interval at 5 seconds unless your workload requires faster status updates. Shorter intervals add API chatter without changing extraction speed.

For broader Facebook extraction, Browse all 86+ scrapers across 14 platforms. Use the scraper that matches your input type so the pipeline stays predictable.

Step 3. Read the JSON output

A successful run returns an array of result objects. This response shows the shape returned by the Facebook pages and profiles scraper.

[
  {
    "inputs": {
      "url": "https://www.facebook.com/harrystyles"
    },
    "scrape_status": "success",
    "id": "100044630460255",
    "url": "https://www.facebook.com/harrystyles",
    "page_name": "Harry Styles",
    "username": "harrystyles",
    "entity_type": "PAGE",
    "summary_text": "Official Harry Styles Facebook Page.",
    "primary_category": "Musician",
    "address": {
      "formatted": "Holmes Chapel, United Kingdom"
    },
    "contact_and_basic_info": {
      "categories": [
        "Musician"
      ],
      "contact_info": {
        "address": {
          "formatted": "Holmes Chapel, United Kingdom"
        }
      }
    },
    "followers": 17000000,
    "logo": "https://scontent.fcok10-2.fna.fbcdn.net/v/t39.30808-1/616198291_1463482668482735_6359717705419257039_n.jpg?stp=dst-jpg_s200x200_tt6&_nc_cat=109&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=ZWuNTTiGg4kQ7kNvwFqqZ8O&_nc_oc=AdqKf9a7FCVK8sYP3mnFkpNC3MpHl2n8c0_XMaYdPOSRjkIobEz7wSlYbgS-4_MUsx4&_nc_zt=24&_nc_ht=scontent.fcok10-2.fna&_nc_gid=jCKYlMcS-Nk6ThoCq122yw&_nc_ss=79289&oh=00_Af7fbNSUSi7U3nPRlo8q7GzH9pEWRBIE3rBL0QQr-4kJUA&oe=6A0A113C",
    "is_verified": true,
    "scrape_error": null,
    "scrape_error_code": null
  }
]

What data you get back

Facebook Page Scraper output schema — Facebook Page Scraper output fields grouped by category.

The response contains the original input, scrape status, public Facebook identifiers, page metadata, and error fields. Treat id and url as your strongest keys.

Use id for dedupe when Facebook returns it. Use url as the fallback key when the ID is missing from a failed or partial row.

Keep inputs.url in storage after you receive a canonical URL. It tells you what your system submitted and helps debug invalid source data.

Field	Type	How to use it
`inputs.url`	string	Original URL you submitted. Keep it for traceability.
`scrape_status`	string	`success` means the row extracted cleanly.
`id`	string	Facebook entity ID. Use it as the stable dedupe key when present.
`url`	string	Canonical Facebook page or profile URL returned by the scraper.
`page_name`	string	Display name shown on the page.
`username`	string	Public handle from the Facebook URL.
`entity_type`	string	Entity type such as `PAGE`.
`summary_text`	string	Public page summary or description.
`primary_category`	string	Main page category, such as `Musician`.
`address.formatted`	string	Public formatted location if available.
`contact_and_basic_info.categories`	array	Category list from the page metadata.
`followers`	integer	Public follower count.
`logo`	string	Public profile image URL.
`is_verified`	boolean	Verification status.
`scrape_error`	string or null	Error message for failed rows.
`scrape_error_code`	string or null	Machine-readable error code for retries and reporting.

Store the followers field as an integer. Text storage breaks numeric sorting and creates extra casting work in analytics queries.

The logo value often contains a long Facebook CDN URL. Facebook rotates these URLs, so refresh the value during your next scrape cycle.

The nested contact_and_basic_info object leaves room for extra page metadata. Your top-level schema stays stable when Facebook exposes another public field.

For CRM enrichment, map page_name, username, primary_category, and followers to typed columns. Keep the raw JSON payload so you reprocess rows without rescraping Facebook.

If your CRM stores social accounts separately from companies or contacts, keep Facebook IDs in the social account table. Then join the account to the person, artist, brand, venue, or organization record.

For analytics, store scrape timestamps beside the extracted values. Follower counts and verification status change over time, and timestamped rows make that history queryable.

Ready to get this data? Get Facebook page data.

Production tips

The Facebook Page Scraper data pipeline that shapes raw results into warehouse tables.

Validate before you submit jobs. Dedupe before you write rows. Check row-level status before you mark the batch as finished.

Validate URLs before sending them

Reject mobile URLs, shortened links, and non-Facebook domains before creating the job.

from urllib.parse import urlparse

def validate_facebook_url(url: str) -> str:
    parsed = urlparse(url.strip())

    if parsed.scheme != "https":
        raise ValueError(f"Invalid scheme: {url}")

    if parsed.netloc != "www.facebook.com":
        raise ValueError(f"Invalid host: {url}")

    if not parsed.path or parsed.path == "/":
        raise ValueError(f"Missing page or profile path: {url}")

    return f"{parsed.scheme}://{parsed.netloc}{parsed.path.rstrip('/')}"


urls = [
    "https://www.facebook.com/harrystyles",
    "https://www.facebook.com/coldplay",
]

validated_urls = [validate_facebook_url(url) for url in urls]
print(validated_urls)

Expected output:

[
  "https://www.facebook.com/harrystyles",
  "https://www.facebook.com/coldplay"
]

This validator keeps the path and removes a trailing slash. Add your own rules if your source system stores m.facebook.com, shared links, or internal redirect URLs.

Run validation before billing starts. Also remove query strings before scraping page or profile metadata.

Deduplicate by Facebook ID first

A page can be submitted as a vanity URL and returned with a canonical URL. Use id as your primary dedupe key when it exists.

def dedupe_results(rows: list[dict]) -> list[dict]:
    seen = set()
    deduped = []

    for row in rows:
        key = row.get("id") or row.get("url") or row.get("inputs", {}).get("url")

        if key in seen:
            continue

        seen.add(key)
        deduped.append(row)

    return deduped

For example, https://www.facebook.com/harrystyles and another URL variant for the same page should collapse into one row. Both rows return id as 100044630460255.

Use the canonical URL from url for future scrapes. Enforce dedupe at both the application layer and the database layer with a PRIMARY KEY or unique index on facebook_id.

Store a stable schema

Keep your database schema plain. Store top-level fields as typed columns, then keep the full JSON object for replay and debugging.

CREATE TABLE facebook_pages (
    facebook_id TEXT PRIMARY KEY,
    input_url TEXT NOT NULL,
    canonical_url TEXT,
    page_name TEXT,
    username TEXT,
    entity_type TEXT,
    primary_category TEXT,
    formatted_address TEXT,
    followers BIGINT,
    logo_url TEXT,
    is_verified BOOLEAN,
    scrape_status TEXT NOT NULL,
    scrape_error TEXT,
    scrape_error_code TEXT,
    raw_json JSONB NOT NULL,
    scraped_at TIMESTAMP NOT NULL DEFAULT NOW()
);

This schema keeps raw_json for fields you decide to use later. Use BIGINT for follower counts and keep nullable fields nullable.

Handle row-level failures

A completed job can contain failed rows. Check scrape_status per result instead of trusting the job status alone.

def split_success_and_failed(rows: list[dict]) -> tuple[list[dict], list[dict]]:
    success = []
    failed = []

    for row in rows:
        if row.get("scrape_status") == "success":
            success.append(row)
        else:
            failed.append(row)

    return success, failed


rows = [
    {
        "inputs": {"url": "https://www.facebook.com/harrystyles"},
        "scrape_status": "success",
        "id": "100044630460255"
    },
    {
        "inputs": {"url": "https://www.facebook.com/invalid-input"},
        "scrape_status": "failed",
        "scrape_error": "Page unavailable",
        "scrape_error_code": "PAGE_UNAVAILABLE"
    }
]

success, failed = split_success_and_failed(rows)

print(f"success={len(success)} failed={len(failed)}")

Expected output:

success=1 failed=1

Retry failed rows only when the error code is retryable. Cap retries at 3 attempts with backoff.

Track schema drift

Treat optional fields as nullable. Save raw_json on every row so you can backfill new columns later. Track row counts by scrape_status and alert on sudden drops in expected fields.

When to use the adjacent Facebook scrapers

Use the pages and profiles scraper when you need entity metadata from a known URL. Use a different scraper when your input type or target output changes.

Goal	Scraper to use
Extract public metadata from a known page or profile URL	Facebook Pages and Profiles Extract by URL
Extract public posts from a known page URL	Extract Facebook page posts
Extract event details from a known event URL	Extract Facebook event data
Search events by venue	Find Facebook events by venue
Extract a marketplace listing from a known URL	Get Marketplace listing data
Search marketplace listings by keyword	Search Marketplace listings

A common pipeline starts with page or profile metadata, then pulls page posts for the same canonical URL. That gives you account-level fields such as followers and post-level fields from the page posts scraper.

The same pattern works for creator databases, music rosters, brand monitoring, and public entity directories. Store the page row once, then attach posts, events, or marketplace records by canonical URL or Facebook ID.

For brand monitoring, refresh high-priority pages daily and long-tail pages weekly. Store each scrape timestamp so analysts separate current values from historical snapshots.

For creator databases, keep follower count history in a separate table. One row per scrape gives you growth curves without overwriting yesterday's value.

For music rosters, join Facebook page records to artist IDs in your internal catalog. Store aliases because stage names, legal names, and page names often differ.

For venue databases, keep the formatted address separate from the raw contact object. Address fields change across regions, and raw JSON helps during cleanup.

Pricing

ScrapeNow charges per returned row. One row costs one credit, starting at $0.04 per credit for small runs and dropping with volume. No monthly contracts, no proxy fees, no charges for failed rows. See the pricing page for current rates.

Run the code above with your own URL against the Get Facebook page data scraper. If you are mapping the rest of your Facebook data pipeline, Browse all 86+ scrapers and pick the scraper that matches your input type.