Skip to main content
Blog

Facebook posts scraper for post-level data extraction

Use the facebook posts scraper to extract post text, reactions, comments, shares, attachments, and page metadata from Facebook post URLs.

ScrapersFacebookMay 9, 2026
Facebook posts scraper for post-level data extraction

ScrapeNow's Facebook Posts Scraper extracts post text, post ID, author page, reaction breakdown, comment count, share count, media attachments, verification status, and follower count from a Facebook post URL.

Growth teams, data engineers, and social analytics teams use it when they need structured post records. They avoid maintaining selectors, browser sessions, proxy routing, and anti-bot handling for every Facebook markup change.

This scraper works best when you already have post URLs from search, page feeds, internal reports, or another ScrapeNow job. One input URL returns one post-level record.

How to use this scraper

Facebook Posts Scraper job pipeline
The Facebook Posts Scraper job pipeline, from input to stored output.

The scraper takes one input field, url. Pass a Facebook post URL that starts with https://www.facebook.com.

Use the Extract Facebook page posts scraper when you need to collect post URLs from a page. Use this scraper after that step to extract the full post payload.

ScrapeNow also has Facebook scrapers for Extract Facebook event data, Find Facebook events by venue, Search Marketplace listings, and Get Marketplace listing data.

Step 1. Open Facebook and search for the page or keyword

Open facebook.com.

Type a keyword in the search bar, such as Coldplay. You can also search for a brand name, creator name, public figure, campaign hashtag, or news topic.

Facebook search bar autocompleting Coldplay suggestions from the news feed
Search for the page or keyword to find posts you want to extract

Step 2. Open the post from the timestamp link

On the results page, click the post timestamp link.

Facebook opens the post in a new tab. The timestamp link gives you a post URL that works better for extraction than a feed URL with extra tracking parameters.

Facebook results highlighting Coldplay post timestamp link to open
Click the post timestamp link to open it in a new tab with a clean URL

Step 3. Copy the post URL

Copy the URL from the browser address bar.

The input must start with https://www.facebook.com. Remove fragments and unrelated query parameters if your internal pipeline stores canonical URLs separately.

Facebook Coldplay Adventure Of A Lifetime video post URL highlighted
Copy the post URL from the address bar for use as scraper input

Step 4. Run the API job

Use this Python script. Replace YOUR_API_KEY with your ScrapeNow API key.

The script starts a job, polls until completion, downloads JSON results, and writes them to output/facebook-posts-extract-by-url.json.

"""
Configuration:
    - Set SCRAPER_SLUG to the scraper you want to run.
    - Set SCRAPER_INPUTS to the list of input dicts matching that scraper's schema.
    - Set API_KEY to your Scraper API key.
"""

import sys
import time
import json
import requests
import os

API_KEY = "YOUR_API_KEY"

SCRAPER_SLUG = "facebook-posts-extract-by-url"

SCRAPER_INPUTS = [
    {
        "url": "https://www.facebook.com/harrystyles/posts/pfbid02eZdTeHzUxcgh4MrbJFPoBbkWzxjS4ezwLQBLPd8udHnNerJaGc5z6oqvxSKgimc2l"
    }
]



BASE_URL = "https://api.scrapenow.io/api/v1/scraping"
TIMEOUT_SECONDS = 3600
POLL_INTERVAL = 5
SPINNER = "|/-\\"


def build_headers(api_key: str, content_type: str | None = None) -> dict:
    headers = {"Authorization": f"Bearer {api_key}"}
    if content_type:
        headers["Content-Type"] = content_type
    return headers


def trigger_scrape(slug: str, inputs: list[dict]) -> str:
    url = f"{BASE_URL}/scrape?scraper={slug}"
    response = requests.post(
        url,
        headers=build_headers(API_KEY, "application/json"),
        json={"inputs": inputs},
    )
    response.raise_for_status()
    return response.json()["data"]["job_id"]


def poll_until_done(job_id: str) -> str:
    start = time.time()
    i = 0
    while True:
        elapsed = time.time() - start
        if elapsed > TIMEOUT_SECONDS:
            print(f"\nTimeout after {TIMEOUT_SECONDS}s")
            sys.exit(1)
        response = requests.get(
            f"{BASE_URL}/jobs/{job_id}",
            headers=build_headers(API_KEY),
        )
        response.raise_for_status()
        data = response.json()
        status = data["data"]["status"]
        mins, secs = divmod(int(elapsed), 60)
        sys.stdout.write(
            f"\r[{SPINNER[i % 4]}] Waiting... {status} ({mins}m {secs:02d}s)  "
        )
        sys.stdout.flush()
        if status in ("completed", "failed"):
            print()
            return status
        time.sleep(POLL_INTERVAL)
        i += 1


def fetch_results(job_id: str) -> dict:
    response = requests.get(
        f"{BASE_URL}/jobs/{job_id}/results?format=json",
        headers=build_headers(API_KEY),
    )
    response.raise_for_status()
    return response.json()


def save_results(data: dict, slug: str) -> str:
    os.makedirs("output", exist_ok=True)
    filename = os.path.join("output", f"{slug}.json")
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    return filename


def main() -> None:
    print(f"Triggering scraper: {SCRAPER_SLUG}")
    job_id = trigger_scrape(SCRAPER_SLUG, SCRAPER_INPUTS)
    print(f"Job started: {job_id}")
    final_status = poll_until_done(job_id)
    if final_status != "completed":
        print(f"Job failed with status: {final_status}")
        sys.exit(1)
    print("Fetching results...")
    results = fetch_results(job_id)
    output_file = save_results(results, SCRAPER_SLUG)
    print(f"Results saved to: {output_file}")


if __name__ == "__main__":
    main()

Step 5. Read the JSON output

A completed job returns an array of records. One input URL creates one output row.

The response keeps the original inputs object, so you can trace every row back to the submitted URL. That matters when you retry failed rows or compare duplicate URLs from search, page feeds, and shared links.

[
  {
    "inputs": {
      "url": "https://www.facebook.com/harrystyles/posts/pfbid02eZdTeHzUxcgh4MrbJFPoBbkWzxjS4ezwLQBLPd8udHnNerJaGc5z6oqvxSKgimc2l"
    },
    "scrape_status": "success",
    "url": "https://www.facebook.com/harrystyles/posts/pfbid02eZdTeHzUxcgh4MrbJFPoBbkWzxjS4ezwLQBLPd8udHnNerJaGc5z6oqvxSKgimc2l",
    "post_id": "1503996387764696",
    "user_url": "https://www.facebook.com/harrystyles",
    "user_username_raw": "Harry Styles",
    "content": "KATTDO Pop-Ups.",
    "date_posted": "2026-03-04T14:05:38.000Z",
    "hashtags": [],
    "num_comments": 805,
    "num_shares": 408,
    "num_likes_type": [
      {
        "type": "Love",
        "num": 6166
      },
      {
        "type": "Like",
        "num": 4078
      },
      {
        "type": "Care",
        "num": 128
      },
      {
        "type": "Sad",
        "num": 15
      },
      {
        "type": "Wow",
        "num": 14
      },
      {
        "type": "Haha",
        "num": 2
      },
      {
        "type": "Angry",
        "num": 2
      }
    ],
    "profile_id": "100044630460255",
    "page_logo": "https://scontent-lga3-2.xx.fbcdn.net/v/t39.30808-1/616198291_1463482668482735_6359717705419257039_n.jpg?stp=dst-jpg_s200x200_tt6&_nc_cat=109&ccb=1-7&_nc_sid=2d3e12&_nc_ohc=ZWuNTTiGg4kQ7kNvwH9PWjE&_nc_oc=AdoSdEJMwgcbSUumVFJg1aHPRjK_6keDPS4zIbEOlN5MmjpytNpZOEJyjt-QkqTeQOU&_nc_zt=24&_nc_ht=scontent-lga3-2.xx&_nc_gid=ab0PQNIzDbyjNQ10sFPdIQ&_nc_ss=79289&oh=00_Af77eENADVP1Eut6b0KRTS4yjhyS4-5KBOMevKpgCrveIQ&oe=6A0A113C",
    "page_likes": null,
    "page_followers": 17000000,
    "page_is_verified": true,
    "original_post": {
      "hashtags": null,
      "attachments": null
    },
    "attachments": [
      {
        "id": "1503996281098040",
        "type": "photo",
        "url": "https://scontent-fra5-2.xx.fbcdn.net/v/t51.82787-15/645775119_18330818206217391_8897228986295042063_n.jpg?_nc_cat=106&ccb=1-7&_nc_sid=127cfc&_nc_ohc=CVyeiZB1rXkQ7kNvwGfR347&_nc_oc=AdonIQgnofhdMZthyTbvYtukz34WV9olz8pbIOto0OlExtKsC3WsC3k8KNH0g4QrXwA&_nc_zt=23&_nc_ht=scontent-fra5-2.xx&_nc_gid=37_KihaFJHLXBTgEAzlDYg&_nc_ss=79289&oh=00_Af6Upgg70TAG-SEdM77xzgtBTd-TG1GTvd7rx3xCz9BlhQ&oe=6A0A0813",
        "video_length": null,
        "source_type": null,
        "attachment_url": "https://www.facebook.com/photo.php?fbid=1503996281098040&set=a.286918702805810&type=3",
        "video_url": null,
        "thumbnail_url": null
      },
      {
        "id": "1503996314431370",
        "type": "photo",
        "url": "https://scontent-fra5-2.xx.fbcdn.net/v/t51.82787-15/645728379_18330818215217391_4742701517630113152_n.jpg?_nc_cat=106&ccb=1-7&_nc_sid=127cfc&_nc_ohc=bGc8GWmZClsQ7kNvwFzlTDg&_nc_oc=AdoMvpegk_cQzdq4LOlPI1jNH_iG639Al5FqdgWV185rtypCp1MiiQcinLNV7H6DvLw&_nc_zt=23&_nc_ht=scontent-fra5-2.xx&_nc_gid=37_KihaFJHLXBTgEAzlDYg&_nc_ss=79289&oh=00_Af7E5LWh41SKW1YRF4fpeeEZbz5xHbbnDVMbB86NyyaE_A&oe=6A0A0E7C",
        "video_length": null,
        "source_type": null,
        "attachment_url": "https://ww
... (truncated)

What data you get back

Facebook Posts Scraper output schema
Facebook Posts Scraper output fields grouped by category.

The response gives you the original input, scrape status, normalized post URL, and extracted post fields.

Field Type What to use it for
scrape_status string Filter successful rows from failed rows
url string Store the canonical post URL
post_id string Deduplicate posts across repeated runs
user_url string Join posts back to a Facebook page or profile
user_username_raw string Display name from the post author
content string Post text
date_posted ISO datetime Time-series analysis and freshness checks
hashtags array Tag extraction
num_comments integer Engagement metric
num_shares integer Engagement metric
num_likes_type array Reaction counts by type
profile_id string Stable author identifier when present
page_logo string Page image URL
page_followers integer or null Page audience size
page_is_verified boolean Verification status
attachments array Photos, videos, thumbnails, and attachment URLs

num_likes_type is more useful than a single total because Facebook reactions carry different intent. In the sample above, Love has 6166, Like has 4078, and Care has 128.

Store reaction counts as separate columns if analysts compare sentiment across posts. Keep the original array as raw JSON so you can backfill new reaction types if Facebook changes the payload.

attachments is an array because one post can contain multiple photos or videos. Each attachment includes an id, type, direct media URL, Facebook attachment URL, and video fields when Facebook exposes them.

Treat media URLs as time-sensitive assets. Facebook CDN URLs can expire, so store the attachment metadata and fetch media quickly if your workflow requires local copies.

original_post helps when the post references another post. Keep it as a nested object in your raw table, then flatten it later if your warehouse schema needs columns.

Ready to get this data? Extract Facebook post data.

Production tips

Facebook Posts Scraper post processing pipeline
The Facebook Posts Scraper post processing pipeline that shapes raw results into warehouse tables.

Validate inputs, deduplicate results, and write failed rows to a retry table.

Validate input URLs before creating jobs

Catch invalid URLs before calling the API.

from urllib.parse import urlparse

def validate_facebook_post_url(url: str) -> None:
    parsed = urlparse(url)

    if parsed.scheme != "https":
        raise ValueError("Facebook post URL must use https")

    if parsed.netloc != "www.facebook.com":
        raise ValueError("Facebook post URL must start with https://www.facebook.com")

    if not parsed.path or parsed.path == "/":
        raise ValueError("Facebook post URL path is empty")

urls = [
    "https://www.facebook.com/harrystyles/posts/pfbid02eZdTeHzUxcgh4MrbJFPoBbkWzxjS4ezwLQBLPd8udHnNerJaGc5z6oqvxSKgimc2l"
]

for url in urls:
    validate_facebook_post_url(url)

For page-level collection, use Extract Facebook page posts as the source of post URLs.

Deduplicate on post_id

Use post_id as your primary dedupe key. If post_id is missing on a failed scrape, fall back to the input URL.

This pattern keeps successful rows stable across repeated runs. It also keeps failed rows tied to the exact input that produced the failure.

import json

with open("output/facebook-posts-extract-by-url.json", "r", encoding="utf-8") as f:
    rows = json.load(f)

seen = set()
deduped = []

for row in rows:
    key = row.get("post_id") or row.get("inputs", {}).get("url")

    if key in seen:
        continue

    seen.add(key)
    deduped.append(row)

print(f"Input rows: {len(rows)}")
print(f"Deduped rows: {len(deduped)}")

Run this after every batch if you scrape the same accounts daily.

Store raw JSON plus a flat table

Keep the full JSON response in object storage or a raw_json column. Then write a flat table for the fields your app queries.

A practical warehouse schema looks like this:

Column Type
post_id text
url text
user_url text
user_username_raw text
content text
date_posted timestamp
num_comments integer
num_shares integer
page_followers integer
page_is_verified boolean
reaction_love integer
reaction_like integer
reaction_care integer
reaction_sad integer
reaction_wow integer
reaction_haha integer
reaction_angry integer
attachments_count integer
raw_json json

Flatten reactions with a small helper.

def reaction_map(row: dict) -> dict:
    reactions = {
        "reaction_love": 0,
        "reaction_like": 0,
        "reaction_care": 0,
        "reaction_sad": 0,
        "reaction_wow": 0,
        "reaction_haha": 0,
        "reaction_angry": 0,
    }

    for item in row.get("num_likes_type") or []:
        key = f"reaction_{item.get('type', '').lower()}"
        if key in reactions:
            reactions[key] = item.get("num") or 0

    return reactions

Use 0 for missing reaction types when you build metric columns. Keep null for fields that Facebook did not expose, such as page_likes in the sample response.

That difference matters in reporting. A missing field means the scraper did not receive the value, while 0 means the value exists and has no count.

Retry only failed rows

The API response includes scrape_status. Use it to split successful rows from rows that need another run.

def split_results(rows: list[dict]) -> tuple[list[dict], list[dict]]:
    success = []
    retry_inputs = []

    for row in rows:
        if row.get("scrape_status") == "success":
            success.append(row)
        else:
            original_input = row.get("inputs")
            if original_input:
                retry_inputs.append(original_input)

    return success, retry_inputs

Cap retries at 3 attempts. After that, move the URL to a review table.

Batch by job size

A practical starting point is 100 post URLs per job. For larger queues, persist job_id as soon as ScrapeNow returns it so a worker restart does not lose a running job.

Common use cases for this scraper

Use this scraper when you already have Facebook post URLs and need the post-level payload.

Good fits include:

  • Tracking engagement on a known set of brand posts
  • Enriching post URLs collected from Facebook search
  • Pulling media attachment URLs from public posts
  • Building a daily table of comments, shares, and reactions
  • Checking verification and follower counts for the author page
  • Auditing campaign posts across creator and brand pages
  • Refreshing metrics on posts that your team already tracks

If you need event data, use Extract Facebook event data for known event URLs. Use Find Facebook events by venue when the venue is your starting point.

For commerce data, the Facebook scraping set includes Search Marketplace listings and marketplace extraction by URL.

Pricing

ScrapeNow charges per returned row. One row costs one credit, starting at $0.04 per credit for small runs and dropping with volume. No monthly contracts, no proxy fees, no charges for failed rows. See the pricing page for current rates.

Start with Extract Facebook page posts. Copy one public post URL, run the Python script above, and store the output fields your pipeline needs.

For most teams, the minimum raw table should include post_id, url, content, date_posted, num_comments, num_shares, num_likes_type, attachments, page_followers, and page_is_verified. Keep the full raw_json payload next to those columns so schema changes do not destroy data you already paid to collect.

Related articles

View all

Start collecting data in under five minutes.

Free credits included - no credit card required.

Free credits included - no credit card required