Skip to main content
Blog

How to extract LinkedIn company data by URL

Use the LinkedIn companies scraper to extract public company profile data by URL with async jobs, clean output, and ready-to-load JSON.

ScrapersLinkedInApril 18, 2026
How to extract LinkedIn company data by URL

LinkedIn company pages show industry, size, headquarters, specialties, and follower counts, but LinkedIn's anti-scraping measures block most automated tools. ScrapeNow's Extract LinkedIn company data scraper reads a public LinkedIn company page and returns one structured record per input URL.

How the scraper behaves

The scraper takes one required input field named url. The value must be a public LinkedIn company profile URL.

A valid input uses this pattern:

https://www.linkedin.com/company/{company-slug}/

Each input object returns one output record. The result keeps the original input under inputs.url, so your pipeline maps every response back to its source row.

ScrapeNow runs the scrape as an async job. You submit inputs, receive a job_id, poll job status, then download results when the job reaches a terminal state.

The API returns job status values through the jobs endpoint. The sample script below treats completed and failed as terminal states.

This async flow works better than a long blocking request for batch scraping. If your client process exits during polling, the job still has a server-side ID you can query later.

How to use this scraper

LinkedIn Companies Scraper job lifecycle
The LinkedIn Companies Scraper job lifecycle, from input to stored output.

ScrapeNow's Extract LinkedIn company data scraper takes one input value. That value is the public LinkedIn company profile URL.

The input URL must start with:

https://www.linkedin.com/

Example input:

{
  "url": "https://www.linkedin.com/company/universalmusicgroup/"
}

The scraper reads the company profile linked by that URL and returns one output record for that input. For batch runs, submit multiple input objects in the same inputs array.

Keep each input object small. Send the URL field only unless the scraper schema adds more fields later.

Step 1. Find the input field

The scraper accepts a single field named url.

LinkedIn search bar suggesting Universal Music Group company results
The URL input field for the LinkedIn companies scraper
Use a company profile URL. Do not send search results URLs, people URLs, post URLs, job URLs, school URLs, or hashtag URLs.

A valid company input uses this path pattern:

https://www.linkedin.com/company/{company-slug}/

For example, universalmusicgroup is the company slug in this URL:

https://www.linkedin.com/company/universalmusicgroup/

Treat the slug as a useful dedupe signal. Treat the full normalized URL as the safer storage key, because slugs change after company rebrands.

If you store both, use the normalized URL as the primary lookup key. Keep the slug as a secondary field for matching and reporting.

Step 2. Open LinkedIn

Open linkedin.com.

LinkedIn search results highlighting Universal Music Group company card
Search for a company on LinkedIn using the search bar
In the search bar, type a company name such as `Universal Music Group`.

Use the company result that points to a company profile. LinkedIn search also returns people, jobs, posts, groups, and schools for the same query.

Check the browser path before copying the URL. Company pages use /company/, while job pages use /jobs/ and people profiles use /in/.

This check prevents a common ingestion error. A people URL and a company URL can share similar names, especially for founders and small businesses.

Step 3. Open the company card and copy the URL

On the results page, click the company's card.

LinkedIn Universal Music Group company page URL in address bar
Copy the company page URL from the address bar
Copy the URL from the browser address bar. It should look like this:
https://www.linkedin.com/company/universalmusicgroup/

Remove tracking parameters before storing the URL. Keep the clean company URL and remove ?trk=... from the end.

Clean URLs improve dedupe rates and reduce wasted rows. They also make retry files easier to inspect by hand.

Normalize URLs before you build the API payload. This keeps the source file, request batch, and warehouse key aligned.

Step 4. Run the API request

Use this Python script. It starts a scrape job, polls for completion, fetches the JSON result, and writes the output to disk.

"""
Configuration:
    - Set SCRAPER_SLUG to the scraper you want to run.
    - Set SCRAPER_INPUTS to the list of input dicts matching that scraper's schema.
    - Set API_KEY to your scraper API key.
"""

import sys
import time
import json
import requests
import os

API_KEY = "YOUR_API_KEY"

SCRAPER_SLUG = "linkedin-companies-extract-by-url"

SCRAPER_INPUTS = [
    {
        "url": "https://www.linkedin.com/company/universalmusicgroup/"
    }
]



BASE_URL = "https://api.scrapenow.io/api/v1/scraping"
TIMEOUT_SECONDS = 3600
POLL_INTERVAL = 5
SPINNER = "|/-\\"


def build_headers(api_key: str, content_type: str | None = None) -> dict:
    headers = {"Authorization": f"Bearer {api_key}"}
    if content_type:
        headers["Content-Type"] = content_type
    return headers


def trigger_scrape(slug: str, inputs: list[dict]) -> str:
    url = f"{BASE_URL}/scrape?scraper={slug}"
    response = requests.post(
        url,
        headers=build_headers(API_KEY, "application/json"),
        json={"inputs": inputs},
    )
    response.raise_for_status()
    return response.json()["data"]["job_id"]


def poll_until_done(job_id: str) -> str:
    start = time.time()
    i = 0
    while True:
        elapsed = time.time() - start
        if elapsed > TIMEOUT_SECONDS:
            print(f"\nTimeout after {TIMEOUT_SECONDS}s")
            sys.exit(1)
        response = requests.get(
            f"{BASE_URL}/jobs/{job_id}",
            headers=build_headers(API_KEY),
        )
        response.raise_for_status()
        data = response.json()
        status = data["data"]["status"]
        mins, secs = divmod(int(elapsed), 60)
        sys.stdout.write(
            f"\r[{SPINNER[i % 4]}] Waiting... {status} ({mins}m {secs:02d}s)  "
        )
        sys.stdout.flush()
        if status in ("completed", "failed"):
            print()
            return status
        time.sleep(POLL_INTERVAL)
        i += 1


def fetch_results(job_id: str) -> dict:
    response = requests.get(
        f"{BASE_URL}/jobs/{job_id}/results?format=json",
        headers=build_headers(API_KEY),
    )
    response.raise_for_status()
    return response.json()


def save_results(data: dict, slug: str) -> str:
    os.makedirs("output", exist_ok=True)
    filename = os.path.join("output", f"{slug}.json")
    with open(filename, "w", encoding="utf-8") as f:
        json.dump(data, f, indent=2, ensure_ascii=False)
    return filename


def main() -> None:
    print(f"Triggering scraper: {SCRAPER_SLUG}")
    job_id = trigger_scrape(SCRAPER_SLUG, SCRAPER_INPUTS)
    print(f"Job started: {job_id}")
    final_status = poll_until_done(job_id)
    if final_status != "completed":
        print(f"Job failed with status: {final_status}")
        sys.exit(1)
    print("Fetching results...")
    results = fetch_results(job_id)
    output_file = save_results(results, SCRAPER_SLUG)
    print(f"Results saved to: {output_file}")


if __name__ == "__main__":
    main()

Replace YOUR_API_KEY with your ScrapeNow API key.

The script starts a scrape job and polls every 5 seconds. It waits up to 3600 seconds, fetches JSON results, and writes them to output/linkedin-companies-extract-by-url.json.

For small test runs, one input URL is enough. For production runs, pass a list of input dictionaries under SCRAPER_INPUTS.

Example batch input:

SCRAPER_INPUTS = [
    {"url": "https://www.linkedin.com/company/universalmusicgroup/"},
    {"url": "https://www.linkedin.com/company/microsoft/"},
    {"url": "https://www.linkedin.com/company/google/"}
]

Keep batches small enough for your retry strategy. A 500-row batch is easier to replay than a 50,000-row batch when the upstream file contains invalid URLs.

For scheduled jobs, store the job_id with your batch metadata. That gives you a direct lookup path when a run fails during polling or result download.

Store the submitted input count with the job record. Compare that count with completed, failed, and loaded rows after every run.

Step 5. Check the output

A successful run returns one JSON object per input URL.

[
  {
    "inputs": {
      "url": "https://www.linkedin.com/company/universalmusicgroup/"
    },
    "scrape_status": "success",
    "id": "universalmusicgroup",
    "name": "Universal Music Group",
    "country_code": "US,NL,HK",
    "locations": [
      "2220 Colorado Avenue Santa Monica, California 90401, US",
      "Hilversum, NL",
      "1755 Broadway Ave New York, NY 10018, US",
      "21301 Burbank Blvd Los Angeles, CA 91367, US",
      "2220 Colorado Ave Santa Monica, CA 90404, US",
      "392 Kwun Tong Rd Kwun Tong, Hong Kong, HK"
    ],
    "followers": 1083276,
    "employees_in_linkedin": 19257,
    "about": "Universal Music Group (UMG) is the world leader in music-based entertainment, with a broad array of businesses engaged in recorded music, music publishing, merchandising and audiovisual content in more than 60 countries...",
    "specialties": "Music and entertainment",
    "company_size": "10,001+ employees",
    "organization_type": "Public Company",
    "industries": "Entertainment Providers",
    "website": "https://www.universalmusic.com/",
    "crunchbase_url": "https://www.crunchbase.com/organization/universal-music-group?utm_source=linkedin&utm_medium=referral&utm_campaign=linkedin_companies&utm_content=profile_cta_anon&trk=funding_crunchbase",
    "company_id": "30"
  }
]

Use the Pull structured LinkedIn job listings scraper for job post URLs. Use the company scraper when the source record is a LinkedIn company page.

The output preserves the input URL under inputs.url. Keep that value in your database so you can audit records, retry failures, and map results back to source rows.

Load the JSON into a staging table before merging it into production tables. Staging gives you a place to validate status, normalize keys, and inspect failed rows.

What data you get back

LinkedIn Companies Scraper output schema
LinkedIn Companies Scraper output fields grouped by category.

The response is built for storage. You get raw input tracking, scrape status, stable company identifiers, profile fields, and company metadata in one record.

Field Type What to do with it
inputs.url string Store the original URL used for the scrape. Use it for audit logs and retries.
scrape_status string Check for success before loading the record into your warehouse.
id string LinkedIn company slug from the URL. Use it for dedupe support.
name string Company display name. Use it for CRM enrichment and account matching.
country_code string Comma-separated country codes found on the profile.
locations array Office addresses listed on the company profile.
followers integer LinkedIn follower count at scrape time.
employees_in_linkedin integer Number of LinkedIn members associated with the company.
about string Company description from the profile.
specialties string LinkedIn specialties text.
company_size string LinkedIn size band such as 10,001+ employees.
organization_type string Public company, privately held, nonprofit, and similar labels.
industries string LinkedIn industry category.
website string Company website listed on LinkedIn.
crunchbase_url string Crunchbase profile URL when LinkedIn includes it.
company_id string LinkedIn internal company ID when available.

Ready to get this data? Extract company data from your LinkedIn URLs.

Use typed columns for fields that drive matching, filtering, or reporting. Store followers and employees_in_linkedin as integers, rather than strings.

Store locations as an array or JSON field. A single company profile can list multiple offices across countries.

Keep the raw JSON alongside your typed table during the first few runs. Raw records make field mapping reviews faster when business users ask where a value came from.

Document how each output field maps to your warehouse columns. This saves time when sales ops asks why CRM counts differ from LinkedIn counts.

Use scrape_status as your load gate

Do not load every row without checking it. Treat scrape_status as the first validation check.

import json

with open("output/linkedin-companies-extract-by-url.json", "r", encoding="utf-8") as f:
    rows = json.load(f)

clean_rows = []
failed_rows = []

for row in rows:
    if row.get("scrape_status") == "success":
        clean_rows.append(row)
    else:
        failed_rows.append({
            "url": row.get("inputs", {}).get("url"),
            "status": row.get("scrape_status")
        })

print(f"Loaded rows: {len(clean_rows)}")
print(f"Failed rows: {len(failed_rows)}")

Expected output for one successful company URL:

Loaded rows: 1
Failed rows: 0

Send failed rows to retry storage with the original input URL and the returned status. That gives you a clean replay file without searching logs.

Store the failed row as structured data. Logs expire, while retry files need to survive batch replays and warehouse backfills.

Include job_id in the retry record when your pipeline has it. That makes support tickets and internal investigations faster.

Use id and company_id for matching

The id field is the readable LinkedIn slug. For https://www.linkedin.com/company/universalmusicgroup/, the slug is universalmusicgroup.

company_id is useful when LinkedIn exposes a numeric ID. Store both because URLs and display names change more often than internal IDs.

Use this matching order in production:

  1. Match on company_id when both records have it.
  2. Match on normalized LinkedIn company URL.
  3. Match on slug when the URL is missing.
  4. Match on company name after adding website or domain checks.

Name-only matching creates false positives. Regional subsidiaries and parent companies often share similar display names.

A domain check removes many false matches. Compare normalized website domains before merging records with similar names.

Store the match method on merged records. A company_id match deserves more trust than a name plus domain match.

Production tips

LinkedIn Companies Scraper validation pipeline
Validation steps the LinkedIn Companies Scraper applies before loading data.

Run validation before you send jobs to the API. Keep raw responses in cold storage and track run-level metrics from the start.

Validate URLs before submitting jobs

The input URL must start with https://www.linkedin.com/. For company pages, also require /company/ in the path.

from urllib.parse import urlparse

def validate_linkedin_company_url(url: str) -> bool:
    parsed = urlparse(url)

    if parsed.scheme != "https":
        return False

    if parsed.netloc != "www.linkedin.com":
        return False

    if not parsed.path.startswith("/company/"):
        return False

    slug = parsed.path.replace("/company/", "").strip("/")
    return len(slug) > 0


urls = [
    "https://www.linkedin.com/company/universalmusicgroup/",
    "https://www.linkedin.com/jobs/view/123456789/",
    "http://www.linkedin.com/company/example/"
]

valid_urls = [url for url in urls if validate_linkedin_company_url(url)]

print(valid_urls)

Expected output:

['https://www.linkedin.com/company/universalmusicgroup/']

If your workflow starts from job listings, route job URLs into the Search LinkedIn jobs by keyword scraper or the Search LinkedIn jobs by URL scraper.

Add validation at the ingestion boundary, before the ScrapeNow API request. Return invalid rows to the source owner with a reason code like wrong_path, wrong_scheme, or empty_slug.

Normalize company URLs before dedupe

LinkedIn URLs often carry tracking parameters or trailing slashes. Normalize them before dedupe.

from urllib.parse import urlparse

def normalize_company_url(url: str) -> str:
    parsed = urlparse(url)
    path = parsed.path.rstrip("/") + "/"
    return f"https://www.linkedin.com{path}"


raw_urls = [
    "https://www.linkedin.com/company/universalmusicgroup/?trk=public_jobs",
    "https://www.linkedin.com/company/universalmusicgroup",
    "https://www.linkedin.com/company/universalmusicgroup/"
]

deduped = sorted({normalize_company_url(url) for url in raw_urls})

print(deduped)

Expected output:

['https://www.linkedin.com/company/universalmusicgroup/']

At 10,000 URLs, a 3 percent duplicate rate means 300 wasted rows. Normalize before you calculate batch size or write retry files. Keep the original URL too for debugging upstream source quality.

Store a fixed schema

Do not write raw JSON into random columns and clean it later. Store a fixed schema on day 1.

CREATE TABLE linkedin_companies (
    linkedin_company_url TEXT PRIMARY KEY,
    linkedin_slug TEXT,
    linkedin_company_id TEXT,
    name TEXT,
    website TEXT,
    industries TEXT,
    company_size TEXT,
    organization_type TEXT,
    followers INTEGER,
    employees_in_linkedin INTEGER,
    country_code TEXT,
    locations JSONB,
    about TEXT,
    specialties TEXT,
    crunchbase_url TEXT,
    scrape_status TEXT,
    scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

For analytics warehouses that do not support JSONB, store locations as a JSON string. Split it into a child table later if you need office-level reporting.

A child table works well for location analytics:

CREATE TABLE linkedin_company_locations (
    linkedin_company_url TEXT,
    location_text TEXT,
    scraped_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (linkedin_company_url, location_text)
);

Keep the parent table focused on one row per company. Put a unique constraint on the normalized company URL to prevent duplicate account records.

Use idempotent inserts

Company data changes. Follower counts, employee counts, locations, and descriptions can change between runs.

Use an upsert keyed on the normalized LinkedIn company URL or company_id.

INSERT INTO linkedin_companies (
    linkedin_company_url,
    linkedin_slug,
    linkedin_company_id,
    name,
    website,
    industries,
    company_size,
    organization_type,
    followers,
    employees_in_linkedin,
    country_code,
    locations,
    about,
    specialties,
    crunchbase_url,
    scrape_status
)
VALUES (
    %(linkedin_company_url)s,
    %(linkedin_slug)s,
    %(linkedin_company_id)s,
    %(name)s,
    %(website)s,
    %(industries)s,
    %(company_size)s,
    %(organization_type)s,
    %(followers)s,
    %(employees_in_linkedin)s,
    %(country_code)s,
    %(locations)s,
    %(about)s,
    %(specialties)s,
    %(crunchbase_url)s,
    %(scrape_status)s
)
ON CONFLICT (linkedin_company_url)
DO UPDATE SET
    name = EXCLUDED.name,
    website = EXCLUDED.website,
    industries = EXCLUDED.industries,
    company_size = EXCLUDED.company_size,
    organization_type = EXCLUDED.organization_type,
    followers = EXCLUDED.followers,
    employees_in_linkedin = EXCLUDED.employees_in_linkedin,
    country_code = EXCLUDED.country_code,
    locations = EXCLUDED.locations,
    about = EXCLUDED.about,
    specialties = EXCLUDED.specialties,
    crunchbase_url = EXCLUDED.crunchbase_url,
    scrape_status = EXCLUDED.scrape_status,
    scraped_at = CURRENT_TIMESTAMP;

This keeps your company table current without creating duplicate rows. For trend reporting, add a history table to track follower and employee count changes across refreshes.

Separate empty fields from failed scrapes

A field can be missing because LinkedIn did not show it on the profile. That differs from a failed scrape.

Use this rule:

Condition Meaning Action
scrape_status != "success" Scrape failed Retry or send to dead-letter storage
scrape_status == "success" and website is empty Profile has no website field Store null
scrape_status == "success" and locations is empty Profile has no visible locations Store empty array
scrape_status == "success" and followers is null Field unavailable in output Store null and keep the row

A strict pipeline saves the input URL, status, and raw response for every row. That makes backfills easier when you change your schema.

Do not treat empty optional fields as scraper failures. Use nullable columns for optional profile fields and use scrape_status to classify actual failures.

Retry failed rows with limits

Cap retries at 3 attempts per URL. After the cap, write the input to a dead-letter file for manual inspection.

Pricing

ScrapeNow charges per returned row. One row costs one credit, starting at $0.04 per credit for small runs and dropping with volume. No monthly contracts, no proxy fees, no charges for failed rows. See the pricing page for current rates.

Related articles

View all

Start collecting data in under five minutes.

Free credits included - no credit card required.

Free credits included - no credit card required