Scrape Any Website and Analyze It with GPT Using n8n

Scrape Any Website and Feed It to GPT Using n8n (My Real Workflow)

This one’s been sitting in my backlog for a while now, but I finally got around to putting it together the way I actually use it. If you’ve ever wanted to scrape content from a website and then feed that into GPT for analysis, rewriting, or summarizing, this post will walk you through the full thing.

The exact flow I use, with real techniques, different options depending on your setup, and a downloadable JSON file.

Why I Use This Workflow (and When You Might Too)

I first built this setup to analyze one of my own blog posts. I wanted to:

Check structure and SEO automatically
Feed the content to GPT for summary + improvement suggestions
Log it somewhere (Notion, Slack, etc.)

But this works for much more:

Summarizing competitor blog posts
Pulling product descriptions
Extracting pricing pages for internal research
Auto-generating summaries or Q&As from docs and help pages

Basically, if the content is on the web and not behind a login, you can grab it 😉.

What the Workflow Does

Here’s what we’re going to build:

Get the latest blog post URL (from an RSS feed or direct input)
Fetch the full HTML
Extract clean text using Cheerio
Send it to GPT (with a tailored prompt)
Output a markdown SEO/structure summary
Push it to Slack or save it in Notion/Docs (your choice)

The Scraping Techniques (Pick What Works for You)

Depending on the site you’re working with, you’ve got a few options:

1. Basic: HTML + Regex Cleanup

const html = $json["body"];
const clean = html.replace(/<[^>]*>/g, '').replace(/\s{2,}/g, ' ').trim();
return [{ json: { content: clean } }];

Works for clean, minimal pages
Can break badly if the page has lots of nested tags or JS

2. Accurate: Use `cheerio` (My Pick)

const cheerio = require('cheerio');
const $ = cheerio.load($json['body']);
const text = $('article').text();
return [{ json: { content: text.trim() } }];

Extracts structured content by tag (like article, .main, etc.)
Lets you target specific elements (like h1, .post-body, etc.)

If you’re self-hosting n8n on Railway (like me), you can install cheerio via npm and use it in Function nodes. I wrote about that setup here.

3. For JavaScript-Heavy Sites

Some pages render content dynamically with JS, so a simple HTTP request won’t cut it.

Options:

Use an external Puppeteer script
Use browser automation like Playwright (triggered via webhook)
Or just use a headless browser scraping API (ScraperAPI, Browserless, etc.)

I usually avoid this unless absolutely necessary.

Feeding It to GPT (Prompt + Config)

Once we have the cleaned content, the next step is to get GPT to do something smart with it.

Here’s a prompt I use to generate a markdown-style SEO report:

You are an expert blog editor and SEO consultant. Here is a blog post:

{{content}}

Please analyze the following:
1. Summary (3 lines max)
2. On-page SEO issues
3. Internal/external links
4. Suggested keywords
5. Structural problems
6. Suggested improvements

Respond in clean markdown format.

You can modify this to:

Rephrase content
Generate tweet threads
Extract action items or product features
Build FAQs from the page

Output Options (Pick One or Chain Them)

I’ve sent the GPT response to:

Slack → for real-time review
Notion → for logging SEO audits
Google Docs → for sharing with a client

Just add a Notion or Google Docs node at the end and pass {{$json["text"]}} as the content.

Want the Workflow?

Download the n8n JSON file here → includes:

RSS feed reader
HTML fetcher
cheerio text extractor
GPT-4 node
Slack message output

You can plug in Notion instead if that’s your style.

Stuff I’ve Learned Doing This

Here are a few things worth noting that don’t always get mentioned:

Don’t scrape aggressively — many sites have rate limits or bot protection
Check your selectors — some blogs don’t use <article>, so you might need .post, .content or .entry
Always clean the text before feeding to GPT — it saves tokens and avoids noisy input
GPT does better when you’re specific — don’t just say “analyze this,” say “find SEO issues”
Add timestamps to logs so you can compare improvements over time