
This one’s been sitting in my backlog for a while now, but I finally got around to putting it together the way I actually use it. If you’ve ever wanted to scrape content from a website and then feed that into GPT for analysis, rewriting, or summarizing, this post will walk you through the full thing.
The exact flow I use, with real techniques, different options depending on your setup, and a downloadable JSON file.
Why I Use This Workflow (and When You Might Too)
I first built this setup to analyze one of my own blog posts. I wanted to:
- Check structure and SEO automatically
- Feed the content to GPT for summary + improvement suggestions
- Log it somewhere (Notion, Slack, etc.)
But this works for much more:
- Summarizing competitor blog posts
- Pulling product descriptions
- Extracting pricing pages for internal research
- Auto-generating summaries or Q&As from docs and help pages
Basically, if the content is on the web and not behind a login, you can grab it 😉.
What the Workflow Does
Here’s what we’re going to build:
- Get the latest blog post URL (from an RSS feed or direct input)
- Fetch the full HTML
- Extract clean text using Cheerio
- Send it to GPT (with a tailored prompt)
- Output a markdown SEO/structure summary
- Push it to Slack or save it in Notion/Docs (your choice)
The Scraping Techniques (Pick What Works for You)
Depending on the site you’re working with, you’ve got a few options:
1. Basic: HTML + Regex Cleanup
const html = $json["body"];
const clean = html.replace(/<[^>]*>/g, '').replace(/\s{2,}/g, ' ').trim();
return [{ json: { content: clean } }];
- Works for clean, minimal pages
- Can break badly if the page has lots of nested tags or JS
2. Accurate: Use cheerio
(My Pick)
const cheerio = require('cheerio');
const $ = cheerio.load($json['body']);
const text = $('article').text();
return [{ json: { content: text.trim() } }];
- Extracts structured content by tag (like
article
,.main
, etc.) - Lets you target specific elements (like
h1
,.post-body
, etc.)
If you’re self-hosting n8n on Railway (like me), you can install cheerio via npm
and use it in Function nodes. I wrote about that setup here.
3. For JavaScript-Heavy Sites
Some pages render content dynamically with JS, so a simple HTTP request won’t cut it.
Options:
- Use an external Puppeteer script
- Use browser automation like Playwright (triggered via webhook)
- Or just use a headless browser scraping API (ScraperAPI, Browserless, etc.)
I usually avoid this unless absolutely necessary.
Feeding It to GPT (Prompt + Config)
Once we have the cleaned content, the next step is to get GPT to do something smart with it.
Here’s a prompt I use to generate a markdown-style SEO report:
You are an expert blog editor and SEO consultant. Here is a blog post:
{{content}}
Please analyze the following:
1. Summary (3 lines max)
2. On-page SEO issues
3. Internal/external links
4. Suggested keywords
5. Structural problems
6. Suggested improvements
Respond in clean markdown format.
You can modify this to:
- Rephrase content
- Generate tweet threads
- Extract action items or product features
- Build FAQs from the page
Output Options (Pick One or Chain Them)
I’ve sent the GPT response to:
- Slack → for real-time review
- Notion → for logging SEO audits
- Google Docs → for sharing with a client
Just add a Notion or Google Docs node at the end and pass {{$json["text"]}}
as the content.
Want the Workflow?
Download the n8n JSON file here → includes:
- RSS feed reader
- HTML fetcher
cheerio
text extractor- GPT-4 node
- Slack message output
You can plug in Notion instead if that’s your style.
Stuff I’ve Learned Doing This
Here are a few things worth noting that don’t always get mentioned:
- Don’t scrape aggressively — many sites have rate limits or bot protection
- Check your selectors — some blogs don’t use
<article>
, so you might need.post
,.content
or.entry
- Always clean the text before feeding to GPT — it saves tokens and avoids noisy input
- GPT does better when you’re specific — don’t just say “analyze this,” say “find SEO issues”
- Add timestamps to logs so you can compare improvements over time