Scrape Any Website and Feed It to GPT Using n8n (My Real Workflow)

AI Tools, Templates & Resources\\Apr 15, 2025

This one’s been sitting in my backlog for a while now, but I finally got around to putting it together the way I actually use it. If you’ve ever wanted to scrape content from a website and then feed that into GPT for analysis, rewriting, or summarizing, this post will walk you through the full thing.

The exact flow I use, with real techniques, different options depending on your setup, and a downloadable JSON file.

Why I Use This Workflow (and When You Might Too)

I first built this setup to analyze one of my own blog posts. I wanted to:

  • Check structure and SEO automatically
  • Feed the content to GPT for summary + improvement suggestions
  • Log it somewhere (Notion, Slack, etc.)

But this works for much more:

  • Summarizing competitor blog posts
  • Pulling product descriptions
  • Extracting pricing pages for internal research
  • Auto-generating summaries or Q&As from docs and help pages

Basically, if the content is on the web and not behind a login, you can grab it 😉.

What the Workflow Does

Here’s what we’re going to build:

  1. Get the latest blog post URL (from an RSS feed or direct input)
  2. Fetch the full HTML
  3. Extract clean text using Cheerio
  4. Send it to GPT (with a tailored prompt)
  5. Output a markdown SEO/structure summary
  6. Push it to Slack or save it in Notion/Docs (your choice)

The Scraping Techniques (Pick What Works for You)

Depending on the site you’re working with, you’ve got a few options:

1. Basic: HTML + Regex Cleanup

const html = $json["body"];
const clean = html.replace(/<[^>]*>/g, '').replace(/\s{2,}/g, ' ').trim();
return [{ json: { content: clean } }];
  • Works for clean, minimal pages
  • Can break badly if the page has lots of nested tags or JS

2. Accurate: Use cheerio (My Pick)

const cheerio = require('cheerio');
const $ = cheerio.load($json['body']);
const text = $('article').text();
return [{ json: { content: text.trim() } }];
  • Extracts structured content by tag (like article, .main, etc.)
  • Lets you target specific elements (like h1, .post-body, etc.)

If you’re self-hosting n8n on Railway (like me), you can install cheerio via npm and use it in Function nodes. I wrote about that setup here.

3. For JavaScript-Heavy Sites

Some pages render content dynamically with JS, so a simple HTTP request won’t cut it.

Options:

  • Use an external Puppeteer script
  • Use browser automation like Playwright (triggered via webhook)
  • Or just use a headless browser scraping API (ScraperAPI, Browserless, etc.)

I usually avoid this unless absolutely necessary.

Feeding It to GPT (Prompt + Config)

Once we have the cleaned content, the next step is to get GPT to do something smart with it.

Here’s a prompt I use to generate a markdown-style SEO report:

You are an expert blog editor and SEO consultant. Here is a blog post:

{{content}}

Please analyze the following:
1. Summary (3 lines max)
2. On-page SEO issues
3. Internal/external links
4. Suggested keywords
5. Structural problems
6. Suggested improvements

Respond in clean markdown format.

You can modify this to:

  • Rephrase content
  • Generate tweet threads
  • Extract action items or product features
  • Build FAQs from the page

Output Options (Pick One or Chain Them)

I’ve sent the GPT response to:

  • Slack → for real-time review
  • Notion → for logging SEO audits
  • Google Docs → for sharing with a client

Just add a Notion or Google Docs node at the end and pass {{$json["text"]}} as the content.

Want the Workflow?

Download the n8n JSON file here → includes:

  • RSS feed reader
  • HTML fetcher
  • cheerio text extractor
  • GPT-4 node
  • Slack message output

You can plug in Notion instead if that’s your style.

Stuff I’ve Learned Doing This

Here are a few things worth noting that don’t always get mentioned:

  • Don’t scrape aggressively — many sites have rate limits or bot protection
  • Check your selectors — some blogs don’t use <article>, so you might need .post, .content or .entry
  • Always clean the text before feeding to GPT — it saves tokens and avoids noisy input
  • GPT does better when you’re specific — don’t just say “analyze this,” say “find SEO issues”
  • Add timestamps to logs so you can compare improvements over time

Get In Touch

Have a question or just want to say hi? I'd love to hear from you.

Use this form to send me a message, and I aim to respond within 24 hours.

Stay in the loop

Get a weekly update on everything I'm building with AI and automation — no spam, just real stuff.

Goes out once a week. Easy unsubscribe. No fluff.
© 2025All rights reserved
MohitAneja.com