Scrape Any Website and Feed It to GPT Using n8n (My Real Workflow)

AI Tools, Templates & Resources\\Apr 15, 2025

This one’s been sitting in my backlog for a while now, but I finally got around to putting it together the way I actually use it. If you’ve ever wanted to scrape content from a website and then feed that into GPT for analysis, rewriting, or summarizing, this post will walk you through the full thing.

The exact flow I use, with real techniques, different options depending on your setup, and a downloadable JSON file.

Why I Use This Workflow (and When You Might Too)

I first built this setup to analyze one of my own blog posts. I wanted to:

  • Check structure and SEO automatically
  • Feed the content to GPT for summary + improvement suggestions
  • Log it somewhere (Notion, Slack, etc.)

But this works for much more:

  • Summarizing competitor blog posts
  • Pulling product descriptions
  • Extracting pricing pages for internal research
  • Auto-generating summaries or Q&As from docs and help pages

Basically, if the content is on the web and not behind a login, you can grab it 😉.

What the Workflow Does

Here’s what we’re going to build:

  1. Get the latest blog post URL (from an RSS feed or direct input)
  2. Fetch the full HTML
  3. Extract clean text using Cheerio
  4. Send it to GPT (with a tailored prompt)
  5. Output a markdown SEO/structure summary
  6. Push it to Slack or save it in Notion/Docs (your choice)

The Scraping Techniques (Pick What Works for You)

Depending on the site you’re working with, you’ve got a few options:

1. Basic: HTML + Regex Cleanup

const html = $json["body"];
const clean = html.replace(/<[^>]*>/g, '').replace(/\s{2,}/g, ' ').trim();
return [{ json: { content: clean } }];
  • Works for clean, minimal pages
  • Can break badly if the page has lots of nested tags or JS

2. Accurate: Use cheerio (My Pick)

const cheerio = require('cheerio');
const $ = cheerio.load($json['body']);
const text = $('article').text();
return [{ json: { content: text.trim() } }];
  • Extracts structured content by tag (like article, .main, etc.)
  • Lets you target specific elements (like h1, .post-body, etc.)

If you’re self-hosting n8n on Railway (like me), you can install cheerio via npm and use it in Function nodes. I wrote about that setup here.

3. For JavaScript-Heavy Sites

Some pages render content dynamically with JS, so a simple HTTP request won’t cut it.

Options:

  • Use an external Puppeteer script
  • Use browser automation like Playwright (triggered via webhook)
  • Or just use a headless browser scraping API (ScraperAPI, Browserless, etc.)

I usually avoid this unless absolutely necessary.

Feeding It to GPT (Prompt + Config)

Once we have the cleaned content, the next step is to get GPT to do something smart with it.

Here’s a prompt I use to generate a markdown-style SEO report:

You are an expert blog editor and SEO consultant. Here is a blog post:

{{content}}

Please analyze the following:
1. Summary (3 lines max)
2. On-page SEO issues
3. Internal/external links
4. Suggested keywords
5. Structural problems
6. Suggested improvements

Respond in clean markdown format.

You can modify this to:

  • Rephrase content
  • Generate tweet threads
  • Extract action items or product features
  • Build FAQs from the page

Output Options (Pick One or Chain Them)

I’ve sent the GPT response to:

  • Slack → for real-time review
  • Notion → for logging SEO audits
  • Google Docs → for sharing with a client

Just add a Notion or Google Docs node at the end and pass {{$json["text"]}} as the content.

Want the Workflow?

Download the n8n JSON file here → includes:

  • RSS feed reader
  • HTML fetcher
  • cheerio text extractor
  • GPT-4 node
  • Slack message output

You can plug in Notion instead if that’s your style.

Stuff I’ve Learned Doing This

Here are a few things worth noting that don’t always get mentioned:

  • Don’t scrape aggressively — many sites have rate limits or bot protection
  • Check your selectors — some blogs don’t use <article>, so you might need .post, .content or .entry
  • Always clean the text before feeding to GPT — it saves tokens and avoids noisy input
  • GPT does better when you’re specific — don’t just say “analyze this,” say “find SEO issues”
  • Add timestamps to logs so you can compare improvements over time

Get In Touch

Have a question or just want to say hi? I'd love to hear from you.

Use this form to send me a message, and I aim to respond within 24 hours.

Get In Touch

Need IT consultation? From simple web development to creating and deploying AI agents on modern infrastructure, I'm here to help.

Use this form to send me a message about your project, and I aim to respond within 24 hours.

© 2025All rights reserved
MohitAneja.com