In the world of web scraping and automation, dealing with Cloudflare protection can be a challenge. Fortunately, Puppeteer, a powerful Node.js library, allows us to bypass Cloudflare and retrieve webpage content efficiently. In this tutorial, we’ll walk through the process of bypassing Cloudflare using Puppeteer, with an optional proxy setup (commented out by default), and return the results in JSON format.

Prerequisites

Before diving into the tutorial, make sure you have the following in place:

  1. Node.js Installed: Ensure Node.js (version 14 or higher) is installed on your system.
  2. PHP Environment: A PHP server to run grab.php and execute shell commands.
  3. Puppeteer Knowledge: Basic familiarity with Puppeteer and JavaScript.
  4. Proxy (Optional): If you plan to use a proxy, have the proxy details ready (IP, port, username, password).

Step 1: Setting Up the Environment

First, create a project folder and initialize it with Node.js. Install Puppeteer by running the following commands in your terminal:

mkdir cloudflare-bypass
cd cloudflare-bypass
npm init -y
npm install puppeteer

This sets up your environment with Puppeteer installed.

Step 2: Creating the Puppeteer Script

Create a file named cloudflare.js in your project folder. This script will use Puppeteer to bypass Cloudflare and fetch the page content based on a URL passed as an argument.

const puppeteer = require('puppeteer');

(async () => {
    const url = process.argv[2]; // Get URL from command line argument
    if (!url) {
        console.log(JSON.stringify({
            status: false,
            message: 'No URL provided'
        }));
        return;
    }

    try {
        const browser = await puppeteer.launch({
            headless: true,
            args: [
                '--no-sandbox',
                '--disable-setuid-sandbox',
                // Optional proxy configuration (commented out by default)
                // '--proxy-server=http://proxy-ip:port'
            ]
        });

        const page = await browser.newPage();

        // Optional: Set proxy credentials if using a proxy
        // await page.authenticate({
        //     username: 'proxy-username',
        //     password: 'proxy-password'
        // });

        // Set a realistic user agent to avoid detection
        await page.setUserAgent('Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36');

        // Navigate to the URL and wait for the page to load
        const response = await page.goto(url, { waitUntil: 'networkidle2', timeout: 60000 });

        // Check HTTP status
        if (response.status() !== 200) {
            throw new Error(`Failed to load page. Status code: ${response.status()}`);
        }

        // Get page content
        const content = await page.content();

        // Return success response in JSON
        console.log(JSON.stringify({
            status: true,
            content: content
        }));

        await browser.close();
    } catch (error) {
        console.log(JSON.stringify({
            status: false,
            message: error.message
        }));
    }
})();

This script launches a headless browser, navigates to the provided URL, and returns the page content in JSON format. The proxy settings are included but commented out by default.

Step 3: Creating the PHP Wrapper

Next, create a file named grab.php to act as a wrapper. This file will accept a URL via a GET parameter, execute the Puppeteer script, and return the result in JSON.

<?php
header('Content-Type: application/json');

if (!isset($_GET['url']) || empty($_GET['url'])) {
    echo json_encode([
        'status' => false,
        'message' => 'URL parameter is missing'
    ]);
    exit;
}

$url = escapeshellarg($_GET['url']); // Escape URL for security
$nodePath = 'node'; // Adjust this path if necessary
$scriptPath = __DIR__ . '/cloudflare.js';

// Execute the Puppeteer script and capture output
$result = shell_exec("$nodePath $scriptPath $url 2>&1");
if ($result === null) {
    echo json_encode([
        'status' => false,
        'message' => 'Failed to execute Puppeteer script'
    ]);
    exit;
}

// Output the result (already in JSON format from cloudflare.js)
echo $result;
?>

The PHP script uses shell_exec to run the Node.js script with the provided URL and returns the JSON response directly.

Step 4: Testing the Setup

Place both grab.php and cloudflare.js in your server’s web directory. Then, test it by accessing the PHP file with a URL parameter, like this:

http://your-server.com/grab.php?url=https://example.com

If successful, you’ll get a JSON response like:

{
    "status": true,
    "content": "..."
}

If there’s an error (e.g., invalid URL or Cloudflare blocking), you’ll see something like:

{
    "status": false,
    "message": "Failed to load page. Status code: 403"
}

Step 5: Using a Proxy (Optional)

If you need to use a proxy to bypass Cloudflare’s restrictions, uncomment the proxy lines in cloudflare.js and fill in your proxy details:

const browser = await puppeteer.launch({
    headless: true,
    args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--proxy-server=http://proxy-ip:port'
    ]
});

await page.authenticate({
    username: 'proxy-username',
    password: 'proxy-password'
});

Save the changes and test again. The proxy will route your requests, potentially helping you avoid Cloudflare blocks.

Conclusion

Bypassing Cloudflare with Puppeteer is a powerful technique for web scraping and automation. By combining PHP and Puppeteer, you can create a flexible system that retrieves webpage content and returns it in a structured JSON format. Experiment with proxies and additional Puppeteer options to fine-tune your setup for specific needs. Happy scraping!