Complete Guide to Web Scraping with Nodejs and Puppeteer

by Solomon Eseme

.

Updated Mon Apr 24 2023

Complete Guide to Web Scraping with Nodejs and Puppeteer

Web Scraping is one of the powerful tools for data collection and the guide to web scraping with Nodejs and Puppeteer will show you how to collect and analyze data using web scraping techniques.

You probably might have heard of the term “Web Scraping” or “Puppeteer” and the cool things you can do with puppeteer web scraping.

You also want to learn more and even how to get started immediately with it.

There might be lots of reasons you what to web scrape with Nodejs such as:

Maybe you want to analyze the prices of goods and services across the internet or maybe you want to collect all the events happening around you or better still collect all the latest backend development job openings as a backend developer.

There are numerous reasons to learn puppeteer web scraping and how to web scrape using JavaScript and Nodejs.

In this article, we are going to explore the ultimate and complete guide to web scraping with Nodejs and puppeteer.

I will work you through an example showing you how I web scraped an event website with Nodejs and Puppeteer.

You will also learn all the tips and tricks to master the art of puppeteer web scraping and gathering any data you want with Nodejs and Puppeteer.

Before we delve in, if you’re a backend developer or looking at delving into this career path, join other developers to receive daily articles on backend development that will boost your productivity.

What is Puppeteer?

Puppeteer is Google’s Node library that is used to scrape web pages and also for controlling chromium instance from Node.js.

There are thousands of things you can do with Puppeteer such as:

  1. Web page scraping and analysis of data.
  2. Tracking page load performance and insights.
  3. Can be used to automate form submissions.
  4. Puppeteer can be configured to generate page screenshots
  5. It can also be used to generate PDF of website pages
  6. Puppeteer is very useful for Automated Testing.
  7. etc

In this guide, we are going to use Puppeteer to scrape an event listing website and generate a JSON response with the data collected.

Creating an Event Scrapper

We are going to learn how to web scrap an event listing website and display the data as a JSON response as opposed to scrapping a job website here.

Let’s get started:

Create a new Node/Express project

Before we start scraping the web pages, we need to install and set up our express server properly, we will start by installing the necessary dependencies.

Create a new project directory

mkdir NodeScrapper
cd NodeScrapper

Run the following commands to install all dependencies:

npm i express puppeteer

Next, create an index.js file that will contain our business logic.

touch index.js

Open the index.js file and paste in the following script

const express = require('express')
const Events = require('./eventScript')

const app = express()
const port = 9000

app.get('/events', async (req, res) => {
  const events = await Events.getEvents()
  res.json(events)
})

app.listen(port, () => {})

Next, we will create the eventScript.js file and paste in the following script.

touch eventScript.js

Creating the Event class

First, we will import puppeteer and define the URL we want to scrape.

const puppeteer = require('puppeteer')
const eventUrl = `https://www.meetup.com/find/?keywords=backend&dateRange=this-week`
let page
let browser
let cardArr = []

class Events{
  
  // Lot of codes in here

}

Creating the Init method

Next, we will create the init method that will initialize our puppeteer with some useful configuration.

What happens within the init method is that we launch puppeteer and create a new page with the browser object.

Then we use the page object to visit a particular URL and waitForSelector to load a particular CSS selector we want to scrape.

So before this, you will need to go to the website and page you want to scrape and open your Google DevTool’s Inspector tool to see the particular selector or tag that houses the content you want to scrape.

In this demo, I just visit the URL above and inspect the page.

Complete Guide to Web Scraping with Nodejs and Puppeteer

Looking at the image above, we notice that the selector that contains all the events we want to scrape is .css-j7qwjs that is what we decided to wait to load with page.waitForSelector.

static async init() {
    // console.log('Loading Page ...')
    browser = await puppeteer.launch({
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-accelerated-2d-canvas',
        '--no-first-run',
        '--no-zygote',
        '--single-process', // <- this one doesn't works in Windows
        '--disable-gpu',
      ],
    })
    page = await browser.newPage()
    await Promise.race([
      await page.goto(eventUrl, { waitUntil: 'networkidle2' }).catch(() => {}),
      await page.waitForSelector('.css-j7qwjs').catch(() => {}),
    ])
  }

Some of the configurations inside the args array just help boast performance and is recommended by Heroku if you are deploying with them.

Creating the Resolve method

Next, we will create the resolve method which will call the init method and evaluate the page with the page.evaluate method.

static async resolve() {
    await this.init()
    const eventURLs = await page.evaluate(() => {
      const cards = document.querySelectorAll('.css-1gl3lql')
      cardArr = Array.from(cards)
      const cardLinks = []
      cardArr.map((card) => {
        const eventLink = card.querySelector('.css-2ne5m0')
        const eventTitle = card.querySelector('.css-1jy1jkx')
        const eventGroupName = card.querySelector('.css-ycqk9')
        const eventImage = card.querySelector('img')
        const eventDate = card.querySelector('.css-ai9mht')
        const { host } = eventLink
        const { protocol } = eventLink
        const pathName = eventLink.pathname
        const query = eventLink.search
        const eventURL = protocol + '//' + host + pathName + query
        const eventGroup =
          eventGroupName !== null
            ? eventGroupName.textContent.split('Group name:')[1]
            : eventGroupName
        cardLinks.push({
          eventText: eventTitle !== null ? eventTitle.textContent : eventTitle,
          eventURLHost: host,
          eventURL: eventURL,
          eventGroup: eventGroup,
          eventImage: eventImage.src,
          date: eventDate !== null ? eventDate.textContent : eventDate,
        })
      })
      return cardLinks
    })
    return eventURLs
  }

When you call page.evaluate on a webpage with Puppeteer that will give you the flexibility of manipulating the DOM of that page with your normal DOM functions.

In our case, we used the document.querySelectorAll() to select all the nodes that have that particular class we want to scrape.

If you look at the event page inspection again, you will see that each of the events, has this class css-1gl3lql at the root parent.

So after collecting all the events with document.querySelectorAll(), we loop through each of them and map out the data we need into the cardLinks array and return the scrape data.

Creating the GetEvents method

Lastly, we create the getEvents methods and call the resolve method within it.

static async getEvents() {
    const events = await this.resolve()
    await browser.close()
    return events
  }

After getting the events, we close the browser object and return the events.

Let’s put everything together for a clearer view.

const puppeteer = require('puppeteer')
const eventUrl = `https://www.meetup.com/find/?keywords=backend`
let page
let browser
let cardArr = []
class Events {
  static async init() {
    // console.log('Loading Page ...')
    browser = await puppeteer.launch({
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-accelerated-2d-canvas',
        '--no-first-run',
        '--no-zygote',
        '--single-process', // <- this one doesn't works in Windows
        '--disable-gpu',
      ],
    })
    page = await browser.newPage()
    await Promise.race([
      await page.goto(eventUrl, { waitUntil: 'networkidle2' }).catch(() => {}),
      await page.waitForSelector('.css-j7qwjs').catch(() => {}),
    ])
  }
  static async resolve() {
    await this.init()
    const eventURLs = await page.evaluate(() => {
      const cards = document.querySelectorAll('.css-1gl3lql')
      cardArr = Array.from(cards)
      const cardLinks = []
      cardArr.map((card) => {
        const eventLink = card.querySelector('.css-2ne5m0')
        const eventTitle = card.querySelector('.css-1jy1jkx')
        const eventGroupName = card.querySelector('.css-ycqk9')
        const eventImage = card.querySelector('img')
        const eventDate = card.querySelector('.css-ai9mht')
        const { host } = eventLink
        const { protocol } = eventLink
        const pathName = eventLink.pathname
        const query = eventLink.search
        const eventURL = protocol + '//' + host + pathName + query
        const eventGroup =
          eventGroupName !== null
            ? eventGroupName.textContent.split('Group name:')[1]
            : eventGroupName
        cardLinks.push({
          eventText: eventTitle !== null ? eventTitle.textContent : eventTitle,
          eventURLHost: host,
          eventURL: eventURL,
          eventGroup: eventGroup,
          eventImage: eventImage.src,
          date: eventDate !== null ? eventDate.textContent : eventDate,
        })
      })
      return cardLinks
    })
    return eventURLs
  }

  static async getEvents() {
    const events = await this.resolve()
    await browser.close()
    return events
  }
}
export default Events

Testing our newly developed product, run the following command, and visit the URL afterward.

N/B: The class names might change for the webpage when you might be reading this, so always use Google DevTools to inspect the right Selector names.

node index.js

// Then visit

http://localhost:9000

If you set up everything correctly, you should be greeted with a JSON response containing your events.

Complete Guide to Web Scraping with Nodejs and Puppeteer

Take a break and subscribe to receive daily articles on Nodejs bacekend development that will boost your productivity.

Is Web Scraping Illegal?

Web Scraping is not illegal but there are to some extend it is and you should be very careful when web scraping either with puppeteer.

I listed some terms to consider when web scraping and to determine if your actions are illegal or not.

Copyright Infringement.

Even as popular are copyright is, you might not know to what extent it talks about Web Scraping.

Well, some data that we may scrape is copyright protected, so you might want to visit the copyright document of that website to see what is allowed and what is not.

Robot.txt

You need to respect the information provided in the Robot.txt file, if it says no scraping is allowed, then it will be illegal to do otherwise.

Using API

Use the API if it is provided instead of web scraping the data.

When you don’t use the API provided and scrape the data in a way that affects the copyright then it becomes illegal.

Terms of Services

You need to review the terms of service of that particular website to know what is allow and what is not.

You need to follow the guidelines from the terms of services if you want to do it legally.

Scraping Public Content

As long as you scrape your data from public content, you are free, but if you do that to privately owned data, then you should review their terms and be very careful.

Debugging with Puppeteer

Puppeteer is also a great tool for debugging because it opens the web page with Chromium just like a normal user would.

So you can use it for Automated UI testing, to determine how your webpage will respond to user’s events and other metrics.

Taking Screenshots

Sometimes, you might want to take screenshots of particular points on the web page while you scrape your data.

To take screenshots with puppeteer, you need to add the following code to your scripts above.

await page.screenshot({ path: 'screenshot.png' });

The screenshot.png is the name of the screenshot, you can also specify a full path to where the screenshot will be saved.

Useful Resources

Conclusion 

We have discussed extensively the complete guide to web scraping with Nodejs and puppeteer.

We discussed why you should web scrapped, the importance and use cases of it, we also discussed the legal aspect of puppeteer web scraping, and how to get started with web scraping with Nodejs and puppeteer.

Whenever you're ready

There are 4 ways we can help you become a great backend engineer:

The MB Platform

Join 1000+ backend engineers learning backend engineering. Build real-world backend projects, learn from expert-vetted courses and roadmaps, track your learnings and set schedules, and solve backend engineering tasks, exercises, and challenges.

The MB Academy

The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

Join Backend Weekly

If you like post like this, you will absolutely enjoy our exclusive weekly newsletter, Sharing exclusive backend engineering resources to help you become a great Backend Engineer.

Get Backend Jobs

Find over 2,000+ Tailored International Remote Backend Jobs or Reach 50,000+ backend engineers on the #1 Backend Engineering Job Board

Backend Tips, Every week

Backend Tips, Every week