Complete Guide to Web Scraping with Nodejs and Puppeteer

  • Complete Guide to Web Scraping with Nodejs and Puppeteer

    Sharing is Caring... Show some love :)

    Web Scraping is one of the powerful tools for data collection and the guide to web scraping with Nodejs and Puppeteer will show you how to collect and analyze data using web scraping techniques.

    You probably might have heard of the term “Web Scraping” or “Puppeteer” and the cool things you can do with puppeteer web scraping.

    You also want to learn more and even how to get started immediately with it.

    There might be lots of reasons you what to web scrape with Nodejs such as:

    Maybe you want to analyze the prices of goods and services across the internet or maybe you want to collect all the events happening around you or better still collect all the latest backend development job openings as a backend developer.

    There are numerous reasons to learn puppeteer web scraping and how to web scrape using JavaScript and Nodejs.

    In this article, we are going to explore the ultimate and complete guide to web scraping with Nodejs and puppeteer.

    I will work you through an example showing you how I web scraped an event website with Nodejs and Puppeteer.

    You will also learn all the tips and tricks to master the art of puppeteer web scraping and gathering any data you want with Nodejs and Puppeteer.

    Before we delve in, if you’re a backend developer or looking at delving into this career path, join other developers to receive daily articles on backend development that will boost your productivity.

    What is Puppeteer?

    Puppeteer is Google’s Node library that is used to scrape web pages and also for controlling chromium instance from Node.js.

    There are thousands of things you can do with Puppeteer such as:

    1. Web page scraping and analysis of data.
    2. Tracking page load performance and insights.
    3. Can be used to automate form submissions.
    4. Puppeteer can be configured to generate page screenshots
    5. It can also be used to generate PDF of website pages
    6. Puppeteer is very useful for Automated Testing.
    7. etc

    In this guide, we are going to use Puppeteer to scrape an event listing website and generate a JSON response with the data collected.

    Creating an Event Scrapper

    We are going to learn how to web scrap an event listing website and display the data as a JSON response as opposed to scrapping a job website here.

    Let’s get started:

    Create a new Node/Express project

    Before we start scraping the web pages, we need to install and set up our express server properly, we will start by installing the necessary dependencies.

    Create a new project directory

    mkdir NodeScrapper
    cd NodeScrapper

    Run the following commands to install all dependencies:

    npm i express puppeteer

    Next, create an index.js file that will contain our business logic.

    touch index.js

    Open the index.js file and paste in the following script

    const express = require('express')
    const Events = require('./eventScript')
    
    const app = express()
    const port = 9000
    
    app.get('/events', async (req, res) => {
      const events = await Events.getEvents()
      res.json(events)
    })
    
    app.listen(port, () => {})

    Next, we will create the eventScript.js file and paste in the following script.

    touch eventScript.js

    Creating the Event class

    First, we will import puppeteer and define the URL we want to scrape.

    const puppeteer = require('puppeteer')
    const eventUrl = `https://www.meetup.com/find/?keywords=backend&dateRange=this-week`
    let page
    let browser
    let cardArr = []
    
    class Events{
      
      // Lot of codes in here
    
    }

    Creating the Init method

    Next, we will create the init method that will initialize our puppeteer with some useful configuration.

    ALSO READ  12 Productive tips for Backend Developers

    What happens within the init method is that we launch puppeteer and create a new page with the browser object.

    Then we use the page object to visit a particular URL and waitForSelector to load a particular CSS selector we want to scrape.

    So before this, you will need to go to the website and page you want to scrape and open your Google DevTool’s Inspector tool to see the particular selector or tag that houses the content you want to scrape.

    In this demo, I just visit the URL above and inspect the page.

    meetupeventcrooped 1024x483 - Complete Guide to Web Scraping with Nodejs and Puppeteer

    Looking at the image above, we notice that the selector that contains all the events we want to scrape is .css-j7qwjs that is what we decided to wait to load with page.waitForSelector.

    static async init() {
        // console.log('Loading Page ...')
        browser = await puppeteer.launch({
          args: [
            '--no-sandbox',
            '--disable-setuid-sandbox',
            '--disable-dev-shm-usage',
            '--disable-accelerated-2d-canvas',
            '--no-first-run',
            '--no-zygote',
            '--single-process', // <- this one doesn't works in Windows
            '--disable-gpu',
          ],
        })
        page = await browser.newPage()
        await Promise.race([
          await page.goto(eventUrl, { waitUntil: 'networkidle2' }).catch(() => {}),
          await page.waitForSelector('.css-j7qwjs').catch(() => {}),
        ])
      }

    Some of the configurations inside the args array just help boast performance and is recommended by Heroku if you are deploying with them.

    Creating the Resolve method

    Next, we will create the resolve method which will call the init method and evaluate the page with the page.evaluate method.

    static async resolve() {
        await this.init()
        const eventURLs = await page.evaluate(() => {
          const cards = document.querySelectorAll('.css-1gl3lql')
          cardArr = Array.from(cards)
          const cardLinks = []
          cardArr.map((card) => {
            const eventLink = card.querySelector('.css-2ne5m0')
            const eventTitle = card.querySelector('.css-1jy1jkx')
            const eventGroupName = card.querySelector('.css-ycqk9')
            const eventImage = card.querySelector('img')
            const eventDate = card.querySelector('.css-ai9mht')
            const { host } = eventLink
            const { protocol } = eventLink
            const pathName = eventLink.pathname
            const query = eventLink.search
            const eventURL = protocol + '//' + host + pathName + query
            const eventGroup =
              eventGroupName !== null
                ? eventGroupName.textContent.split('Group name:')[1]
                : eventGroupName
            cardLinks.push({
              eventText: eventTitle !== null ? eventTitle.textContent : eventTitle,
              eventURLHost: host,
              eventURL: eventURL,
              eventGroup: eventGroup,
              eventImage: eventImage.src,
              date: eventDate !== null ? eventDate.textContent : eventDate,
            })
          })
          return cardLinks
        })
        return eventURLs
      }
    

    When you call page.evaluate on a webpage with Puppeteer that will give you the flexibility of manipulating the DOM of that page with your normal DOM functions.

    ALSO READ  Deploying Laravel to Heroku

    In our case, we used the document.querySelectorAll() to select all the nodes that have that particular class we want to scrape.

    If you look at the event page inspection again, you will see that each of the events, has this class css-1gl3lql at the root parent.

    So after collecting all the events with document.querySelectorAll(), we loop through each of them and map out the data we need into the cardLinks array and return the scrape data.

    Creating the GetEvents method

    Lastly, we create the getEvents methods and call the resolve method within it.

    static async getEvents() {
        const events = await this.resolve()
        await browser.close()
        return events
      }

    After getting the events, we close the browser object and return the events.

    Let’s put everything together for a clearer view.

    const puppeteer = require('puppeteer')
    const eventUrl = `https://www.meetup.com/find/?keywords=backend`
    let page
    let browser
    let cardArr = []
    class Events {
      static async init() {
        // console.log('Loading Page ...')
        browser = await puppeteer.launch({
          args: [
            '--no-sandbox',
            '--disable-setuid-sandbox',
            '--disable-dev-shm-usage',
            '--disable-accelerated-2d-canvas',
            '--no-first-run',
            '--no-zygote',
            '--single-process', // <- this one doesn't works in Windows
            '--disable-gpu',
          ],
        })
        page = await browser.newPage()
        await Promise.race([
          await page.goto(eventUrl, { waitUntil: 'networkidle2' }).catch(() => {}),
          await page.waitForSelector('.css-j7qwjs').catch(() => {}),
        ])
      }
      static async resolve() {
        await this.init()
        const eventURLs = await page.evaluate(() => {
          const cards = document.querySelectorAll('.css-1gl3lql')
          cardArr = Array.from(cards)
          const cardLinks = []
          cardArr.map((card) => {
            const eventLink = card.querySelector('.css-2ne5m0')
            const eventTitle = card.querySelector('.css-1jy1jkx')
            const eventGroupName = card.querySelector('.css-ycqk9')
            const eventImage = card.querySelector('img')
            const eventDate = card.querySelector('.css-ai9mht')
            const { host } = eventLink
            const { protocol } = eventLink
            const pathName = eventLink.pathname
            const query = eventLink.search
            const eventURL = protocol + '//' + host + pathName + query
            const eventGroup =
              eventGroupName !== null
                ? eventGroupName.textContent.split('Group name:')[1]
                : eventGroupName
            cardLinks.push({
              eventText: eventTitle !== null ? eventTitle.textContent : eventTitle,
              eventURLHost: host,
              eventURL: eventURL,
              eventGroup: eventGroup,
              eventImage: eventImage.src,
              date: eventDate !== null ? eventDate.textContent : eventDate,
            })
          })
          return cardLinks
        })
        return eventURLs
      }
    
      static async getEvents() {
        const events = await this.resolve()
        await browser.close()
        return events
      }
    }
    export default Events

    Testing our newly developed product, run the following command, and visit the URL afterward.

    N/B: The class names might change for the webpage when you might be reading this, so always use Google DevTools to inspect the right Selector names.

    node index.js
    
    // Then visit
    
    http://localhost:9000

    If you set up everything correctly, you should be greeted with a JSON response containing your events.

    eventData 1024x715 - Complete Guide to Web Scraping with Nodejs and Puppeteer

    Take a break and subscribe to receive daily articles on Nodejs bacekend development that will boost your productivity.

    Is Web Scraping Illegal?

    Web Scraping is not illegal but there are to some extend it is and you should be very careful when web scraping either with puppeteer.

    ALSO READ  Complete Guide on Laravel Relationships

    I listed some terms to consider when web scraping and to determine if your actions are illegal or not.

    Copyright Infringement.

    Even as popular are copyright is, you might not know to what extent it talks about Web Scraping.

    Well, some data that we may scrape is copyright protected, so you might want to visit the copyright document of that website to see what is allowed and what is not.

    Robot.txt

    You need to respect the information provided in the Robot.txt file, if it says no scraping is allowed, then it will be illegal to do otherwise.

    Using API

    Use the API if it is provided instead of web scraping the data.

    When you don’t use the API provided and scrape the data in a way that affects the copyright then it becomes illegal.

    Terms of Services

    You need to review the terms of service of that particular website to know what is allow and what is not.

    You need to follow the guidelines from the terms of services if you want to do it legally.

    Scraping Public Content

    As long as you scrape your data from public content, you are free, but if you do that to privately owned data, then you should review their terms and be very careful.

    Debugging with Puppeteer

    Puppeteer is also a great tool for debugging because it opens the web page with Chromium just like a normal user would.

    So you can use it for Automated UI testing, to determine how your webpage will respond to user’s events and other metrics.

    Taking Screenshots

    Sometimes, you might want to take screenshots of particular points on the web page while you scrape your data.

    To take screenshots with puppeteer, you need to add the following code to your scripts above.

    await page.screenshot({ path: 'screenshot.png' });

    The screenshot.png is the name of the screenshot, you can also specify a full path to where the screenshot will be saved.

    Useful Resources

    Conclusion 

    We have discussed extensively the complete guide to web scraping with Nodejs and puppeteer.

    We discussed why you should web scrapped, the importance and use cases of it, we also discussed the legal aspect of puppeteer web scraping, and how to get started with web scraping with Nodejs and puppeteer.

    Start Learning Backend Dev. Now

    Stop waiting and start learning! Get my 10 tips on teaching yourself backend development.

    Don't worry. I'll never, ever spam you!

    Sharing is caring :)

    Start Learning Now
    Learning for all. Savings for you. Courses from $11.99

    Comments