Web Scraping is one of the powerful tools for data collection and the guide to web scraping with Nodejs and Puppeteer will show you how to collect and analyze data using web scraping techniques.
You probably might have heard of the term “Web Scraping” or “Puppeteer” and the cool things you can do with puppeteer web scraping.
You also want to learn more and even how to get started immediately with it.
There might be lots of reasons you what to web scrape with Nodejs such as:
Maybe you want to analyze the prices of goods and services across the internet or maybe you want to collect all the events happening around you or better still collect all the latest backend development job openings as a backend developer.
There are numerous reasons to learn puppeteer web scraping and how to web scrape using JavaScript and Nodejs.
In this article, we are going to explore the ultimate and complete guide to web scraping with Nodejs and puppeteer.
I will work you through an example showing you how I web scraped an event website with Nodejs and Puppeteer.
You will also learn all the tips and tricks to master the art of puppeteer web scraping and gathering any data you want with Nodejs and Puppeteer.
Before we delve in, if you’re a backend developer or looking at delving into this career path, join other developers to receive daily articles on backend development that will boost your productivity.
What is Puppeteer?
Puppeteer is Google’s Node library that is used to scrape web pages and also for controlling chromium instance from Node.js.
There are thousands of things you can do with Puppeteer such as:
- Web page scraping and analysis of data.
- Tracking page load performance and insights.
- Can be used to automate form submissions.
- Puppeteer can be configured to generate page screenshots
- It can also be used to generate PDF of website pages
- Puppeteer is very useful for Automated Testing.
- etc
In this guide, we are going to use Puppeteer to scrape an event listing website and generate a JSON response with the data collected.
Creating an Event Scrapper
We are going to learn how to web scrap an event listing website and display the data as a JSON response as opposed to scrapping a job website here.
Let’s get started:
Create a new Node/Express project
Before we start scraping the web pages, we need to install and set up our express server properly, we will start by installing the necessary dependencies.
Create a new project directory
mkdir NodeScrapper
cd NodeScrapper
Run the following commands to install all dependencies:
npm i express puppeteer
Next, create an index.js
file that will contain our business logic.
touch index.js
Open the index.js
file and paste in the following script
const express = require('express')
const Events = require('./eventScript')
const app = express()
const port = 9000
app.get('/events', async (req, res) => {
const events = await Events.getEvents()
res.json(events)
})
app.listen(port, () => {})
Next, we will create the eventScript.js
file and paste in the following script.
touch eventScript.js
Creating the Event class
First, we will import puppeteer and define the URL we want to scrape.
const puppeteer = require('puppeteer')
const eventUrl = `https://www.meetup.com/find/?keywords=backend&dateRange=this-week`
let page
let browser
let cardArr = []
class Events{
// Lot of codes in here
}
Creating the Init method
Next, we will create the init
method that will initialize our puppeteer with some useful configuration.
What happens within the init
method is that we launch puppeteer and create a new page with the browser object.
Then we use the page object to visit a particular URL and waitForSelector to load a particular CSS selector we want to scrape.
So before this, you will need to go to the website and page you want to scrape and open your Google DevTool’s Inspector tool to see the particular selector or tag that houses the content you want to scrape.
In this demo, I just visit the URL above and inspect the page.
Looking at the image above, we notice that the selector that contains all the events we want to scrape is .css-j7qwjs
that is what we decided to wait to load with page.waitForSelector
.
static async init() {
// console.log('Loading Page ...')
browser = await puppeteer.launch({
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process', // <- this one doesn't works in Windows
'--disable-gpu',
],
})
page = await browser.newPage()
await Promise.race([
await page.goto(eventUrl, { waitUntil: 'networkidle2' }).catch(() => {}),
await page.waitForSelector('.css-j7qwjs').catch(() => {}),
])
}
Some of the configurations inside the args
array just help boast performance and is recommended by Heroku if you are deploying with them.
Creating the Resolve method
Next, we will create the resolve
method which will call the init method and evaluate the page with the page.evaluate
method.
static async resolve() {
await this.init()
const eventURLs = await page.evaluate(() => {
const cards = document.querySelectorAll('.css-1gl3lql')
cardArr = Array.from(cards)
const cardLinks = []
cardArr.map((card) => {
const eventLink = card.querySelector('.css-2ne5m0')
const eventTitle = card.querySelector('.css-1jy1jkx')
const eventGroupName = card.querySelector('.css-ycqk9')
const eventImage = card.querySelector('img')
const eventDate = card.querySelector('.css-ai9mht')
const { host } = eventLink
const { protocol } = eventLink
const pathName = eventLink.pathname
const query = eventLink.search
const eventURL = protocol + '//' + host + pathName + query
const eventGroup =
eventGroupName !== null
? eventGroupName.textContent.split('Group name:')[1]
: eventGroupName
cardLinks.push({
eventText: eventTitle !== null ? eventTitle.textContent : eventTitle,
eventURLHost: host,
eventURL: eventURL,
eventGroup: eventGroup,
eventImage: eventImage.src,
date: eventDate !== null ? eventDate.textContent : eventDate,
})
})
return cardLinks
})
return eventURLs
}
When you call page.evaluate
on a webpage with Puppeteer that will give you the flexibility of manipulating the DOM of that page with your normal DOM functions.
In our case, we used the document.querySelectorAll()
to select all the nodes that have that particular class we want to scrape.
If you look at the event page inspection again, you will see that each of the events, has this class css-1gl3lql
at the root parent.
So after collecting all the events with document.querySelectorAll()
, we loop through each of them and map out the data we need into the cardLinks
array and return the scrape data.
Creating the GetEvents method
Lastly, we create the getEvents methods and call the resolve method within it.
static async getEvents() {
const events = await this.resolve()
await browser.close()
return events
}
After getting the events, we close the browser
object and return the events.
Let’s put everything together for a clearer view.
const puppeteer = require('puppeteer')
const eventUrl = `https://www.meetup.com/find/?keywords=backend`
let page
let browser
let cardArr = []
class Events {
static async init() {
// console.log('Loading Page ...')
browser = await puppeteer.launch({
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process', // <- this one doesn't works in Windows
'--disable-gpu',
],
})
page = await browser.newPage()
await Promise.race([
await page.goto(eventUrl, { waitUntil: 'networkidle2' }).catch(() => {}),
await page.waitForSelector('.css-j7qwjs').catch(() => {}),
])
}
static async resolve() {
await this.init()
const eventURLs = await page.evaluate(() => {
const cards = document.querySelectorAll('.css-1gl3lql')
cardArr = Array.from(cards)
const cardLinks = []
cardArr.map((card) => {
const eventLink = card.querySelector('.css-2ne5m0')
const eventTitle = card.querySelector('.css-1jy1jkx')
const eventGroupName = card.querySelector('.css-ycqk9')
const eventImage = card.querySelector('img')
const eventDate = card.querySelector('.css-ai9mht')
const { host } = eventLink
const { protocol } = eventLink
const pathName = eventLink.pathname
const query = eventLink.search
const eventURL = protocol + '//' + host + pathName + query
const eventGroup =
eventGroupName !== null
? eventGroupName.textContent.split('Group name:')[1]
: eventGroupName
cardLinks.push({
eventText: eventTitle !== null ? eventTitle.textContent : eventTitle,
eventURLHost: host,
eventURL: eventURL,
eventGroup: eventGroup,
eventImage: eventImage.src,
date: eventDate !== null ? eventDate.textContent : eventDate,
})
})
return cardLinks
})
return eventURLs
}
static async getEvents() {
const events = await this.resolve()
await browser.close()
return events
}
}
export default Events
Testing our newly developed product, run the following command, and visit the URL afterward.
N/B: The class names might change for the webpage when you might be reading this, so always use Google DevTools to inspect the right Selector names.
node index.js
// Then visit
http://localhost:9000
If you set up everything correctly, you should be greeted with a JSON response containing your events.
Take a break and subscribe to receive daily articles on Nodejs bacekend development that will boost your productivity.
Is Web Scraping Illegal?
Web Scraping is not illegal but there are to some extend it is and you should be very careful when web scraping either with puppeteer.
I listed some terms to consider when web scraping and to determine if your actions are illegal or not.
Copyright Infringement.
Even as popular are copyright is, you might not know to what extent it talks about Web Scraping.
Well, some data that we may scrape is copyright protected, so you might want to visit the copyright document of that website to see what is allowed and what is not.
Robot.txt
You need to respect the information provided in the Robot.txt file, if it says no scraping is allowed, then it will be illegal to do otherwise.
Using API
Use the API if it is provided instead of web scraping the data.
When you don’t use the API provided and scrape the data in a way that affects the copyright then it becomes illegal.
Terms of Services
You need to review the terms of service of that particular website to know what is allow and what is not.
You need to follow the guidelines from the terms of services if you want to do it legally.
Scraping Public Content
As long as you scrape your data from public content, you are free, but if you do that to privately owned data, then you should review their terms and be very careful.
Debugging with Puppeteer
Puppeteer is also a great tool for debugging because it opens the web page with Chromium just like a normal user would.
So you can use it for Automated UI testing, to determine how your webpage will respond to user’s events and other metrics.
Taking Screenshots
Sometimes, you might want to take screenshots of particular points on the web page while you scrape your data.
To take screenshots with puppeteer, you need to add the following code to your scripts above.
await page.screenshot({ path: 'screenshot.png' });
The screenshot.png
is the name of the screenshot, you can also specify a full path to where the screenshot will be saved.
Useful Resources
- The Ultimate Guide to Web Scraping with Node.js
- Web Scraping with Nuxtjs using Puppeteer
- The Definitive Guide to Web Scraping with NodeJs & Puppeteer
Conclusion
We have discussed extensively the complete guide to web scraping with Nodejs and puppeteer.
We discussed why you should web scrapped, the importance and use cases of it, we also discussed the legal aspect of puppeteer web scraping, and how to get started with web scraping with Nodejs and puppeteer.