Web Scraping with Nuxtjs using Puppeteer is intended to demonstrate how to set up and configure Puppeteer for web scraping to work properly with Nuxtjs and how to use it to Scrape a job listing website and display them on your website.
Since Puppeteer is a Server-side Node package, it becomes very difficult to set it up to work properly with a Client-Side library like Vue.js and there are not many tutorials online on how to set this up properly using “puppeteer for web scraping”.
This article is intended to demonstrate how I solve the problem in my client’s project using Nuxt.js and Puppeteer.
Web Scraping Fundamentals
Web scrapping can sound very strange at first but it’s really a very simple term to understand.
The term Web Scraping is a technique that describes the extraction of data from websites and saving it in any desired format for further processing.
Web scraping automates the manual process of extracting information from websites and storing this information electronically for further processing.
Usage
- Extracting product details from e-commerce websites such as prices, product names, images, etc.
- Web scraping is very useful in research as it can help to gather structured data from multiple websites.
- Gathering data from different sources for analysis can be automated with web scrapping easily.
- It can be used to collect data for testing and training machine learning models.
Method of web scraping
- Web scraping software: This is the most popular way of web scraping where pre-made software is deployed for the purpose of web scraping.
- Writing code: This is a method where a developer is hired to develop the scraping scripts based on the input of the owner to scrape a specific website.
Introduction to Puppeteer
Puppeteer is a Node library that is used to scrape web pages, automate form submissions, etc.
It is Google’s official Node library for controlling Google Chrome instances from Node.js, it can also be configured to run in headless mode and make it run in the background.
Puppeteer can be used for several use cases but I will only list a few below.
- Web page scraping.
- Tracking page load performance.
- Automate form submissions.
- Generate page screenshots
- Generate PDF of website pages
- Very useful for Automated Testing.
- Performs any browser automation
- Use to render the server-side of a single-page app for preview.
- etc.
Building a JobScrapper Application with Nuxtjs using Puppeteer
Creating a new Nuxt Project
Before we start developing our web scrapper, we need to install and set up Nuxtjs, following the simple step in the official documentation can help speed up the process.
Type in the following commands to set up the project and accept the default setup rules.
yarn create nuxt-app <project-name>
After installation, let’s start by creating the different components, stores, and pages that will be needed in this project.
Create a component called jobs
to display a list of all the jobs scraped.
cd components
touch Jobs.vue
Next, create a new job
store in the store’s folder to manage our jobs state.
cd store
touch job.js
Lastly, lets create a jobs
page inside the pages folder for our navigation if needed anyways.
cd pages
touch jobs.vue
Of course, this is limited as your project can be complex and contains plenty components, pages and stores to manage different states.
Installing dependencies.
Next is to install all the necessary dependencies needed to scrape pages with nuxtjs and puppeteer.
npm i puppeteer net tls
Run the command to install the puppeteer library and other support libraries.
Configuring Puppeteer
This is the difficult part, I had different issues configuring my puppeteer to work with nuxtjs because nuxtjs is both client and server-side frame work.
It becomes difficult to know where to place puppeteer or how to call it from the server side since puppeteer is a server node library and only works in server side of nuxtjs.
I will just go ahead to explain how i get it working on my project.
First, let’s create a new script.js
file in the root directory and paste in the following codes.
const saveFile = require('fs').writeFileSync
const pkgJsonPath = require.main.paths[0] + '/puppeteer' + '/package.json'
// console.log(pkgJsonPath)
const json = require(pkgJsonPath)
// eslint-disable-next-line no-prototype-builtins
if (!json.hasOwnProperty('browser')) {
json.browser = {}
}
delete json.browser.ws
saveFile(pkgJsonPath, JSON.stringify(json, null, 2))
Looking at the script you might understand what it does, if not i will explain.
It goes into node_modules/puppeteer/package.json
file and delete a particular line.
Before deleting that line, it checks if the package.json
has the broswer
object, if not create a new one, else move on to delete the ws
property of the browser
object and save the file finally.
The script is going to run each time we run npm install.
The ws
is puppeteer’s web socket that was set to a web socket that does not exist in our project.
By deleting that line each time will run npm install puppeteer will default to using the web socket that is in our node_modules
folder.
Running the Script
Now, let’s add the script to our package.json
file where it will be executed as a postinstall
script.
Open your package.json
file and add the following code.
.......
"scripts": {
"dev": "nuxt",
"build": "nuxt build",
"start": "nuxt start",
"export": "nuxt export",
"serve": "nuxt serve",
"lint:js": "eslint --ext .js,.vue --ignore-path .gitignore .",
"lint": "yarn lint:js",
"test": "jest",
"postinstall": "node script"
},
....
You also need to add the following code into your package.json
file.
.......
"browser": {
"fs": false,
"path": false,
"os": false,
"tls": false
}
.......
That just sets fs
, path
, os
and tls
to false
because these are only needed on the server-side of things.
Now that the hard part is off, let’s configure Webpack to deal with puppeteer correctly.
Open your nuxt.config.js
file and add the following line inside the build object.
build: {
extend(config, { isServer, isClient }) {
config.externals = config.externals || {}
if (!isServer) {
config.node = {
fs: 'empty',
}
if (Array.isArray(config.externals)) {
config.externals.push({
puppeteer: require('puppeteer'),
})
} else {
config.externals.puppeteer = require('puppeteer')
}
}
config.output.globalObject = 'this'
return config
},
},
This configuration only requires puppeteer and adds it to externals array only when Nuxtjs is at the client-side and set fs
to empty too.
If you did everything right, your puppeteer should be ready to use with Nuxtjs to scrape pages, if you’re stuck you can grab the repository here.
Now we can move to the easy part.
Web Scrapping
Create a file called JobScrapper.js
and paste in the following code.
In my project, I was given a list of websites I should scrape to avoid violating any scrapping rules (Just saying 🙂
const puppeteer = require('puppeteer')
const jobUrl = // SITE URL HERE
let page
let browser
let cardArr = []
class Jobs {
// We will add 3 methods here
// Initializes and create puppeteer instance
static async init(){}
// Visits the page, retrieves the job
static async resolver() {}
// Converts the job to array
static async getJobs() {}
}
export default Jobs
Create the Init method
static async init() {
browser = await puppeteer.launch({
// headless: false,
args: [
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-dev-shm-usage',
'--disable-accelerated-2d-canvas',
'--no-first-run',
'--no-zygote',
'--single-process', // <- this one doesn't works in Windows
'--disable-gpu',
],
})
page = await browser.newPage()
await Promise.race([
await page.goto(jobUrl, { waitUntil: 'networkidle2' }).catch(() => {}),
await page.waitForSelector('.search-card').catch(() => {}),
])
}
The init
function initializes puppeteer with several configurations, creates a new page with browser.newPage()
, visit our URL with await page.goto(………)
, and wait for the page to load successfully with await page.waitForSelector(…..)
Create a Resolver method.
static async resolve() {
await this.init()
const jobURLs = await page.evaluate(() => {
const cards = document.querySelectorAll('.search-card')
cardArr = Array.from(cards)
const cardLinks = []
cardArr.map((card) => {
const cardTitle = card.querySelector('.card-title-link')
const cardDesc = card.querySelector('.card-description')
const cardCompany = card.querySelector(
'a[data-cy="search-result-company-name"]'
)
const cardDate = card.querySelector('.posted-date')
const { text } = cardTitle
const { host } = cardTitle
const { protocol } = cardTitle
const pathName = cardTitle.pathname
const query = cardTitle.search
const titleURL = protocol + '//' + host + pathName + query
const company = cardCompany.textContent
cardLinks.push({
titleText: text,
titleURLHost: host,
titleURLPathname: pathName,
titleURLSearchQuery: query,
titleURL: titleURL,
titleDesc: cardDesc.innerHTML,
titleCompany: company,
titleDate: cardDate.textContent,
})
})
return cardLinks
})
return jobURLs
}
}
This method does all the job.
Firstly, it selects all the Jobs listed, convert it to javascript array and loop through each of them while retrieving the data needed.
Create a getJobs method
static async getJobs() {
const jobs = await this.resolver()
await browser.close()
const data = {}
data.jobs = jobs
data.total_jobs = jobs.length
return data
}
The method simply returns the job array from the resolver
method and closes the browser.
Creating Vuex action
Next, we are going to set up our Vuex store to retrieve the jobs each time we dispatch the getJobs
action and store them to state.
Open the job
file and add the following codes.
import JobScrapper from '~/JobScrapper'
// Action
async getJobs({ commit }) {
const data = await JobScrapper.getJobs();
if (data.total_jobs) {
commit('STORE_JOBS', data)
return data.jobs
}
}
// Mutation
STORE_JOBS(state, payload) {
state.jobs = payload.jobs
state.total_jobs = payload.total_jobs
},
// Getter
export const getters = {
getJobs: (state) => () => {
return state.jobs
},
}
// State
export const state = () => ({
jobs: [],
total_jobs: 0,
})
Displaying Jobs
Open pages/jobs.vue
file and add the following codes.
<template>
<div class="row mt-5">
<div class="card-group">
<div class="row">
<div class="col-md-8">
<Job v-for="(job, i) in jobs" :key="i" :job="job" />
</div>
</div>
</div>
</div>
</template>
<script>
export default {
async asyncData({ store }) {
const getJobs = store.getters['job/getJobs']
let jobs = getJobs()
if (!jobs.length) {
jobs = await store.dispatch('job/getJobs')
}
return { jobs }
}
}
</script>
This is just one way you could dispatch the actions in each of the pages you want, but it has to be within the asyncData()
hook because it is called from the server-side.
Another way or my best way could be to dispatch the action inside nuxtServerInit
action which will dispatch the action on every new page loads.
Let’s me show you how to do that.
Create an index.js
file inside the store folder and add the following codes.
async nuxtServerInit({ dispatch }) {
try {
await dispatch('job/getJobs')
} catch (error) {}
},
This will scrape the jobs and save it to state, you can then use …mapState
or …mapGetters
to retrieve the job and display it in your component.
In my project, I use the nuxtServerInit
approach and …mapState
in any of the components, I want to display the job.
Jobs Component
<template>
<section>
........
<div class="row mb-1 mt-5" v-if="jobs.length !== 0">
<div
v-for="job in jobs"
:key="job.id"
class="col-md-6 col-sm-12 mb-4"
>
// My JOB component to display a specific job
<Job :job="job" />
</div>
</div>
<div v-else class="row mb-1 mt-5">No Jobs at this time</div>
..........
</section>
</template>
<script>
import { mapState } from 'vuex'
export default {
computed: {
...mapState({
jobs: (state) => {
return [...state.job.jobs].slice(0, 10)
},
}),
},
}
</script>
<style></style>
That’s all.
Except you want to see my Job
component, then clone the repository here, everything can be found there.
P:S
This method of web scraping with Nuxtjs using puppeteer has many workarounds and maybe a little difficult to understand for beginners, though it works properly because I have used it in my projects.
I have a better approach on how to handle web scraping using Node/express and puppeteer, I will be writing about it too.
Consider joining our newsletter to never miss a thing when it drops.
References
- Complete Guide to Web Scraping with Nodejs and Puppeteer
- https://flaviocopes.com/puppeteer/
- https://www.webharvy.com/articles/what-is-web-scraping.html
- https://masteringjs.io/tutorials/fundamentals/puppeteer
Conclusion:
Congratulations for making it this far, by now you should have a deep understanding of web scrapping using puppeteer in Nuxt.js.
You should also have built and completed the JobScrapper Project.
Keep coding 🙂