Web Scraping with Nuxtjs using Puppeteer

by Solomon Eseme

.

Updated Mon Apr 24 2023

Web Scraping with Nuxtjs using Puppeteer

Web Scraping with Nuxtjs using Puppeteer is intended to demonstrate how to set up and configure Puppeteer for web scraping to work properly with Nuxtjs and how to use it to Scrape a job listing website and display them on your website.

Since Puppeteer is a Server-side Node package, it becomes very difficult to set it up to work properly with a Client-Side library like Vue.js and there are not many tutorials online on how to set this up properly using “puppeteer for web scraping”.

This article is intended to demonstrate how I solve the problem in my client’s project using Nuxt.js and Puppeteer.

Web Scraping Fundamentals

Web scrapping can sound very strange at first but it’s really a very simple term to understand.

The term Web Scraping is a technique that describes the extraction of data from websites and saving it in any desired format for further processing.

Web scraping automates the manual process of extracting information from websites and storing this information electronically for further processing.

Usage

  1. Extracting product details from e-commerce websites such as prices, product names, images, etc.
  2. Web scraping is very useful in research as it can help to gather structured data from multiple websites.
  3. Gathering data from different sources for analysis can be automated with web scrapping easily.
  4. It can be used to collect data for testing and training machine learning models.

Method of web scraping

  1. Web scraping software: This is the most popular way of web scraping where pre-made software is deployed for the purpose of web scraping.
  2. Writing code: This is a method where a developer is hired to develop the scraping scripts based on the input of the owner to scrape a specific website.

Introduction to Puppeteer

Puppeteer is a Node library that is used to scrape web pages, automate form submissions, etc.

It is Google’s official Node library for controlling Google Chrome instances from Node.js, it can also be configured to run in headless mode and make it run in the background.

Puppeteer can be used for several use cases but I will only list a few below.

  1. Web page scraping.
  2. Tracking page load performance.
  3. Automate form submissions.
  4. Generate page screenshots
  5. Generate PDF of website pages
  6. Very useful for Automated Testing.
  7. Performs any browser automation
  8. Use to render the server-side of a single-page app for preview.
  9. etc.

Building a JobScrapper Application with Nuxtjs using Puppeteer

Creating a new Nuxt Project

Before we start developing our web scrapper, we need to install and set up Nuxtjs, following the simple step in the official documentation can help speed up the process.


Type in the following commands to set up the project and accept the default setup rules.

yarn create nuxt-app <project-name>

After installation, let’s start by creating the different components, stores, and pages that will be needed in this project.

Create a component called jobs to display a list of all the jobs scraped.

cd components

touch Jobs.vue

Next, create a new job store in the store’s folder to manage our jobs state.

cd store
touch job.js

Lastly, lets create a jobs page inside the pages folder for our navigation if needed anyways.

cd pages
touch jobs.vue

Of course, this is limited as your project can be complex and contains plenty components, pages and stores to manage different states.

Installing dependencies.

Next is to install all the necessary dependencies needed to scrape pages with nuxtjs and puppeteer.

npm i puppeteer net tls

Run the command to install the puppeteer library and other support libraries.

Configuring Puppeteer

This is the difficult part, I had different issues configuring my puppeteer to work with nuxtjs because nuxtjs is both client and server-side frame work.

It becomes difficult to know where to place puppeteer or how to call it from the server side since puppeteer is a server node library and only works in server side of nuxtjs.

I will just go ahead to explain how i get it working on my project.

First, let’s create a new script.js file in the root directory and paste in the following codes.

const saveFile = require('fs').writeFileSync

const pkgJsonPath = require.main.paths[0] + '/puppeteer' + '/package.json'

// console.log(pkgJsonPath)
const json = require(pkgJsonPath)

// eslint-disable-next-line no-prototype-builtins
if (!json.hasOwnProperty('browser')) {
  json.browser = {}
}

delete json.browser.ws

saveFile(pkgJsonPath, JSON.stringify(json, null, 2))

Looking at the script you might understand what it does, if not i will explain.

It goes into node_modules/puppeteer/package.json file and delete a particular line.

Before deleting that line, it checks if the package.json has the broswer object, if not create a new one, else move on to delete the ws property of the browser object and save the file finally.

The script is going to run each time we run npm install.

The ws is puppeteer’s web socket that was set to a web socket that does not exist in our project.

By deleting that line each time will run npm install puppeteer will default to using the web socket that is in our node_modules folder.

Running the Script

Now, let’s add the script to our package.json file where it will be executed as a postinstall script.

Open your package.json file and add the following code.

.......
"scripts": {
    "dev": "nuxt",
    "build": "nuxt build",
    "start": "nuxt start",
    "export": "nuxt export",
    "serve": "nuxt serve",
    "lint:js": "eslint --ext .js,.vue --ignore-path .gitignore .",
    "lint": "yarn lint:js",
    "test": "jest",
    "postinstall": "node script"
  },
....

You also need to add the following code into your package.json file.

.......
"browser": {
    "fs": false,
    "path": false,
    "os": false,
    "tls": false
  }
.......

That just sets fs, path, os and tls to false because these are only needed on the server-side of things.

Now that the hard part is off, let’s configure Webpack to deal with puppeteer correctly.

Open your nuxt.config.js file and add the following line inside the build object.

build: {
    extend(config, { isServer, isClient }) {
      config.externals = config.externals || {}
      if (!isServer) {
        config.node = {
          fs: 'empty',
        }
        if (Array.isArray(config.externals)) {
          config.externals.push({
            puppeteer: require('puppeteer'),
          })
        } else {
          config.externals.puppeteer = require('puppeteer')
        }
      }
      config.output.globalObject = 'this'
      return config
    },
  },

This configuration only requires puppeteer and adds it to externals array only when Nuxtjs is at the client-side and set fs to empty too. 

If you did everything right, your puppeteer should be ready to use with Nuxtjs to scrape pages, if you’re stuck you can grab the repository here.

Now we can move to the easy part.

Web Scrapping

Create a file called JobScrapper.js and paste in the following code. 

In my project, I was given a list of websites I should scrape to avoid violating any scrapping rules (Just saying 🙂

const puppeteer = require('puppeteer')
const jobUrl = // SITE URL HERE
let page
let browser
let cardArr = []
class Jobs {

  // We will add 3 methods here

  // Initializes and create puppeteer instance
  static async init(){}

  // Visits the page, retrieves the job
  static async resolver() {}

  // Converts the job to array
  static async getJobs() {}

}
export default Jobs

Create the Init method

static async init() {

    browser = await puppeteer.launch({
      // headless: false,
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage',
        '--disable-accelerated-2d-canvas',
        '--no-first-run',
        '--no-zygote',
        '--single-process', // <- this one doesn't works in Windows
        '--disable-gpu',
      ],
    })

    page = await browser.newPage()
    await Promise.race([
      await page.goto(jobUrl, { waitUntil: 'networkidle2' }).catch(() => {}),
      await page.waitForSelector('.search-card').catch(() => {}),
    ])
  }

The init function initializes puppeteer with several configurations, creates a new page with browser.newPage(), visit our URL with await page.goto(………), and wait for the page to load successfully with await page.waitForSelector(…..)

Create a Resolver method.

static async resolve() {
    await this.init()
    const jobURLs = await page.evaluate(() => {
      const cards = document.querySelectorAll('.search-card')
      cardArr = Array.from(cards)
      const cardLinks = []
      cardArr.map((card) => {
        const cardTitle = card.querySelector('.card-title-link')
        const cardDesc = card.querySelector('.card-description')
        const cardCompany = card.querySelector(
          'a[data-cy="search-result-company-name"]'
        )
        const cardDate = card.querySelector('.posted-date')
        const { text } = cardTitle
        const { host } = cardTitle
        const { protocol } = cardTitle
        const pathName = cardTitle.pathname
        const query = cardTitle.search
        const titleURL = protocol + '//' + host + pathName + query
        const company = cardCompany.textContent
        cardLinks.push({
          titleText: text,
          titleURLHost: host,
          titleURLPathname: pathName,
          titleURLSearchQuery: query,
          titleURL: titleURL,
          titleDesc: cardDesc.innerHTML,
          titleCompany: company,
          titleDate: cardDate.textContent,
        })
      })
      return cardLinks
    })
    return jobURLs
  }
}

This method does all the job.

Firstly, it selects all the Jobs listed, convert it to javascript array and loop through each of them while retrieving the data needed.

Create a getJobs method

static async getJobs() {
    const jobs = await this.resolver()
    await browser.close()
    const data = {}
    data.jobs = jobs
    data.total_jobs = jobs.length
    return data
  }

The method simply returns the job array from the resolver method and closes the browser.

Creating Vuex action

Next, we are going to set up our Vuex store to retrieve the jobs each time we dispatch the getJobs action and store them to state.

Open the job file and add the following codes.

import JobScrapper from '~/JobScrapper'

// Action
async getJobs({ commit }) {
    const data = await JobScrapper.getJobs();
    if (data.total_jobs) {
      commit('STORE_JOBS', data)
      return data.jobs
    }
}

// Mutation
STORE_JOBS(state, payload) {
    state.jobs = payload.jobs
    state.total_jobs = payload.total_jobs
},

// Getter
export const getters = {
  getJobs: (state) => () => {
    return state.jobs
  },
}

// State
export const state = () => ({
  jobs: [],
  total_jobs: 0,
})

Displaying Jobs

Open pages/jobs.vue file and add the following codes.

<template>
  <div class="row mt-5">
    <div class="card-group">
      <div class="row">
        <div class="col-md-8">
          <Job v-for="(job, i) in jobs" :key="i" :job="job" />
        </div>
      </div>
    </div>
  </div>
</template>

<script>
export default {
  async asyncData({ store }) {
    const getJobs = store.getters['job/getJobs']
    let jobs = getJobs()
    if (!jobs.length) {
      jobs = await store.dispatch('job/getJobs')
    }
    return { jobs }
  }
}
</script>

This is just one way you could dispatch the actions in each of the pages you want, but it has to be within the asyncData() hook because it is called from the server-side.

Another way or my best way could be to dispatch the action inside nuxtServerInit action which will dispatch the action on every new page loads.

Let’s me show you how to do that.

Create an index.js file inside the store folder and add the following codes.

async nuxtServerInit({ dispatch }) {
    try {
      await dispatch('job/getJobs')
    } catch (error) {}
},

This will scrape the jobs and save it to state, you can then use …mapState or …mapGetters to retrieve the job and display it in your component.

In my project, I use the nuxtServerInit approach and …mapState in any of the components, I want to display the job.

Jobs Component

<template>
  <section>
    ........
        <div class="row mb-1 mt-5" v-if="jobs.length !== 0">
          <div
            v-for="job in jobs"
            :key="job.id"
            class="col-md-6 col-sm-12 mb-4"
          >
            // My JOB component to display a specific job
            <Job :job="job" />
          </div>
        </div>
        <div v-else class="row mb-1 mt-5">No Jobs at this time</div>
    ..........

  </section>
</template>
<script>
import { mapState } from 'vuex'
export default {
  computed: {
    ...mapState({
      jobs: (state) => {
        return [...state.job.jobs].slice(0, 10)
      },
    }),
  },
}
</script>
<style></style>

That’s all.

Except you want to see my Job component, then clone the repository here, everything can be found there.

P:S
This method of web scraping with Nuxtjs using puppeteer has many workarounds and maybe a little difficult to understand for beginners, though it works properly because I have used it in my projects.

I have a better approach on how to handle web scraping using Node/express and puppeteer, I will be writing about it too.

Consider joining our newsletter to never miss a thing when it drops.

References

  1. Complete Guide to Web Scraping with Nodejs and Puppeteer
  2. https://flaviocopes.com/puppeteer/
  3. https://www.webharvy.com/articles/what-is-web-scraping.html
  4. https://masteringjs.io/tutorials/fundamentals/puppeteer

Conclusion:

Congratulations for making it this far, by now you should have a deep understanding of web scrapping using puppeteer in Nuxt.js. 

You should also have built and completed the JobScrapper Project.

Keep coding 🙂

Whenever you're ready

There are 3 ways we can help you become a great backend engineer:

The MB Platform

Join 1000+ backend engineers learning backend engineering. Build real-world backend projects, learn from expert-vetted courses and roadmaps, track your learnings and set schedules, and solve backend engineering tasks, exercises, and challenges.

The MB Academy

The “MB Academy” is a 6-month intensive Advanced Backend Engineering BootCamp to produce great backend engineers.

Join Backend Weekly

If you like post like this, you will absolutely enjoy our exclusive weekly newsletter, Sharing exclusive backend engineering resources to help you become a great Backend Engineer.

Backend Tips, Every week

Backend Tips, Every week