logo

Scraping For Images Using Puppeteer

Kenneth Jimmy

10/16/2020

Once a time I was searching for a partner...then I stumbled upon my beloved and, immediately, I fell in love. Then I whispered to myself with my eyes glistened with tears of joy: "My days of loneliness are finally over!" 😥

Ok, that was a story of my life...well not in that way (although I am single 😅) but in a much, much different way. I was actually looking for a way to automate the boring stuff I had to do on several web pages when I stumbled upon Puppeteer. Yes...no, I am not talking about that dude who manipulates dolls, but yes, a software for browser automation!

But What Exactly Is Puppeteer?

According to documentation, Puppeteer is a Node library that provides a high-level API to control Chromium or Chrome over the DevTools Protocol. Let me break that down in simple terms.

Puppeteer is a library or module in Node.js which enables you to perform certain actions on the web like opening pages, navigating around a website, evaluating javascript and so much more, automatically. It uses Node.js and Chrome to do its wonders.

However, there has been a recent version of Puppeteer that can allow you to use Firefox instead of Chrome.

The relationship between the browser (Chrome) and Node.js is actually encoded in their logo as you can see below:

Img

What I love about Puppeteer is that you can perform web scraping like a breeze. In fact, in this article, I am going to show you just how to scrape a comic website for image src URL values using Puppeteer. Cool, right? 😎

There's so much to talk about Puppeteer. If you are interested in learning more, you can visit their official documentation site 👉🏽 here.

Enough of the introduction. Now let's start scraping some data!

The Objective

How do you get the source `src` value of an image on the website? Simply - place your mouse pointer on the image, right-click, select 'inspect'...opens dev tool and there's your:

Img

Not so much for one image right? But what if you were to repeat the same steps for…say a hundred images? That could easily become a nightmare. Trust me, I know that.

This is where we are going to write a script using Puppeteer to automatically do that boring task for us in no time, saving us a ton of time plus the stress.

We will be fetching our data from a website I usually go to read comic books. (By the way, I love reading comic books. It’s a childhood thing…don’t worry 😉)

The Website

The first thing you want to do is to know what website you want to scrape from and understand how the website's elements are structured. The former shouldn't be a problem. In our case, it's a comic website and we are trying to scrape some image src values. Now, how do we know how the images are structured? Let me guide you:

Go to the comic webiste. You will be greeted with this interface on the homepage:

Img

Are those the images we want? Well, maybe for you. But we are looking for a set of images of any issue of any comic title. So we have to navigate to another page.

Let’s say we are targeting the ‘Batman: The Adventures Continue (2020)’ title. We have to click on that title from the ‘Popular Titles’ section on the home page.

If you can’t find this title on the ‘Popular Titles’ section on the website home page at the time of your reading this article, please, search for the title on the website or use another title of your choice and you will be able to follow along just fine. The pattern remains the same.

Now we are on the overview page. On this page, we have the list of all the issues available for that comic - ‘Batman: The Adventures Continue (2020)' in our case.

Img

Remember, the goal is to get to the images (or pages) of the comic title issue.

To get to the "book pages" page, we must select an issue. So, let's select 'Batman The Adventures Continue 2020 Issue #1'.

Now we are here! 😎

You'll quickly notice the common pattern for navigating through the book pages on this website.

Img

I suppose that you have been observing how we routed to this page? We will instruct Puppeteer to follow the same route. And we will also tell Puppeteer how to find the right images we want.

Now, let's dive into Puppeteer!

The Robot

Alright, first thing, make sure you have Node and NPM installed - Puppeteer relies heavily on those. Learn how to download Node.js installer and NPM here.

The next thing to do is to initialize a new node.js project. To that end, create a new folder and name it whatever you like. In my case, I'll name it Puppeteer -Tutorial. Open your terminal and cd to your newly created directory.

Pssst: you can find this article's code on Github.

Now run:

npm init

On running that command, you will be prompted to fill in some essential information about your new project. Go ahead and do just so, skipping the ones you really don't need (I'll leave you to decide). In the end, you should have a familiar outcome like this:

Img

It simply means that your project has been initialized with a package.json file containing all the information you passed on it earlier.

Now is the time for you to install Puppeteer. But before installing Puppeteer, you should open your project with your code editor (preferably VS Code) and create an index.js file on the same path level as your package.json file.

Then run:

npm install puppeteer

This will install Puppeteer from the NPM library. You'll also notice that it downloads a recent version of Chromium (~170MB Mac, ~282MB Linux, ~280MB Win) that is guaranteed to work with the API.

Img

In the end, you should have a list of 3 files, namely, index.js, package-lock.json and package.json, and a node_modules folder. When you open your package.json file, you should see a newly created object-like dependencies with a "puppeteer": "<lastest version>" pair inside.

Img

Now rub your palms together, for we are about to test Puppeteer for the first time! Isn't that exciting? 😃

To run a test, we will use the typical example on the documentation.

Create a new test.js file and paste the following code inside it:

const puppeteer = require("puppeteer");

(async () => {
  const browser = await puppeteer.launch();
  const page = await browser.newPage();
  await page.goto("https://kenjimmy.me");
  await page.screenshot({ path: "example.png" });

  await browser.close();
})();

Now, open your integrated terminal in VS Code by pressing ctrl (cmd) + `. Then, run:

node test.js

node is a command that is executable because you have Node.js installed on your machine. Thus, node test.js executes the JavaScript code (particularly Puppeteer code in this case) inside our test.js file on the Node.js runtime.

I won't explain what each line of code is doing in our test.js file because I just want to test if Puppeteer has been correctly installed. We're saving the explanation for the actual script that will soon follow.

On running that command, you should see an image file with a screenshot of my website's home page in size 800×600px.

Img

Now we are good to go.

Scraping For Image Srcs Values Using Puppeteer

Inside your index.js file, let's require the fs (file system) module. This will allow you to write data fetched from the comic website into a file.

const fs = require("fs");

Since you are writing some data into a file, it is required of you to have already created that file. So, go ahead and create a data.json file.

Besides the file system, we definitely need to import or require Puppeteer. The following code will do just so:

const puppeteer = require("puppeteer");

In JavaScript, there's a concept known as Immediately Invoked Function Expression (IIFE). In simple terms, an IIFE is a JavaScript function that runs as soon as it is defined. We are going to use IIFE as a wrapper around our script, which is a common way of writing Puppeteer scripts.

This is how we write an IIFE:

(async () => {
 // your code goes here...
})();

Note that our anonymous function is preceded by async key word. This is a special syntax introduced to JavaScript since ES8 to work with promises in a more comfortable fashion. It is expecting an await statement and we are going to supply some in a jiffy.

Now, inside our IFFE, let's write a try...catch statement. This will help us handle errors correctly in case there was one.

(async () => {
  try {
    // ...
  } catch (error) {
    console.log(error);
  }
})();

All we are doing in the catch block is logging the error to the console if it caught any.

Now, let's write our script inside the try block.

First, we launch Chromium in headless mode. This simply means you won't see your Chromium browser in action while it is manipulating the website. Then, open the web page we want to manipulate:

// ...

// Initialize Puppeteer
const browser = await puppeteer.launch();
const page = await browser.newPage();

// Specify comic issue page url
await page.goto(
      "https://comicpunch.net/readme/index.php?title=batman-the-adventures-continue-2020&chapter=1"
);
console.log("page has been loaded!");

// ...

Where did we get the URL we passed in page.goto()?

Remember the comic title page? We just have to copy one of the issue links:

Img

What we have just told Puppeteer to do is to launch the Chromium browser by using the .launch() method, and go to the URL we specified by using the .newPage() and .goto() methods.

What should it do next? It should click on the "Full Chapter" button on the page:

Img

Clicking on the "Full Chapter" button will render all the images we need on the page. And that's exactly what we want.

We need a way to let Puppeteer know where the "Full Chapter" button is and click on it. Thus, we need to use .click(). This method expects a selector.

We can get the selector of any element by inspecting it on the browser. In this case, we will be targeting "button.button4", where ".button4" is the class name given to the "Full Chapter" button.

// ...

// While page is waiting for 1s, click on the 'Full Chapter' button and do the rest
await page.waitFor(1000);
await page.click("button.button4");
console.log("'Full Chapter' button has been clicked!");

// ...

Oh boy! There's a line before .click(). Yes, we have to use .waitFor() and pass in 1000ms because we need a bit more time for the rest of the code to work as intended. If you want to experiment with it, go ahead and remove that line and see what happens. 😁

By the way, .waitFor() has been deprecated and will be removed in a future release.

Now, let's tell Puppeteer to evaluate or compute the main task, which is to convert the nodelist of images returned from the DOM into an array, then map each item and get the src attribute value, and store it in src variable, which is therefore returned to be the value of issueSrcs variable.

The .evaluate() accepts a function to be evaluated in the page context. So this is where we write our main code to get all image URLs with the ".comicpic" class:

// ...

const issueSrcs = await page.evaluate(() => {
      const srcs = Array.from(
        document.querySelectorAll(".comicpic")
      ).map((image) => image.getAttribute("src"));
      return srcs;
});
console.log("Page has been evaluated!");

// ...

Now, we have all our data stored in our "issueSrcs" variable. But we are not done yet. We need to persist the data into our data.json file, right? This is where we use the fs.writeFileSync() method:

// ...

// Persist data into data.json file
fs.writeFileSync("./data.json", JSON.stringify(issueSrcs));
console.log("File is created!");

// ...

Finally, we end the Puppeteer using the .close() method. Otherwise? Well...😆

// ...

// End Puppeteer
await browser.close();

// ...

In the end, your code should look like this:

const fs = require("fs");
const puppeteer = require("puppeteer");

(async () => {
  try {
    // Initialize Puppeteer
    const browser = await puppeteer.launch();
    const page = await browser.newPage();

    // Specify comic issue page url
    await page.goto(
      "https://comicpunch.net/readme/index.php?title=batman-the-adventures-continue-2020&chapter=1"
    );
    console.log("page has been loaded!");

    // While page is waiting for 1s, click on the 'Full Chapter' button and do the rest
    await page.waitFor(1000);
    await page.click("button.button4");
    console.log("'Full Chapter' button has been clicked!");

    // Evaluate/Compute the main task:
    // Here, we convert the Nodelist of images returned from the DOM into an array, then map each item and get the src attribute value, and store it in 'src' variable, which is therefore returned to be the value of 'issueSrcs' variable.
    const issueSrcs = await page.evaluate(() => {
      const srcs = Array.from(
        document.querySelectorAll(".comicpic")
      ).map((image) => image.getAttribute("src"));
      return srcs;
    });
    console.log("Page has been evaluated!");

    // Persist data into data.json file
    fs.writeFileSync("./data.json", JSON.stringify(issueSrcs));
    console.log("File is created!");

    // End Puppeteer
    await browser.close();
  } catch (error) {
    console.log(error);
  }
})();

Cool! We are ready to try out our script. I am curious. What about you? 😃

Now, in your terminal, run:

node index.js

And...yippy! We did it!! Yes, we successfully fetched all the comic image URLs with the class of '.comicpic' on that page and stored them in JSON format inside our data.json file. Great job. Well done! You just improved your IQ by 10. (obviously, that was a joke)

You should see something like this when you open your data.json file:

Img

Conclusion

The Puppeteer is a great tool for web automation and even web testing. Most things that you can do manually in the browser can be done using Puppeteer!

We have experimented with it in this article by scraping some image src values from a web page, but here are a few more examples you can try out yourself:

  • Crawl a SPA (Single-Page Application) and generate pre-rendered content (i.e. "SSR" (Server-Side Rendering)).
  • Automate form submission, UI testing, keyboard input, etc.
  • Create an up-to-date, automated testing environment. Run your tests directly in the latest version of Chrome using the latest JavaScript and browser features.
  • Capture a timeline trace of your site to help diagnose performance issues.
  • Test Chrome Extensions.

Make the best out of this amazing tool! (and why not 😎)