A Guide to Web Scraping Using JavaScript: Techniques, Tools, and Best Practices in 2024

Web scraping refers to the operation of fetching data from websites without requiring human intervention. It is often employed for several reasons, including price tracking and other analytics. Web scraping is a great tool and also needs to be handled with care. One of the main things was to make sure that the website defines its policies and what is allowed or not allowed in terms of scraping.

Web scraping using JavaScript – When is it needed?

Using Javascript in web scraping, especially with the usage of Node.js, has emerged as one of the preferred web scraping tools with the fastest adoption from users thanks to the asynchronous approach, plenty of libraries and versatile elements. This is because Node.js makes JavaScript usable on the server and it thus suitable for web scraping firms.

There are some references substantiating this stance:

  • The State of JavaScript 2021: This annual survey highlights the popularity of JavaScript and its frameworks/libraries, including those used for web scraping.

Prerequisites

To begin with the website extraction process through JavaScript, you would require:

  • Basics of JavaScript language
  • Install Node.js in your device
  • Familiarity with npm (Node Package Manager)

Setting Up the Environment

1. Installing Node.js and npm

Get the installation nice and easy by downloading Node.js from here. npm is bundled with Node.js so you are not required to get it as standalone software.

2. Initiate a New Project

Open up your terminal and make a new project directory:

mkdir web-scraping-js  
cd web-scraping-js  
npm init -y
3. Packages Required for the Project

For making and processing HTTP requests, we’ll use Axios, and for handling HTML not only will we use Cheerio, there is also Puppeteer for dynamic handling.

npm install axios cheerio puppeteer

Simple Web Scraping Using Axios And Cheerio

Making an HTTP Request with Axios

Now let us try to fetch and load a webpage using Axios. For this scenario, we will try to scrape the sample website, example.com.

const axios = require('axios');  
  
axios.get('https://example.com')  
  .then(response => {  
    console.log(response.data);  
  })  
  .catch(error => {  
    console.error(`Could not fetch the page: ${error}`);  
  });
Parsing HTML with Cheerio

Next step, we will process the fetched HTML in such a way that we will be able to get the title from the page.

const cheerio = require('cheerio');  
  
axios.get('https://example.com')  
  .then(response => {  
    const $ = cheerio.load(response.data);  
    const title = $('title').text();  
    console.log(`Title: ${title}`);  
  })  
  .catch(error => {  
    console.error(`Could not fetch the page: ${error}`);  
  });

Advanced Web Scraping with Puppeteer

What is Puppeteer?

Puppetter is a Node library which provides a high-level API over the Chrome or Chromium browser which can be run in headless mode. It is especially effective in scraping the content generated by dynamically implemented through JavaScript.

Basic Example with Puppeteer

Let us fetch and scrape the title of example.com with Puppeteer.

const puppeteer = require('puppeteer');  
  
(async () => {  
  const browser = await puppeteer.launch();  
  const page = await browser.newPage();  
  await page.goto('https://example.com');  
    
  const title = await page.title();  
  console.log(`Title: ${title}`);  
    
  await browser.close();  
})();

Practical Example: Scraping a Real Website

Deciding on the Target Website

In this example, we will be scraping product data from an imaginary online shopping website, https://books.toscrape.com/. It is essential to always check for the terms of service of the website before scraping.

Data Extraction

We will extract only the titles and the prices of the books available at the members’ homepage.

const cheerio = require('cheerio');  
  
axios.get('https://books.toscrape.com/')  
  .then(response => {  
    const $ = cheerio.load(response.data);  
    const books = [];  
      
    $('.product_pod').each((index, element) => {  
      const title = $(element).find('h3 a').attr('title');  
      const price = $(element).find('.price_color').text();  
      books.push({ title, price });  
    });  
      
    console.log(books);  
  })  
  .catch(error => {  
    console.error(`Could not fetch the page: ${error}`);  
  });  
Pagination

In order to scrape additional pages, we will have to implement a way in which we will go to the next page. For instance, scraping two pages :

const scrapePage = async (pageUrl) => {  
  const response = await axios.get(pageUrl);  
  const $ = cheerio.load(response.data);  
  const books = [];  
    
  $('.product_pod').each((index, element) => {  
    const title = $(element).find('h3 a').attr('title');  
    const price = $(element).find('.price_color').text();  
    books.push({ title, price });  
  });  
    
  return books;  
};  
  
const main = async () => {  
  const baseUrl = 'https://books.toscrape.com/catalogue/page-';  
  const allBooks = [];  
    
  for (let i = 1; i <= 2; i++) {  
    const pageUrl = `${baseUrl}${i}.html`;  
    const books = await scrapePage(pageUrl);  
    allBooks.push(...books);  
  }  
    
  console.log(allBooks);  
};  
  
main();
Error Handling and Optimization

Error Handling – Solving the Most Common Problems

While scraping, these are common times where you may encounter network errors out of server response error or a timeout error. It is advised to foresee such issues and use try-catch blocks during scraping.

const fetchData = async (url) => {  
  try {  
    const response = await axios.get(url);  
    return response.data;  
  } catch (error) {  
    console.error(`Error fetching data from ${url}: ${error}`);  
    return null;  
  }  
};
Rate Limiting and Throttling

To ensure that we do not crash the server we are targeting, appropriate measures must be put in place against an increase in requests, these include rate limiting, and throttling. Remember to always use the delay function after every request.

const delay = ms => new Promise(resolve => setTimeout(resolve, ms));  
  
const main = async () => {  
  const baseUrl = 'https://books.toscrape.com/catalogue/page-';  
  const allBooks = [];  
    
  for (let i = 1; i <= 2; i++) {  
    const pageUrl = `${baseUrl}${i}.html`;  
    const books = await scrapePage(pageUrl);  
    allBooks.push(...books);  
    await delay(1000); // Delay for 1 second  
  }  
    
  console.log(allBooks);  
};  
  
main();
Data Cleanup and Storage In Various Formats

Always cleanup the data after scraping and save it to disk or upload it to a database. This is how data can be saved into a JSON format:

const fs = require('fs');  
  
const saveDataToFile = (data, filename) => {  
  fs.writeFileSync(filename, JSON.stringify(data, null, 2));  
};  
  
const main = async () => {  
  const baseUrl = 'https://books.toscrape.com/catalogue/page-';  
  const allBooks = [];  
    
  for (let i = 1; i <= 2; i++) {  
    const pageUrl = `${baseUrl}${i}.html`;  
    const books = await scrapePage(pageUrl);  
    allBooks.push(...books);  
    await delay(1000); // Delay for 1 second  
  }  
    
  saveDataToFile(allBooks, 'books.json');  
  console.log('Data saved to books.json');  
};  
  
main();  

Questions and Answers (Q and A)

  1. Is web scraping illegal?

The legality of web scraping pans out differently depending on the location and the specific site. Always check the target website’s terms and conditions and also keep in mind the CFAA, the USA and the GDPR, the EU. Scraping data available in public domain is not illegal most of the time, however targeting protected content or p2p protected login contents can lead to legal complications.

  1. What types of websites do not permit scraping activities?

Steer clear from websites that are reasonable to be avoided because they are positively forbidden in the TOS. Thus, login pages, CAPTHCA protected sites and any sites that have anti scraping measures. A case in point is the social network, online shops with dynamic content and sites with strict policies on data usage.

  1. How do I solve Captcha to continue marsupials scraping?

CAPTCHA challenges are always put in place to avoid automated forms of access. They can be solved using third-party CAPTCHA solving services, developing machine learning models that can accurately solve CAPTCHA tasks, or solving them manually where possible. Nevertheless, bear with me that circumventing CAPTCHAs could be an infringement of the terms of service of the website.

  1. List all security concerns while web scraping with recommended security practices.
  • Avoid detection through IP address rotation.
  • Use rate limiting for the uploaded server so that it does not crash.
  • Do not scrape sensitive/personal or any other private information.
  • Use secure connections (HTTPS) to protect data in transit.
  • Respect the website’s robots.txt file, although it’s not legally binding.
  1. How will I improve the efficiency and performance of my computer programs that perform web data extraction?
  • Do not block for requests, allow multiple requests to be handled at one time through asynchronous programming.
  • Add caching and logic down further up redundant requests are eliminated away.
  • Do not store data in plain files, rather use more efficient ways like, databases or compressed files.
  • Further reduce the resources used by processing the parsing logic by optimizing the logic structure used in processing.
  1. Is there any possibility that web scraping will go undetected?

Of course, there are measures that can be undertaken to identify web scraping such as, scrapping IP addresses, request colourization and non-human behaviour. Scraping is a clever process which utilizes, IP address distribution, acts as human and observe bandwidth limitations.

  1. What are some other JavaScript options for web scraping?
  • request-promise: This is a simpler client side javascript that gets a string from http.
  • jsdom: JavaScript based dom.
  • selenium-webdriver: Browser automation tool to scrape dynamically updated content.
  1. How can I work with content that’s dynamic and an asynchronous JavaScript is required?

Headless browsers, Puppeteer or Selenium, are capable of doing this and let you scrape their rendered dynamic content. They can run JavaScript so you can collect data that’s asynchronously loaded after certain actions and content that requires actions from the audience.

Conclusion

In this arduously long article, we have presented the essential aspects of web scraping with JavaScript. It began with the steps of sending HTTP requests and processing HTML pages using Axios and Cheerio. Then we advanced to more sophisticated means of scraping, that of dynamic content using Puppeteer. We also tackled practical issues, errors, solutions, and optimization.

After this step by step tutorial, you are ready to begin real web scraping projects with JavaScript. Enjoy scraping!

Further Reading

Leave a Reply

Your email address will not be published. Required fields are marked *

×