Web Crawler in NodeJS using Cheerio
Jimmy Rousseau
Author: Jimmy Rousseau | Published: 10/31/2023

As I was browsing the web looking at ideas, and cool things people are doing, I came across the Vercel Pinecone Starter Application repo. There I came across a crawler made by someone by the username of rschwabco. This crawler takes advantage of a queue/stack for depth, and page limit. Though it was really neat the approach to this, and wanted to post it here for others and myself to refer to, as it's a great example of when to use a class definition as well.

Here is the full source available in this repo.

You can initialize the crawler like so;

This will create a instance of the crawler initialized with the parameters depth of 2, and limit of 100.

You can then use it like this:

The crawler's crawl method is a loop that uses the method "shouldContinueCrawling" in order to determine if we should keep going.

Do we have items in our queue? AND our collected pages is less than our max we initialized with?

which the addNewUrlsToQueue method increments our depth:

It's not hard, or complex, just straight forward, clear, and elegant, which I really appreciate.