SEO understanding level – Basic to Medium
There have been a significant number of recent updates within the world of SEO web crawling software.
Of particular note, Screaming Frog released the latest version of their crawling software – updated to incorporate a fresher, more visual appearance.
Deepcrawl, another well-known crawling software company, who received a significant amount of funding in August, demonstrated that this type of software is being taken seriously by the rigors of Venture Capital.
Recently, we’ve been contacted by a number of other, lesser-known/up-and-coming crawling software companies, either announcing that the Beta version of their tool is available to try, or to suggest signing up and downloading their software.
Overall, web crawling software for SEO seems to be an increasingly competitive space, with each company doing everything they can to secure valuable business.
Each web crawler has its own merits, with each serving a slightly different product.
In light of this, we’ve decided to have a more in depth look at web crawling, including what it actually is, and how this type of software can be so valuable when conducting an SEO campaign.
What is web crawling?
Crawling is the process of discovering individual resources, URLs, websites and the whole (hyperlinked) internet in an automated and methodical process.
This process is carried out by a web crawler, bot or spider. GoogleBot for example is an incredibly powerful crawler, able to crawl and index the roughly one billion websites on the word wide web. Their ability to discover this data means that, when you perform a Google search, results are able to be returned from this web crawl. The actual results you see displayed however are delivered via a serving tree sitting atop an index.
What does an SEO web crawler actually do?
Using web crawling software for SEO is a way of troubleshooting, and effectively emulating what crawlers such as Googlebot may discover when visiting your website. This allows you to find and fix errors with the technical structure of your site, as well identify a whole variety of other technical issues. These may include issues with your canonical setup, duplicate/missing meta data or ‘thin content’ by way of low word count.
If you use a crawler such as Screaming Frog, Deepcrawl or Sitebulb, you will need to provide it with a list of URLs, or a singular start point (such as the site homepage, or a specific page entry such as a product page). The software then follows each and every link that it finds on that page (and subsequent pages), until there are no more unique URLs to follow.
These results are usually then displayed via the tool interface, highlighting any issues and errors, as well as providing an insight into the general structure of your website.
How do SEO web crawlers work?
Starting with a single URL, crawling software scans through the source code of that page and looks for links to other URLs within that code. It then adds these links to an overall database/list that is stored locally, on whatever machine is being used to crawl.
The crawler then goes back to that database, identifying which URLs have been crawled already, and then crawls the URLs that have not yet been crawled.
This process continues until there are no more unique URLs remaining.
When the crawler attempts to visit a certain URL, a request is made to the server where the URL is based (in the same way as when you visit a page from the internet). This request asks the server:
- If that particular crawler is ‘allowed’ to see the requested URL (by following robots.txt directives)
- Whether that URL exists at the requested location
- What language the page is requested in
- For other areas of importance
This header is submitted with every page that is requested:
The server then generates a “Header Response” which provides the crawler with all of the information it has requested. In some instances the server may not allow the requested page to be crawled; in these instances a 5xx server code will be returned, along with a reason why that URL could not be crawled.
Other status codes can also be returned at this point, such as:
- 404 Not Found tells the crawler that the requested URL was not found on the server. The crawler then makes a record of this, and moves onto the next URL in the list.
- 3xx status codes tell the crawler that the requested URL is redirected to another location, which the crawler will record and then follow, until it finds a URL without a 3xx status code. GoogleBot will only follow a redirect up to around 5 steps – after this it is deemed too long of a chain. Crawling software is therefore great at identifying such issues, as they can be configured to follow redirect chains for many more steps
- 200 Status Code – OK – tells the crawler that the webpage was found, and it is OK to crawl.
(image from web-sniffer.net)
If the crawler receives an ‘OK’ Response, the software will then conduct ‘parsing’; this effectively scans through an entire webpage, extracting certain elements and storing these in the database.
These elements can include:
- Meta tags, such as titles and descriptions
- H1 tags
- Canonical tags
- Many, many more
How is SEO web crawl data stored and presented?
The crawling software stores all of this information in the same database associated with the relevant URL. All of this data is stored locally, which is why most standard PCs cannot handle multi-million page crawls – they just run out of RAM to store this data in.
However, setting up a crawler in a service such as AWS can help resolve this issue, as it allows you to dedicate much more RAM to the crawler. Some crawlers also offer a combined service, where they utilise the power of cloud computing to enable you to crawl many more pages.
When the crawl completes, most software will present the information to you in a spreadsheet style report, that as highlighted above can identify many different issues with your website clearly and quickly.
Some crawlers add styling elements to the report and present the data to you in a graphical way – fantastic if you need to analyse the data quickly, or present to clients or management teams.
Web crawling benefits for SEO
As we’ve seen, web crawling is a necessary part of SEO, it can help to identify issues from “under the hood” of the website that may not be apparent purely by looking/clicking around. It’s by far the easiest and best insight into how search engines will discover and view your content, and therefore the data a crawler can provide is invaluable to anyone running an SEO campaign.