Google image SEO & Amazon (AWS) S3 – A case study for resolving missing images
Posted by Luci Wood on December 5, 2018
What was the challenge?
Earlier this year, whilst using Search Console to conduct a Fetch & Render on a client site, I noticed a high number of images that were being returned as ‘temporarily unreachable’. This seemed a little strange, so I rendered a few more pages and found the same issue on them too.
After taking a wider sample of pages from across the website, it soon became apparent that the problem was sitewide, and since the client owns two other domains that we look after, I thought I’d check to see if either of those had any problems.
A quick check in Fetch & Render showed that there was definitely an issue across all three websites with ‘temporarily unreachable’ images. For some reason, Google wasn’t able to get to images across all three domains.
The next step was to check Google Images – upon performing a site colon search within Google images for each of the three domains I realised that the only images being indexed were from an archived blog that the client no longer used; none of the images from the main website were visible. For a travel website, naturally, this is a big problem.
A grand total of 17 images indexed for one of the client’s domains. Not ideal.
So began the investigations. First suspect – the recently activated CDN, Cloudflare. We’d had a little trouble a couple of weeks previously when trying to run crawls on the client site; the new CDN, keen to prove its worth, blocked all our attempts. No major issue – the client whitelisted our IP address, and all was well.
Closer inspection of the image URLs pointed to a possible issue with Amazon Web Services (AWS), or more specifically, Amazon S3, which the client uses to serve their images. For those unfamiliar with Amazon Simple Storage Service (known as Amazon S3), it allows you to store data as objects within resources called “buckets”.
There are several advantages to using Amazon S3 vs. hosting images on your own server:
- Lower costs
- Greater flexibility
- Good scalability
- Low latency
- Data security and protection is robust
So far so good, right? Perhaps not…
After much investigation, we started to narrow down on the issue – only images served from the S3 bucket were being affected. AWS S3 was now the ‘Prime’ suspect (sorry, couldn’t resist!). However, AWS gave away no clues, there was nothing that suggested what the problem might be, and images served from a browser request seemed fine.
Eventually, we came to the realisation that AWS was blocking Googlebot from accessing the images, and we all know what happens if you block Google from accessing things… the spiders throw a fit and refuse to crawl (ok, not quite, but essentially no access = nothing crawled and nothing indexed).
What was the solution for AWS S3?
As with most things in life, identifying the problem is really only half the battle. Finding a solution is quite another.
Whilst searching for the problem we had checked for a robots.txt file on the S3 bucket and found that there wasn’t one. There was nothing we could see that was specifically denying Googlebot access, but by the same token, also nothing explicitly allowing it. For that reason, the client added a robots.txt file with specific directives for Googlebot (and Bingbot) to one of the S3 buckets for their three domains.
This seemed to do the trick; the images began to return to Fetch & Render, so the client added a robots.txt file for each of the three domain’s S3 buckets. Within a couple of days, we saw the return of all the images to Google Images (phew!).
Things to check if your images aren’t indexed
Before you panic about why your images aren’t available in Google Image Search, from this piece of work we saw that there may be several reasons why this may be the case. Here are a few things you can check if you spot issues with your images…
The first thing to do is to head over to Search Console – in the old version, you can use Fetch & Render to compare how Googlebot sees the page vs. how a visitor to your website would see the page. This will usually give you an immediate indication if something isn’t quite right.
Scroll further down the page and you’ll typically see a list of resources that Googlebot isn’t able to reach. You might expect some blocking of scripts here, but certainly you wouldn’t want to see something like this:
After running your crawl, you should then be able to see something similar to Search Console’s Fetch & Render visualisation in the bottom pane of the screen.
If the steps above don’t allude to what the issue may be, the next thing to do is to check whether the platform you’re using to serve the images is blocking access to bots. If so, ensure a robots.txt file allowing specific search engine bots access to crawl is in place. Run a site search in Google Images after a couple of days to see if that’s done the job.
Naturally, if you want to perform well in visual search, you need to start with a strong foundation. This would include activities such as compressing your images to ensure they can load quickly, ensuring you’ve implemented relevant ALT tags and making use of image sitemaps. You can find more on that topic in my previous blog post about visual search.
In my next blog post I’ll be looking at CDNs in more depth, and discussing how you can ensure you’re optimising for images whilst delivering a great user experience.