Log files record every request, or ‘hit’, made to a web server.
Analyzing log files is an often-overlooked but important aspect of technical SEO, and it really comes into its own when dealing with huge websites with thousands, or hundreds of thousands of pages.
Here, we’ll be explaining what a log file is, where to find it, what it means for your site, and how you can squeeze SEO juice out of it.
What is a Log File?
Every request made to a web server is logged and saved.
The file is output and saved to your hosting web server – and you’ll be able to find it and analyze it.
When you look at a raw log file, it’ll be densely packed with written info and data – but don’t panic!
Some logged request information you’ll see may include:
- IP Address
- Type of HTTP request (GET/POST, etc)
- User Agent
- Requested URL
- HTTP Status Code
Other attributes might include:
- Host name
- Bytes downloaded
- Request/Client IP
This information can be useful for troubleshooting issues, but for SEO, we’re primarily interested in HTTP requests – this recorded information tells us about how GoogleBot and other crawlers crawl a site.
The log is a 100% accurate representation of how GoogleBot and other search engine bots crawl a site and its pages.
Log files essentially log the activity of crawlers, providing SEOs clues as to how a site and its pages are being crawled and indexed, and by who.
How Do I Find a Log File?
It depends on your server setup and configuration.
For those who have access to/have configured their own server, the following guides can be followed to download a log file. This is often a convoluted process – you can ask your web developer or tech team for log files if you are not responsible for the server.
Using cPanel to Download a Log File
cPanel is a web hosting control panel software that simplifies web and server management.
If you have access to your site’s cPanel then downloading log files is much simpler (thankfully!).
Log into cPanel and you’ll typically need to head to Metrics and Raw Access Logs. You’ll need to tick the Archive box to ensure that logs will be saved from then on.
After a new log is created, you’ll then be able to download the log for analysis.
This is by far the easiest DIY method to retrieve logs and anyone can do it, provided you can access your site’s cPanel.
Once you have the log file, you’ll need to look into log analyzer tools to analyze the data.
You can manually convert and filter the file, but there’ll be tons of info in there which is totally irrelevant to SEO. In fact, some log files can reach hundreds of megabytes, and parsing and sorting the info DIY is painstaking.
It’s worth noting that you may need multiple log files to cover longer periods of time – one day’s worth of data is not likely to be enough to analyze how GoogleBot crawls your site.
Why Analyze Log Files?
It’s a lot of hassle right?!
Well, if you use cPanel then it’s not too bad!
But still, what do you do with all that data?!
Identify Crawl Bots
SEO is usually all Google, Google, and Google, but other crawl bots matter too, especially if you’re looking to tap into emerging markets or audiences in China for example, via Baidu.
Your server log will tell you if requests have been made by GoogleBot as well as BingBot, Baidu, Yahoo, and Yandex, and any other user agents.
So, for example, you may find that you’re not getting crawled by Baidu, but you want to appear in China. The log file contains this sort of information.
You’ll also be able to see if your robots.txt file is doing its job in disallowing certain crawlers.
Search engines don’t crawl your site indefinitely, and they don’t have unlimited resources either.
Crawls are periodical and only a dedicated quantity of crawling resources will be dedicated to your site. Log files let you find and assess how many pages are being crawled across your site over a given period of time – usually daily.
More authoritative, healthy, and optimized sites will be dedicated greater crawling resources.
Consider a site like Wikipedia which has some 6 million pages, with over 600 newly added articles each day. That’s going to need a lot of crawling and as such, GoogleBot probably crawls many thousands or even millions of Wiki pages every day to check for updates and changes.
Wikipedia is one of the most authoritative domains in the world and thus, it has a huge crawl budget to suit its vast quantity of pages and ultra-high authority.
But your site might be different, you still might have thousands of pages and if your site isn’t being allocated the appropriate crawl budget, then your newly created or updated pages might sit there uncrawled for days, weeks, months, or even years in exceptional circumstances!
You could also find that crawler bots are indexing old pages, redirects, orphaned pages, and other useless URLs when you want them to crawl your new or updated content.
Log analysis lets you calculate how many pages of your site are being crawled – and which ones – so you can make necessary changes to your site, robots.txt and sitemap to make sure your top pages are prioritized.
Once you’ve assessed crawl budget and timing, you can modify your XML sitemap to prioritize which pages Google crawls regularly.
This might be your blog or similar, whereas other parts of your site that remain the same for long periods of time will not need to be crawled so often.
Temporary 302 Redirects and Duplicate URLs
Temporary 302 redirects direct users and crawlers to a temporary page. They’re not harmful to SEO but do eat into crawl budget as the search engine will be continually crawling to see if the redirect is still there.
You can also narrow down the content that you can instruct Google not to crawl at all, i.e. duplicate content. This can also help reduce analytics errors.
Check Page Crawl Times
You can analyze log files to discover the time to first byte (TTFB) and time to last byte (TTLB) associated with each page, helping you quickly find your fastest and slow loading pages.
This is a quick and easy way to check page load stats for big sites, and you’ll be able to sort your largest web pages so you can check those for particularly slow content (e.g. high-res images).
Discover Orphaned Pages
Orphaned pages are pages still linked to your site and crawled, but aren’t internally linked. They exist to Google (if they’re being instructed to crawl the page from the sitemap, or external links point to it) but not to you or the user (unless external links still point to them).
These can be created by site structure changes, internal linking errors, and old redirects.
Either way, they won’t rank as they’ll have no internal links. Log analysis finds orphaned pages so you can inspect them and deal with them.
Best Log File Analysers
Here is a compilation of log file analyzer tools that take the graft out of analyzing complex log files for SEO purposes.
In Semrush’s mighty set of SEO tools, you’ll find the Semrush Log File Analyser which was actually only released in 2018. This tool is available with Semrush’s main set of tools, costing $119/mo for the Pro package.
You’ll be able to upload your log files for analysis where the tool will break down all of the important SEO-related metrics and display them in graphs.
The tool will breakdown:
- Requests from various search engine bots (e.g. GoogleBot)
- HTTP status codes found each day
- The different file types crawled each day
This easily lets you assess your crawl budget, and shows how often pages and/or files are crawled. This enables you achieve the main goal; to analyze the crawlability of your site and point GoogleBot towards the pages that require crawl priority (e.g. a regularly updated blog).
- Analyze bot activity
- Discover most crawled and least crawled pages
The Screaming Frog Log File Analyzer is probably one of the most complete log analyzers around and doubles up as an excellent technical SEO tool.
Costing just £99.00 per year, it’s one of the cheaper tools around and the free version still lets you analyze 1000 lines of log, plenty for single-site owners and SEO novices.
This tool will breakdown crawler activity in full, filterable by the bot, e.g. GoogleBot, Baidu, Yandex, etc. It’ll point you towards broken links, redirects, and orphaned pages. Sorting by page size is simple too, so you can quickly ascertain page load speed and size.
For SEO purposes, the tool breaks down crawl budget and reveals the most crawled pages and content, enabling you to discover how efficiently your site makes use of its crawl budget.
- Discover and verify bots
- Analyze crawl frequency
- Find your most crawled pages to analyze crawl budget
- Find orphaned pages
This is more of a log management tool for DevOps and tech teams. It aggregates many types of logs, including server log files, for centralized accessibility across an organization.
Log alerts can be sent via Slack or other team management platforms and data can be imported into software utilities such as Hadoop.
- Aggregates logs for engineers and dev teams
- Email alert system for anomalies
- Access and download log files for further analysis
- Wide technical remit (not really an SEO tool)
Graylog is a sophisticated log and data analysis platform that enables aggregation of logs of virtually any type, from any source. It’s designed for enterprise and organization-level businesses that require rigorous analysis of many data log outputs.
Graylog is an advanced tool for engineers and dev teams, it’s designed for analyzing data from many inputs or outputs – not just servers.
- Can store huge quantities of data
- Designed for enterprise-wide data analysis
- Real-time analysis
- Unprecedented scalability
Loggly aggregates log data across large enterprise-level or organization-wide networks. It’s an enterprise-level platform that allows for real-time analysis of server logs as well as log data from near-limitless sources.
Loggly enables the analysis of all server data, but it’s also an interdisciplinary tool for log aggregation and analysis.
- Store and analyze server logs across large distributed systems
- Secure, cloud-enabled analytics platform
- Near-limitless scalability
- Enables cross-department collaboration
Another enterprise-level server log file and aggregator. Designed to pool and centralize server info from distributed networks, Log Entries enables advanced analysis of server resources and system health.
A professional engineering/dev tool that breaks down server metrics in minuscule detail to audit the health of colossal networks.
- Collect server logs across large networks
- Real-time alert system
- Server anomaly detection
- Enables cross-department collaboration
A fast, terminal-based log analyzer, GoAccess breaks down key server requests in detail for real-time analysis. It shows visitor numbers as well as crawlers and spiders so you can analyze how often your site is crawled. It’s also great for assessing page response time and server load.
- Terminal-based dashboard
- Analyze web traffic and crawlers in real-time
- Analyze server resources and bandwidth
Optimized for SEO log analysis, SEOLyzer provides seamless live log tracking tools to detect crawling errors rapidly. It has an easy-to-use interface and the focus here is on speed; it enables you to locate and home in on errors before your other systems (e.g. Search Console) pick up the error.
SEOLyzer has built an impressive repertoire of clients and has proven a superb technical SEO tool for auditing useful server logs for key information. The aim of SEOLyzer is to detect logged errors and work to fix them before site coverage declines.
The software also allows you to aggregate information into graphs and KPIs.
- Log analysis for SEO specifically
- Aggregate log data for analysis
- Find errors before they hit your site coverage
- Free version for single-site users (limited analysis capacity.
SEO log analysis may seem fairly niche and complex, but it’s a powerful weapon for your technical SEO armory.
SEO log analysis’s main strengths lie in debugging and troubleshooting, but it’s also an excellent tool for analyzing crawl budget in the natural habitat of the crawlers themselves.
The log is 100% accurate, data-rich, and hard-linked to the requests made by crawlers – that is its main advantage.