Finding AI Bots in Your Web Server Log Files

The following are additional ways to process your web server log files found in our blog post “How to Protect Your Content from AI… and Should You?” (Coming Soon) This allows you to quickly evaluate how often your website is being accessed by AI Scrapers, AI Crawlers and Assistants.

Bots accounted for 47.4% of all internet traffic in 2022.

Extract all of the lines of AI Bots from Server Logs with grep

Below learn how to use the grep command to filter and extract entries related to specific bots from an Apache log file. This method helps focus on the traffic that AI Scraping generates on your website that might violate your AI Robots.txt Rules.

Prerequisites

  • Access to a Unix-like operating system (Linux, macOS)
  • An Apache log file, typically named apache.log

Instructions

  1. Open Terminal: Start by opening your terminal application.
  2. Navigate to the Log Directory: Use the cd command to change to the directory containing your Apache log file. Replace /path/to/apache/logs with the actual path to your Apache log files.
  1. Execute the grep Command: Use the following command to filter entries related to specific bots from your Apache log file. This command will search through apache.log for lines containing any of the listed AI Bot names and save the matching lines to filtered_apache.log.
  1. Understanding the Command:
    • grep -E: Invokes grep with the -E flag to enable extended regular expression matching.
    • The long string of bot names separated by | is the pattern grep will search for in the log file. The | symbol acts as an OR operator, meaning any line containing at least one of these names will be matched.
    • apache.log: The name of the log file you are searching through.
    • >: Redirects the output of grep to a file instead of displaying it on the screen.
    • filtered_apache.log: The file where the matched lines will be saved.
  2. Review the Results: After running the command, filtered_apache.log will contain only the log entries that match the specified bot names related to AI Scraping. You can view this file using a text editor or the cat command, like so:

By following these steps, you can efficiently extract and review the activities of specific bots within your Apache logs. This process is valuable for analyzing bot behavior and ensuring that AI Scrapers interact with your site as expected.

How to Count AI Scraping Visits in Apache Logs with a Shell Script

Below is the process for creating a shell script that counts visits from various bots in an Apache log file and outputs the results in a text file. This method will be valuable for web admins looking to analyze AI bot traffic.

Prerequisites

  • A Unix-like operating system (Linux, macOS)
  • An Apache log file (commonly named apache.log)
  • Access to a text editor, either graphical (like TextEdit on macOS, Notepad++ on Windows, or gedit on Linux) or command-line (like nano)

Creating the Script

Using a Graphical Text Editor:
  1. Open your text editor: Launch your graphical text editor.
  2. Create a new file: Start a new document.
  3. Write the script: Copy and paste the following code into your document.
  4. Save the file: Save your script with a .sh extension, e.g., bot_counter.sh.
Or using Nano (Command-Line Text Editor):
  1. Open Terminal: Access your terminal application.
  2. Create and edit the script file: Type nano bot_counter.sh to create and open the file in nano.
  3. Write the script: Copy and paste the same code as above into the nano editor.
  4. Save the file: Press Ctrl + O, then Enter to save, followed by Ctrl + X to exit nano.

Making the Script Executable

  • In the terminal, navigate to the directory containing your script file.
  • Run the command chmod +x bot_counter.sh to make it executable. Replace bot_counter.sh with your script’s filename.

Running the Script

  • Execute the script by typing ./bot_counter.sh in the terminal. Ensure you are in the same directory as the script and Apache log file.

The script will process apache.log and produce bot_counts.csv, which lists each bot and the number of times it accessed your site.

How to Count AI Scraping Visits in Apache Logs with an awk Command

Another alternative to using a bash script is to use the command awk, below is the process to use awk that counts visits from various bots in an Apache log file. This method is efficient for web admins looking to analyze AI Scraping traffic using a single command. We found this method effective but slower to process than the bash script.

Prerequisites

  • A Unix-like operating system (Linux, macOS)
  • An Apache log file (commonly named apache.log)
  • Basic knowledge of using the terminal

Running the awk Command

Open your terminal and run the following awk command:

This awk command will:

  1. Split the bot list into an array.
  2. Initialize counts for each bot.
  3. Check each line in the log file for bot names and update the counts.
  4. Write the results to bot_counts.log.

The command will process apache.log and produce bot_counts.log, which lists each bot and the number of times it accessed your site.

What is grep?

grep is a powerful command-line utility used in Unix-like operating systems for searching text using patterns. When used with regular expressions (-E flag), grep becomes even more versatile, allowing you to match complex patterns.

What is awk?

awk is a powerful command-line utility used in Unix-like operating systems for pattern scanning and processing. It allows you to search, filter, and manipulate text based on defined patterns. awk is particularly useful for processing structured data, such as log files or CSV files, and can perform complex text transformations and reporting.

What are User Agents?

In the robots.txt file, the user agents identify specific web crawlers or bots, allowing site administrators to tailor access permissions individually. By specifying user agents, one can selectively restrict or grant access to different parts of a website, ensuring that only desired bots can index or interact with specific content.

Will AI User Agent rules block Search Engines or Social Sharing?

We have specifically selected AI user agents that are unrelated to search or social sharing. For example, Google-Extended is for Google’s AI Model Bot, where GoogleBot is used for general search. This may change in the future, but we will update our AI Model code snippet accordingly.

(Last Updated: June 13, 2024)

hi@tenacity.io

Need help figuring out whether AI is scraping your website?