Universal Web Crawler Blocking Report

Universal Web Crawler Blocking Report – Top Blocked Non-AI and AI Bots

Last update on:

Introduction

This research aims to analyze the prevalence of web crawlers’ full website blocking among a diverse range of websites and answer, among all, the questions of “which web crawlers (robots) are most frequently blocked from top traffic sites” and “which categories of sites have blocked each of the different common web crawlers“. The study investigates the frequency at which specific web crawlers (robots) are blocked across a dataset comprising 223,014 analyzed websites. We initially analyzed 86,151 top-traffic sites between April 15 and April 25 and redid the analysis, this time for 223,014 top-traffic sites between July 04 and Jul 30 from our initial backlog of 1 million high-traffic websites, including the sites we checked in the first analysis. More websites are analyzed in our second analysis in order to draw a more comprehensive conclusion. The analysis encompasses both gradual automated analysis of the 1 million high-traffic sites, as well as websites manually analyzed by users through our robots.txt checking tool.

Robots.txt Fun Fact:

Did you know that in our analysis of the robots.txt file of over 223,000 top-traffic sites:

  • 🥇 FitnessFirst.com, with a total of 2,158 user agents, leads the highest number of user-agent declarations in its robots.txt file.
  • 🥈 Op.nysed.gov follows with 1,807 user agents.
  • 🥉 En.mimi.hu stands at the third position with 1,801 user agents.

Data Collection

  • Website Selection: The dataset includes a comprehensive selection of websites drawn from two primary sources: an automated process that analyzes 1 million top-traffic websites and a manual analysis of websites by users of our free robots.txt testing tool.
  • Web Crawlers Blocking Analysis: The analysis focuses on the robots.txt files of the selected sites, identifying instances where specific web crawlers are blocked. The evaluation ensures each website is represented only once in the dataset, utilizing the most recent analysis of the robots.txt file of each website to avoid duplication.

Data Sources

  • Automated Analysis: Websites are selected for automated analysis based on their Open Page Rank. The list of sites is also obtained from DomCop’s list of the top 10 million websites, which is obtained from the Common Crawl project.
  • Manual Analysis: Users contribute to the dataset by manually submitting domains for analysis via our robots.txt testing tool. These submissions broaden the scope of the dataset and incorporate a wider variety of websites. However, the number of websites submitted by the users is comparatively minuscule.

Data Processing

  • Crawler Grouping: To streamline the analysis, crawlers from the same organization are grouped together. For example, various user agents associated with Semrush are consolidated into a single category, enhancing the clarity and interpretability of the results.
  • Crawler Name Normalization: The crawler names extracted from robots.txt files undergo normalization to account for variations in formatting and casing. This ensures accurate data categorization and aggregation.
  • Occurrence Counting: Only when a bot is entirely blocked from a site is counted as a blocking occurrence. If a bot is partially blocked (blocked from certain paths but not the entire site), it is not counted as an occurrence. This criterion ensures that the analysis accurately reflects only instances where bots are effectively excluded from accessing the site’s content.
  • Deduplication: We ensured that each website was entered only once in our analysis, meaning that only the latest analysis for each website has been included in the results.
  • Website Categories: In the first run of our analysis, the websites were categorized based on a curated list of 185 categories. In our second round of analysis, we reorganized this list into 18 more general categories. We assisted ChatGPT-4 to create and organize our list of categories. The final list of categories then underwent a manual human review to ensure accuracy, minimal overlap, and comprehensive coverage. This approach allowed us to generalize the categories for easier understanding of the results while retaining essential information. These categories can be seen below:
  1. News & Media
  2. Technology & Computing
  3. Education & Research
  4. Health & Wellness
  5. Business & Finance
  6. Lifestyle
  7. Arts & Culture
  8. Science & Environment
  9. Community & Social
  10. Commerce & Shopping
  11. Entertainment & Recreation
  12. Professional Services
  13. Government & Public Services
  14. Transport & Logistics
  15. Cultural & Historical
  16. Publishing & Writing
  17. Events & Conferences
  18. Technology Infrastructure

Here are some examples of sites in each category:

Category NameExample Sites
News & Mediabbc.co.uk, nytimes.com, techcrunch.com, theguardian.com, washingtonpost.com
Technology & Computinggoogle.com, googletagmanager.com, maps.google.com, support.google.com, github.com
Education & Researcharxiv.org, ted.com, sciencedirect.com, apa.org, researchgate.net
Health & Wellnesswho.int, thelancet.com, cdc.gov, healthline.com, mayoclinic.org
Business & Financeforbes.com, ft.com, bloomberg.com, cnbc.com, mckinsey.com
Lifestylepotofu.me, chefsfriends.nl, socialbutterflyguy.com, loveinlateryears.com, homefixated.com
Arts & Culturepinterest.com, flickr.com, commons.wikimedia.org, archive.org, canva.com
Science & Environmentnasa.gov, ncbi.nlm.nih.gov, nationalgeographic.com, nature.com, news.sciencemag.org
Community & Socialfacebook.com, instagram.com, vk.com, reddit.com, discord.gg
Commerce & Shoppingplay.google.com, amazon.com, itunes.apple.com, fiverr.com, cdn.shopify.com
Entertainment & Recreationyoutube.com, vimeo.com, tiktok.com, open.spotify.com, twitch.tv
Professional Serviceslinkedin.com, mayerbrown.com, jobs.microfocus.com, pwc.com, randygage.com
Government & Public Servicesec.europa.eu, gov.uk, fao.org, un.org, europa.eu
Transport & Logisticsproterra.com, uber.com, bullettrain.jp, logisticpoint.net, rdw.com.au
Cultural & Historicalloc.gov, web-japan.org, mnhs.org, biblearchaeology.org, cliolink.com
Publishing & Writingen.wikipedia.org, medium.com, blogger.com, ameblo.jp, issuu.com
Events & Conferenceseventbrite.com, calendly.com, meetup.com, sxsw.com, veritas.org
Technology Infrastructurefonts.googleapis.com, ajax.googleapis.com, google-analytics.com, gstatic.com, cdn.jsdelivr.net

Analysis

  • Visualization: The research employs a bar chart to visualize the frequency of crawler exclusions across different bot categories. The Y-axis represents the number of websites blocking a specific robot, while the X-axis delineates the bot names. A stacked bar chart is also presented to show the prevalence of crawler exclusion among different categories of websites.
  • Statistical Analysis: Simple quantitative analysis is conducted to identify trends and patterns in robot exclusions. The frequency of robot blocking is examined to discern prevalent practices among website owners.
  • Interpretation: The findings can be interpreted to provide insights into the prevalence and significance of web crawler exclusions in the online ecosystem, especially among top-traffic websites and among different categories of sites. Implications for website owners, technical SEO strategies, and robot behavior can be derived from this data by other interested researchers.

Results

The results of the ongoing analysis are published on Nexunom’s Robots.txt Checker page. Here is a snapshot of the results at 223,014 analyzed websites. The bar chart represents the number of times we found a crawler name disallowed to access the entire site on a website’s robots.txt file (top 15 crawlers,) and the following table shows the exact number of times each of the top 15 crawlers have been blocked.

Universal Web Crawler Blocking Report - Top Blocked Non-AI and AI Bots 1
15 Top Blocked Web Crawlers – Bar Chart
Universal Web Crawler Blocking Report - Top Blocked Non-AI and AI Bots 2
Number of Times Each Web Crawler Has Been Blocked

As the table above shows, GPTBot, SemrushBot, CCBot, Google-Extended and TeleportBot are among the top five most blocked crawlers in the robots.txt files of the top-traffic sites. The table below shows a more comprehensive list of the top blocked web crawlers (60 top blocked crawler list) along with the rank of each crawler in the top blocked list.

RankCrawler NameCategoryOccurrence
1GPTBotAI6904
2SemrushBotSEO5589
3CCBotAI4664
4Google-ExtendedAI3840
5TeleportBotWeb Scraping3765
6ChatGPT-UserAI3556
7MJ12botSEO3433
8AhrefsBotSEO3252
9anthropic-aiAI2331
10FacebookBotSocial Media2300
11dotbotSEO2001
12WebCopierWeb Scraping1841
13WebStripperWeb Scraping1811
14ClaudeBotAI1799
15Offline ExplorerWeb Scraping1785
16WebZIPWeb Scraping1785
17BytespiderWeb Scraping1761
18Claude-WebAI1756
19SiteSnaggerWeb Scraping1755
20larbinWeb Scraping1649
21AmazonbotSearch Engine1629
22MSIECrawlerWeb Scraping1616
23omgilibotWeb Scraping1604
24PetalBotSearch Engine1574
25BaiduspiderSearch Engine1570
26blexbotSEO1556
27omgiliWeb Scraping1537
28HTTrackWeb Scraping1460
29wgetWeb Scraping1414
30ZyBORGWeb Scraping1400
31PerplexityBotAI1371
32YandexSearch Engine1370
33WebReaperWeb Scraping1358
34NPBotSEO1349
35XenuSEO1342
36ia_archiverWeb Scraping1332
37TurnitinBotAcademic1240
38grub-clientWeb Scraping1239
39sitecheck.internetseer.comSEO1195
40FetchWeb Scraping1189
41cohere-aiAI1180
42ZealbotWeb Scraping1179
43Download NinjaWeb Scraping1179
44linkoSEO1169
45libwwwWeb Scraping1156
46ZaoWeb Scraping1122
47Microsoft.URL.ControlWeb Scraping1117
48UbiCrawlerWeb Scraping1098
49DOCWeb Scraping1089
50magpie-crawlerWeb Scraping1064
51k2spiderWeb Scraping1051
52DiffbotWeb Scraping961
53DataForSeoBotSEO944
54fastWeb Scraping856
55psbotWeb Scraping820
56Mediapartners-Google*AdBot815
57008Web Scraping738
58ScrapyWeb Scraping722
59ZeusWeb Scraping697
60WebBanditWeb Scraping685

From the table above, it is evident that 5 of the top 10 blocked web crawlers are AI-related, while SEO and Web Scraping tools account for 4 of the top 10.

The following stacked bar chart shows which “categories of sites” have blocked a specific crawler in their robots.txt files. For example, the stacked bar for GPTBot reveals that “News & Media,” “Technology & Computing,” and “Entertainment & Recreation” are among the top categories of sites blocking ChatGPT crawler. The same trend applies to the rest of the AI Crawlers.

Note: This stacked chart, unlike the above bar chart, only represents the websites we analyzed from our dataset of 1 million top-traffic sites and does not include the sites analyzed by the users via Nexunom’s robots.txt checker, so it is a better representative of the blocking behavior of the top-traffic sites than the bar chart above.

Universal Web Crawler Blocking Report - Top Blocked Non-AI and AI Bots 3
Categories of Sites Blocking Each Web Crawler

The following table shows five of the top “site categories” blocking each of the top blocked 15 crawlers.

Web CrawlerSite CategoryBlocked OccurrenceBot Category
AhrefsBotTechnology & Computing500SEO
AhrefsBotNews & Media393SEO
AhrefsBotEducation & Research366SEO
AhrefsBotCommerce & Shopping341SEO
anthropic-aiNews & Media870AI
anthropic-aiLifestyle225AI
anthropic-aiTechnology & Computing207AI
anthropic-aiEntertainment & Recreation197AI
anthropic-aiArts & Culture178AI
CCBotNews & Media1916AI
CCBotTechnology & Computing354AI
CCBotArts & Culture335AI
CCBotEntertainment & Recreation332AI
CCBotLifestyle300AI
ChatGPT-UserNews & Media1639AI
ChatGPT-UserTechnology & Computing266AI
ChatGPT-UserEntertainment & Recreation223AI
ChatGPT-UserLifestyle218AI
ChatGPT-UserArts & Culture212AI
ClaudeBotNews & Media662AI
ClaudeBotArts & Culture170AI
ClaudeBotTechnology & Computing170AI
ClaudeBotEntertainment & Recreation158AI
ClaudeBotEducation & Research115AI
dotbotTechnology & Computing285SEO
dotbotNews & Media256SEO
dotbotCommerce & Shopping242SEO
dotbotArts & Culture200SEO
dotbotEducation & Research182SEO
FacebookBotNews & Media786Social Media
FacebookBotLifestyle216Social Media
FacebookBotArts & Culture204Social Media
FacebookBotTechnology & Computing203Social Media
FacebookBotEntertainment & Recreation188Social Media
Google-ExtendedNews & Media1771AI
Google-ExtendedTechnology & Computing300AI
Google-ExtendedEntertainment & Recreation287AI
Google-ExtendedArts & Culture280AI
Google-ExtendedPublishing & Writing252AI
GPTBotNews & Media2480AI
GPTBotTechnology & Computing724AI
GPTBotEntertainment & Recreation559AI
GPTBotArts & Culture507AI
GPTBotPublishing & Writing492AI
MJ12botTechnology & Computing438SEO
MJ12botCommerce & Shopping420SEO
MJ12botPublishing & Writing387SEO
MJ12botEducation & Research384SEO
MJ12botArts & Culture336SEO
Offline ExplorerEducation & Research410Web Scraper
Offline ExplorerPublishing & Writing319Web Scraper
Offline ExplorerNews & Media191Web Scraper
Offline ExplorerArts & Culture147Web Scraper
Offline ExplorerCommerce & Shopping139Web Scraper
SemrushBotTechnology & Computing913SEO
SemrushBotEducation & Research874SEO
SemrushBotArts & Culture617SEO
SemrushBotCommerce & Shopping484SEO
SemrushBotNews & Media454SEO
TeleportBotEducation & Research842Web Scraper
TeleportBotPublishing & Writing651Web Scraper
TeleportBotNews & Media420Web Scraper
TeleportBotArts & Culture308Web Scraper
TeleportBotTechnology & Computing307Web Scraper
WebCopierEducation & Research418Web Scraper
WebCopierPublishing & Writing322Web Scraper
WebCopierNews & Media189Web Scraper
WebCopierArts & Culture155Web Scraper
WebCopierCommerce & Shopping144Web Scraper
WebStripperEducation & Research413Web Scraper
WebStripperPublishing & Writing318Web Scraper
WebStripperNews & Media194Web Scraper
WebStripperArts & Culture151Web Scraper
WebStripperTechnology & Computing142Web Scraper

As the table above indicates, when the web crawler is in the AI category, the most frequently blocking site category is News & Media, as seen with bots like GPTBot, ChatGPT-User, Google-Extended, anthropic-ai, and ClaudeBot. For the web crawlers with SEO categories such as SemrushBot, AhrefsBot, and MJ12bot, the top blocking site category is Technology & Computing. For the crawlers with the Web Scraper category such as Offline Explorer and WebCopier, the most commonly blocking site category is Education & Research.

Limitations and Considerations

  • User Contributions: While the dataset may include domains searched by users in our robots.txt tester, the proportion of such data is considered negligible and does not significantly influence the results.
  • Single Domain Representation: Each domain is counted only once in the analysis, with the latest crawl of its robots.txt file contributing to the dataset. This approach ensures fair representation and avoids skewing the results based on multiple entries for the same domain.
  • Incomplete Data: While efforts are made to continuously update the dataset, it may not capture all domains or reflect instantaneous changes in robot exclusion practices across the web.
  • Potential Sampling Bias: The dataset’s composition may be influenced by sampling bias inherent in the selection of domains for analysis from the top 1 million high traffic websites, potentially impacting the generalizability of findings to all the web.

Analysis of the Results

Conclusion One:

The analysis of the results indicates that certain categories of web crawlers are among the most frequently blocked across the analyzed websites. Notably, web crawlers related to artificial intelligence tools (AI crawlers), such as GPTBot, Google-Extended, and ChatGPT-User, feature prominently among the top blocked web crawlers. CCBot (Common Crawl Bot), which historically provides data for training AI tools, is also found among the top 5 blocked crawlers. Our results showed that 5 of the top 10 blocked crawlers were AI-related.

PerplexityBot, another AI crawler, is in position 31 of the top blocked crawlers, probably because it is a newer or less important one. Claude-Bot and Claude-Web are also in positions 14 and 18, respectively, showing a more prominent position in the top blocked AI-related web crawlers.

Conclusion Two:

Additionally, web crawlers associated with search engine optimization (SEO) tools, such as SemrushBot, MJ12bot (Majestic SEO web crawler), AhrefsBot, and dotblot (Moz web crawler), are prevalent in the list of frequently blocked crawlers, respectively in positions 2, 7, 8, and 11.

Conclusion Three:

Another interesting finding was that SEO-related web crawlers were blocked most by technology and computing sites, AI-related web crawlers by news and media sites, and web scraper crawlers by education and research sites.

While these findings offer insights into common practices regarding robot exclusions, it’s important to note that the results are presented for informational purposes only. It’s acknowledged that some web crawlers, particularly those related to AI tools like GPTBot or Google-Extended, may contribute valuable traffic to the sites they crawl. Therefore, while recognizing the prevalence of certain bot categories in robot exclusions, caution is advised against hasty blocking, ensuring that legitimate bot traffic is not inadvertently excluded.

The findings contribute to our understanding of robot crawling management practices and can help form strategies for handling web crawler traffic effectively. If you want to contribute to this research or have any suggestions or recommendations for us, please feel free to leave a comment below.

What’s your Reaction?
+1
15
+1
0
+1
0
+1
0
+1
0
+1
0
+1
0

Author

  • Saeed Khosravi

    Saeed Khosravi is an SEO Strategist, Digital Marketer, and WordPress Expert with over 15 years of experience, starting his career in 2008. He graduated with a degree in MIB Marketing from HEC Montreal. As the Founder and CEO of Nexunom, Saeed, alongside his dedicated team, provides comprehensive digital marketing solutions to local businesses. He is also the founder and the main brain behind several successful marketing SAAS platforms, including Allintitle.co, ReviewTool.com, and Tavata.com.

    View all posts
Subscribe
Notify of
guest

This site uses Akismet to reduce spam. Learn how your comment data is processed.

0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Saeed Khosravi's Official Site