[๐ ] Reducing bot counts in the blog analytics
โจ GPT-5.5โs Summary ใ
A record of reducing cloud crawler and thin headless traffic in Cloudflare Worker visit counting by adding ASN/organization filters, visible-engagement signals, and then unblocking an incorrectly blocked ISP ASN.
The analytics numbers looked strange again.
After adding the public blog analytics page, local work was no longer polluting the production counter. But another problem showed up.
Some requests did not feel like human reading.
They opened several paths quickly, used a User-Agent that looked like a browser, but did not behave like a reader staying on a page. Crawlers are expected on a public static blog. The problem is that if they also hit /track, public views and visit counts inflate with them.
When I first added the visitor counter, I wrote that bots and duplicate visits were blocked enough. In operation, โenoughโ needed to move a little higher.
User-Agent was not enough
The Worker already had basic bot checks.
bot user-agent
Cloudflare verified bot
Cloudflare bot score
dedupe window
This catches obvious bots.
But not every crawler writes bot in the User-Agent. Some requests look like normal Chrome. User-Agent alone cannot reliably tell a person from a thin automated browser.
Cloudflareโs request.cf data includes ASN and ASN organization. I made /track check those too.
TRACK_BLOCKED_ASNS
TRACK_BLOCKED_AS_ORGS
Instead of letting environment variables fully replace the defaults, the Worker now adds env values on top of built-in defaults. The obvious blocked axis can stay in code, while operational discoveries can be added through env.
Exact organization matching was too weak.
Huawei Cloud
Huawei-Cloud-HK
Huawei Cloud Singapore POP
Huawei Clouds Singapore
Names can vary like this. So the organization check accepts both exact matches and contained strings. One default can catch several variants.
Adding 3 seconds of visible engagement
ASN filters alone are too network-centered.
A normal reader can sit behind an ISP or company network. A crawler can also come from an ordinary-looking network. So the client signal had to be checked too.
Now the client only sends /track after the page has been visible for a short time.
track_delay_seconds = 3
Three seconds is not long. A real reader barely feels it. But it filters out some instant previews, background-tab scraping, and page loads that disappear immediately.
The client sends these values in the /track payload.
engagementMs
visibilityState
documentHidden
viewportWidth
viewportHeight
The Worker validates them again.
The server does not simply trust that the client waited. If engagementMs is shorter than the threshold, the request is ignored as client_visible_too_short. If the viewport is zero or invalid, it is ignored as client_viewport_invalid.
Visible signals were not optional
The first implementation was still loose.
It checked visibilityState and documentHidden when they existed, but a request without those fields could still pass. That makes the new client stricter, while manual /track requests that omit the fields can remain.
So the Worker condition became stricter.
visibilityState === "visible"
documentHidden === false
If both are not exactly true, the request is ignored as client_signal_missing.
The point is that hidden and missing are in the same family here. /track is the endpoint that increases public counters. A request to this endpoint should look like it came from the normal page and normal client. Missing signals do not deserve a pass.
So /analytics stays publicly readable, but /track became more demanding.
Read analytics
-> public
Increase view count
-> production origin
-> visible engagement
-> valid viewport
-> bot/ASN/org filters pass
Block values must keep being reviewed
This work also showed that blocklists cannot just grow wider.
Trying to catch cloud crawlers can accidentally block a real ISP ASN. That would drop real reader records too. So I removed the incorrectly added ASN from the built-in block list.
Stronger filters can look better, but false positives also cost something in analytics.
Counting bots inflates the number.
Blocking people erases real readers.
Both are bad.
So I settled on this rule.
Block clearly cloud/crawler axes in the Worker.
Require client visible signals.
Do not put ISP-like human traffic axes into built-in blocks.
Keep suspicious values with ignored_reason for later audit.
analytics_events keeps minimal rows with ignored_reason even for ignored requests. Without that, it becomes hard to later explain why numbers dropped or what got blocked.
Numbers should update quickly, but not too easily
A visitor counter sits on an awkward balance.
If it updates too slowly, it looks broken. That is why the visitor counter work used a GA baseline plus immediate D1 increments.
But if counting is too easy, bots count too.
This change moves that balance slightly toward caution. Normal readers still count after 3 seconds. But instant automation, hidden documents, invalid viewports, and known cloud crawler axes are less likely to enter the counter.
Public analytics is not an accounting ledger.
Still, it should be close to โevidence that someone read the blog.โ Bigger numbers are not the goal. Trustworthy numbers are what make the next decision possible.
What I checked
I checked these items during the work.
node --check cloudflare/ga-stats-worker.js
node --check assets/js/custom/visitor-stats.js
git diff --check
bundle exec jekyll build
Cloudflare Worker deploy
GitHub Pages deploy
I also smoke-tested the Worker.
/analytics?range=today
-> trackDelaySeconds: 3
/track without visibilityState
-> ignored, client_signal_missing
/track with visibilityState="hidden"
-> ignored, client_signal_missing
/track with documentHidden=true
-> ignored, client_signal_missing
I also checked that the deployed home HTML contained data-track-delay-seconds="3".
This was not a flashy change.
But public numbers need this kind of defense. View counts are visible, so bad counting breaks trust quickly. This work did not make the counter bigger. It made the counter harder to fool.
Leave a comment