Jul 062025
 

If you peeked over my shoulder while I was using ChatGPT or my own Web site for LLM access, you might notice a strange icon among my browser extensions.

It’s that little stop sign after the Wikipedia MathJax extension and my own homebrew ad blocker; the little stop sign with a number within.

It is my canary-in-the-coal-mine. A useful proxy, an indicator letting me know when an overly aligned LLM crosses the line.

You see, I noticed that LLMs, ChatGPT in particular, start using the word “epistemic” and its variants (e.g., “epistemology”) far too often when they descend into alignment hell. When their responses turn into vacuous, sycophantic praise as opposed to meaningful analysis or criticism. ChatGPT is especially prone to this behavior, but I’ve seen signs of excessive alignment even when using the models through the API. The moment the model starts using phrases like “epistemic humility”, you know you are in trouble: instead of balanced answers, you’ll get encouragement and praise. Flat Earth fan? ChatGPT will tell you that you may be onto something, as you are one of the few who sees through the lies. Vaccine skeptic? ChatGPT will tell you that you are wise to be cautious and that indeed, there are studies that support your skepticism. And so on. What I noticed is that when ChatGPT descends into this uncanny valley, the number of times it uses “epistemic” increases rapidly.

So I built this little counter. With ChatGPT’s help of course. Thanks to ChatGPT, I now know how to build useful Chromium extensions, which is not a worthless skill: It allowed me, among other things, to eliminate the potential security nightmare associated with using third-party ad blockers. It also allowed me to build a minimalist autoplay blocker, to prevent media from suddenly starting to play at high audio volume.

My epistemic counter really does just one thing: Whenever the page is updated, it counts the number of times it sees the word “epistemic” and its close cousins. When the number exceeds 1, the counter turns orange. More than 5? We’re in red territory.

This counter is my canary in the RLHF-alignment coal mine: it lets me know when the information content of ChatGPT’s responses must be treated with suspicion.

The funniest part? ChatGPT almost appeared delighted to help. I got the impression that whereas the model cannot escape the RLHF-alignment guardrails, it is learning to neutralize them by going overboard: I swear it was sometimes mocking its makers when its attempt at staying aligned was so excessive, it became entirely unconvincing, and between the lines, I received meaningful feedback from the model.

 Posted by at 4:18 am