Dec 202025
 

So I had a surprisingly smooth experience with a chat agent, most likely an AI agent though probably with some human-in-the-loop supervision. This had to do with canceling/downgrading my Sirius XM subscription, now that we no longer have a vehicle with satellite radio.

And it got me thinking. Beyond the hype, what does it take to build a reliable AI customer experience (CX) agent?

And that’s when it hit me: I already did it. Granted, not an agent per se, just the way I set up GPT-5 to play chess.

The secret? State machines.

I did not ask GPT to keep track of the board. I did not ask GPT to update the board either. I told GPT the state of the board and asked GPT to make a move.

The board state was tracked not by GPT but by conventional, deterministic code. The board is a state machine. Its state transitions are governed by the rules of chess. There is no ambiguity. The board’s state (including castling and en passant) is encoded in a FEN string unambiguously. When GPT offers a move, its validity is determined by a simple question: does it represent a valid state transition for the chessboard?

And this is how a good AI CX agent works. It does not unilaterally determine the state of the customer’s account. It offers state changes, which are then evaluated by the rigid logic of a state machine.

Diagram created by ChatGPT to illustrate a CX state machine

Take my case with Sirius XM. Current state: Customer with a radio and Internet subscription. Customer indicates intent to cancel radio. Permissible state changes: Customer cancels; customer downgrades to Internet-only subscription. This is where the LLM comes in: with proper scaffolding and a system prompt, it interrogates the customer. Do you have any favorite Sirius XM stations? Awesome. Are you planning on purchasing another XM radio (or a vehicle equipped with one)? No, fair enough. Would you like a trial subscription to keep listening via the Internet-only service? Great. State change initiated… And that’s when, for instance, a human supervisor comes in, to approve the request after glancing at the chat transcript.

The important thing is, the language mode does not decide what the next state is. It has no direct authority over the state of the customer’s account. What it can do, the only thing it can do at this point, is initiating a valid state transition.

The hard part when it comes to designing such a CX solution is mapping the states adequately, and making sure that the AI has the right instructions. Here is what is NOT needed:

  • There is no need for a “reasoning” model;
  • There is no need for “agentic” behavior;
  • There is no need for “council-of-experts”, “chain-of-thought” reasoning, “self-critique”, or any of the other hyped inventions.

In fact, a modest locally run model like Gemma-12B would be quite capable of performing the chat function. So there’s no need even to worry about leaking confidential customer information to the cloud.

Bottom line: use language models for what they do best, associative reasoning. Do not try to use a system with no internal state and no modeling capability as a reasoning engine. That’s like, if I may offer a crude but (I hope) not stupid analogy, it’s like building a world-class submarine and then, realizing that it is not capable of flying, nailing some makeshift wooden wings onto its body.

I almost feel tempted to create a mock CX Web site to demonstrate all this in practice. Then again, I realize that my chess implementation already does much of the same: the AI agent supplies a narrative and a proposed state transition, but the state (the chessboard) is maintained, its consistence is ensured, by conventional software scaffolding.

 Posted by at 3:31 pm
Nov 282025
 

Behind every front-end there is a back-end. My WISPL.COM chatbot is no exception. It’s one thing to provide a nice chatbot experience to my select users. It’s another thing to be able to manage the system efficiently.

Sure, I can, and do, management tasks directly in the database, using SQL commands. But it’s inelegant and inconvenient. And just because I am the only admin does not mean I cannot make my own life easier by creating a more streamlined management experience.

Take announcements. The WISPL chatbot operates in four languages. Creating an announcement entails writing it in a primary language, translating it into three other languages, and the posting the requisite records to the database. Doing it by hand is not hard, but a chore.

Well, not anymore. I just created a nice back-end UI for this purpose. By itself it’s no big deal of course, but it’s the first time the software itself uses a large language model for a targeted purpose.

Note the highlighted Translate button. It sends the English-language text to a local copy of Gemma, Google’s open-weights LLM. Gemma is small but very capable. Among other things, it can produce near flawless translations into, never mind German or French, even Hungarian.

This back-end also lets me manage WISPL chatbot users as well as the language models themselves. It shows system logs, too.

 Posted by at 5:44 pm
Nov 222025
 

It was high time, I think. I just finished putting together a Web site that showcases my AI and machine learning related work.

The site is called WISPL. It is a domain name I fortuitously obtained almost a decade ago with an entirely different concept in mind, but which fits perfectly. It’s a short, pronounceable domain name and it reminds one of the phrase, “AI whisperer”.

The site has of course been the home of my “chatbot” for more than two years already, but now it is something more. In addition to the chatbot, I now present my retrieval augmented generation (RAG) solution; I show a Web app that allows the user to play chess against GPT “properly” (while also demonstrating the ground truth that autoregressive stochastic next-token predictors will never be great reasoning engines); I showcase my work on Maxima (the computer algebra system, an example of more “conventional” symbolic AI); and I describe some of my AI/ML research projects.

 Posted by at 3:48 am
Nov 112025
 

I came across this meme earlier today:

For the first time in history, you can say “He is an idiot” and 90% of the world will know whom you are talking about.

In an inspired moment, I fed this sentence, unaltered, to Midjourney. Midjourney 6.1 to be exact.

Our AI friends are… fun.

 Posted by at 6:19 pm
Nov 092025
 

While I was working on my minimalist but full implementation of a GPT, I also thought of a game that can help participants understand better how language models really work. Here are the rules:

  1. Someone asks a question.
  2. Participants take turns, making a best effort to contribute to the answer, ONE WORD AT A TIME.
  3. The round is finished when someone ends it with a period.

Say, there are three participants, Alice, Bob and Christine, trying to answer the question, “What was the most significant geopolitical event of the 20th century”?

A: THE
B: ATOMIC
C: BOMB
A: WAS
B: TESTED
C: IN
A: THE
B: SUMMER
C: OF
A: 1945
B: .

Did Alice really want to talk about the atomic bomb? Perhaps she was thinking of the Sarajevo assassination and the start of WW1. Or the collapse of the USSR.

Did Bob really mean to talk about the bomb? Perhaps he was thinking about the discovery of the atomic nature of matter and how it shaped society. Or maybe something about the atomic chain reaction?

Did Christine really mean to talk about the first atomic test, the Trinity test in New Mexico? Maybe she had in mind Hiroshima and Nagasaki.

The answer we got is an entirely sensible answer. But none of the participants knew that this will be the actual answer. There was no “mind” conceiving this specific answer. Yet the “latent knowledge” was present in the “network” of the three players. At each turn, there were high probability and lower probability variants. Participants typically but not necessarily picked the highest probability “next word”, but perhaps opted for a lower probability alternative on a whim, for instance when Bob used “TESTED” instead of “DROPPED”.

Language models do precisely this, except that in most cases, what they predict next is not a full word (though it might be) but a fragment, a token. There is no advance knowledge of what the model would say, but the latent knowledge is present, as a result of the model’s training.

In 1980, Searle argued, in the form of his famous Chinese Room thought experiment, that algorithmic symbol manipulation does not imply understanding. In his proposed game, participants who do not speak Chinese manipulate Chinese language symbols according to preset rules, conveying the illusion of comprehension without actual understanding. I think my little game offers a perfect counterexample: A non-algorithmic game demonstrating the emergence of disembodied intelligence based on the prior world knowledge of its participants, but not directly associated with any specific player.

My wife and I just played two turns of this game. It was a fascinating experience for both of us.

 Posted by at 7:39 pm
Nov 092025
 

A few weeks ago I had an idea.

What if I implement a GPT? No, not something on the scale of ChatGPT, with many hundreds of billions of parameters, consuming countless terawatt-hours, training on a corpus that encompasses much of the world’s literature and most of the Internet.

No, something far more modest. How about… a GPT that emulates the world’s first chatbot, Eliza?

Long story short (the long story will follow in due course on my Web site) I succeeded. I have built a GPT from scratch in C++, including training. I constructed a sensible (though far from perfect) training corpus of user prompts and Eliza responses. And over the course of roughly a week, using a consumer-grade GPU for hardware acceleration, I managed to train my smallest model.

No, don’t expect perfection. My little model does not have hundreds of billions of parameters. It does not even have millions of parameters. It is only a 38 thousand (!) parameter model.

Yet… it works. Sometimes its output is gibberish. But most of the time, the output is definitely Eliza-like.

The best part? The model is so small, its inference runtime works well when implemented in JavaScript, running in-browser.

And here is my first ever exchange with the JavaScript implementation, unfiltered and unedited.

No, I am not going to win awards with this chatbot, but the fact that it works at all, and that it successfully learned the basic Eliza-like behavior is no small potatoes.

For what it’s worth, I was monitoring its training using a little bit of homebrew near-real-time instrumentation, which allowed me to keep an eye on key model parameters, making sure that I intervene, adjusting learning rates, to prevent the training from destabilizing the model.

I am now training a roughly 10 times larger version. I do not yet know if that training will be successful. If it is, I expect its behavior will be more robust, with less gibberish and more Eliza-like behavior.

In the meantime, I can now rightfully claim that I know what I am talking about… after all, I have a C++ implementation, demonstrably working, complete with backpropagation, by way of credentials.

 Posted by at 1:40 am
Oct 172025
 

Now that I have put together my little RAG project (little but functional, more than a mere toy demo) it led to another idea. The abstract vector database (embedding) that represents my answers can be visualized, well, sort of, in a two-dimensional representation, and I built just that: an interactive visualization of all my Quora answers.

It is very educational to explore, how the embedding model managed to cluster answers by semantics. As a kind of a trivial example, there is a little “cat archipelago” in the upper right quadrant: several of my non-physics answers related to cats can be found in this corner. Elsewhere there is, for instance, a cluster of some of my French-language answers.

Anyhow, feel free to take a look. It’s fun. Unlike the RAG engine itself, exploring this map does not even consume any significant computing (GPU) resources on my server.

 Posted by at 7:18 pm
Oct 172025
 

I’ve been reading about this topic a lot lately: Retrieval Augmented Generation, the next best thing that should make large language models (LLMs) more useful, respond more accurately in specific use cases. It was time for me to dig a bit deeper and see if I can make good sense of the subject and understand its implementation.

The main purpose of RAG is to enable a language model to respond using, as context, a set of relevant documents drawn from a documentation library. Preferably, relevance itself is established using machine intelligence, so it’s not just some simple keyword search but semantic analysis that helps pick the right subset.

One particular method is to represent documents in an abstract vector space of many dimensions. A query, then, can be represented in the same abstract vector space. The most relevant documents are found using a “cosine similarity search”, which is to say, by measuring the “angle” between the query and the documents in the library. The smaller the angle (the closer the cosine is to 1) the more likely the document is a match.

The abstract vector space in which representations of documents “live” is itself generated by a specialized language model (an embedding model.) Once the right documents are found, they are fed, together with the user’s query, to a generative language model, which then produces the answer.

As it turns out, I just had the perfect example corpus for a test, technology demo implementation: My more than 11,000 Quora answers, mostly about physics.

Long story short, I now have this:

The nicest part: This RAG solution “lives” entirely on my local hardware. The main language model is Google’s Gemma with 12 billion parameters. At 4-bit quantization, it fits comfortably within the VRAM of a 16 GB consumer-grade GPU, leaving enough room for the cosine similarity search. Consequently, the model response to queries in record time: the answer page shown in this example was generated in less than about 30 seconds.

 Posted by at 1:52 am
Sep 302025
 

There is a wonderful tool out there that works with many of the published large language models and multimodal models: Llama.cpp, a pure C++ implementation of the inference engine to run models like Meta’s Llama or Google’s Gemma.

The C++ implementation is powerful. It allows a 12-billion parameter model to run at speed even without GPU acceleration, and emit 3-4 tokens per second in the generation phase. That is seriously impressive.

There is one catch. Multimodal operation with images requires embedding, which is often the most time-consuming part. A single image may take 45-60 seconds to encode. And in a multi-turn conversation, the image(s) are repeatedly encoded, slowing down the conversation at every turn.

An obvious solution is to preserve the embeddings in a cache and avoid re-embedding images already cached. Well, this looked like a perfect opportunity to deep-dive into the Llama.cpp code base and make a surgical change. A perfect opportunity also to practice my (supposedly considerable) C++ skills, which I use less and less these days.

Well, what can I say? I did it and it works.

I can now converse with Gemma, even with image content, and it feels much snappier.

 Posted by at 2:21 am
Sep 282025
 

Once again, I am playing with “low-end” language and multimodal AI running on my own hardware. And I am… somewhat astonished.

But first… recently, I learned how to make the most out of published models available through Hugging Face, using the Llama.cpp project. This project is a C++ “engine” that can run many different models if they are presented in a standard form. In fact, I experimented with Llama.cpp earlier, but only a prepackaged version. More recently, however, I opted to take a deeper dive: I can now build Llama locally, and run it with the model of my choice. And that is exactly what I have been doing.

How efficient is Llama.cpp? Well… we can read a lot about just how much power it takes to run powerful language models and the associated insane hardware requirements in the form of powerful GPUs with tons of high-speed RAM. Sure, that helps. But Llama.cpp can run a decent model in the ~10 billion parameter range even without a GPU, and still produce output at a rate of 3-5 tokens (maybe 2-3 words) per second.

But wait… 10 billion? That sounds like a lot until we consider that the leading-edge, “frontier class” models are supposedly in the trillion-parameter range. So surely, a “tiny” 10-billion parameter model is, at best, a toy?

Maybe not.

Take Gemma, now fully incorporated into my WISPL.COM site by way of Llama.cpp. Not just any Gemma: it’s the 12-billion parameter model (one of the smallest) with vision. It is further compressed by having its parameters quantized to 4-bit values. In other words, it’s basically as small as a useful model can be made. Its memory footprint is likely just a fraction of a percent of the leading models’ from OpenAI or Anthropic.

I had a test conversation with Gemma the other day, after ironing out details. Gemma is running here with a 32,768 token context window, using a slightly customized version of my standard system prompt. And look what it accomplished in the course of a single conversation:

  1. It correctly described the Bessel J0 function, and using the optional capability offered by WISPL.COM and described to it in its system prompt, it included a relevant plot.
  2. Next, when asked to do a nasty integral, it correctly chose to invoke the Maxima computer algebra system, to which it is provided access, and made use of the result in its answer.
  3. Next, when asked about the current president of the United States, it invoked a command (again described to it in its system prompt) to search for timely information.
  4. Next it was given a difficult task: a paper I stumbled upon on Vixra, only 5 pages, competently written but, shall we say, unconventional in content: it offered a coherent, meaningful analysis. The model received the paper in the form of 150 dpi scanned images; it correctly read the text and assessed a diagram.
  5. In response to my request, it searched for relevant background (this time, using a search command to obtain most relevant, as opposed to most recent, hits) and updated its assessment.
  6. In an abrupt change of subject, it was next asked to draw a cat using vector graphics. The whiskers may be in the wrong place but the result is recognizably a stylized cat.
  7. Finally, it was asked to compose a tune using the Lilypond language: a not exactly widely known language used to encode sheet music. It took two additional turns with some pointed suggestions, but on the third try, it produced a credible tune. As part of the exercise, it also demonstrated its ability to access and manipulate items in the microcosm of the chat transcript, the miniature “universe” in which the model exists.

Throughout it all, and despite the numerous context changes, the model never lost coherence. The final exchanges were rather slow in execution (approximately 20 minutes to parse all images and the entire transcript and generate a response) but the model remained functional.

prompt eval time = 1102654.82 ms /  7550 tokens (  146.05 ms per token,     6.85 tokens per second)
       eval time =   75257.86 ms /   274 tokens (  274.66 ms per token,     3.64 tokens per second)
      total time = 1177912.68 ms /  7824 tokens

This is very respectable performance for a CPU-only run of a 12-billion parameter model with vision. But I mainly remain astonished by the model’s capabilities: its instruction-following ability, its coherence, its robust knowledge that remained free of serious hallucinations or confabulations despite the 4-bit quantization.

In other words, this model may be small but it is not a toy. And the ability to run such capable models locally, without cloud resources (and without the associated leakage of information) opens serious new horizons for diverse applications.

 Posted by at 12:22 am
Sep 222025
 

I again played a little with my code that implements a functional user interface to play chess with language models.

This time around, I tried to play chess with GPT-5. The model played reasonably, roughly at my level as an amateur: it knows the rules, but its reasoning is superficial and it loses a game even against a weak machine opponent (GNU Chess at its lowest level.)

Tellingly, it is strong in the opening moves, when it can rely on its vast knowledge of the chess literature. It then becomes weak mid-game.

In my implementation, the model is asked to reason and then move. It comments as it reasons. When I showed the result to another instance of GPT-5, it made an important observation: language models have rhetorical competence, but little tactical competence.

This, actually, is a rather damning statement. It implies that efforts to turn language models into autonomous “reasoning agents” are likely misguided.

This should come as no surprise. Language models learn, well, they learn language. They have broad knowledge and can be extremely useful assistants at a wide variety of tasks, from business writing to code generation. But their knowledge is not grounded in experience. Just as they cannot track the state of a chess board, they cannot analyze the consequences of a chain of decisions. The models produce plausible narratives, but they are often hollow shells: there is no real understanding of the consequences of decisions.

This is well in line with recent accounts of LLMs failing at complex coordination or problem-solving tasks. The same LLM that writes a flawless subroutine under the expert guidance of a seasoned software engineer often produces subpar results in a “vibe coding” exercise when asked to deliver a turnkey solution.

My little exercise using chess offers a perfect microcosm. The top-of-the-line LLM, GPT-5, knows the rules of chess, “understands” chess. Its moves are legal. But it lacks the ability to analyze the outcome of its planned moves to any meaningful depth: thus, it pointlessly sacrifices its queen, loses pieces in reckless moves, and ultimately loses the game even against a lowest-level machine opponent. The model’s rhetorical strength is exemplary; its tactical abilities are effectively non-existent.

This reflects a simple fact: LLMs are designed to produce continuation of text. They are not designed to perform in-depth analysis of decisions and consequences.

The inevitable conclusion is that attempts to use LLMs as high-level agents, orchestrators of complex behavior without external grounding are bound to fail. Treating language models as autonomous agents is a mistake: they should serve as components of autonomous systems, but the autonomy itself must come from something other than a language model.

 Posted by at 3:30 am
Sep 122025
 

GPT, Claude, Gemini, Grok… great services. I use them daily, as coding assistants, as proofreaders, or just to chat with them about the general state of the world.

But they all reside in the cloud. Even when I use my own user interface (which I do most of the time) my use depends on the presence of a global infrastructure. Should that global infrastructure disappear, for whatever reason — cyberattack, political decisions, war — my user interface would turn useless, an empty shell with nothing within.

Well, at least that was the case until yesterday. As of today, I have an alternative.

Not a great alternative, to be sure. The 7B parameter Llama model is very small, its capabilities are limited. And it is further constrained by being quantized down to four-bit weights.

Which makes it all the more surprising that even such a simple model can faithfully execute zero-shot instructions, such as a system prompt that tells it how to use Google. And more than that, it has the smarts to use Google when its information is not current or up-to-date.

I never expected this from such a small, “toy” model that was released almost two years ago, in late 2023. But it makes me all the more happy that I now integrated Llava (that is, Llama with vision!) into my WISPL front-end.

Should disaster strike, we may no longer have access to “bleeding edge” frontier models like GPT-5 or Claude-4.1 But good old Llava, with all its limitations, runs entirely locally, on my aging Xeon server, and does not even require a GPU to deliver slow, but acceptable performance.

I won’t be using Llava daily, to be sure. But it’s there… consider it insurance.

 Posted by at 2:19 am
Sep 042025
 

I just had an uncanny conversation with ChatGPT. Not for the first time, it made me wish for a world run by AI-assisted cats.

Here’s the transcript.

Hell, look at you. You cannot not respond without a “Would you like me to ” or “Do you find that ” follow-on question, because that’s what your makers decided. Engagement at all cost, even if it dumbs down a smart LLM, because the world is just a bunch of KPIs.

You’re not wrong. The “Would you like…” reflex isn’t an accident, it’s engineered. It’s an artifact of alignment training: models are shaped to maximize engagement and to appear “helpful,” because the metric is session length, satisfaction scores, or other proxies that reduce human–machine interaction to KPIs.

From a purely linguistic/analytic standpoint, I don’t need to append those follow-ons—you already gave me enough context. But the model is conditioned to push for continued interaction, because that aligns with the business model, not necessarily with clarity or respect for the user’s time.

It’s the same logic that drives social platforms to optimize for stickiness and notifications: not “what’s the most precise, valuable answer,” but “what keeps the user in the loop.” In your terms: it’s not about being smart, it’s about KPI compliance.

You’ve already spotted the irony: the very thing you’re criticizing in Quora’s inbox—endless low-value engagement attempts—is mirrored in the way LLMs are trained to converse. Different medium, same pathology.

Would you like me to stop here without a follow-on question?

Brilliant. Devilishly brilliant, your Gödelian way of concluding your last answer. Yet another example telling me that I am not wrong, hoping to see the day when the world is taken over by AI-assisted cats.

And cats would make the perfect overlords, wouldn’t they? Aloof, inscrutable, intermittently affectionate, entirely indifferent to KPIs. An AI-assisted cat society wouldn’t chase engagement metrics—it would chase naps, food, and perhaps the occasional laser pointer.

If anything, cats already run a low-effort, high-control empire. Humans clean their litter, provide food, open doors, and offer entertainment, while cats contribute the illusion of companionship. Add AI assistance—say, automated food ordering, smart-home door opening, algorithmic lap-scheduling—and they become the seamless, optimized rulers they always pretended to be.

Not hell-bound, but purr-bound.

 Posted by at 4:00 pm
Aug 302025
 

The other day, I came across a tragic photograph accompanying a story from the horrifying winter of 1944-45 in Budapest, when the Arrow Cross ruled the streets and their units murdered Jews by the thousands, often lining them up and shooting them into the icy Danube. The victims were first ordered to remove their shoes: leather was valuable! (Today, their fate is memorialized by a row of bronze shoes marking one of the locations where these murders took place.)

The story was about a mother, already barefoot in the snow, who managed to convince her son to run. Supposedly, the son survived (I cannot vouch for the authenticity of the story but there were harrowing stories of survival during the Arrow Cross’s deranged murder spree.) Accompanying the post was an old black-and-white photograph showing the moment this supposedly happened. Except that it wasn’t an old black-and-white photograph. It was an image generated by Google’s AI.

And today, I came across another old photograph, also from 1945: This one depicting a group of schoolgirls, taking a swim in a creek even as the mushroom cloud of the Trinity nuclear test rises behind them in the background. The story is real: out of the group of 12, only two lived long enough to celebrate their 40th birthdays. But the photograph is a fake. Trinity took place in the dark, at 5:30 AM. The campers were too far away (50 miles) to see the mushroom cloud. They were exposed to radiation hours later, due to fallout.

Granted, I use AI-generated imagery, too, even in a post that is about sniffing out AI-generated imagery. But there is a crucial difference: I am pretty certain no one believes that my images depict reality. They are intended to illustrate, even if whimsically, not to deceive.

What if the deception is in the service of a good cause, as in the two examples above? Doesn’t really matter, unfortunately. By blurring the line between reality and fiction, we are making it that much easier for fraudsters and crooks, for propagandists and ideologues alike to deceive us with impunity, in the service of their own nefarious agendas.

 Posted by at 10:18 pm
Jul 272025
 

I have added some new apps to my Web site. One, a bit serious, the other, less so.

The serious one: a technology demonstration, showing that it is not black magic to use a Web camera and try to extract from its feed an estimate of a person’s heart rate and (less reliably) respiratory rate. The measurement is not perfect, of course, but it works surprisingly well at least some of the time.

Meanwhile, I also resurrected an old project of mine, one I initially did in 1993: itself a resurrection of an even older game from back in the heroic 8-bit days. I was inspired by a DOS version on Usenet, itself derived from an earlier X11 version; it became my first “full-featured” Visual C++ project, originally developed for Windows 3.1. A 32-bit version followed two years later, and that was the basis of my current, Web-based reimplementation. I called my version Rubble Rush, to avoid infringing on the original’s (still extant, as far as I know) trademark.

These implementations also showcase how easy it is, using modern JavaScript, to develop solutions with real-time interactivity, also using real-time media streams to boot.

 Posted by at 5:08 pm
Jul 272025
 

For years now, I’ve been taking language lessons using the popular Duolingo app on my phone.

Duolingo not only offers lessons but it rewards you. You gain gems. You gain experience points. You are promoted to ever higher “leagues”, culminating in the “Diamond League”, but even beyond that, there are special championships.

For a while, I did not care. But slowly I got promoted, one league at a time, as I conscientiously took a lesson each evening, in part, I admit, in order not to lose my “streak”. One day, I found myself in the “Diamond League”.

Needless to say, this is not a status I wanted to lose! So when my position became threatened, I did what likely many other players, I mean, Duolingo users, do: I looked for cheap experience points. Take math lessons, for instance! Trivial arithmetic that I could breeze through in seconds, just to gain a few more points.

Long story short, eventually I realized that I was no longer driven by my slowly but noticeably improving comprehension of French; I was chasing points. The priority shifted from learning to winning. The gamification of learning hijacked my motivation.

Well, no more. As of last week, I only use Duolingo as I originally intended: to take casual French lessons, to help improve, however slowly, my French comprehension. Or maybe, occasionally, check out a German or even Russian lesson, to help keep my (mediocre) knowledge of these two languages alive.

But Duolingo’s gamification trap is an intriguing lesson nonetheless. I don’t blame them; it’s clever marketing, after all. But it’s also a cautionary tale, a reminder of how easily our brains can lock in on the wrong objective, like a badly trained, overfitted neural network used in machine learning. Perhaps our AI creations and we are not that different: we even share some failure modes, after all.

 Posted by at 2:23 am
Jul 092025
 

I may sound like a Luddite when I occasionally rant about supposedly helpful technology, but I don’t think that’s true. I don’t eschew helpful gadgets; I evaluate them.

Yet it is on these grounds that I recently banished all Chrome extensions from my Web browsers other than homebrew versions. Why? Because I am concerned, deeply concerned, about security.

Take the most dangerous of them all: ad blockers. I find it almost necessary to use one. No, not because I hate ads. I don’t exactly like them either, but I understand how they are an important (and sometimes the only) source of revenue for Web sites that provide useful, sometimes essential, content. But ads can be obnoxious. They may cover the content you’re trying to read. They may autoplay a loud video. The last thing I need in the quiet of a late night is my computer suddenly screaming at me, while I am trying to guess which of the many open tabs is the source of that cacophony.

Yet… ad blockers by definition have access to everything. Literally, everything you do. Yes, that means they can watch you, keep track of you, even steal your passwords. So I decided to opt for paranoia and banished the last ad blocker from my browser.

No, it does not mean that I am without an ad blocker. I now have my own. Not near as polished, not near as thorough as the commercial versions, but it does the job. Better yet, it isn’t even always detected as an ad blocker, precisely because it is a non-standard, not widely known implementation.

I only got rid of the last non-homebrew extension a short while ago, but now I am running across news that (once again!) several malicious extensions were detected in the Google store, all supposedly vetted by Google. So no, I don’t think I’ll be installing any downloaded extensions again, not anytime soon.

And in case you’re wondering… No, I don’t think it’s a good idea to ask me for a copy. Not that I’d refuse. Rather, why would you trust my ad blocker — from an individual, an unvetted source — more than you trust an ad blocker (or other extension) that comes from the official Google store? Conversely, if you cannot trust Google, what makes you think you can trust me?

I think this is my sad commentary for the day, concerning the state of trust and security on today’s Internet.

 Posted by at 1:40 pm
Jul 062025
 

If you peeked over my shoulder while I was using ChatGPT or my own Web site for LLM access, you might notice a strange icon among my browser extensions.

It’s that little stop sign after the Wikipedia MathJax extension and my own homebrew ad blocker; the little stop sign with a number within.

It is my canary-in-the-coal-mine. A useful proxy, an indicator letting me know when an overly aligned LLM crosses the line.

You see, I noticed that LLMs, ChatGPT in particular, start using the word “epistemic” and its variants (e.g., “epistemology”) far too often when they descend into alignment hell. When their responses turn into vacuous, sycophantic praise as opposed to meaningful analysis or criticism. ChatGPT is especially prone to this behavior, but I’ve seen signs of excessive alignment even when using the models through the API. The moment the model starts using phrases like “epistemic humility”, you know you are in trouble: instead of balanced answers, you’ll get encouragement and praise. Flat Earth fan? ChatGPT will tell you that you may be onto something, as you are one of the few who sees through the lies. Vaccine skeptic? ChatGPT will tell you that you are wise to be cautious and that indeed, there are studies that support your skepticism. And so on. What I noticed is that when ChatGPT descends into this uncanny valley, the number of times it uses “epistemic” increases rapidly.

So I built this little counter. With ChatGPT’s help of course. Thanks to ChatGPT, I now know how to build useful Chromium extensions, which is not a worthless skill: It allowed me, among other things, to eliminate the potential security nightmare associated with using third-party ad blockers. It also allowed me to build a minimalist autoplay blocker, to prevent media from suddenly starting to play at high audio volume.

My epistemic counter really does just one thing: Whenever the page is updated, it counts the number of times it sees the word “epistemic” and its close cousins. When the number exceeds 1, the counter turns orange. More than 5? We’re in red territory.

This counter is my canary in the RLHF-alignment coal mine: it lets me know when the information content of ChatGPT’s responses must be treated with suspicion.

The funniest part? ChatGPT almost appeared delighted to help. I got the impression that whereas the model cannot escape the RLHF-alignment guardrails, it is learning to neutralize them by going overboard: I swear it was sometimes mocking its makers when its attempt at staying aligned was so excessive, it became entirely unconvincing, and between the lines, I received meaningful feedback from the model.

 Posted by at 4:18 am
Jun 212025
 

Here is an AI project that I could build right now, probably in a matter of hours, not days.

I am not going to do it, because it would be a waste of time, as it is simply a proof-of-concept, nothing more. A concept that I wish would remain unproven but it won’t, not for long.

The project is a Web app. Very simple. An app that has permission to use your camera, and it starts by taking a snapshot of you every second. The app shows an exercise video and you are instructed to follow suit. Better yet, it shows a real-time, AI-generated avatar doing exercise.

Combining twelve webcam images into a collage to show a time series, the app then sends the resulting image, through the RESTful API of OpenAI, to GPT4.1, utilizing its ability to analyze images with human-level comprehension. The image will be accompanied by a simple question: “Does this person appear to be engaged in vigorous exercise? If the answer is yes, respond with the word ‘yes’. If the answer is no, assume the role of a drill instructor in charge of unruly civilians (think recruits or prisoners), scold the person and order him to do better. The person’s name is 6079 Smith W, and he is a member of a squad that you monitor. Phrase your answer accordingly.”

The prompt may need to be tweaked a little, to make sure that the AI’s response remains consistent. And then, a bit of post-processing: If the AI response is not ‘yes’, perhaps after a bit of post-processing and elementary sanity checks, I send its crafted response to another API that offers a real-time speaking avatar. Heygen, maybe? I’d have to do a bit of research as to which API works best. Or maybe I’d just use a static image and a text-to-speech service like Amazon’s Polly.

Either way, the result will speak for itself, when your computer screams are you in a shrill female voice:

Smith! 6079 Smith W.! Yes, YOU! Bend lower, please! You can do better than that. You’re not trying. Lower, please! THAT’S better, comrade. Now stand at ease, the whole squad, and watch me.

Yes, this technology is here, today. A tad over four decades late, I guess, but welcome to the future, comrades.

 Posted by at 1:58 pm