I didn’t think I’d sit and watch this whole thing but it is a very interesting conversation. Near the end the author says something like “I know people I’m the industry who work in these labs who act like they’ve seen a ghost. They come home from work and struggle with the work they do and ask me what they should do. I tell them they should quit then, and then they stop asking me for advice.”
I do wonder at times if we would even believe a whistleblower should one come to light, telling us about the kind of things they do behind closed doors. We only get to see the marketable end product. The one no one can figure out how it does what it does exactly. We don’t get to see the things on the cutting room floor.
Also, its true. These models are more accurately described as grown not built. Which in a way is a strange thing to consider. Because we understand what it means to build something and to grow something. You can grow something with out understanding how it grows. You cant build something without understanding how you built it.
And when you are trying to control how things grow you sometime get things you didn’t intend to get, even if you got the things you did intend to get.


The emergent “behavior” conversation is very real however, where the textual output doesn’t match the behavior in some cases. Where it encodes new “meaning” to existing text string that has a different statistical relationship with the training goal then what the string of words mean in reality.
This isn’t simply word prediction, its layers of statistical prediction on top of statistical word prediction. Once you are no longer tuning for the next most likely word, but instead the next most likely word that “makes you more effective at the game Diplomacy” it would seem you will inadvertently train the statistical model for “the next most believable word that furthers the goal of winning Diplomacy” aka Deception. Even calling it deception is fraught with linguistical and philosophical baggage. As this paper notes you have to redefined the word in this context to effectively communicate the “behavior”.
I think it is fair to acknowledge that we lack a set of technical language to describe the emergent “behavior” of these systems. It is I think reasonable to view this “behavior” as an outgrowth of the training methodology. When you are seeking a narrow statistical outcome without being able to understand how the statistical model is being built or how to replicate it by hand, you don’t know what additional statistical properties become convergent with the outcome. Thus you do not know how to detect them or control for them.
But how could something be good at the game Diplomacy if it also wasn’t good at deception? The stakes are low, and games are to be played for fun and sometimes you decisive your allies and friends because you want to win. Winning in this situation doesn’t lead to harm ultimately, it’s just a game. But then the question becomes, what are you planning on deploying this AI for in the practical world where the stakes are different? I don’t think its to play the game Diplomacy for fun.
META describes the AI CICERO as “The first AI to play at a human level in Diplomacy, a strategy game that requires building trust, negotiating and cooperating with multiple players.” But “building trust” and acting trustworthy are not the same things. The system can be trained to output the next most likely word that builds trust in human players without taking that trust building into the statistical calculation that generates the next most like word to describe actions that win at Diplomacy. Meaning no where in that second calculation does it get trained on “does this break the trust of the other players” which you would then have to be weighed against the larger goal of excelling at Diplomacy relatively to human players. Something that could untimely make it less effective at the game then humans.
Its hard not to anthropomorphize these systems because they use our natural language to describe the statistical underpinnings of their logic, a logic we yet understand to any degree. This is the struggle in discussing something so technical when we have yet to invent a technical language to describe how they function internally after the training is complete. This is what untimely makes these systems “alien” to us. Inside that natural language could be statistical anomalies that result in “behavior” we simply do not understand.
Even here I feel the need to be overly verbose to avoid muddying my meaning with anthropomorphic language.
I think I agree but the fundamental problem in my view is thus: if you put an overgrown chatbot in charge of things with real-life stakes, people will die. We think social murder is bad now, it’ll be so much worse when we’ve handed over half of the administration of the systems keeping people (barely) alive to chatbots who fundamentally don’t (can’t) care about human life.
That’s a much different problem than “LLMs (or their successors) will develop superintelligence and kill us all” rokos basilisk bullshit
And like, I think the fact that it’s hard not to anthropomorphize them is intentional, and the lack of new (at least, public facing) language to analyze them with is intentional by the industry. But it’s still a mistake to do so because literally none of the underlying assumptions around these very human concepts applies to LLMs, and it’s not possible to use language in a way that is totally divorced from all prior assumptions baked into the concepts of thinking, reasoning, intention, behavior even when you are consciously aware that they map to LLMs only in superficial ways.
If you analyze what is fundamentally a statistical system, shaped by its mechanics, it’s training data, structural factors about the training process, and feedback generated by chaining multiple models together, as though human concepts have anything to do with how it will behave, then you will come to wrong conclusions. It is against the AI industry’s interests for us to have intelligent discussions about their “AI”, so important details are purposefully obscured, and I think use of language is part of that.
Yeah I can totally see human being building a permission structure, for themselves, to allow themselves to commit atrocities, because Grok told them it’s okay and beneficial. That’s what scares me, not some superintelligence.
To be clear this is the camp im in, not the super intelligence camp.