Aicadium Senior Vice-President of Data Science, Dimitry Fisher shares his thoughts on recent conversations he has had around ChatGPT and Generative AI in general, its shortcomings, and how the industry can progress these technologies.
Some Thoughts on ChatGPT and Generative AI
I recently had the opportunity to speak with Sydney Freedberg Jr from Breaking Defense about ChatGPT and how the Pentagon might be able to use it and other Generative AI technologies. The consensus in his article, “Pentagon should experiment with AIs like ChatGPT – but don’t trust them yet: DoD’s ex-AI chiefs”, as the title suggests, is that generative AI is not ready for this kind of application yet, either in the business or in the military domain. In this blog I will elaborate on why that is the case and what developments or advances may enable them to provide more benefit in the next few years. I will also share a few ideas and directions to focus the current research. This is a fast-moving topic, and the current progress in the Generative AI field is moving much too quickly for the patenting and intellectual property protection to be relevant, with ideas and methods becoming outdated faster than the patents are published or evaluated. It is the free exchange of ideas that allows this field to progress at such a remarkable speed. Of course, in 5, 10, 20 years this field will reach its maturity, like many other fields of machine learning or AI before it, will become commoditised, and then will be superseded by new architectures, methods, and capabilities. Yet, the entities that failed to understand both the benefits and the shortcomings of this technology today, will find themselves at a disadvantage in the near future.
Highlights from the Interview with Sydney Freedberg Jr
Here are the main points that I talked about with Sydney with respect to the transformer-based Large Language Models (LLMs):
- The great power of the transformer-based approaches, such as ChatGPT, lies in their computational efficiency: they can train on, and learn in some form (more on that later), monumental amounts of data, and adjust a monumental number of parameters to perform well on that data. The notion of “perform well” is rather well defined for such tasks as to complete a sentence or to insert a missing word. It becomes fuzzier though when the LLM (or a chatbot based on it) must produce answers to free-response questions. Here the LLM training becomes highly dependent on human feedback, and that takes time.
- The price the transformer-based approaches pay for this efficiency, however, is heavy. Everything an LLM “knows” is encoded in its parameters (“weights”), and not in declarative, episodic, semantic, etc. long-term memory that makes an actual conversation (or culture) possible. The LLM parameters serve it remarkably well to come up with the next word, so to build a conversation one word at a time, but they fail to produce any semblance of an idea. Humans (or at least the humans that you may enjoy talking to) have an idea of what they are going to say before they start talking, however, LLMs do the exact opposite. “Wise men speak because they have something to say; fools – because they have to say something.” Alas, the present-day LLM chatbots fall squarely into the latter category. There is no “idea” or even an emergent representation of an idea; there is just the most likely next word. As Sydney has illustrated in the article, even the easily verifiable facts fall victim to the “most likely next word” mechanism, due to the lack of any dedicated long-term memory.
- One of the worst side effects of the next-word prediction mechanism is the exponentially vanishing chance of making sense in long sentences or passages, as explained by Yann LeCun here. Get one word wrong and it will be very hard to get back on track, as the idea is not there to begin with.
Canned Answers are not Knowledge
ChatGPT4 seems to overcome this problem by having a large set of canned answers memorised and reproduced verbatim. This still doesn’t mean that it knows what it’s saying, or that it makes any sense. This does however mean that it has gone through the “1984” – inspired procedure of having its knowledge of past events thoroughly filtered to follow the present-day Party line (“guardrails” in LLM newspeak) of whoever OpenAI trusts with the all-important Reinforcement Learning from Human Feedback (RLHF) task. As the Party line changes, so will our knowledge of the past.
On the other hand, ChatGPT4 ability to handle images natively offers a glimmer of hope (at least to me) that it would be able to, with some work still required, form a visual representation of something like an idea or a narrative. There is a very good reason toddlers are taught using picture books. Unlike spoken language which is necessarily serial, an image or painting conveys a scene all at once. It is then up to the reader (or LLM) to make sense of it and turn it into a serial narration, story, or some form of language representation, but the idea is already there in the pictorial form.
LLM Hallucination (it’s really confabulation)
Another well-publicised problem of the LLMs is “hallucination”. They are prone to giving nonsensical answers or answers that sound plausible but are largely or entirely made up. This is a deeper problem than either the lack of idea or the absence of semantic or episodic long-term memory. The absence of long-term memory makes LLM unable to produce accurate quotes or reference correct sources (except for canned answers mentioned above). Instead, all that LLM has ever learned is encoded as the LLM parameter values. These values are used, together with query prompt and previously generated text (including word positions and punctuation, but excluding prosody), to generate further text. Now, here’s the problem: the statistical interrelations between words don’t make them facts. For example, if crime was mentioned in a text talking about a certain politician, it contributes to that correlation regardless of whether the said politician fought crime, was involved in crime, or mentioned crime in a speech. So LLM may present either one of these 3 “hypotheses” as fact, with certain nonzero probability. It’s even worse if the politician has a commonly occurring name. To be accurate, this is not really hallucination; this is confabulation. However, the name “hallucination” has stuck to this phenomenon, unfortunately, so it will be hard to change it now.
Real hallucination is when the brain (or model) represents and classifies something that is not in the input data. For example, a noise input generates a valid percept of an object, but the object isn’t really there (assuming you believe in objective reality, otherwise all bets are off). Google Deep Dream architectures are excellent at this, although they call it “dream” rather than hallucination. More recently, stable diffusion in latent-representation space produces some spectacular examples.
Delusions are related but distinct. That is when the brain misrepresents or misidentifies something in the input. When a deep neural network is shown pixel noise but sees a stop sign – that’s a hallucination. When a deep neural network is shown a bunny but sees a stop sign – that’s a delusion. When an LLM makes up someone’s biography, this is neither hallucination nor delusion, rather, this is confabulation.
The Way Forward – Where does Generative AI go next?
The way forward for the LLMs, as I see it, must focus on 3 items: idea, idea, and idea. Indeed, the idea representation is the key to GPT reaching an acceptable performance on both images and text. Otherwise, it’s fundamentally a parrot, generating speech from memorised tokens unrelated to the overall goals other than getting fed or played with. I love parrots, but I wouldn’t trust them with any important roles in either business or the military.
Item #1 is the grounding. This is not at all a recent requirement. It’s been around for several decades. The ability to manipulate tokens (object images, words, word parts, etc.) does not alone endow an AI with human-level intelligence, rather, the AI needs to be aware of the real-life counterparts of those tokens, concrete or abstract. It is an open question as to how to interpret or represent qualia across a full spectrum of human diversity, let alone between humans and AI. However, this effort must happen, for the AI to be able to acquire both empathy and the ability to meaningfully and harmoniously coexist with humans. Grounding is a fundamental aspect of independent idea generation and vetting for an AI, much better than the thought-police approach of the present-day RLHF.
Item #2 is the common sense. This is related to grounding, but distinct. Whereas the grounding in the previous (narrow) sense focuses more on the interaction between AI and humans, common sense focuses more broadly on the interaction between AI and the physical world, including the consequences of one’s actions.
When an LLM generates an instruction (computer code, action recommendation, etc.) it does not currently have any capacity to represent, let alone predict, the state of the world when or after the said instruction has been followed. It cannot even evaluate the feasibility or validity of the instructions. Nor can it infer or represent intent, beyond the user-provided prompt. At the high end, these capabilities go beyond the notion of an idea and spill over into a much broader realm of human-level executive function. At the low end though this is basic reasoning and ideation at a toddler level: put the cup on the desk and not next to the desk; put the pants on first and then the shoes; and so on. Towards that end, the LLM must be aware (has learned) the basic rules of the physical world such as object permanence, conservation (in Piaget’s sense), etc. at least at the toddler’s level.
There are currently no dedicated components in the GPT architecture to do so, therefore the belief in an emergent common sense may be magical or wishful thinking at this stage. An embodied GPT model being trained to interact with the physical world could address some of these issues, but a modification to the fundamental architecture may be required to represent the state of self and the state of the world for a sufficient length of time to cover the span of relations between causes and effects. On the other hand, GPT4 is already capable of answering logic questions at LSAT level, by pure token manipulation (without either grounding or embodiment). There are also ongoing efforts to improve common sense language generation using image captioning approaches (such as CommonGen dataset and task), although these do not go beyond token manipulation either.
Item #3 is the idea representation. Unlike board or arcade games with highly constrained rules and action spaces, the space of ideas in the real world is vast. If one were to limit any idea to an English sentence of, say, no more than 20 words, the number of ideas would still be much larger than any dataset humanity has ever dealt with. Like images, of which there can be a staggering number (24 bits per pixel for millions of independent pixels), ideas either make sense or not, and most do not. That is, the vast majority of possible ideas are nonsense like the vast majority of possible images are noise. It is all the more fitting therefore that the latent representation of ideas for a neural network should be generated and handled in a similar fashion to the latent representation of images. It may even be the case that the stable diffusion algorithm is applicable to ideas latent representation in a similar way to images latent representation, to produce new ideas (good or bad) from old ideas and a prompt.
I’m not speaking narrowly of the picture-book approach I mentioned earlier, although that would be probably a good place to start. What I mean here is a much broader class of idea representations, which may make sense to an adult human but not to a toddler, or which may only make sense to an AI. Consider, as perhaps one of the most famous examples, Picasso’s Guernica. The idea conveyed to most is the horror of war (any war, not just the Spanish Civil War), yet Picasso himself says of it, “What ideas and conclusions you have got I obtained too, but instinctively, unconsciously. I make the painting for the painting. I paint the objects for what they are.” In other words, he has painted the image generated by his latent representation of the idea. It is likely that images are indeed a valid way to represent (and their latent representations – to encode) the ideas for GPT-based LLM architectures.
More generally, one can think of an idea as a member in a superset of image captions. That is, current or historical image captions (e.g., “Shiba-inu riding a bicycle” or “Ivan the Terrible Having Murdered His Own Son”) are ideas. More generally, all image captions are ideas, but not all ideas are currently image captions, however, they can be made so. If we allow the casting of ideas as image captions, then the same cosine metric and the same contrastive objective that are used in CLIP may apply here as well.
The greatest challenge is – there is no human-agreed correspondence between images and ideas, so this cannot be supervise-trained the same way CLIP is. Rather, it would have to be a multi-stage indirect process: a piece of text (e.g., a paragraph) becomes an idea (e.g., a terse statement or sentence) becomes an image, and then back to the idea to a paragraph. Some training can be done with self-supervised reconstruction (encoder-decoder), some with self-supervised completion (such as fill-in-the-blank or end-the-sentence for language, or inpainting/denoising for images), and some with (minimal) human supervision happening somewhere along the way but only if absolutely necessary. Much better than human supervision though, such a network may generate a diversity of ideas, beyond what humans are currently capable of representing. Eventually, this “pictorial language” itself may become a common language between humans and machines, overcoming entirely the mutual incomprehensibility of human languages (as well as machine instructions).
Generating and representing an idea also does not entirely alleviate the confabulation problem described above. It helps the LLM-generated text to make sense, but it doesn’t necessarily help it to be factually correct. As discussed above, factual correctness (retroactive editing of history by political parties notwithstanding) requires explicit long-term memory. Furthermore, such memory likely has to be associative and/or content-addressable; however, this is a subject of a separate blog entirely.