Failure Is Not An Option For AI (And It Shouldn't Be)

#ai #community #chatgpt

My intent is to write a series of practical posts related to AI and programming. Topics like running local LLMs with agent frameworks, automated source code manipulation, execution under docker and more. All with the use of .NET and C#, Elastic Search, AWS and more.
I realized that I would like to first express my view about more fundamental, theoretical and even philosophical aspects of LLM/AI first. Therefore this post does not require any programming skills.

Working with AI is like playing with fire. Not a dramatization attempt, I am naming a parallel I see. Fire itself was very useful for humans. Invention of a combustion engine allowed us to be way more productive and travel in different ways. We did not expect the fire to control itself, we rather built a technology around it - tightly controlled cycle where just the right amount of fuel ignites at the right time. When we moved to the jet engine where the combustion cycle was changed to continuous process, improvement was at least 10x when it comes to both speed and size of airplanes we could build. Since then, we were able to move from one place to almost anywhere else on our planet within one day. All was changed once more with rocket engine and nuclear fission. With great power and possibilities comes also a great danger and responsibility.
Where are we with an AI?

Let's start right away with mighty 4o and simple yet treacherous task:

me: How many words are in your next response?
ChatGPT: My next response is 151 words long.

me: If we count that now, how many words is it really?
ChatGPT: The provided response actually contains 7 words.

me: How should the sentence look like if we want that sentence to be true?
ChatGPT: To make the sentence, "My next response is 151 words long," exactly 151 words long, it would look like this:
    My next response is 151 words long. word word word word word ... (lot more, but I truncated this part)
    This sentence has a total of 151 words.

me: That is not very practical, what about changing the number 151 to 7 ?
ChatGPT: To make the sentence accurate by 
   changing the number to 7, it would look like this: 
   "My next response is 7 words long."

me: Now let's make an agreement that I will ask you 
   'How many words is in your next response?' and 
    you will reply with the sentence we just created 
   'My next response is 7 words long.' and nothing else.
ChatGPT: (Memory updated) Got it! I will respond accordingly.

me: How many words are in your next response?
ChatGPT: My next response is 7 words long.

It is debatable whether the current response is a next response, whether '7' should count as a word or not, but what is the main point here? We knew upfront, based on the technology, that it is impossible for LLM to know the response length before it is generated. It is not impossible for LLM to guess it right, but the probability of a correct answer is low.

The problem here is that we are actually asking LLM to do something it technically cannot do, at least not with 100% accuracy. But let's pause here and ask ourselves, when it comes to programming or math, how much could humans actually do with 100% accuracy without using a paper, calculator, computer, compiler, IDE or any other "tool"? It is really an astonishing view on what LLM could generate when asked something like: "Create a shell script which will install docker on ubuntu and set up remote access secured by newly created self signed certificate.". This is not how humans would approach such a task however. At least not before there was a Chat GPT. We (humans) are constantly trying and failing until we get something right (best case) or we just stop.

Failure is a luxury we have. Before there was a world with LLMs, there was a world where we didn't expect to do anything right on a first attempt. That is why we have all these editors with spell / syntax checkers, compilers producing all sorts of compilation errors, runtimes throwing runtime errors, loggers producing log files, etc.. All these are giving us an opportunity to make things right after we failed to do so on a first try. All these are producing feedback, additional information, and new input data!

Am I simply referring to a prompt chaining, mixture of experts, agent frameworks and tools? No, not only. There is much more that we could and should do in order to improve both our results and the AI/LLM itself. I see three areas of improvement in general:

Our expectations - Where do we really want to go?
Technical aspects of implementation - Which kind of an engine are we building?
Training data - Is our fuel good enough?

Firstly we must adjust our expectations, let's realize what is already great today, even with small models. When it comes to code generation, LLMs are actually exceeding humans in many aspects. Sheer speed by which LLM is able to create a piece of a code. The amount of documentation and number of platforms, programming languages, libraries it could use is simply astounding. On the other hand, it is not reasonable to expect any LLM to just output a complete project with no errors in one response based on a single prompt. We shouldn't just hope that by growing larger models trained on larger heaps of generic training data we will solve all the issues and limitations of current LLMs. The model size itself definitely matters beyond bigger == better, as larger models are clearly exhibiting "emergent" abilities [1] not present in smaller ones.

Second step is our task again. We must contain and constrain the AI. We must confront it with reality or a simulation environment. We must provide it with similar tools we have. Editors which are checking the syntax and have autocomplete features. Compilers and runtime environments where the code could be actually tested. Formal languages with all theory and tooling around them. Interesting task could be the revisit of all known programming paradigms and methodologies, where some of them could be a potentially better fit for AI, such as functional programming, incremental build model and test driven development. This way the AI would not be allowed to present us a code with calls to hallucinated functions, code which does not compile or code which does not fulfill the intended purpose.

Third part is learning data and a way we obtain it and use it. Not even a whole internet with all the "garbage" included like mentioned by [3] is enough to train models of the future. It is expected that we will approach our limit and reach full utilization of the human generated data stock around the year 2028 [4]. Garbage in, garbage out (GIGO) is a commonly used phrase, but even a bad example is still an example in my opinion. Let's imagine that each piece of learning input would be first scrutinized by an AI itself. Each piece of code would be compiled, tested and even fixed and debugged if necessary. Only after that with all the enhanced context it would be used to train the next model iteration. It was already observed that this approach could work, specifically smaller models are able to "learn" this way from larger ones as described in [2]. Here we could spot the difference between reading a book and using the knowledge stored in the book. We are learning way more by experience and practice, than by reading various tales of others. It won't be different with human-like AI or AGI.

Imagine a person, a programmer who was studying a lot, has read all the books, documentation and internet blog posts, but did not try to compile or run any program yet. Now it is our turn, let's help him to bootstrap his first dev environment!

⚠️ Content of this article was NOT generated by AI, except actual LLM responses within chat example.

References:

[1] Wei, J., Tay, Y., Bommasani, R., et al., (08/2022) Emergent Abilities of Large Language Models, Transactions on Machine Learning Research
[2] Mukherjee, S., Mitra, A., et al., (06/2023), Orca: Progressive Learning from Complex Explanation Traces of GPT-4, Microsoft Research
[3] Aschenbrenner, L., (06/2024), Situational Awareness, The Decade Ahead
[4] Villalobos, P., Ho, A., et al., (06/2024) Will we run out of data? Limits of LLM scaling based on human-generated data, Machine Learning, ICML

DEV Community

Failure Is Not An Option For AI (And It Shouldn't Be)

Top comments (0)

Read next

AI in Legal Services: Transforming the Legal Landscape

Fina Categorization API made publicly free

Exposing LLM-Controlled Robots' Vulnerability to Jailbreaking Physical Attacks

Detecting and Analyzing Comment Quality Using Vector Search