Many of the LLM challenges can already be solved by creatively writing prompts

  • CoT (let's think step by step)
  • Toolformer
  • Various adversarial techniques (have the LLM critique itself)

One of the better writeups can be seen here where gpt3 is used to write SQL from natural language. The write-up is using some of the techniques listed above, but we are only scratching the surface. The above SQL example uses a series of 26 prompts. Each prompt probably takes somewhere around 10 seconds so 260 seconds in total. This is a little slow, but probably much much faster than a human would be able to do. The article also says that there are sometimes inaccurate outputs. It seems that this too maybe could be solved by having the model generate litmus tests that it could then execute.

So it's slow and error-prone, but it seems that many problems can be fixed by writing and executing more prompts! We have the following challenges

  • Context length is limited
  • Speed is slow
  • System resource requirements are high (network calls are necessary)

What if instead of using 26 prompts, we used something ludicrous, like 1,000,000 prompts? How far can you get with that? Can you write an entire application end to end? I don't think this is far-fetched. No doubt you would need to invent a lot of prompt scaffolding. What prompts do you need and in what order? How do you hook up the generational process to external feedback like a compiler? This is not trivial work, but it doesn't seem completely impossible either. LLMs are already generating all of the individual pieces needed to assemble an application all we need is a framework for how to build them in isolation and compose them together. Throw in a good process for refactoring existing code and I think we are there.

If you can just generate a task, then generate 10 attempts at solving the task, select the best one and then generate a new task rinse and repeat. This would be a kind of depth-first search of all the positivities guided by feedback from tools (like a generated test suite) and a human-researched chain of tasks.

Right now this is limited by the above-mentioned engineering challenges. Executing 1,000,000 prompts in sequence would take 7.6 years given 10 seconds per prompt. Also figuring out the framework needed to sequence the prompts in the right order will probably take at least 100,000 prompts. Not to mention that it costs a fortune. But what happens when/if those problems are solved? Say we have the following things figured out

  • 100K token context length
  • Instant inference
  • Can be run locally

It is now cost-effective and simple to build elaborate prompt scaffolds in the same sense that it is cost-effective and simple to compute PI to 10,000 decimals. What happens then?