Overview
Large Language Models (LLMs) have been around for several years but
recent advances have revolutionized the field of natural language
processing (NLP) and artificial intelligence (AI), opening up a world
of possibilities across various domains. OpenAI's
ChatGPT has taken the world by storm and
shown remarkable ability to generate human-like responses to a wide
range of queries.
These models require training on vast amounts of data and in some
sense can be thought of as a way to summarize and expose the
information in the training dataset. It is important to remember,
however, that the information returned in response to a query is a
statistical answer that does not result from the same logical
reasoning steps humans are capable of. This leads to the so called
hallucination effect that is now being talked about, i.e., when a
confidently written but factually wrong response is given to a set of
queries.
Between discussion sites like StackOverflow and code repositories like
GitHub and GitLab, there is a tremendous amount of source code
available online for potential training. Efforts like Github
Copilot, show the power of AI
assisted coding to boost productivity. Our testing has shown, however,
that while Copilot can greatly assist a developer, it is also prone to
hallucinations inventing new methods that don't exist or sometimes
using outdated APIs that no longer exist in recent versions of the
code.
We decided to experiment to see if we could add a feedback loop into
the process of code generation to reduce model hallucinations and
iterate towards a working solution faster. We also experimented
with adding conversational feedback to the results rather
than the auto-complete mechanism that Copilot currently implements
(Note: Copilot just announced a preview release of a conversational
chat based
mode).
As an experiment we developed a Python package
pseudocode which allows
developers to use an LLM chat session to produce correct and tested
code simply by providing code annotations and tests. We believe that
pseudocode is a "higher
level language" for writing Python code. Below you will find an
example. We emphasize that the interface is via type annotations and function docstrings and provides easy ways to include automated tests.
- We define the code to be generated by providing a function
signature with type annotations
- We define tests that must pass
- We submit these to OpenAI ChatGPT 3.5 turbo API with some
instructions
- Any feedback is resubmitted to the ChatGPT session to continue
refining the code
- If the user provides feedback, then the feedback is sent back to
ChatGPT and we repeat the process
- We run the resulting code, and if the tests fail, we resubmit the
errors to the ChatGPT session
Below we walk through a demonstration of this process.
Example
Let's use a concrete example of a slightly non-trivial task of getting
GitHub issues.
from pseudocode import pseudo_function
@pseudo_function(review=True)
def get_issues(repository: str) -> typing.List[typing.Tuple[str, int, datetime.datetime]]:
"""A function to fetch all issues created by
A function that does the following:
- assume you have a github token environment variable GITHUB_TOKEN
- assume that repository is of the form organization/repo
- use the requests library
- fetch all github issues from repository in last 10 days
- only show issue numbers which are odd
- return a tuple with issue titles, number, and date created
>>> all([_[1] % 2 == 1 for _ in get_issues("conda/conda")])
print('github issues', get_issues("conda/conda"))
We feel this is a good example because it is hard to remember how to
use the GitHub API exactly and inevitably requires some back and forth
using the PyGithub Python
library, API requests, and viewing API documentation. LLMs show promise
for these applications since there is a large amount of surrounding
documentation but no one example to do exactly what you
need. pseudocode
enforces the practice of creating an interface
specification which:
- declares the function signature
def get_issues(repository: str) -> typing.List[typing.Tuple[str, int, datetime.datetime]]
- uses the function's
docstring
to specify what the function should be doing
- automates tests run by
pseudocode
to provide feedback to the LLM of the quality and correctness of the code generated
The specification is essential for helping the LLM avoid hallucinating, and we have found that it is highly effective for the examples we've experimented with. When you run the example above, you will see the following.
╭────────────────────────────────────────────────────── Function Specification ────────────────────────────────────╮
│ The following is a short description of what this function should do "A function to fetch all issues created by".│
│ A longer more detailed description of what this function should do is as follows: │
│ A function that does the following: │
│ - assume you have a github token environment variable GITHUB_TOKEN │
│ - assume that repository is of the form organization/repo │
│ - use the requests library │
│ - fetch all github issues from repository in last 10 days │
│ - only show issue numbers which are odd │
│ - return a tuple with issue titles, number, and date created │
│ The function takes the following arguments: │
│ - variable "respository" of python type "<class 'str'>" │
│ This function must return a result of python type "typing.List[typing.Tuple]". │
│ Output code that will satisfy the given requirements. │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────────────────────────────────── Code Review ─────────────────────────────────────────╮
│ def run(repository: str) -> list: │
│ 'Authorization': f"Bearer {os.environ['GITHUB_TOKEN']}", │
│ 'Accept': 'application/vnd.github.v3+json' │
│ url = f"https://api.github.com/repos/{repository}/issues" │
│ now = datetime.datetime.now() │
│ ten_days_ago = now - datetime.timedelta(days=10) │
│ response = requests.get(url, headers=headers, params={"since": ten_days_ago}) │
│ issues = response.json() │
│ odd_issue_numbers = [f"{issue['number']}" for issue in issues if issue['number'] % 2 != 0] │
│ if str(issue['number']) in odd_issue_numbers: │
│ result.append((issue['title'], issue['number'], issue['created_at'])) │
╰──────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
What code feedback would you like to provide? (leave empty for approval):
pseudocode
generates appropriate prompts to the LLM asking it to
provide code in response to the user's requirements. In this case we
specified review=True
in the decorator which prompted pseudocode
to ask for user feedback on the generated code.
What code feedback would you like to provide? (leave empty for approval): this code is messy can you simplify it?
╭─────────────────────────────────────────────────────────── LLM Feedback ─────────────────────────╮
│ this code is messy can you simplify it? │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────────────────────────────────── Code Review ─────────────────────────╮
│ def run(repository: str) -> list: │
│ headers = {'Authorization': f"Bearer {os.environ['GITHUB_TOKEN']}"} │
│ url = f"https://api.github.com/repos/{repository}/issues" │
│ since = datetime.datetime.now() - datetime.timedelta(days=10) │
│ params = {'since': since.isoformat()} │
│ response = requests.get(url, headers=headers, params=params) │
│ issues = response.json() │
│ return [(i['title'], i['number'], i['created_at']) for i in issues if i['number'] % 2 != 0] │
╰──────────────────────────────────────────────────────────────────────────────────────────────────╯
What code feedback would you like to provide? (leave empty for approval):
Already after some quick initial feedback we can see that the LLM
takes into account the users needs. After user provided feedback the
automated tests are run. If those tests fail or run into exceptions
they will be automatically passed back to the LLM for corrections.
pass result = str(all([_[1] % 2 == 1 for _ in run("conda/conda")]))
('Use appropriate `defaults` value for Windows is use when subdir=win-*', 12555, '2023-03-30T17:44:55Z'),
('Document steps to debug tests with vscode', 12459, '2023-03-07T18:20:08Z')
The final generated code after passing all tests.
def run(repository: str) -> list:
headers = {'Authorization': f"Bearer {os.environ['GITHUB_TOKEN']}"}
url = f"https://api.github.com/repos/{repository}/issues"
since = datetime.datetime.now() - datetime.timedelta(days=10)
params = {'since': since.isoformat()}
response = requests.get(url, headers=headers, params=params)
return [(i['title'], i['number'], i['created_at']) for i in issues if i['number'] % 2 != 0]
Conclusion
There are several key ideas here we want to highlight about why we
think the approaches taken in pseudocode
are unique.
-
Autogenerated code needs guidance. Feedback is generated from the user and the rest is guided by automated testing and exception reporting back to the LLM.
-
Interface design is naturally a high level task which is the key to composable code. Declaring interfaces for LLMs to operate within is important.
-
Separation of code declarations and generated code. Similar to .py
and .pyc
files we should separate the interfaces from the autogenerated code.
We mentioned earlier that this was an experiment of the interaction
between automated testing, user feedback, and LLMs. We want to explore
this more and already have additional ideas like applying formatting
and linting to generated LLM code via
black,
ruff, and
isort, and caching of generated LLM
code to avoid repetitive API calls and store a database of reusable
code. Similar to pseudocode.pseudo_function(...)
shown below, it would be nice
to have an equivalent for arbitrary files. This would provide a feel
somewhat like
cookiecutter on
steroids.
def test_dockercompose(contents):
'A docker-compose file to generate a running redis service on port 5000 with no password',
To see the full example, have a look at this video.
Check out the project at Quansight/pseudocode.