livw for orkes

Posted on Nov 28, 2024 • Edited on Apr 4 • Originally published at orkes.io

Experimenting and Putting Prompt Engineering Tactics into Practice

#promptengineering #ai #llm #genai

This is Part 2 of the Prompt Engineering series, which will put prompt engineering tactics into practice, including model choice and LLM parameters like temperature. Check out Part 1 on the importance of creating effective prompts.

Prompt engineering is the practice of enhancing LLM responses by writing optimized prompts and tuning parameters like temperature or topP. There are many strategies and tactics for creating effective prompts, but how do these translate into real-life situations? Which tactics are the most effective? How should I combine prompt writing and parameter tuning? Let’s explore some real examples of prompt engineering put into practice.

Engineering better LLM responses

OpenAI documentation provides six main strategies for prompt engineering:

Write clear instructions
Provide reference text
Split complex tasks into simpler subtasks
Give the model time to "think"
Use external tools
Test changes systematically

These strategies point to the overarching principle of prompt engineering: providing clear, specific, well-structured instructions. There are many ways to split hairs about how exactly to do that, with numerous guides available online (including our previous Guide to Prompt Engineering, where we introduced the basics of creating clear and effective prompts). What we want to do is to dig deeper into how model choice, prompt writing tactics, and parameter tuning interact and interplay to produce the ideal responses.

In the following sections, we will explore each facet in turn and finally put it all together, so you have a concrete idea of what to focus on when implementing prompt engineering in your AI projects.

Model choice

Without any prompt engineering, your choice of LLM serves as the ringfence determining the base quality of the response output. To illustrate, here is the response by Cohere’s command model compared to Mistral’s mistral-small model. Both models were asked to summarize an article about NASA’s Europa Clipper mission:

Summarize ${text}
// Assume that ${text} is a dynamic variable that is injected with the 
relevant content at runtime

While both models provide a relatively accurate summary of the article, Mistral offers a well-structured response out of the box without additional prompting.

Diagram of Cohere’s command model’s output vs Mistral’s mistral-small-latest’s output, in response to the user query “Summarize text”. — Various LLM models differ in their instinctual, un-engineered responses.

Depending on your use cases and system requirements, Mistral’s responses could be overly-engineered or the more optimal response. Nevertheless, using prompt engineering tactics, you can get command to respond just like mistral-small.

Tactics for effective prompts

To provide clear, specific, well-structured instructions, these are some of the tactics employed in the prompts shown below:

Break up the instructions for clarity.

Instead of combining the article and the instructions in a single sentence, the prompt is now divided into distinct sections: one for the content and another for the instructions on how to evaluate the content.
Demarcate your prompt with special characters or tags where appropriate.

The article content is now enclosed in quotation marks to explicitly mark the text that should be summarized.
Include details about the desired output. Quantify where possible.

In one variation, the prompt now asks for the response to be “clear and concise” and even specifies the desired format of using bullet points. In another, it explicitly states the desired length of the output.

Diagram showing three different LLM prompts, with relevant phrases highlighted to demonstrate the prompt tactics. — There are numerous ways to implement prompt engineering tactics depending on what output you need.

Check out how the responses have improved in each of these prompts.

Prompt: Here is the text from an article: "${text}". Provide a summary 
of this text in a clear and concise manner.

Response: NASA has launched the Europa Clipper mission, a $5.2 billion 
robotic spacecraft, to explore Jupiter's moon, Europa, which is believed 
to have a global ocean of liquid water beneath its icy crust and could 
potentially support extraterrestrial life. The mission aims to determine 
if Europa has the necessary ingredients for life, including water, 
organic compounds, and an energy source. The spacecraft, the largest 
and most expensive planetary science mission in NASA's history, will 
conduct 49 close encounters with Europa, studying its ocean, atmosphere, 
and surface features. Scientists are particularly interested in the 
possibility of hydrothermal vents on Europa's ocean floor, which could 
support simple life forms. The mission also aims to find potential 
landing sites for future lander missions. Europa Clipper's findings will 
shape future exploration of Europa and the search for life beyond Earth.

Prompt: Here is the text from an article: "${text}". Provide a summary 
of this text in a clear and concise manner, starting with the title 
**Summary**, and including the following bullet points:
- **Mission:**
- **Objective:**
- **Spacecraft:**
- **Journey:**
- **Scientific Goals:**
- **Future Missions:**

Response: Summary

- **Mission:** NASA's Europa Clipper mission aims to explore Jupiter's 
moon, Europa, a promising location for finding extraterrestrial life.
- **Objective:** The primary goal is to determine if Europa has the 
necessary ingredients for life, particularly simple, single-celled 
organisms.
- **Spacecraft:** Europa Clipper is the largest and most expensive 
planetary science spacecraft NASA has ever launched. It carries a 
sophisticated suite of nine instruments to study Europa's ocean, 
surface, and atmosphere.
- **Journey:** The spacecraft was launched on a SpaceX Falcon Heavy 
rocket and will take a 5.5-year journey to Jupiter. It will fly by Mars 
and Earth for gravity assists, reaching Jupiter in 2030.
- **Scientific Goals:** Europa Clipper will study Europa's ocean, ice 
shell, atmosphere, and potential plumes to understand its habitability. 
It will search for water, organic compounds, and energy sources 
necessary for life.
- **Future Missions:** A Europa lander is a challenging prospect, but 
a compelling discovery could prompt a follow-up mission. Other high-
priority missions include a Uranus orbiter and a mission to Saturn's 
moon Enceladus.

Prompt: Here is the text from an article: "${text}". Summarize the text 
clearly in 280 characters using this format "Article summary: <your_
summary_here>".

Response: Article summary: NASA's Europa Clipper mission launched on a 
SpaceX Falcon Heavy rocket to explore Jupiter's moon Europa, which may 
harbor life. The spacecraft will study Europa's ocean, atmosphere, and 
surface, seeking signs of habitability and potential life.

Temperature

In some cases, it is also useful to tune the temperature of the responses. This LLM parameter affects how creative or deterministic the output will be — the higher the temperature, the more random and creative the output will be.

While temperature is associated with creative tasks like generating a poem or a story, it can also apply to factual writing, like summaries. Take a look at this response below, where the temperature is set to 0.1. The text output reuses similar words or phrasing, and this repetitiveness makes the summary rather stale and unengaging.

Diagram of Cohere’s response to the prompt “Here is the text from an article: — A low temperature favors more frequent words or phrases, which can make the output repetitive.

In contrast, a temperature of 1 leads to more dynamic and lively text, but tends to exaggerate or include less important information. This would mean that the summarizer is prone to creating misleading snapshots of the articles.

A more moderate temperature of 0.4 could be the sweet spot for an article summarizer, providing relevant information with livelier language.

Evaluating your LLM responses

We’ve seen individually how model choice, prompt writing tactics, and LLM parameters can affect your desired output. But how do these factors stack together to get optimized results at scale? While prompt writing tactics get you the biggest mileage in obtaining a particular result, LLM parameter tuning and model choice could be the distinguishing factor between an average response and a stellar one.

Using the following prompt with different models and temperature values:

Here is the text from an article: "${text}". Summarize the text clearly 
in 280 characters using this format "Article summary: <your_summary_
here>".

We get the following responses:

At a small sample size, it is easy to pick which answer is more desirable. But in a production setting, it is best practice to systematically test changes using a larger sample size to ascertain between randomness and true improvement.

One way to systematically track changes would be to evaluate the LLM responses against a benchmark — using relevant criteria or gold-standard answers. In this case, we want summaries that are concise yet comprehensive — answering the 5W1H (what, who, where, when, why, and how).

These evaluations can be manually done by humans or AI-assisted. Whichever route you choose, ensure that you can evaluate a large sample set with sufficient variety to cover all the edge cases in your use scenarios.

Optimizing LLM responses for AI app development

In practice, an application that leverages generative AI is likely far more complex than using a simple one-shot prompt. Multiple prompts can be used across different tasks in a process, or AI techniques like retrieval-augmented generated (RAG), computer vision, and other data pre-processing tasks may come into play.

An AI orchestration engine like Orkes Conductor can streamline such developmental efforts with its built-in AI tasks and feature-rich AI prompt builder, where you can test various prompt engineering strategies. With one-step integrations for a dozen LLM providers (including OpenAI, HuggingFace, and more), Orkes’ AI prompt builder unlocks the convenience of plugging, testing, and playing with prompts across models.

Testing Interface for prompts in Orkes Conductor. — Orkes Conductor AI prompt builder, as part of its AI Orchestration capabilities.

Besides testing, these prompts can be templatized, saved, and safely used in a production setting for AI-driven workflows and applications, such as document classification, RAG-based search retrieval, approval journeys, and more.

Wrap up

By implementing various prompt engineering tactics, we have explored effective ways to combine model choice, prompt writing, and parameter tuning and systematically test these changes.

Beyond what we have explored here, there are many other ways to improve model outputs, such as model fine-tuning or retrieval-augmented generation (RAG). A strategic combination of these different techniques and methods will unleash the full potential of generative AI in automating creative, complex, or human-involved tasks.

Using Orkes Conductor, you can rapidly build, optimize, and test AI-enabled applications in a distributed environment. Conductor is an open-source orchestration platform used widely in many mission-critical applications for orchestrating microservices, AI services, event handling, and more. Try out a fully-managed and hosted Conductor service with the free Developer Playground.

Top comments (2)

Comment hidden by post author - thread only accessible via permalink

Winzod AI • Nov 28 '24

Hey folks, came across this post and thought it might be helpful for you! Check out this comprehensive guide to RAG in AI - Rag In AI.

Comment hidden by post author - thread only accessible via permalink

Winzod AI • Nov 29 '24

A great blog! Loved the minute details!! Also folks, if you wanna know more around prompt experimenting you can explore this dev tool I came across - getmaxim.ai/

Some comments have been hidden by the post's author - find out more