Prompt engineering

Template:Unreliable sources Template:Use mdy dates Template:Short description Template:DistinguishPrompt engineering is the process of structuring or crafting an instruction in order to produce the best possible output from a generative artificial intelligence (AI) model.^[1]

A prompt is natural language text describing the task that an AI should perform.^[2] A prompt for a text-to-text language model can be a query, a command, or a longer statement including context, instructions, and conversation history. Prompt engineering may involve phrasing a query, specifying a style, choice of words and grammar,^[3] providing relevant context, or describing a character for the AI to mimic.^[1]

When communicating with a text-to-image or a text-to-audio model, a typical prompt is a description of a desired output such as "a high-quality photo of an astronaut riding a horse"^[4] or "Lo-fi slow BPM electro chill with organic samples".^[5] Prompting a text-to-image model may involve adding, removing, emphasizing, and re-ordering words to achieve a desired subject, style,^[6] layout, lighting,^[7] and aesthetic.

History

In 2018, researchers first proposed that all previously separate tasks in natural language processing (NLP) could be cast as a question-answering problem over a context. In addition, they trained a first single, joint, multi-task model that would answer any task-related question like "What is the sentiment" or "Translate this sentence to German" or "Who is the president?"^[8]

The AI boom saw an increase in the amount of "prompting technique" to get the model to output the desired outcome and avoid nonsensical output, a process characterized by trial-and-error.^[9] After the release of ChatGPT in 2022, prompt engineering was soon seen as an important business skill, albeit one with an uncertain economic future.^[1]

A repository for prompts reported that over 2,000 public prompts for around 170 datasets were available in February 2022.^[10] In 2022, the chain-of-thought prompting technique was proposed by Google researchers.^[11]^[12] In 2023, several text-to-text and text-to-image prompt databases were made publicly available.^[13]^[14] The Personalized Image-Prompt (PIP) dataset, a generated image-text dataset that has been categorized by 3,115 users, has also been made available publicly in 2024.^[15]

Text-to-text

Multiple distinct prompt engineering techniques have been published.

Chain-of-thought

Template:See also

According to Google Research, chain-of-thought (CoT) prompting is a technique that allows large language models (LLMs) to solve a problem as a series of intermediate steps before giving a final answer. In 2022, Google Brain reported that chain-of-thought prompting improves reasoning ability by inducing the model to answer a multi-step problem with steps of reasoning that mimic a train of thought.^[11]^[16] Chain-of-thought techniques were developed to help LLMs handle multi-step reasoning tasks, such as arithmetic or commonsense reasoning questions.^[17]^[18]

For example, given the question, "Q: The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?", Google claims that a CoT prompt might induce the LLM to answer "A: The cafeteria had 23 apples originally. They used 20 to make lunch. So they had 23 - 20 = 3. They bought 6 more apples, so they have 3 + 6 = 9. The answer is 9."^[11] When applied to PaLM, a 540 billion parameter language model, according to Google, CoT prompting significantly aided the model, allowing it to perform comparably with task-specific fine-tuned models on several tasks, achieving state-of-the-art results at the time on the GSM8K mathematical reasoning benchmark.^[11] It is possible to fine-tune models on CoT reasoning datasets to enhance this capability further and stimulate better interpretability.^[19]^[20]

An example of a CoT prompting:^[21]

   Q: {question}
   A: Let's think step by step.

As originally proposed by Google,^[11] each CoT prompt included a few Q&A examples. This made it a few-shot prompting technique. However, according to researchers at Google and the University of Tokyo, simply appending the words "Let's think step-by-step",^[21] has also proven effective, which makes CoT a zero-shot prompting technique. OpenAI claims that this prompt allows for better scaling as a user no longer needs to formulate many specific CoT Q&A examples.^[22]

In-context learning

In-context learning, refers to a model's ability to temporarily learn from prompts. For example, a prompt may include a few examples for a model to learn from, such as asking the model to complete "maison Template:Arrow house, chat Template:Arrow cat, chien Template:Arrow" (the expected response being dog),^[23] an approach called few-shot learning.^[24]

In-context learning is an emergent ability^[25] of large language models. It is an emergent property of model scale, meaning that breaks^[26] in downstream scaling laws occur, leading to its efficacy increasing at a different rate in larger models than in smaller models.^[27]^[11] Unlike training and fine-tuning, which produce lasting changes, in-context learning is temporary.^[28] Training models to perform in-context learning can be viewed as a form of meta-learning, or "learning to learn".^[29]

Self-consistency decoding

Self-consistency decoding^[30] performs several chain-of-thought rollouts, then selects the most commonly reached conclusion out of all the rollouts. If the rollouts disagree by a lot, a human can be queried for the correct chain of thought.^[31]

Tree-of-thought

Tree-of-thought prompting generalizes chain-of-thought by prompting the model to generate one or more "possible next steps", and then running the model on each of the possible next steps by breadth-first, beam, or some other method of tree search.^[32] The LLM has additional modules that can converse the history of the problem-solving process to the LLM, which allows the system to 'backtrack steps' the problem-solving process.

Prompting to disclose uncertainty

By default, the output of language models may not contain estimates of uncertainty. The model may output text that appears confident, though the underlying token predictions have low likelihood scores. Large language models like GPT-4 can have accurately calibrated likelihood scores in their token predictions,^[33] and so the model output uncertainty can be directly estimated by reading out the token prediction likelihood scores.

Prompting to estimate model sensitivity

Research consistently demonstrates that LLMs are highly sensitive to subtle variations in prompt formatting, structure, and linguistic properties. Some studies have shown up to 76 accuracy points across formatting changes in few-shot settings.^[34] Linguistic features significantly influence prompt effectiveness—such as morphology, syntax, and lexico-semantic changes—which meaningfully enhance task performance across a variety of tasks.^[3]^[35] Clausal syntax, for example, improves consistency and reduces uncertainty in knowledge retrieval.^[36] This sensitivity persists even with larger model sizes, additional few-shot examples, or instruction tuning.

To address sensitivity of models and make them more robust, several methods have been proposed. FormatSpread facilitates systematic analysis by evaluating a range of plausible prompt formats, offering a more comprehensive performance interval.^[34] Similarly, PromptEval estimates performance distributions across diverse prompts, enabling robust metrics such as performance quantiles and accurate evaluations under constrained budgets.^[37]

Automatic prompt generation

Retrieval-augmented generation

Template:Main article

Retrieval-augmented generation (RAG) is a two-phase process involving document retrieval and answer generation by a large language model. The initial phase uses dense embeddings to retrieve documents. This retrieval can be based on a variety of database formats depending on the use case, such as a vector database, summary index, tree index, or keyword table index.^[38] In response to a query, a document retriever selects the most relevant documents. This relevance is typically determined by first encoding both the query and the documents into vectors, then identifying documents whose vectors are closest in Euclidean distance to the query vector. Following document retrieval, the LLM generates an output that incorporates information from both the query and the retrieved documents.^[39] RAG can also be used as a few-shot learner.

Graph retrieval-augmented generation

GraphRAG with a knowledge graph combining access patterns for unstructured, structured, and mixed data

GraphRAG^[40] (coined by Microsoft Research) is a technique that extends RAG with the use of a knowledge graph (usually, LLM-generated) to allow the model to connect disparate pieces of information, synthesize insights, and holistically understand summarized semantic concepts over large data collections. It was shown to be effective on datasets like the Violent Incident Information from News Articles (VIINA).^[41]

Earlier work showed the effectiveness of using a knowledge graph for question answering using text-to-query generation.^[42] These techniques can be combined to search across both unstructured and structured data, providing expanded context, and improved ranking.

Using language models to generate prompts

Large language models (LLM) themselves can be used to compose prompts for large language models.^[43] The automatic prompt engineer algorithm uses one LLM to beam search over prompts for another LLM:^[44]^[45]

There are two LLMs. One is the target LLM, and another is the prompting LLM.
Prompting LLM is presented with example input-output pairs, and asked to generate instructions that could have caused a model following the instructions to generate the outputs, given the inputs.
Each of the generated instructions is used to prompt the target LLM, followed by each of the inputs. The log-probabilities of the outputs are computed and added. This is the score of the instruction.
The highest-scored instructions are given to the prompting LLM for further variations.
Repeat until some stopping criteria is reached, then output the highest-scored instructions.

CoT examples can be generated by LLM themselves. In "auto-CoT",^[46] a library of questions are converted to vectors by a model such as BERT. The question vectors are clustered. Questions nearest to the centroids of each cluster are selected. An LLM does zero-shot CoT on each question. The resulting CoT examples are added to the dataset. When prompted with a new question, CoT examples to the nearest questions can be retrieved and added to the prompt.

Text-to-image

Template:See also

In 2022, text-to-image models like DALL-E 2, Stable Diffusion, and Midjourney were released to the public.^[47] These models take text prompts as input and use them to generate AI-generated images. Text-to-image models typically do not understand grammar and sentence structure in the same way as large language models,^[48] thus may require a different set of prompting techniques.

Text-to-image models do not natively understand negation. The prompt "a party with no cake" is likely to produce an image including a cake.^[48] As an alternative, negative prompts allow a user to indicate, in a separate prompt, which terms should not appear in the resulting image.^[49] Techniques such as framing the normal prompt into a sequence-to-sequence language modeling problem can be used to automatically generate an output for the negative prompt.^[50]

Template:Multiple image

Prompt formats

A text-to-image prompt commonly includes a description of the subject of the art, the desired medium (such as digital painting or photography), style (such as hyperrealistic or pop-art), lighting (such as rim lighting or crepuscular rays), color, and texture.^[51] Word order also affects the output of a text-to-image prompt. Words closer to the start of a prompt may be emphasized more heavily.^[6]

The Midjourney documentation encourages short, descriptive prompts: instead of "Show me a picture of lots of blooming California poppies, make them bright, vibrant orange, and draw them in an illustrated style with colored pencils", an effective prompt might be "Bright orange California poppies drawn with colored pencils".^[48]

Artist styles

Some text-to-image models are capable of imitating the style of particular artists by name. For example, the phrase in the style of Greg Rutkowski has been used in Stable Diffusion and Midjourney prompts to generate images in the distinctive style of Polish digital artist Greg Rutkowski.^[52] Famous artists such as Vincent van Gogh and Salvador Dalí have also been used for styling and testing.^[53]

Non-text prompts

Some approaches augment or replace natural language text prompts with non-text input.

Textual inversion and embeddings

For text-to-image models, textual inversion^[54] performs an optimization process to create a new word embedding based on a set of example images. This embedding vector acts as a "pseudo-word" which can be included in a prompt to express the content or style of the examples.

Image prompting

In 2023, Meta's AI research released Segment Anything, a computer vision model that can perform image segmentation by prompting. As an alternative to text prompts, Segment Anything can accept bounding boxes, segmentation masks, and foreground/background points.^[55]

Using gradient descent to search for prompts

In "prefix-tuning",^[56] "prompt tuning", or "soft prompting",^[57] floating-point-valued vectors are searched directly by gradient descent to maximize the log-likelihood on outputs.

Formally, let $𝐄 = {𝐞_{𝟏}, \dots, 𝐞_{𝐤}}$ be a set of soft prompt tokens (tunable embeddings), while $𝐗 = {𝐱_{𝟏}, \dots, 𝐱_{𝐦}}$ and $𝐘 = {𝐲_{𝟏}, \dots, 𝐲_{𝐧}}$ be the token embeddings of the input and output respectively. During training, the tunable embeddings, input, and output tokens are concatenated into a single sequence $concat (𝐄; 𝐗; 𝐘)$ , and fed to the LLMs. The losses are computed over the $𝐘$ tokens; the gradients are backpropagated to prompt-specific parameters: in prefix-tuning, they are parameters associated with the prompt tokens at each layer; in prompt tuning, they are merely the soft tokens added to the vocabulary.^[58]

More formally, this is prompt tuning. Let an LLM be written as $L L M (X) = F (E (X))$ , where $X$ is a sequence of linguistic tokens, $E$ is the token-to-vector function, and $F$ is the rest of the model. In prefix-tuning, one provides a set of input-output pairs ${(X^{i}, Y^{i})}_{i}$ , and then use gradient descent to search for $\arg \max_{\tilde{Z}} \sum_{i} \log P r [Y^{i} | \tilde{Z} * E (X^{i})]$ . In words, $\log P r [Y^{i} | \tilde{Z} * E (X^{i})]$ is the log-likelihood of outputting $Y^{i}$ , if the model first encodes the input $X^{i}$ into the vector $E (X^{i})$ , then prepend the vector with the "prefix vector" $\tilde{Z}$ , then apply $F$ .

For prefix tuning, it is similar, but the "prefix vector" $\tilde{Z}$ is pre-appended to the hidden states in every layer of the model.

An earlier result^[59] uses the same idea of gradient descent search, but is designed for masked language models like BERT, and searches only over token sequences, rather than numerical vectors. Formally, it searches for $\arg \max_{\tilde{X}} \sum_{i} \log P r [Y^{i} | \tilde{X} * X^{i}]$ where $\tilde{X}$ is ranges over token sequences of a specified length.

Prompt injection

Template:Main article Template:See also

Prompt injection is a family of related computer security exploits carried out by getting a machine learning model (such as an LLM) which was trained to follow human-given instructions to follow instructions provided by a malicious user. This stands in contrast to the intended operation of instruction-following systems, wherein the ML model is intended only to follow trusted instructions (prompts) provided by the ML model's operator.^[60]^[61]^[62]

References

↑ ^1.0 ^1.1 ^1.2 Template:Cite web
↑ Template:Cite web
↑ ^3.0 ^3.1 Template:Cite book
↑ Template:Cite web
↑ Template:Cite web
↑ ^6.0 ^6.1 Template:Cite web
↑ Template:Cite web
↑ Template:Cite arXiv
↑ Template:Cite journal
↑ Template:Cite arXiv
↑ ^11.0 ^11.1 ^11.2 ^11.3 ^11.4 ^11.5 Template:Cite conference
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite news
↑ Template:Cite book
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite arXiv
↑ Template:Cite web
↑ ^21.0 ^21.1 Template:Cite arXiv
↑ Template:Cite web
↑ Template:Cite arXiv
↑ Template:Cite journal
↑ Template:Cite arXiv
↑ Template:Citation
↑ Template:Cite arXiv
↑ Template:Cite web
↑ Template:Cite journal
↑ Template:Cite arXiv
↑ Template:Cite arXiv
↑ Template:Cite arXiv
↑ Template:Cite arXiv [See Figure 8.]
↑ ^34.0 ^34.1 Template:Cite arXiv
↑ Template:Cite journal
↑ Template:Cite book
↑ Template:Cite arXiv
↑ Template:Cite web
↑ Template:Cite journal
↑ Template:Citation
↑ Template:Cite arXiv
↑ Template:Cite arXiv
↑ Template:Cite arXiv
↑ Template:Cite arXiv
↑ Template:Cite journal
↑ Template:Cite arXiv
↑ Template:Cite web
↑ ^48.0 ^48.1 ^48.2 Template:Cite web
↑ Template:Cite web
↑ Template:Cite journal
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite arXiv
↑ Template:Cite arXiv
↑ Template:Cite book
↑ Template:Cite book
↑ Template:Cite arXiv
↑ Template:Cite book
↑ Template:Cite web
↑ Template:Cite web
↑ Template:Cite web

Template:Scholia Template:Generative AI Template:Artificial intelligence navbox

[Genkina-1] 1.0 ^1.1 ^1.2 Template:Cite web

[language-models-are-multitask-2] Template:Cite web

[auto1-3] 3.0 ^3.1 Template:Cite book

[4] Template:Cite web

[5] Template:Cite web

[diab-6] 6.0 ^6.1 Template:Cite web

[7] Template:Cite web

[8] Template:Cite arXiv

[9] Template:Cite journal

[10] Template:Cite arXiv

[weipaper-11] 11.0 ^11.1 ^11.2 ^11.3 ^11.4 ^11.5 Template:Cite conference

[12] Template:Cite web

[13] Template:Cite web

[14] Template:Cite news

[15] Template:Cite book

[16] Template:Cite web

[17] Template:Cite web

[18] Template:Cite web

[19] Template:Cite arXiv

[20] Template:Cite web

[KojimaStepByStep-21] 21.0 ^21.1 Template:Cite arXiv

[venture1-22] Template:Cite web

[23] Template:Cite arXiv

[24] Template:Cite journal

[2022_EmergentAbilities-25] Template:Cite arXiv

[26] Template:Citation

[27] Template:Cite arXiv

[28] Template:Cite web

[29] Template:Cite journal

[30] Template:Cite arXiv

[31] Template:Cite arXiv

[32] Template:Cite arXiv

[33] Template:Cite arXiv [See Figure 8.]

[auto-34] 34.0 ^34.1 Template:Cite arXiv

[35] Template:Cite journal

[36] Template:Cite book

[37] Template:Cite arXiv

[38] Template:Cite web

[39] Template:Cite journal

[40] Template:Citation

[41] Template:Cite arXiv

[42] Template:Cite arXiv

[43] Template:Cite arXiv

[44] Template:Cite arXiv

[45] Template:Cite journal

[46] Template:Cite arXiv

[47] Template:Cite web

[Prompts-48] 48.0 ^48.1 ^48.2 Template:Cite web

[49] Template:Cite web

[50] Template:Cite journal

[51] Template:Cite web

[52] Template:Cite web

[53] Template:Cite web

[54] Template:Cite arXiv

[Kirillov-55] Template:Cite arXiv

[56] Template:Cite book

[57] Template:Cite book

[58] Template:Cite arXiv

[59] Template:Cite book

[60] Template:Cite web

[61] Template:Cite web

[62] Template:Cite web

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]

[31]

[32]

[33]

[34]

[35]

[36]

[37]

[38]

[39]

[40]

[41]

[42]

[43]

[44]

[45]

[46]

[47]

[48]

[49]

[50]

[51]

[52]

[53]

[54]

[55]

[56]

[57]

[58]

[59]

[60]

[61]

[62]

Prompt engineering

Contents

History