Prompt engineering tools — the curated list
A curated list of prompt engineering, evaluation, and orchestration tools worth knowing about: PromptShip, LangChain, DSPy, Promptfoo, LangSmith, and more.
The LLM tooling space is a swamp of overlapping categories with marketing pages that all claim to do everything. The question isn't "what tools exist" (you can find that with a Google search). It's "what category does my problem live in, and which tool in that category is honest about its trade-offs."
This is a curated map. Five categories, each with the tools worth knowing, each with an honest one-line take. We use most of these ourselves; the ones we don't are still here because they fit a use case ours doesn't.
Honest disclosure
How to pick a tool from this list#
The shortest path to picking the wrong tool is starting from a category list. The shortest path to picking the right one:
- Start from a problem. "Our prompts keep regressing silently." "We can't tell which prompt change made things worse." "Our agent loops forever."
- Identify the category. Each problem above maps to one category in this guide (eval, observability, orchestration).
- Pick the smallest tool that solves it. Don't pick a framework when a library will do; don't pick a library when a function will do.
Prompt management#
Tools for saving, versioning, sharing, and organizing prompts. The category we live in. See What is prompt management for what makes this category distinct.
- PromptShipthat's uspromptship.co
Save, version, and share prompts across teams. Built around variables, version history, and a public library of curated prompts. Free for up to 200 prompts.
- PromptLayerpromptlayer.com
Prompt registry with version history and analytics, with a focus on observability for prompts running in production.
- Humanloophumanloop.com
Prompt management plus eval workflows, oriented toward enterprise teams shipping LLM features in regulated environments.
Application frameworks (orchestration)#
Frameworks for building LLM apps — chains, agents, RAG pipelines, retries, fallbacks. Pick based on how much magic you want vs. how much you want to write yourself.
- LangChainlangchain.com
The most popular Python/JS framework for building LLM applications. Heavy on abstractions; great when you want batteries included, less great when you want to know what's happening underneath.
- LlamaIndexllamaindex.ai
RAG-focused framework. Strong document loaders, retrievers, and query engines — the right pick if your app is mostly retrieval over your own data.
- DSPydspy.ai
A different approach: declare what you want the LLM to do, let DSPy compile the prompts. Steep learning curve, powerful for systems you'd otherwise hand-tune endlessly.
- Haystackhaystack.deepset.ai
Production-oriented orchestration framework. Less hype, more reliability — popular in enterprise RAG deployments.
Evaluation and observability#
Tools for measuring whether your prompts and pipelines actually work in production — and catching regressions before users do. Cross-references the A/B testing workflow.
- Promptfoopromptfoo.dev
Open-source LLM eval and red-teaming. Define test cases as YAML, run across multiple models, get a comparison matrix. The right starting point for systematic prompt testing.
- LangSmithsmith.langchain.com
LangChain's observability and eval platform. Trace every prompt and tool call, build datasets from production traffic, run evaluations.
- Heliconehelicone.ai
LLM observability with a focus on cost, latency, and usage analytics. Lightweight to integrate; great for catching cost regressions early.
- Braintrustbraintrust.dev
Eval-first platform. Build LLM-as-judge graders, version your eval sets, run regression tests on every prompt change. Strong for teams with dedicated ML ops.
Vendor playgrounds#
Where to prototype quickly. Each vendor's playground is the easiest way to learn that vendor's patterns — see ChatGPT, Claude, and Gemini guides for what each does best.
- OpenAI Playgroundplatform.openai.com
The default place to prototype prompts for OpenAI models. Side-by-side compare across model versions, configure system prompts, test function calling.
- Anthropic Workbenchconsole.anthropic.com
Claude's playground. Strong support for prefilling, XML tags, and the Claude-specific patterns that don't exist elsewhere.
- Google AI Studioaistudio.google.com
Gemini's playground. Best place to test long-context and multimodal prompts — supports image, video, and audio inputs natively.
Vector databases (for RAG)#
The retrieval half of Retrieval-Augmented Generation. Pick based on scale, deployment preference, and whether you already run Postgres.
- Pineconepinecone.io
Managed vector database. The default if you want to ship RAG quickly without running infrastructure.
- Weaviateweaviate.io
Open-source vector database with hybrid search built in. Strong for teams wanting more control over deployment and search algorithms.
- Qdrantqdrant.tech
Performance-focused open-source vector DB. Rust-based, fast, growing fast in production deployments.
- pgvectorgithub.com
Postgres extension for vector search. The right pick when you already run Postgres and don't want to add a new service for moderate-scale RAG.
Quick reference#
The 60-second summary
Five categories: prompt management, orchestration, eval, playgrounds, vector DB.
Pick from a problem, not from the list. Each category solves a specific class of problem; mismatched tools are the most common failure.
Start small: Promptfoo for eval, vendor playgrounds for prototyping, pgvector for RAG if you have Postgres.
The honest take: one-line summaries here include trade-offs. Don't pick a tool whose trade-off doesn't match your team's strengths.
What to read next#
For papers describing the patterns these tools implement, see papers. For external guides on using them well, readings. For the benchmarks they're evaluated against, datasets.
Put this guide to work
Save your prompts, version every change, and share them with your team — free for up to 200 prompts.