For many of the same reasons. A plethora of alteratives, personal preference, weird ideology, appropriateness for the task, inertia, not-invented-here.
The list goes on.
Edit, read the article -its really good- that cycle of AI engineering progression is spot on -read the article too!
> Typed I/O for every LLM call. Use Pydantic. Define what goes in and out.
Sure, not related to DSPy though, and completely tablestakes. Also not sure why the whole article assumes the only language in the world is Python.
> Separate prompts from code. Forces you to think about prompts as distinct things.
There's really no reason prompts must live in a file with a .md or .json or .txt extension rather than .py/.ts/.go/.., except if you indeed work at a company that decided it's a good idea to let random people change prod runtime behavior. If someone can think of a scenario where this is actually a good idea, feel free to elighten me. I don't see how it's any more advisable than editing code in prod while it's running.
> Composable units. Every LLM call should be testable, mockable, chainable.
> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.
And LiteLLM or `ai` (Vercel), the actually most used packages, aren't? You're comparing downloads with Langchain, probably the worst package to gain popularity of the last decade. It was just first to market, then after a short while most realized it's horrifically architected, and now it's just coasting on former name recognition while everyone who needs to get shit done uses something lighter like the above two.
> Eval infrastructure early. Day one. How will you know if a change helped?
Sure, to an extent. Outside of programming, most things where LLMs deliver actual value are very nondeterministic with no right answer. That's exactly what they offer. Plenty of which an LLM can't judge the quality of. Having basic evals is useful, but you can quickly run into their development taking more time than it's worth.
But above all.. the comments on this post immediately make clear that the biggest differentiator of DSPy is the prompt optimization. Yet this article doesn't mention that at all? Weird.
I think one thing that's lost in all of the LLM tooling is that it's LLM-or-nothing and people have lost knowledge of other ML approaches that actually work just fine, like entity recognition.
I understand it's easier to just throw every problem at an LLM but there are things where off-the-shelf ML/NLP products work just as well without the latency or expense.
I conjecture that the core value proposition of DSPy is its optimizer? Yet the article doesn't really touch it in any important way. How does it work? How would I integrate it into my production? Is it even worth it for usual use-cases? Adding a retry is not a problem, creating and maintaining an AI control plane is. LangChain provides services for observability, online and offline evaluation, prompt engineering, deployment, you name it.
This takes a ton of upfront work and careful thinking. As soon as you move the goalposts of what you're trying to achieve you also have to update the training and evaluation dataset to cover that new use case.
This can actually get in the way of moving fast. Often teams are not trying to optimize their prompts but even trying to figure out what the set of questions and right answers should be!
I think I might have just misunderstood how to use it.
I think a problem to DSPy is that they don't know the concept of THE WHOLE PRODUCT: https://en.wikipedia.org/wiki/Whole_product
Look at https://mastra.ai/ and https://www.copilotkit.ai/ to see how more inviting their pages look. A company is not selling only the product itself but all the other things around the product = THE WHOLE PRODUCT
A similar concept in developer tools is the docs are the product
Also I'm a fullstack javascript engineer and I don't use Python. Docs usually have a switch for the language at the top. Stripe.com is famous for it's docs and Developer Experience: https://docs.stripe.com/search#examples It's great to study other great products to get inspiration and copy the best traits that are relevant to your product as well.
1. Up until about six months ago, modifying prompts by hand and incorporating terminology with very specific intent and observing edge cases and essentially directing the LLM in a direction to the intended outcome was somewhat meticulous and also somewhat tricky. This is what the industry was commonly referring to as prompt engineering.
2. With the current state of SOTA models like Opus 4.6, the agent that is developing my applications alongside of me often has a more intelligent and/or generalized view of the system that we're creating.
We've reached a point in the industry where smaller models can accomplish tasks that were reserved for only the largest models. And now that we use the most intelligent models to create those systems, the feedback loop which was patterned by DSPy has essentially become adopted as part of my development workflow.
I can write an agent and a prompt as a first pass using an agentic coder, and then based on the observation of the performance of the agent by my agentic coder, continue to iterate on my prompts until I arrive at satisfactory results. This is further supported by all of the documentation, specifications, data structures, and other I/O aspects of the application that the agent integrates in which the coding agent can take into account when constructing and evaluating agentic systems.
So DSPy was certainly onto something but the level of abstraction, at least in my personal use case has, moved up a layer instead of being integrated into the actual system.
The fact that you have to bundle input+output signatures and everything is dynamically typed (sometimes into the args) just make it annoying to use in codebases that have type annotations everywhere.
Plus their out of the box agent loop has been a joke for the longest time, and writing your own if feasible but it's night and day when trying to get something done with pydantic-ai.
Too bad because it has a lot of nice things, I wish it were more popular.
1) It's slow: you first have to get acquainted with DSPY and then get hand-labeled data for prompt optimization. This can be a slow process so it's important to just label cases that are ambiguous, not obvious.
2) They know that manual prompt engineering is brittle, and want a prompt that's optimized and robust against a model they're invoking, which DSPy offers. However, it's really the optimizer (ex. GEPA) doing the heavy-lifting.
3) They don't actually want a model or prompt at all. They want a task completed, reliably, and they want that task to not regress in performance. Ideally, the task keeps improving in production.
Curious if folks in this thread feel more of these pains than the ones in the article.
So this article seems surprising since it emphasizes more the non prompt optimization aspects. If that was the selling point I would rather use something like Pydantic AI when I already use Pydantic for so much of the rest.
The real killer feature is the prompt compilation; it's also the hardest to get to an effective place and I frequently found myself needing more control over the context than it would allow. This was a while ago, so things may have improved. But good evals are hard and the really fancy algorithms will burn a lot of tokens to optimize your prompts.
In my opinion, the reason people don't use DSPy is because DSPy aims to be a machine learning platform. And like the article says -- this feels different or hard to people who are not used to engineering with probabilistic outputs. But these days, many more people are programming with probability machines than ever before.
The absolute biggest time sink and 'here be dragons' of using LLMs is poke and hope prompt "engineering" without proper evaluation metrics.
> You don’t have to use DSPy. But you should build like someone who understands why it exists.
And this is the salient point, and I think it's very well stated. It's not about the framework per se, but about the methodology.
Stranger still: it seems like every company I have worked with ends up building a half-baked version of Dspy.
Wrong. There can be a lot of subjectivity and pretending that some golden answer exists does more harm and narrows down the scope of what you can build.
My other main problem with data extraction tasks and why I'm not satisfied with any of the existing eval tools is that the schemas I write change can drastically as my understanding of the problem increases. And nothing really seems to handle that well, I mostly just resort to reading diffs of what happens when I change something and reading the input/output data very closely. Marimo is fantastic for anything visual like this btw.
Also there is a difference between: the problem in reality → the business model → your db/application schema → the schema you send to the LLM. And to actually improve your schema/prompt you have to be mindful of the entire problem stack and how you might separate things that are handled through post processing rather than by the LLM directly.
> Abstract model calls. Make swapping GPT-4 for Claude a one-line change.
And in practice random limitations like structured output API schema limits between providers can make this non-trivial. God I hate the Gemini API.
useful for upcoming consultants to learn how to price services too.
Are we playing philosophy here? If you move some part of the code from the repo and into a database, then changing that database is still part of the deployment, but now you just made your versioning have identity crisis. Just put your prompts in your git repo and say no when someone requests an anti-pattern be implemented.
The only thing I'd grab dspy for at this point is to automate the edges of the agentic pipeline that could be improved with RL patterns. But if that is true, you're really shorting yourself by giving your domain DSPY. You should be building your own RL learning loops.
My experience: If you find yourself reaching for a tool like Dspy, you might be sitting on a scenario where reinforcement learning approaches would help even further up the stack than your prompts, and you're probably missing where the real optimization win is. (Think bigger)
I agree but you'd be surprised at how many people will argue against static typing with a straight face. It's happened to me on at least three occasions that I can count and each time the usual suspects were trotted out: "it's quicker", "you should have tests to validate anyhow", "YOLO polymorphism is amazing", "Google writes Python so it's OK", etc.
It must be cultural as it always seems to be a specific subset of Python and ECMAScript devs making these arguments. I'm glad that type hints and Typescript are gaining traction as I fall firmly on the other side of this debate. The proliferation of LLM coding workflows has likely accelerated adoption since types provide such valuable local context to the models.
https://github.com/ax-llm/ax (if you're in the typescript world)
But I think it misses the point of what Dspy "is". It's less that Dspy is about prompt optimization and more that, Dspy encourages you to design your systems in a way that better _enables_ optimization.
You can apply the same principles without Dspy too :)
And hopefully it's clear enough from the post: I'm not necessarily suggesting people use Dspy, just that there are important lessons to take with you, even if you don't use it :)
I think it solves some of this friction!
The reality is that you don't want to re-deploy for every prompt change, especially early on. You want to get a really tight feedback loop. If prompt change requires a re-deploy, that is usually too slow. You don't have to use a database to solve this, but it's pretty common to see in my experience.
1. People don't want to switch frameworks, even though you can pull prompts generated by DSPy and use them elsewhere, it feels weird.
2. You need to do some up-front work to set up some of the optimizers which a lot of people are averse to.
But on the other hand, I think people unintentionally end up re-implementing a lot of Dspy.
I'll think about how to word this better, thanks for the feedback!
So in practice I imagine you get at a lot of the same ideas / benefits!
It's annoying/difficult in practice if this is strictly in code. I don't think a database is necessarily the way to go, but it's just a common pattern I see. And I really strongly believe this is more of a need for a "development time override" than the primary way to deploy to production, to be clear.
This was my take as well.
My company recently started using Dspy, but you know what? We had to stand up an entire new repo in Python for it, because the vast majority of our code is not Python.
I'm curious what other practitioners are doing.
As someone who has done traditional NLP work as at least part of my job for the last 15 years, LLMs do ofter a vastly superior NER solution over any previous NLP options.
I agree with your overall statement, that frequently people rush to grab an LLM when superior options already exist (classification is a big example, especially when the power of embeddings can be leveraged), but NER is absolutely a case where LLMs are the superior option (unless you have latency/cost requirements to force you to choose and inferior quality as the trade off, but your default should be an LLM today).
Dspy encourages you to write your code in a way that better enables optimization, yes (and provides direct abstractions for that). But this isn't in a sense unique to Dspy: you can get these same benefits by applying the right patterns.
And they are the patterns I just find people constantly implementing these without realizing it, and think they could benefit from understanding Dspy a bit better to make better implementations :)
I highly recommend checking out this community plugin from Maxime, it helps "bridge the gap": https://github.com/dspy-community/dspy-template-adapter
https://google.github.io/adk-docs/
Disclaimer, I use ADK, haven't really looked at Dspy (though I have prior heard of it). ADK certainly addresses all of the points you have in the post.
```
from openai import OpenAI
# Point the client to the TensorZero Gateway
client = OpenAI(base_url="http://localhost:3000/openai/v1", api_key="not-used")
response = client.chat.completions.create(
# Call any model provider (or TensorZero function)
model="tensorzero::model_name::anthropic::claude-sonnet-4-6",
messages=[
{
"role": "user",
"content": "Share a fun fact about TensorZero.",
}
],
)```
You can layer additional features only as needed (fallbacks, templates, A/B testing, etc).
For example: I don't use Dspy at work! And I'm working in a primarily dotnet stack, so we definitely don't use Dspy... But still, I see the same patterns seeping through that I think are important to understand.
And then there's a question of "how do we implement these patterns idiomatically and ergonomically in our codebase/langugage?"
I agree with all the points that they list but I fear if I looked close at the code and how they did it I wouldn't stop cringing until I looked away. Frameworks like this tend to point out 10 concerns that you should be concerned about but aren't and make users learn a lot of new stuff to bend their work around your framework but they rarely get a clear understanding of what the concerns are, where exactly the value comes from the framework, etc.
That is, if you are trying to sell something you can do a lot better with something crazy and one-third-baked like OpenClaw, which will make your local Apple Store sell out of minis, than anything that rationally explains "you are going to have to invent all the stuff that is in this framework that looks like incomprehensible bloat to you right now." I mean, it is rational, it is true, but I can say empirically as a person-who-sells-things that it doesn't sell, in fact if you wanted me to make a magic charm that looks like it would sell things and make sure you don't sell anything it would be that.
They themselves are turning into wrapper code for other libraries (e.g. the LLM abstraction which litellm handles for them).
Can also add:
Option 3: Use instructor + litellm (probabyly pydantic AI, but have not tried that yet)
Edit: As others pointed out their optimizing algorithms are very good (GEPA is great and let's you easily visualize / track the changes it makes to the prompt)
You're right: prompts are overfit to models. You can't just change the provider or target and know that you're giving it a fair shake. But if you have eval data and have been using a prompt optimizer with DSPy, you can try models with the one-line change followed by rerunning the prompt optimizer.
Dropbox just published a case study where they talk about this:
> At the same time, this experiment reinforced another benefit of the approach: iteration speed. Although gemma-3-12b was ultimately too weak for our highest-quality production judge paths, DSPy allowed us to reach that conclusion quickly and with measurable evidence. Instead of prolonged debate or manual trial and error, we could test the model directly against our evaluation framework and make a confident decision.
https://dropbox.tech/machine-learning/optimizing-dropbox-das...
CV too for that matter, object recognition before deep learning required a white background and consistent angles. Remember this XKCD from only 2014? https://xkcd.com/1425/
I think the unfortunate part is: the way it encourages you to structure your code is good for other reasons that might not be an 'acute' pain. And over time, it seems inevitable you'll end up building something that looks like it.
Implementations are generally always going to be messy; and still I feel like not all the messiness is incidental. A lot of it is accidental :)
That metric is the key piece. I don't know the right way to build an automated metric for a lot of the systems I want to build that will stand the test of time.
I've been fiddling around with many prototypes to try to figure out the right way to do this, but it feels challenging; I'm not yet familiar enough with how to do this ergonomically and idiomatically in dotnet haha
4.7M
DSPy monthly downloads
222M
LangChain monthly downloads
For a framework that promises to solve the biggest challenges in AI engineering, this gap is suspicious. Still, companies using Dspy consistently report the same benefits.
They can test a new model quickly, even if their current prompt doesn't transfer well. Their systems are more maintainable. They are focusing on the context more than the plumbing.
So why aren’t more people using it?
DSPy’s problem isn’t that it’s wrong. It’s that it’s hard. The abstractions are unfamiliar and force you to think a litle bit differently. And what you want right now is not to think differently; you just want the pain to go away.
But I keep watching the same thing happen: people end up implementing a worse version of Dspy. I like to jokingly say there’s a Khattab’s Law now (based off of Greenspun’s Law about Common Lisp):
Any sufficiently complicated AI system contains an ad hoc, informally-specified, bug-ridden implementation of half of DSPy.
You’re going to build these patterns anyway. You’ll just do it worse, after a lot of time, and through a lot of pain.
Let’s walk through how virtually every team ends up implementing their own “Dspy at home”. We’ll use a simple structured extraction task as an example throughout. Don’t let the simplicity of the example fool you though; these patterns only become more important as the system becomes more complex.
Let’s say you need to extract company names from some text, you might start out with the OpenAI API:
from openai import OpenAIclient = OpenAI()def extract_company(text: str) -> str: response = client.chat.completions.create( model="gpt-5.2", messages=[{"role": "user", "content": f"Extract the company name from: {text}"}] ) return response.choices[0].message.content
It basically works. So you ship it and life is good.
But inevitably, Product will want to iterate faster. Redeploying for every prompt change is too annoying. So you decide to store prompts in a database:
from openai import OpenAIfrom myapp.config import get_promptclient = OpenAI()def extract_company(text: str) -> str: prompt_template = get_prompt("extract_company") prompt = prompt_template.format(text=text) response = client.chat.completions.create( model="gpt-5.2", messages=[{"role": "user", "content": prompt}] ) return response.choices[0].message.content
Now you have a prompts table, a little admin UI to edit it. And of course you had to add version history after someone broke prod last Tuesday.
You notice the model sometimes returns "Company: Acme Corp" instead of just "Acme Corp". So you add structured outputs:
from openai import OpenAIfrom pydantic import BaseModelclient = OpenAI()class CompanyExtraction(BaseModel): company_name: str confidence: floatdef extract_company(text: str) -> CompanyExtraction: prompt_template = get_prompt("extract_company_v2") response = client.chat.completions.parse( model="gpt-5.2", messages=[{"role": "user", "content": prompt_template.format(text=text)}], response_format=CompanyExtraction, ) return response.choices[0].message
You now have typed inputs and outputs and higher confidence the system is doing what it should.
After running for a while, you’ll notice transient failures like 529 errors or rare cases where parsing fails. So you add retries:
from openai import OpenAIfrom tenacity import retry, stop_after_attempt, wait_exponentialclient = OpenAI()@retry(stop=stop_after_attempt(3), wait=wait_exponential(min=1, max=10))def extract_company(text: str) -> CompanyExtraction: prompt_template = get_prompt("extract_company_v2") response = client.beta.chat.completions.parse( model="gpt-4o", messages=[{"role": "user", "content": prompt_template.format(text=text)}], response_format=CompanyExtraction, ) return response.choices[0].message.parsed
Now each call has a bit more resilience. Though, in practice you might fallback to a different provider because retrying against an overloaded service returning 529s is a recipe for… another 529 error.
Eventually, you might start parsing more esoteric company names, and the model might not be good enough to recognize an entity as a company name. So you want to add RAG against known company information to help improve the extraction:
from openai import OpenAIclient = OpenAI()def extract_company_with_context(text: str) -> CompanyExtraction: # Step 1: Write a RAG query query_prompt_template = get_prompt("extract_company_query_writer") query_prompt = query_prompt_template.format(text) query_response = client.chat.completions.create( model="gpt-5.2", messages=[{"role": "user", "content": query_prompt}] ) query = response.choices[0].message.content query_embedding = embed(query) docs = vector_db.search(query_embedding, top_k=5) context = "\n".join([d.content for d in docs]) # Step 2: Extract with context prompt_template = get_prompt("extract_company_with_rag") prompt = prompt_template.format(text=text, context=context) response = client.chat.completions.parse( model="gpt-5.2", messages=[{"role": "user", "content": prompt}], response_format=CompanyExtraction, ) return response.choices[0].message
Now we have two prompts: one to create the query and one to create parse the company. And we have also introduced other parameters like k. It’s worth noting that not all of these parameters are independent. Since the retrieved documents feed into the final prompt, any changes here affect the overall performance.
You keep changing both prompts, the embedding model, k, and any parameter you can get your hands on to fix bugs as they are reported. But you’re never quite sure if your change completely fixed the issue. And you’re never quite sure if your changes broke something else. So you finally realize you need those “evals” everyone is talking about:
def evaluate(dataset: list[dict]) -> dict: results = [] for example in dataset: prediction = extract_company_with_context(example["text"]) results.append({ "correct": prediction.company_name == example["expected"], "confidence": prediction.confidence }) return { "accuracy": sum(r["correct"] for r in results) / len(results), "avg_confidence": sum(r["confidence"] for r in results) / len(results) }
Data extraction tasks are amongst the easiest to evaluate because there’s a known “right” answer. But even here, we can imagine some of the complexity. First, we need to make sure that the dataset passed in is always representative of our real data. And generally: your data will shift over time as you get new users and those users start using your platform more completely. Keeping this dataset up to date is a key maintenance challenge of evals: making sure the eval measures something you actually (and still) care about.
Inevitably, some company will release a new model that’s exciting that someone will want to try. Let’s say Anthropic releases a new model. Unfortunately, your code is full of openai.chat.completions.create, which won’t exactly work for Anthropic. Your prompts might not even work well with the new model.
So you decide you need to refactor everything:
class LLMModule: def __init__(self, signature: type[BaseModel], prompt_key: str): self.signature = signature self.prompt_key = prompt_key def forward(self, **kwargs) -> BaseModel: prompt = get_prompt(self.prompt_key).format(**kwargs) return self._call_llm(prompt) def _call_llm(self, prompt: str) -> BaseModel: # Model-agnostic, with retries, parsing, validation ...extract_company = LLMModule( signature=CompanyExtraction, prompt_key="extract_company_v3")result = extract_company.forward(text="...")
You now have typed signatures, composable modules, swappable backends, centralized retry logic, and prompt management separated from application code.
Congrats! You just built a worse version of Dspy.
Dspy packages important patterns every serious AI system ends up needing:
Typed inputs and outputs. What goes in, what comes out, with a schema.
Composable units you can chain, swap, and test independently.
Logic that improves prompts, separated from the logic that runs them.
These are just software engineering fundamentals. Separation of concerns. Composability. Declarative interfaces. But for some reason, many good engineers either forget about these or struggle to apply them to AI systems.
🔄
You can't step through a prompt. The output is probabilistic. When it finally works, you don't want to touch it.
🚀
Getting an LLM to work feels like an accomplishment. Clean architecture feels like a luxury for later.
❓
Where do you draw the boundaries? Your prompts are both code and data. Nothing is familiar.
So engineers do what works in the moment. Inline prompts. Copy-paste with tweaks. One-off solutions that become permanent.
But 6 months later, they are drowning in the complexity of their half-baked abstractions.
DSPy forces you to think about these abstractions upfront. That's why the learning curve feels steep. The alternative is discovering the patterns through pain.
Accept the learning curve. Read the docs. Build a few toy projects until the abstractions click. Then use it for real work.
Don't use DSPy, but build with its patterns from day one. See below.
✓ Typed I/O for every LLM call. Use Pydantic. Define what goes in and out.
✓ Separate prompts from code. Forces you to think about prompts as distinct things.
✓ Composable units. Every LLM call should be testable, mockable, chainable.
✓ Eval infrastructure early. Day one. How will you know if a change helped?
✓ Abstract model calls. Make swapping GPT-4 for Claude a one-line change.
DSPy has adoption problems because it asks you to think differently before you’ve actually felt the pain of thinking the same way everyone else does.
The patterns DSPy embodies aren’t optional. If your AI system gets complex enough, you will reinvent them. The only question is whether you do it deliberately or accidentally.
You don’t have to use DSPy. But you should build like someone who understands why it exists.
If this resonated, let’s continue the conversation on X or LinkedIn!