Skip to content

The Stack AI DevOps Business

10 Open-Source AI Tools That Feel ILLEGAL To Know About

A countdown of ten open source AI tools, ranked from 10 to 1 by how much pain each one removes from a working LLM stack. It covers RAG ingestion (Chonkie, Marker, Crawl4AI), storage and retrieval (Qdrant), local inference (Ollama), observability (Langfuse), prompt optimization (DSPy), provider routing (LiteLLM), and structured output (Outlines and Instructor). The framing argument is that the model layer is now a commodity, so the glue code around it is where engineering time is won or lost. Each tool gets an honest fork: when to reach for it and when a simpler option wins.

Published Jun 22, 2026 16:26 video 22 min read Added Jul 5, 2026 Open on YouTube →

At a glance

This is a countdown of ten open source AI tools, ranked from 10 to 1 by a single yardstick: how much pain each one deletes from your stack. None of them are secret. They carry tens of thousands of GitHub stars, millions of downloads a month, and real teams already run them in production. They just never went viral, and that is the whole problem, because right now you are probably rebuilding something one of these repos already solved perfectly.

The through line is an argument about where the value moved. The model layer is now a commodity. The glue code around it, the chunking, the PDF parsing, the retries, the provider switching, the observability, is where you actually win or lose. Every tool here is a piece of that glue, and the whole point of the video is to stop you from writing another version of code that is already open, battle tested, and free. The countdown also threads a running contrast: two tools (Outlines and Instructor) that solve the same problem from opposite ends, which is why they land at 3 and 1.

Rank	Tool	What it does	Category
10	Chonkie	Real chunking strategies for splitting documents before retrieval	RAG ingestion
9	Marker	Converts PDFs and office docs into clean, layout aware markdown	RAG ingestion
8	Langfuse	Traces, evals, and prompt management for LLM apps	Observability
7	Qdrant	Rust vector database for billion scale similarity search	Storage and retrieval
6	Ollama	Runs open weight models locally with an OpenAI compatible API	Local inference
5	DSPy	Programs and auto optimizes prompts against a metric	Prompt optimization
4	Crawl4AI	AI native web crawler that outputs clean markdown, no paywall	Web to LLM
3	Outlines	Constrains generation so invalid output cannot be produced	Structured output
2	LiteLLM	One OpenAI shaped interface routing to 100 plus providers	Gateway and routing
1	Instructor	Returns validated Python objects from any LLM via Pydantic	Structured output

Figure 1. The full countdown, ranked by how much pain each deletes. Six of the ten are pieces of a single retrieval pipeline; the rest are cross cutting layers for serving, routing, and reliability.

Most of these tools are not rivals. They are stages of the same pipeline. The diagram below places each one where it actually sits, from raw documents on the left to structured output on the right, with observability watching every step.

MarkerPDFs to markdown Crawl4AIweb to markdown

Chonkiesplit for retrieval

Qdrantvector database

Ollamaserve local models LiteLLMroute to providers DSPyoptimize prompts

Outlinesduring generation Instructorafter generation

OBSERVE EVERY STEP Langfuse: trace every call, score outputs, version prompts across the whole pipeline

Figure 2. The ten tools mapped onto one retrieval augmented generation stack. Amber is the data path (documents in), blue is the model path (tokens out). Langfuse spans the bottom because observability is not a stage, it watches all of them.

10. Chonkie, chunking that is not a naive split

Chonkie opens the list because it fixes a quality leak most people never notice. If you build anything with retrieval, a RAG pipeline where the model looks up relevant documents before it answers, you first have to chop those documents into chunks. It sounds trivial and it is not. The way you split the text decides what the retriever can even find. Split mid sentence and you hand the model garbage context. Split too coarse and you bury the one paragraph that mattered inside a wall of noise. Most people write a text.split on every 500 characters, call it done, then wonder why their answers are mediocre.

Chonkie is a tiny, fast library that gives you real chunking strategies instead of that naive split. It ships token chunking, sentence chunking, recursive chunking that respects document structure, semantic chunking that groups text by meaning, and late chunking where you embed the whole document first and then split so each chunk keeps the context of the words around it. The core insight is that there is no single right chunk size. A legal contract and a Slack export want completely different strategies, and Chonkie lets you swap between them in a line instead of rewriting your entire ingestion code.

It is lightweight on purpose: no giant dependency tree, fast enough to run over a big corpus without becoming the slow part of your pipeline. The honest caveat is that it is a small, mostly single maintainer project (since the video, the team rebranded under Feyn), so do not bet core infrastructure on it without reading the code first. Reach for it the moment your retrieval quality plateaus and you suspect the chunks are the reason, because it saves you the day you would otherwise spend hand tuning split logic and rerunning evals.

9. Marker, PDFs into clean markdown

Marker sits at roughly 18,000 stars, and it exists because the real world ships documents as PDFs. Your knowledge lives in PDFs, EPUBs, Word files, scanned reports, research papers, and manuals with two column layouts, tables, equations, and footnotes. To feed any of that to an LLM you need clean text, and PDF is one of the most hostile formats to extract cleanly. Pull text out with a basic library and you get scrambled column order, tables flattened into nonsense, and headers interleaved with body text. The model then reasons over corrupted input, and you blame the model.

Marker converts PDFs and other documents into clean markdown using machine learning models that actually understand page layout. It figures out reading order, keeps tables as tables, handles math, and strips the junk, and the output is structured markdown that drops straight into a RAG pipeline or a long context prompt. On most benchmarks it beats Nougat, the older Meta model people used to reach for, and it is faster too.

The trade off is that it is heavier than a plain text extractor because it runs ML under the hood, so for a stack of simple, well behaved PDFs it is overkill. But once your documents have any real layout complexity, tables, columns, or scans, Marker is the difference between a pipeline that works and one that quietly poisons every answer. If you are ingesting a corpus of real world documents, this is the front door.

8. Langfuse, seeing inside your LLM app

Langfuse is the open source observability layer for LLM apps, backed by Y Combinator and sitting around 7,000 stars. Once your app is more than one prompt, you go blind. A user reports a bad answer and you have no idea which step failed. Was it retrieval, the prompt, the model, or a tool called three layers deep in some agent? You are grepping logs and guessing.

Langfuse fixes that by tracing every LLM call as a structured timeline. Every prompt, every response, every tool invocation, latency, and token cost is captured, so you can replay exactly what happened on any request. On top of tracing it does evals so you can score output systematically, and prompt management so your prompts live in one versioned place instead of scattered across your codebase.

The fork here matters. Langfuse is positioned as the open source, self hostable answer to LangSmith, which is LangChain's commercial observability product. Go Langfuse if you have data residency requirements, if traces of user prompts legally cannot leave your infrastructure, or if you simply want to own the stack and have the DevOps muscle, because self hosting it means running Postgres and ClickHouse, which is real operational overhead. Pick LangSmith if you want the polished hosted experience and your org does not care where the data sits, because its UX is ahead. The open source argument wins on control and compliance, not on convenience.

7. Qdrant, the vector store under the hood

Qdrant is a vector database written in Rust, north of 20,000 stars. Embeddings turn text into vectors, long lists of numbers where similar meaning lands close together in space. To do retrieval at scale you need somewhere to store millions or billions of those vectors and find the nearest ones to a query in milliseconds. That is a vector database, and Qdrant is one of the strongest open source ones going.

The Rust choice is not a vanity detail. It means tight memory control and serious throughput, which is why Qdrant handles billion scale similarity search without falling over. You can self host it or use their managed cloud, and it does the things production actually needs: filtered search by metadata (give me the nearest vectors but only from this user's documents), payload storage, and horizontal scaling. It is the vector store under a ton of RAG systems people use every day without knowing the name.

When do you reach for a dedicated database like this versus keeping vectors in Postgres with pgvector? Pick pgvector if your data is small, already lives in Postgres, and you want one less moving part. Pick Qdrant the moment scale, filtering, or query latency becomes the bottleneck, which it will if you serve real traffic over a big corpus. It is the upgrade you make when your prototype vector store starts to choke.

6. Ollama, run open models on your own machine

Ollama crossed 80,000 stars by mid 2025, one of the fastest growing AI repos ever. It makes running an open weight model on your own machine a one command affair. Install it, type ollama run llama3, and you have a local model with an OpenAI compatible API on localhost. That compatibility is the clever part: any code you wrote against OpenAI mostly just works by pointing it at your local endpoint instead.

Its model library exploded through 2024 and 2025. Llama 3.1, 3.2, and 3.3, Mistral NeMo, Gemma 2, Phi 3 and 3.5, DeepSeek R1, and Qwen 2.5. Basically any open weight model you would want is one install away.

Here the video calls out its own hype honestly. The run local and save money pitch is real for some cases and nonsense for others. For private data that legally cannot leave your network, for offline work, for cheap experimentation, and for building desktop apps that ship a model to the user, Ollama is genuinely excellent. But the idea that a developer's MacBook running Llama 3 70B replaces a cloud API in production mostly does not hold. It is slower, less reliable, and a hosted call at fractions of a cent per thousand tokens beats it on cost and uptime once you have real traffic. Critics call the local everything fantasy developer cosplay, and for most production workloads they are right. The verdict: Ollama is a fantastic development and privacy tool, not a free production backend. Prototype with it, keep sensitive data in house, run offline, but do not use it as an excuse to skip a real inference setup when you scale.

5. DSPy, program your LLM instead of prompting it

DSPy, out of Stanford's NLP lab and north of 20,000 stars, attacks the thing every builder secretly hates: prompt engineering. You handwrite a prompt, tune it for hours, and it works. Then the model version changes and your carefully crafted wording breaks, because it was tuned to quirks of the old model. Your whole pipeline is a stack of brittle strings held together by vibes.

DSPy's argument is that you should program your LLM, not prompt it. You define modules with typed inputs and outputs, the logic of what you want, and DSPy's optimizer writes and rewrites the actual prompt text for you automatically against a metric you give it. The optimizer in DSPy 2.0 is called MIPROv2, and it can tune multi step, multimetric pipelines, so it scales past toy single task examples into real agent systems. Teams like JetBlue and Replit have run it in production.

The concrete win is self improving pipelines. Instead of a human babysitting prompt strings forever, you specify the behavior and a metric and the system tunes itself. When the model changes, you just rerun the optimizer instead of rewriting prompts by hand. The honest catch, and the critics have a point, is that the optimizer is a black box on top of a black box. It changes your prompts under the hood, so when something goes wrong it is harder to debug, and some teams prefer explicit, version controlled prompt text they can read. Reach for DSPy when you have a complex pipeline, a clear metric to optimize against, and you are tired of manual prompt churn. Stick with handwritten prompts when the task is simple and you value reading exactly what is sent to the model. It is a power tool that rewards people who already understand the problem it automates.

4. Crawl4AI, the web crawler built in turbo anger mode

Crawl4AI is the most starred open source crawler on GitHub, which tells you how badly people needed it. The origin story is the value proposition. The creator, who goes by Uncle Code, got fed up with paywalled, gated scraping services charging him to pull public web data into AI pipelines. In his words he went turbo anger mode, built Crawl4AI in days, and it went viral. No API keys forced on you, no paywall.

What makes it AI native instead of just another scraper is the output. Most scrapers hand you raw HTML and then you spend an afternoon stripping tags, navbars, ads, and scripts before the text is usable. Crawl4AI outputs clean markdown designed for RAG and LLM ingestion. It also does structured extraction by CSS selector, XPath, or by handing a schema to an LLM, plus parallel crawling, stealth mode to dodge bot detection, proxy support, and session reuse so you can crawl behind a login. It has gone enterprise grade too, hitting the v0.9 line with a partnership claiming 99.9% uptime.

The thing to watch is sustainability. This started as a single maintainer's fury project and the creator is now actively seeking enterprise sponsors, which is the signal that volunteer maintenance does not survive production grade load. Do not read that as a reason to avoid it, read it as a reason to pin your version and watch the project's health. For getting web content into an LLM pipeline as clean markdown with no gatekeeper between you and the data, nothing open source does it better right now.

3. Outlines, make invalid output impossible

Outlines, from dottxt, changes how you think about reliable output entirely, and it sets up number two, so pay attention. When you need an LLM to return valid JSON or match an exact format, the normal approach is to ask nicely, check the result, and retry if it is broken. Outlines refuses to play that game. It constrains generation at the token level during generation, so an invalid token literally cannot be produced.

Here is the mechanism. The model picks the next token from a probability distribution over its whole vocabulary. Outlines masks out every token that would violate your schema before the model chooses, so at every single step the only options left are valid ones. The result is mathematically guaranteed valid JSON, a regex match, or one of your allowed enum values, not fixed after the fact but impossible to get wrong in the first place. That guarantee comes at effectively zero latency cost, because you are not running retry loops.

" { 4 9 foo } 2 null

kept: a valid next token masked: would break the schema

The model samples only from the amber set. Invalid JSON is impossible, not repaired later.

Figure 3. How Outlines works. At a position where the schema demands a number, every non digit token is masked out of the vocabulary before sampling. The guarantee is structural, which is why it costs no retry loops.

That is why it has been adopted where it counts. vLLM, Hugging Face's Text Generation Inference, and SGLang, the three dominant open source inference servers, all integrate Outlines natively. Constrained generation is now a first class feature across hundreds of organizations serving infrastructure, not a niche plugin you bolt on.

The one hard limit defines exactly when you use it. Outlines works by reaching into the token probabilities, which means you need a model you serve yourself, an open weight model behind vLLM or TGI. You cannot do this to OpenAI's GPT 4o or Claude through their APIs, because you do not control their token sampling. That limit is the whole reason number two, and ultimately number one, exist.

2. LiteLLM, one gateway to a hundred providers

LiteLLM is the unified gateway that ends provider lock in. Every provider has its own SDK, its own request shape, its own quirks. Write your app against OpenAI, then your boss says move to Claude for cost or to Bedrock for compliance, and now you are rewriting integration code all across your codebase. The founders at BerriAI built LiteLLM after watching enterprise teams burn weeks on exactly that switching logic.

It gives you one OpenAI compatible interface that routes to over 100 LLM APIs: OpenAI, Anthropic, Bedrock, Azure, Vertex, Cohere, Hugging Face, Nvidia NIM, and the long tail, all through one shape. Swapping providers becomes a config change, not a code rewrite.

OpenAI Anthropic Bedrock Azure Vertex Cohere Hugging Face Nvidia NIM

Figure 4. LiteLLM as a single choke point. Your app speaks one OpenAI shaped dialect; LiteLLM translates it to whichever of 100 plus providers you point at, so a vendor switch is a config edit rather than a rewrite.

It comes in two forms and picking the right one matters. There is the Python SDK you import directly into your app, and there is the proxy server, an AI gateway that runs as a central service every team in your company calls. The proxy adds cost tracking, guardrails, load balancing, and logging across providers in one place, and it covers basically every endpoint in production use: chat completions, the responses API, embeddings, images, audio, batches, rerank, and even the new agent to agent endpoint that tracks the emerging A2A protocol for routing traffic between agents, not just to models.

Be careful with the proxy. Running it as a centralized gateway introduces a single point of failure, and teams have hit rate limit handling bugs and inconsistent streaming across providers under heavy load. BerriAI added Redis backed rate limiting and active health checks in response, but the criticism persists at high throughput. Use the SDK inside a single service for simple provider flexibility with no extra infrastructure to babysit. Stand up the proxy when you have many teams, many providers, and you need centralized cost and policy control, and when you do, give it the redundancy any single point of failure demands. Either way, this is the repo that keeps you from marrying one model vendor.

1. Instructor, validated data from any model

Instructor takes number one with over 11,000 stars, more than 3 million downloads a month, and over 100 contributors, and it got there on pure word of mouth among ML engineers with basically no marketing. It is number one because it deletes the single most universal piece of boilerplate in the entire LLM stack.

You ask a model for structured data. It hands you back a string. Now you parse that string into JSON, validate the fields, handle the case where it wrapped the JSON in prose, handle the missing field, handle the wrong type, and write a retry for when it is malformed. Everyone writes this, and everyone writes it again on the next project. Jason Liu, a former Stitch Fix ML engineer, got sick of rewriting it and built Instructor so he would never have to again.

Here is how it kills the boilerplate. You define a Pydantic model, a Python class describing the shape you want (names, types, constraints). You pass it as response_model to your LLM call and you get back a validated Python object. No JSON parsing, no error handling, no manual retries, because validation and automatic retries are built in. If the model returns something that does not fit your schema, Instructor catches it and retries with the validation error fed back to the model until it conforms. It is built on Pydantic v2, whose validation core was rewritten in Rust for roughly a 17 times speed up, so the checking is fast, and it is not just Python either: there are ports for TypeScript, Go, Ruby, Elixir, and Rust.

This is the other side of the fork from Outlines, and now the whole picture snaps into focus. Instructor fixes outputs after generation with retries, which means it works against any model through any API, including the closed ones like GPT 4o and Claude where you cannot touch the token sampling. Outlines prevents bad outputs during generation with a hard guarantee, but only on open weight models you serve yourself. The decision is clean: calling a hosted API, use Instructor, because post hoc validation with retries is the only option you have and it covers 99% of production cases. Serving your own open model on vLLM and wanting a mathematical guarantee at zero latency cost, use Outlines. Most builders are calling an API, which is exactly why Instructor is number one. It is the highest leverage import you can add to an LLM project today.

Dimension	Outlines (No. 3)	Instructor (No. 1)
When it acts	During generation	After generation
Mechanism	Masks invalid tokens before each sampling step	Validates the result, feeds errors back, retries
Guarantee	Mathematically valid, cannot be wrong	Valid after retries, corrected post hoc
Works with	Open weight models you serve (vLLM, TGI)	Any model through any API, including GPT and Claude
Latency cost	Effectively zero, no retry loops	Extra calls when a retry is needed
Best for	Self hosted inference wanting hard guarantees	Hosted API calls, roughly 99% of production

Figure 5. The two ends of the same fork. Outlines makes bad output impossible but needs the token stream; Instructor cleans it up afterward but works everywhere. Which one you can use is decided by whether you control the model's sampling.

One clarification that tripped people up in late 2024: Instructor moved to the 567 Labs organization on GitHub and now draws a clear line between itself and Pydantic AI. Instructor is for schema first extraction, pulling structured data out of a model. Pydantic AI is for building agents. If you just need clean, validated data back from an LLM call, Instructor is the one you want.

Key takeaways

The model layer is now a commodity. The glue code around it, ingestion, storage, routing, structure, and observability, is where you win, and these ten repos are that glue.
Six of the ten form one RAG pipeline: Marker and Crawl4AI bring content in, Chonkie chunks it, Qdrant stores and searches it, and Outlines or Instructor shape the output.
Chunking quality quietly caps retrieval quality. A naive 500 character split is why many RAG answers are mediocre.
Structured output has two opposite solutions. Outlines guarantees validity during generation on models you serve; Instructor guarantees it after generation on any API. What you control decides which you use.
LiteLLM turns provider switching into a config change, but its central proxy is a single point of failure to design around.
Several standouts (Chonkie, Crawl4AI) are small or single maintainer projects. Pin your versions and watch project health before betting production on them.
Honesty about hype is a feature here: Ollama is a development and privacy tool, not a free production backend.

Chapters

0:00 Intro, why these tools never went viral 0:22 Number 10, Chonkie, smarter chunking for RAG 1:53 Number 9, Marker, PDFs into clean markdown 3:15 Number 8, Langfuse, observability for LLM apps 4:37 Number 7, Qdrant, vector database in Rust 5:45 Number 6, Ollama, run open models locally 7:16 Number 5, DSPy, program your LLM 8:53 Number 4, Crawl4AI, the AI native web crawler 10:15 Number 3, Outlines, constrained generation 11:47 Number 2, LiteLLM, the unified gateway 13:40 Number 1, Instructor, validated data from any model

Notable quotes

"Right now, you're probably rebuilding something that one of these repos already solved perfectly." (0:11)

"I'm ranking them by how much pain each one just deletes from your stack." (0:18)

"Most people just write a text.split on every 500 characters and call it done. And then they sit there wondering why their answers are kind of mediocre." (0:37)

"Marker is the difference between a pipeline that works and one that just quietly poisons every answer." (2:55)

"The open source argument, it wins on control and compliance, not on convenience." (4:26)

"Critics call the local everything fantasy developer cosplay. And yeah, for most production workloads, they're right." (6:52)

"You should program your LLM, not prompt it." (7:44)

"He went turbo anger mode, built Crawl4AI in days and it went viral." (9:05)

"An invalid token literally cannot be produced." (10:37)

"The model layer, it's a commodity now. The glue code around it is where you actually win. And these repos, they're the glue. Stop rebuilding it." (15:55)

Resources mentioned

Chonkie, lightweight chunking library for RAG (now under Feyn)
Marker, PDF and document to markdown converter by Datalab
Nougat, the older Meta document model Marker beats on benchmarks
Langfuse, open source LLM observability, backed by Y Combinator
LangSmith and LangChain, the commercial observability alternative
Postgres and ClickHouse, the datastores Langfuse self hosting runs on
Qdrant, Rust vector database
pgvector, the Postgres extension alternative for smaller scale
Ollama, local open weight model runner
Models in Ollama's library: Llama 3, Mistral NeMo, Gemma 2, Phi 3 and 3.5, DeepSeek R1, Qwen 2.5
DSPy, prompt programming and optimization from Stanford NLP, with the MIPROv2 optimizer
JetBlue and Replit, cited as DSPy production users
Crawl4AI by Uncle Code, AI native web crawler
Outlines by dottxt, constrained generation for structured output
vLLM, Text Generation Inference, and SGLang, inference servers that integrate Outlines
LiteLLM by BerriAI, unified gateway to 100 plus providers
Providers reachable through LiteLLM: OpenAI, Anthropic, Bedrock, Azure OpenAI, Vertex AI, Cohere, Hugging Face, Nvidia NIM
A2A protocol and Redis, referenced in the LiteLLM proxy discussion
Instructor by Jason Liu, now under 567 Labs, schema first extraction
Pydantic and Pydantic AI, the validation core and the separate agent framework

Full transcript

These tools, okay, they're not new. They've got tens of thousands of GitHub stars, millions of downloads, big teams already running them in production. They just, I don't know, they never went viral. And that's kind of the whole problem, right? Because right now, you're probably rebuilding something that one of these repos already solved perfectly. So here's the countdown. 10 down to one. And I'm ranking them by how much pain each one just deletes from your stack. Number 10 is Chonkie, and it solves a problem that honestly most people don't even realize is costing them quality. So if you're building anything with retrieval, you know, a RAG pipeline where the model looks up relevant documents before it answers, you got to chop those documents into chunks first. Sounds trivial. It isn't. The way you split that text, it decides what the retriever can actually find. Split mid sentence and yeah, you just handed the model garbage context. Split too coarse and you bury the one paragraph that mattered inside this wall of noise. And most people just write a text.split on every 500 characters and call it done. And then they sit there wondering why their answers are kind of mediocre. Chonkie is a tiny fast little library that gives you actual chunking strategies instead of that naive split. So you got token chunking, sentence chunking, recursive chunking that respects document structure, semantic chunking that groups text by meaning, and late chunking where you embed the whole document first and then split, so each chunk keeps the context of the words around it. And the point is there's no single right chunk size, right? A legal contract and a Slack export, those want completely different strategies. And Chonkie lets you swap between them in a line instead of rewriting your whole ingestion code. It's lightweight on purpose. No giant dependency tree, fast enough to run over a big corpus without it becoming the slow part of your pipeline. The honest caveat though, it's a small, mostly single maintainer project. So, you know, don't bet your company's core infra on it without reading the code first. But for the thing it does, it'll save you the day you'd otherwise spend hand tuning split logic and rerunning evals. Use it the moment your retrieval quality plateaus and you kind of suspect the chunks are the reason. Number 9, Marker. Around 18,000 stars and it exists because, well, the real world ships documents as PDFs. So here's the actual problem. Your knowledge lives in PDFs, EPUBs, Word files, scanned reports, research papers, manuals with these two column layouts, tables, equations, footnotes. And to feed any of that to an LLM, you need clean text. And PDF is one of the most hostile formats to extract cleanly. Pull text out with a basic library and you get scrambled column order, tables flattened into nonsense, headers interleaved with body text, and then the model reasons over corrupted input, and surprise, you blame the model. Marker converts PDFs and other documents into clean markdown using machine learning models that actually understand page layout. It figures out reading order, keeps tables as tables, handles math, strips the junk, and the output's structured markdown that drops straight into a RAG pipeline or a long context prompt. On most benchmarks, it beats Nougat. That's the older Meta model people used to reach for, and it's faster, too. The trade off, I mean, it's heavier than a plain text extractor because it's running ML under the hood. So for a stack of simple, well behaved PDFs, yeah, it's overkill. But once your documents have any real layout complexity, tables, columns, scans, Marker is the difference between a pipeline that works and one that just quietly poisons every answer. If you're ingesting a corpus of real world documents, this is the front door. Number 8, Langfuse. It's the open source observability layer for LLM apps backed by Y Combinator, sitting around 7,000 stars. So once your app is more than one prompt, you kind of go blind. A user reports a bad answer and you have no idea which step failed. Was it retrieval, the prompt, the model, a tool called three layers deep in some agent? You're just grepping logs and guessing. Langfuse fixes that by tracing every LLM call as a structured timeline. Every prompt, every response, every tool invocation, latency, token cost, all of it captured, so you can replay exactly what happened on any request. And on top of the tracing, it does evals so you can score output systematically, and prompt management. So your prompts live in one versioned place instead of, you know, scattered all over your codebase. The fork here matters though. Langfuse is positioned as the open source self hostable answer to LangSmith, which is LangChain's commercial observability product. So which do you pick? Go Langfuse if you've got data residency requirements, if traces of user prompts legally can't leave your infrastructure, or honestly you just want to own the stack and you've got the DevOps muscle, because self hosting it means running Postgres and ClickHouse and that's, that's real operational overhead. Pick LangSmith if you want the polished hosted experience and your org doesn't care where the data sits because honestly its UX is ahead. So the open source argument, it wins on control and compliance, not on convenience. Just know which one you're actually optimizing for. Number 7, Qdrant, a vector database written in Rust, north of 20,000 stars. So embeddings turn text into vectors, which are just long lists of numbers where similar meaning lands close together in space. And to do retrieval at scale, you need somewhere to store millions or billions of those vectors and find the nearest ones to a query in milliseconds. That's a vector database. And Qdrant's one of the strongest open source ones going. And the Rust thing isn't a vanity detail, right? It means tight memory control and serious throughput, which is why it handles billion scale similarity search without just falling over. You can self host it or use their managed cloud. And it does the stuff production actually needs, filtering search by metadata. So you can say give me the nearest vectors but only from this user's documents, payload storage, horizontal scaling. It's the vector store under a ton of RAG systems people use every day without even knowing the name. So when do you reach for a dedicated database like this versus just keeping vectors in Postgres with pgvector? Pick pgvector if your data is small, already lives in Postgres, and you want one less moving part. Pick Qdrant the moment scale or filtering or query latency becomes the bottleneck, which it will if you're serving real traffic over a big corpus. It's the upgrade you make when your prototype vector store starts to choke. Number 6 is Ollama. Somewhere around over 80,000 stars by mid 2025. One of the fastest growing AI repos ever. Ollama makes running an open weight model on your own machine a one command affair. Install it, type run llama 3, and boom, you've got a local model with an OpenAI compatible API on localhost. And that compatibility, that's the clever part. Any code you wrote against OpenAI, mostly it just works by pointing it at your local endpoint instead. Its model library kind of exploded through 2024 and 2025. Llama 3.1, 3.2, 3.3, Mistral NeMo, Gemma 2, Phi 3 and 3.5, DeepSeek R1, Qwen 2.5. Basically any open weight model you'd want, one install away. Now, I got to be straight with you because there's hype here worth calling out. The run local and save money pitch. It's real for some cases and it's nonsense for others. For private data that legally cannot leave your network, for offline work, for cheap experimentation, for building desktop apps that ship a model to the user, Ollama's genuinely excellent. But the idea that a developer's MacBook running Llama 3 70B replaces a cloud API in production, that mostly doesn't hold. It's slower, less reliable, and a hosted call at fractions of a cent per thousand tokens beats it on both cost and uptime once you've got real traffic. Critics call the local everything fantasy developer cosplay. And yeah, for most production workloads, they're right. So the verdict. Ollama is a fantastic development and privacy tool, not a free production backend. Use it to prototype, to keep sensitive data in house, to run offline. Just don't use it as your excuse to skip a real inference setup when you go to scale. Number 5, DSPy, out of Stanford's NLP lab, north of 20,000 stars. And it attacks the thing every builder secretly hates, prompt engineering. So here's the pain. You handwrite a prompt, you tune it for hours, it works. Then the model version changes and your carefully crafted wording just breaks because it was tuned to quirks of the old model. Your whole pipeline's a stack of brittle strings held together by, I don't know, vibes. DSPy's argument is that you should program your LLM, not prompt it. So you define modules with typed inputs and outputs, the logic of what you want, and then DSPy's optimizer writes and rewrites the actual prompt text for you automatically against a metric you give it. The optimizer in DSPy 2.0, it's called MIPROv2. It can tune multi step, multimetric pipelines. So this scales past toy single task examples into real agent systems. Teams like JetBlue and Replit have run it in production. And the concrete win is self improving pipelines. Instead of a human babysitting prompt strings forever, you specify the behavior and a metric and the system tunes itself. When the model changes, you just rerun the optimizer instead of rewriting prompts by hand. The honest catch though, and the critics have a point here, is that the optimizer is a black box on top of a black box. It changes your prompts under the hood. So when something goes wrong, it's harder to debug. And some teams genuinely prefer explicit version controlled prompt text they can just read. So fork it like this. Reach for DSPy when you've got a complex pipeline, a clear metric to optimize against, and you're just tired of manual prompt churn. Stick with handwritten prompts when the task's simple, and you value being able to read exactly what's sent to the model. It's a power tool, and you know, like any power tool, it rewards people who already understand the problem it's automating. Number 4 is Crawl4AI and it's the most starred open source crawler on GitHub which kind of tells you how badly people needed it. So the origin story is the value prop. The creator who goes by Uncle Code, he got fed up with paywalled gated scraping services charging him to pull public web data into AI pipelines. In his words, he went turbo anger mode, built Crawl4AI in days and it went viral. No API keys forced on you, no paywall. And what makes it AI native instead of just another scraper? It's the output. Most scrapers hand you raw HTML and then you spend an afternoon stripping tags, navbars, ads, scripts before the text is even usable. Crawl4AI outputs clean markdown designed for RAG and LLM ingestion. It also does structured extraction by CSS selector, XPath, or by handing a schema to an LLM, plus parallel crawling, stealth mode to dodge bot detection, proxy support, and session reuse so you can crawl behind a login. It's gone enterprise grade, too, hitting the v0.9 line with a partnership claiming 99.9% uptime. The thing to watch is sustainability. This started as a single maintainer's fury project and the creator is now actively seeking enterprise sponsors, which is honestly the signal that volunteer maintenance doesn't survive production grade load. Don't read that as a reason to avoid it, though. Read it as a reason to pin your version and watch the project's health. For getting web content into an LLM pipeline as clean markdown with no gatekeeper between you and the data, nothing open source does it better right now. Number 3, Outlines from dottxt. And this one, it kind of changes how you think about reliable output entirely. So here's the setup and it leads straight into number two. So pay attention. When you need an LLM to return valid JSON or match an exact format, the normal approach is you ask nicely, check the result, and retry if it's broken. Outlines just refuses to play that game. It constrains generation at the token level during generation. So an invalid token literally cannot be produced. The model picks the next token from a probability distribution over its whole vocabulary. Right? Outlines masks out every token that would violate your schema before the model chooses. So at every single step, the only options left are valid ones. And the result is mathematically guaranteed valid JSON or a regex match or one of your allowed enum values, not fixed after the fact, impossible to get wrong in the first place. And that guarantee comes at effectively zero latency cost because you're not running retry loops. And that's why it's been adopted where it counts. vLLM, Hugging Face's Text Generation Inference, SGLang. The three dominant open source inference servers, they all integrate Outlines natively. So that means constrained generation is now a first class feature across hundreds of organizations serving infrastructure, not some niche plug in you bolt on. The one hard limit, it defines exactly when you use it. Outlines works by reaching into the token probabilities, which means you need a model you serve yourself, an open weight model behind vLLM or TGI. You cannot do this to GPT 4o or Claude through their APIs because you don't control their token sampling. So the fork kind of writes itself and it's the whole reason number two exists. Number 2 is LiteLLM, the unified gateway that ends provider lock in. So every provider has its own SDK, its own request shape, its own quirks. Write your app against OpenAI, then your boss says move to Claude for cost or to Bedrock for compliance. And now you're rewriting integration code all across your codebase. The founders at BerriAI built LiteLLM after watching enterprise teams burn weeks on exactly that switching logic. It gives you one OpenAI compatible interface that routes to over a 100 LLM APIs. OpenAI, Anthropic, Bedrock, Azure, Vertex, Cohere, HuggingFace, Nvidia NIM, the long tail, all of it through one shape. So swapping providers becomes a config change, not a code rewrite. It comes in two forms, and picking the right one matters. There's the Python SDK you import directly into your app. And there's the proxy server, the AI gateway that runs as a central service every team in your company calls. The proxy adds cost tracking, guardrails, load balancing, and logging across providers in one place. And it covers basically every endpoint in production use. Chat completions, the responses API, embeddings, images, audio, batches, rerank, even the new agent to agent endpoint that tracks the emerging A2A protocol for routing traffic between agents, not just to models. Now be careful with the proxy running it as a centralized gateway that introduces a single point of failure and teams have hit rate limit handling bugs and inconsistent streaming across providers under heavy load. BerriAI added Redis backed rate limiting and active health checks in response but the criticism it persists at high throughput. So fork it, use the SDK inside a single service for simple provider flexibility with no extra infrastructure to babysit. Stand up the proxy when you've got many teams, many providers, and you need centralized cost and policy control. And when you do, give it the redundancy any single point of failure demands. Either way, this is the repo that keeps you from, you know, marrying one model vendor. Number 1 is Instructor. Over 11,000 stars, more than 3 million downloads a month, over a 100 contributors, and it got there on pure word of mouth among ML engineers with basically no marketing. It's number one because it deletes the single most universal piece of boilerplate in the entire LLM stack. So you ask a model for structured data. It hands you back a string. Now you parse that string into JSON, validate the fields, handle the case where it wrapped the JSON in prose, handle the missing field, handle the wrong type, and write a retry for when it's malformed. Everyone writes this. Everyone writes it again on the next project. Jason Liu, a former Stitch Fix ML engineer. He got sick of rewriting it and built Instructor so he'd never have to again. And here's how it kills the boilerplate. You define a Pydantic model, which is just a Python class describing the shape you want, names, types, constraints. You pass it as response_model to your LLM call, and you get back a validated Python object. No JSON parsing, no error handling, no manual retries, because validation and automatic retries are built in. If the model returns something that doesn't fit your schema, Instructor catches it and retries with the validation error fed back to the model until it conforms. It's built on Pydantic v2 whose validation core was rewritten in Rust for roughly 17 times speed up. So the checking's fast and it's not just Python either. There are ports for TypeScript, Go, Ruby, Elixir, and Rust. This is also the other side of the fork from Outlines. And now, now you can see the whole picture. Instructor fixes outputs after generation with retries, which means it works against any model through any API, including GPT 4o and Claude, the closed ones where you can't touch the token sampling. Outlines prevents bad outputs during generation with a hard guarantee, but only on open weight models you serve yourself. So the decision's clean. Calling a hosted API like OpenAI or Anthropic, use Instructor because post hoc validation with retries is the only option you've got and it covers 99% of production cases. Serving your own open model on vLLM and you want a mathematical guarantee at zero latency cost, use Outlines. Most builders, they're calling an API, which is exactly why Instructor's number one. It's the highest leverage import you can add to an LLM project today. One clarification because it tripped people up in late 2024. Instructor moved to the 567 Labs organization on GitHub and now draws a clear line between itself and Pydantic AI. Instructor is for schema first extraction, pulling structured data out of a model. Pydantic AI is for building agents. So if you just need clean, validated data back from an LLM call, Instructor's the one you want. So that's the 10. None of them are secret. They just never got the hype they earned. And if you take one thing away, it's this. Before you write another JSON parser, another provider switch, another retry loop, another chunking function, just check whether one of these already solved it. Because the model layer, it's a commodity now. The glue code around it is where you actually win. And these repos, they're the glue. Stop rebuilding it.