What is Karpathy's LLM Wiki?

Karpathy's LLM Wiki is a pattern for building personal knowledge bases using LLMs, proposed by Andrej Karpathy in April 2026. Instead of traditional RAG (retrieval-augmented generation) where the LLM re-derives answers from scratch on every query, the LLM incrementally builds and maintains a persistent wiki — a structured collection of markdown files with cross-references, summaries, and entity pages that compound over time.

How much does it cost to build an LLM Wiki?

Our implementation cost under $50 in total API fees to categorise 2,188 documents, processing approximately 7.2 million input tokens and 639,000 output tokens. The Manus AI data extraction was completely free as read-only API calls consume zero credits. Ongoing ingest costs roughly $0.01-0.03 per new document.

What is the difference between RAG and an LLM Wiki?

RAG (Retrieval-Augmented Generation) retrieves raw document chunks at query time and generates answers from scratch on every question. An LLM Wiki compiles knowledge once at ingest time — the LLM reads each source, extracts key information, categorises it, and integrates it into a persistent wiki with cross-references and metadata. The knowledge compounds over time rather than being re-derived on every query.

Can I build an LLM Wiki with Obsidian?

Yes. Obsidian is ideal for LLM Wikis because it's local-first (all data stays on your machine), supports YAML frontmatter for structured metadata, has a graph view for visualising document connections, and the Dataview plugin can run structured queries across your entire knowledge base. The wiki is just a folder of markdown files — Obsidian provides the browsing and query interface.

How do you export data from Manus AI?

Manus AI provides a REST API at api.manus.ai/v2 with endpoints for listing tasks (task.list) and retrieving message history (task.listMessages). Authentication uses an x-manus-api-key header. Read-only operations consume zero Manus credits. A pagination loop through task.list followed by task.listMessages for each task extracts the complete conversation history and file attachments.

Karpathy's LLM Wiki: How We Built a 2,188-Document Personal Knowledge Base from AI Research

01.I Wanted to Build a Personal Wiki

I have been using Manus AI heavily for the past year. If you have not used it, the depth of research it produces is genuinely remarkable — it does not just answer your question, it goes off, browses the web, reads documents, cross-references sources, and comes back with structured deliverables that would take a human researcher days or weeks. Market sizing reports with segmented TAM/SAM/SOM estimates. Competitive intelligence dossiers with executive profiles, acquisition histories, and financials. Technical architecture documents. Full business plans.

The problem was: I had done over 1,400 of these research tasks across a full year and all of that knowledge was locked inside Manus's conversation history. I knew I had asked it to research the Irish med-tech market at least three different times from three different angles. I knew there were competitive analyses, go-to-market strategies, paludiculture research, AI agent architecture docs — all sitting there, inaccessible and disconnected from each other.

Then Karpathy published his LLM Wiki gist and I thought: that is exactly what I need. Not just a way to file my research, but a way to turn it into a compounding knowledge base where everything is categorised, cross-referenced, and searchable. Where new research builds on top of old research instead of existing in isolation.

This is the story of how I did it — from extracting everything out of Manus to building a fully categorised, quality-scored knowledge base in Obsidian. The whole thing took an evening and cost under $50.

02.Karpathy's Three-Layer Architecture

Karpathy's LLM Wiki gist proposes three layers:

Raw Sources (Immutable)

Your curated collection of source documents. Articles, papers, research outputs. The LLM reads from them but never modifies them. This is your source of truth.

The Wiki (LLM-Generated)

A directory of LLM-generated markdown files — summaries, entity pages, concept pages, comparisons. The LLM owns this layer entirely. It creates pages, updates them, maintains cross-references. You read it; the LLM writes it.

The Schema (Conventions)

A configuration document that tells the LLM how the wiki is structured, what conventions to follow, and what workflows to run. This is what makes the LLM a disciplined wiki maintainer rather than a generic chatbot.

The key insight is deceptively simple: humans abandon wikis because the maintenance burden grows faster than the value. Updating cross-references, keeping summaries current, noting when new data contradicts old claims — it is tedious and relentless. LLMs do not get bored. They can touch fifteen files in one pass and never forget to update a cross-reference.

Karpathy describes three core operations: Ingest (process a new source, update multiple wiki pages), Query (answer questions by synthesising existing pages), and Lint (periodic health checks for contradictions and gaps). Our implementation focuses on the first — building the initial wiki from a year's worth of AI research.

03.Extracting 1,414 Tasks from Manus AI

The first challenge was getting the data out. Manus AI exposes a REST API at api.manus.ai with cursor-based pagination. The critical endpoints: task.list returns your tasks 100 at a time, and task.listMessages returns the full conversation history for each task including file attachments with signed download URLs.

Key Finding: Read Operations Are Free

Manus credits are consumed when tasks run (LLM inference, virtual machines, browser automation). The list and read endpoints are pure GET requests against stored data — they consume zero credits. We verified this empirically: starting balance 14,107 credits, ending balance after extracting all 1,414 tasks: 14,107 credits.

We built a Node.js extraction script with zero external dependencies — just native fetch in Node 24. The critical design decision was resumability: a progress file tracks which tasks have been downloaded. If the process is interrupted by a network issue, rate limit, or manual stop, re-running picks up exactly where it left off.

The full extraction of 1,414 tasks with 28,864 messages completed in about eight minutes. We selectively downloaded only .md and .txt attachments — the actual research deliverables — skipping images, zip files, and other media. The result: 2,185 markdown documents totalling 73MB of structured research.

1,414

Tasks Extracted

28,864

Messages

2,188

Documents

Manus Credits

04.Categorisation: Letting the AI Discover the Taxonomy

Filing 2,188 documents into folders without assessment creates a filing cabinet, not a knowledge base. The value of the wiki layer is that every document has been assessed, categorised, summarised, and enriched with metadata — so both humans and LLMs can navigate the collection intelligently.

I made a deliberate choice: do not predefine the category taxonomy. Let the model discover categories from the content. This avoids two failure modes. First, premature categorisation — defining categories before seeing the data leads to overloaded catch-all buckets and empty specific ones. Second, human bias — the person creating categories reflects their mental model of what should be there, not what is actually there.

You can use any capable LLM for this step — OpenAI, Claude, Gemini, or even a strong open-source model. The key is using the best model available to you, not the cheapest. Categorisation quality determines the usefulness of the entire knowledge base, so this is not the place to optimise for cost. I used OpenAI for this pass. Each document was sent with a structured prompt, and the model returns JSON:

{
  "title": "Marketing Strategy for Entering the Irish GP EHR Market",
  "primary_category": "Business Strategy",
  "subcategory": "Healthcare SaaS Go-to-Market Strategy",
  "tags": ["marketing", "EHR", "Ireland", "healthcare IT"],
  "entities": ["Clanwilliam Group", "HSE", "ICGP"],
  "summary": "Go-to-market strategy for an EHR company targeting...",
  "quality": 4,
  "content_type": "strategy_document"
}

Quality scoring uses a strict 1–5 rubric. A 5 means comprehensive analysis with actionable insights. A 1 means trivial, empty, or broken. The model was remarkably consistent — the same 10 primary categories emerged naturally across all 2,188 documents without us prescribing them.

Categories That Emerged

Category	Documents	Example Subcategories
Business Strategy	698	Healthcare SaaS GTM, EdTech Execution, Competitive Intelligence
Education	484	AI Assessment Systems, Irish Exam Marking, Course Development
AI & Technology	431	Agent Frameworks, LLM Orchestration, Desktop App Architecture
Healthcare	307	Irish Health Data Analytics, GP Communication, Digital Health
Environment	133	Paludiculture, Carbon Credits, Peatland Restoration
Legal & Compliance	51	GDPR, ISO 27001, IP Strategy
Cybersecurity	33	Penetration Testing, Threat Intelligence, Bug Bounty
Creative	24	Content Strategy, Video Production, Brand Design
Finance	16	VC Landscape, Financial Projections, Investment Analysis
Personal Development	11	Health & Wellness, Stress Management, Career Planning

05.The Wiki: YAML Frontmatter + Obsidian

Each categorised document is written to an Obsidian vault with YAML frontmatter at the top — machine-readable metadata that powers structured queries — followed by a summary block and then the full original document, unmodified.

---
title: "Marketing Strategy for Entering the Irish GP EHR Market"
date: 2025-04-04
quality: 4
content_type: strategy_document
primary_category: "Business Strategy"
subcategory: "Healthcare SaaS Go-to-Market Strategy"
tags: ["marketing", "EHR", "Ireland", "healthcare IT", "GP"]
entities: ["Clanwilliam Group", "HSE", "ICGP", "CompleteGP"]
source: manus
---

> **Summary:** Go-to-market strategy for an EHR company targeting
> GP practices in Ireland, covering competitive landscape,
> positioning, channel strategy, and implementation roadmap.

# Comprehensive Marketing Strategy for GP EHR Company in Ireland
...

The folder structure follows the categorisation naturally:

wiki/
├── Business Strategy/
│   ├── Healthcare SaaS Go-to-Market Strategy/
│   │   ├── Marketing Strategy for Entering the Irish GP EHR Market.md
│   │   └── GP EHR Customer Acquisition Strategy for Ireland.md
│   ├── EdTech Startup Execution Plan/
│   └── Competitive Intelligence/
├── Healthcare/
├── AI & Technology/
│   ├── Agent Frameworks/
│   └── LLM-Powered Knowledge Management/
├── Environment/
│   └── Paludiculture/
└── Index.md

Why Obsidian? Three reasons. It is local-first — all data stays on disk with no cloud dependency or vendor lock-in. It has a graph view that visualises connections between documents based on tags and wikilinks. And the Dataview plugin runs structured queries against YAML frontmatter, turning a folder of markdown files into a queryable database.

As Karpathy puts it: Obsidian is the IDE, the LLM is the programmer, the wiki is the codebase.

06.The Ingest Pipeline: Drop, Categorise, File

A knowledge base that only works with one data source is a dead archive. Karpathy's pattern is designed for ongoing accumulation. We built a universal ingest pipeline: a drop zone where you place any file and it gets automatically read, categorised, enhanced, and filed into the wiki.

raw/inbox/

Drop any file here — .md, .pdf, .docx, .txt, .html, .json, .csv

LLM Categorisation

Your AI model of choice reads, assesses quality, discovers categories, extracts entities, and generates YAML frontmatter

raw/processed/

Originals archived with date prefix — nothing is ever deleted, always recoverable

wiki/{category}/{subcategory}/

Clean .md file with YAML frontmatter, summary, and full content — filed into the right folder automatically

log.md

Chronological record — "Ingested Marketing Strategy on 2026-04-11" — so you always know what was added and when

The ingest script handles multiple formats through a fallback chain. Markdown and text files are read directly. PDFs go through pdftotext first, falling back to Python's pdfminer for complex layouts. Word documents use python-docx for paragraph-level extraction. HTML, JSON, and CSV are read as-is — the LLM handles the interpretation.

For non-markdown inputs, the model also returns a cleaned_content field — the document properly formatted as markdown with headers, lists, and structure restored from the messy extraction. The instruction is explicit: preserve all original content, only fix formatting. The wiki page should contain the full source material, not a compression of it.

Each ingested document gets the same frontmatter treatment as the Manus exports — categories, quality scores, entity extraction, tags. It slots into the existing folder structure alongside related documents. The log records what was ingested and when, giving you a chronological timeline of the wiki's growth.

07.Why This Beats RAG

The conventional approach to making AI work with your documents is RAG — Retrieval-Augmented Generation. You chunk your documents, embed them into vectors, store them in a vector database, and at query time the LLM retrieves the most similar chunks to answer your question. NotebookLM, ChatGPT file uploads, and most enterprise AI tools work this way.

RAG works. But it has a structural limitation: nothing compounds. Every query starts from scratch. The LLM does not remember that it synthesised five documents about the Irish healthcare market last week. It does not know that a newer source contradicts an older claim. It cannot tell you what the collective finding is across all your research — it can only find relevant chunks and generate a one-off answer.

The LLM Wiki inverts the architecture. Instead of doing work at query time, you do work at ingest time. When a new document enters the system, the LLM reads it, assesses it, categorises it, extracts entities, scores its quality, and files it with metadata. The cost is paid once. Every future interaction — whether a human browsing Obsidian or an LLM reading the index — benefits from that upfront investment.

Dimension	RAG	LLM Wiki
When work happens	Query time	Ingest time
Knowledge compounds	No	Yes
Cross-references	Re-derived each time	Pre-built and persistent
Contradictions	Not detected	Flagged at ingest
Human browsable	Not really	Fully — it is just markdown
Infrastructure	Vector DB, embeddings, retrieval pipeline	A folder of .md files

08.What It Cost

The entire knowledge base — from raw API extraction to fully categorised wiki — was built for under $50 in API costs and zero platform credits.

Phase	Model	Tokens	Cost
Manus extraction	N/A	N/A	$0
Document categorisation	Any LLM API	7.2M in / 639K out	~$40-50
Ongoing ingest (per doc)	Any LLM API	~2-5K in / ~300 out	~$0.01-0.03

The categorisation is a one-time cost. Once a document has its frontmatter, it does not need re-processing. Only new documents ingested through the inbox incur additional API costs — roughly one to three cents each depending on length.

09.What's Next

This implementation covers the Ingest layer of Karpathy's pattern. The foundation is solid — 2,188 documents, categorised, scored, searchable, with an ongoing ingest pipeline. But the full vision includes more:

•Synthesis pages: Cross-document topic summaries that combine insights from multiple sources into unified wiki pages. When five documents discuss the Irish healthcare market, one authoritative overview page should exist.
•Web research ingest: Using an LLM with web search to generate research documents on demand, filed directly into the wiki. Ask a question, the LLM researches it, and the findings become a permanent wiki page.
•Lint operations: Periodic health checks for contradictions between pages, stale claims that newer sources have superseded, orphan pages with no connections, and knowledge gaps worth investigating.
•Entity relationship graphs: Mapping how companies, technologies, and concepts connect across the entire knowledge base — visible through Obsidian's graph view.

The code for the extraction, categorisation, and ingest pipeline is straightforward Node.js — no frameworks, no dependencies, just native fetch and any LLM API. The entire system is a folder of markdown files and a few scripts. That is the beauty of the pattern: the infrastructure is almost trivially simple. The LLM does the hard part.

Karpathy's gist is intentionally abstract — it describes the pattern, not a specific implementation. This is mine. A year of AI research, extracted from a platform, categorised by a frontier model, and assembled into a persistent knowledge base that will compound with every document I add.

The human curates sources, directs analysis, asks good questions, and thinks about what it all means. The LLM does everything else. That division of labour is the entire point.

01.I Wanted to Build a Personal Wiki

02.Karpathy's Three-Layer Architecture

Karpathy's LLM Wiki gist proposes three layers:

Raw Sources (Immutable)

Your curated collection of source documents. Articles, papers, research outputs. The LLM reads from them but never modifies them. This is your source of truth.

The Wiki (LLM-Generated)

The Schema (Conventions)

03.Extracting 1,414 Tasks from Manus AI

Key Finding: Read Operations Are Free

1,414

Tasks Extracted

28,864

Messages

2,188

Documents

Manus Credits

04.Categorisation: Letting the AI Discover the Taxonomy

{
  "title": "Marketing Strategy for Entering the Irish GP EHR Market",
  "primary_category": "Business Strategy",
  "subcategory": "Healthcare SaaS Go-to-Market Strategy",
  "tags": ["marketing", "EHR", "Ireland", "healthcare IT"],
  "entities": ["Clanwilliam Group", "HSE", "ICGP"],
  "summary": "Go-to-market strategy for an EHR company targeting...",
  "quality": 4,
  "content_type": "strategy_document"
}

Categories That Emerged

Category	Documents	Example Subcategories
Business Strategy	698	Healthcare SaaS GTM, EdTech Execution, Competitive Intelligence
Education	484	AI Assessment Systems, Irish Exam Marking, Course Development
AI & Technology	431	Agent Frameworks, LLM Orchestration, Desktop App Architecture
Healthcare	307	Irish Health Data Analytics, GP Communication, Digital Health
Environment	133	Paludiculture, Carbon Credits, Peatland Restoration
Legal & Compliance	51	GDPR, ISO 27001, IP Strategy
Cybersecurity	33	Penetration Testing, Threat Intelligence, Bug Bounty
Creative	24	Content Strategy, Video Production, Brand Design
Finance	16	VC Landscape, Financial Projections, Investment Analysis
Personal Development	11	Health & Wellness, Stress Management, Career Planning

05.The Wiki: YAML Frontmatter + Obsidian

---
title: "Marketing Strategy for Entering the Irish GP EHR Market"
date: 2025-04-04
quality: 4
content_type: strategy_document
primary_category: "Business Strategy"
subcategory: "Healthcare SaaS Go-to-Market Strategy"
tags: ["marketing", "EHR", "Ireland", "healthcare IT", "GP"]
entities: ["Clanwilliam Group", "HSE", "ICGP", "CompleteGP"]
source: manus
---

> **Summary:** Go-to-market strategy for an EHR company targeting
> GP practices in Ireland, covering competitive landscape,
> positioning, channel strategy, and implementation roadmap.

# Comprehensive Marketing Strategy for GP EHR Company in Ireland
...

The folder structure follows the categorisation naturally:

wiki/
├── Business Strategy/
│   ├── Healthcare SaaS Go-to-Market Strategy/
│   │   ├── Marketing Strategy for Entering the Irish GP EHR Market.md
│   │   └── GP EHR Customer Acquisition Strategy for Ireland.md
│   ├── EdTech Startup Execution Plan/
│   └── Competitive Intelligence/
├── Healthcare/
├── AI & Technology/
│   ├── Agent Frameworks/
│   └── LLM-Powered Knowledge Management/
├── Environment/
│   └── Paludiculture/
└── Index.md

As Karpathy puts it: Obsidian is the IDE, the LLM is the programmer, the wiki is the codebase.

06.The Ingest Pipeline: Drop, Categorise, File

raw/inbox/

Drop any file here — .md, .pdf, .docx, .txt, .html, .json, .csv

LLM Categorisation

Your AI model of choice reads, assesses quality, discovers categories, extracts entities, and generates YAML frontmatter

raw/processed/

Originals archived with date prefix — nothing is ever deleted, always recoverable

wiki/{category}/{subcategory}/

Clean .md file with YAML frontmatter, summary, and full content — filed into the right folder automatically

log.md

Chronological record — "Ingested Marketing Strategy on 2026-04-11" — so you always know what was added and when

07.Why This Beats RAG

Dimension	RAG	LLM Wiki
When work happens	Query time	Ingest time
Knowledge compounds	No	Yes
Cross-references	Re-derived each time	Pre-built and persistent
Contradictions	Not detected	Flagged at ingest
Human browsable	Not really	Fully — it is just markdown
Infrastructure	Vector DB, embeddings, retrieval pipeline	A folder of .md files

08.What It Cost

The entire knowledge base — from raw API extraction to fully categorised wiki — was built for under $50 in API costs and zero platform credits.

Phase	Model	Tokens	Cost
Manus extraction	N/A	N/A	$0
Document categorisation	Any LLM API	7.2M in / 639K out	~$40-50
Ongoing ingest (per doc)	Any LLM API	~2-5K in / ~300 out	~$0.01-0.03

09.What's Next

•Synthesis pages: Cross-document topic summaries that combine insights from multiple sources into unified wiki pages. When five documents discuss the Irish healthcare market, one authoritative overview page should exist.
•Web research ingest: Using an LLM with web search to generate research documents on demand, filed directly into the wiki. Ask a question, the LLM researches it, and the findings become a permanent wiki page.
•Lint operations: Periodic health checks for contradictions between pages, stale claims that newer sources have superseded, orphan pages with no connections, and knowledge gaps worth investigating.
•Entity relationship graphs: Mapping how companies, technologies, and concepts connect across the entire knowledge base — visible through Obsidian's graph view.

The human curates sources, directs analysis, asks good questions, and thinks about what it all means. The LLM does everything else. That division of labour is the entire point.

What is Karpathy's LLM Wiki?

TL;DR

01.I Wanted to Build a Personal Wiki

02.Karpathy's Three-Layer Architecture

Raw Sources (Immutable)

The Wiki (LLM-Generated)

The Schema (Conventions)

03.Extracting 1,414 Tasks from Manus AI

Key Finding: Read Operations Are Free

04.Categorisation: Letting the AI Discover the Taxonomy

Categories That Emerged

05.The Wiki: YAML Frontmatter + Obsidian

06.The Ingest Pipeline: Drop, Categorise, File

raw/inbox/

LLM Categorisation

raw/processed/

wiki/{category}/{subcategory}/

log.md

07.Why This Beats RAG

08.What It Cost

09.What's Next

Written by Kevin Collins

Production Vibe Coding

Related Articles

Stay Updated

Resources

Company

Legal

echofold

What is Karpathy's LLM Wiki?

TL;DR

01.I Wanted to Build a Personal Wiki

02.Karpathy's Three-Layer Architecture

Raw Sources (Immutable)

The Wiki (LLM-Generated)

The Schema (Conventions)

03.Extracting 1,414 Tasks from Manus AI

Key Finding: Read Operations Are Free

04.Categorisation: Letting the AI Discover the Taxonomy

Categories That Emerged

05.The Wiki: YAML Frontmatter + Obsidian

06.The Ingest Pipeline: Drop, Categorise, File

raw/inbox/

LLM Categorisation

raw/processed/

wiki/{category}/{subcategory}/

log.md

07.Why This Beats RAG

08.What It Cost

09.What's Next

Written by Kevin Collins

Production Vibe Coding

Related Articles

Stay Updated