Architecture overview¶
DocAsk is organized around a simple idea: all sources are converted into the same internal document format before retrieval.
The initial use case is MMORE, but the core pipeline is designed to remain project-agnostic. Project-specific behavior is isolated in optional project profiles.
High-level flow¶
Target project repository
|
| Markdown / RST docs
| Python source files
| YAML config files
| repository tree
v
DocAsk loaders and extractors
v
DocumentRecord objects
v
corpus.jsonl
v
retrieval backend
|-------------------------------|
| |
v v
simple retriever MMORE retriever
(local / dynamic corpus) (MMORE index)
| |
|-------------------------------|
v
retrieved sources
v
project profile
|
| optional direct answer
| optional query expansion / filtering / reranking
v
RAG prompt construction
v
LLM or extractive answer
v
answer with cited sources
Main design choices¶
DocAsk separates the pipeline into clear blocks:
Block |
Role |
|---|---|
|
Load source files that are already documentation-like. |
|
Extract documentation from source code. |
|
Combine all sources into one corpus. |
|
Export and index the corpus with MMORE. |
|
Retrieve relevant documents. |
|
Hold optional project-specific query expansion, filtering, reranking, and direct answers. |
|
Build prompts and generate answers. |
|
Manage selected projects, generated project configs, and persisted app state. |
|
Streamlit user interface. |
Why keep a DocAsk format?¶
DocAsk uses its own DocumentRecord format instead of exposing MMORE everywhere.
This keeps the project modular:
the corpus can be inspected before indexing;
the simple retriever can run without MMORE;
MMORE can be replaced or updated without rewriting loaders;
retrieved sources keep consistent metadata for citations;
Streamlit can work with project-specific corpora before MMORE indexing is available.
Simple backend vs MMORE backend¶
The simple backend reads a selected corpus.jsonl directly. It is useful for:
local development;
newly built project corpora;
debugging retrieval quality;
avoiding MMORE indexing.
The mmore backend retrieves from an MMORE index. It is the target backend, but the index must be built separately.
For a newly selected project, the safe first workflow is:
Build corpus → use backend simple
The full MMORE workflow is:
Build corpus → export MMORE corpus → build MMORE index → use backend mmore