Architecture overview

DocAsk is organized around a simple idea: all sources are converted into the same internal document format before retrieval.

The initial use case is MMORE, but the core pipeline is designed to remain project-agnostic. Project-specific behavior is isolated in optional project profiles.

High-level flow

Target project repository
        |
        |  Markdown / RST docs
        |  Python source files
        |  YAML config files
        |  repository tree
        v
DocAsk loaders and extractors
        v
DocumentRecord objects
        v
corpus.jsonl
        v
retrieval backend
        |-------------------------------|
        |                               |
        v                               v
simple retriever                MMORE retriever
(local / dynamic corpus)        (MMORE index)
        |                               |
        |-------------------------------|
        v
retrieved sources
        v
project profile
        |
        |  optional direct answer
        |  optional query expansion / filtering / reranking
        v
RAG prompt construction
        v
LLM or extractive answer
        v
answer with cited sources

Main design choices

DocAsk separates the pipeline into clear blocks:

Block

Role

loaders/

Load source files that are already documentation-like.

extractors/

Extract documentation from source code.

corpus/

Combine all sources into one corpus.

indexing/

Export and index the corpus with MMORE.

retrieval/

Retrieve relevant documents.

project_profiles/

Hold optional project-specific query expansion, filtering, reranking, and direct answers.

rag/

Build prompts and generate answers.

projects/

Manage selected projects, generated project configs, and persisted app state.

app/

Streamlit user interface.

Why keep a DocAsk format?

DocAsk uses its own DocumentRecord format instead of exposing MMORE everywhere.

This keeps the project modular:

  • the corpus can be inspected before indexing;

  • the simple retriever can run without MMORE;

  • MMORE can be replaced or updated without rewriting loaders;

  • retrieved sources keep consistent metadata for citations;

  • Streamlit can work with project-specific corpora before MMORE indexing is available.

Simple backend vs MMORE backend

The simple backend reads a selected corpus.jsonl directly. It is useful for:

  • local development;

  • newly built project corpora;

  • debugging retrieval quality;

  • avoiding MMORE indexing.

The mmore backend retrieves from an MMORE index. It is the target backend, but the index must be built separately.

For a newly selected project, the safe first workflow is:

Build corpus → use backend simple

The full MMORE workflow is:

Build corpus → export MMORE corpus → build MMORE index → use backend mmore