# Architecture overview

DocAsk is organized around a simple idea: all sources are converted into the same internal document format before retrieval.

The initial use case is MMORE, but the core pipeline is designed to remain project-agnostic. Project-specific behavior is isolated in optional project profiles.

## High-level flow

```text
Target project repository
        |
        |  Markdown / RST docs
        |  Python source files
        |  YAML config files
        |  repository tree
        v
DocAsk loaders and extractors
        v
DocumentRecord objects
        v
corpus.jsonl
        v
retrieval backend
        |-------------------------------|
        |                               |
        v                               v
simple retriever                MMORE retriever
(local / dynamic corpus)        (MMORE index)
        |                               |
        |-------------------------------|
        v
retrieved sources
        v
project profile
        |
        |  optional direct answer
        |  optional query expansion / filtering / reranking
        v
RAG prompt construction
        v
LLM or extractive answer
        v
answer with cited sources
```

## Main design choices

DocAsk separates the pipeline into clear blocks:

| Block | Role |
|---|---|
| `loaders/` | Load source files that are already documentation-like. |
| `extractors/` | Extract documentation from source code. |
| `corpus/` | Combine all sources into one corpus. |
| `indexing/` | Export and index the corpus with MMORE. |
| `retrieval/` | Retrieve relevant documents. |
| `project_profiles/` | Hold optional project-specific query expansion, filtering, reranking, and direct answers. |
| `rag/` | Build prompts and generate answers. |
| `projects/` | Manage selected projects, generated project configs, and persisted app state. |
| `app/` | Streamlit user interface. |

## Why keep a DocAsk format?

DocAsk uses its own `DocumentRecord` format instead of exposing MMORE everywhere.

This keeps the project modular:

- the corpus can be inspected before indexing;
- the simple retriever can run without MMORE;
- MMORE can be replaced or updated without rewriting loaders;
- retrieved sources keep consistent metadata for citations;
- Streamlit can work with project-specific corpora before MMORE indexing is available.

## Simple backend vs MMORE backend

The `simple` backend reads a selected `corpus.jsonl` directly. It is useful for:

- local development;
- newly built project corpora;
- debugging retrieval quality;
- avoiding MMORE indexing.

The `mmore` backend retrieves from an MMORE index. It is the target backend, but the index must be built separately.

For a newly selected project, the safe first workflow is:

```text
Build corpus → use backend simple
```

The full MMORE workflow is:

```text
Build corpus → export MMORE corpus → build MMORE index → use backend mmore
```