# Data flow

This page explains how information moves through DocAsk.

## Step 1: project selection

DocAsk starts from a target project repository.

The Streamlit interface lets a user select a local project path. For example, the MMORE repository may be located at:

```text
/path/to/mmore
```

DocAsk then creates a project-specific working folder:

```text
data/projects/<project_name>/
```

For MMORE:

```text
data/projects/mmore/
```

## Step 2: project configuration

DocAsk generates or reads a project configuration file.

For a Streamlit-selected project, this is usually:

```text
data/projects/<project_name>/project_config.yaml
```

For older command-line workflows, the default config remains:

```text
configs/project_config.yaml
```

The project config points to:

```text
docs_path          -> Markdown and RST documentation
code_path          -> Python source code
yaml_config_paths  -> YAML example or production configs
repo_path          -> repository structure
```

## Step 3: normalized records

Each source is converted into a `DocumentRecord`.

A `DocumentRecord` contains:

- a unique `doc_id`;
- textual `content`;
- a `source_type`;
- optional source metadata such as path, title, module, symbol, and signature.

Example source types:

```text
markdown_section
python_module
python_class
python_function
python_method
example_config
production_config
repo_structure
```

## Step 4: corpus file

All records are saved to JSONL.

Default command-line corpus:

```text
data/processed/corpus.jsonl
```

Project-specific corpus:

```text
data/projects/<project_name>/corpus.jsonl
```

Each line is one serialized `DocumentRecord`.

## Step 5: retrieval

There are two retrieval paths.

### Simple retrieval

The simple retriever reads the selected `corpus.jsonl` directly.

It is useful for:

- debugging corpus quality;
- testing queries quickly;
- using newly built project corpora;
- avoiding MMORE indexing during development.

### MMORE retrieval

For MMORE, the corpus is first converted to:

```text
data/projects/<project_name>/mmore_corpus.jsonl
```

or, in the default workflow:

```text
data/processed/mmore_corpus.jsonl
```

Then MMORE indexes it and retrieves from the generated index.

The MMORE backend does not automatically read a newly built `corpus.jsonl`. The index must be rebuilt when the target corpus changes.

## Step 6: project profile

After retrieval, DocAsk can apply a project profile.

A project profile can:

- expand queries with project-specific terms;
- filter low-value or irrelevant results;
- rerank results for known intents;
- answer some structured questions directly without calling the LLM.

The MMORE profile, for example, can answer some Milvus parameter questions deterministically to avoid unrelated configuration fields.

## Step 7: prompt and answer

If no direct answer is returned by the project profile, retrieved sources are formatted into a prompt.

The LLM is instructed to:

- answer only from the provided sources;
- cite every factual statement;
- avoid inventing commands, paths, APIs, or configuration keys;
- say clearly when the available sources are insufficient.

The final output is an answer with cited sources.