# Indexing

DocAsk has its own corpus format, but MMORE expects a different JSONL format.

The indexing layer bridges the two.

## Relevant files

```text
src/docask/indexing/mmore_format.py
src/docask/indexing/mmore_indexer.py
scripts/export_mmore_corpus.py
scripts/build_index.py
```

## Step 1: export to MMORE format

Default command:

```bash
PYTHONPATH=src python scripts/export_mmore_corpus.py
```

Default input:

```text
data/processed/corpus.jsonl
```

Default output:

```text
data/processed/mmore_corpus.jsonl
```

Project-specific command:

```bash
PYTHONPATH=src python scripts/export_mmore_corpus.py \
  --corpus-path data/projects/mmore/corpus.jsonl \
  --output-path data/projects/mmore/mmore_corpus.jsonl
```

MMORE-compatible records look like:

```json
{
  "text": "...",
  "modalities": [],
  "metadata": {}
}
```

DocAsk adds a short source header inside the text field before indexing. This makes it possible to reconstruct source information after MMORE retrieval.

## Step 2: build the MMORE index

Default command:

```bash
PYTHONPATH=src python scripts/build_index.py
```

Project-specific command:

```bash
PYTHONPATH=src python scripts/build_index.py \
  --documents-path data/projects/mmore/mmore_corpus.jsonl \
  --collection-name mmore_docs
```

This uses:

```text
configs/mmore_index_config.yaml
```

and stores the index under:

```text
data/indexes/mmore/
```

## Why keep indexing separate?

The corpus can be built and inspected before MMORE is involved.

This makes debugging easier:

1. build `corpus.jsonl`;
2. preview the records;
3. test simple retrieval;
4. only then export and index with MMORE.

## Important distinction

Building a corpus does not automatically rebuild the MMORE index.

For a newly selected project in Streamlit:

```text
Build corpus → backend simple
```

For MMORE retrieval:

```text
Build corpus → export MMORE corpus → build MMORE index → backend mmore
```