Indexing¶
DocAsk has its own corpus format, but MMORE expects a different JSONL format.
The indexing layer bridges the two.
Relevant files¶
src/docask/indexing/mmore_format.py
src/docask/indexing/mmore_indexer.py
scripts/export_mmore_corpus.py
scripts/build_index.py
Step 1: export to MMORE format¶
Default command:
PYTHONPATH=src python scripts/export_mmore_corpus.py
Default input:
data/processed/corpus.jsonl
Default output:
data/processed/mmore_corpus.jsonl
Project-specific command:
PYTHONPATH=src python scripts/export_mmore_corpus.py \
--corpus-path data/projects/mmore/corpus.jsonl \
--output-path data/projects/mmore/mmore_corpus.jsonl
MMORE-compatible records look like:
{
"text": "...",
"modalities": [],
"metadata": {}
}
DocAsk adds a short source header inside the text field before indexing. This makes it possible to reconstruct source information after MMORE retrieval.
Step 2: build the MMORE index¶
Default command:
PYTHONPATH=src python scripts/build_index.py
Project-specific command:
PYTHONPATH=src python scripts/build_index.py \
--documents-path data/projects/mmore/mmore_corpus.jsonl \
--collection-name mmore_docs
This uses:
configs/mmore_index_config.yaml
and stores the index under:
data/indexes/mmore/
Why keep indexing separate?¶
The corpus can be built and inspected before MMORE is involved.
This makes debugging easier:
build
corpus.jsonl;preview the records;
test simple retrieval;
only then export and index with MMORE.
Important distinction¶
Building a corpus does not automatically rebuild the MMORE index.
For a newly selected project in Streamlit:
Build corpus → backend simple
For MMORE retrieval:
Build corpus → export MMORE corpus → build MMORE index → backend mmore