Indexing¶
GitHelp has its own corpus format, but MMORE expects a different JSONL format.
The indexing layer bridges the two.
Relevant files¶
src/githelp/indexing/mmore_format.py
src/githelp/indexing/mmore_indexer.py
scripts/export_mmore_corpus.py
scripts/build_index.py
Step 1: export to MMORE format¶
Default command:
python scripts/export_mmore_corpus.py
Default input:
data/processed/corpus.jsonl
Default output:
data/processed/mmore_corpus.jsonl
Project-specific command:
python scripts/export_mmore_corpus.py \
--corpus-path data/projects/mmore/corpus.jsonl \
--output-path data/projects/mmore/mmore_corpus.jsonl
MMORE-compatible records look like:
{
"text": "...",
"modalities": [],
"metadata": {}
}
GitHelp adds a short source header inside the text field before indexing. This makes it possible to reconstruct source information after MMORE retrieval.
Step 2: build the MMORE index¶
Default command:
python scripts/build_index.py
Project-specific command:
python scripts/build_index.py \
--documents-path data/projects/mmore/mmore_corpus.jsonl \
--collection-name mmore_docs
This uses:
configs/mmore_index_config.yaml
and stores the index under:
data/indexes/mmore/
If answering with backend mmore fails with KeyError: 'model_name', restart
Streamlit after rebuilding the index. GitHelp can recover from missing Milvus
model metadata by reading model names from configs/mmore_index_config.yaml,
but an already-running Streamlit process may still be using older adapter code.
In local environments where native MMORE/Milvus retrieval crashes, GitHelp runs
native retrieval in an isolated child process. If that process fails, the
mmore backend falls back to the exported mmore_corpus.jsonl so Streamlit can
still answer from the MMORE-formatted corpus.
Why keep indexing separate?¶
The corpus can be built and inspected before MMORE is involved.
This makes debugging easier:
build
corpus.jsonl;preview the records;
test simple retrieval;
only then export and index with MMORE.
Important distinction¶
Building a corpus does not automatically rebuild the MMORE index.
For a newly selected project in Streamlit:
Build corpus → backend simple
For MMORE retrieval:
Build corpus → export MMORE corpus → build MMORE index → backend mmore