Indexing

GitHelp has its own corpus format, but MMORE expects a different JSONL format.

The indexing layer bridges the two.

Relevant files

src/githelp/indexing/mmore_format.py
src/githelp/indexing/mmore_indexer.py
scripts/export_mmore_corpus.py
scripts/build_index.py

Step 1: export to MMORE format

Default command:

python scripts/export_mmore_corpus.py

Default input:

data/processed/corpus.jsonl

Default output:

data/processed/mmore_corpus.jsonl

Project-specific command:

python scripts/export_mmore_corpus.py \
  --corpus-path data/projects/mmore/corpus.jsonl \
  --output-path data/projects/mmore/mmore_corpus.jsonl

MMORE-compatible records look like:

{
  "text": "...",
  "modalities": [],
  "metadata": {}
}

GitHelp adds a short source header inside the text field before indexing. This makes it possible to reconstruct source information after MMORE retrieval.

Step 2: build the MMORE index

Default command:

python scripts/build_index.py

Project-specific command:

python scripts/build_index.py \
  --documents-path data/projects/mmore/mmore_corpus.jsonl \
  --collection-name mmore_docs

This uses:

configs/mmore_index_config.yaml

and stores the index under:

data/indexes/mmore/

If answering with backend mmore fails with KeyError: 'model_name', restart Streamlit after rebuilding the index. GitHelp can recover from missing Milvus model metadata by reading model names from configs/mmore_index_config.yaml, but an already-running Streamlit process may still be using older adapter code.

In local environments where native MMORE/Milvus retrieval crashes, GitHelp runs native retrieval in an isolated child process. If that process fails, the mmore backend falls back to the exported mmore_corpus.jsonl so Streamlit can still answer from the MMORE-formatted corpus.

Why keep indexing separate?

The corpus can be built and inspected before MMORE is involved.

This makes debugging easier:

  1. build corpus.jsonl;

  2. preview the records;

  3. test simple retrieval;

  4. only then export and index with MMORE.

Important distinction

Building a corpus does not automatically rebuild the MMORE index.

For a newly selected project in Streamlit:

Build corpus → backend simple

For MMORE retrieval:

Build corpus → export MMORE corpus → build MMORE index → backend mmore