Retrieval

Retrieval is the step that finds relevant documents for a user question.

DocAsk currently supports two retrieval backends:

simple
mmore

Shared result format

File:

src/docask/retrieval/base.py

The shared result type is:

RetrievalResult

It contains:

  • the retrieved DocumentRecord;

  • a retrieval score.

Simple retriever

File:

src/docask/retrieval/simple_retriever.py

The simple retriever is a local debugging and dynamic-project backend.

It:

  • reads a selected corpus.jsonl;

  • tokenizes the query and documents;

  • ranks documents with overlap-based heuristics;

  • adds boosts for exact symbols, signatures, modules, titles, and user-facing documentation.

It does not use embeddings or MMORE.

Command:

PYTHONPATH=src python scripts/debug_retrieval.py \
  "How do I configure indexing?" \
  --corpus-path data/projects/mmore/corpus.jsonl

MMORE retriever

File:

src/docask/retrieval/mmore_retriever.py

The MMORE retriever:

  • loads a MMORE retriever from config;

  • calls retriever.retrieve(...);

  • parses DocAsk metadata from retrieved text;

  • converts raw MMORE results back into RetrievalResult objects.

Command example:

PYTHONPATH=src python scripts/prepare_answer.py \
  "How do I configure indexing?" \
  --backend mmore

Retriever factory

File:

src/docask/retrieval/retriever_factory.py

This file gives the rest of DocAsk one entry point:

retrieve_documents(...)

It chooses the backend based on:

simple
mmore

Project profiles

Retrieval results can be refined by a project profile.

Project profiles live in:

src/docask/project_profiles/

They can:

  • expand queries;

  • filter irrelevant sources;

  • rerank retrieved results;

  • answer some structured questions directly.

This keeps the core DocAsk retrieval pipeline generic while allowing MMORE-specific improvements to remain isolated.

Backend choice

For a project corpus built from the Streamlit interface, use:

backend simple

unless the corresponding MMORE index has also been rebuilt.

The mmore backend retrieves from the configured MMORE index, not directly from the selected JSONL corpus.