Debugging

This page lists useful checks during development.

Run all tests

PYTHONPATH=src pytest -q

This is the main check before committing changes.

Compile all Python files

python -m compileall src scripts app

This catches syntax errors and some import issues.

Test default corpus build

PYTHONPATH=src python scripts/build_corpus.py

This should write:

data/processed/corpus.jsonl

Test dynamic corpus build

PYTHONPATH=src python scripts/build_corpus.py \
  --config configs/project_config.yaml \
  --output-path data/projects/mmore/corpus.jsonl

This should write:

data/projects/mmore/corpus.jsonl

Test answering on a project-specific corpus

PYTHONPATH=src python scripts/answer_question.py \
  "Which Milvus parameters are used in the ColPali config?" \
  --llm \
  --backend simple \
  --corpus-path data/projects/mmore/corpus.jsonl \
  --config-path configs/app_config.yaml

For MMORE, some structured questions may be answered directly by the project profile without loading the LLM.

Debug Streamlit state

The app stores local state in:

data/app_state.json

If the interface restores an old project or wrong corpus path, remove this file:

rm -f data/app_state.json

Then restart Streamlit.

Clear generated project corpora

rm -rf data/projects/

Then rebuild the project corpus from the Streamlit interface or from the command line.

Check for old imports after refactoring

grep -R "docask.retrieval.answering\|docask.retrieval.prompting\|docask.retrieval.extractive_answerer\|docask.retrieval.mmore_format\|docask.retrieval.mmore_indexer" -n src scripts app docs

This should return nothing after moving files into rag/ and indexing/.

Remove generated cache files

find . -type d -name "__pycache__" -prune -exec rm -rf {} +
rm -rf src/docask.egg-info

Avoid grepping inside the virtual environment

If the virtual environment is inside the repo, grep can return unrelated files from dependencies.

Prefer:

grep -R "test_prompting\|test_retrieval" -n README.md scripts src docs tests

instead of:

grep -R "test_prompting\|test_retrieval" -n .

Debug source extraction

Build the full corpus:

PYTHONPATH=src python scripts/build_corpus.py

Preview specific source types:

PYTHONPATH=src python scripts/preview_corpus.py --source-type python_function --limit 3
PYTHONPATH=src python scripts/preview_corpus.py --source-type example_config --limit 3
PYTHONPATH=src python scripts/preview_corpus.py --source-type repo_structure --limit 1

Debug retrieval quality

PYTHONPATH=src python scripts/debug_retrieval.py "How do I configure indexing?"

If results are poor, check:

  • whether the expected source exists in the corpus;

  • whether the source type is correct;

  • whether titles and metadata are informative;

  • whether the selected backend reads the expected corpus or index;

  • whether the query is too vague;

  • whether a project profile should expand, filter, or rerank that intent.

Backend mismatch

A common issue is using:

backend mmore

after building a new corpus with Streamlit.

Building a corpus does not update the MMORE index. For a newly built project corpus, use:

backend simple

unless the MMORE export and index have also been rebuilt.