Corpus building

Corpus building is the first main step of DocAsk.

It is orchestrated by:

scripts/build_corpus.py

and implemented mainly in:

src/docask/corpus/builder.py

What build_corpus.py does

build_corpus.py reads a project configuration and builds a unified corpus.

Depending on the project config, it can include:

  • Markdown and reStructuredText documentation;

  • Python files for docstrings and signatures through ast;

  • YAML configuration files;

  • repository structure.

Default command

PYTHONPATH=src python scripts/build_corpus.py

This reads:

configs/project_config.yaml

and writes:

data/processed/corpus.jsonl

Dynamic project command

The script can also build a corpus for a project-specific config:

PYTHONPATH=src python scripts/build_corpus.py \
  --config data/projects/mmore/project_config.yaml \
  --output-path data/projects/mmore/corpus.jsonl

This is the mechanism used by the Streamlit app when a user selects a local project and clicks Build corpus.

Output

Default output:

data/processed/corpus.jsonl

Project-specific output:

data/projects/<project_name>/corpus.jsonl

Source breakdown

A successful run prints a breakdown by source_type, for example:

Built corpus with 545 documents
Breakdown by source_type:
  - markdown_section: 246
  - python_function: 58
  - python_module: 10
  - python_class: 49
  - python_method: 129
  - example_config: 46
  - production_config: 6
  - repo_structure: 1

The exact numbers may change when the indexed repository changes.

Project-specific generated config

The Streamlit project setup flow generates:

data/projects/<project_name>/project_config.yaml

This file contains absolute paths to the selected local project.

Why JSONL?

JSONL is simple and useful here because:

  • each document is independent;

  • it can be inspected line by line;

  • it can be streamed by later indexing pipelines;

  • it is easy to convert to other formats.