Corpus building¶

Corpus building is the first main step of DocAsk.

It is orchestrated by:

scripts/build_corpus.py

and implemented mainly in:

src/docask/corpus/builder.py

What `build_corpus.py` does¶

build_corpus.py reads a project configuration and builds a unified corpus.

Depending on the project config, it can include:

Markdown and reStructuredText documentation;
Python files for docstrings and signatures through ast;
YAML configuration files;
repository structure.

Default command¶

PYTHONPATH=src python scripts/build_corpus.py

This reads:

configs/project_config.yaml

and writes:

data/processed/corpus.jsonl

Dynamic project command¶

The script can also build a corpus for a project-specific config:

PYTHONPATH=src python scripts/build_corpus.py \
  --config data/projects/mmore/project_config.yaml \
  --output-path data/projects/mmore/corpus.jsonl

This is the mechanism used by the Streamlit app when a user selects a local project and clicks Build corpus.

Output¶

Default output:

data/processed/corpus.jsonl

Project-specific output:

data/projects/<project_name>/corpus.jsonl

Source breakdown¶

A successful run prints a breakdown by source_type, for example:

Built corpus with 545 documents
Breakdown by source_type:
  - markdown_section: 246
  - python_function: 58
  - python_module: 10
  - python_class: 49
  - python_method: 129
  - example_config: 46
  - production_config: 6
  - repo_structure: 1

The exact numbers may change when the indexed repository changes.

Project-specific generated config¶

The Streamlit project setup flow generates:

data/projects/<project_name>/project_config.yaml

This file contains absolute paths to the selected local project.

Why JSONL?¶

JSONL is simple and useful here because:

each document is independent;
it can be inspected line by line;
it can be streamed by later indexing pipelines;
it is easy to convert to other formats.