Configuration

This section contains auto-generated documentation for configuration modules.

Config Module

Pydantic models for service configuration.

class routir.config.config.ServiceConfig(*, name, engine, config, processor='BatchQueryProcessor', cache=-1, batch_size=32, cache_ttl=600, max_wait_time=0.05, cache_key_fields=<factory>, cache_redis_url=None, cache_redis_kwargs=<factory>, scoring_disabled=False)[source]

Bases: BaseModel

Configuration for a single search or ranking service.

One ServiceConfig entry in the "services" list corresponds to one loaded engine, one search processor, and optionally one scoring processor registered in ProcessorRegistry.

Example JSON:

{
    "name": "my-retriever",
    "engine": "Qwen3",
    "config": {
        "index_path": "/data/qwen3-index",
        "embedding_model_name": "Qwen/Qwen3-Embedding-8B"
    },
    "cache": 4096,
    "cache_ttl": 600,
    "batch_size": 16,
    "max_wait_time": 0.05
}
name

Service identifier used in API requests (the "service" field) and as the key in ProcessorRegistry.

Type:

str

engine

Class name of the Engine subclass to instantiate. Must be importable at startup — either a built-in engine or one loaded via file_imports.

Type:

str

config

Engine-specific parameters passed as config= to the engine constructor. Content varies by engine; see the engine’s __init__ for accepted keys. The special key "index_path" with the hfds:<repo> prefix triggers automatic download from Hugging Face Datasets.

Type:

dict

processor

Class name of the BatchProcessor subclass for the search role. Default "BatchQueryProcessor" works for most engines. Override only if you need custom batching or request logic.

Type:

str

cache

In-memory LRU cache size for search results (number of entries). -1 (default) disables the cache entirely. Ignored when cache_redis_url is set.

Type:

int

cache_ttl

Cache entry time-to-live in seconds (default 600). Applies to both LRU and Redis caches.

Type:

int

batch_size

Maximum requests accumulated into one batch before the engine processes them (default 32).

Type:

int

max_wait_time

Maximum seconds to wait for a batch to fill before processing a partial batch (default 0.05 s). Lower values reduce latency; higher values improve GPU utilisation.

Type:

float

cache_key_fields

Request fields included in the cache key. Default ["query", "limit"]. Add extra fields (e.g. "subset") whenever they affect the results.

Type:

list[str]

cache_redis_url

Redis connection URL for distributed caching (e.g. "redis://localhost:6379"). When set, Redis replaces the in-memory LRU cache.

Type:

str, optional

cache_redis_kwargs

Additional keyword arguments forwarded to the Redis client (e.g. {"password": "…", "db": 1}).

Type:

dict, optional

scoring_disabled

When True, the scoring/reranking processor for this service is not registered even if the engine implements score_batch. Useful when you want search-only access to an engine that also supports reranking.

Type:

bool

name: str
engine: str
config: Dict[str, Any]
processor: str
cache: int
batch_size: int
cache_ttl: int
max_wait_time: float
cache_key_fields: List[str]
cache_redis_url: str | None
cache_redis_kwargs: Dict[str, Any] | None
scoring_disabled: bool
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.TextJsonlSource(*, source='text_jsonl', doc_path, id_field='id', content_fields='text', sep='\n', offset_source='offsetfile', cache_dir=None)[source]

Bases: BaseModel

Text-from-JSONL view source: read content fields from a JSONL line.

Concatenates one or more JSON fields with sep to form the text payload.

source

Discriminator literal "text_jsonl".

Type:

Literal[‘text_jsonl’]

doc_path

Path to the JSONL (or gzip-compressed) document file. Each line must be a JSON object. May be a local path or an hfds:<repo> URL for Hugging Face Datasets.

Type:

str

id_field

JSON key whose value is the document ID (default "id").

Type:

str

content_fields

JSON key(s) whose values are concatenated with sep to form the text payload. Accepts a single string or a list for multi-field concatenation (e.g. ["title", "body"]). Always normalized to a list after validation.

Type:

str or list[str]

sep

Separator inserted between concatenated content fields (default "\n").

Type:

str

offset_source

Strategy for random document access:

  • "offsetfile" (default) - builds a byte-offset map (.offsetmap sidecar file) for fast O(1) lookup in a JSONL file.

  • "msmarco_seg" - reads from sharded gzipped files in the MSMARCO v2.1 segmented document format, using embedded byte offsets from the document ID.

Type:

str

cache_dir

Directory for the .offsetmap sidecar. When the dataset mount is read-only (common on shared compute filesystems), set this to a writable scratch path. If unset, RoutIR tries to write next to the source file; on PermissionError it falls back to ${XDG_CACHE_HOME:-~/.cache}/routir/offsetmap/. Inside the directory, the sidecar filename is <basename>.<hash16>.offsetmap so multiple corpora sharing one cache dir never collide.

Type:

str, optional

source: Literal['text_jsonl']
doc_path: str
id_field: str
content_fields: str | List[str]
sep: str
offset_source: Literal['msmarco_seg', 'offsetfile']
cache_dir: str | None
model_post_init(_TextJsonlSource__context)[source]

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.LocalPathSource(*, source='local_path', path_template=None, path_glob=None, base_dir=None, mime=None)[source]

Bases: BaseModel

Bytes-from-filesystem view source: one file per id (or many via glob).

Two modes:
  • Single file per id — set path_template (e.g. /data/kf/{id}.jpg).

  • Multiple files per id — set path_glob (e.g. /data/kf/{id}_*.jpg). All matching files are returned, sorted by filename.

Exactly one of path_template / path_glob must be set.

source

Discriminator literal "local_path".

Type:

Literal[‘local_path’]

path_template

Format string producing exactly one path per id. Resolved via str.format(id=...).

Type:

str, optional

path_glob

Format string producing a glob expanded against the filesystem. Resolved via str.format(id=...); the result is passed to glob.glob(). Use this when one id maps to many files (multi-frame keyframes, multi-segment audio).

Type:

str, optional

base_dir

Restrict all resolved paths to live under this directory (resolved against os.path.realpath). Resolved paths outside this root are rejected with a clear error so that user-supplied ids cannot escape the configured tree via ../.

Type:

str, optional

mime

Content type hint forwarded to engines and clients (e.g. "image/jpeg", "audio/wav"). Per-view, applies to every match.

Type:

str, optional

source: Literal['local_path']
path_template: str | None
path_glob: str | None
base_dir: str | None
mime: str | None
model_post_init(_LocalPathSource__context)[source]

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.GlobMatcher(*, kind='glob', pattern)[source]

Bases: BaseModel

Anchored glob match for tar member names.

pattern contains {id} which is substituted (re.escape’d) at lookup time. fnmatch metacharacters in the surrounding text (*, ?, [ classes) translate to regex via fnmatch.translate(); the resulting regex is anchored at both ends so a partial substring match is impossible.

kind: Literal['glob']
pattern: str
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.RegexMatcher(*, kind='regex', pattern)[source]

Bases: BaseModel

Anchored regex match for tar member names.

pattern contains {id} which is substituted (re.escape’d) at lookup time. The pattern is wrapped with ^...$ so a partial substring match is impossible. Use this when fnmatch isn’t expressive enough (e.g. you need to anchor a digit count or alternate extensions).

kind: Literal['regex']
pattern: str
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.ShardManifest(*, kind='manifest', path, id_column='id', shard_column='shard')[source]

Bases: BaseModel

CSV/TSV file mapping id -> shard token.

Loaded once at backend init; cached in a process-wide singleton keyed by (path, id_column, shard_column). Shard column values are str or int — both work with str.format(shard=...) (int allows {shard:06d}).

kind: Literal['manifest']
path: str
id_column: str
shard_column: str
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.ShardModulo(*, kind='modulo', n, width=4)[source]

Bases: BaseModel

Derive shard from a deterministic hash of the id mod n.

Uses sha256(id.encode())[:8] interpreted as a big-endian uint64 — deterministic across processes and Python versions (Python’s built-in hash() is salted per interpreter and would not be safe here).

kind: Literal['modulo']
n: int
width: int
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.ShardSubstring(*, kind='substring', start, end)[source]

Bases: BaseModel

Slice an id literally to derive its shard token.

id[start:end] is the shard token (str). Use this for layouts where a fixed prefix of the id is the bucket key.

kind: Literal['substring']
start: int
end: int
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.TarSource(*, source='tar', tar_template, shard_resolver=None, matcher, mime=None, cache_dir=None)[source]

Bases: BaseModel

Bytes-from-tar view source: lookup members in plain (uncompressed) tar shards.

PR5b ships plain .tar support. .tar.gz requires indexed_gzip and lands in PR6. Random access is via a sidecar .taridx file (member-name -> (offset, size)) built on first open, stamped with (size, mtime, head_sha256) of the tar so stale indexes rebuild automatically. Reads use os.pread so concurrent fetches don’t need a per-shard lock.

source

Discriminator literal "tar".

Type:

Literal[‘tar’]

tar_template

str.format(shard=...) template. May reference {shard} with an optional format spec (e.g. {shard:06d}).

Type:

str

shard_resolver

How to derive the shard token from an id. Omit when tar_template references no {shard} placeholder (single-tar collections).

Type:

Union[ShardManifest, ShardModulo, ShardSubstring], optional

matcher

Member-name pattern. Substitutes {id} (re.escape’d) at lookup time, anchored.

Type:

Union[GlobMatcher, RegexMatcher]

mime

Content-type hint forwarded to engines / clients.

Type:

str, optional

cache_dir

Directory for the .taridx sidecar. When the dataset mount is read-only (common on shared compute filesystems), set this to a writable scratch path. If unset, RoutIR tries to write next to the tar; on PermissionError it falls back to ${XDG_CACHE_HOME:-~/.cache}/routir/taridx/. Inside the directory, the sidecar filename is <basename>.<hash16>.taridx so multiple shards sharing one cache dir never collide.

Type:

str, optional

source: Literal['tar']
tar_template: str
shard_resolver: ShardManifest | ShardModulo | ShardSubstring | None
matcher: GlobMatcher | RegexMatcher
mime: str | None
cache_dir: str | None
model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.ViewSpec(*, kind='text', source)[source]

Bases: BaseModel

One named view of a collection.

A view describes how to materialize one modality of a document (e.g. OCR text, ASR transcript, keyframes). Selected at pipeline-call time via the DSL @view suffix on a stage (PR3); collection swap then preserves the pipeline as long as the target collection exposes the same view names.

kind

"text" or "bytes" - declares the payload modality.

Type:

Literal[‘text’, ‘bytes’]

source

The source/backend configuration for this view. Discriminated on the inner "source" field: "text_jsonl" -> TextJsonlSource (text view); "local_path" -> LocalPathSource (bytes view); "tar" -> TarSource (bytes view).

Type:

routir.config.config.TextJsonlSource | routir.config.config.LocalPathSource | routir.config.config.TarSource

kind: Literal['text', 'bytes']
source: TextJsonlSource | LocalPathSource | TarSource
model_post_init(_ViewSpec__context)[source]

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.CollectionConfig(*, name, processor='ContentProcessor', views=<factory>, default_view=None, id_to_lang_mapping=None, force_load_all_documents=False, cache=256, cache_ttl=600, doc_path=None, id_field=None, content_field=None, offset_source=None, cache_path=None)[source]

Bases: BaseModel

Configuration for a document collection.

Collections expose the /content endpoint and are used by reranking pipeline stages to fetch document text by ID.

A collection is a bundle of one or more views, each addressing a distinct modality or projection of the documents. Most collections have a single text view; multi-view collections support per-stage view selection via the pipeline DSL (PR3).

Example JSON (single-view, modern form):

{
    "name": "my-corpus",
    "views": {
        "text": {
            "kind": "text",
            "source": {
                "source": "text_jsonl",
                "doc_path": "/data/corpus.jsonl",
                "id_field": "docid",
                "content_fields": "text"
            }
        }
    }
}

Example JSON (multi-view):

{
    "name": "my-corpus",
    "default_view": "ocr",
    "views": {
        "ocr": {"kind": "text", "source": {"source": "text_jsonl",
                "doc_path": "/data/corpus.jsonl",
                "id_field": "docid", "content_fields": "ocr"}},
        "asr": {"kind": "text", "source": {"source": "text_jsonl",
                "doc_path": "/data/corpus.jsonl",
                "id_field": "docid", "content_fields": "asr"}}
    }
}

Example JSON (legacy single-view, still accepted):

{
    "name": "my-corpus",
    "doc_path": "/data/corpus.jsonl",
    "id_field": "docid",
    "content_field": "text"
}
name

Collection identifier used in API requests (the "collection" field) and as the key in ProcessorRegistry.

Type:

str

processor

Class name of the content-processor to use. Default "ContentProcessor" dispatches to the view backends. Use "IRDSProcessor" to load from an ir_datasets dataset ID.

Type:

str

views

Named views of this collection. Each entry maps a view name to its ViewSpec (kind + source). If left empty and the legacy doc_path field is set, a single text view named "text" is synthesized from the deprecated doc_path / id_field / content_field / offset_source / cache_path fields.

Type:

dict[str, ViewSpec]

default_view

Name of the view used when a request does not specify one. Auto-elected when exactly one view exists; required to be set explicitly when multiple views are defined.

Type:

str, optional

id_to_lang_mapping

Path to a pickle file mapping document IDs to language codes. Used by processors that serve multilingual corpora.

Type:

str, optional

force_load_all_documents

When True, all documents are loaded into memory at startup for maximum throughput. Only suitable for small corpora; default False uses on-demand offset-based access.

Type:

bool

Deprecated attributes (kept for back-compat synthesis into views; new configs should use views directly):

doc_path (str, optional): Legacy: path to the JSONL document file.

Synthesizes a single text view named "text" when views is empty.

id_field (str, optional): Legacy: JSON id field name. content_field (str or list[str], optional): Legacy: JSON content

field(s).

offset_source (str, optional): Legacy: random-access strategy. cache_path (str, optional): Legacy: path to the .offsetmap sidecar.

name: str
processor: str
views: Dict[str, ViewSpec]
default_view: str | None
id_to_lang_mapping: str | None
force_load_all_documents: bool
cache: int
cache_ttl: int
doc_path: str | None
id_field: str | None
content_field: str | List[str] | None
offset_source: Literal['msmarco_seg', 'offsetfile'] | None
cache_path: str | None
model_post_init(_CollectionConfig__context)[source]

Override this method to perform additional initialization after __init__ and model_construct. This is useful if you want to do some validation that requires the entire model to be initialized.

model_config = {}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

class routir.config.config.Config(*, services=<factory>, collections=<factory>, server_imports=<factory>, file_imports=<factory>, dynamic_pipeline=True, pipeline_aliases=<factory>, pipeline_cache=-1, pipeline_cache_ttl=600, pipeline_cache_redis_url=None, pipeline_cache_redis_kwargs=<factory>, relay_content_cache=-1, relay_content_cache_ttl=600, relay_content_cache_redis_url=None, relay_content_cache_redis_kwargs=<factory>, bytes_content_cache_max_bytes=None)[source]

Bases: BaseModel

Top-level configuration for the RoutIR service.

Passed as a JSON file (or JSON string) to routir <config.json>. Parsed by load_config(), which initialises all collections and services and registers them with ProcessorRegistry.

Example JSON skeleton:

{
    "file_imports": ["./my_engine.py"],
    "collections": [
        {
            "name": "my-corpus",
            "doc_path": "/data/corpus.jsonl"
        }
    ],
    "services": [
        {
            "name": "my-retriever",
            "engine": "MyEngine",
            "config": {"index_path": "/data/index"}
        }
    ],
    "server_imports": ["http://other-host:5000"]
}
services

Search/ranking services to load and register. Each entry loads one engine and creates its processors.

Type:

list[ServiceConfig]

collections

Document collections to register as content services. Required for reranking pipeline stages.

Type:

list[CollectionConfig]

server_imports

URLs of remote RoutIR servers whose services are proxied locally via Relay. Discovered automatically from each server’s /avail endpoint at startup.

Type:

list[str]

file_imports

Paths to Python files loaded before any service is initialised. Use this to register custom Engine subclasses.

Type:

list[str]

dynamic_pipeline

When True (default), the /pipeline endpoint accepts arbitrary pipeline DSL strings at request time. Set to False to restrict the server to pre-defined services only.

Type:

bool

pipeline_aliases

Named shortcuts for pipeline DSL fragments. An alias is a single identifier that expands to a (possibly multi-stage) pipeline at request time. Aliases may reference other aliases; cycles raise an error at startup. Each alias name must not collide with any service or collection name.

Example:

{
    "ragtime2":    "{zho%100, rus%100, spa%100, eng%100}ScoreFusion",
    "ragtime2-rr": "ragtime2%100 >> nemotron%20"
}

With these aliases, ragtime2%20 expands to {zho%100, rus%100, spa%100, eng%100}ScoreFusion%20 (the call-site limit is applied to the last stage of the alias body).

Type:

dict[str, str]

pipeline_cache

Capacity for the pipeline-level result cache (number of entries). -1 (default) or 0 disables it. When enabled, the /pipeline endpoint reuses the final response for identical requests, keyed on the canonical AST of the pipeline (after alias expansion), the query, the collection, and the subset of runtime_kwargs that refer to aliases actually used. Ignored when pipeline_cache_redis_url is set.

Type:

int

pipeline_cache_ttl

Pipeline-cache entry TTL in seconds (default 600).

Type:

int

pipeline_cache_redis_url

Redis URL for the pipeline cache. When set, Redis replaces the in-memory LRU cache.

Type:

str, optional

pipeline_cache_redis_kwargs

Extra kwargs forwarded to the Redis client for the pipeline cache.

Type:

dict, optional

relay_content_cache

Capacity for the local cache that fronts remote content services registered via server_imports. -1 (default) or 0 disables it. Without this cache, reranking against a remote collection re-fetches every document text on every query (the per-run doc_content_cache on SearchPipeline only dedupes within a single request).

Type:

int

relay_content_cache_ttl

TTL in seconds for the relay-content cache (default 600).

Type:

int

relay_content_cache_redis_url

Redis URL for the relay-content cache.

Type:

str, optional

relay_content_cache_redis_kwargs

Extra kwargs forwarded to the Redis client for the relay-content cache.

Type:

dict, optional

bytes_content_cache_max_bytes

Per-request cap on the size of the bytes-view payloads cached inside one SearchPipeline run (doc_content_cache). When set, the cache evicts in insertion order (FIFO) once the inserted bytes exceed the cap. Only bytes-view payloads count toward the cap; text-view entries stay free. None (default) disables eviction.

Type:

int, optional

model_config = {'extra': 'forbid'}

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

services: List[ServiceConfig]
collections: List[CollectionConfig]
server_imports: List[str | Dict[str, Any]]
file_imports: List[str]
dynamic_pipeline: bool
pipeline_aliases: Dict[str, str]
pipeline_cache: int
pipeline_cache_ttl: int
pipeline_cache_redis_url: str | None
pipeline_cache_redis_kwargs: Dict[str, Any]
relay_content_cache: int
relay_content_cache_ttl: int
relay_content_cache_redis_url: str | None
relay_content_cache_redis_kwargs: Dict[str, Any]
bytes_content_cache_max_bytes: int | None

Config Loader

async routir.config.load.auto_add_relay_services(servers, content_cache_settings=None)[source]

Discover and register services from remote RoutIR servers as local proxies.

Queries each server’s /avail endpoint (via AsyncClient) to list its available services, then creates Relay-backed processors for every service not already registered locally. This lets the local server transparently forward requests to the remote server.

"search", "score", and "content" service types are proxied. Services already registered locally take precedence (remote services with the same name are skipped).

Parameters:
  • servers (str | List[str | Dict[str, str]]) – Either a single server entry or a list of entries. Each entry may be a string (REST base URL, e.g. "http://host:5000") or a dict with at least "endpoint" and optionally "grpc_endpoint", "api_key", etc. These extra fields are forwarded into the created Relay config (and into RelayContentProcessor for content) so the data plane can use gRPC even though discovery always uses REST.

  • content_cache_settings (Dict[str, Any] | None) – Optional dict with keys cache_size, cache_ttl, redis_url, redis_kwargs applied to every RelayContentProcessor registered from this call. Sourced from the relay_content_cache* fields on Config.

routir.config.load.load_index_from_hfds(repo_id)[source]

Download an index from HuggingFace Datasets.

Parameters:

repo_id (str) – Repository ID (with optional ‘hfds:’ prefix)

Returns:

Path to the downloaded index directory

async routir.config.load.load_config(config, *, tar_gz_decompress_cache=None)[source]

Parse the service configuration and register all collections and services.

This is the main initialization entry point called by the server at startup. It performs the following steps in order:

  1. Parse the JSON config (file path or raw string) into a Config model.

  2. Load any Python files listed in file_imports (custom engine classes).

  3. For each collection, create and register a content processor.

  4. For each service, instantiate the engine and register search (and optionally score) processors. Index paths with the hfds: prefix are downloaded from Hugging Face Datasets first.

  5. Discover and proxy services from remote servers listed in server_imports.

Parameters:
  • config (str) – Either a file path to a JSON config file or a raw JSON string. File paths are read and parsed automatically.

  • tar_gz_decompress_cache (str, optional) – Directory into which any .tar.gz tar_template references should be decompressed once at startup. When None (the default), .tar.gz references raise at startup — native gzip random access (via indexed_gzip) is not yet wired into the runtime read path.

Note

This function modifies the global ProcessorRegistry singleton in place. It is not safe to call concurrently.