Skip to content

rtb-app regex_util::compile_bounded

Status: APPROVED — closes an RTB documented-but-unimplemented security contract surfaced by the GTB v0.22.0 parity audit (Phase 2, §B). The API shape is already mandated by CLAUDE.md (§Regex Compilation) and by a live reference in crates/rtb-cli-bin/src/validate.rs:69; this spec only records the contract and its tests so it can be implemented TDD. No design decisions were open — resolutions in §6 are folded from the existing standard.

Honest scoping note: there is no current call site that compiles an externally-sourced pattern (every regex in the tree today is a build-time literal — validate.rs, rtb-redact — which correctly uses Regex::new directly). This helper is therefore preventive contract completion: it must exist before the first external-pattern consumer is written, so that consumer has a compliant compiler to call. It is not patching an active hole.

1. Motivation

CLAUDE.md §Regex Compilation states that any regex::Regex::new / RegexBuilder call whose pattern originates outside the binary (config file, CLI flag, TUI input, HTTP payload, message queue) must route through rtb_app::regex_util::compile_bounded. The helper does not exist. Until it does, the standard cannot be honoured and the first external- pattern feature would have to invent its own bound (or, worse, omit one). GTB's analogue is pkg/regexutil (length cap + compile timeout).

2. Surface

New module crates/rtb-app/src/regex_util.rs, pub mod regex_util; in lib.rs, re-exported from the prelude.

/// Maximum byte length of an externally-sourced pattern.
pub const MAX_PATTERN_LEN: usize = 1024;        // 1 KiB
/// `RegexBuilder::size_limit` — compiled-program memory cap.
pub const SIZE_LIMIT: usize = 1 << 20;          // 1 MiB
/// `RegexBuilder::dfa_size_limit` — lazy-DFA cache cap.
pub const DFA_SIZE_LIMIT: usize = 8 << 20;      // 8 MiB

/// Compile an untrusted pattern under fixed memory bounds.
pub fn compile_bounded(pattern: &str) -> Result<regex::Regex, RegexCompileError>;

No timeout argument: Rust's regex is a Thompson NFA with linear-time matching, so — unlike Go's backtracking-capable engine — there is no catastrophic-backtracking class to defend against. The caps bound compile-time memory, the only remaining DoS vector.

3. Error type

#[derive(Debug, thiserror::Error, miette::Diagnostic)]
pub enum RegexCompileError {
    #[error("regex pattern is {len} bytes; limit is {MAX_PATTERN_LEN} bytes")]
    #[diagnostic(code(rtb_app::regex::too_long))]
    TooLong { len: usize },

    #[error("regex failed to compile within memory bounds")]
    #[diagnostic(code(rtb_app::regex::compile), help("simplify the pattern or reduce repetition counts"))]
    Compile(#[source] regex::Error),
}

thiserror + miette::Diagnostic per engineering-standards §error handling (no anyhow in framework crates).

4. Mechanism

  1. Reject pattern.len() > MAX_PATTERN_LENTooLong (checked before handing anything to the regex engine — cheap guard first).
  2. RegexBuilder::new(pattern).size_limit(SIZE_LIMIT).dfa_size_limit(DFA_SIZE_LIMIT).build(), mapping regex::ErrorCompile.
  3. Return the compiled Regex.

Add a regex dependency to crates/rtb-app/Cargo.toml (workspace already transitively depends on it; pin via [workspace.dependencies]).

5. Non-goals / boundaries

  • Literal, build-time patterns stay on Regex::new (optionally once_cell::Lazy<Regex>), exactly as CLAUDE.md says — validate.rs and rtb-redact are correct as-is and must not be migrated.
  • No match-time timeout (linear engine — unnecessary).
  • Not a general regex-cache; callers own their compiled Regex.

6. Resolutions (folded — no open questions)

  • [R-1] Host crate → rtb-app (CLAUDE.md mandates the exact path rtb_app::regex_util; validate.rs:69 already points there).
  • [R-2] Limits → the CLAUDE.md values (1 KiB len, 1 MiB size, 8 MiB dfa). Exposed as pub const so call sites and tests share them.
  • [R-3] No timeout (linear-time engine — documented rationale).
  • [R-4] Length cap is on bytes (str::len), matching "1 KiB".

7. Testing (TDD, ≥90% per CLAUDE.md)

  • A pattern of MAX_PATTERN_LEN + 1 bytes → TooLong { len }.
  • A pattern at exactly MAX_PATTERN_LEN is allowed through to compilation.
  • A valid small pattern compiles and the returned Regex matches/does-not- match correctly.
  • A pattern engineered to exceed size_limit (e.g. a large bounded repetition a{1000}{1000}…) → Compile, not a panic, and does not blow memory (the cap fires).
  • An invalid pattern ("(") → Compile.
  • MAX_PATTERN_LEN/SIZE_LIMIT/DFA_SIZE_LIMIT are the documented values (guards against silent drift from CLAUDE.md).

8. Out of scope

  • Migrating existing literal-pattern call sites (they are compliant).
  • A match-time deadline or per-call configurable limits (fixed bounds only).