rtb-app regex_util::compile_bounded¶
Status: APPROVED — closes an RTB documented-but-unimplemented security
contract surfaced by the GTB v0.22.0 parity audit (Phase 2, §B). The API
shape is already mandated by CLAUDE.md (§Regex Compilation) and by a
live reference in crates/rtb-cli-bin/src/validate.rs:69; this spec only
records the contract and its tests so it can be implemented TDD. No design
decisions were open — resolutions in §6 are folded from the existing
standard.
Honest scoping note: there is no current call site that compiles an
externally-sourced pattern (every regex in the tree today is a build-time
literal — validate.rs, rtb-redact — which correctly uses
Regex::new directly). This helper is therefore preventive contract
completion: it must exist before the first external-pattern consumer is
written, so that consumer has a compliant compiler to call. It is not
patching an active hole.
1. Motivation¶
CLAUDE.md §Regex Compilation states that any regex::Regex::new /
RegexBuilder call whose pattern originates outside the binary (config
file, CLI flag, TUI input, HTTP payload, message queue) must route
through rtb_app::regex_util::compile_bounded. The helper does not exist.
Until it does, the standard cannot be honoured and the first external-
pattern feature would have to invent its own bound (or, worse, omit one).
GTB's analogue is pkg/regexutil (length cap + compile timeout).
2. Surface¶
New module crates/rtb-app/src/regex_util.rs, pub mod regex_util; in
lib.rs, re-exported from the prelude.
/// Maximum byte length of an externally-sourced pattern.
pub const MAX_PATTERN_LEN: usize = 1024; // 1 KiB
/// `RegexBuilder::size_limit` — compiled-program memory cap.
pub const SIZE_LIMIT: usize = 1 << 20; // 1 MiB
/// `RegexBuilder::dfa_size_limit` — lazy-DFA cache cap.
pub const DFA_SIZE_LIMIT: usize = 8 << 20; // 8 MiB
/// Compile an untrusted pattern under fixed memory bounds.
pub fn compile_bounded(pattern: &str) -> Result<regex::Regex, RegexCompileError>;
No timeout argument: Rust's regex is a Thompson NFA with linear-time
matching, so — unlike Go's backtracking-capable engine — there is no
catastrophic-backtracking class to defend against. The caps bound
compile-time memory, the only remaining DoS vector.
3. Error type¶
#[derive(Debug, thiserror::Error, miette::Diagnostic)]
pub enum RegexCompileError {
#[error("regex pattern is {len} bytes; limit is {MAX_PATTERN_LEN} bytes")]
#[diagnostic(code(rtb_app::regex::too_long))]
TooLong { len: usize },
#[error("regex failed to compile within memory bounds")]
#[diagnostic(code(rtb_app::regex::compile), help("simplify the pattern or reduce repetition counts"))]
Compile(#[source] regex::Error),
}
thiserror + miette::Diagnostic per engineering-standards §error
handling (no anyhow in framework crates).
4. Mechanism¶
- Reject
pattern.len() > MAX_PATTERN_LEN→TooLong(checked before handing anything to the regex engine — cheap guard first). RegexBuilder::new(pattern).size_limit(SIZE_LIMIT).dfa_size_limit(DFA_SIZE_LIMIT).build(), mappingregex::Error→Compile.- Return the compiled
Regex.
Add a regex dependency to crates/rtb-app/Cargo.toml (workspace already
transitively depends on it; pin via [workspace.dependencies]).
5. Non-goals / boundaries¶
- Literal, build-time patterns stay on
Regex::new(optionallyonce_cell::Lazy<Regex>), exactly as CLAUDE.md says —validate.rsandrtb-redactare correct as-is and must not be migrated. - No match-time timeout (linear engine — unnecessary).
- Not a general regex-cache; callers own their compiled
Regex.
6. Resolutions (folded — no open questions)¶
- [R-1] Host crate →
rtb-app(CLAUDE.md mandates the exact pathrtb_app::regex_util;validate.rs:69already points there). - [R-2] Limits → the CLAUDE.md values (1 KiB len, 1 MiB size, 8 MiB
dfa). Exposed as
pub constso call sites and tests share them. - [R-3] No timeout (linear-time engine — documented rationale).
- [R-4] Length cap is on bytes (
str::len), matching "1 KiB".
7. Testing (TDD, ≥90% per CLAUDE.md)¶
- A pattern of
MAX_PATTERN_LEN + 1bytes →TooLong { len }. - A pattern at exactly
MAX_PATTERN_LENis allowed through to compilation. - A valid small pattern compiles and the returned
Regexmatches/does-not- match correctly. - A pattern engineered to exceed
size_limit(e.g. a large bounded repetitiona{1000}{1000}…) →Compile, not a panic, and does not blow memory (the cap fires). - An invalid pattern (
"(") →Compile. MAX_PATTERN_LEN/SIZE_LIMIT/DFA_SIZE_LIMITare the documented values (guards against silent drift from CLAUDE.md).
8. Out of scope¶
- Migrating existing literal-pattern call sites (they are compliant).
- A match-time deadline or per-call configurable limits (fixed bounds only).