Custom Analysis Rules

Define organization-specific policies in YAML. Custom rules produce structured signals with rule IDs, source spans, and matching evidence — same shape as built-ins.

Rules are pattern-matching only — no code execution, no external calls. The engine validates each rule's shape and predicate at load time; the generated JSON Schema drives offline IDE autocomplete against /schemas/v1/custom_rules.schema.json.

Basic Example

# company_rules.yaml
# yaml-language-server: $schema=https://lexega.com/schemas/v1/custom_rules.schema.json

rules:
  - id: ACME-STG-ENC
    risk_level: critical
    message: "Stage created without encryption."
    triggers:
      all_of:
        - kind: { in: [create_stage, alter_stage] }
        - ddl.stage.encryption_disabled: true

  - id: ACME-MASK-DROP
    risk_level: high
    message: "Masking policy {policy.target.name.canonical} removed."
    triggers:
      kind: drop_masking_policy

The first rule fires on CREATE STAGE or ALTER STAGE statements that turn encryption off. The second fires on any DROP MASKING POLICY, and its message interpolates a field from the matched statement's facts.

Running Custom Rules

# Analyze with custom rules
lexega-sql analyze query.sql --custom-rules company_rules.yaml

# Combine with catalog
lexega-sql analyze query.sql --custom-rules rules.yaml --catalog prod.json

# Multiple rule files
lexega-sql analyze query.sql --custom-rules rules1.yaml --custom-rules rules2.yaml

Cloud Storage URIs

For larger or shared deployments, rules, policies, exceptions, and output artifacts can be read from or written to cloud storage. This enables separation of concerns: DevOps maintains rules, Security maintains policies, artifacts go to shared storage.

Scheme	Provider	CLI Required
`s3://bucket/key`	AWS S3	`aws`
`gs://bucket/key`	Google Cloud Storage	`gsutil`
`az://container/blob`	Azure Blob Storage	`az`
`https://...`	Any HTTPS endpoint (read-only)	`curl`

Reading from cloud storage:

# AWS S3
lexega-sql analyze query.sql --custom-rules s3://mybucket/governance/rules.yaml

# Google Cloud Storage
lexega-sql analyze query.sql --policy gs://mybucket/security/policy.yaml

# Azure Blob Storage
lexega-sql analyze query.sql --exceptions az://governance/exceptions.yaml

# Pre-signed URLs
lexega-sql analyze query.sql --custom-rules "https://mybucket.s3.amazonaws.com/rules.yaml?sig=..."

Writing to cloud storage:

# Write decision artifact to S3
lexega-sql review main..HEAD models/ \
  --policy policy.yaml \
  --env prod \
  --decision-out s3://mybucket/ci/decisions/$GITHUB_RUN_ID/

# Write risk report to GCS
lexega-sql analyze query.sql --report-out gs://mybucket/reports/risk.json

Severity vs Enforcement

Signal severity (risk_level): Each signal carries a risk_level (critical, high, medium, low, info). Built-in signals set this in the YAML rule; custom rules set it the same way.
Policy enforcement (--policy): The policy layer is the only way to block. A policy bundle can allow / warn / block by referencing rule IDs (e.g., block rule GRT-TO-PUBLIC in prod). Policies are an enforcement layer on top of signals — they reference rules, not define detection logic.

Rule Structure

Field	Type	Required	Description
`id`	string	yes	Stable identifier (e.g. `ACME-STG-ENC`). Avoid the prefixes used by built-ins (`GRT-`, `MASK-`, `SNW-`, `Q-`, `DML-`, …) so customer rules can be told apart at a glance — unless you're authoring a partial override, in which case `id` must match a built-in.
`risk_level`	enum	yes	One of `critical`, `high`, `medium`, `low`, `info` (lowercase).
`message`	string	yes	Signal description. Supports `{path.to.field}` interpolation against the matched statement's facts.
`triggers`	object	yes for full rules; omit for partial overrides	The predicate that decides whether the rule fires on a given statement. See below. Omitting it turns the entry into a partial overlay on a built-in (see Overriding Built-in Rules).
`enabled`	bool	no	Default `true`. Set `false` to keep the rule in source but skip evaluation.
`emission`	enum	no	`once` (one signal per statement, default) or `per_witness` (one signal per matched element). Auto-set to `per_witness` whenever the predicate contains an `each:` quantifier. Cannot be set on a partial override.
`per_statement`	bool	no	Default `false`. When `true`, the deduplication pass keeps each emitted signal as its own entry instead of merging by `(rule_id, risk_level)`. Use for compliance-review rules where every occurrence is a distinct artifact. Cannot be set on a partial override.

Trigger Predicates

The triggers: key is a single predicate object that matches against StatementFacts — the typed bag of facts the analyzer derives for one SQL statement.

Combinators

Form	Meaning
`all_of: [<pred>, ...]`	Boolean AND. Implicit `all_of` is also formed when a predicate object lists multiple sibling path-matches without a combinator.
`any_of: [<pred>, ...]`	Boolean OR.
`not: <pred>`	Boolean NOT.

Selecting the statement type

kind: is the most common top-level matcher. Values are snake-cased StatementKind variants — select, insert, grant, create_table, alter_stage, drop_masking_policy, etc. See the Builtin Rule Reference for the full taxonomy.

triggers:
  kind: grant                                   # one statement kind

triggers:
  kind:                                         # any of several kinds
    in: [insert, update, delete, merge, multi_insert, truncate]

Matching fact fields

Every field on StatementFacts is addressable by dotted path. The schema generator walks the fact tree and emits an explicit property for each legal path, so IDE auto-complete works offline:

triggers:
  all_of:
    - kind: grant
    - privilege.with_grant_option: true         # eq sugar — bare value means `eq`
    - privilege.target.kind:                    # the *object* the grant is on
        in: [database, schema, catalog]

target is the object, not the grantee. privilege.target is the object the privilege is being granted on (GRANT … ON <target> TO <grantees>). To predicate on who receives the grant, use the privilege.grantees relational path (see the next section).

Scalar operators

For scalar fields (bool, string, number, enum, Option<T>):

Op	Example
(bare value)	`kind: grant` (sugar for `{ eq: grant }`)
`eq` / `neq`	`privilege.all_privileges: { neq: true }`
`gt` / `lt` / `gte` / `lte`	`privilege.role_grant_impact.inherited_privileges: { gt: 0 }`
`matches: <glob>`	`ddl.stage.url_literal: { matches: "://:@" }`
`in: [...]` / `not_in: [...]`	`kind: { in: [create_stage, alter_stage] }`
`exists: <bool>`	`privilege.role_grant_impact: { exists: true }` (Option-presence test)
`is_null: <bool>`	same shape, alternate spelling
`range: { low, high, low_inclusive, high_inclusive }`	numeric ranges; `*_inclusive` default `true`

About matches:. matches: is glob, not regex. * is the only wildcard (matches any sequence of characters, including empty); every other character — including ., ?, ^, $, [, ], \ — is matched literally. The pattern "PROD.*" matches strings beginning with the literal four characters PROD. followed by anything (so it works against dot-separated canonical names like PROD.SCHEMA.TABLE); it is not equivalent to the regex PROD\..*. For fixed enumerations use in: / not_in:; for OR-of-patterns, combine matches: predicates under any_of:. Anchored, character-class, and capture-group matching are not supported.

For Vec<scalar> fields (e.g. privilege.privileges):

Op	Example
`contains: <value>`	`privilege.privileges: { contains: ownership }`
`contains_any: [...]`	`privilege.privileges: { contains_any: [select, insert] }`
`contains_all: [...]`	every listed value must be present

Relational quantifiers

For Vec<struct> fields (e.g. privilege.grantees, query.scopes), nest a sub-predicate inside a quantifier:

Quantifier	Meaning
`exists: <pred>`	∃ element. P(element).
`all: <pred>`	∀ element. P(element).
`none: <pred>`	¬∃ element. P(element).
`each: <pred>`	Per-witness driver — one signal fires per matching element. Implies `emission: per_witness`.
`count: <numeric-op>`	Match on the size of the collection. Bare integer is sugar for `eq`.

triggers:
  all_of:
    - kind: grant
    - privilege.grantees:
        exists:
          name.normalized: PUBLIC               # ∃ grantee named PUBLIC
    - privilege.privileges:
        count: { gte: 5 }                       # at least 5 privileges granted

# Per-witness emission: one signal per OR-tautology found in the query.
triggers:
  all_of:
    - kind: { in: [select, set_select] }
    - query.or_tautologies:
        each: {}                                # vacuous body — every element witnesses

Identifier matching

Identifier fields (IdentName) carry both raw (the source spelling) and normalized (dialect-aware case fold — UPPERCASE on Snowflake/BigQuery/MSSQL, lowercase on PostgreSQL/Databricks/Redshift):

- privilege.grantees:
    exists:
      name.normalized:
        in: [ACCOUNTADMIN, SECURITYADMIN]       # matches regardless of source casing

TableRef adds a canonical field (database.schema.table joined from normalized components) for cross-dialect-stable identity:

- query.reads_table:
    exists:
      table.canonical: { matches: "PROD.SENSITIVE.*" }

Message Interpolation

{path.to.field} slots in message resolve against the matched statement's facts at emission time:

- id: ACME-PROD-WRITE
  risk_level: high
  message: "Write to {query.writes_table.0.table.canonical} requires change ticket."
  triggers:
    all_of:
      - kind: { in: [insert, update, delete, merge] }
      - query.writes_table: { count: { gt: 0 } }

Paths that don't resolve render as the empty string — author messages defensively against optional fields. For relational paths (Vec<T>), index in with a numeric segment as above (query.writes_table.0.table.canonical) or use {witness.<sub-path>} from a per_witness rule.

Overriding Built-in Rules

A rule entry that omits triggers: becomes a partial override on the built-in rule with the same id. The built-in's predicate (triggers, emission, per_statement) is inherited verbatim; only the fields you set replace the built-in's.

# Downgrade the built-in SCHEMA-DROP from critical to low and reword the message.
# No `triggers:` — the built-in's `kind: drop_schema` predicate is inherited.
rules:
  - id: SCHEMA-DROP
    risk_level: low
    message: "Schema dropped (downgraded per org policy — swap-and-drop is expected)."

Partial overrides can only set risk_level, message, and enabled — setting emission or per_statement without triggers: raises DisallowedFieldOnPartialOverride at load time. To change the predicate too, write a full override instead: include triggers: and the entire rule (still using the built-in's id) replaces the built-in wholesale.

# Full override — restates the built-in's id with a different predicate.
# The built-in is replaced entirely; nothing is inherited.
rules:
  - id: SCHEMA-DROP
    risk_level: high
    message: "Schema drop in a non-ephemeral namespace."
    triggers:
      all_of:
        - kind: drop_schema
        - ddl.target.name.schema.normalized:
            not_in: [EPHEMERAL, SANDBOX, SCRATCH]

Loader behaviour:

Unknown id on a partial. UnresolvedPartialOverride — fix the id, or convert the entry to a full rule by adding triggers:. There is no "silent overlay against nothing" fallback.
Partial override under --no-builtin. PartialOverrideWithoutBuiltins — there's no built-in corpus to inherit from. Either drop --no-builtin or restate the rule as a full entry.
Partial with no override-able fields set. EmptyPartialOverride — at least one of risk_level / message / enabled is required so the entry does something.

Common Mistakes

triggers: as an array. Legacy shape. v1 triggers: is one predicate object. Use all_of: [...] / any_of: [...] for multiple conditions.
statement_type: GrantStatement. Legacy shape. v1 uses kind: grant (snake_case, no Statement suffix).
categorical_signal: { category: SECURITY, ... }. Removed. v1 signals carry rule_id directly; the legacy (category, surface, condition) triple is gone. Predicate against typed fact fields instead.
risk_level: High. Lowercase only — high. Same for the others.
Top-level metadata: / name: / schema_version:. Not part of the v1 rule-file shape. Only rules: is recognized at the root.

Debugging

--explain-signals lists, per statement, which rules matched; --verbose adds a per-rule breakdown of the paths that matched and (for rules that almost matched) the paths that didn't:

lexega-sql analyze query.sql --custom-rules rules.yaml --explain-signals

Execution Order: Built-in rules and custom rules run through the same engine and produce the same signal shape. Customer entries with a built-in's id either fully replace the built-in (triggers: present) or partially overlay it (triggers: absent — see Overriding Built-in Rules); in both cases only one rule fires per id. After all rules fire, signals are deduplicated by (rule_id, risk_level) unless the rule sets per_statement: true. The policy layer is applied last to map signals to allow / warn / block decisions.