Error Contracts & Resilience Mapping
In distributed API systems, error handling is not an operational afterthought—it is a first-class architectural surface. Contract-first error design decouples transport semantics from domain failures, enabling predictable client behavior, automated SDK generation, and deterministic resilience workflows. By treating error payloads as versioned, schema-validated artifacts, platform teams can enforce consistent failure modes across microservices, gateways, and client ecosystems.
Defining the Error Contract Boundary
A robust error contract establishes a strict boundary between HTTP transport semantics and application-level domain failures. Rather than relying on opaque string messages or ad-hoc JSON shapes, teams should standardize on a machine-readable baseline. The RFC 7807 Problem+JSON Implementation serves as the foundational schema, providing a predictable structure for type, title, status, detail, and instance fields while allowing extensible domain properties.
OpenAPI 3.1 Specification & JSON Schema
openapi: 3.1.0
info:
title: Resilient Service API
version: 2.1.0
paths:
/orders:
post:
responses:
'400':
description: Validation failure
content:
application/problem+json:
schema:
$ref: '#/components/schemas/ProblemDetail'
'500':
description: Internal server error
content:
application/problem+json:
schema:
$ref: '#/components/schemas/ProblemDetail'
components:
schemas:
ProblemDetail:
type: object
required: [type, title, status]
properties:
type:
type: string
format: uri
description: URI reference identifying the error class
title:
type: string
description: Short, human-readable summary
status:
type: integer
minimum: 100
maximum: 599
description: HTTP status code
detail:
type: string
description: Detailed explanation of the specific occurrence
instance:
type: string
format: uri
description: URI identifying the specific request instance
trace_id:
type: string
format: uuid
description: Correlation ID for distributed tracing
CI/CD Contract Validation
Enforce schema compliance at the pipeline level using contract testing tools:
name: Validate Error Contracts
on: [pull_request]
jobs:
contract-check:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Install Schemathesis
run: pip install schemathesis
- name: Run Error Payload Assertions
run: |
schemathesis run ./openapi.yaml \
--checks all \
--validate-schema \
--header "Accept: application/problem+json" \
--hypothesis-phases explicit
Transport Layer Alignment & Code Mapping
Predictable client routing requires strict alignment between HTTP status codes and error categories. Misaligned status codes break API gateway routing rules, load balancer health checks, and client-side exception dispatchers. The HTTP Status Code Mapping establishes deterministic mappings that separate infrastructure-level failures (4xx/5xx) from business-logic constraints, ensuring consistent cross-service alignment.
Vendor Extension for SDK Generation
Extend OpenAPI definitions with x-error-classification to drive automated code generation:
responses:
'429':
description: Rate limit exceeded
x-error-classification: transient
content:
application/problem+json:
schema:
$ref: '#/components/schemas/RateLimitProblem'
'409':
description: Resource conflict
x-error-classification: permanent
content:
application/problem+json:
schema:
$ref: '#/components/schemas/ConflictProblem'
Gateway configurations (Envoy, Kong, AWS API Gateway) can parse these extensions to route traffic, apply throttling, or trigger fallback pools without inspecting payload bodies.
Error Classification & Automation Triggers
Error contracts must encode actionable metadata to drive automated recovery workflows. Classifying failures into transient, permanent, and business-logic categories enables middleware to make deterministic routing decisions without human intervention. The Retryable vs Non-Retryable Errors taxonomy dictates when clients should back off, when they should fail fast, and when they should escalate to alternative execution paths.
| Classification | HTTP Range | SDK Action | Middleware Behavior |
|---|---|---|---|
| Transient | 429, 502, 503, 504 | Exponential backoff + jitter | Circuit open → half-open → closed |
| Permanent | 400, 401, 403, 404, 410 | Fail fast + log | Drop request, return cached/default |
| Business | 409, 422, 451 | Surface to UI, halt workflow | Route to domain-specific handler |
Embedding retry_after or idempotency_key fields in transient payloads allows clients to resume operations safely without duplicating side effects.
Client-Side Resilience & Fallback Execution
Generated SDKs and full-stack consumers must translate contract signals into runtime resilience patterns. Rather than scattering try/catch blocks across business logic, platform teams should centralize error handling in interceptors, middleware, or generated exception hierarchies. The Client Fallback Strategies guide outlines how to implement circuit breakers, graceful degradation, and cache fallbacks based on explicit contract metadata.
TypeScript / Axios Interceptor
import axios, { AxiosError } from 'axios';
interface ProblemDetail {
type: string;
title: string;
status: number;
detail?: string;
classification?: 'transient' | 'permanent' | 'business';
}
export class ApiError extends Error {
constructor(public readonly problem: ProblemDetail) {
super(problem.title);
this.name = 'ApiError';
}
}
const resilientClient = axios.create();
resilientClient.interceptors.response.use(
(res) => res,
(error: AxiosError<ProblemDetail>) => {
const problem = error.response?.data;
if (problem?.classification === 'transient') {
throw new TransientError(problem);
}
throw new ApiError(problem ?? { type: 'unknown', title: 'Unknown Error', status: 0 });
}
);
Go HTTP Client with Retry Injection
type RetryPolicy struct {
MaxRetries int
Backoff time.Duration
}
func (rp *RetryPolicy) ShouldRetry(resp *http.Response, err error) bool {
if err != nil {
return true
}
return resp.StatusCode == http.StatusTooManyRequests ||
resp.StatusCode >= http.StatusInternalServerError
}
// Usage with custom error unwrapping
func (c *Client) DoWithResilience(req *http.Request) (*http.Response, error) {
resp, err := c.HTTPClient.Do(req)
if err != nil || resp.StatusCode >= 500 {
return nil, fmt.Errorf("transient failure: %w", err)
}
return resp, nil
}
Python Requests + Circuit Breaker Adapter
import requests
from pybreaker import CircuitBreaker
breaker = CircuitBreaker(fail_max=5, recovery_timeout=30)
class ResilientSession:
@breaker
def request(self, method, url, **kwargs):
resp = requests.request(method, url, **kwargs)
if resp.status_code == 429 or resp.status_code >= 500:
raise requests.exceptions.RetryError(resp.text)
return resp
OpenAPI Generator Configuration
openapi-generator-cli generate \
-i openapi.yaml \
-g typescript-axios \
--additional-properties=errorHandling=strict,useSingleRequestParameter=true \
-o ./generated-sdk
Observability, Auditing & Continuous Enforcement
Error contracts degrade silently without continuous validation. Platform teams must monitor schema drift, validate error payloads in staging environments, and trace failure propagation across service boundaries. Integrating structured error logging with distributed tracing (OpenTelemetry) ensures that instance and trace_id fields correlate directly with span telemetry. The Production Debugging & Performance Audits framework details how to validate contract compliance in CI, detect drift in production, and audit fallback execution paths.
CI/CD Drift Detection Pipeline
- name: Detect Error Schema Drift
run: |
npm install @openapitools/openapi-diff
npx openapi-diff ./baseline.yaml ./current.yaml \
--fail-on-incompatible \
--check-error-schemas
- name: Validate Staging Payloads
run: |
curl -s https://staging-api.example.com/health \
| jq -e '.error_schema_version == "2.1.0"'
Common Pitfalls & Anti-Patterns
- Generic 500 for business failures: Masking domain constraints behind
500 Internal Server Errorprevents clients from implementing targeted recovery logic and inflates false-positive alerting. - Inconsistent error schemas across microservices: Divergent payload shapes break unified SDK generation, forcing consumers to write brittle, service-specific parsers.
- Missing idempotency keys or retry tokens: Transient errors without replay-safe identifiers lead to duplicate mutations and data corruption during automated retries.
- Coupling transport codes to domain exceptions: Directly mapping
404toUserNotFoundor403toBillingExpiredcreates rigid hierarchies that break when routing infrastructure changes. - Neglecting independent error contract versioning: Tying error schema evolution to endpoint signatures forces unnecessary major version bumps and breaks existing exception handlers during minor infrastructure updates.
Frequently Asked Questions
How do error contracts impact automated client SDK generation?
Structured error schemas enable type-safe exception mapping in generated clients, reducing boilerplate and preventing runtime parsing failures during contract updates. Code generators can emit dedicated exception classes, typed retry policies, and fallback hooks directly from OpenAPI definitions.
What is the difference between a resilience map and a standard error handling guide?
A resilience map explicitly ties error classifications to automated recovery strategies (retries, fallbacks, circuit breaks) rather than just logging or UI messaging. It defines machine-readable triggers that dictate how infrastructure and clients should behave under specific failure conditions.
Should error contracts be versioned independently of API endpoints?
Yes, error schemas should follow semantic versioning to prevent breaking client exception handlers during minor API updates or infrastructure changes. Independent versioning allows platform teams to evolve failure semantics without forcing endpoint deprecations or client rewrites.