Baseline Management for Tile Servers

A tile server is not a static asset host — it is a distributed, stateful rendering engine that emits raster or vector output across dozens of zoom levels, projections, and style revisions. That makes baseline management the single point where a map visual regression suite either earns trust or loses it. Without disciplined baselines, GPU-accelerated WebGL compositing, OS-level font hinting, anti-aliasing on tile seams, and non-deterministic label-collision passes all leak pixel drift into the diff. Multiplied across the hundreds of tiles in a viewport, that drift swamps the signal and trains engineers to ignore the suite. This page covers how to capture, store, version, and compare tile baselines so the only diffs that surface are the ones a cartographer would care about.

This page extends the foundational principles set out in Web Map Visual Testing Fundamentals & Toolchains, translating the determinism and versioning ideas there into a production pipeline scoped specifically to tile servers and tile-pyramid output.

What a Tile-Server Baseline Actually Is

A baseline, in this context, is the blessed rendered output for a known coordinate, at a known zoom, under a known style revision, captured in a known rendering environment. The unit of comparison is not “the page” but the individual tile (or a fixed grid block of tiles), addressed by its z/x/y coordinate. Anchoring the baseline to a tile address rather than a full-page screenshot is what makes the system align with the OGC Two Dimensional Tile Matrix Set — every baseline maps to a deterministic cell in the tile matrix, so a regression can be reported as “tile 12/2048/1536 drifted” rather than “the map looks different somewhere.”

A complete baseline record therefore has two parts:

The image artifact — the rendered PNG/WebP for the tile or grid block.
The provenance manifest — structured metadata describing exactly what produced it: tile coordinate, zoom, style hash, rendering engine, OS, font stack, and device pixel ratio (DPR).

The manifest is the load-bearing piece. Storing pixels without provenance produces a graveyard of images nobody can re-bless safely; storing provenance makes every baseline reproducible and lets the pipeline route diffs only to baselines that share the same rendering environment. This separation also keeps the system aligned with how deterministic tile capture and synchronization is handled upstream of comparison.

Architecture: Storage Model, Naming, and Branching

Storing hundreds of megabytes of tile imagery directly in Git bloats the repository, slows clones, and breaks line-based diff tooling. The recommended pattern keeps only the lightweight manifests in version control and pushes the binary artifacts to object storage (S3, GCS, Azure Blob) or Git LFS. The full mechanics of this split — content-addressed keys, sidecar manifests, and promotion workflows — are detailed in Setting up baseline image versioning for web maps.

Content-addressed keys. Name each artifact by a hash of its provenance, not by a mutable path. A key such as baselines/<style_hash>/<engine>/<z>/<x>/<y>.png makes the storage layout self-documenting and lets the pipeline fetch only the subset of tiles a change could have touched.

Sidecar manifests. Each artifact gets a JSON sidecar that the CI job reads before diffing:

{
  "tile": { "z": 12, "x": 2048, "y": 1536 },
  "styleHash": "mapbox-gl-style-v4.2.1-sha256-9f3c…",
  "engine": "chromium-118",
  "os": "ubuntu-22.04",
  "fontStack": "noto-sans-2.1",
  "dpr": 1,
  "capturedAt": "2026-06-26T09:00:00Z"
}

Branching strategy. Treat baselines as code that travels with the style that produced them. When a pull request changes a style or data source, regenerate only the affected tiles and bless them on the same branch, so the new expected pixels and the change that caused them merge together. Never re-bless against main out of band — that silently rewrites history for tiles unrelated to the change.

Engine isolation. Maintain a distinct baseline set per rendering engine rather than forcing one “canonical” image across browsers. A Chromium-on-Ubuntu render will rarely match WebKit-on-macOS at the pixel level, and a single canonical baseline guarantees permanent false positives on every engine but one.

Step-by-Step: A Deterministic Capture Pipeline

Effective capture bypasses network variability and external data dependencies entirely. The procedure below produces reproducible tile baselines on any runner.

Lock the viewport and device pixel ratio. Force viewport: { width: 1024, height: 768 } and deviceScaleFactor: 1 at browser-context creation to eliminate fractional pixel interpolation. The exact numbers matter less than freezing them across every runner and developer machine.
```
const context = await browser.newContext({
  viewport: { width: 1024, height: 768 },
  deviceScaleFactor: 1,
});
```
Intercept tile requests and serve fixtures. Route fetch/XHR tile requests to local fixtures or a pre-warmed cache so external provider rate limits, CDN cache misses, and render-timing jitter cannot alter the output.
```
await page.route("**/tiles/**", (route) =>
  route.fulfill({ path: `fixtures/${tileKey(route.request().url())}.pbf` })
);
```
Iterate the tile grid programmatically. Request tiles using a standardized z/x/y matrix rather than panning the camera by hand, so every run visits the same cells in the same order.
Await an idle-then-requestAnimationFrame handshake. Wait for the map engine’s idle event, then one animation frame, before capturing. This is the same synchronization contract covered under handling async tile loading, and it is what prevents the suite from photographing a half-loaded pyramid.
```
await page.evaluate(() => new Promise((resolve) => {
  map.once("idle", () => requestAnimationFrame(resolve));
}));
```
Capture per-tile or per-grid-block. Snapshot individual tiles or fixed grid blocks instead of the full viewport. This enables parallel execution, caps memory during large regression sweeps, and keeps each artifact addressable by its tile coordinate.
Hash, write the manifest, and upload. Compute a content hash, emit the JSON sidecar from step 2 of the architecture section, and push both to object storage under the content-addressed key.

This grid-based approach makes baseline generation reproducible across CI runs and removes the timing-based flakiness that dynamic map initialization otherwise introduces.

Cross-Browser and Cross-Environment Considerations

Rendering engines diverge in WebGL context initialization, text rasterization, and compositing order, so a baseline captured in one environment is only valid as a target for that same environment. Concrete controls:

Containerized runners with pinned system libraries. Standardize the OS image, font installation, and GPU stack in Docker, pinning exact versions of libgl1, mesa-utils, and fontconfig. Unpinned packages are the most common cause of a baseline that “passed last week” failing today with no code change.
Software rasterization for stability. Force --use-gl=swiftshader (or --use-gl=angle --use-angle=swiftshader) so rasterization is identical across machines regardless of the physical GPU. Hardware acceleration introduces driver-specific dithering that no threshold can reliably absorb.
preserveDrawingBuffer for WebGL capture. Set preserveDrawingBuffer: true on the WebGL context so the framebuffer survives long enough to be read back during snapshot, rather than being cleared between frames.
Engine-aware diff routing. Route each candidate render against the baseline that shares its engine, OS, and font stack — never across the matrix. This isolates genuine cartographic defects from expected engine divergence and complements dynamic element masking and UI stability for non-deterministic overlays.

Treating each browser/OS/DPR combination as a separate rendering target is the difference between a matrix of trustworthy baselines and a single fragile one that is wrong everywhere except where it was born.

Threshold and Parameter Reference

Generic UI diff settings fail on tile imagery because they weight every pixel equally. Cartographic content needs perceptual and structural tolerance, tuned per content class. The deeper algorithmic treatment lives in Diff Algorithm Tuning for Cartography; the table below is the operational quick reference for baseline comparison.

Parameter	Recommended value	Rationale
Viewport	`1024 × 768` (frozen)	Eliminates layout-driven tile reflow between runs
Device pixel ratio	`1` (or one fixed value per matrix)	Removes subpixel interpolation differences
Vector / typography tolerance	`0.1%–0.5%` pixel change	Catches dropped labels and shifted road geometry
Raster hillshade / satellite tolerance	`2%–3%` pixel change	Absorbs benign resampling and JPEG-band noise
Anti-aliasing handling	Ignore AA pixels	Stops tile-seam AA from registering as drift
pHash Hamming distance gate	`≤ 5 / 64` for “identical”	Cheap pre-filter before expensive pixel diff
Idle settle timeout	`10s` hard cap, fail closed	Prevents half-loaded pyramids being blessed
Baseline retention	`90 days` rolling + tagged regressions	Bounds storage without losing reference history

For coarse deduplication and a fast pre-filter, perceptual hashing (pHash) reduces each tile to a compact fingerprint and compares two tiles by their normalized Hamming distance:

d (a, b) = \frac{1}{N} i = 1 \sum N [a_{i} \neq = b_{i}]

where $a$ and $b$ are the $N$ -bit hashes of the baseline and candidate tiles and $[\cdot]$ is the Iverson bracket. Tiles below the gate skip the pixel diff entirely; only tiles above it pay for a full perceptual or structural comparison.

Common Pitfalls

Repository bloat from committed imagery. Root cause: pushing PNGs into Git. Fix: keep manifests in Git, artifacts in object storage or LFS, keyed by content hash.
The canonical-baseline trap. Root cause: one blessed image shared across all engines. Fix: per-engine baseline sets with engine-aware diff routing.
Silent style drift. Root cause: re-blessing baselines without tying them to the style revision that produced them. Fix: embed styleHash in every manifest and regenerate affected tiles on the same branch as the style change.
Half-loaded capture. Root cause: snapshotting before the tile pyramid settles. Fix: the idle + requestAnimationFrame handshake with a fail-closed timeout.
GPU-driver false positives. Root cause: hardware rasterization varying per runner. Fix: force SwiftShader software rendering and pin libgl1/mesa versions in the container image.

Beyond Deterministic Baselines: Triage at Scale

As baseline repositories grow, manual triage becomes the bottleneck. Lightweight classification can augment — never replace — deterministic capture by segmenting diff regions into categories (roads, labels, water, POIs, UI chrome), auto-dismissing diffs that fall inside known anti-aliasing noise bands, and confidence-scoring the rest so high-certainty regressions go straight to PR checks while ambiguous ones reach a human with bounding boxes pre-highlighted. Trained on the organization’s own cartographic style, this stays domain-aware rather than generic. Teams building this on open tooling typically wire it into the pipelines described in Open-Source Visual Testing Stacks, while those weighing managed options compare diff routing and threshold handling in Percy vs Chromatic for Maps.

Production Readiness Checklist

Capture scripts enforce a frozen viewport, fixed DPR, and an explicit WebGL context configuration.
Tile requests are intercepted to fixtures or a pre-warmed cache so no external variability reaches the render.
Artifacts live in object storage under content-addressed keys with JSON sidecar manifests in Git.
Diff comparison uses a pHash pre-filter plus cartography-calibrated pixel/structural thresholds.
CI runners are containerized with pinned OS, font, and GPU/rasterizer versions.
Engine-specific baseline sets and engine-aware diff routing prevent cross-environment contamination.
Any automated triage layer is trained on domain-specific diffs and keeps a human-in-the-loop escalation path.

Frequently Asked Questions

Should tile baselines be stored in Git or object storage?

Store the binary tile imagery in object storage or Git LFS and keep only the JSON provenance manifests in Git. Committing PNGs directly bloats the repository, slows clones, and breaks line-based diffing. Keying artifacts by a content hash of their provenance lets CI fetch only the subset of tiles a given change could have affected.

Why does the same tile render differently in CI than on my laptop?

The CI runner’s GPU backend, font packages, locale, or system libraries differ from your machine. Pin all of them in the container image and force software rasterization with --use-gl=swiftshader so the framebuffer is identical everywhere. If two runs of the same commit still differ, the problem is in capture determinism, not in the map code.

How many tiles should a baseline set cover?

Cover a representative grid at each zoom level you ship, concentrating on tiles with high feature density and tile-boundary seams where rendering artifacts cluster. Capturing every tile in the pyramid is wasteful; capturing a fixed, version-controlled z/x/y matrix gives reproducible coverage without unbounded storage.

When should I re-bless a baseline?

Re-bless only when a deliberate cartographic change — a style revision, new data source, or symbology update — alters the expected output, and do it on the same branch as that change so the new pixels and their cause merge together. Tie every baseline to the styleHash that produced it so re-blessing is scoped to the affected layer rather than the whole grid.

Web Map Visual Testing Fundamentals & Toolchains — the parent reference for deterministic rendering, versioned baselines, and CI gating.
Setting up baseline image versioning for web maps — the content-addressed storage and promotion workflow in depth.
Diff Algorithm Tuning for Cartography — SSIM, pHash, masking, and zoom-aware thresholds.
Screenshot Capture, Sync & Comparison Logic — the synchronization handshake upstream of baseline comparison.
Open-Source Visual Testing Stacks — building the capture-and-diff pipeline on open tooling.

Up one level: Web Map Visual Testing Fundamentals & Toolchains.

Baseline Management for Tile Servers

What a Tile-Server Baseline Actually Is #

Architecture: Storage Model, Naming, and Branching #

Step-by-Step: A Deterministic Capture Pipeline #

Cross-Browser and Cross-Environment Considerations #

Threshold and Parameter Reference #

Common Pitfalls #

Beyond Deterministic Baselines: Triage at Scale #

Production Readiness Checklist #

Frequently Asked Questions #

Related #