168 Vector Embeddings · External Validation Report

Ether Data · External validation

External validation of census-derived spatial embeddings.

This report documents the validation methodology and results for Ether Data's spatial embeddings: 256-dimensional self-supervised representations of place, one vector per hexagonal grid cell ( $\approx 0.74\ \text{km}^2$ ), trained on federal census records. Current evaluated coverage: 10 northeastern US states, 485,701 cells. The report is written for technical reviewers; all numbers are from held-out evaluation under the protocol described in Section 1, and every result is reproducible from a versioned manifest.

0.78

nighttime lights

R^2

blocked holdout

0.78

wealth-axis
partial-r

0.60

cross-city transfer
Spearman

\approx 0

matched-null

R^2

calibrated

485,701 cells · 10 states · all results held-out under spatially blocked evaluation · every number manifest-tracked and reproducible

Section 01

Methodology

Frozen-embedding linear probes

A conservative lower bound

Embeddings are held fixed. Each external target is regressed on the embedding dimensions (ridge regression, fixed regularization), and performance is reported as $R^2$ on held-out cells. Linear probes are a conservative lower bound on information content: they cannot exploit nonlinear structure.

Spatially blocked holdout

No neighbor leakage

Random cell-level splits leak spatial autocorrelation: a held-out cell's near-identical neighbors sit in the training set, and the probe interpolates rather than generalizes. All results here use contiguous-block holdout — 1,643 blocks of roughly $10\ \text{km}^2$ , 19.9% of cells held out by block, so no held-out cell has a training neighbor inside its block. On this dataset, blocked-split and random-split results agree to within 0.01 $R^2$ , which was verified rather than assumed.

Null calibration

Signal, not smoothness

The embedding was also probed against synthetic random fields generated to match the spatial autocorrelation of the real targets. If a representation "predicts" matched noise, its real-target scores are partly artifact. Result: $\lvert R^2 \rvert \le 0.01$ across all null configurations, on both blocked and random splits. The scores in Section 2 are signal, not spatial smoothness.

Circularity exclusion

Every target is external

No quantity derived from the embedding's own training source is used as a validation target. Predicting one's training data through a held-out split measures shared-source agreement, not generalization. All targets below are external: satellite radiance, crowdsourced telecom infrastructure, place data, road networks, employment records, municipal service requests, and property assessment records.

Reproducibility

Every reported number has a manifest entry recording the model artifact, evaluation query hash, split version, cell counts, and regularization. Certified results re-run under a regression guard: a shift greater than 0.05 $R^2$ between runs blocks publication.

Section 02

External validation battery

Held-out $R^2$ , blocked split, 10-state coverage, dense targets zero-filled:

External target	Source type	Held-out $R^2$
Nighttime lights, 2023 annual (log radiance)	VIIRS satellite	0.78
Place density (log count)	commercial and open place data	0.67
Cell tower density (log count)	OpenCellID, crowdsourced	0.62
Daytime workforce (log jobs)	federal employment records, block level	0.52
Road network length (log meters)	open road network data	0.30

Construction note for the daytime target: workforce counts were assigned to grid cells by census-block centroid, with block-level population from federal decennial counts as the comparison surface. Both sides reconcile exactly to published state totals (10 states, 63,413,961 population, $0.0000\%$ error), and the assignment shares no geometry with the embedding's training pipeline.

Section 03

Daytime divergence: a measured limitation

A residence-based representation encodes where people live, not where they are during the day. We quantified this rather than asserting it. Cells were stratified by their daytime-to-residential ratio; held-out prediction error for the daytime-workforce target was measured per stratum, with bootstrap confidence intervals (1,000 draws), across all ten states (57,067 held-out cells):

Cell type (daytime / residential ratio)	Mean log under-prediction	$90\%$ CI
Residential (ratio $< 0.5$ )	$-0.33$	$[-0.34,\ -0.32]$
Balanced (0.5–2)	$+2.13$	$[+2.11,\ +2.16]$
Workplace-leaning (2–10)	$+3.05$	$[+3.00,\ +3.09]$
Extreme daytime ( $\ge 10$ ; business districts, airports)	$+3.45$	$[+3.36,\ +3.53]$

Under-prediction increases monotonically with workplace dominance. The pattern holds individually in 9 of 10 states; the tenth has too few extreme-daytime cells for the stratum to be adequately powered. The same blindness is localized independently by a second physical source: cell-tower density is under-predicted in the same workplace-dominated strata.

Pre-registered falsification test

This measurement defines a falsification test for the roadmap: a daytime-workforce extension layer is required to raise daytime $R^2$ with the improvement concentrated in these divergence cells, or it does not ship. Every future layer faces an equivalent pre-registered test.

Section 04

Independence from population

The first question asked of any place representation is whether it is a population proxy. Three tests address this.

4.1 · Wealth axis

Target: statewide assessed property values from municipal assessor records, 2.56 million parcels aggregated to grid cells. The embedding alone reaches $R^2$ 0.78 (population alone: 0.40). After residualizing both the target and the prediction on log population (two-stage partialling), the held-out partial correlation is 0.78. The embedding carries a wealth dimension nearly orthogonal to population density. This is also the first dollar-denominated quantity the representation has been scored against.

4.2 · Cross-city transfer

A probe was trained on New York City 311 service-request density (2021–2022) and applied unchanged to Boston — no retraining, no Boston labels. Held-out rank correlation on Boston cells: Spearman 0.60. Raw-magnitude $R^2$ is negative under cold transfer, as expected: the two cities differ in base call volume, so levels shift while rank structure carries. The substantive claim is rank transfer — the property required to deploy a representation where no labels exist. Caveat: 311 volume reflects reporting propensity as well as underlying conditions; we treat this as a transfer demonstration, not a measure of municipal need.

4.3 · Representation versus raw inputs

For every target available on the same cells, three probes were trained on identical splits: (a) the full set of raw census input variables, (b) those variables compressed to the embedding's effective dimensionality (the capacity-matched comparison), and (c) the embedding. Lift of the embedding over the capacity-matched baseline:

Target	Embedding $R^2$	Lift vs capacity-matched raw
Assessed property value	0.78	$+0.25$
Nighttime lights	0.73	$+0.20$
Daytime workforce	0.42	$+0.17$
Place density	0.50	$+0.14$
Cell towers	0.34	$+0.03$
Road length	0.04	$-0.05$

Mean lift $+0.12\ R^2$ ; the embedding also outperforms the full uncompressed input set on average ( $+0.10$ ), despite the latter having roughly twenty times the effective dimensionality. Road length is a loss and is reported as one: physical infrastructure is read better directly from raw variables, and it identifies a planned extension layer rather than a claim.

Section 05

Known limitations

The representation is static. One vector per cell, no time axis. Temporally varying phenomena (diurnal cycles, seasonality, events) are out of scope for this version and are excluded from claims.
Seasonal population is a known blind spot. Resort and vacation-home geographies swing several-fold seasonally; a residence-based static representation does not capture this. It is on the measured-limitations list, not hidden inside an average.
Road network signal is weak (Section 4.3) — the one target where raw inputs beat the representation.
OpenCellID carries crowdsourcing bias (collection density follows contributor activity). It is used as one of several triangulating sources, never as sole ground truth.
Small-sample results are quarantined.Strata or test cells below adequate power (e.g., one state's extreme-daytime stratum; sub-20-cell evaluation folds) are flagged in the underlying reports and excluded from the claims above.

Section 06

Replication

The evaluation sample (one full state: 27,837 cells, all 256 dimensions, plus the exact holdout fold assignments) supports independent replication of every result in this report, including the blocked-split construction. Validation targets used here are public datasets (VIIRS, OpenCellID, municipal 311, municipal assessment rolls). We invite re-scoring on the provided folds, on independent splits, and against reviewers' own targets — with the one methodological request that census-derived targets be treated as circular and scored accordingly.

Invitation to re-score

Re-run on the provided folds. Bring your own targets. The only ask: treat census-derived quantities as circular and score them accordingly. Every number in this report is manifest-tracked and reproducible.

Ether Data, June 2026. Contact: nate@etherdata.ai.