Spatial Interpolation for Yield Data in Python

Spatial interpolation for yield data converts discrete, noisy harvest monitor point logs into continuous agronomic raster surfaces — the output that makes management zone classification algorithms and variable-rate prescription generation possible. Raw yield points are inherently irregular: they are logged at 1–2 Hz, offset from the actual crop position by GPS latency and header-width geometry, and contaminated by combine speed transients, unloading pauses, and headland turns. Without a rigorous interpolation step that first cleans these artefacts and then fits a spatially coherent model, every downstream prescription inherits systematic bias.

This page covers the full engineering path from a raw yield CSV to a validated, field-masked GeoTIFF, including geostatistical theory, production Python code with Ordinary Kriging and IDW fallback, a parameter reference table, failure-mode analysis, and cross-validation quality gates. The result feeds directly into the Yield Mapping & Variable Rate Prescription Generation pipeline as the foundational continuous layer.

Yield Interpolation Pipeline Five sequential stages of the yield interpolation pipeline from raw harvest monitor points to a field-masked GeoTIFF raster ready for zone classification. Raw Harvest Monitor Points CSV / GeoJSON

Clean & Thin Speed / Moisture Reproject → UTM

Variogram Fit & Model Select Spherical / Exp / Gauss

Kriging Grid or IDW Fallback 5–10 m resolution

Masked COG GeoTIFF + LOOCV Zone-ready output

Prerequisites

Requirement	Detail
Python	3.10+
`geopandas`	≥ 1.0
`rasterio`	≥ 1.3
`pykrige`	≥ 1.7
`scipy`	≥ 1.12
`numpy`	≥ 1.26
`scikit-learn`	≥ 1.4
`shapely`	≥ 2.0

TEXT

pip install "geopandas>=1.0" "rasterio>=1.3" "pykrige>=1.7" \
    "scipy>=1.12" "numpy>=1.26" "scikit-learn>=1.4" "shapely>=2.0"

Input data format: A yield monitor CSV or GeoJSON containing at minimum: geometry (point), yield_tha (t/ha or bu/ac — state units explicitly), moisture_pct, speed_kmh, and timestamp. All geometries must share a single CRS; a projected system with metric units (e.g., EPSG:32614, EPSG:32630) is required before any distance calculation. An accompanying field boundary polygon GeoDataFrame is needed for masking.

Hardware assumptions: RTK GPS accuracy of ≤ 30 cm is assumed. Sub-metre horizontal accuracy is sufficient for 5–10 m cell sizes; if you are using WAAS/SBAS-corrected data only (±1–3 m), consider 10 m cells to avoid under-sampling the footprint.

1. Concept & Algorithm

Why spatial interpolation is not optional

Harvest monitors log a yield estimate roughly every 1–2 seconds at forward speeds of 5–8 km/h. At 6 km/h that produces a point every 1.7 m, far denser than the agronomic variation structure in most fields. Yet the points are clustered along parallel swath lines with wide unsampled gaps between them — yield is not measured between swaths. Spatial interpolation bridges those gaps by estimating yield at unsampled locations using statistical relationships derived from the sampled points.

Ordinary Kriging

Ordinary Kriging (OK) is the standard geostatistical interpolator for yield data. It is a best linear unbiased estimator (BLUE): among all linear combinations of observed values it minimises prediction variance. The key insight is that OK uses the variogram — a function of how dissimilarity grows with separation distance — rather than an arbitrary distance decay. This gives OK two advantages over simpler methods:

Quantified uncertainty. OK produces a kriging variance map alongside the estimated yield surface. High-variance areas (sparse coverage, headlands) can be flagged and excluded from prescription calculations.
Stationarity handling. A nugget term in the variogram absorbs GPS and sensor measurement error, preventing the interpolator from chasing noise.

Fitting the experimental variogram requires binning point pairs by lag distance and computing the mean squared difference at each bin. A theoretical model (Spherical, Exponential, or Gaussian) is then fitted by nonlinear least squares to recover three parameters:

Range — the lag distance at which spatial autocorrelation vanishes (typically 50–300 m for field-scale yield variation).
Sill — the total variance of the process (range + nugget).
Nugget — the y-intercept of the fitted curve; represents variance at zero lag (measurement noise, GPS error).

When kriging fails — too few points, negative eigenvalues in the covariance matrix, or a flat variogram — an Inverse Distance Weighting (IDW) fallback with power=2 provides a deterministic, numerically stable alternative at the cost of losing uncertainty estimation.

Choosing cell size

Five to ten metres is the agronomic sweet spot for field-scale prescription work. Finer cells (1–3 m) oversample the interpolation and introduce apparent spatial precision that the variogram cannot support. Coarser cells (≥ 20 m) may smooth out intra-field yield variation that is agronomically meaningful. Match cell size to the combine swath width (typically 6–12 m) when in doubt.

2. Step-by-Step Implementation

Step 1 — Ingest and filter harvest monitor data

PYTHON

import logging
import numpy as np
import pandas as pd
import geopandas as gpd
from shapely.geometry import Point

logging.basicConfig(level=logging.INFO, format="%(levelname)s: %(message)s")

RAW_CSV = "harvest_2024.csv"        # columns: lon, lat, yield_tha, moisture_pct, speed_kmh
BOUNDARY_GPKG = "field_boundary.gpkg"
TARGET_EPSG = 32614                  # UTM zone 14 N — adjust to your field

def load_and_clean(csv_path: str, target_epsg: int) -> gpd.GeoDataFrame:
    df = pd.read_csv(csv_path)

    # Drop rows missing required columns
    required = ["lon", "lat", "yield_tha", "moisture_pct", "speed_kmh"]
    df = df.dropna(subset=required)

    # Agronomic filters (corn; adjust per crop)
    before = len(df)
    df = df[
        (df["speed_kmh"].between(3.0, 15.0)) &
        (df["moisture_pct"].between(12.0, 35.0)) &
        (df["yield_tha"] > 0)
    ]
    p1, p99 = df["yield_tha"].quantile([0.01, 0.99])
    df = df[df["yield_tha"].between(p1, p99)]
    logging.info(f"Filtered {before - len(df)} rows; {len(df)} remain.")

    gdf = gpd.GeoDataFrame(
        df,
        geometry=[Point(xy) for xy in zip(df["lon"], df["lat"])],
        crs="EPSG:4326"
    )
    # Reproject to projected CRS — mandatory before distance operations
    gdf = gdf.to_crs(epsg=target_epsg)
    assert gdf.crs.is_projected, "CRS must be projected before interpolation."
    return gdf

yield_gdf = load_and_clean(RAW_CSV, TARGET_EPSG)
boundary = gpd.read_file(BOUNDARY_GPKG).to_crs(epsg=TARGET_EPSG)

# Sanity check
assert len(yield_gdf) > 50, "Too few points after filtering — check agronomic thresholds."
assert yield_gdf.crs.equals(boundary.crs), "CRS mismatch between points and boundary."

Step 2 — Spatial thinning to prevent kriging singularity

High-frequency logging creates spatially clustered points that inflate the kriging covariance matrix condition number. Grid-based thinning reduces density to approximately one point per target cell:

PYTHON

def grid_thin(gdf: gpd.GeoDataFrame, cell_size: float) -> gpd.GeoDataFrame:
    """Keep one point per grid cell (the median yield value in each cell)."""
    x = gdf.geometry.x.values
    y = gdf.geometry.y.values
    col_idx = ((x - x.min()) / cell_size).astype(int)
    row_idx = ((y - y.min()) / cell_size).astype(int)
    gdf = gdf.copy()
    gdf["_cell"] = list(zip(row_idx, col_idx))
    thinned = (
        gdf.groupby("_cell", group_keys=False)
        .apply(lambda g: g.iloc[[np.argmin(np.abs(g["yield_tha"] - g["yield_tha"].median()))]])
    )
    thinned = thinned.drop(columns="_cell")
    logging.info(f"Thinned {len(gdf)} → {len(thinned)} points (cell_size={cell_size} m).")
    return thinned.reset_index(drop=True)

CELL_SIZE = 6.0   # metres — match combine swath width
yield_thinned = grid_thin(yield_gdf, CELL_SIZE)

Step 3 — Fit variogram and run Ordinary Kriging with IDW fallback

PYTHON

from pykrige.ok import OrdinaryKriging
from rasterio.transform import from_origin

def build_interpolation_grid(boundary_gdf: gpd.GeoDataFrame, cell_size: float):
    x_min, y_min, x_max, y_max = boundary_gdf.total_bounds
    grid_x = np.arange(x_min, x_max, cell_size)
    grid_y = np.arange(y_min, y_max, cell_size)
    transform = from_origin(x_min, y_max, cell_size, cell_size)
    return grid_x, grid_y, transform

def run_kriging(
    gdf: gpd.GeoDataFrame,
    grid_x: np.ndarray,
    grid_y: np.ndarray,
    variogram_model: str = "spherical",
) -> tuple[np.ndarray, np.ndarray]:
    x = gdf.geometry.x.values
    y = gdf.geometry.y.values
    z = gdf["yield_tha"].values.astype(np.float64)

    try:
        ok = OrdinaryKriging(
            x, y, z,
            variogram_model=variogram_model,
            nlags=20,
            weight=True,
            verbose=False,
            enable_plotting=False,
        )
        z_pred, z_var = ok.execute("grid", grid_x, grid_y)
        logging.info(f"OK completed. Nugget/Sill ratio: "
                     f"{ok.variogram_model_parameters[2]:.3f}")
        return np.array(z_pred), np.array(z_var)

    except Exception as exc:
        logging.warning(f"Ordinary Kriging failed ({exc}); falling back to IDW.")
        from scipy.interpolate import griddata
        pts = np.column_stack([x, y])
        gx, gy = np.meshgrid(grid_x, grid_y)
        z_idw = griddata(pts, z, (gx, gy), method="linear")
        # Fill remaining NaNs with nearest
        z_nn = griddata(pts, z, (gx, gy), method="nearest")
        z_idw = np.where(np.isnan(z_idw), z_nn, z_idw)
        return z_idw, np.full_like(z_idw, np.nan)

grid_x, grid_y, transform = build_interpolation_grid(boundary, CELL_SIZE)
z_grid, z_var = run_kriging(yield_thinned, grid_x, grid_y)

assert not np.all(np.isnan(z_grid)), "Interpolation returned all NaN — check point count."
logging.info(f"Grid shape: {z_grid.shape}, range: "
             f"{np.nanmin(z_grid):.2f}–{np.nanmax(z_grid):.2f} t/ha")

Step 4 — Mask to field boundary and export as Cloud-Optimised GeoTIFF

PYTHON

import rasterio
from rasterio.features import rasterize

def mask_to_boundary(
    grid: np.ndarray,
    boundary_gdf: gpd.GeoDataFrame,
    transform,
) -> np.ndarray:
    rows, cols = grid.shape
    shapes = [(geom, 1) for geom in boundary_gdf.geometry]
    mask = rasterize(shapes, out_shape=(rows, cols), transform=transform, fill=0, dtype="uint8")
    return np.where(mask == 1, grid, np.nan)

z_masked = mask_to_boundary(z_grid, boundary, transform)

def export_cog(
    grid: np.ndarray,
    transform,
    crs,
    output_path: str,
    nodata: float = -9999.0,
) -> None:
    rows, cols = grid.shape
    data = np.where(np.isnan(grid), nodata, grid).astype(np.float32)
    profile = {
        "driver": "GTiff",
        "height": rows,
        "width": cols,
        "count": 1,
        "dtype": "float32",
        "crs": crs,
        "transform": transform,
        "nodata": nodata,
        "compress": "deflate",
        "tiled": True,
        "blockxsize": 256,
        "blockysize": 256,
        "interleave": "band",
    }
    with rasterio.open(output_path, "w", **profile) as dst:
        dst.write(data, 1)
        dst.update_tags(1, description="Interpolated yield (t/ha)")
    logging.info(f"Exported COG GeoTIFF → {output_path}")

export_cog(z_masked, transform, boundary.crs, "yield_interpolated_2024.tif")

3. Key Parameters & Tuning

Parameter	Type	Default	Agronomic Effect
`cell_size`	`float`	`6.0` m	Controls prescription resolution; match to combine swath width. Finer than 3 m introduces false precision.
`variogram_model`	`str`	`"spherical"`	Spherical suits gradual field-scale yield transitions; Exponential handles sharper boundaries (soil texture change).
`nlags`	`int`	`20`	Number of lag bins for experimental variogram. Increase to 30–40 for large fields (>200 ha) with long-range autocorrelation.
`weight`	`bool`	`True`	Weights lag bins by pair count — reduces influence of poorly sampled lag distances at field edges.
`speed_kmh` lower bound	`float`	`3.0`	Below this, combine is turning or stopped. Raising to 4.0 km/h removes more headland data but reduces header-fill artefacts.
`moisture_pct` bounds	`tuple`	`(12, 35)`	Crop-specific. Soybean: 10–25 %. Wheat: 10–20 %. Out-of-range readings indicate sensor fault or green-crop error.
IDW `power`	`float`	`2.0`	Higher power (3–4) produces more localised estimates useful for high-density RTK data.

4. Handling Edge Cases & Failure Modes

Singular covariance matrix during kriging This occurs when duplicate or near-duplicate coordinates exist after thinning, or when the point cloud is too small (< 30 points). Fix: add a tiny jitter to exact duplicates (coords += np.random.uniform(-0.05, 0.05, coords.shape)) and verify that grid_thin ran before calling OrdinaryKriging.

Flat experimental variogram (nugget ≈ sill) All spatial structure has been removed, usually because the yield signal is dominated by measurement noise or the swath-level thinning was too aggressive. Check: plot ok.lags vs ok.semivariance; if the curve is horizontal from lag 0, reduce the thinning cell size and re-run. Also check that the moisture_pct and speed_kmh filters are not removing the highest-yield passes disproportionately.

CRS mismatch with the field boundary Even one degree of angular difference at mid-latitudes produces 50–100 m misregistration, causing the boundary mask to clip real yield data. Always assert yield_gdf.crs.equals(boundary.crs) before calling mask_to_boundary. The correct approach to understanding CRS in precision agriculture is to nominate a single target EPSG at ingest time and reproject every layer to it immediately.

UTM zone boundary crossings Fields that straddle a UTM zone boundary (e.g., zone 14/15 at 96° W) must be processed in a single projected CRS. Pick the zone covering the majority of the field and reproject any points that fall in the adjacent zone. Splitting the processing causes a visible seam in the interpolated surface.

Memory overflow on large fields At 1 m resolution a 500 ha field produces a 70 M-cell grid. Use cell_size ≥ 5 for production or process the field in spatial tiles using rasterio.windows. After tiling, mosaic the results with rasterio.merge before masking to avoid boundary artefacts.

Yield monitor time offset artefacts Harvest monitors log at the GPS position of the combine, not the actual grain intake point. At 6 km/h with a 4 s sensor lag, points are displaced ~7 m forward along the swath. For sub-5 m cell sizes this introduces a systematic positional bias. Apply a velocity-weighted backward shift: gdf.geometry = gdf.geometry.translate(-gdf.speed_ms * lag_seconds, 0) after projecting, where lag_seconds is obtained from the monitor calibration sheet.

5. Verification & Output Validation

Leave-one-out cross-validation

PYTHON

from sklearn.model_selection import LeaveOneOut
from sklearn.metrics import mean_squared_error, r2_score

def loocv_kriging(gdf: gpd.GeoDataFrame, variogram_model: str = "spherical") -> dict:
    x = gdf.geometry.x.values
    y = gdf.geometry.y.values
    z = gdf["yield_tha"].values.astype(np.float64)

    preds = np.empty_like(z)
    loo = LeaveOneOut()
    for train_idx, test_idx in loo.split(z):
        ok = OrdinaryKriging(
            x[train_idx], y[train_idx], z[train_idx],
            variogram_model=variogram_model,
            verbose=False,
            enable_plotting=False,
        )
        pred, _ = ok.execute("points", x[test_idx], y[test_idx])
        preds[test_idx] = pred

    rmse = np.sqrt(mean_squared_error(z, preds))
    r2 = r2_score(z, preds)
    logging.info(f"LOOCV — RMSE: {rmse:.3f} t/ha, R²: {r2:.3f}")
    return {"rmse_tha": rmse, "r2": r2}

# Run on thinned dataset (LOOCV on the full raw dataset is prohibitively slow)
metrics = loocv_kriging(yield_thinned)
assert metrics["r2"] > 0.5, (
    f"R² = {metrics['r2']:.3f} — interpolation quality is too low for prescription use. "
    "Check variogram model choice and filter settings."
)

Visual spot-check checklist

Open yield_interpolated_2024.tif in QGIS with a diverging colour ramp centred on the field median yield. Gradient transitions should align with known soil boundaries, slope, or historical management zones — not with swath direction.
Verify that the raster extent exactly matches the boundary polygon (no extrapolated halo outside the field).
Inspect the kriging variance raster (z_var): high variance areas (> 2× the mean) indicate sparse sampling and should be excluded from prescription zone boundaries.
Overlay the raw point cloud and verify that the interpolated surface peaks align with dense, consistent yield readings, not isolated outliers.

6. Integration with the Broader Pipeline

The validated yield raster is the primary input to two downstream stages:

Management zone delineation. The management zone classification algorithms page covers stacking this yield raster with soil apparent electrical conductivity (ECa), digital elevation, and multi-temporal NDVI before running K-Means or Gaussian Mixture zone fitting. Ensure the yield raster is normalised (z-score or min-max per field) before stacking so that scale differences between layers do not dominate the clustering.

Prescription generation and ISOXML export. Once zones are defined, application rates are assigned per zone and the result is serialised for the cab controller. Before that export, field boundaries must pass shapefile validation for farm equipment to catch geometry violations (self-intersections, invalid ring orientation) that cause silent truncation in ISO 11783 terminals. The variable rate export to ISOXML workflow then handles encoding the prescription as a TaskData element.

The deep dive on fitting variogram models to real-world sparse datasets with high measurement noise is in interpolating sparse yield monitor data with kriging, which extends this workflow with universal kriging, directional variograms, and multi-year kriging with external drift.

This page is part of the Yield Mapping & Variable Rate Prescription Generation guide — see there for the complete pipeline context including data ingestion, quality gating, and ISOXML export.

Frequently Asked Questions

Why does my kriging produce artefacts at field edges?

Ordinary Kriging extrapolates into areas with no nearby sample support. Always clip the interpolation grid to the field boundary polygon before exporting. Use rasterio.features.rasterize with the boundary as a boolean mask and set grid cells outside the polygon to nodata.

Can I run spatial interpolation on WGS84 coordinates?

No. Distance-based methods (kriging, IDW) require Euclidean space. Running them on geographic coordinates (EPSG:4326) distorts lag distances because one degree of longitude shrinks toward the poles. Reproject to the appropriate UTM zone (e.g., EPSG:32614 for central North America) before fitting any variogram.

How many points do I need for reliable Ordinary Kriging?

A practical lower bound is around 50–100 spatially distributed points per field. Below that, the experimental variogram has too few lag pairs to fit a stable model and nugget/sill estimates become unreliable. For sparse datasets use IDW with power=2 and tighten the search radius to the swath width.

Interpolating Sparse Yield Monitor Data with Kriging — directional variograms, universal kriging with elevation drift, and multi-year analysis
Management Zone Classification Algorithms — stack the yield raster with ECa and NDVI for K-Means zone fitting
Shapefile Validation for Farm Equipment — geometry QA gates before prescription export
Variable Rate Export to ISOXML — encode prescription maps for ISO 11783 cab controllers
Understanding CRS in Precision Agriculture — UTM zone selection, projection validation, and reprojection patterns

Spatial Interpolation for Yield Data in Python

Related on this site