AerialExtreMatch: A Benchmark for Extreme-View Image Matching and Localization

2025


Rouwan Wu1   Zhe Huang2   Xingyi He2   Yan Liu3   Shen Yan1   Sida Peng2   Maojun Zhang1†   Xiaowei Zhou2†  

1National University of Defense Technology    2State Key Lab of CAD&CG, Zhejiang University    3Huazhong University of Science and Technology

TL;DR


We introduce AerialExtreMatch*, a large-scale, high-fidelity dataset tailored for extreme-view image matching and UAV localization. It consists of:
(a) Train Pair: a large-scale synthetic dataset containing 1.5 million RGB–depth image pairs rendered from high-quality 3D scenes, simulating diverse UAV and satellite viewpoints to enable robust training of image matching models.
(b) Evaluation Pair: a 32-level benchmark of ~30,000 pairs graded by overlap, scale, and pitch to enable fine-grained evaluation of image matching models.
(c) Localization: a real-world UAV localization set using DJI M300 RTK+H20T queries matched against both high-quality UAV-derived orthomosaic/DSM and lower-quality satellite maps.
* All code and datasets are readily available at GitHub.

Method


Pipeline for collecting RGB-depth pairs and estimating the co-visibility mask.

  • Left: Training data generation. Cesium for Unreal provides high-quality 3D models, enabling the rendering of RGB and depth images from diverse aerial viewpoints by sampling camera poses with varying altitudes and angles.
  • Right: Co-visibility estimation. Given intrinsics K, extrinsics P, and RGB-depth image pairs (I1, D1) and (I2, D2), the co-visible masks C12 and C21 are computed by warping 3D points from one view to the other via geometric reprojection.
The RGB-depth image pairs generated from the above pipeline are referred to as Train Pair, which are used to train image matching models. The Evaluation Pair is derived from Train Pair. The two do not intersect with each other.


Evaluation Pair. We construct the Evaluation Pair by categorizing image pairs into discrete difficulty levels. For each image pair, we compute the corresponding metrics and discretize them into the following bins:

  • Overlap Ratio (four bins): <20, 20–40, 40–60, and >60;
  • Pitch Difference (four bins): 50–55, 55–60, 60–65, and 65–70 degrees;
  • Scale Variation (two bins): 1–2, and >2;


Localization.

  • Query Image Collection: Captured using DJI M300 RTK drone + DJI H20T camera. Both the drone’s altitude and the camera’s pitch angle are carefully controlled during data acquisition. Accurate camera poses are estimated using Render2Loc.
  • Reference Data Preparation: Captured using DJI M300 RTK drone + SHARE PSDK 102S. The aerial imagery is processed using modern 3D reconstruction techniques to generate a digital orthophoto map (DOM) and a digital surface model (DSM). In addition, satellite-derived DSM and DOM data covering the same geographic region are acquired from commercial providers and spatially aligned.



Citation


WIP