Back to Community

Research note ยท Research

CosFly-Track

A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization

Xiangyue Wang, Hanxuan Chen, Songsheng Cheng, Ruilong Ren, Jie Zheng, Shuai Yuan, Tianle Zeng, Hanzhong Guo, Kangli Wang, and Ji Pei

Read CosFly-Track on arXiv

Most aerial vision-language datasets ask a drone to go somewhere. CosFly-Track starts from the messier thing people often want in the field. Stay with a moving subject without flying into anything or losing the shot.

A route can end at the right coordinate and still be useless for tracking if the target slipped behind a wall ten seconds earlier. In a follow task, the drone has to manage distance, line of sight, yaw, pitch, collision risk, and smooth motion throughout the clip.

The paper CosFly-Track: A Large-Scale Multi-Modal Dataset for UAV Visual Tracking via Multi-Constraint Trajectory Optimization is a dataset paper with a strong production recipe. The authors build drone paths with tracking constraints already inside the optimizer.

Paper details

Quick Read

CosFly-Track contains about 12,000 expert and perturbed UAV trajectories generated from roughly 6,000 pedestrian paths. The full dataset covers 2.4 million timesteps, about 334 hours of tracking, across 16 CARLA town variants and multiple weather or lighting conditions.

Each trajectory has seven aligned channels including RGB, metric depth, semantic segmentation, 6-DoF drone pose, target state with visibility flag, bilingual Chinese-English instructions, and trajectory-pair metadata. So the release is useful beyond rendered video. It gives an AI drone model camera data, state, target visibility, and the motion it should have taken.

Zero-shot VLMs mostly fail at the control task. After supervised fine-tuning on 200K CosFly-Track samples, seven vision-language models reach 78.3% to 95.6% SR@1m, a 53 to 69 percentage-point improvement over zero-shot baselines.

That result does not settle real-world tracking. It does show that the dataset contains learnable structure for a task ordinary aerial navigation data tends to miss.

Why The Data Has To Be Different

A good tracking path is judged at every frame. Is the target visible? Is the viewing distance reasonable? Is the pitch angle useful? Is the drone path smooth enough to fly? Is the route safe?

Grid planners are awkward here. They can find a collision-free path, and smoothing can make the route look better, but the smoothed result may still inherit bad visibility or strange motion from the original grid search.

CosFly-Track uses MuCO, a multi-constraint optimizer that works directly in continuous 3D space. Its objective balances tracking distance, smoothness, jerk, safety, visibility, viewpoint quality, pitch, altitude, and path length. BVH queries keep collision and visibility checks fast. Unsafe waypoints are handled with soft costs, geometric projection, and velocity repair.

In the 20-path comparison, a strong A* baseline gets better visibility. MuCO runs about 22x faster, produces paths about 13% shorter, and keeps visibility above 0.90 on 16 of the 20 trajectories. For thousands of paths, that speed is part of the contribution.

The Paired-Trajectory Trick

The most useful design choice may be the paired data. Every pedestrian path produces an expert drone trajectory plus a perturbed version. The perturbations are small, ordinary mistakes. The drone or pedestrian shifts by a few meters, the viewing angle drifts, or both happen together.

That gives researchers a few ways to train. A model can imitate expert waypoints, recover from noisy history, compare clean and degraded tracking, or use the pairs for DAgger-style correction. This is closer to a real developer drone system, where the aircraft will not always sit exactly on the expert trajectory.

The ablation supports that choice. Denoising from perturbed input to expert target gives the best FDE and SR@1m in the reported setup. Expert-only training hurts yaw prediction, which is exactly the kind of failure that shows up when a camera drone has to correct its view instead of merely continue a clean path.

What The Benchmark Says

The model table is most useful as a warning about inputs.

After fine-tuning, Qwen3.5-9B reaches 95.60% SR@1m, GLM-4.6V-Flash reaches 95.48%, and Qwen3-VL-8B reaches 95.22%. Gemma-4-E4B is lower at 78.34%.

Pose history does most of the work in this benchmark. Removing it makes final displacement error jump from about 1.25 m to more than 3.8 m, and SR@1m drops from 77.6% to roughly 16-18%. Bounding boxes matter for target prediction. RGB adds only a small gain once pose and bbox history are already present.

I would not read that as a claim that vision is optional. The benchmark still leans heavily on structured state. In rougher field conditions, visual cues may carry more of the burden in occlusion, pedestrian motion, traffic, glare, weather, and camera artifacts.

Limits

The data comes from CARLA, so sim-to-real transfer remains the big open question. The authors say real-world data is being collected for a future release. This version should be treated as a training and benchmarking resource, not proof that a model can follow people outside.

The generation pipeline is not fully open-source because of company policy, although the paper gives algorithm detail for reimplementing the optimizer. The current release is an initial subset of about 100K multi-modal frames, with expansion planned.

There is also the obvious dual-use issue. UAV tracking is useful for search and rescue, filming, sports analysis, wildlife monitoring, and inspection. The same capability can be misused for surveillance. The dataset license restricts unauthorized surveillance and military targeting.

Where this fits with Nimbus and Droneforge

For Droneforge builders, CosFly-Track points toward a useful AI drone training primitive for target-following data where motion, view quality, and recovery behavior are tied together.

Nimbus already exposes the pieces a developer drone tracking agent needs to inspect, including video, telemetry, route replay, object tracking, and command output. A model trained with CosFly-style data would sit between perception and motion generation, turning recent observations into short-horizon tracking waypoints.

The practical product question is recovery. Can a model learn how to regain the right distance, yaw, and viewing angle after the drone has drifted away from the expert path? That is more valuable than copying a perfect route.

Safety layers, operator control, geofencing, and command limits still have to wrap anything learned. CosFly-Track is interesting because it gives the learning layer a better starting point.

Bottom line

CosFly-Track is worth reading because it treats UAV tracking as its own training problem. The dataset construction loop is the strongest idea. Generate moving-target routes with visibility, viewpoint, safety, and kinematics baked into the trajectory, then pair expert behavior with realistic perturbations.

For drone autonomy, that is the right direction. A useful tracking drone has to keep seeing, keep flying, and know how to recover when the clean path is gone.

Research context

The Droneforge research section collects practical notes for builders who want to connect drone autonomy ideas to real hardware. Topics may include perception, tracking, mission planning, route replay, benchmarks, datasets, and lessons from operating Nimbus with DF1 in repeatable field workflows.

These notes are written for developers who need more than abstract robotics theory. The goal is to connect papers, experiments, and field observations to concrete Nimbus App and Python Library workflows that can be tested with video, telemetry, commands, and route planning tools.

As this section grows, each research entry will point builders toward the assumptions, constraints, and practical tradeoffs behind real autonomy experiments. That context helps teams decide what to prototype, what to measure, and how to evaluate progress.

Community archive

Continue exploring Droneforge changelogs, research notes, and Nimbus examples through the community archive. These internal links help connect related releases, technical notes, and builder resources.