AI Summary • Published on Feb 18, 2026
First Responders (FRs) face increasing challenges during disasters. Artificial intelligence and robotics, specifically Unmanned Ground Vehicles (UGVs), could assist them. However, current Human-Robot Interaction (HRI) methods, such as remote controllers, can be cumbersome and distracting for FRs. There is a lack of specialized datasets for gesture-based UGV control tailored for FR operations, with existing datasets often having a limited number of commands, focusing on other domains like UAVs, or not being publicly available. This necessitates a more intuitive and non-intrusive HRI method like visual hand gesture recognition (VHGR) to enable FRs to control robots efficiently without diverting their attention from critical tasks.
The authors developed FR-GESTURE, an RGBD dataset for gesture-based UGV control by First Responders. They defined a set of 12 distinct gestures, drawing inspiration from existing FR signals and tactical hand signals, which were refined based on feedback from experienced FRs. These gestures are mapped to specific UGV commands (e.g., "Come to me," "Freeze," "Evacuate the area," "Fetch a shovel"). The dataset was collected from 7 participants, with each performing the 12 gestures at 6 to 7 different distances (1 to 7 meters) in three diverse environments (two indoor, one outdoor). Data was captured using two Intel RealSense D415 cameras from different viewpoints, resulting in 3312 RGBD pairs at 480x640 resolution. The dataset also includes partially occluded samples and frames with motion blur to enhance robustness. For baseline experiments, the authors adopted established image classifiers like ResNet-18, ResNet-50, ResNeXt-50, and EfficientNet, pretraining them on a larger dataset (HaGRID) to mitigate overfitting due to the FR-GESTURE dataset's size. AdamW optimizer was used, and models were trained for 100 epochs, with F1-Score as the evaluation metric. Two evaluation protocols were defined: a uniform split and a subject-independent split to assess generalization to unseen individuals.
Baseline experiments were performed using ResNet-18, ResNet-50, ResNeXt-50, and EfficientNet. In the uniform split protocol, where data was randomly divided into training, validation, and test sets with uniform class distribution, pretraining on HaGRID was crucial for preventing overfitting. EfficientNet demonstrated superior performance, likely due to its smaller size making it less susceptible to overfitting and better at generalizing. For the subject-independent protocol, designed to evaluate the models' robustness to unseen signers, a significant decrease in F1-Score was observed across all models compared to the uniform protocol. This indicates that the limited number of subjects in the dataset might hinder the models' generalization ability to new individuals. Despite the drop, EfficientNet still maintained superior performance in this more challenging setting.
The FR-GESTURE dataset, while a pioneering effort, has several limitations that offer avenues for future research. Firstly, the samples were collected in a university setting from students wearing casual clothing, which may not accurately reflect real-world first responder environments or the use of personal protective equipment (e.g., gloves, helmets). Future work could address this by increasing dataset size and diversity, potentially using diffusion models for data augmentation, to encourage models to focus on discriminative hand shapes rather than background or attire. Secondly, the dataset exhibits gender and racial imbalance, with only 2 out of 7 participants being women and all participants being white. This lack of diversity could impede accurate recognition in real-world scenarios, highlighting the need for more inclusive data collection in future iterations to ensure robustness across diverse user populations.