Abstract:To address the challenges of insufficient detail depiction, blurred boundaries, and mis-segmentation in UAV-based urban scene segmentation, we propose a semantic segmentation framework, ETP-HRNet. This method incorporates an Edge-guided Dual-domain Feature Fusion Structure (EDFS), which integrates spatial and frequency information to enhance edge detail perception, thereby improving the segmentation accuracy of boundary regions. The Tri-Dimensional Dynamic Residual Module (TDRM) was designed, which expands the receptive field and enriches semantic representations through multidimensional convolution and a dynamic selection mechanism, mitigating intra-class feature inconsistencies. Meanwhile, a Pyramid Feature Aggregation Structure (PFAS) is designed to fa-cilitate efficient cross-scale feature integration, enhancing the model’s ability to capture multi-scale contextual information. Experimental results indicate that ETP-HRNet achieves a mIoU of 70.04% on the UAVid dataset and 74.61% on the UDD dataset. Notably, it improved the static car category by 8.03% on the UAVid dataset and the road category by 5.39% on the UDD dataset, with effectiveness in fine-grained segmentation and semantic con-sistency.