We address an advanced challenge of predicting pedestrian occupancy as an extension of multi-view pedestrian detection in urban traffic. To support this, we have created a new synthetic dataset called MVP-Occ, designed for dense pedestrian scenarios in large-scale scenes. Our dataset provides detailed representations of pedestrians using voxel structures, accompanied by rich semantic scene understanding labels, facilitating visual navigation and insights into pedestrian spatial information.
Furthermore, we present a robust baseline model, termed OmniOcc, capable of predicting both the voxel occupancy state and panoptic labels for the entire scene from multi-view images. Through in-depth analysis, we identify and evaluate the key elements of our proposed model, highlighting their specific contributions and importance.
A unified occupancy prediction model specifically designed for multi-view dense pedestrian environments, characterized by its expandability and ability to handle various combinations of camera configurations and scene dimensions during train and test time.
Additionally, the model is designed to predict 2D pedestrian occupancy in the ground plane while simultaneously predicting 3D semantic occupancy for the entire scene. Using pedestrian instances as center locations, our model can further group semantic occupancy into instance and panoptic occupancies.
We propose a novel synthetic Multi-View Pedestrian Occupancy dataset, comprising five large-scale scenes, designed to mimic real-world environments.
The entire scene is represented by voxels, and each voxel is annotated with one of five classes, indicating whether it belongs to a pedestrian, the background environment (ground, walls, others), or is empty.
We also provide the scene point cloud data without any pedestrians.
We also offer 3D poses and 2D locations in the groud plane of pedestrians for pose estimation and multi-view detection tasks.
In addition to 3D annotations at the scene-level, we also provide multi-view pixel-level annotations, including RGB, Depth, Semantic and Instance masks.
Each scene has 2500 frames, generated at 10 FPS frame rate, with an image size of 1920x1080. The number of cameras vary depending on the scene characteristics with the goal of maximizing scene coverage for each camera to mimic the real-world CCTV camera setup.
The dataset includes three main folders:
Table. Comparison with previous multi-view pedestrian detection methods.
Table. 3D occupancy prediction for all scenes in MVP-Occ dataset.
Table. Comparison with previous multi-view pedestrian detection methods.
Table. 3D occupancy prediction for all scenes in MVP-Occ dataset.
Figure. 2D and 3D occupancy prediction under same-scene evaluation of the Park scene.
Figure. 3D occupancy prediction under same-scene evaluation for all scenes in MVP-Occ dataset.
Figure. 3D occupancy prediction under synthetic-to-real evaluation (Facade to WildTrack).
Figure. 3D occupancy prediction and visualizations of the rendered segmentation masks under synthetic-to-real evaluation (Each scene to WildTrack).
Figure. Comparison between ground-truth and rendered segmentation masks for the WildTrack scene.
@inproceedings{aung2024mvpocc,
title={Multi-View Pedestrian Occupancy Prediction with a Novel Synthetic Dataset},
author={Aung, Sithu, Sagong, Min-cheol, and Cho, Junghyun},
booktitle={The 39th Annual AAAI Conference on Artificial Intelligence},
year={2025},
}