MVP-Occ: Multi-view Pedestrian Occupancy Prediction with a Novel Synthetic Dataset (AAAI'25)

1Korea Institute of Science and Technology 2AI-Robotics, KIST School, University of Science and Technology, Korea 3Yonsei-KIST Convergence Research Institute, Yonsei University
teaser

A comprehensive dataset with a focus on dense pedestrian crowds in surveillance scenes.

Abstract

We address an advanced challenge of predicting pedestrian occupancy as an extension of multi-view pedestrian detection in urban traffic. To support this, we have created a new synthetic dataset called MVP-Occ, designed for dense pedestrian scenarios in large-scale scenes. Our dataset provides detailed representations of pedestrians using voxel structures, accompanied by rich semantic scene understanding labels, facilitating visual navigation and insights into pedestrian spatial information.

Furthermore, we present a robust baseline model, termed OmniOcc, capable of predicting both the voxel occupancy state and panoptic labels for the entire scene from multi-view images. Through in-depth analysis, we identify and evaluate the key elements of our proposed model, highlighting their specific contributions and importance.

OmniOcc Model

model

A unified occupancy prediction model specifically designed for multi-view dense pedestrian environments, characterized by its expandability and ability to handle various combinations of camera configurations and scene dimensions during train and test time.

Additionally, the model is designed to predict 2D pedestrian occupancy in the ground plane while simultaneously predicting 3D semantic occupancy for the entire scene. Using pedestrian instances as center locations, our model can further group semantic occupancy into instance and panoptic occupancies.

MVP-Occ Dataset

model

We propose a novel synthetic Multi-View Pedestrian Occupancy dataset, comprising five large-scale scenes, designed to mimic real-world environments.

Occupancy Labels

model

The entire scene is represented by voxels, and each voxel is annotated with one of five classes, indicating whether it belongs to a pedestrian, the background environment (ground, walls, others), or is empty.

Scene Point Cloud

model

We also provide the scene point cloud data without any pedestrians.

Pedestrians Labels

model

We also offer 3D poses and 2D locations in the groud plane of pedestrians for pose estimation and multi-view detection tasks.

Pixel-level Annotations

model

In addition to 3D annotations at the scene-level, we also provide multi-view pixel-level annotations, including RGB, Depth, Semantic and Instance masks.

Dataset Details

model

Each scene has 2500 frames, generated at 10 FPS frame rate, with an image size of 1920x1080. The number of cameras vary depending on the scene characteristics with the goal of maximizing scene coverage for each camera to mimic the real-world CCTV camera setup.

Folder Structure

model

The dataset includes three main folders:

  1. "images": Pixel-level annotations for rendered simulation with pedestrians.
  2. "scene": Pixel-level annotations for a scene without pedestrians.
  3. "annotations": 3D human poses and ground plane locations with semantic and instance occupancy labels.

Camera calibration parameters and poses are provided in "poses.json".

Download the dataset

To download the dataset, please sign this form and send it to this email (jhcho@kist.re.kr).

Quantitative Results

Same-scene Evaluation

model

Table. Comparison with previous multi-view pedestrian detection methods.

model

Table. 3D occupancy prediction for all scenes in MVP-Occ dataset.

Synthetic-to-real Evaluation

model

Table. Comparison with previous multi-view pedestrian detection methods.

model

Table. 3D occupancy prediction for all scenes in MVP-Occ dataset.

Qualitative Results

Same-scene Evaluation

model

Figure. 2D and 3D occupancy prediction under same-scene evaluation of the Park scene.



model

Figure. 3D occupancy prediction under same-scene evaluation for all scenes in MVP-Occ dataset.

Synthetic-to-real Evaluation

model

Figure. 3D occupancy prediction under synthetic-to-real evaluation (Facade to WildTrack).

model

Figure. 3D occupancy prediction and visualizations of the rendered segmentation masks under synthetic-to-real evaluation (Each scene to WildTrack).

model

Figure. Comparison between ground-truth and rendered segmentation masks for the WildTrack scene.

BibTeX

@inproceedings{aung2024mvpocc,
      title={Multi-View Pedestrian Occupancy Prediction with a Novel Synthetic Dataset},
      author={Aung, Sithu, Sagong, Min-cheol, and Cho, Junghyun},
      booktitle={The 39th Annual AAAI Conference on Artificial Intelligence},
      year={2025},
}