physicsgen

PhysicsGen: Can Generative Models Learn from Images
to Predict Complex Physical Relations?

Abstract

The image-to-image translation abilities of generative learning models have recently made significant progress in the estimation of complex (steered) mappings between image distributions. While appearance based tasks like image in-painting or style transfer have been studied at length, we propose to investigate the potential of generative models in the context of physical simulations. Providing a dataset of 300k image-pairs and baseline evaluations for three different physical simulation tasks, we propose a benchmark to investigate the following research questions: i) are generative models able to learn complex physical relations from input-output image pairs? ii) what speedups can be achieved by replacing differential equation based simulations? While baseline evaluations of different current models show the potential for high speedups (ii), these results also show strong limitations toward the physical correctness (i). This underlines the need for new methods to enforce physical correctness.

Download Datasets & Hugging Face Datasets

The dataset used for evaluation, which contains 300k image pairs across three diverse simulation scenarios, is publicly available via Zenodo DOI and also available through Hugging Face Datasets.

Variants

Data is divided into train, test, and eval splits. For efficient storage and faster uploads, the data is converted and stored as Parquet files with image data stored as binary blobs.

Usage

You can load and use the dataset with the Hugging Face datasets library. For example, to load the sound_combined variant:

from datasets import load_dataset

dataset = load_dataset("mspitzna/physicsgen", name="sound_combined", trust_remote_code=True)

# Access a sample from the training split.
sample = dataset["train"][0]
input_img = sample["osm"]
target_img = sample["soundmap_512"]

# plot Input vs Target Image for a single sample
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(10, 5))
ax1.imshow(input_img)
ax2.imshow(target_img)
plt.show()

Code for baseline experiments

The code is located in the GitHub repository.

Project Structure:

project_root/
│
├── data/                         # Dedicated data folder
│   └── urban_sound_25k_baseline/ # Download this via provided DOI
│       ├── test/
│       │   ├── test.csv
│       │   ├── soundmaps/
│       │   └── buildings/
│       │
│       └── pred/                 # Your predictions
│           ├── y_0.png
│           └── ...
│
└── eval_scripts/
    ├── lens_metrics.py
    └── sound_metrics.py

The indexing system for predicted sound propagation images in the pred folder aligns directly with the test.csv dataframe rows. Each predicted image file, named as y_{index}.png, corresponds to the test data’s row at the same index, with index 0 referring to the dataframe’s first row.

Sound Propagation Evaluation Script

Description: Evaluates sound propagation predictions by comparing them to ground truth noise maps, including Line-of-Sight (LoS) and Non-Line-of-Sight (NLoS) errors.

Usage:

python sound_metrics.py --data_dir data/true --pred_dir data/pred --output evaluation.csv

Arguments:

Lens Evaluation Script

Description: Evaluates the accuracy of facial landmark predictions by comparing them to ground truth images.

Usage:

python lens_metrics.py --data_dir data/true --pred_dir data/pred --output results/

Arguments:

Results

Urban Sound Propagation

The table below presents baseline performance metrics for various architectural approaches, encompassing combined mean absolute error (MAE) and weighted mean absolute percentage error (wMAPE), alongside specific line-of-sight (LoS) and non-line-of-sight (NLoS) metrics.

Condition Architecture LoS MAE NLoS MAE LoS wMAPE NLoS wMAPE Runtime / Sample (ms)
Baseline Simulation 0.00 0.00 0.00 0.00 204700
Baseline convAE 3.67 2.74 20.24 67.13 0.128
Baseline VAE 3.92 2.84 21.33 75.58 0.124
Baseline UNet 2.29 1.73 12.91 37.57 0.138
Baseline Pix2Pix 1.73 1.19 9.36 6.75 0.138
Baseline DDPM 2.42 3.26 15.57 51.08 3986.353
Baseline SD(w.CA) 3.76 3.34 17.42 35.18 2961.027
Baseline SD 2.12 1.08 13.23 32.46 2970.86
Baseline DDBM 1.61 2.17 17.50 65.24 3732.21
             
Diffraction Simulation 0.00 0.00 0.00 0.00 206000
Diffraction convAE 3.59 8.04 13.77 32.09 0.128
Diffraction VAE 3.92 8.22 14.46 32.57 0.124
Diffraction UNet 0.94 3.27 4.22 22.36 0.138
Diffraction Pix2Pix 0.91 3.36 3.51 18.06 0.138
Diffraction DDPM 1.59 3.27 8.25 20.30 3986.353
Diffraction SD(w.CA) 2.46 7.72 10.14 31.23 2961.027
Diffraction SD 1.33 5.07 8.15 24.45 2970.86
Diffraction DDBM 1.35 3.35 11.22 23.56 3732.21
             
Reflection Simulation 0.00 0.00 0.00 0.00 251000
Reflection convAE 3.83 6.56 20.67 93.54 0.128
Reflection VAE 4.15 6.32 21.57 92.47 0.124
Reflection UNet 2.29 5.72 12.75 80.46 0.138
Reflection Pix2Pix 2.14 4.79 11.30 30.67 0.138
Reflection DDPM 2.74 7.93 17.85 80.38 3986.353
Reflection SD(w.CA) 3.81 6.82 19.78 81.61 2961.027
Reflection SD 2.53 5.26 15.04 55.27 2970.86
Reflection DDBM 1.93 6.38 18.34 79.13 3732.21

Lens Distortion

The table presents a comparative analysis of different models’ performance in accurately predicting facial landmarks under varying lens distortion settings, represented by the coefficients ( p_1 ) and ( p_2 ). It details the combined error, X Error, Y Error, and Shift for each model, highlighting how each model copes with horizontal and vertical distortion impacts separately.

Model Comb. X Err. Y Err. Shift Runtime / Sample (ms)
$p_1 \neq 0, p_2 = 0$          
Simulation 0.00 0.00 0.00 0.00 153.205
convAE 11.93 6.75 8.13 1.38 0.110
VAE 11.53 6.55 7.83 1.28 0.122
UNet 2.82 1.28 2.15 0.87 0.118
Pix2Pix 2.00 0.99 1.43 0.44 0.122
DDPM 1.93 0.94 1.39 0.45 3970.603
SD(w.CA) 3.09 1.59 2.21 0.62 2991.678
SD 2.79 1.41 2.01 0.60 2997.576
$p_1 = 0, p_2 \neq 0$          
Simulation 0.00 0.00 0.00 0.00 153.205
convAE 10.56 8.35 4.77 2.21 0.110
VAE 10.40 8.26 4.62 3.64 0.122
UNet 2.36 1.33 1.60 0.27 0.117
Pix2Pix 1.77 1.02 1.14 0.13 0.123
DDPM 2.13 1.39 1.23 0.16 3970.603
SD(w.CA) 2.85 1.60 1.94 0.34 2991.678
SD 2.44 1.38 1.64 0.26 2997.576

Dynamics of rolling and bouncing movements

The table evaluates the performance of three generative models—GAN, UNet, and DDPM—on four key error metrics: Position X, Position Y, Rotation, and Roundness. These metrics assess each model’s ability to accurately predict ball position, rotation, and shape in a controlled simulation environment, highlighting their precision in handling geometric distortions.

Model Position X Position Y Rotation Roundness Error
convAE 4.24 $\pm$ 3.9 6.08 $\pm$ 5.9 12.2 $\pm$ 8.6 1.06 $\pm$ 0.0 99%
VAE 4.69 $\pm$ 6.1 6.25 $\pm$ 6.9 31.0 $\pm$ 40 0.90 $\pm$ 0.1 95%
UNet 5.53 $\pm$ 7.5$ 10.8 $\pm$ 12 15.2 $\pm$ 23 0.74 $\pm$ 0.2 28%
Pix2Pix 6.28 $\pm$ 8.0 11.7 $\pm$ 13 17.2 $\pm$ 21 0.56 $\pm$ 0.1 11%
DDPM 7.91 $\pm$ 9.0 15.5 $\pm$ 14 32.9 $\pm$ 34 0.61 $\pm$ 0.2 5.7%
SD(w.CA) 40.0 $\pm$ 49 24.8 $\pm$ 23 61.1 $\pm$ 52 0.53 $\pm$ 0.2 7.3%
SD 8.55 $\pm$ 12 16.2 $\pm$ 14 34.2 $\pm$ 38 0.47 $\pm$ 0.1 2%

Contribute to the Leaderboard

We welcome contributions from the research community! If you have conducted research on sound propagation prediction and have results that outperform those listed in our leaderboard, or if you’ve developed a new architectural approach that shows promise, we invite you to share your findings with us.

Please submit your results, along with a link to your publication, to martin.spitznagel@hs-offenburg.de. Submissions should include detailed performance metrics and a description of the methodology used. Accepted contributions will be updated on the leaderboard.

Paper

Accepted at IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025.

Preprint is available here: https://arxiv.org/abs/2503.05333

License

This dataset is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

Funding Acknowledgement

We express our gratitude for the financial support provided by the German Federal Ministry of Education and Research (BMBF). This project is part of the “Forschung an Fachhochschulen in Kooperation mit Unternehmen (FH-Kooperativ)” program, within the joint project KI-Bohrer, and is funded under the grant number 13FH525KX1.

BMBF-Logo