publications
See Google Scholar for the most up-to-date information.
2026
- CMAMEGenerative emulation of chaotic dynamics with coherent priorJuan Nathaniel and Pierre GentineComputer Methods in Applied Mechanics and Engineering, 2026
Data-driven emulation of nonlinear dynamics is challenging due to long-range skill decay that often produces physically unrealistic outputs. Recent advances in generative modeling aim to address these issues by providing uncertainty quantification and correction. However, the quality of generated simulation remains heavily dependent on the choice of conditioning prior. In this work, we present an efficient generative framework for nonlinear dynamics emulation, connecting principles of turbulence with diffusion-based modeling: Cohesion. Our method estimates large-scale coherent structure of the underlying dynamics as guidance during the denoising process, where small-scale fluctuation in the flow is then resolved. These coherent prior are efficiently approximated using reduced-order models, such as deep Koopman operators, that allow for rapid generation of long prior sequences while maintaining stability over extended forecasting horizon. With this gain, we can reframe forecasting as trajectory planning, a common task in reinforcement learning, where conditional denoising is performed once over entire sequences, minimizing the computational cost of autoregressive-based generative methods. Numerical evaluations on chaotic systems of increasing complexity, including Kolmogorov flow, shallow water equations, and subseasonal-to-seasonal climate dynamics, demonstrate Cohesion superior long-range forecasting skill that can efficiently generate physically-consistent simulations, even in the presence of partially-observed guidance.
2025
- Nat Comms PhysDeep Koopman operators for causal discoveryJuan Nathaniel*, Carla Roesch*, Jatan Buch, and 4 more authorsCommunications Physics, 2025
Causal discovery aims to identify cause-effect mechanisms for better scientific understanding, explainable decision-making, and more accurate modeling. Standard statistical frameworks, such as Granger causality, lack the ability to quantify causal relationships in nonlinear dynamics due to the presence of complex feedback mechanisms, timescale mixing, and nonstationarity. Thus, applying these methods to study causal dynamics in real-world systems, such as the Earth, is a major challenge. Addressing this shortcoming, we leverage deep learning and a Koopman operator-theoretic formalism to present a class of causal discovery algorithms. Kausal uses deep Koopman operator methods to approximate nonlinear dynamics in a linearized vector space in which traditional causal inference methods such as Granger causality can be more easily applied. Our idealized experiments demonstrate Kausal’s superior ability in discovering and characterizing causal signals compared to existing deep learning and non-deep learning state-of-the-art approaches. Finally, the successful identification of major El Niño and La Niña events in observations showcases Kausal’s skill to handle real-world applications.
- CausalDynamics: A large-scale benchmark for structural discovery of dynamical causal modelsBenjamin Herdeanu*, Juan Nathaniel*, Carla Roesch*, and 4 more authorsIn Advances in Neural Information Processing Systems 38 (NeurIPS), 2025
Causal discovery for dynamical systems poses a major challenge in fields where active interventions are infeasible. Most methods used to investigate these systems and their associated benchmarks are tailored to deterministic, low-dimensional and weakly nonlinear time-series data. To address these limitations, we present CausalDynamics, a large-scale benchmark and extensible data generation framework to advance the structural discovery of dynamical causal models. Our benchmark consists of true causal graphs derived from thousands of both linearly and nonlinearly coupled ordinary and stochastic differential equations as well as two idealized climate models. We perform a comprehensive evaluation of state-of-the-art causal discovery algorithms for graph reconstruction on systems with noisy, confounded, and lagged dynamics. CausalDynamics consists of a plug-and-play, build-your-own coupling workflow that enables the construction of a hierarchy of physical systems. We anticipate that our framework will facilitate the development of robust causal discovery algorithms that are broadly applicable across domains while addressing their unique challenges.
2024
- NeurIPS OralChaosbench: A multi-channel, physics-based benchmark for subseasonal-to-seasonal climate predictionJuan Nathaniel, Yongquan Qu, Tung Nguyen, and 4 more authorsIn Advances in Neural Information Processing Systems 37 (NeurIPS), 2024
Accurate prediction of climate in the subseasonal-to-seasonal scale is crucial for disaster preparedness and robust decision making amidst climate change. Yet, forecasting beyond the weather timescale is challenging because it deals with problems other than initial condition, including boundary interaction, butterfly effect, and our inherent lack of physical understanding. At present, existing benchmarks tend to have shorter forecasting range of up-to 15 days, do not include a wide range of operational baselines, and lack physics-based constraints for explainability. Thus, we propose ChaosBench, a challenging benchmark to extend the predictability range of data-driven weather emulators to S2S timescale. First, ChaosBench is comprised of variables beyond the typical surface-atmospheric ERA5 to also include ocean, ice, and land reanalysis products that span over 45 years to allow for full Earth system emulation that respects boundary conditions. We also propose physics-based, in addition to deterministic and probabilistic metrics, to ensure a physically-consistent ensemble that accounts for butterfly effect. Furthermore, we evaluate on a diverse set of physics-based forecasts from four national weather agencies as baselines to our data-driven counterpart such as ViT/ClimaX, PanguWeather, GraphCast, and FourCastNetV2. Overall, we find methods originally developed for weather-scale applications fail on S2S task: their performance simply collapse to an unskilled climatology. Nonetheless, we outline and demonstrate several strategies that can extend the predictability range of existing weather emulators, including the use of ensembles, robust control of error propagation, and the use of physics-informed models. Our benchmark, datasets, and instructions are available at https://leap-stc.github.io/ChaosBench.
- npj Clean WaterInferring failure risk of on-site wastewater systems from physical and social factorsJuan Nathaniel, Sara Schwetschenau, and Upmanu Lallnpj Clean Water, 2024
Aging infrastructure and climate change present emerging challenges for clean water supply and reliable wastewater services for communities in the United States (US). In Georgia, for example, the failure rates of on-site wastewater systems (OWTS) have increased from 10% to 35% in the last two decades as the systems age. In this work, we develop a hierarchical Bayesian model to understand the different contributions of physical and social factors driving OWTS failures using a long-term collection of 201,000 Georgia’s OWTS inspection records. The out-of-sample validation accuracy of our hierarchical Bayesian model is 70% within Georgia, outperforming other machine learning models that do not consider the multiscale nature of the problem. Overall, we find counties that experience more extreme precipitation and are situated in steeper-sloped regions are significantly associated with increased failure risks. Uncertainties, meanwhile, are largely associated with counties experiencing more precipitation and have lower median housing value.
- Sci DataSpatiotemporal upscaling of sparse air-sea pCO2 data via physics-informed transfer learningSiyeon Kim*, Juan Nathaniel*, Zhewen Hou, and 2 more authorsScientific Data, 2024
Global measurements of ocean pCO2 are critical to monitor and understand changes in the global carbon cycle. However, pCO2 observations remain sparse as they are mostly collected on opportunistic ship tracks. Several approaches, especially based on direct learning, have been used to upscale and extrapolate sparse point data to dense estimates using globally available input features. However, these estimates tend to exhibit spatially heterogeneous performance. As a result, we propose a physics-informed transfer learning workflow to generate dense pCO2 estimates that are grounded in real-world measurements and remain physically consistent. The models are initially trained on dense input predictors against pCO2 estimates from Earth system model simulation, and then fine-tuned to sparse SOCAT observational data. Compared to the benchmark direct learning approach, our transfer learning framework shows major improvements of up to 56-92%. Furthermore, we demonstrate that using models that explicitly account for spatiotemporal structures in the data yield better validation performances by 50-68%. Our strategy thus presents a new monthly global pCO2 estimate that spans for 35 years between 1982-2017.
- WorkshopBest Student PaperDeep generative data assimilation in multimodal settingYongquan Qu*, Juan Nathaniel*, Shuolin Li, and 1 more authorIn IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024Best Student Paper Award @ CVPR EarthVision Workshop 2024
Robust integration of physical knowledge and data is key to improve computational simulations such as Earth system models. Data assimilation is crucial for achieving this goal because it provides a systematic framework to calibrate model outputs with observations which can include remote sensing imagery and ground station measurements with uncertainty quantification. Conventional methods in- cluding Kalman filters and variational approaches inherently rely on simplifying linear and Gaussian assumptions and can be computationally expensive. Nevertheless with the rapid adoption of data-driven methods in many areas of computational sciences we see the potential of emulating traditional data assimilation with deep learning especially generative models. In particular the diffusion-based probabilistic framework has large overlaps with data assimilation principles: both allows for conditional generation of samples with a Bayesian inverse framework. These models have shown remarkable success in text-conditioned image generation or image-controlled video synthesis. Likewise one can frame data assimilation as observation-conditioned state calibration. In this work we propose SLAMS: Score-based Latent Assimilation in Multimodal Setting. Specifically we assimilate in-situ weather station data and ex-situ satellite imagery to calibrate the vertical temperature profiles globally. Through extensive ablation we demonstrate that SLAMS is robust even in low-resolution noisy and sparse data settings. To our knowledge our work is the first to apply deep generative framework for multimodal data assimilation using real-world datasets; an important step for building robust computational simulators including the next-generation Earth system models.
2023
- Sci DataMetaFlux: Meta-learning global carbon fluxes from sparse spatiotemporal observationsJuan Nathaniel, Jiangong Liu, and Pierre GentineScientific Data, 2023
We provide a global, long-term carbon flux dataset of gross primary production and ecosystem respiration generated using meta-learning, called MetaFlux. The idea behind meta-learning stems from the need to learn efficiently given sparse data by learning how to learn broad features across tasks to better infer other poorly sampled ones. Using meta-trained ensemble of deep models, we generate global carbon products on daily and monthly timescales at a 0.25-degree spatial resolution from 2001 to 2021, through a combination of reanalysis and remote-sensing products. Site-level validation finds that MetaFlux ensembles have lower validation error by 5-7% compared to their non-meta-trained counterparts. In addition, they are more robust to extreme observations, with 4-24% lower errors. We also checked for seasonality, interannual variability, and correlation to solar-induced fluorescence of the upscaled product and found that MetaFlux outperformed other machine-learning based carbon product, especially in the tropics and semi-arids by 10-40%. Overall, MetaFlux can be used to study a wide range of biogeochemical processes.