Continuous benchmarking: Keeping pace with an evolving ecosystem of models and technologies
This paper introduces CI-beNNch, an automated continuous benchmarking framework for high-performance computing (HPC) applications, drawing on continuous integration principles. It addresses critical issues of reproducibility and usability in scientific software development by abstracting configurations and streamlining execution across diverse computing environments.
Enhancing AI and Dynamical Subseasonal Forecasts with Probabilistic Bias Correction
This paper introduces Probabilistic Bias Correction (PBC), a machine learning framework designed to significantly improve subseasonal weather forecasts (2-6 weeks ahead) by correcting systematic errors in existing dynamical and AI models. PBC has demonstrated superior performance in real-time forecasting competitions, enhancing predictions for temperature, pressure, and precipitation, and improving the accuracy of extreme event warnings.
Dual-Modal Lung Cancer AI: Interpretable Radiology and Microscopy with Clinical Risk Integration
This study introduces a dual-modal AI framework combining CT radiology and H&E microscopy with clinical data for improved lung cancer diagnosis and subtype classification. The system demonstrates high accuracy and interpretability, offering a more robust and transparent approach to overcome the limitations of single-modality diagnostic methods.
Serving Chain-structured Jobs with Large Memory Footprints with Application to Large Foundation Model Serving
The paper introduces a formal resource‑allocation framework for serving large transformer‑based models with pipeline parallelism, proposing greedy placement, cache allocation, and load‑balancing algorithms that drastically cut inference latency.
MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation
MM‑WebAgent is a hierarchical agent that jointly plans layout and multimodal assets, then iteratively refines them, achieving state‑of‑the‑art results on a new multimodal webpage generation benchmark.