08.01.2025 12:15 Hannah Laus (TUM) : Non-Asymptotic Uncertainty Quantification in High-Dimensional Learning
Uncertainty quantification (UQ) is a crucial but challenging task in many high-dimensional regression or learning problems to increase the
confidence of a given predictor. In this talk we discuss a new data-driven approach for UQ in regression that applies both to classical
regression approaches such as the LASSO as well as to neural networks. One of the most notable UQ techniques is the debiased LASSO, which modifies the LASSO to allow for the construction of asymptotic confidence intervals by decomposing the estimation error into a Gaussian and an asymptotically vanishing bias component. However, in real-world problems with finite-dimensional data, the bias term is often too significant to be neglected, resulting in overly narrow confidence intervals. In this talk we will address this issue and derive a
data-driven adjustment that corrects the confidence intervals for a large class of predictors by estimating the means and variances of the
bias terms from training data, exploiting high-dimensional concentration phenomena. This gives rise to non-asymptotic confidence intervals, which can help avoid overestimating uncertainty in critical applications such as MRI diagnosis. Importantly, this analysis extends beyond sparse regression to data-driven predictors like neural networks, enhancing the reliability of model-based deep learning. Our findings, discussed in this talk, bridge the gap between established theory and the practical applicability of such debiased methods. This talk is based on joint work with Frederik Hoppe, Claudio Mayrink Verdun, Felix Krahmer and Holger Rauhut.
Source
15.01.2025 11:15 Elisabeth Maria Griesbauer (Institute of Basic Medical Sciences, Oslo, NO): Synthetic data generation balancing privacy and utility, using vine copulas
The availability of diverse, high-quality data has led to tremendous advances in science, technology and society at large, when analysed by means of statistical and machine learning (ML) methods. However, real-world data, in many cases, cannot be made public to the research community due to privacy restrictions, obstructing progress, especially in bio-medical research. Synthetic data can substitute the sensitive real data, and as long as they do not disclose private aspects. This has proven to be successful in training downstream ML applications.
We propose TVineSynth, a vine copula based synthetic tabular data generator, which is designed to balance privacy and utility, using the vine tree structure and its truncation to do the trade-off. Contrary to synthetic data generators that achieve differential privacy (DP) by globally adding noise, TVineSynth performs a controlled approximation of the estimated data generating distribution, so that it does not suffer from poor utility of the resulting synthetic data for downstream prediction tasks. TVineSynth introduces a targeted bias into the vine copula model that, combined with the specific tree structure of the vine, causes the model to zero out privacy-leaking dependencies while relying on those that are beneficial for utility. Privacy is here measured with membership (MIA) and attribute inference attacks (AIA). Further, we theoretically justify how the construction of TVineSynth ensures AIA privacy under a natural privacy measure for continuous sensitive attributes. When compared to competitor models, with and without DP, on simulated and on real-world data, TVineSynth achieves a superior privacy-utility balance.
Source
15.01.2025 12:15 David Huk (University of Warwick, Coventry, UK): Quasi-Bayes meets Vines
Recently developed quasi-Bayesian (QB) methods proposed a stimulating change of paradigm in Bayesian computation by directly constructing the Bayesian predictive distribution through recursion, removing the need for expensive computations involved in sampling the Bayesian posterior distribution. This has proved to be data-efficient for univariate predictions, however, existing constructions for higher dimensional densities are only possible by relying on restrictive assumptions on the model's multivariate structure. In this talk, we discuss a wholly different approach to extend Quasi-Bayesian prediction to high dimensions through the use of Sklar's theorem, by decomposing the predictive distribution into one-dimensional predictive marginals and a high-dimensional copula. We use the efficient recursive QB construction for the one-dimensional marginals and model the dependence using highly expressive vine copulas. Further, we tune hyperparameters using robust divergences (eg. energy score) and show that our proposed Quasi-Bayesian Vine (QB-Vine) is a fully non-parametric density estimator with an analytical form and convergence rate independent of the dimension of the data in some situations. Our experiments illustrate that the QB-Vine is appropriate for high dimensional distributions (64), needs very few samples to train (200), and outperforms state-of-the-art methods with analytical forms for density estimation and supervised tasks by a considerable margin.
Source
16.01.2025 16:30 Daniel Frischemeier (Universität Münster): Statistisches Denken im Mathematikunterricht der Primarstufe fördern - Von Forscher:innenfragen über Datenkarten zum Einsatz digitaler Werkzeuge
Um Schüler:innen frühzeitig auf den kompetenten Umgang mit Daten vorzubereiten, ist die Förderung eines frühen statistischen Denkens von grundlegender Bedeutung. Dabei ist es essenziell, im Mathematikunterricht der Primarstufe zentrale Prinzipien zu berücksichtigen, wie beispielsweise die Einbettung in einen Datenanalysezyklus, das Formulieren fundierter statistischer Fragestellungen (Forscher:innenfragen) und das Arbeiten mit realen sowie multivariaten Daten.
Bereits im stochastischen Anfangsunterricht ermöglicht der Einsatz von Datenkarten durch Umlegen und Ordnen eigenständige, händische Entdeckungen im Kontext realer und multivariater Daten. Dadurch können Schüler:innen eigene Darstellungsformen für Daten entwickeln. Aufbauend auf diesen Erfahrungen können digitale Werkzeuge, wie zum Beispiel die Software TinkerPlots, genutzt werden, um größere Datensätze zu explorieren und statistische Projekte durchzuführen.
Der Vortrag gibt einen Überblick über zentrale unterrichtspraktische Ansätze und Ideen zur Förderung eines frühen statistischen Denkens im Mathematikunterricht der Primarstufe. Ergänzend dazu werden ausgewählte Ergebnisse aus begleitenden empirischen Studien präsentiert.
______________________
Eingeladen von Prof. Karin Binder.
Source
20.01.2025 13:45 Yuhao Wang (Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, CN; Shanghai Qi Zhi Institute, CN): Residual permutation test for regression coefficient testing
We consider the problem of testing whether a single coefficient is equal to zero in linear models when the dimension of covariates p can be up to a constant fraction of sample size n. In this regime, an important topic is to propose tests with finite-population valid size control without requiring the noise to follow strong distributional assumptions. In this paper, we propose a new method, called residual permutation test (RPT), which is constructed by projecting the regression residuals onto the space orthogonal to the union of the column spaces of the original and permuted design matrices. RPT can be proved to achieve finite-population size validity under fixed design with just exchangeable noises, whenever p n/2. Moreover, RPT is shown to be asymptotically powerful for heavy-tailed noises with bounded (1+t)-th order moment when the true coefficient is at least of order n^{-t/(1+t)} for t \in [0, 1]. We further proved that this signal size requirement is essentially rate-optimal in the minimax sense. Numerical studies confirm that RPT performs well in a wide range of simulation settings with normal and heavy-tailed noise distributions.
Source