Scientific seminar of the Millenium Nucleus Center for the Discovery of Structures in Complex Data (MiDaS). More information in: midas.mat.uc.cl

2022-03-25 14:00hrs.

John Staudenmayer. University of Massachusetts, Amherst Naive Penalized Spline Estimators of Derivatives Achieve Optimal Rates of Convergence Inscripción hasta el jueves 24 de marzo Abstract: Given data \{x_i, y_i\}_{i=1}^n, sampled from y_i = f(x_i)+e_i where f is an unknown function with p continuous derivatives, and e_i are iid with mean zero and constant variance, we are interested in estimating a derivative of f. While it is straightforward to compute nonparametric estimates of derivative functions, the challenge is that those estimates also require some sort of regularization to balance estimation bias and overfitting, and methods to choose that regularization are usually designed for estimating the function itself, not derivatives. In this talk we review a few methods to address that problem and what is known about their asymptotic properties. We also present new asymptotic result about penalized splines and show that choosing a smoothing parameter to estimate the function itself and symply differentiating the estimate achieves the optimal L_2 rate of convergence. This is joint work with Bright Antwi Boasiako, a graduate student at the University of Massachusets, Amherst.

2021-12-17 14:00hrs.

Vanda Inacio de Carvalho. University of Edinburgh Density regression via dependent Dirichlet process mixtures and penalised splines Inscripción hasta el viernes 17 de diciembre a las 12hrs https://forms.gle/ukMeSsuEdkqhmtZb7 Abstract: In many real-life applications, it is of interest to study how the distribution of a (continuous) response variable changes with covariates. Dependent Dirichlet process (DDP) mixture of normals models, a Bayesian nonparametric method, successfully addresses such goal. The approach of considering covariate independent mixture weights, also known as the single-weights dependent Dirichlet process mixture model, is very popular due to its computational convenience but can have limited flexibility in practice. To overcome the lack of flexibility, but retaining the computational tractability, this work develops a single-weights DDP mixture of normals model, where the components’ means are modelled using Bayesian penalised splines (P-splines). We coin our approach as psDDP. A practically important feature of psDDP models is that all parameters have conjugate full conditional distributions thus leading to straightforward Gibbs sampling. In addition, they allow the effect associated with each covariate to be learned automatically from the data. The validity of our approach is supported by simulations and applied to two real datasets, one concerning the study of the association of a toxic metabolite on preterm birth, and the other one in the context of field trials experiments.

2021-12-03 14:00hrs.

Sayed Mostafa. North Carolina A&t State University Nonparametric Model-Assisted Estimation Using Scrambled Responses from Complex Surveys Inscripción hasta el viernes 03 de diciembre a las 12hrs: https://forms.gle/qnnTjEbhDJ7azuUFA Abstract: The randomized response technique offers an effective way for reducing potential bias resulting from nonresponse and untruthful responses when asking questions about sensitive behaviors or beliefs. The technique is also used for conducting statistical disclosure control of public use data files as is commonly done by the U.S. Census Bureau. In both cases, the technique works by scrambling the actual survey responses using some known randomization model. In the case of asking sensitive survey questions, the scrambling of responses is done by survey respondents and only scrambled responses are collected, whereas in the case of disclosure control, the survey agency implements the randomization of responses after collecting the survey data and prior to releasing it for public use. In this talk, we will consider the problem of estimating the finite population mean of a study variable for which only scrambled responses are available from a complex sample survey. In addition to these scrambled responses, we assume that non-scrambled data about non-sensitive auxiliary variables is available for all population units from administrative records or any other source. We define and study a class of nonparametric model-assisted estimators that make efficient use of available auxiliary information and account for the complex survey design. Asymptotic properties of proposed estimators are derived, and the finite sample performance of these estimators is studied via extensive simulations accounting for a wide range of forms for the relationship between the study and auxiliary variables. The empirical results support the theoretical analyses and suggest that our proposed estimators are superior to existing estimators in most cases. We also discuss the problem of variance estimation and construction of confidence intervals. The proposed methods are illustrated using data from the U.S. Consumer Expenditure Survey.

2021-10-08 14:00hrs.

Sally Paganin. Harvard School of Public Health Centered Partition Processes: Informative Priors for Clustering Inscripción hasta el jueves 7 de octubre: https://forms.gle/wT3rvC63u3uAeN5A6 Abstract:

There is a very rich literature proposing Bayesian approaches for clustering starting with a prior probability distribution on partitions. Most approaches assume exchangeability, leading to simple representations in terms of Exchangeable Partition Probability Functions (EPPF). Gibbs-type priors encompass a broad class of such cases, including Dirichlet and Pitman-Yor processes. Even though there have been some proposals to relax the exchangeability assumption, allowing covariate-dependence and partial exchangeability, limited consideration has been given on how to include concrete prior knowledge on the partition. For example, we are motivated by an application in birth defect epidemiology, in which we wish to cluster birth defects into groups and we have prior knowledge of an initial clustering provided by experts. As a general approach for including such prior knowledge, we propose a Centered Partition process that modifies the EPPF to favor partitions close to an initial one. We use our method to analyze data from the National Birth Defect Prevention Study (NBDPS), relating different exposures to the risk of developing a birth defect, focusing on the class of Congenital Heart Defects.

This is joint work with Amy Herring (Duke University), Andrew Olshan (UNC at Chapel Hill) and David Dunson (Duke University)

2021-09-24 14:00hrs.

Kelly C. M. Gonçalves. Universidade Federal Do Rio de Janeiro Bayesian dynamic quantile linear models and some extensions Inscripción hasta el jueves 23 de septiembre (https://forms.gle/SPCjQ3aSc3nnQi4k7) Abstract: The main aim of this talk is to present a new class of models, named dynamic quantile linear models. It combines dynamic linear models with distribution free quantile regression producing a robust statistical method. This class of models provides richer information on the effects of the predictors than does the traditional mean regression and it is very insensitive to heteroscedasticity and outliers, accommodating the non-normal errors often encountered in practical applications. Bayesian inference for quantile regression proceeds by forming the likelihood function based on the asymmetric Laplace distribution and a location-scale mixture representation of it allows finding analytical expressions for the conditional posterior densities of the model. Thus, Bayesian inference for dynamic quantile linear models can be performed using an efficient Markov chain Monte Carlo algorithm or a fast sequential procedure suited for high-dimensional predictive modeling applications with massive data. Finally, a hierarchical extension, useful to account for structural features in the dataset, will be also presented.

2021-08-13 14:30hrs.

Mark Handcock. University of California, los Angeles Two applications in response to COVID-19: Modeling transmission using contact tracing data and modeling associated excess deaths Inscripción hasta el jueves 12 de agosto: https://forms.gle/2njbS1t6YyBBgE5eA Abstract: In this talk we will cover two separate applications of statistical modeling, each aimed at epidemic modeling in response to the COVID-19 pandemic. In the first we quantify the transmission potential of asymptomatic, presymptomatic and symptomatic cases using surveillance data from an outbreak in Ho Chi Minh City, Vietnam. We develop a transmission model and use a Bayesian framework to estimate the proportions of asymptomatic, presymptomatic and symptomatic cases and transmissions. We map chains of transmission and estimate the basic reproduction number (R_0). In the second application we developed statistical models and methodology to understand the historical patterns of all-cause mortality data at the country/regional level and to relate this to the level of all-cause mortality during the COVID-19 pandemic. the focus here is providing private and flexible tools for public health epidemiologists to gauge current all-cause mortality to the historical patterns. We have developed and published an on/off-line open-source Shiny app for the private analysis of all-cause mortality data. It presents various visualizations of the expected all-cause deaths and excess deaths.

2021-07-23 14:30hrs.

Leontine Alkema. University of Massachusetts Amherst Model-Based Estimates in Demography and Global Health: Quantifying the Contribution of Population-Period-Specific Information Inscripción hasta el jueves 22 de julio: https://forms.gle/YDRmXDaZ6JyVjqBcA Abstract:

Sophisticated statistical models are used to produce estimates for demographic and health indicators even when data are limited, very uncertain or lacking. To facilitate interpretation and use of model-based estimates, we aim to provide a standardized approach to answer the question: To what extent is a model-based estimate of an indicator of interest informed by data for the relevant population-period as opposed to information supplied by other periods and populations and model assumptions? We propose a data weight measure to calculate the weight associated with population-period data set y relative to the model-based prior estimate obtained by fitting the model to all data excluding y. In addition, we propose a data-model accordance measure which quantifies how extreme the population-period data are relative to the prior model-based prediction.

2021-07-02 14:30hrs.

Andres Felipe Barrientos. Florida State University Differentially private methods for Bayesian model uncertainty in linear regression models Inscripción hasta el jueves 01 de julio: https://forms.gle/7wXgzuSy9ozLPJAH9 Abstract: Statistical methods for confidential data are in high demand, for reasons ranging from recent trends in privacy law to ethical considerations. Currently, differential privacy is the most widely adopted formalization of privacy of randomized algorithms in the literature. This article provides differentially private methods for handling model uncertainty in normal linear regression models. More precisely, we introduce techniques that allow us to provide differentially private Bayes factors, posterior probabilities, and model-averaged estimates. Our methods are conceptually simple and easy to run with existing implementations of non-private methods.

2021-06-04 14:30hrs.

David Haziza. University of Ottawa A general multiply robust framework for combining probability and non-probability samples in surveys Inscripción hasta el jueves 3 de junio: https://forms.gle/Q5nZdPjq94shnsAL6 Abstract: In recent years, there has been an increased interest in combining probability and nonprobability samples. Non-probability samples are cheaper and quicker to conduct but the resulting estimators are vulnerable to bias as the participation probabilities are unknown. To adjust for the potential bias, estimation procedures based on parametric or nonparametric models have been discussed in the literature. However, the validity of the resulting estimators relies heavily on the validity of the underlying models. We propose a data integration approach by combining multiple outcome regression models and propensity score models. The proposed approach can be used for estimating general parameters including totals, means, distribution functions and percentiles. The resulting estimators are multiply robust in the sense that they remain consistent if all but one model are misspecified. I will present the results from a simulation study that show the benefits of the proposed method in terms of bias and efficiency.

2021-05-07 14:30hrs.

Krista J Gile. University of Massachusetts Amherst Policy Questions, Messy Data: Three approaches to turning messy data into information for public policy Inscripción hasta el jueves 6 de mayo: https://forms.gle/1H2vXGRLEjaAAQgb8 Abstract:

This talk describes 3 projects demonstrating the development of statistical methods to use messy data on people to draw conclusions for public health or policy. It concludes with some comments on the contribution of statistics to such settings.

For the first part consider: How do you sample populations such as people who inject drugs? You find a few, then get them to recruit their friends. But how do you make inference from the resulting sample? Respondent-Driven Sampling attempts to allow for statistical inference in this challenging data setting. In particular, we address the question of how to cluster the network tree-structured data collected using RDS based on covariates and partial network observation.

In the second part, we address the challenge of estimating the number of killings in the Syrian conflict. The challenge would be easy if there were a list of killings, and several groups are working on creating such lists. Unfortunately, none of the lists are complete. We introduce a method to use hierarchical clustering to characterize the partial overlap between 4 separately-collected lists of killings to estimate the total killings. This is an extension of the classical capture-recapture or multiple systems estimation methods.

Finally, we discuss a difficult issue in analysis of network data: In many cases, networks are undirected (if I had lunch with you, you also had lunch with me), but two parties reporting on the same relation may give conflicting reports. We address this problem in the context of a longitudinal study of (several types of) social relations and health behaviors among middle school students. Leveraging the multiple networks and time points of response observed for each student, we estimate the false reporting rates of each student to infer the true network structure.

2021-04-23 14:30hrs.

Karl Rohe. Uw-Madison Vintage Factor Analysis with Varimax Performs Statistical Inference Inscripción hasta el jueves 22 de abril: https://forms.gle/bwr26TUbvQ9ZL9Pe6 Abstract: Psychologists developed Multiple Factor Analysis to decompose multivariate data into a small number of interpretable factors without any a priori knowledge about those factors. In this form of factor analysis, the Varimax "factor rotation" is a key step to make the factors interpretable. Charles Spearman and many others objected to factor rotations because the factors seem to be rotationally invariant. This is an historical enigma because factor rotations have survived and are widely popular because, empirically, they often make factors easier to interpret. We argue that the rotation makes the factors easier to interpret because, in fact, the Varimax factor rotation performs statistical inference. We show that Principal Components Analysis (PCA) with the Varimax rotation provides a unified spectral estimation strategy for a broad class of modern factor models, including the Stochastic Blockmodel and a natural variation of Latent Dirichlet Allocation (i.e., "topic modeling"). In addition, we show that Thurstone's widely employed sparsity diagnostics implicitly assess a key "leptokurtic" condition that makes the rotation statistically identifiable in these models. Taken together, this shows that the know-how of Vintage Factor Analysis performs statistical inference, reversing nearly a century of statistical thinking on the topic. With a sparse eigensolver, PCA with Varimax is both fast and stable. Combined with Thurstone's straightforward diagnostics, this vintage approach is suitable for a wide array of modern applications. https://arxiv.org/abs/2004.05387

2020-06-17 15:00hrs.

Miguel de Carvalho. University of Edinburgh Elements of Bayesian geometry Zoom (Pedir link a Luis Gutiérrez) Abstract:

In this talk, I will discuss a geometric interpretation to Bayesian inference that will yield a natural measure of the level of agreement between priors, likelihoods, and posteriors. The starting point for the construction of the proposed geometry is the observation that the marginal likelihood can be regarded as an inner product between the prior and the likelihood. A key concept in our geometry is that of compatibility, a measure which is based on the same construction principles as Pearson correlation, but which can be used to assess how much the prior agrees with the likelihood, to gauge the sensitivity of the posterior to the prior, and to quantify the coherency of the opinions of two experts. Estimators for all the quantities involved in our geometric setup are discussed, which can be directly computed from the posterior simulation output. Some examples are used to illustrate our methods, including data related to on-the-job drug usage, midge wing length, and prostate cancer. Joint work with G. L. Page and with B. J. Barney.

2020-05-27 15:00hrs.

Nicolás Kuschinski. Pontificia Universidad Católica de Chile Grid-Uniform Copulas and Rectangle Exchanges: Model and Bayesian Inference Method for a Rich Class of Copula Functions Zoom (Pedir link a Luis Gutiérrez) Abstract: We introduce a new class of copulas which we call Grid-Uniform Copulas. We show the richness of this class of copulas by proving that for any copula $C$ and any $\epsilon>0$ there is a Grid-Uniform Copula that approximates it within Hellinger distance $\epsilon$. We then proceed to show how Grid-Uniform Copulas can be used to create semiparametric models for multivariate data, and show an elegant way to perform MCMC sampling for these models.

2020-05-20 15:00hrs.

Mauricio Castro. Pontificia Universidad Católica de Chile Automated learning of t factor analysis models with complete and incomplete data Zoom (Pedir link a Luis Gutiérrez) Abstract: The t factor analysis (tFA) model is a promising tool for robust reduction of high-dimensional data in the presence of heavy-tailed noises. When determining the number of factors of the tFA model, a two-stage procedure is commonly performed in which parameter estimation is carried out for a number of candidate models, and then the best model is chosen according to certain penalized likelihood indices such as the Bayesian information criterion. However, the computational burden of such a procedure could be extremely high to achieve the optimal performance, particularly for extensively large data sets. In this paper, we develop a novel automated learning method in which parameter estimation and model selection are seamlessly integrated into a one-stage algorithm. This new scheme is called the automated tFA (AtFA) algorithm, and it is also workable when values are missing. In addition, we derive the Fisher information matrix to approximate the asymptotic covariance matrix associated with the ML estimators of tFA models. Experiments on real and simulated data sets reveal that the AtFA algorithm not only provides identical fitting results, as compared to traditional two-stage procedures, but also runs much faster, especially when values are missing.

2020-05-13 15:00hrs.

Freddy Palma Mancilla. Universidad Nacional Autónoma de México Intertwinings for Markov branching processes Zoom (Pedir link a Luis Gutiérrez) Abstract: Using a stochastic filtering framework we devise some intertwining relationships in the setting of Markov branching processes. One of our result turns out to be the basis of an exact simulation method for these kind of processes. Also, the population dynamic scheme inherent in the model helps to study the behavior of prolific individuals by observing the total size of the population. Moreover, we study a population with two types of immigrations, where it is observed the total immigration, and our objective is to study each immigration separately. This result allows to link continuous-time Markov chains with continuous-state branching (CB) processes.

2020-05-06 15:00hrs.

Luis Gutiérrez. Pontificia Universidad Católica de Chile Bayesian nonparametric hypothesis testing procedures Zoom (Pedir link a Luis Gutiérrez) Abstract: Scientific knowledge is firmly based on the use of statistical hypothesis testing procedures. A scientific hypothesis can be established by performing one or many statistical tests based on the evidence provided by the data. Given the importance of hypothesis testing in science, these procedures are an essential part of statistics. The literature of hypothesis testing is vast and covers a wide range of practical problems. However, most of the methods are based on restrictive parametric assumptions. In this talk, we will discuss Bayesian nonparametric approaches to construct hypothesis tests in different contexts. Our proposal resorts to the literature of model selection to define Bayesian tests for multiple samples, paired-samples, and longitudinal data analysis. Applications with real-life datasets and illustrations with simulated data will be discussed.

2020-04-29 15:00hrs.

Inés Varas. Pontificia Universidad Católica de Chile Linking measurements: a Bayesian nonparametric approach Zoom (Pedir link a Luis Gutiérrez) Abstract: Equating methods is a family of statistical models and methods used to adjust scores on different test forms so that scores can be comparable and used interchangeably. These methods lie on functions to transform scores on two or more versions of a test. Most of the proposed approaches for the estimation of these functions are based on continuous approximations of the score distributions, as they are most of the time, discrete functions. Considering scores as ordinal random variables, we propose a flexible dependent Bayesian nonparametric model for test equating. The new approach avoids continuous assumptions of the score distributions, in contrast to current equating methods. Additionally, it allows the use of covariates in the estimation of the score distribution functions, an approach not explored at all in the equating literature. Applications of the proposed model to real and simulated data under different sampling designs are discussed. Several methods are considered to evaluate the performance of our method and to compare it with current methods of equating. Respect to discrete versions of equated scores obtained from traditional equating methods, results show that the proposed method has better performance.

2020-04-22 15:00hrs.

Diego Morales Navarrete. Pontificia Universidad Católica de Chile On modeling and estimating geo-referenced count spatial data Zoom (Pedir link a Luis Gutiérrez) Abstract:

Modeling spatial data is a challenging task in statistics. In many applications, the observed data can be modeled using Gaussian, skew-Gaussian or even restricted random field models. However, in several fields, such as population genetics, epidemiology and aquaculture, the data of interest are often count data, and therefore the mentioned models are not suitable for their analysis. Consequently, there is a need for spatial models that are able to properly describe data coming from counting processes. Commonly three approaches are used to model this type of data: GLMMs with gaussian random field (GRF) effects, hierarchical models, and copula models. Unfortunately, these approaches do not give an explicit characterization of the count random field like their q-dimensional distribution or correlation function. It is important to stress that GLMMs and hierarchical models induces a discontinuity in the path. Therefore, samples located nearby are more dissimilar in value than in the case when the correlation function is continuous at the origin. Moreover, there are cases in which the copula representation for discrete distributions is not unique, so it is unidentifiable. Hence to deal with this, we propose a novel approach to model spatial count data in an efficient and accurate manner. Briefly, starting from independent copies of a “parent” gaussian random field, a set of transformations can be applied, and the result is a non-Gaussian random field. This approach is based on the characterization of count random fields that inherit the well-known geometric properties from Gaussian random fields.

2020-01-29 12:00 hrs.

José Quinlan. Pontificia Universidad Católica de Chile On the Support of Yao-based Random Ordered Partitions for Change-Point Analysis Sala 1, Facultad de Matemáticas Abstract:

In Bayesian change-point analysis for univariate time series, prior distributions on the set of ordered partitions play a key role for change-point detection. In this context, mixtures of product partition models based on Yao's cohesion are very popular due to their tractability and simplicity. However, how flexible are these prior processes to describe different beliefs about the number and locations of change-points? In this talk I will address the previous question in terms of its weak support.

2020-01-22 12:00 hrs.

Miles Ott. Smith College Respondent-Driven Sampling: Challenges and Opportunities Sala 1, Facultad de Matemáticas Abstract: Respondent-driven sampling leverages social networks to sample hard-to-reach human populations, including among those who inject drugs, sexual minority, sex worker, and migrant populations. As with other link-tracing sampling strategies, sampling involves recruiting a small convenience sample, who invite their contacts into the sample, and in turn invite their contacts until the desired sample size is reached. Typically, the sample is used to estimate prevalence, though multivariable analyses of data collected through respondent-driven sampling are becoming more common. Although respondent-driven sampling may allow for quickly attaining large and varied samples, its reliance on social network contacts, participant recruitment decisions, and self-report of ego-network size makes it subject to several concerns for statistical inference. After introducing respondent-driven sampling I will discuss how these data are actually being collected and analyzed, and opportunities for statisticians to improve upon this widely-adopted method.