Statistics Seminars 2006-07

From MathWiki

Table of contents

Elisabeth Tiller at Waterloo, Thursday, December 7, 2006

TOPIC
Wanted: An evolutionary model for insertions and deletions in biological sequences
SPEAKER
Dr. Elisabeth Tiller, University of Toronto
DATE
Thursday, December 07, 2006
TIME
4:00pm
PLACE
University of Waterloo, Department of Statistics and Actuarial Science, MC 5158 (Coffee and cookies are available before the talk in MC5158)
Abstract
Analyses of evolutionary changes at the protein, RNA and DNA levels has been generally limited to considering the amino acid substitution process. This talk will first review these models. However there have been few empirical and statistical studies of the insertion and deletion (indel) processes for applications to alignment and phylogenic analysis. Using protein alignment databases, we are developing an empirical model for indel evolution. We consider the observed frequency and distribution of indels between sequences against increasing observed frequency of amino acid substitution. We also intend to analyze databases of structural RNA sequence alignments to study the evolution of the indel distribution in these contexts.
We propose to develop more general parametric models for the indel distribution for variable divergence times. We are looking to collaborate with mathematicians on this project. In our model, all insertions and deletions would occur with an instantaneous rate that is dependent on their length and time of divergence from the reference sequence. The expected distribution for all indel lengths could then be determined from this model. The empirical model derived from the analysis of the alignment databases will then be used in determining suitable values for the parameters in a parametric model for the indel process. Currently we are using Monte Carlo simulations of the indel process to generate artificial datasets. Since the parameters of the simulation parameters are known, and the resulting sequences can be analyzed, we hope these simulations will help to derive the mathematical parametric model that will allow us to recapitulate the observed distribution of indels with the simulation parameters.

David Stanford at U. of T., Thursday, December 7, 2006

Title
"Transplant Queues"
Speaker
Professor David Stanford, Statistical & Actuarial Sciences, UWO
Location
Sidney Smith Hall **1074**
Date
Thursday, 7 December 2006 at 4:00PM (Cookies and Juice will be served in the DeLury Lounge at 3:30 p.m. in SS6004)
Abstract
This talk will comprise three parts. Following a general introduction to the problem, we will present the results of a summer USRA project which perfermed a number of statistical analyses on the waiting lists for liver transplants in Canada. The second aspect will identify how the tranplant waiting lists can be modelled as queues --- albeit complicated ones. The final aspect will focus on how key queueing principles can be used to infer qualitative consequences of select types of changes to current allocation policy.
This is joint work with Elizabeth Renouf, USRA student, Statistical & Actuarial Sciences, UWO and Vivian McAlister, MD, LHSC & UWO.

Rong Zhu at York, Wednesday, December 6, 2006

Speaker
Rong Zhu, McMaster University
Title
A New Property of Generalized Poisson and Comparison with Negative Binomial
Date
Wednesday, December 6, 2006
Time
2:30 pm
Where
N638 Ross
SUMMARY
We prove that the generalized Poisson distribution GP(theta, eta) (eta >=0) is a mixture of Poisson distributions; previously it was only known to be overdispersed relative to Poisson like negative binomial distribution. We compare the probability mass functions and skewnesses of the generalized Poisson and the widely used negative binomial distributions with the first two moments fixed. We find that the generalized Poisson and negative binomial distributions with means and variances fixed have slight differences in many situations, but their zero-inflated distributions with masses at zero, means and variances fixed can differentiate largely. These probabilistic comparisons are helpful in selecting a better fitting distribution for modelling count data with heavy right tails. Through a real example of count data with large zero fraction, we illustrate how the generalized Poisson and negative binomial distributions as well as their zero-inflated distributions can be discriminated. (This is joint work with Dr. Harry Joe.)

Dimitris Cheliotis at the Fields Institute: December 6, 2006

Speaker
Dimitris Cheliotis (University of Toronto)
Title
Patterns for the 1-dimensional random walk in the random environment - a functional LIL
When
Wednesday, December 6, 2006, 3:10 pm
Where
Fields Institute, Large Room
Abstract
We start with a one dimensional random walk (or diffusion) in a Wiener-like environment. We look at its graph at different, increasing scales natural for it. What are the patterns that appear repeatedly? We characterize them through a functional law of the iterated logarithm analogous to Strassen's result for Brownian motion and simple random walk. The talk is based on joint work with Balint Virag.

Daniel Ashlock at York: Friday, December 1, 2006

Speaker
Prof. Daniel Ashlock, University of Guelph
Title
Multiclustering: Avoiding the Natural Shape of Underlying Metrics.
Time
10:30-11:30
Date
December 1, 2006
Place
N638 Ross Building
Abstract
This talk introduces a novel clustering technique that exploits the variability of k-means clustering to grant immunity to the cluster shapes that are artifacts of the distance measure used rather than a feature of the data. The talk assumes that that the listener is a nonspecialist with potential applications for clustering algorithms. In addition to clustering data, multiclustering gives an advisory as the natural number of clusters in the data (or an indication that no such natural clusters exist).

Huaiping Zhu at York, December 1, 2006

MITACS Seminar at LAMPS and LIAM
Title
West Nile virus: Modelling, Surveillance and Data Mining
Title
Threshold Conditions for accessing control of West Nile virus
Speaker
Huaiping Zhu, York University
When
Friday Dec. 01 2:00-4:00,
Where
Ross N638
Abstract
By using a system of differential equations, we investigate the threshold conditions for accessing control strategies for West Nile virus. Usually it was believed that the endemic could be controlled if the basic reproduction number is smaller than one. Form the existence and classification of the multiple equilibria, as well as the discussion of stability of these equilibria, we develop explicit threshold conditions in terms of controlling parameters beyond the basic reproduction number. I will also discuss the saddle-node bifurcation of the system in the case when the basic reproduction is smaller than one. The local stability of the equilibria will be proved by the theory of K-competitive dynamical systems and index theory of dynamical systems on a surface. The results of this study suggest that the basic reproductive number itself is not enough to indicate whether West Nile virus will prevail or not and also suggest that we must pay more attention to the initial size of the infected bird and mosquito population. The results also partially explain the mechanism of the recurrence of the small scale endemic of the virus in North America.

Venkata Duvvuri at York, December 1, 2006

Speaker
Venkata Duvvuri, York University
Title
West Nile virus: Modelling, Surveillance and Data Mining
Title
Data-mining approaches towards the management of mosquito borne diseases
Abstract
Mosquito borne disease management requires a clear understanding of all concerned variables that directly or indirectly aid to mosquito proliferation, pathogen sustainability and disease epidemics. This talk is mainly focused on the utilization of data mining methodologies i.e, Classification and Regression Trees (CART) and Bayesian Networks for prioritizing the vector/disease influencing variables, generation of decision rules and forecasting the probable epidemic/threats, which in turn guide health officials and policy makers to establish or modify the existing management procedures in an effective way. Both these application were applied towards disease management of Lymphatic Filariasis and Japanese encephalitis, which are dreadful infectious mosquito borne diseases in India.

Richard A. Harshman at York: Friday, November 24, 2006

Speaker
Richard A. Harshman, (Psychology Dept., University of Western) harshman@uwo.ca http://publish.uwo.ca/~harshman
Title
Introducing new 'ways' into data analysis: Making linear models multilinear gives them important new properties
Date
Friday, November 24, 2006, 10:30-11:30
Place
N638 Ross
Abstract
In 1970, I developed PARAFAC (PARAllel FACtor analysis), a generalization of factor/component analysis from matrices to three-way arrays of data (e.g., to measurements of n cases on m variables on each of p occasions, or to correlations of n variables with the same n variables in each of p different circumstances). The motivation was to enhance validity: by parallel factoring of multiple non-identical mixtures of the same patterns, the three-way model could often overcome the rotational ambiguity of standard factor/component analysis and uniquely recover the source patterns that originally generated the mixtures. In the last 10 years there has been a rapid growth of important PARAFAC applications in diverse fields, ranging from chemistry and physics (e.g., E-E flourescence and XES x-ray spectroscopy), to signal engineering (e.g., cell-phone signals, noisy radar), to neuroscience (EEG and fMRI brain signals), etc. A Google search now returns over 50,000 hits. Quite recently, I have been developing similar generalizations of other common methods of data analysis, which hope will also have wide application and value.
In this talk I will explain how, by extending standard statistical models from linear to multilinear, we can substantially increase their power and give them important new properties. The idea can be briefly explained as follows: while traditional methods find an optimal linear combination across one index of a two-way data array (combining columns of data), the generalized methods find jointly-optimal linear combinations across two (or more) indices of a three- (or higher)-way array. The figure below shows how a standard canonical correlation for the General Linear Model (GLM) is modified for a "level 1" multilinear generalization. The canonical weight vectors (columns of W on both sides) are chosen so that the correlation between the left and right canonical variates (columns of C) is maximal. Note that the data sources on the two sides do not need to have the same number of 'ways', so either side can be a matrix or a four-way array, etc.
By introducing multilinear generalizations into the General Linear Model,this approach implicitly also generalizes its many special cases, such asDiscriminant Analysis, (M)ANOVA /(M)ANCOVA, etc. In many of these applications, one side of the canonical relation would be a 'design matrix' or 'design array'. Statistical tests could be based on distribution free compute-intensive methods such as randomization tests or bootstrapping.
A further kind of generalization will also be described, called "level 2 multilinearity". Here, the patterns themselves are multilinear, and take the form of matrices or arrays with low-rank outer-product structure. For example, in the level 2 GLM, the canonical variates become tensors of order 2 or higher. Patterns with such added structure can convey "deeper" or "higher order" information about the data generating processes, including how specific latent properties in one 'way' of the array 'interact' or act jointly with specific latent properties in another.

Rohit Deo at Waterloo, Thursday, November 23, 2006

UNIVERSITY OF WATERLOO Department of Statistics and Actuarial Science
SPEAKER
Dr. Rohit Deo (Stern School of Business, New York University)
TOPIC
Bias Reduction and Likelihood Based Almost Exactly Sized Hypothesis Testing in Predictive Regressions using the Restricted Likelihood
DATE
Thursday, November 23, 2006
TIME
4:00pm
PLACE
MC 5158 (Coffee and cookies are available before the talk in MC5158)
Abstract

The question of whether the time series of one variable, such as stock returns, can be predicted by the lagged values of the time series of another variable, such as lagged dividend-price ratios or lagged book-to-market ratios, is of interest in financial econometrics. Predictive regression models are generally used for such analyses, but hypothesis testing in such models tends to be problematic due to persistence in the regressor series, leading to biased slope coefficients estimates. We address the problem of estimation and hypothesis testing of the slope coefficient using the restricted likelihood. This likelihood is shown to yield estimates that have much less bias than the usual least squares estimates. Furthermore, we show that the likelihood ratio test based on the restricted likelihood provides accurately sized tests, due to the fact that they have small curvature. The procedure is also extended to the case of multiple regressor series.

Keith Worsley at York: Tuesday, November 21, 2006

Department of Mathematics and Statistics Colloquium
Speaker
Professor Keith Worsley (McGill University)
Title
Detecting Connectivity Between Images by Thresholding Random Fields: MS Lesions, Cortical Thickness, and the "Bubbles" Task in an fMRI Experiment
Date
Tuesday, November 21, 2006 (Refreshments will be served in N620 Ross Building at 3:30 p.m.)
Time
4:00 p.m.
Place
N638 Ross Building
Abstract
We are interested in the general problem of detecting connectivity, or high correlation, between pairs of pixels or voxels in two sets of images. To do this, we set a threshold on the correlations that controls the false positive rate, which we approximate by the expected Euler characteristic of the excursion set. An exact expression for this is found using new results in random field theory involving Lipschitz-Killing curvatures and Jonathan Taylor's Gaussian Kinematic Formula. The first example is a data set on 425 multiple sclerosis patients. Lesion density was measured at each voxel in white matter, and cortical thickness was measured at each point on the cortical surface. The hypothesis is that increased lesion density interrupts neuronal activity, provoking cortical thinning in those grey matter regions connected through the affected white matter regions. The second example is an fMRI experiment using the "bubbles" task. In this experiment, the subject is asked to discriminate between images that are revealed only through a random set of small windows or "bubbles". We are interested in which parts of the image are used in successful discrimination, and which parts of the brain are involved in this task.

Hans Tuenter at the University of Toronto, Thursday, November 16, 2006

University of Toronto Department of Statistics Seminar
Title
"Stochastic Modeling of Wind Speed Time Series,"
Speaker
Hans Tuenter (Ontario Power Generation)
Date
Thursday, 16 November 2006 at 4:00PM (Cookies and juice will be served in the DeLury Lounge SS6004 at 3:30 p.m.)
Location
Sidney Smith Hall **1074**
Abstract
We present a stochastic model for wind speeds that captures the short-term autocorrelations and its long-term stationary properties. In addition, the model allows for diurnal and seasonal components to be incorporated. An application of the model is in Monte Carlo simulations of wind parks, where the economic viability heavily depends, not only, upon the wind speed distribution, but also upon its diurnal and seasonal patterns.

Elena A. Erosheva at Waterloo: Thursday, November 16, 2006

UNIVERSITY OF WATERLOO Department of Statistics and Actuarial Science
SPEAKER
Dr. Elena A. Erosheva (University of Washington)
TOPIC
A Bayesian analysis of multivariate binary response data using basic and compartmental Grade of Membership models
DATE
Thursday, November 16, 2006
TIME
4:00pm
Abstract
This talk presents an analysis of functional disability data from the National Long Term Care Survey (NLTCS). Functional disability reflects difficulties in performing activities that are considered normal for everyday living such as dressing or grocery shopping. We employ a Bayesian framework to determine characteristics and the number of functional disability profiles in the data with the basic GoM model. We then extend the GoM model hierarchy to incorporate a deterministically healthy compartment. With the compartmental GoM model, we estimate weights of the healthy and partially disabled components and examine the impact of this extension on the interpretation of disability profiles. We assess the choice of the optimal number of disability profiles with several approaches including a Deviance Information Criterion (DIC) and an approximation to the Bayesian Information Criterion (BIC). We find that the functional disability data are described best by eight disability profiles and a deterministic healthy compartment.

Jason Roy at Waterloo, November 9, 2006

UNIVERSITY OF WATERLOO Department of Statistics and Actuarial Science
TOPIC
Causal comparisons in randomized trials of two active treatments: The effect of supervised exercise to promote smoking cessation.
SPEAKER
Dr. Jason Roy, University of Rochester
DATE
Thursday, November 9, 2006
TIME
4:00pm
PLACE
MC 5158 (Coffee and cookies are available before the talk in MC5158)
Abstract
In behavioral medicine trials, such as smoking cessation trials, two or more active treatments are often compared. Noncompliance by some subjects with their assigned treatment poses a challenge to the data analyst. Causal parameters of interest might include those defined by subpopulations based on their potential compliance status under each assignment, using the principal stratification framework (e.g., causal effect of new therapy compared to standard therapy among subjects that would comply with either intervention). Even if subjects in one arm do not have access to the other treatment(s), the causal effect of each treatment typically can only be identified from the outcome, randomization and compliance data within certain bounds. We propose to use additional information -- compliance-predictive covariates -- to help identify the causal effects. Our approach is to specify marginal compliance models conditional on covariates within each arm of the study. Parameters from these models can be identified from the data. We then link the two compliance models through an association model that depends on a parameter that is not identifiable, but has a meaningful interpretation; this parameter forms the basis for a sensitivity analysis. We demonstrate the benefit of utilizing covariate information in both a simulation study and in an analysis of data from a smoking cessation trial.

Chen Li at York: October 26, 2006

Database Seminar at the School of Information Technology
Title
Answering Approximate Queries Efficiently
Speaker
Chen Li, UC Irvine
Time
Thursday, Oct. 26, 12:00pm
Room
TEL3009
Abstract
Many database applications have the emerging need to answer approximate queries efficiently. Such a query can ask for strings that are similar to a given string, such as "names similar to Schwarzenegger" and "telephone numbers similar to 412-0964," where "similar to" uses a predefined, domain-specific function to specify the similarity between strings, such as edit distance. There are many reasons to support such queries. To name a few: (1) The user might not remember exactly the name or the telephone number when issuing the query. (2) There could be typos in the query. (3) There could be errors or inconsistencies even in the database, especially in applications such as data cleaning.
In this talk we will present some of our recent results on answering approximate queries efficiently. One problem related to optimizing such queries is to estimate the selectivity of a fuzzy string predicate, i.e., estimating how many strings in the database satisfy the predicate. We develop a novel technique, called SEPIA, to solve the problem. We will present the details of this technique using the edit distance function. We study challenges in adopting this technique, including how to construct its histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. We show the results of our extensive experiments.
Time permitting, we will also briefly report our other related results. One is on supporting fuzzy queries with both predicates on numeric attributes (e.g., salary > 50K) and predicates on string attributes (e.g., telephone numbers similar to 412-0964). Another one is on how to relax conditions in an SQL query that returns an empty answer. These results are based on three recent papers in VLDB'2005 and VLDB'2006.
Biography
Chen Li is an assistant professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003. He is currently a part-time Visiting Research Scientist at Google, Santa Monica. His research interests are in the fields of database and information systems, including data integration and sharing, data cleansing, data warehousing, and data privacy. More information is available at: http://www.ics.uci.edu/~chenli/

Guangzhe Fan at York: October 20, 2006

Title
Kernel-Induced Classification Trees and Random Forests
Speaker
Guangzhe Fan
Department of Statistics and Actuarial Science
University of Waterloo
When
Friday, October 20, 2006, 2:30
Where
York University: N638 Ross Bulding
Abstract
A recursive-partitioning procedure using kernel functions is proposed for classification problems. We call it KICT- kernel-induced classification trees.
Essentially, KICT uses kernel functions to construct CART models. The resulting model could perform significantly better in classification than the original CART model in many situations, especially when the pattern of the data is non-linear. We also introduce KIRF: kernel-induced random forests. KIRF compares favorably to random forests and SVM in many situations. KICT and KIRF also largely retain the computational advantage of CART and random forests, respectively, in contrast to SVM. We use simulated and real world data to illustrate their performances. We conclude that the proposed methods are useful alternatives and competitors to CART, random forests, and SVM.

Fei Zou at York: October 20, 2006

Title
Mixture Models in Quantitative Trait Loci (QTL) Mapping
Speaker
Fei Zou
Department of Biostatistics
University of North Carolina - Chapel Hill
When
Friday, October 20, 2006, 10:30
Where
York University: N638 Ross Bulding
Abstract 
In a QTL study, the putative QTL position is often unknown, which results in unknown QTL genotypes. Due to the unknown QTL genotypes, phenotype data therefore arise from mixtures of distributions under standard interval mapping procedures. Previous approaches to estimation involve modeling the distributions parametrically. In this talk, we will introduce several semi-parametric and non-parametric QTL mapping methods. Further, accurately estimating the QTL position is one of the major goals of any QTL study.
Traditionally, the position corresponding to the peak of the profile LOD score from interval mapping is used to estimate the QTL position and often referred as the MLE estimate of QTL position. Is the MLE estimate truly optimal? Several alternative estimates will help us to answer this question.

Tim Ramsay at York: September 29, 2006

Title
Concurvity -- the bias that time forgot.
Speaker
Dr. Tim Ramsay
Assistant Professor/Associate Scientist
University of Ottawa/Ottawa Health Research Institute
Where
York University: N638 Ross Bulding
When
Friday, September 29, 2006, 10:30
Abstract
A special case of the generalized additive model in which one covariate is modeled linearly, the semiparametric additive model is becoming an increasingly popular statistical tool. Its appeal lies in the fact that it allows the analysis to flexibly control for a variety of confounders, without making any parametric assumptions about their effects, while still producing an easy-to-interpret linear effect for the covariate of interest.
In 2002, however, it became painfully obvious to environmental epidemiologists that the linear effect could be seriously biased. Two interesting aspects of this discovery are that
(1) serious as it is, this bias problem remains largely unknown in the larger statistical community and
(2) the problem was first discovered twenty years ago but appears to have been completely forgotten.
This talk will illustrate through an embarrassing example how concurvity, the nonparametric analogue of collinearity, leads to bias. The mechanism behind this bias will be discussed, together with ways to diagnose and avoid it. The fact that it was discovered twenty years ago, and again seven years ago, by theoreticians suggests a dangerous gap between theory and practice in the discipline of statistics.