# Statistics Seminars 2006-07

### From MathWiki

Table of contents |

## Elisabeth Tiller at Waterloo, Thursday, December 7, 2006

- TOPIC
- Wanted: An evolutionary model for insertions and deletions in biological sequences
- SPEAKER
- Dr. Elisabeth Tiller, University of Toronto
- DATE
- Thursday, December 07, 2006
- TIME
- 4:00pm
- PLACE
- University of Waterloo, Department of Statistics and Actuarial Science, MC 5158 (Coffee and cookies are available before the talk in MC5158)
- Abstract
- Analyses of evolutionary changes at the protein, RNA and DNA levels has been generally limited to considering the amino acid substitution process. This talk will first review these models. However there have been few empirical and statistical studies of the insertion and deletion (indel) processes for applications to alignment and phylogenic analysis. Using protein alignment databases, we are developing an empirical model for indel evolution. We consider the observed frequency and distribution of indels between sequences against increasing observed frequency of amino acid substitution. We also intend to analyze databases of structural RNA sequence alignments to study the evolution of the indel distribution in these contexts.

- We propose to develop more general parametric models for the indel distribution for variable divergence times. We are looking to collaborate with mathematicians on this project. In our model, all insertions and deletions would occur with an instantaneous rate that is dependent on their length and time of divergence from the reference sequence. The expected distribution for all indel lengths could then be determined from this model. The empirical model derived from the analysis of the alignment databases will then be used in determining suitable values for the parameters in a parametric model for the indel process. Currently we are using Monte Carlo simulations of the indel process to generate artificial datasets. Since the parameters of the simulation parameters are known, and the resulting sequences can be analyzed, we hope these simulations will help to derive the mathematical parametric model that will allow us to recapitulate the observed distribution of indels with the simulation parameters.

## David Stanford at U. of T., Thursday, December 7, 2006

- Title
- "Transplant Queues"
- Speaker
- Professor David Stanford, Statistical & Actuarial Sciences, UWO
- Location
- Sidney Smith Hall **1074**
- Date
- Thursday, 7 December 2006 at 4:00PM (Cookies and Juice will be served in the DeLury Lounge at 3:30 p.m. in SS6004)
- Abstract
- This talk will comprise three parts. Following a general introduction to the problem, we will present the results of a summer USRA project which perfermed a number of statistical analyses on the waiting lists for liver transplants in Canada. The second aspect will identify how the tranplant waiting lists can be modelled as queues --- albeit complicated ones. The final aspect will focus on how key queueing principles can be used to infer qualitative consequences of select types of changes to current allocation policy.
- This is joint work with Elizabeth Renouf, USRA student, Statistical & Actuarial Sciences, UWO and Vivian McAlister, MD, LHSC & UWO.

## Rong Zhu at York, Wednesday, December 6, 2006

- Speaker
- Rong Zhu, McMaster University
- Title
- A New Property of Generalized Poisson and Comparison with Negative Binomial
- Date
- Wednesday, December 6, 2006
- Time
- 2:30 pm
- Where
- N638 Ross
- SUMMARY
- We prove that the generalized Poisson distribution GP(theta, eta) (eta >=0) is a mixture of Poisson distributions; previously it was only known to be overdispersed relative to Poisson like negative binomial distribution. We compare the probability mass functions and skewnesses of the generalized Poisson and the widely used negative binomial distributions with the first two moments fixed. We find that the generalized Poisson and negative binomial distributions with means and variances fixed have slight differences in many situations, but their zero-inflated distributions with masses at zero, means and variances fixed can differentiate largely. These probabilistic comparisons are helpful in selecting a better fitting distribution for modelling count data with heavy right tails. Through a real example of count data with large zero fraction, we illustrate how the generalized Poisson and negative binomial distributions as well as their zero-inflated distributions can be discriminated. (This is joint work with Dr. Harry Joe.)

## Dimitris Cheliotis at the Fields Institute: December 6, 2006

- Speaker
- Dimitris Cheliotis (University of Toronto)
- Title
- Patterns for the 1-dimensional random walk in the random environment - a functional LIL
- When
- Wednesday, December 6, 2006, 3:10 pm
- Where
- Fields Institute, Large Room
- Abstract
- We start with a one dimensional random walk (or diffusion) in a Wiener-like environment. We look at its graph at different, increasing scales natural for it. What are the patterns that appear repeatedly? We characterize them through a functional law of the iterated logarithm analogous to Strassen's result for Brownian motion and simple random walk. The talk is based on joint work with Balint Virag.

## Daniel Ashlock at York: Friday, December 1, 2006

- Speaker
- Prof. Daniel Ashlock, University of Guelph
- Title
- Multiclustering: Avoiding the Natural Shape of Underlying Metrics.
- Time
- 10:30-11:30
- Date
- December 1, 2006
- Place
- N638 Ross Building
- Abstract
- This talk introduces a novel clustering technique that exploits the variability of k-means clustering to grant immunity to the cluster shapes that are artifacts of the distance measure used rather than a feature of the data. The talk assumes that that the listener is a nonspecialist with potential applications for clustering algorithms. In addition to clustering data, multiclustering gives an advisory as the natural number of clusters in the data (or an indication that no such natural clusters exist).

## Huaiping Zhu at York, December 1, 2006

- MITACS Seminar at LAMPS and LIAM
- Title
- West Nile virus: Modelling, Surveillance and Data Mining
- Title
- Threshold Conditions for accessing control of West Nile virus
- Speaker
- Huaiping Zhu, York University
- When
- Friday Dec. 01 2:00-4:00,
- Where
- Ross N638
- Abstract
- By using a system of differential equations, we investigate the threshold conditions for accessing control strategies for West Nile virus. Usually it was believed that the endemic could be controlled if the basic reproduction number is smaller than one. Form the existence and classification of the multiple equilibria, as well as the discussion of stability of these equilibria, we develop explicit threshold conditions in terms of controlling parameters beyond the basic reproduction number. I will also discuss the saddle-node bifurcation of the system in the case when the basic reproduction is smaller than one. The local stability of the equilibria will be proved by the theory of K-competitive dynamical systems and index theory of dynamical systems on a surface. The results of this study suggest that the basic reproductive number itself is not enough to indicate whether West Nile virus will prevail or not and also suggest that we must pay more attention to the initial size of the infected bird and mosquito population. The results also partially explain the mechanism of the recurrence of the small scale endemic of the virus in North America.

## Venkata Duvvuri at York, December 1, 2006

- Speaker
- Venkata Duvvuri, York University
- Title
- West Nile virus: Modelling, Surveillance and Data Mining
- Title
- Data-mining approaches towards the management of mosquito borne diseases
- Abstract
- Mosquito borne disease management requires a clear understanding of all concerned variables that directly or indirectly aid to mosquito proliferation, pathogen sustainability and disease epidemics. This talk is mainly focused on the utilization of data mining methodologies i.e, Classification and Regression Trees (CART) and Bayesian Networks for prioritizing the vector/disease influencing variables, generation of decision rules and forecasting the probable epidemic/threats, which in turn guide health officials and policy makers to establish or modify the existing management procedures in an effective way. Both these application were applied towards disease management of Lymphatic Filariasis and Japanese encephalitis, which are dreadful infectious mosquito borne diseases in India.

## Richard A. Harshman at York: Friday, November 24, 2006

- Speaker
- Richard A. Harshman, (Psychology Dept., University of Western) harshman@uwo.ca http://publish.uwo.ca/~harshman
- Title
- Introducing new 'ways' into data analysis: Making linear models multilinear gives them important new properties
- Date
- Friday, November 24, 2006, 10:30-11:30
- Place
- N638 Ross
- Abstract
- In 1970, I developed PARAFAC (PARAllel FACtor analysis), a generalization of factor/component analysis from matrices to three-way arrays of data (e.g., to measurements of n cases on m variables on each of p occasions, or to correlations of n variables with the same n variables in each of p different circumstances). The motivation was to enhance validity: by parallel factoring of multiple non-identical mixtures of the same patterns, the three-way model could often overcome the rotational ambiguity of standard factor/component analysis and uniquely recover the source patterns that originally generated the mixtures. In the last 10 years there has been a rapid growth of important PARAFAC applications in diverse fields, ranging from chemistry and physics (e.g., E-E flourescence and XES x-ray spectroscopy), to signal engineering (e.g., cell-phone signals, noisy radar), to neuroscience (EEG and fMRI brain signals), etc. A Google search now returns over 50,000 hits. Quite recently, I have been developing similar generalizations of other common methods of data analysis, which hope will also have wide application and value.

- In this talk I will explain how, by extending standard statistical models from linear to multilinear, we can substantially increase their power and give them important new properties. The idea can be briefly explained as follows: while traditional methods find an optimal linear combination across one index of a two-way data array (combining columns of data), the generalized methods find jointly-optimal linear combinations across two (or more) indices of a three- (or higher)-way array. The figure below shows how a standard canonical correlation for the General Linear Model (GLM) is modified for a "level 1" multilinear generalization. The canonical weight vectors (columns of W on both sides) are chosen so that the correlation between the left and right canonical variates (columns of C) is maximal. Note that the data sources on the two sides do not need to have the same number of 'ways', so either side can be a matrix or a four-way array, etc.

- By introducing multilinear generalizations into the General Linear Model,this approach implicitly also generalizes its many special cases, such asDiscriminant Analysis, (M)ANOVA /(M)ANCOVA, etc. In many of these applications, one side of the canonical relation would be a 'design matrix' or 'design array'. Statistical tests could be based on distribution free compute-intensive methods such as randomization tests or bootstrapping.

- A further kind of generalization will also be described, called "level 2 multilinearity". Here, the patterns themselves are multilinear, and take the form of matrices or arrays with low-rank outer-product structure. For example, in the level 2 GLM, the canonical variates become tensors of order 2 or higher. Patterns with such added structure can convey "deeper" or "higher order" information about the data generating processes, including how specific latent properties in one 'way' of the array 'interact' or act jointly with specific latent properties in another.

## Rohit Deo at Waterloo, Thursday, November 23, 2006

- UNIVERSITY OF WATERLOO Department of Statistics and Actuarial Science
- SPEAKER
- Dr. Rohit Deo (Stern School of Business, New York University)
- TOPIC
- Bias Reduction and Likelihood Based Almost Exactly Sized Hypothesis Testing in Predictive Regressions using the Restricted Likelihood
- DATE
- Thursday, November 23, 2006
- TIME
- 4:00pm
- PLACE
- MC 5158 (Coffee and cookies are available before the talk in MC5158)
- Abstract

The question of whether the time series of one variable, such as stock returns, can be predicted by the lagged values of the time series of another variable, such as lagged dividend-price ratios or lagged book-to-market ratios, is of interest in financial econometrics. Predictive regression models are generally used for such analyses, but hypothesis testing in such models tends to be problematic due to persistence in the regressor series, leading to biased slope coefficients estimates. We address the problem of estimation and hypothesis testing of the slope coefficient using the restricted likelihood. This likelihood is shown to yield estimates that have much less bias than the usual least squares estimates. Furthermore, we show that the likelihood ratio test based on the restricted likelihood provides accurately sized tests, due to the fact that they have small curvature. The procedure is also extended to the case of multiple regressor series.

## Keith Worsley at York: Tuesday, November 21, 2006

- Department of Mathematics and Statistics Colloquium
- Speaker
- Professor Keith Worsley (McGill University)
- Title
- Detecting Connectivity Between Images by Thresholding Random Fields: MS Lesions, Cortical Thickness, and the "Bubbles" Task in an fMRI Experiment
- Date
- Tuesday, November 21, 2006 (Refreshments will be served in N620 Ross Building at 3:30 p.m.)
- Time
- 4:00 p.m.
- Place
- N638 Ross Building
- Abstract
- We are interested in the general problem of detecting connectivity, or high correlation, between pairs of pixels or voxels in two sets of images. To do this, we set a threshold on the correlations that controls the false positive rate, which we approximate by the expected Euler characteristic of the excursion set. An exact expression for this is found using new results in random field theory involving Lipschitz-Killing curvatures and Jonathan Taylor's Gaussian Kinematic Formula. The first example is a data set on 425 multiple sclerosis patients. Lesion density was measured at each voxel in white matter, and cortical thickness was measured at each point on the cortical surface. The hypothesis is that increased lesion density interrupts neuronal activity, provoking cortical thinning in those grey matter regions connected through the affected white matter regions. The second example is an fMRI experiment using the "bubbles" task. In this experiment, the subject is asked to discriminate between images that are revealed only through a random set of small windows or "bubbles". We are interested in which parts of the image are used in successful discrimination, and which parts of the brain are involved in this task.

## Hans Tuenter at the University of Toronto, Thursday, November 16, 2006

- University of Toronto Department of Statistics Seminar
- Title
- "Stochastic Modeling of Wind Speed Time Series,"
- Speaker
- Hans Tuenter (Ontario Power Generation)
- Date
- Thursday, 16 November 2006 at 4:00PM (Cookies and juice will be served in the DeLury Lounge SS6004 at 3:30 p.m.)
- Location
- Sidney Smith Hall **1074**
- Abstract
- We present a stochastic model for wind speeds that captures the short-term autocorrelations and its long-term stationary properties. In addition, the model allows for diurnal and seasonal components to be incorporated. An application of the model is in Monte Carlo simulations of wind parks, where the economic viability heavily depends, not only, upon the wind speed distribution, but also upon its diurnal and seasonal patterns.

## Elena A. Erosheva at Waterloo: Thursday, November 16, 2006

- UNIVERSITY OF WATERLOO Department of Statistics and Actuarial Science
- SPEAKER
- Dr. Elena A. Erosheva (University of Washington)
- TOPIC
- A Bayesian analysis of multivariate binary response data using basic and compartmental Grade of Membership models
- DATE
- Thursday, November 16, 2006
- TIME
- 4:00pm
- Abstract
- This talk presents an analysis of functional disability data from the National Long Term Care Survey (NLTCS). Functional disability reflects difficulties in performing activities that are considered normal for everyday living such as dressing or grocery shopping. We employ a Bayesian framework to determine characteristics and the number of functional disability profiles in the data with the basic GoM model. We then extend the GoM model hierarchy to incorporate a deterministically healthy compartment. With the compartmental GoM model, we estimate weights of the healthy and partially disabled components and examine the impact of this extension on the interpretation of disability profiles. We assess the choice of the optimal number of disability profiles with several approaches including a Deviance Information Criterion (DIC) and an approximation to the Bayesian Information Criterion (BIC). We find that the functional disability data are described best by eight disability profiles and a deterministic healthy compartment.

## Jason Roy at Waterloo, November 9, 2006

- UNIVERSITY OF WATERLOO Department of Statistics and Actuarial Science
- TOPIC
- Causal comparisons in randomized trials of two active treatments: The effect of supervised exercise to promote smoking cessation.
- SPEAKER
- Dr. Jason Roy, University of Rochester
- DATE
- Thursday, November 9, 2006
- TIME
- 4:00pm
- PLACE
- MC 5158 (Coffee and cookies are available before the talk in MC5158)
- Abstract
- In behavioral medicine trials, such as smoking cessation trials, two or more active treatments are often compared. Noncompliance by some subjects with their assigned treatment poses a challenge to the data analyst. Causal parameters of interest might include those defined by subpopulations based on their potential compliance status under each assignment, using the principal stratification framework (e.g., causal effect of new therapy compared to standard therapy among subjects that would comply with either intervention). Even if subjects in one arm do not have access to the other treatment(s), the causal effect of each treatment typically can only be identified from the outcome, randomization and compliance data within certain bounds. We propose to use additional information -- compliance-predictive covariates -- to help identify the causal effects. Our approach is to specify marginal compliance models conditional on covariates within each arm of the study. Parameters from these models can be identified from the data. We then link the two compliance models through an association model that depends on a parameter that is not identifiable, but has a meaningful interpretation; this parameter forms the basis for a sensitivity analysis. We demonstrate the benefit of utilizing covariate information in both a simulation study and in an analysis of data from a smoking cessation trial.

## Chen Li at York: October 26, 2006

- Database Seminar at the School of Information Technology
- Title
- Answering Approximate Queries Efficiently
- Speaker
- Chen Li, UC Irvine
- Time
- Thursday, Oct. 26, 12:00pm
- Room
- TEL3009

- Abstract
- Many database applications have the emerging need to answer approximate queries efficiently. Such a query can ask for strings that are similar to a given string, such as "names similar to Schwarzenegger" and "telephone numbers similar to 412-0964," where "similar to" uses a predefined, domain-specific function to specify the similarity between strings, such as edit distance. There are many reasons to support such queries. To name a few: (1) The user might not remember exactly the name or the telephone number when issuing the query. (2) There could be typos in the query. (3) There could be errors or inconsistencies even in the database, especially in applications such as data cleaning.

- In this talk we will present some of our recent results on answering approximate queries efficiently. One problem related to optimizing such queries is to estimate the selectivity of a fuzzy string predicate, i.e., estimating how many strings in the database satisfy the predicate. We develop a novel technique, called SEPIA, to solve the problem. We will present the details of this technique using the edit distance function. We study challenges in adopting this technique, including how to construct its histogram structures, how to use them to do selectivity estimation, and how to alleviate the effect of non-uniform errors in the estimation. We discuss how to extend the techniques to other similarity functions. We show the results of our extensive experiments.

- Time permitting, we will also briefly report our other related results. One is on supporting fuzzy queries with both predicates on numeric attributes (e.g., salary > 50K) and predicates on string attributes (e.g., telephone numbers similar to 412-0964). Another one is on how to relax conditions in an SQL query that returns an empty answer. These results are based on three recent papers in VLDB'2005 and VLDB'2006.

- Biography
- Chen Li is an assistant professor in the Department of Computer Science at the University of California, Irvine. He received his Ph.D. degree in Computer Science from Stanford University in 2001, and his M.S. and B.S. in Computer Science from Tsinghua University, China, in 1996 and 1994, respectively. He received a National Science Foundation CAREER Award in 2003. He is currently a part-time Visiting Research Scientist at Google, Santa Monica. His research interests are in the fields of database and information systems, including data integration and sharing, data cleansing, data warehousing, and data privacy. More information is available at: http://www.ics.uci.edu/~chenli/

## Guangzhe Fan at York: October 20, 2006

- Title
- Kernel-Induced Classification Trees and Random Forests

- Speaker
- Guangzhe Fan
- Department of Statistics and Actuarial Science
- University of Waterloo

- When
- Friday, October 20, 2006, 2:30

- Where
- York University: N638 Ross Bulding

- Abstract

- A recursive-partitioning procedure using kernel functions is proposed for classification problems. We call it KICT- kernel-induced classification trees.
- Essentially, KICT uses kernel functions to construct CART models. The resulting model could perform significantly better in classification than the original CART model in many situations, especially when the pattern of the data is non-linear. We also introduce KIRF: kernel-induced random forests. KIRF compares favorably to random forests and SVM in many situations. KICT and KIRF also largely retain the computational advantage of CART and random forests, respectively, in contrast to SVM. We use simulated and real world data to illustrate their performances. We conclude that the proposed methods are useful alternatives and competitors to CART, random forests, and SVM.

## Fei Zou at York: October 20, 2006

- Title
- Mixture Models in Quantitative Trait Loci (QTL) Mapping

- Speaker
- Fei Zou
- Department of Biostatistics
- University of North Carolina - Chapel Hill

- When
- Friday, October 20, 2006, 10:30
- Where
- York University: N638 Ross Bulding

- Abstract
- In a QTL study, the putative QTL position is often unknown, which results in unknown QTL genotypes. Due to the unknown QTL genotypes, phenotype data therefore arise from mixtures of distributions under standard interval mapping procedures. Previous approaches to estimation involve modeling the distributions parametrically. In this talk, we will introduce several semi-parametric and non-parametric QTL mapping methods. Further, accurately estimating the QTL position is one of the major goals of any QTL study.
- Traditionally, the position corresponding to the peak of the profile LOD score from interval mapping is used to estimate the QTL position and often referred as the MLE estimate of QTL position. Is the MLE estimate truly optimal? Several alternative estimates will help us to answer this question.

## Tim Ramsay at York: September 29, 2006

- Title
- Concurvity -- the bias that time forgot.

- Speaker
- Dr. Tim Ramsay
- Assistant Professor/Associate Scientist
- University of Ottawa/Ottawa Health Research Institute

- Where
- York University: N638 Ross Bulding
- When
- Friday, September 29, 2006, 10:30

- Abstract
- A special case of the generalized additive model in which one covariate is modeled linearly, the semiparametric additive model is becoming an increasingly popular statistical tool. Its appeal lies in the fact that it allows the analysis to flexibly control for a variety of confounders, without making any parametric assumptions about their effects, while still producing an easy-to-interpret linear effect for the covariate of interest.
- In 2002, however, it became painfully obvious to environmental epidemiologists that the linear effect could be seriously biased. Two interesting aspects of this discovery are that
- (1) serious as it is, this bias problem remains largely unknown in the larger statistical community and
- (2) the problem was first discovered twenty years ago but appears to have been completely forgotten.

- This talk will illustrate through an embarrassing example how concurvity, the nonparametric analogue of collinearity, leads to bias. The mechanism behind this bias will be discussed, together with ways to diagnose and avoid it. The fact that it was discovered twenty years ago, and again seven years ago, by theoreticians suggests a dangerous gap between theory and practice in the discipline of statistics.