Speaker Abstracts

Promoting an Authentic Experience of Statistical Practices in Statistics Education

Layla Guyot (Texas State University)

The demand for analytical skills is growing in many different fields and sectors and the employment of statisticians is expected to increase drastically. However, there seems to be a misalignment between the skills acquired through statistics education and the skills of a statistician as required by employers (Gibbons & MacGillivray, 2014). Indeed, the transition from learning statistics to practicing statistics is challenging because professional statisticians are developing reasoning and practices that grow out of experience (Pfannkuch & Wild, 2000; Kent, Bakker, Hoyles, & Noss, 2005). The misalignment between theory and practice can be addressed by including an authentic experience of statistical practices throughout statistics education (Anderson & Loynes, 1987; Hahn, 2014; Bakker & Akkerman, 2014).
This presentation will enable statisticians to reflect on their experience, identify prevailing practices and add to the existing literature since statistical practices are constantly evolving.

A goodness-of-fit test based on empirical distribution function

Sunil Mathur (Texas A&M University-Corpus Christi)

In this paper, we propose a new goodness-of-fit test for parametric distribution function based on empirical distribution function. A logarithm of ratio of empirical distribution function and null distribution function is used in the construction of the proposed test statistic. We computed the expectation and variance of the proposed test and showed that the test statistic is distribution-free. The performance of the proposed test is compared with some of its competitors using Monte Carlo simulation study.

Gamma Degradation Model and Related Applications: From Light Intensity to Drug Dissolution

Hon Keung Tony Ng (Southern Methodist University)

In engineering and sciences, the process that a system reduces in performance, reliability or life span of assets gradually and irreversibly is known as a degradation process. Degradation measurements are recorded over time for prognostics and health management purposes. The gamma degradation model has been used to characterize the evolution of degradation measurements. In this talk, I will first provide an introduction to the gamma degradation model. Then, I will introduce a two-phase degradation model and discuss the likelihood and Bayesian inference for this model. The gamma model and the inferential methods are applied to analyze a real data set of light emitting diodes (LEDs). In addition to engineering applications, I will also discuss an application of the gamma degradation model in biopharmaceutical statistics. The gamma degradation model is used for assessing the similarity of two drug dissolution profiles and its merits are discussed.

Classification of Protein Binding Ligands Using Their Structural Information

Leif Ellingson (Texas Tech University)

It is known that a protein's biological function is in some way related to its physical structure. Many researchers have studied this relationship both for the entire backbone structures of proteins as well as their binding sites, which are where binding activity occurs. However, despite this research, it remains an open challenge to predict a protein's function from its structure. The main purpose of this research is to gain a better understanding of how structure relates to binding activity and to classify proteins according to function via structural information. First, we performed the classification of binding sites for the dataset arising from Ellingson and Zhang (2012) through the use of logistic regression. Then we approach the problem from the data set compiled by Kahraman et al. (2007). We calculated the covariance matrices of the binding sites' coordinates, which use the distance of each atom to the center of mass, and calculated the distance from an atom to the 1st, 2nd and 3rd principal axes. Then we obtained covariance matrices of these distances to serve as our data objects. Finally, we performed classification on these matrices using a variety of techniques, including nearest neighbor and logistic regression.

Practical Statistics: Mitigating Data Quality Issues

Sue Stankus (University of Texas at San Antonio)

Results of statistical analyses are dependent upon the data used. In industry, statisticians can spend a large percentage of their time being data detectives, tracking down issues with inputs to their analyses. However, the validation of data is limited to a few sentences in many texts. There have been several methods proposed for categorizing data quality in the field of management information systems. A common element is data accuracy. In this presentation, common data accuracy issues are discussed and several examples of these issues are given to demonstrate how data issues can impact analyses. The examples include the issue discovery method, issue resolution, and impact of the issue on analyses.

Relational Novelty Detection: attribute-wise learning approach

Daijin Ko (The University of Texas at San Antonio)

Novelty detection methods identify instances that appear markedly different from the “normal” training data. In this paper, we propose a new method ‘Relational Novelty Detection (RENOD)” that reformulates novelty detection as a set of supervised learning problems. For each nominal attribute, we learn a classification supervised learning model which predicts the factor values of that attribute from the values of all other attributes. We assign a relational novelty score from the prediction probability from the learned model and the observed nominal attribute. For numerical attributes, we compute the deviation between the prediction from a regression supervised learning model and the observed attribute value and assign a relational novelty score from the probability distribution of the deviation derived from the “normal” training data. In addition to these, we derive the marginal novelty score from the distribution of the univariate feature from the “normal” training data and combine with the relational novelty score and generate the novelty score. The overall novelty score is derived by combining all the novelty scores for all attributes. A suitable threshold for testing novelty is derived from the overall novelty scores of the “normal” training data. We demonstrate the method with a well-known intrusion detection data for cyber-security and show that this approach is highly powerful in detecting novelty in high-dimensional data with mixed attributes and many irrelevant attributes. The method is applicable in broad areas such as security, malware classification, fraud detection for credit cards, and insurance or health care data. In addition to the analysis we discuss how to interpret the detected novelty using the profiles of novelty scores.

Using an Oracle to Improve the Quality of a Prediction

Cabral Balreira (Trinity University)

We will discuss the main models used for ranking teams in the growing field of Sports Analytics. The basic problem will be framed as a question in network theory where the outcome of games will determine an underlying directed network. Using Linear Algebra we will determine the rank of individual teams and predict the outcomes of future games. We will introduce a new ranking method called the Oracle Ranking that is a customizable network ranking method that has shown good predictive results in the NFL and NBA. Using the idea of a home advantage, we will show how to improve and quantify the quality of predictions.

Similarity Scores for Mixed Types of Data with Applications in Home Properties

Keying Ye (University of Texas at San Antonio)

In comparing similarity score or “distance” between two objects with many variables, studies in distance have been done for continuous, categorical and ordinal variables. In this talk, a robust method of developing similarity or “distance” scores between different residential home properties with mixed types of data will be discussed. Such kind of scores can be used to find “similar” or “dissimilar” properties for comparisons, and more importantly, for selecting comps properties in home property appraisals.

Prediction Intervals for Integrals of Some Types of Non-Gaussian Random Fields: A Semiparametric Bootstrap Approach

Bazoumana Kone (University of Texas at San Antonio)

This work proposes a method to construct prediction intervals for integrals of non-Gaussian random fields over bounded regions (called block averages in the geostatistical literature). The method uses a semiparametric approach that does not require distributional assumptions, but only parametric assumptions about the mean and covariance functions of the random field. The resulting semiparametric bootstrap prediction interval overcomes some drawbacks of the commonly used plug-in block kriging prediction interval: the former has better coverage probability properties than the later since it accounts for the uncertainty from parameter estimation, and does not rely on the assumption of Gaussianity. The method is illustrated in the prediction of block averages of cadmium traces in a potentially contaminated region in Switzerland.