Model Based Clustering using multilocus data with loci selection
Elisabeth Gassiat and Wilson Toussile
Advances in Data Analysis and Classification 2008.

## Abstract

A long standing issue in population genetics is the identification of genetically homogeneous populations. The most widely used measures of population structure are Wright's F statistics (Wright 1931). But the fundamental prerequisite of any inference based on these statistics is the definition of populations and this definition is typically subjective (based on linguistic, cultural or physical characters, geographical location). The population structure may be difficult to detect using visible characters. We propose a Model-Based Clustering (MBC) method combined with loci selection using multilocus data. The loci selection problem is regarded as a model selection problem and models in competition are compared with the Bayesian Information Criterion (BIC). The resulting procedure selects the subset $\widehat{S}_n$ of clustering variables, the number $\widehat{K}_n$ of clusters, estimates proportion of each population and allelic frequencies within each cluster. We prove that the selected model $\left(\widehat{K}_n, \widehat{S}_n\right)$ converges in probability to the true model $\left(K_0, S_0\right)$ under a single realistic assumption as the number $n$ of individuals tends to infinity. The proposed algorithm named \textbf{MixMoGenD} ('Mixture Model for Genetic Data') has been implemented using $C++$ and $C$ programming languages. An interface with \textbf{R} was created. Numerical experiments on simulated data sets was conducted to highlight the interest of the proposed loci selection procedure.