Glossary
Term | Definition |
---|---|
Accuracy | It corresponds to the degree of agreement between model predictions and observed data points. This statistic quantifies how well the model accounts for variability in the data. High accuracy indicates a better fit and a more reliable representation of the underlying relationships in the mixed linear model. |
Akaike information criteria (AIC) | Statistical criterion used for model selection in the context of data analysis and statistics. It measures the trade-off between the goodness of fit of a model with its complexity, aiming to find the model that best represents the data while penalizing for overly complex models. Lower AIC values indicate better-fitting and more parsimonious models. |
Aliasing | In the context of mixed linear models, it refers to a situation when there are two or more highly correlated predictor variables (or factors) making it difficult to isolate each variable’s influence on the response, potentially leading to inaccurate model interpretations. Researchers often need to address aliasing by simplifying the model. |
Analysis of variance (ANOVA) | Statistical technique used to evaluate the significance of different components in a model. It assesses whether the variation in a dependent variable can be attributed to the effects of one or more independent variables (or factors). ANOVA calculates the F-statistic to compare the variance between groups to the variance within groups. |
Animal model | Statistical model that assesses the genetic influence on traits within populations. It considers genetic relationships among individuals to estimate narrow-sense heritability, genetic correlations, and breeding values, among others. Animal models are used to assist in breeding and selection decisions. |
Autoregressive errors | Specific pattern of errors in statistical modeling where the current residual in a dataset is influenced by its past residuals. In the context of time series analysis, residuals at one time point are related to residuals at previous time points, reflecting a temporal correlation pattern. |
Breeding value | Genetic worth of an individual organism in terms of its potential to pass on desirable traits to its offspring. It is a quantifiable measure used to predict an individual’s genetic contribution to the next generation’s performance or characteristics. |
Binomial distribution | It corresponds to a discrete probability distribution that models the outcomes of a random experiment, with a fixed number of independent and identical trials, where there are only two possible outcomes (binary). These outcomes are mutually exclusive and collectively exhaustive, meaning that only one of the two outcomes can occur in a single trial. |
Broad-sense heritability (H²) | Measure that quantifies the proportion of the total phenotypic variance in a population that is due to genetic differences among individuals. It considers the contributions of all genetic factors, including additive, dominance, and gene interaction effects (epistasis), to the overall phenotypic variability. |
Clonal model | When dealing with clones, traditional statistical models may assume independence among data points, but in reality, clones share genetic similarities. To account for this, specialized statistical techniques may be used, such as hierarchical modeling or mixed-effects models, which incorporate genetic relatedness as a factor to improve model accuracy when analyzing data involving genetically similar individuals, such as in plant breeding |
Competition effect | This effect emerges when multiple individuals within a community compete for access to limited resources like food, habitat, or mates. This competition can significantly influence population dynamics and resource allocation and even lead to evolutionary adaptations as organisms strive to gain an advantage in resource acquisition within their system. |
Conditional sum of squares | In the context of analysis of variance, it is an approach to decompose the total sum of squares into various components. Here, the order in which predictor variables (independent variables) are added to the model matters. It calculates the sum of squares attributable to each variable while controlling (i.e., conditioning) for the effects of variables that were entered in the model earlier. It is also known as the Type 1 sum of squares. |
Contrasts | In statistics, they refer to comparisons made between different groups or levels of a factor to test hypotheses or evaluate differences. Contrasts are used in various statistical analyses, including analysis of variance (ANOVA) and regression analysis. Different types of contrasts, such as simple, polynomial, and planned contrasts, serve specific analytical purposes in statistical modeling and hypothesis testing. |
Convergence | Specific property or condition in the context of statistical modeling where an iterative algorithm is used for estimating model parameters. Convergence in this context means that the algorithm has reached a stable and acceptable solution. |
Correlated effects | Denote the interconnected relationship of multiple variables on each other. Correlated effects are often explored in data analysis and regression modeling to understand how variables interact and influence each other. |
Correlated errors | Describe a situation in statistical modeling where residual variations in data points are not random but display a discernible pattern or relationship with each other. This pattern deviates from the assumption of independent and random errors, requiring specific modeling approaches. |
Covariate | It is a variable that is not the primary focus of interest in modeling but it is considered because it may have an influence on the other variables being studied. Covariates are used to control for potential confounding factors in statistical analyses, allowing to better understand the relationship between the primary variables of interest and the response variable. |
Completely Randomized Design (CRD) | In a CRD, subjects or experimental units are randomly assigned to different treatment groups or conditions. This design is suitable when experimental units are homogenous and the randomization process ensures that each unit has an equal chance of receiving any treatment. |
Cubic spline | Mathematical function that consists of multiple cubic polynomials joined together in a smooth and continuous manner. These cubic polynomials are typically defined by control points or knots. In linear models, a cubic spline is used to model the relationship between a predictor variable and a response variable in a flexible way, allowing for more complex relationships to be captured. |
Design factors | In the context of experimental design, these factors are deliberately manipulated or controlled to investigate their effects on a dependent variable or outcome. For example, blocks or whole-plots. |
Design layout | In statistics, it refers to the structured arrangement of experimental units in a research study. It encompasses the systematic placement of treatments, controls, and randomization processes to ensure the study’s objectives are met efficiently and without bias. |
Diagnostics | It corresponds to the process of assessing and evaluating the quality, validity, and assumptions of statistical models. Diagnostic procedures help identify potential data issues, outliers, and violations of statistical assumptions that can affect the reliability of results. Common diagnostic techniques in statistics include residual analysis, normality tests, homoscedasticity tests, outlier detection, multicollinearity checks, goodness of fit tests, etc. |
Direct product | In linear mixed models matrix algebra, the “direct product” typically relates to the Kronecker or tensor product, and is denoted by ⊗. It combines two or more matrices or vectors to create a larger matrix or vector by taking all possible pairwise products of the elements of the original matrices or vectors. |
Direct sum | In linear mixed models and matrix algebra the direct sum, denoted ⊕, is a concept that combines two or more algebraic structures into a larger structure while preserving their individual properties. For instance, in the direct sum of matrices, the direct sum forms a block diagonal matrix with the original elements in the diagonal and values zero everywhere else, therefore resulting in independent components. |
Distribution | A distribution, or probability distribution, refers to a set of possible values or outcomes of a random variable and the probability associated with each of those outcomes. It describes how the values or observations are spread or distributed in a data set. Common types of distributions include the Normal distribution, binomial distribution, and Poisson distribution. |
Double haploid (DH) | Type of organism that is homozygous at all of its genetic loci, meaning it has two identical alleles for each gene. Double haploids are often used in plant breeding to accelerate the process of developing lines. |
Dominance | Dominance pertains to the interaction between alleles at a specific gene locus. Complete dominance occurs when one allele completely masks the expression of another. In incomplete dominance, the heterozygous individual displays an intermediate phenotype. Codominance involves both alleles being expressed simultaneously, often resulting in a mixed phenotype. |
Epistasis | Genetic phenomenon in which the effect of one gene (locus) masks or modifies the effect of another gene at a different locus. It involves the interaction between multiple genes to determine an organism’s phenotype. |
Epistatic ratio | It corresponds to the portion of the genetic variance due to the epistatic component. It is expressed against the total phenotypic variance. |
Error structure | Variance-covariance structure that describes how the errors, or residuals, in a statistical model are expected to behave. The error structure provides information about the assumptions made regarding the distribution and correlation of these errors. |
Estimable | Parameter or coefficient of a linear model that can be estimated with a unique and stable solution based on the available data. |
Experimental design/layout | In statistics, it refers to the structured arrangement of experimental units in a research study. It encompasses the systematic placement of treatments, controls, and randomization processes to ensure the study’s objectives are met efficiently and without bias. |
Experimental unit | It represents the smallest level at which treatments or interventions are applied. Often these entities are those where the data is collected. |
F-test | It is a statistical test used to compare the variances or ratios of variances between two or more groups or populations. They are often used in analyses of variance (ANOVA) and regression analyses. The F-test helps determine whether there are statistically significant differences among levels of a factor or variables. |
Factor levels | It refers to the different values or categories that a factor (independent variable) can take on in the context of linear models. Factors are variables that are manipulated in an experiment to observe their effect on a response variable (dependent variable). |
Full-sib family | Group of individuals who share both of the biological parents. Each full-sib pair typically shares, on average, 50% of their genetic material inherited from their common parents. Full-sibling families are often used in genetic studies, breeding programs, and research to explore genetic inheritance patterns, variation, and heritability of traits. |
G structure | In the context of mixed-effects models, it refers to the variance-covariance structure that describe the underlying assumptions in the data due to random effects. |
Genomic best linear unbiased prediction (GBLUP) | It is a statistical method used in genetics to estimate the genetic merits or breeding values of individuals based on their genomic data, such as SNP marker data. |
Genomic-estimated breeding value (GEBV) | Represents the estimated genetic value or merit of an individual based on genomic prediction model, such as GBLUP. |
Generalized linear model (GLM) | Statistical framework that extends traditional linear regression to model situations where the response is not normally distributed. GLMs allow for various probability distributions (e.g., binomial, Poisson, gamma, and exponential) and use a link function to relate the predictors to the expected value of the response. |
Generalized linear mixed model (GLMM) | Combines aspects of generalized linear models and mixed-effects models. GLMMs are used to analyze data with non-normal distributions and complex correlated structures. Common types of distributions that can be accommodated are binomial, Poisson, gamma, and exponential. |
Genomic relationship matrix (GRM) | Matrix used in quantitative genetics and genomics to quantify the genetic relatedness between individuals based on genomic marker data, such as SNPs. It is used to estimate the genetic merit or breeding value of individuals for specific traits without relying on traditional pedigree information. |
Heritability | Statistic that measures the proportion of the total phenotypic variation in a trait within a population that can be attributed to genetic factors. It’s a number between 0 and 1, where 0 means genes have no impact, and 1 means genes are the sole factor determining the trait. There are two main types of heritability: narrow-sense heritability (h²) and broad-sense heritability (H²). The former represents the portion of the total phenotypic variation that is due solely to the additive genetic effect, while the latter also considers the dominance and epistatic effects. |
Hierarchical model | Analytical framework that incorporates multiple levels or layers of data, capturing dependencies and variations within a dataset. Hierarchical models are a type of linear mixed model and they are valuable for modeling data with nested structures, such as students within schools, patients within hospitals, or measurements within blocks. |
Heteroscedasticity / Heterogeneity | In the context of linear models, it occurs when the variability of errors (residuals) is not consistent across various levels or values of the independent variable(s). In simpler terms, it means that the spread or dispersion of residuals systematically varies as you move along the range of predictor variables. This phenomenon violates one of the key assumptions of linear regression, which assumes a constant error variance (homoscedasticity). |
Hypothesis testing | Method to assess whether observed data supports a particular claim or hypothesis about a group or a population. It involves formulating a null and an alternative hypotheses, collecting data, choosing a significance level, and performing a statistical test. The outcome determines whether to reject or fail to reject the null hypothesis, enabling researchers to draw conclusions and make inferences about the group or population from which the sample was taken. |
Inbreeding | Genetic phenomenon where closely related individuals within a population intermate, increasing the chances of offspring inheriting identical alleles (homozygous) from both parents. This can lead to a higher chance of expression of recessive alleles and reduced genetic diversity within the population, potentially impacting its long-term development and adaptability. |
Inbreeding coefficients | Numerical measure representing the probability that two alleles at a specific gene locus in an individual are identical by descent. It quantifies the degree of inbreeding within a population. Higher coefficients indicate greater relatedness. |
Incomplete block design | Experimental design where experimental subjects or items are divided into blocks, but not every treatment is applied to each block. It helps control sources of variation and efficiently compare treatments. This design is useful when complete randomization is not feasible. |
Incremental sum of squares | In the context of analysis of variance, it is an approach to decompose the total sum of squares into various components. In this approach, the order in which predictor variables are added to the model does matter. It calculates the sum of squares attributable to each variable while considering the effects of all other variables already in the model. |
Information matrix | In statistics, it is a square matrix used in maximum likelihood estimation. It quantifies the curvature of the likelihood function at the estimated parameter values, providing information about parameter precision. The inverse of the information matrix is used to estimate the variance-covariance matrix of parameter estimates. The diagonal elements represent the precision of parameter estimates, while off-diagonal elements capture parameter covariances. |
Initial values | In linear mixed models, it refers to the starting estimates or guesses for the variance components that are used as the basis for iterative optimization algorithms. Choosing appropriate initial values can improve the efficiency of parameter estimation and help avoid the convergence to a local optimum. |
Interaction effect | Situations where the combined influence of two or more factors is not simply the sum of their individual effects. Instead, these factors interact in a way that produces a different and unexpected response. Identifying and analyzing interaction effects is crucial for understanding complex relationships in data. |
Intercept | The intercept provides valuable information about the baseline or starting point of the relationship between the dependent and independent variables. In regression analysis, it represents the value of the dependent variable when all independent variables are set to zero. In linear modeling, it is the parameters representing the mean of the population. |
Lagrange multipliers | Mathematical technique used to solve constrained optimization problems in calculus. These multipliers help find the maximum or minimum of a function subject to one or more equality constraints. The approach involves introducing additional variables (Lagrange multipliers) to form a modified objective function called the Lagrangian, which is then optimized. |
Leverage | It quantifies the impact of an observation on the estimation of a statistical model’s parameters and can help identify influential data points. More specifically, it quantifies how far an observation’s predictor variables are from the mean of those predictors. |
Linear combination | Mathematical operation where several variables are combined with specified coefficients. This combination creates a new variable or expression. For example, in a simple linear regression model, the dependent variable Y can be expressed as a linear combination of one independent variable and an error term. |
Longitudinal analysis | Statistical technique used to analyze data collected from the same individuals, objects, or entities over multiple time points. It allows researchers to examine how variables change over time, assess trends, and investigate relationships between variables within the same subjects. |
Marginal | In statistics, this term is used to describe the distribution or properties of a single variable or subset of variables within a multivariate context, while keeping other variables constant or averaging over them . It’s used in various statistical concepts and techniques, including marginal distributions, marginal means, and marginal likelihoods. |
Maternal effect | In breeding and genetics, it is a phenomenon where the phenotype or characteristics of offspring are influenced not only by their genotype but also by the maternal genotype or the environment provided by the mother during pregnancy, incubation, or early development. Maternal effects can play a significant role in the expression of traits in offspring. |
Meta analysis | Statistical technique used to systematically combine and analyze the results of multiple statistical models or individual studies that have investigated a similar problem. The primary goal of meta-analysis is to provide a statistically sound quantitative summary or synthesis of the findings from these separate models or studies, allowing researchers to draw more robust and generalizable conclusions. |
Missing value | Occurs when no information is available or recorded for a particular observation or variable in a dataset. |
Moving average | Statistical technique used to smooth out fluctuations and trends in a time series dataset by averaging values over a specific interval. Moving averages help identify underlying patterns by reducing noise and making it easier to visualize and analyze trends. |
Multiple environment trial (MET) | Also known as a Multi-Environment Trial, it is a statistical analysis used in plant science. METs are conducted to evaluate the performance of plant populations, varieties, genotypes, clones, or hybrids across multiple environments or locations, which may have varying soil types, climate conditions, and management practices. |
Multi-trait analysis | Statistical method where multiple dependent variables (traits) are analyzed simultaneously within a single analytical framework. This approach is particularly useful when there is a relationship or correlation between the different traits. |
Multivariate modeling | Approach that involves the simultaneous analysis of multiple dependent variables in a single statistical model. This approach is used to understand the relationships between these variables and how they collectively respond to changes in one or more independent variables. |
Narrow-sense heritability (h²) | Statistic used in quantitative genetics to describe the proportion of the total phenotypic variation observed in a population for a specific trait that is attributed to the additive genetic variation/effect. |
Negative binomial | It is a probability distribution used in statistics to model the number of trials required for a specific number of successes in a sequence of independent, identical Bernoulli trials. It is characterized by two parameters: the probability of success on a single trial (p) and the desired number of successes (r). This distribution is useful for predicting the number of trials required for a specified number of successes to occur. |
Nested design | Experimental design where one factor or grouping structure is entirely contained within another. This nesting creates a hierarchical relationship between the factors, where the levels of the nested factor are unique to each level of the higher-level factor. This approach is often used when there are inherent hierarchical relationships or subgroups in the data, such as students within classrooms or plants within specific geographic regions. |
Nested effect | Statistical effects or sources of variation where one factor or grouping structure is entirely contained within another. There are typically two types of effects: the main effects (associated with the higher-level factor) and the nested effects (associated with the lower-level factor nested within the higher-level factor). |
Non-singular | Statistical term used to describe a matrix or system of equations that is not singular, meaning it has a unique solution. In non-singular cases, there is no redundancy or inconsistency in the data or equations, allowing for a well-defined and solvable problem. This also means that, for a square matrix, there exists an inverse. |
Normal distribution | Also known as the Gaussian distribution, where the data are symmetrically distributed around a central mean value, with a specific standard deviation controlling the spread or dispersion. Many real-world phenomena approximate a normal distribution. The normal distribution is essential in statistical analysis because it allows for the application of various statistical methods, including hypothesis testing and confidence interval estimation. |
Disclaimer
Part of the content in this document has been generated with the assistance of the large language model GPT-3.5 (OpenAI 2022). However, all text has been fully revised and curated by our specialists for accuracy and veracity.