ASR011. LMM with log transformation of the response and pedigree - Tawny dragon lizards

The complete script for this example can be downloaded here:

Dataset

The dataset we will use in this example is D007, and the first few rows are presented below:

idsort indiv dam sire hatch_date cohort generation sex morph p_orange p_yellow p_grey
1 203 0 0 NA 1 1 F OY 13.96 9.69 76.35
2 205 0 0 NA 1 1 F G 0.40 0.24 99.37
3 206 0 0 NA 1 1 F Y 1.13 34.73 64.14
4 209 0 0 NA 1 1 F O 57.11 4.20 38.69
5 211 0 0 NA 1 1 F OY 2.66 6.23 91.11
6 218 0 0 NA 1 1 F O 35.78 0.03 64.19

The pedigree information for this analysis is contained within the above data frame, corresponding to columns 2, 3, and 4, with indiv representing the individuals, and dam and sire the female and male parents, respectively.


Model

In this example will be fitting a LMM with pedigree information using a log-transformed response variable. The specification of the model is:

\[ y = \mu + cohort + sex + gen + e \]

where:

    \(y\) is the proportion of yellow,

    \(\mu\) is the overall mean,

    \(cohort\) is the fixed effect of cohort,

    \(sex\) is the fixed effect of sex,

    \(gen\) is the random effect of genotype, with \(gen \sim \mathcal{N}(0,\,\sigma^{2}_{g})\),

    \(e\) is the random residual effect, with \(e \sim \mathcal{N}(0,\,\sigma^{2}_{e})\).

The factor \(gen\) is associated with the numerator relationship matrix (NRM) \(\mathbf{A}\).


First, let’s fit the model with ASReml-R using the original (un-transformed) response variable. Note that before fitting the model, the inverse of NRM needs to be obtained using the pedigree information with the function ainverse() (see ASR013 for more details), and indiv, cohort, and sex need to be set as factors.

asr011 <- asreml(
 fixed = p_yellow ~ cohort + sex,
 random = ~vm(indiv, ainv),
 residual = ~units,
 data = d007
)

Note that the above model is including the inverse of the NRM by the use of the expression vm() that indicates variance model and that links a factor to the inverse matrix.


The visual diagnostics of the residuals can be assessed with:

plot(asr011)

These plots look relatively reasonable; however, several issues are visible: the distribution of the residuals is skewed to the left; there is a trend on the top-left plot, indicating that the assumption of linearity might not be entirely valid; and several potential outliers are identified.

Therefore, it is possible to consider a transformation of the response variable to improve the model fit. The logarithmic transformation will be used here according to the following expression:

\[ \normalsize y = { log(pyellow+1)\over100-pyellow+1} \]
where,
    \(y\) is the transformed response variable, and \(pyellow\) is the proportion of yellow.

This transformation is implemented in R using the following code:

(d007$p_yellow_log <- log((d007$p_yellow + 1)/(100 - d007$p_yellow + 1)))
 [1] -2.14495159 -4.39763007 -0.61774663 -2.92398837 -2.57321387 -4.58526464
 [7] -4.61512052 -2.58218016 -2.73599740 -1.83993632 -2.65972470 -4.56583518
[13] -4.50967080 -4.29663968 -2.64696251 -2.11296423 -2.26551431 -0.49954731
[19] -3.53497384 -4.50967080 -0.54992266 -3.10642181 -1.64226674 -0.16665919
[25] -0.63892230  1.10227578 -2.50385695 -0.69933004 -4.56583518 -3.94834994
[31] -4.31154456 -4.20469262  0.54654371 -0.69314718 -3.72143501 -0.05805552
[37] -4.59511985 -4.54676856 -0.91835046 -4.61512052 -0.60399002 -2.68732407
[43] -2.71224088 -4.55625737 -4.54676856 -4.61512052 -4.58526464 -2.56579431
[49] -1.10752103 -4.58526464 -3.19348147 -0.90397025 -2.38512914 -4.47387232
[55] -4.23909156 -2.65492228 -2.04354862 -4.30406509 -4.52805133 -4.38949865
[61] -1.94054281 -3.07577498 -3.17295958 -1.13023548 -0.43373294 -4.61512052
[67] -3.89683435 -2.89597401 -4.42241878 -4.61512052 -4.61512052 -3.99700334
[73] -1.80555279 -2.04065552 -4.59511985 -4.15838535 -4.61512052 -4.61512052
[79] -4.60507117 -2.49546844 -4.61512052 -4.51881975 -4.48270515 -4.61512052
[85] -2.17776871 -3.22501081 -4.61512052 -4.61512052 -4.55625737 -4.55625737
[91] -2.99295635 -4.61512052 -2.97371752 -4.31154456 -4.23909156 -4.60507117
[97] -4.52805133 -2.62030287


Let’s now re-fit the model using the transformed response variable with:

asr011_transfo <- asreml(
  fixed = p_yellow_log ~ cohort + sex,
  random = ~vm(indiv, ainv),
  residual = ~idv(units),
  data=d007
)


Now, the diagnostic residuals plots (below) look better: they are more closely normally distributed and there seems to be no visible outliers.


Exploring output

The statistical significance of fixed effects can be evaluated with:

wald(asr011_transfo, denDF = 'numeric')$Wald
            Df denDF   F.inc           Pr
(Intercept)  1  30.5 204.900 4.440892e-15
cohort       2  47.7  11.840 6.670866e-05
sex          2  87.0   2.534 8.519154e-02

Here we can see that both fixed effect factors are significant.


The variance components estimated by the model are:

summary(asr011_transfo)$varcomp
                component std.error  z.ratio bound %ch
vm(indiv, ainv) 1.3436165 0.5465975 2.458146     P   0
units!units     0.8116911 0.3531848 2.298205     P   0
units!R         1.0000000        NA       NA     F   0


And, the heritability for this analysis can be obtained as:

vpredict(asr011_transfo, h2 ~ V1/(V1+V2))
   Estimate        SE
h2 0.623399 0.1852652

These results indicate that we have a heritability estimate of 0.62, indicating that 62% of the phenotypic variability of the throat color expression is under additive genetic control. The standard error is quite high, but this is expected given that the small dataset.


The additive genetic effects (BLUP/breeding values) are:

BLUP <- summary(asr011_transfo, coef = TRUE)$coef.random
head(BLUP)
                      solution std.error    z.ratio
vm(indiv, ainv)_203  0.5830478 0.6491449  0.8981781
vm(indiv, ainv)_205 -0.9781981 0.6704375 -1.4590443
vm(indiv, ainv)_206  1.3325487 0.6190151  2.1526915
vm(indiv, ainv)_209 -0.1563091 0.6718978 -0.2326382
vm(indiv, ainv)_211  0.4684198 0.6579716  0.7119149
vm(indiv, ainv)_218 -1.0958538 0.7124058 -1.5382438


The predictions (means) for each individual can be calculated as shown below, but note that here the factors cohort and sex are ignored. Hence, in these predictions we only considered \(\hat{\mu} + \hat{BLUP}\).

BLUP$Mean <- mean(d007$p_yellow_log)
BLUP$Preds <- BLUP$Mean + BLUP$solution
head(BLUP)
                      solution std.error    z.ratio      Mean     Preds
vm(indiv, ainv)_203  0.5830478 0.6491449  0.8981781 -3.170818 -2.587770
vm(indiv, ainv)_205 -0.9781981 0.6704375 -1.4590443 -3.170818 -4.149016
vm(indiv, ainv)_206  1.3325487 0.6190151  2.1526915 -3.170818 -1.838269
vm(indiv, ainv)_209 -0.1563091 0.6718978 -0.2326382 -3.170818 -3.327127
vm(indiv, ainv)_211  0.4684198 0.6579716  0.7119149 -3.170818 -2.702398
vm(indiv, ainv)_218 -1.0958538 0.7124058 -1.5382438 -3.170818 -4.266672


In our example, these predictions are based on the log-transformed response, which are difficult to interpret. However, we can use the following equation to do the back-transformation and get the predictions on their original scale:

\(\normalsize y = {{(100 + 1)e^P -1}\over {1 + 100 - P}}\)
where,
    \(y\) is the back transformed prediction, and \(P\) is the prediction of the transformed data.

This back-transformation is implemented using the following code:

BLUP$BT_Preds <- ((100 + 1)*exp(BLUP$Preds) - 1)/(1 + exp(BLUP$Preds))
head(BLUP)
tail(BLUP)
                      solution std.error    z.ratio      Mean     Preds   BT_Preds
vm(indiv, ainv)_203  0.5830478 0.6491449  0.8981781 -3.170818 -2.587770  6.1328250
vm(indiv, ainv)_205 -0.9781981 0.6704375 -1.4590443 -3.170818 -4.149016  0.5845491
vm(indiv, ainv)_206  1.3325487 0.6190151  2.1526915 -3.170818 -1.838269 13.0001208
vm(indiv, ainv)_209 -0.1563091 0.6718978 -0.2326382 -3.170818 -3.327127  2.5345243
vm(indiv, ainv)_211  0.4684198 0.6579716  0.7119149 -3.170818 -2.702398  5.4088621
vm(indiv, ainv)_218 -1.0958538 0.7124058 -1.5382438 -3.170818 -4.266672  0.4111005
                           solution std.error     z.ratio      Mean     Preds  BT_Preds
vm(indiv, ainv)_251.5   -0.01168204 0.7010365 -0.01666396 -3.170818 -3.182500 3.0626201
vm(indiv, ainv)_251.6   -0.83804347 0.6974068 -1.20165655 -3.170818 -4.008862 0.8186964
vm(indiv, ainv)_251.7   -0.80524185 0.6974068 -1.15462286 -3.170818 -3.976060 0.8782250
vm(indiv, ainv)_261.3   -0.81711852 0.6857410 -1.19158477 -3.170818 -3.987937 0.8564535
vm(indiv, ainv)_261.6   -0.08687452 0.7030779 -0.12356315 -3.170818 -3.257693 2.7792473
vm(indiv, ainv)_206.1.1  0.38657605 0.7050686  0.54828145 -3.170818 -2.784242 4.9345300

These results can be used to rank individuals and perform some selections.