ASR017. Exploring phenotypic data at each site individually before fitting an MET model - Pine clones

The complete script for this example can be downloaded here:

Dataset

In this example, we will use the D021 and the D021ped datasets, which correspond to phenotypic and pedigree data of pine clones from a multi-environmental trial. A few lines of each data frame are shown below:

d021

clone location block height_8 dia_8
18 Hill 2 35.9 7.00
18 Hill 4 38.3 6.60
18 Hill 6 27.7 5.85
18 Hill 8 37.0 6.40
18 Orchard 2 35.7 5.40
18 Orchard 3 37.1 6.05

d021ped

clone dam sire
1 0 0
2 0 0
3 0 0
4 0 0
5 0 0
6 0 0


As these are multi-sites data, we need to select one site at the time to the perform the analyses.

levels(d021$location)
[1] "Hill"    "Orchard" "Rail"   
d021_r <- d021[d021$location == 'Rail', ]

Here we will work on the location Rail, but the same can be performed for the two other locations.


We can now take a look at the summary:

summary(d021_r)
     clone        location   block      height_8         dia_8      
 21     :  8   Hill   :  0   1:101   Min.   :24.20   Min.   :3.250  
 62     :  7   Orchard:  0   2:100   1st Qu.:34.30   1st Qu.:5.050  
 146    :  7   Rail   :393   3: 98   Median :37.40   Median :5.850  
 54     :  6                 4: 94   Mean   :37.16   Mean   :5.739  
 71     :  6                         3rd Qu.:40.00   3rd Qu.:6.400  
 81     :  6                         Max.   :55.70   Max.   :8.650  
 (Other):353                                         NA's   :1      

In this case everything looks fine, but we notice that there is a missing value in the column (dia_8).


We can also inspect the distribution of the clones across the blocks:

head(table(d021_r$block, d021_r$clone))
   
    19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
  1  0  1  2  1  1  1  1  1  1  1  1  1  1  1  0  1  1  1  1  1  1  1  1  1  1
  2  0  1  2  1  1  0  1  1  1  1  1  1  1  1  0  1  1  1  1  1  1  1  1  1  0
  3  0  1  2  1  1  0  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  1  0
  4  1  0  2  1  1  0  1  1  1  1  1  1  1  0  0  1  0  1  1  1  1  1  1  1  1
   
    44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 64 65 66 67 69 70 71
  1  0  1  1  0  1  1  1  1  1  1  1  1  1  0  1  1  1  1  1  1  1  0  1  0  1
  2  1  1  1  1  1  1  0  1  1  1  2  1  0  0  0  1  1  2  1  0  1  1  1  1  2
  3  1  0  0  0  1  1  1  1  0  1  2  1  0  1  0  1  1  2  1  0  0  1  1  1  1
  4  1  0  1  0  1  1  1  1  1  1  1  1  0  0  0  1  1  2  1  1  0  1  1  1  2
   
    72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 91 93 94 95 96 97 98
  1  1  1  1  1  1  1  1  1  1  1  1  0  0  1  1  1  0  1  1  1  0  1  0  1  1
  2  1  0  1  1  1  1  1  0  1  2  1  1  1  0  0  1  0  1  1  1  1  0  0  2  1
  3  0  1  1  0  1  1  1  1  1  2  1  1  0  1  0  1  1  0  0  1  1  0  1  1  1
  4  0  1  1  1  1  1  0  1  1  1  2  1  1  1  1  1  0  1  1  0  1  1  0  1  1
   
    99 101 102 104 105 106 107 108 109 111 112 113 115 116 117 118 119 120 121
  1  1   0   1   1   0   1   1   1   1   1   1   0   0   1   1   1   1   1   0
  2  1   1   1   0   0   1   1   0   1   1   1   1   1   1   1   0   0   1   0
  3  1   0   1   0   1   1   1   1   1   1   1   1   1   1   1   0   0   1   0
  4  0   0   1   0   0   1   1   0   1   1   1   0   0   1   0   1   0   1   1
   
    122 123 124 125 126 127 128 129 130 131 132 133 135 136 137 138 139 140 141
  1   1   1   1   1   1   1   1   1   1   1   0   0   1   1   0   0   1   1   1
  2   0   1   0   1   1   1   1   1   1   1   1   1   1   0   1   0   0   0   1
  3   0   1   0   1   1   1   1   1   1   1   1   0   0   0   1   1   1   1   1
  4   1   1   0   1   1   1   1   1   1   1   0   0   1   1   1   0   1   0   0
   
    142 143 144 145 146 147 148 149 150 152 153
  1   1   1   1   0   2   0   1   0   1   0   1
  2   1   0   1   1   1   0   1   1   1   1   1
  3   1   0   1   0   2   1   1   1   1   0   1
  4   1   1   1   0   2   0   1   1   1   0   0

According to this table, we can see that we have incomplete blocks, as the blocks only contain a portion of the clones.


Finally, we can examine the distribution of the response variable (here height_8).

hist(d021_r$height_8)

This histogram indicates that height data seem to be normally distributed, and this implies that it is likely that the residuals will have also an approximate Normal distribution.


Model

In this example we will also investigate the residuals plots and the presence of outliers in our data. For that, we will use a single-site model. The specification of this model is:

\[ y = \mu + block + clone + e\ \]

where,

    \(y\) is the height at age 8 of a given tree,

    \(\mu\) is the overall mean,

    \(block\) is the random effect of block, with \(block \sim \mathcal{N}(0,\,\sigma^{2}_{i})\),

    \(clone\) is the random effect of clone, with \(clone \sim \mathcal{N}(0,\,\sigma^{2}_{c})\),

    \(e\) is the random residual effect, with \(e \sim \mathcal{N}(0,\,\sigma^{2}_{e})\).

The factor \(clone\) is associated with the numerator relationship matrix (NRM) \(\mathbf{A}\).


Now, let’s take a look at how to write the model with ASReml-R. Note that before fitting the model gen, and clone need to be set as factors and the inverse of the NRM needs to be created using the pedigree information (see ASR013 for more details).

ainv <- ainverse(d021ped)
asr017 <- asreml(
  fixed = height_8 ~ 1,
  random = ~block + vm(clone, ainv),
  data = d021_r
)

Note that the above model is including the inverse of the NRM by the use of the expression vm() that indicates variance model and that links a factor to the inverse matrix.


Exploring output

We can inspect the residual diagnostic plots and check for outlier using the plot() function:

plot(asr017)

The plots are looking good, the residuals seem to follow an approximate Normal distribution and homoscedastic. However, we can observe a few potential outliers.


If we decide to remove the height data of the highest and the lowest residual values, we need to identify them. One way to do this is:

res <- residuals(asr017)
which.max(res)
[1] 187
which.min(res)
[1] 311

And then make the corresponding records missing.

d021_r$height_8[d021_r$height_8 == 187] <- NA
d021_r$height_8[d021_r$height_8 == 311] <- NA

Note that as this dataset has more than one response variable (i.e., height_8 and dia_8), and we were only looking at the column height_8, so in this case we chose to replace these specific height values by NA rather than removing the whole row.


More information

The multi-environmental model associated with these data is available in ASR018.