levels(d021$location)
[1] "Hill" "Orchard" "Rail"
<- d021[d021$location == 'Rail', ] d021_r
The complete script for this example can be downloaded here:
In this example, we will use the D021 and the D021ped datasets, which correspond to phenotypic and pedigree data of pine clones from a multi-environmental trial. A few lines of each data frame are shown below:
clone | location | block | height_8 | dia_8 |
---|---|---|---|---|
18 | Hill | 2 | 35.9 | 7.00 |
18 | Hill | 4 | 38.3 | 6.60 |
18 | Hill | 6 | 27.7 | 5.85 |
18 | Hill | 8 | 37.0 | 6.40 |
18 | Orchard | 2 | 35.7 | 5.40 |
18 | Orchard | 3 | 37.1 | 6.05 |
clone | dam | sire |
---|---|---|
1 | 0 | 0 |
2 | 0 | 0 |
3 | 0 | 0 |
4 | 0 | 0 |
5 | 0 | 0 |
6 | 0 | 0 |
As these are multi-sites data, we need to select one site at the time to the perform the analyses.
levels(d021$location)
[1] "Hill" "Orchard" "Rail"
<- d021[d021$location == 'Rail', ] d021_r
Here we will work on the location Rail
, but the same can be performed for the two other locations.
We can now take a look at the summary:
summary(d021_r)
clone location block height_8 dia_8
21 : 8 Hill : 0 1:101 Min. :24.20 Min. :3.250
62 : 7 Orchard: 0 2:100 1st Qu.:34.30 1st Qu.:5.050
146 : 7 Rail :393 3: 98 Median :37.40 Median :5.850
54 : 6 4: 94 Mean :37.16 Mean :5.739
71 : 6 3rd Qu.:40.00 3rd Qu.:6.400
81 : 6 Max. :55.70 Max. :8.650
(Other):353 NA's :1
In this case everything looks fine, but we notice that there is a missing value in the column (dia_8
).
We can also inspect the distribution of the clones across the blocks:
head(table(d021_r$block, d021_r$clone))
19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43
1 0 1 2 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1
2 0 1 2 1 1 0 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 0
3 0 1 2 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
4 1 0 2 1 1 0 1 1 1 1 1 1 1 0 0 1 0 1 1 1 1 1 1 1 1
44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 62 64 65 66 67 69 70 71
1 0 1 1 0 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 1 1 0 1 0 1
2 1 1 1 1 1 1 0 1 1 1 2 1 0 0 0 1 1 2 1 0 1 1 1 1 2
3 1 0 0 0 1 1 1 1 0 1 2 1 0 1 0 1 1 2 1 0 0 1 1 1 1
4 1 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 2 1 1 0 1 1 1 2
72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 91 93 94 95 96 97 98
1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 1 0 1 1 1 0 1 0 1 1
2 1 0 1 1 1 1 1 0 1 2 1 1 1 0 0 1 0 1 1 1 1 0 0 2 1
3 0 1 1 0 1 1 1 1 1 2 1 1 0 1 0 1 1 0 0 1 1 0 1 1 1
4 0 1 1 1 1 1 0 1 1 1 2 1 1 1 1 1 0 1 1 0 1 1 0 1 1
99 101 102 104 105 106 107 108 109 111 112 113 115 116 117 118 119 120 121
1 1 0 1 1 0 1 1 1 1 1 1 0 0 1 1 1 1 1 0
2 1 1 1 0 0 1 1 0 1 1 1 1 1 1 1 0 0 1 0
3 1 0 1 0 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0
4 0 0 1 0 0 1 1 0 1 1 1 0 0 1 0 1 0 1 1
122 123 124 125 126 127 128 129 130 131 132 133 135 136 137 138 139 140 141
1 1 1 1 1 1 1 1 1 1 1 0 0 1 1 0 0 1 1 1
2 0 1 0 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 1
3 0 1 0 1 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1
4 1 1 0 1 1 1 1 1 1 1 0 0 1 1 1 0 1 0 0
142 143 144 145 146 147 148 149 150 152 153
1 1 1 1 0 2 0 1 0 1 0 1
2 1 0 1 1 1 0 1 1 1 1 1
3 1 0 1 0 2 1 1 1 1 0 1
4 1 1 1 0 2 0 1 1 1 0 0
According to this table, we can see that we have incomplete blocks, as the blocks only contain a portion of the clones.
Finally, we can examine the distribution of the response variable (here height_8
).
hist(d021_r$height_8)
This histogram indicates that height data seem to be normally distributed, and this implies that it is likely that the residuals will have also an approximate Normal distribution.
In this example we will also investigate the residuals plots and the presence of outliers in our data. For that, we will use a single-site model. The specification of this model is:
\[ y = \mu + block + clone + e\ \]
where,\(y\) is the height at age 8 of a given tree,
\(\mu\) is the overall mean,
\(block\) is the random effect of block, with \(block \sim \mathcal{N}(0,\,\sigma^{2}_{i})\),
\(clone\) is the random effect of clone, with \(clone \sim \mathcal{N}(0,\,\sigma^{2}_{c})\),
\(e\) is the random residual effect, with \(e \sim \mathcal{N}(0,\,\sigma^{2}_{e})\).The factor \(clone\) is associated with the numerator relationship matrix (NRM) \(\mathbf{A}\).
Now, let’s take a look at how to write the model with ASReml-R. Note that before fitting the model gen
, and clone
need to be set as factors and the inverse of the NRM needs to be created using the pedigree information (see ASR013 for more details).
<- ainverse(d021ped) ainv
<- asreml(
asr017 fixed = height_8 ~ 1,
random = ~block + vm(clone, ainv),
data = d021_r
)
Note that the above model is including the inverse of the NRM by the use of the expression vm()
that indicates variance model and that links a factor to the inverse matrix.
We can inspect the residual diagnostic plots and check for outlier using the plot()
function:
plot(asr017)
The plots are looking good, the residuals seem to follow an approximate Normal distribution and homoscedastic. However, we can observe a few potential outliers.
If we decide to remove the height data of the highest and the lowest residual values, we need to identify them. One way to do this is:
<- residuals(asr017)
res which.max(res)
[1] 187
which.min(res)
[1] 311
And then make the corresponding records missing.
$height_8[d021_r$height_8 == 187] <- NA
d021_r$height_8[d021_r$height_8 == 311] <- NA d021_r
Note that as this dataset has more than one response variable (i.e., height_8
and dia_8
), and we were only looking at the column height_8
, so in this case we chose to replace these specific height values by NA rather than removing the whole row.
The multi-environmental model associated with these data is available in ASR018.