lines | i_11_10006 | i_11_10030 | i_11_10041 | i_11_10043 | i_11_10075 | i_11_10176 | i_11_10186 |
---|---|---|---|---|---|---|---|
201157_A | 0 | 0 | 0 | 2 | 2 | 2 | 2 |
AAPO | 2 | 0 | 2 | 0 | 0 | 0 | 0 |
ABACUS | 2 | 2 | 2 | 2 | 2 | 2 | 0 |
ABAVA | 2 | 0 | 2 | 0 | 2 | 2 | 0 |
ACAPELLA | 2 | 2 | 2 | 2 | 2 | 2 | 0 |
ACROBAT | 2 | 2 | 2 | 2 | 2 | 2 | 2 |
ASR027. Run a quality control check on a genomic relationship matrix - Spring Barley
The complete script for this example can be downloaded here:
Dataset
The example that we will present here is based on the D022genomic dataset. The first few rows and columns are presented below.
The kinship.diagnostics()
function
In the example asr026 we presented how to obtain a \(\mathbf{G}\) matrix from genomic SNPs data. In this example, we will take a closer look at this matrix using the kinship.diagnostics()
function.
First, here is the code we used to obtain the \(\mathbf{G}\) matrix:
<- data.matrix(d022genomic[, c(2:3490)])
d022genomic_matrix row.names(d022genomic_matrix) <- d022genomic$lines
<- qc.filtering(M = d022genomic_matrix, maf = 0.05, marker.callrate = 0.2,
d022genomic_matrix ind.callrate = 0.20, impute = FALSE, plots = FALSE)
Initial marker matrix M contains 478 individuals and 3489 markers.
A total of 0 markers were removed because their proportion of missing values was equal or larger than 0.2.
A total of 0 individuals were removed because their proportion of missing values was equal or larger than 0.2.
A total of 0 markers were removed because their MAF was smaller than 0.05.
A total of 0 markers were removed because their heterozygosity was larger than 1.
A total of 0 markers were removed because their |F| was larger than 1.
Final cleaned marker matrix M contains 0% of missing SNPs.
Final cleaned marker matrix M contains 478 individuals and 3489 markers.
<- G.matrix(M = d022genomic_matrix$M.clean, method = "VanRaden")$G d022_Gmatrix
And here are the first few rows and columns of that matrix:
1:6,1:6] d022_Gmatrix[
201157_A AAPO ABACUS ABAVA ACAPELLA ACROBAT
201157_A 1.8740621 -0.15263241 0.15041992 -0.18105962 0.11010941 0.1319905
AAPO -0.1526324 2.09655608 0.09997384 0.71769243 -0.18287625 -0.4694097
ABACUS 0.1504199 0.09997384 2.09481462 0.04459779 -0.05349424 -0.1244370
ABAVA -0.1810596 0.71769243 0.04459779 2.05167893 -0.03463882 -0.2013997
ACAPELLA 0.1101094 -0.18287625 -0.05349424 -0.03463882 1.91538118 0.0029342
ACROBAT 0.1319905 -0.46940973 -0.12443698 -0.20139966 0.00293420 1.8453593
From this point, we can do a quality control of the matrix such as:
<- kinship.diagnostics(K = d022_Gmatrix) qc
Matrix dimension is: 478x478
Range diagonal values: 1.62661 to 3.23991
Mean diagonal values: 2
Range off-diagonal values: -0.65452 to 2.71143
Mean off-diagonal values: -0.00419
There are 478 extreme diagonal values, outside < 0.8 and > 1.2
There are 5 records of possible duplicates, based on: k(i,j)/sqrt[k(i,i)*k(j,j)] > 0.95
According to the reported messages, the ranges of diagonal and off-diagonal values seem to reach suspiciously high values. But note that this population has a good portion of inbreed individuals, and therefore, we expect values that range from 1 to 2. Nevertheless, we see values reaching as high as 3.24! Additionally, these reports indicate that we have a group of individuals that could be duplicates.
It is possible to look at these results in more detail.
- Check the extreme diagonal values:
head(qc$list.diagonal, 6)
value
STELLA 3.239909
KESTREL 2.719800
BULBUL_89 2.715966
TARM92 2.715878
AKKA 2.710266
ORZA 2.707058
In this instance, we could consider removing the line STELLA
from the data.
- Check for possible duplicates:
$list.duplicate qc
Indiv.A Indiv.B Value Corr
1 TARM92 BULBUL_89 2.711431 0.9983462
2 TARM92 ORZA 2.690508 0.9922711
3 ORZA BULBUL_89 2.686060 0.9906148
4 PALLAS BONUS 1.986324 0.9866252
5 VALTICKY DIAMANT 1.987351 0.9623336
Here, we can also consider to eliminate some of these lines completely from the data if we can not verify they are identical.
- Check the diagonal and off-diagonal values distribution plots:
According to the above plots, some values seem excessively high, but the large majority appear to be within a reasonable range. If some tune-up, such as bending, blending or alignment is done, it is very likely that some of the issues of this matrix will be greatly reduced or eliminated.