ASR027. Run a quality control check on a genomic relationship matrix - Spring Barley

The complete script for this example can be downloaded here:

Dataset

The example that we will present here is based on the D022genomic dataset. The first few rows and columns are presented below.

lines i_11_10006 i_11_10030 i_11_10041 i_11_10043 i_11_10075 i_11_10176 i_11_10186
201157_A 0 0 0 2 2 2 2
AAPO 2 0 2 0 0 0 0
ABACUS 2 2 2 2 2 2 0
ABAVA 2 0 2 0 2 2 0
ACAPELLA 2 2 2 2 2 2 0
ACROBAT 2 2 2 2 2 2 2


The kinship.diagnostics() function

In the example asr026 we presented how to obtain a \(\mathbf{G}\) matrix from genomic SNPs data. In this example, we will take a closer look at this matrix using the kinship.diagnostics() function.


First, here is the code we used to obtain the \(\mathbf{G}\) matrix:

d022genomic_matrix <- data.matrix(d022genomic[, c(2:3490)])
row.names(d022genomic_matrix) <- d022genomic$lines 
d022genomic_matrix <- qc.filtering(M = d022genomic_matrix, maf = 0.05, marker.callrate = 0.2, 
                                   ind.callrate = 0.20, impute = FALSE, plots = FALSE)
Initial marker matrix M contains 478 individuals and 3489 markers.
A total of 0 markers were removed because their proportion of missing values was equal or larger than 0.2.
A total of 0 individuals were removed because their proportion of missing values was equal or larger than 0.2.
A total of 0 markers were removed because their MAF was smaller than 0.05.
A total of 0 markers were removed because their heterozygosity was larger than 1.
A total of 0 markers were removed because their |F| was larger than 1.
Final cleaned marker matrix M contains 0% of missing SNPs.
Final cleaned marker matrix M contains 478 individuals and 3489 markers.
d022_Gmatrix <- G.matrix(M = d022genomic_matrix$M.clean, method = "VanRaden")$G


And here are the first few rows and columns of that matrix:

d022_Gmatrix[1:6,1:6]
           201157_A        AAPO      ABACUS       ABAVA    ACAPELLA    ACROBAT
201157_A  1.8740621 -0.15263241  0.15041992 -0.18105962  0.11010941  0.1319905
AAPO     -0.1526324  2.09655608  0.09997384  0.71769243 -0.18287625 -0.4694097
ABACUS    0.1504199  0.09997384  2.09481462  0.04459779 -0.05349424 -0.1244370
ABAVA    -0.1810596  0.71769243  0.04459779  2.05167893 -0.03463882 -0.2013997
ACAPELLA  0.1101094 -0.18287625 -0.05349424 -0.03463882  1.91538118  0.0029342
ACROBAT   0.1319905 -0.46940973 -0.12443698 -0.20139966  0.00293420  1.8453593


From this point, we can do a quality control of the matrix such as:

qc <- kinship.diagnostics(K = d022_Gmatrix)
Matrix dimension is: 478x478
Range diagonal values: 1.62661 to 3.23991
Mean diagonal values: 2
Range off-diagonal values: -0.65452 to 2.71143
Mean off-diagonal values: -0.00419
There are 478 extreme diagonal values, outside < 0.8 and > 1.2
There are 5 records of possible duplicates, based on: k(i,j)/sqrt[k(i,i)*k(j,j)] >  0.95

According to the reported messages, the ranges of diagonal and off-diagonal values seem to reach suspiciously high values. But note that this population has a good portion of inbreed individuals, and therefore, we expect values that range from 1 to 2. Nevertheless, we see values reaching as high as 3.24! Additionally, these reports indicate that we have a group of individuals that could be duplicates.


It is possible to look at these results in more detail.

  • Check the extreme diagonal values:
head(qc$list.diagonal, 6)
             value
STELLA    3.239909
KESTREL   2.719800
BULBUL_89 2.715966
TARM92    2.715878
AKKA      2.710266
ORZA      2.707058

In this instance, we could consider removing the line STELLA from the data.


  • Check for possible duplicates:
qc$list.duplicate
   Indiv.A   Indiv.B    Value      Corr
1   TARM92 BULBUL_89 2.711431 0.9983462
2   TARM92      ORZA 2.690508 0.9922711
3     ORZA BULBUL_89 2.686060 0.9906148
4   PALLAS     BONUS 1.986324 0.9866252
5 VALTICKY   DIAMANT 1.987351 0.9623336

Here, we can also consider to eliminate some of these lines completely from the data if we can not verify they are identical.


  • Check the diagonal and off-diagonal values distribution plots:

According to the above plots, some values seem excessively high, but the large majority appear to be within a reasonable range. If some tune-up, such as bending, blending or alignment is done, it is very likely that some of the issues of this matrix will be greatly reduced or eliminated.