lines | i_11_10006 | i_11_10030 | i_11_10041 | i_11_10043 | i_11_10075 | i_11_10176 | i_11_10186 |
---|---|---|---|---|---|---|---|

201157_A | 0 | 0 | 0 | 2 | 2 | 2 | 2 |

AAPO | 2 | 0 | 2 | 0 | 0 | 0 | 0 |

ABACUS | 2 | 2 | 2 | 2 | 2 | 2 | 0 |

ABAVA | 2 | 0 | 2 | 0 | 2 | 2 | 0 |

ACAPELLA | 2 | 2 | 2 | 2 | 2 | 2 | 0 |

ACROBAT | 2 | 2 | 2 | 2 | 2 | 2 | 2 |

# ASR027. Run a quality control check on a genomic relationship matrix - Spring Barley

The complete script for this example can be downloaded here:

### Dataset

The example that we will present here is based on the D022genomic dataset. The first few rows and columns are presented below.

### The `kinship.diagnostics()`

function

In the example asr026 we presented how to obtain a \(\mathbf{G}\) matrix from genomic SNPs data. In this example, we will take a closer look at this matrix using the `kinship.diagnostics()`

function.

First, here is the code we used to obtain the \(\mathbf{G}\) matrix:

```
<- data.matrix(d022genomic[, c(2:3490)])
d022genomic_matrix row.names(d022genomic_matrix) <- d022genomic$lines
<- qc.filtering(M = d022genomic_matrix, maf = 0.05, marker.callrate = 0.2,
d022genomic_matrix ind.callrate = 0.20, impute = FALSE, plots = FALSE)
```

`Initial marker matrix M contains 478 individuals and 3489 markers.`

`A total of 0 markers were removed because their proportion of missing values was equal or larger than 0.2.`

`A total of 0 individuals were removed because their proportion of missing values was equal or larger than 0.2.`

`A total of 0 markers were removed because their MAF was smaller than 0.05.`

`A total of 0 markers were removed because their heterozygosity was larger than 1.`

`A total of 0 markers were removed because their |F| was larger than 1.`

`Final cleaned marker matrix M contains 0% of missing SNPs.`

`Final cleaned marker matrix M contains 478 individuals and 3489 markers.`

`<- G.matrix(M = d022genomic_matrix$M.clean, method = "VanRaden")$G d022_Gmatrix `

And here are the first few rows and columns of that matrix:

`1:6,1:6] d022_Gmatrix[`

```
201157_A AAPO ABACUS ABAVA ACAPELLA ACROBAT
201157_A 1.8740621 -0.15263241 0.15041992 -0.18105962 0.11010941 0.1319905
AAPO -0.1526324 2.09655608 0.09997384 0.71769243 -0.18287625 -0.4694097
ABACUS 0.1504199 0.09997384 2.09481462 0.04459779 -0.05349424 -0.1244370
ABAVA -0.1810596 0.71769243 0.04459779 2.05167893 -0.03463882 -0.2013997
ACAPELLA 0.1101094 -0.18287625 -0.05349424 -0.03463882 1.91538118 0.0029342
ACROBAT 0.1319905 -0.46940973 -0.12443698 -0.20139966 0.00293420 1.8453593
```

From this point, we can do a quality control of the matrix such as:

`<- kinship.diagnostics(K = d022_Gmatrix) qc `

`Matrix dimension is: 478x478`

`Range diagonal values: 1.62661 to 3.23991`

`Mean diagonal values: 2`

`Range off-diagonal values: -0.65452 to 2.71143`

`Mean off-diagonal values: -0.00419`

`There are 478 extreme diagonal values, outside < 0.8 and > 1.2`

`There are 5 records of possible duplicates, based on: k(i,j)/sqrt[k(i,i)*k(j,j)] > 0.95`

According to the reported messages, the ranges of diagonal and off-diagonal values seem to reach suspiciously high values. But note that this population has a good portion of inbreed individuals, and therefore, we expect values that range from 1 to 2. Nevertheless, we see values reaching as high as 3.24! Additionally, these reports indicate that we have a group of individuals that could be duplicates.

It is possible to look at these results in more detail.

- Check the extreme diagonal values:

`head(qc$list.diagonal, 6)`

```
value
STELLA 3.239909
KESTREL 2.719800
BULBUL_89 2.715966
TARM92 2.715878
AKKA 2.710266
ORZA 2.707058
```

In this instance, we could consider removing the line `STELLA`

from the data.

- Check for possible duplicates:

`$list.duplicate qc`

```
Indiv.A Indiv.B Value Corr
1 TARM92 BULBUL_89 2.711431 0.9983462
2 TARM92 ORZA 2.690508 0.9922711
3 ORZA BULBUL_89 2.686060 0.9906148
4 PALLAS BONUS 1.986324 0.9866252
5 VALTICKY DIAMANT 1.987351 0.9623336
```

Here, we can also consider to eliminate some of these lines completely from the data if we can not verify they are identical.

- Check the diagonal and off-diagonal values distribution plots:

According to the above plots, some values seem excessively high, but the large majority appear to be within a reasonable range. If some tune-up, such as bending, blending or alignment is done, it is very likely that some of the issues of this matrix will be greatly reduced or eliminated.