can ngs detect large deletions?

Question

Subhash Kerre · Answer

Overall, LRs accounted for 7.2% of all pathogenic variants detected across all 28 panel genes (Fig. 2). However, the prevalence of LRs varied by gene. LRs were most prevalent in STK11 (60.7% of all pathogenic variants identified in the gene), MSH2 (27.9%), PMS2 (25.6%), BMPR1A (26.9%), RAD51C (21.1%), and CDKN2A p14ARF (16.7%) (Fig. 2). Of the 2336 individuals found to carry pathogenic LRs, 164 (7.0%) had at least one other germline PV. Among these, 157 individuals had sequence variants, and seven carried an additional LR in a separate, non-contiguous gene.

Nearly all (95.5%) of deletions included here were pathogenic (Table 1). with the remaining 4.5% being classified as VUS (Table 1). Only 4.5% of deletions were classified as VUS (Table 1), including those found in GREM1 and CDK4, where no pathogenic deletions were observed. Deletions represented 86.1% of pathogenic LRs detected and were seen in almost all tested genes (Table 2).

A total of 41 unique pathogenic and VUS partial deletions were detected here, present in 79 individuals. Figure 3 illustrates one partial deletion in BRCA2. NGS initially showed that amplicons covering a sub-region of exon 11 (spanning ~ 1000 bp in size) were present in only one copy (Fig. 3a). This partial deletion was also investigated by targeted microarray-CGH, as directed by the process flow detailed in Fig. 1, and a decreased relative amplitude of probe clusters in this region confirmed the NGS finding (Fig. 3b). MLPA did not detect this LR because the MLPA probe binding sites for this exon did not cover the region deleted in this individual (data not shown). Targeted PCR was employed to characterize precise breakpoints and evaluate the pathogenicity of the partial deletion. The analysis revealed an in-frame deletion of 711 bp within a non-critical region. As a result, this LR was classified as VUS.

Duplications were detected in each of the 26 genes evaluated for LRs; however, pathogenic duplications were identified in only 16 genes (Table 2). Overall, duplications accounted for 11.3% of all detected pathogenic LRs, and 79.1% of all duplications were classified as VUS (Table 1). Five unique partial duplications (pathogenic variants and VUS) were detected in seven individuals. One partial duplication in BRCA2 is depicted in Fig. 4. NGS showed that exons 5–10 were elevated to allele counts of approximately 3, which is consistent with duplication (Fig. 4a). Confirmatory microarray-CGH refined this duplication further and demonstrated that a portion of exon 11 was included in the duplication as well (Fig. 4b). Follow-up targeted PCR established that the duplicated segment occurred in tandem with the native gene, in a head-to-tail orientation (Fig. 4c). Such a configuration is predicted to result in abnormal protein production and/or function, and the variant was classified as DM.

Amaresh Janaki · Answer

To detect CNVs in a target region of a query sample, our pipeline (Fig. 2) utilized this principle by comparing coverage depth in this region of the query sample with average depth in same region for normal samples with similar coverage depth as the query sample. The normal samples are provided to the analysis, and the pipeline creates pools of normal samples, where each pool contains normal samples with similar coverage depth. These pools are called static pools and can be repeatedly used for CNV detection of any query sample where the coverage depth is similar to the average coverage depth of the pool. The pipeline is illustrated in Fig. 2.

To increase resolution each target region is divided into overlapping sub-regions in a sliding window approach as shown in Fig. 3, forming the template for a window-based representation of each target region. This approach is called the Target Region based Sliding Windows (TRSW) approach, or just sliding windows. This also helps in detecting CNVs occurring in smaller sub-regions, e.g., part of an exon. Selection of window size is based on length of sequencing reads and the required resolution of CNV predictions. Sliding length for two adjacent overlapping sliding windows remains the same across all regions and is kept relatively small compared to window size. This helps in detecting the start- and end-points of CNVs more accurately, up to the resolution of the sliding length. At our diagnostic lab standard sequencing read length is 150 nt (X2 paired-end reads). Hence a window size of 75 nt, i.e., half of the read length, along with a sliding length of 10 nt has been chosen for validation samples and for standard routine CNV detection in NGS runs. This gives an overlap of 65 nt between two consecutive windows. This selection of window size and sliding length gives a good tradeoff between computational complexity and resolution.

Equation 1a is used for calculating NSW, the number of sliding windows for a target region of length LTR, where sliding window length is LSW and sliding length is LSL.

Window traversal for a region starts by aligning the first window at start of the region and sliding forward (with sliding length) until end of region. If for the last slide the remaining length of the region is less than sliding length, then the remaining length is added as an additional length to the last window. Hence the size of the last window in a region can be bigger than the chosen window size. Equations 1b and 1c are used for calculation of this additional length LADD and length of the last sliding window LLAST.SW, respectively.

Once window traversal ends for a target region, the next window starts at the beginning of the next target region. If the length of a target region is smaller than the chosen window size, then there will not be any splitting of that region into windows and there will only be one window for that region, of the same size as the region.

In first part of the pipeline static pools are created from normal samples with no CNVs, sorted according to coverage depth. The pipeline can then select a pool of samples that matches the coverage depth of the query sample and use this to estimate expected coverage depth (without any CNVs) for a region of interest. Figure 4 shows the workflow of static pool creation.

Targeted capturing kits always have batch effects in capturing quality due to differences in batches or lots of kits as provided from vendor . This is a common issue with sequencing of targeted panels. Using samples from the same sequencing batch or lot reduces the level of noise by reducing batch effects in the CNV analysis. Therefore, normal samples used in creation of static pools for a CNV analysis should be sequenced using the same batch of target capturing kit as was used for the query samples.

Results from several NGS runs are used as input data in pool creation. The pipeline extracts normal samples (with depth of coverage higher than the assigned cutoff) from the provided runs and lists them in increasing order of coverage depth (Step 1 in Fig. 4).

To increase the resolution of CNV results the sliding windows approach (TRSWs, see above) is used. For each normal sample, coverage for all sliding windows is calculated (Step 2 in Fig. 4).

This list of samples is used for creating the static pools. Equation 2 is used for calculating M, the total number of pools generated from these samples given N, the number of normal samples, and K, the pool size.

Provided the size for each pool is K, the first K samples of the list are used to create the 1st static pool of normal samples, the 2nd pool skips first sample and uses the next K samples (2nd till K + 1th sample), and the same follows for next remaining pools. The Mth (last) pool uses last K samples (N − K + 1th till Nth sample) from the list (Step 3 in Fig. 4).

For each sliding window in the panel the mean coverage depth over all samples in each pool is calculated (Step 4 in Fig. 4). This list of mean coverage depth of each sliding window (mean_TRSW) of a pool is stored and used for CNV score calculations.

As all regions in the target panel are split into smaller sliding windows (TRSWs) to increase the resolution of results, CNV score is calculated for each window. Figure 2 illustrates the CNV calculation workflow.

For a given query sample the coverage depth is first calculated for each sliding window. A static pool is then chosen from the set of static pools where mean coverage depth of the selected pool is closest to coverage depth of the sample. The coverage depth for each window of the query sample is compared against mean coverage depth of each corresponding window of the selected pool. This ratio is converted to log2 scale to calculate the final CNV score, i.e., log copy number ratio score (logCNR score) for that window. Equation 3 is used for calculating the logCNRscore for a window, where LSW is sliding window length, NDi is nucleotide depth at ith position of query sample, NDij is nucleotide depth at ith position of jth sample in the static pool, and n is the number of samples in the selected static pool.

Theoretical values of logCNRscore are 0.0 for 2 alleles (normal), − 1.0 for 1 allele (deletion), and + 0.58 for 3 alleles (duplication). The logCNRscore for each sliding window is stored as CNV results of the query sample.

The quality of the pools relatively to the query sample is important for the performance of our approach, and quality control of query and pools is therefore an important step for reducing noise in the analysis. Three quality checks are used. First, comparing the coverage depth of the query sample to average depth of the selected pool. Second, checking the uniformity in coverage depth among samples in the selected static pool. And third, comparing CNV results generated using static pools to results generated with run-wise pools (see below).

Quality of CNV results depends on a similar coverage depth of query sample and selected static pool. Hence for all query samples, percentage deviation of mean depth of the query sample relative to mean depth of the selected pool is checked. If this percentage deviation is larger than a cutoff (set by lab, for example 5%), then the query sample is re-analyzed with a larger (updated) list of static pools. If the deviation is still too large, then re-sequencing or a MLPA test is used, depending on the number of genes requested for analysis.

The quality of the selected static pool can also affect the CNV results. Even when the percentage deviation of the coverage depth of the query sample compared to mean depth of the selected pool is lower than cutoff, differences in depth of normal samples used in making of selected pools can introduce noise. Hence only good quality pools (i.e., samples with uniform coverage depth) should be used for CNV detection. Additionally, run-wise pools (created by using all samples from the same NGS run of the query sample) can also be used to check quality of the static pool in case of noisy results.

For each gene in the target panel, logCNR score of windows belonging to that gene are plotted. These plots are checked for initial assessment. Once potential signals are identified, gene specific regions are looked up in the table of logCNR scores. As example of a deletion event, Fig. 5 shows plots of logCNR score of all sliding windows of BRCA2 gene in a control sample (CS_12) depicting signals of deletion of exon3, and the table in Fig. 5 enlists the logCNR scores of all sliding windows of same exon3 and its adjacent exon2 and exon4. In some cases, to get the best possible resolution (i.e., to locate exact break point) nucleotide-level coverage files are also checked. In our lab’s diagnostic practices, we also generate merged plots for the same gene across all the samples sequenced in same run (without naming the samples to avoid incidental findings), which helps in detecting or rectifying any noise or signal. We also generate merged plots for run-wise versus static pooling results for all genes over all samples, which helps us in predicting or identifying any noise associated with static pools (see Quality control).

Once CNV signals have been confirmed in the logCNR score table, MLPA-based validation in performed on the sample. In cases of specific genes where MLPA test is not available, RNA sequencing or long-range PCR is performed for CNV verification.

Selection of control samples for validation has been based on availability of known CNV positive samples, previously detected through MLPA. These samples were collected from the genetic diagnostic laboratories at Haukeland University Hospital (Bergen, Norway), University Hospital of North Norway (Tromsø, Norway), and St. Olavs Hospital (Trondheim, Norway). In total 36 positive control samples were used for validation of the CNV detection pipeline, where only genes with known CNVs were checked to reduce the risk of incidental findings. Additionally, 11 routine samples were chosen for calculating the specificity of the pipeline, where all the genes in the panel were checked for CNVs. These samples were collected at Department of Medical Genetics, St. Olavs Hospital, Trondheim, Norway. Both the 36 positive control samples and the 11 routine samples were germline samples where DNA had been extracted from blood.

The target gene panel consisted of 126 genes. For all genes, only exons, UTR regions and approximately ± 25 nucleotides in intronic regions were captured. These 126 genes are mainly cancer associated genes. Additional file 1 lists target regions and capturing probes.

Illumina’s Nextera Rapid Capture Custom Enrichment kit was used for capturing the target sequences. Illumina MiSeq and Illumina NextSeq 500 sequencers were used for sequencing the samples.

Mala Subbaya · Answer

NGS dosage analysis detects unbalanced LRs such as deletions and duplications involving one or more exons. One limitation of the study is that analyses of GREM1 and of MSH2 inversion elements were not included on the panel for the full time period included in this analysis.

Ask Sawal

can ngs detect large deletions?

Related Questions

More Questions

Contact