Van Loo lab

Cancer Genomics Laboratory

: Software

Peter Van Loo diagram.

Introduction

Allele-Specific Copy number Analysis of Tumors (ASCAT)

ASCAT is a method to derive copy number profiles of tumor cells, accounting for normal cell admixture and tumor aneuploidy. ASCAT infers tumor purity (the fraction of tumor cells) and ploidy (the amount of DNA per tumor cell, expressed as multiples of haploid genomes) from SNP array or massively parallel sequencing data, and calculates whole-genome allele-specific copy number profiles (the number of copies of both parental alleles for all SNP loci across the genome).

The latest ASCAT version is available as an R package on GitHub, at: https://github.com/VanLoo-lab/ascat. Instructions to install and try the software are provided on the GitHub page.

Running ASCAT

In its simplest form (with matched normal data available, without GC wave correction and all samples female), ASCAT can be run as follows:

library(ASCAT)
ascat.bc = ascat.loadData("Tumor_LogR.txt", "Tumor_BAF.txt", "Germline_LogR.txt", "Germline_BAF.txt")
ascat.plotRawData(ascat.bc)
ascat.bc = ascat.aspcf(ascat.bc)
ascat.plotSegmentedData(ascat.bc)
ascat.output = ascat.runAscat(ascat.bc)

The ascat.loadData function by default assumes all samples are female. An extra optional parameter (gender = …) allows setting the gender of samples (in vector format, using "XX" for females and "XY" for males).
 
ASCAT can be run in different modes: without matched normal data, with a logR correction (GC content and replication timing), with a multi-segmentation and on high-throughput sequencing (HTS) data. Examples to run ASCAT using such different modes can be found here.

Input data formats and supported platforms

1) SNP arrays

ASCAT is platform and species-independent and works for both Illumina and Affymetrix SNP arrays. The input required includes matrices of LogR and B Allele Frequency (BAF) data (rows are probes or SNP loci and columns are samples). ASCAT requires identically formatted LogR and BAF files for both tumor and germline data (with matching samples on matching rows in all four files). For examples of the precise data format, see our simulated example data (7.62 MB, zip). 

Input data for ASCAT can be obtained directly from Illumina GenomeStudio or can be derived from Affymetrix CEL files, e.g. through the PennCNV libraries. The pipeline we use (and recommend) for Affymetrix SNP 6.0 arrays can be found within the R package on GitHub.

Please note that you need two adapted files for this pipeline, one containing the SNP locations for the AffySNP6 platform (12.68 MB, zip) and a genotype cluster file (33.25 MB, zip) that was compiled from a series of about 5,000 verified normal samples.

2) HTS data

For HTS data, ASCAT requires BAM files as well as reference files (listed on the GitHub page) so it can get allele counts and derive logR and BAF values. After logR/BAF files are generated (ascat.prepareHTS), one can use the other ASCAT functions to perform all of the standard steps (loading logR/BAF, correcting logR for covariates, segmenting tracks and getting CNA profiles).

For targeted sequencing data, we have implemented a bespoke function that identifies high-quality SNPs to investigate (ascat.prepareTargetedSeq). This step must be done on a batch of normal samples (no tumor samples) and prior to generating logR and BAF values. More information on how to get CNA profiles for HTS data can be found on our GitHub page.

3) Additional information

ASCAT can also be run on data from other species, for example, SNP arrays from canine breast cancers or exomes from zebrafish melanomas. As the method leverages SNP loci, it will however not work on haploid or homozygous (inbred) species (e.g. inbred mouse strains).

Samples profiled through SNP arrays or massively parallel sequencing are often affected by 'wave artifacts' that are in part correlated with the GC content of the surrounding region (e.g. this paper by Diskin et al.). We have implemented a GC wave correction in ASCAT, and recommend adding that step to the pipeline if the input data hasn't been through alternative methods for logR correction. Our original GC correction method (ASCAT 2.2) is based on the one initially implemented by Cheng et al., Genome Biology 12:R80, 2011. We have extended such a correction method to correct for both GC content and replication timing (as from version 3.0).

An important platform- and normalization-specific parameter is the normalization parameter (gamma) within the function ascat.runAscat. This parameter represents the drop in LogR for a change from two copies to one copy in 100% of cells. For massively parallel sequencing data, gamma should always be set to 1. For array data, due to array background signal and bespoke array normalization procedures, gamma is often significantly lower in practice. Its default setting of 0.55 works for many but not all SNP arrays (e.g. Illumina 109k arrays as processed through BeadStudio/GenomeStudio and Affymetrix SNP 6.0 arrays processed through the PennCNV libraries). For other SNP array platforms (and normalization procedures), we recommend checking the value of gamma through a comparison of a male and female germline sample (evaluating the difference in LogR values of the X chromosome probes between genders, relative to the rest of the genome), or through an X chromosome titration series.

ASCAT outputs

The output of ASCAT, and how to interpret it, is described in this book chapter.

Legacy versions and data

Historic versions of ASCAT are available as part of our GitHub version. We recommend always using the latest version, but we provide the historic versions for legacy reasons.

Major changes to ASCAT over the original version 1.0 are:

  • Availability as an easy-to-use and coherent R software suite (2.0)
  • Major improvements in computational speed (2.0)
  • Platform-independence (2.0)
  • Update of the core algorithm for better performance and results (2.0 and 2.2)
  • Addition of germline genotype prediction and thereby extension to unmatched tumor samples (2.0)
  • Adaptations to the ASPCF segmentation algorithm to increase sensitivity in samples with low noise and to increase robustness in more noisy samples (2.1)
  • Addition of a gender parameter, allowing correct handling of copy number aberrations on the X chromosome in male samples (2.2)
  • Addition of GC correction code (2.2)
  • Adaptations to allow manual refitting of samples (2.3)
  • Adaptations and additions to output data structures (2.3)
  • Availability as R package (2.4)
  • Addition of a multi-sample segmentation for samples that are expected to share breakpoints (2.5)
  • Addition of a bespoke methodology for generating logR and BAF from HTS data (3.0)
  • Addition of a pre-processing step for targeted sequencing data (3.1)

Breast carcinoma SNP array data from our original ASCAT publication is also available. The data consists of the LogR and BAF values for both the tumor and germline SNP array data. We also include tumor LogR data after adjustment for GC bias using the method described in Diskin et al., Nucleic Acids Research, 36:e126, 2008. Due to privacy regulations, the data are password protected. Please contact us to obtain access.

A script used to analyze these Illumina 109k breast carcinoma SNP array data using ASCAT 1.0 is available on GitHub.

Subclonal copy number analysis: the Battenberg algorithm

To assay subclonal copy number changes in massively parallel sequencing data, we created the Battenberg algorithm, based on the underlying ASCAT principles and equations and on haplotype phasing of 1000 genomes SNP loci. The Battenberg algorithm was originally described here and is now available on GitHub.

Frequently asked questions