Cohort Integration

Cohort Integration (CI) identifies cancer driver genes that are under positive selection during tumor development. CI takes Mutation Annotation Format (MAF) file, which is a aggregated mutation information from VCF Files. To identify the genes under positive selection, CI compares the fitness effects of observed missense and nonsense coding variants to the effects of simulated random nucleotide changes with the two-sample one-sided Kolmogorov–Smirnov test. This test determines if the test set and the reference set differ significantly by the deviation of the cumulative distribution function between the two datasets. The fitness effects of coding variants are evaluated by Evolutionary Action (EA). To determine the fitness effect of frameshift INDEL and other small somatic point mutations, CI uses the mutation frequency of frameshift INDEL and other small somatic point mutations to evaluate the fitness effect of these mutations on the given gene. CI applies the cumulative binomial probability test and assumed equal mutation rate at each nucleotide position throughout the genome. Finally, an overallfitness effect for each gene is calculated by combining the p-value of the fitness effect for missense and nonsense mutations and for frameshift INDEL and stop-loss mutations with Fisher’s combined probability, and then the p-values are corrected for the multiple-hypothesis testing problem by calculating the false discovery rate (q-value).