The game is to actually get a lot of patients. Memorial Sloan Kettering, Foundation One, Broad Institute, Venter are the biggest data-gatherers I'm aware of right now, with the DoD starting to get in the game. But who really wins will be the platforms that do the bioinformatics analysis: Google Genomics, illumina (basespace), etc.
And the ethics questions and "we don't know about the environment" questions aren't going to get answered until the data is collected. Wait till the EMRs are tied into the big data pipelines. Oh, nellie.
Hate to break it to everyone breathless over yet another press clipping, but this startup is dead in the water. Another $45M down the tubes.
If they get some limited access to data, develop the machine learning software better than anyone, whats stopping one of the big pharma companies from shelling out $1B to acquire the company? Getting access to a tech team + IP could be very valuable. Pharma companies aren't known for their software teams either and this seems to be the core of their business offering here. Not the tests that everyone's already doing... I mean I just reread the article and they aren't claiming this is their core business proposition at all (as the OP seems to imply).
Even Illumina makes $2B in revenue currently. There's lots of money in the pharma industry and they always converge on a few big firms. They don't need to be original to provide real technical value here.
Unlike the pharma industry where you shield your IP with lawyers for 10 yrs, being first means nothing in the software world. It's about who can do it best.
The founders comments here also clarifies that they aren't going after new interesting areas because they are focusing on commercializing something that works right now and advancing that data science aspect of it, instead of making a future play on some original R&D.
Which would require getting IRBs on board. Which requires working with IRB systems vendors, like iRIS. (google for "iRIS IRB system")
Also, beating iRIS would be nice.
Also, marrying CROs to investigators and vice versa would be super helpful.
We are still unsure of the nature/nuture problem. What if the environment is more attributable to cancer development?
"Color has developed a next-generation sequencing based test for hereditary cancer. This test analyzes 30 genes associated with increased risk to develop breast, ovarian, colorectal, melanoma, pancreatic, prostate, stomach, and uterine cancers... The assay has a high degree of analytical validity for the detection of single nucleotide variants, small insertions and deletions (indels), and larger deletions and duplications (copy number variants, or CNVs)."
So not micro-arrays like 23andMe that test for SNPs ($99?), but not full genome sequencing either (~$1,000?); but specific sequencing for the sites of these genomic regions of those 30 genes.
Wet Lab sequencing method: "Specifically, it includes target enrichment by Agilent’s SureSelect method (v1.7) and sequencing by Illumina’s NextSeq 500 (paired-end 150bp, High Output kit)"; Unanswered question, are they doing the sequencing in-house or using a facility somewhere else?
Computational method: "The bioinformatics pipeline was built using well-established algorithms such as BWA-MEM, SAMtools, Picard and GATK. CNVs are detected using dedicated internally developed algorithms for read depth analysis and split-read alignment detection."
So basically perform the standard genome assembly, alignment with human reference genome of your partial assembly, and then identify what variant of these 30 genes the patient sample has; plus a special sauce for counting the number of specific bp repeats, due to in-del events, this is not something I am not too familiar, but presumably the number of a specific k-mer repeats you have in these genes of interest might correlate to a specific type of cancer? (would love to hear someone who is an expert in this field their opinion).
"These [30] genes are APC, ATM, BAP1, BARD1, BMPR1A, BRCA1,BRCA2, BRIP1, CDH1, CDK4, CDKN2A (p14ARF and p16INK4a), CHEK2, EPCAM, GREM1, MITF, MLH1, MSH2, MSH6, MUTYH, NBN, PALB2, PMS2, POLD1, POLE, PTEN, RAD51C, RAD51D, SMAD4, STK11, and TP53". (You can follow up by searching them here, e.g., http://www.genecards.org/cgi-bin/carddisp.pl?gene=BRCA2&keyw...).
Also interesting to note, since it's clinical, each of their test has to be verified by a certified "genetics counselor" and also meet lots of clinical standards.
I am all for them trying though. I just don't think we are at a point where we can make a good diagnosis/conclusion yet.
I found this whitepaper on their website, which provides some level of detail...
https://s3.amazonaws.com/color-static-prod/pdfs/validationWh...
However, there were a good many asterisks and caveats about not testing every position along these genes (some of which are quite large).
While I'm not aware of any other companies that are doing this type of direct to consumer testing, companies like Myriad have offered targeted panels on some of these gene targets for some time.
Color is focused on testing for characterized genes (e.g. BRCA1, BRCA2, PTEN, etc.) which have an impact to an individuals risk of developing cancer. Environmental and other factors of course play a role, and most cancers are not caused by these genes. However, knowing that you are at high risk of developing cancer is something a patient can work on with their physician to develop a personalized screening and prevention plan. For example, national guidelines from NCCN suggest that women with a BRCA1 mutation get more frequent mammograms. See also cancer.gov risks for having a BRCA1 or BRCA2 mutation: https://www.cancer.gov/about-cancer/causes-prevention/geneti...
Color was developed working closely with some of the leading cancer researchers including Dr. Mary-Claire King, who is credited with discovering BCRA1, and Dr. Laura Esserman and Dr. Laura v'ant Veer at UCSF. Our team includes people with backgrounds in genetics, medicine, and clinical pathology as well as machine learning, big data, and systems engineering.
This unique combination of skills is really crucial to pushing this area forward. For example, the Komen Foundation (one of the world's biggest breast cancer foundation) held a conference I was part of the planning committee for at Rockefeller University last year on big data for breast cancer. http://ww5.komen.org/BD4BC.html
Marrying data science to medicine is a way to drive cancer research forward.
As an aside, one of Color's founders is a BRCA carriers whose mother had breast cancer twice, and whose grandmother died of the disease. So, it is a bit sad to me that the default assumption is we are "academic imperialists" versus people trying to do something good for the world.
Thanks for reading :)
First off I want to say that I appreciate the drive to work on problems like this. I personally think it's a much better use of money than funding yet another marketing tool to 'revolutionize push notification blah blah'.
I am wondering if you can answer a question (probably naive but curious nonetheless). Why not sequence the full genome on a 30x coverage? Why find just the mutations on these genes rather than across the board?
The reason I'm asking is that it feels somewhat limiting to focus on only the currently known relationships of mutations rather than collecting the full data set. There's other initiatives, such as SB Genomics, that are doing very interesting work on the Cancer Cloud by utilizing new graphing techniques in data science to understand large scale pattern interactions, but they typically are utilizing the full data set.
We don't know what we don't know. Yes, we know that certain genes can contribute towards cancer. However, that doesn't mean you will develop it. You could have a torrent of SNPs underneath BRCA1 but never develop breast cancer.
The last comment I'll make (not to be a naysayer) but it's naive to approach the solution as if it's black and white. Biology is super messy. The data is often not clean. Sequencers do get it wrong (albeit not often) and the bioinformatics is not a perfect science. People looking at the field from the outside should approach it with healthy caution.
It's somewhat the equivalent of saying that, in theory it's easy to launch a rocket into orbit. You have a concentration of reaction mass, you ignite it, it propels you to space. In reality, it's obviously infinitely more complex, much like genetics.
I think you'll see more people with these hybrid skills over time. They take years to develop, though.
If we want advancement in medicine, can we really do it without deep collaboration between biology and computer science? If so, isn't Color Genomics a model of particularly deep cross-disciplinary work, not academic imperialism?
There's an entire field for that already, it's called bioinformatics.
I think that medicine tends to treats us as patients instead of partners. Making testing much more affordable historically has done more good than bad. I think reducing the price of this testing by one or two orders of magnitude is a wonderful thing.
Cancer has genetic predisposition, environmental causes, and random chance all in enough quantity that none of these influences can be ignored.
That said, I'm extremely skeptical that this is a useful consumer-driven test. Far better to be driven by your physician who can decide if there's enough of a familial cancer history to drive the test, to potentially drive other precautions.
Germline genetic testing is useful for entertainment, or for when you have some phenotype that you're interested in investigating, or perhaps for deciding about having children.
Genomes aren't blueprints for an entire life, they're just the parts book. So germline genetic testing is not some sort of Gattaca-like predictor of a person, even if we knew what most of the genome does.
I don't think that it's a good idea if we end up in the scenario you just described, but what if they find something useful? I think that putting money into cancer research, no matter on what approach, is usually a good idea.
A lot of breakthroughs in history of science came from unrelated areas of research such as this where computer people are working in (at first) non-computer field.
For (ii), see the debates about PSA tests. You may diagnose potential for cancer, but if the treatment causes more health problems than the disease; you've just created useless worry and costs with your test. The optimists see two things (1) better tests that could tell whether the prostate cancer is problematic and (2) as lifespans increase, the likelihood of you living long enough for the cancer to be a problem increases, so if we are trying to live forever, then the PSA test results will matter eventually.
Back to (i), the false positives. As a disease becomes more rare, the more your false positive rate matters. If a disease affects 1 in 10k people, your 0.1% FPR will diagnose 10 times as many people wrongly as it does correctly. This is pretty damn common with genetic tests.
I think you can get somewhere data mining whole genome sequences, particularly if you sequence from healthy and diseased tissues in individuals and do exome sequencing as well, but that doesn't appear to be what Color Genomics actually does.
Far more interesting is the data the Color will collect if they grow a customer base, which is not something that many researchers have access to. However, if they're only collecting features and not the labels (cancer incidence) then it's pretty boring data.
Sure, they are concentrating on one of the many aspects of human biology but as long as their findings are mere recommendations to get tested more often etc based on the potential genetic predisposition I don't see anything wrong with that.
As far as the article is concerned - I'd be very cautious going into the "share your dna report with your employer" (or anyone else for that matter) territory for a variety of reasons.
Additionally, is this company performing research or just doing testing on known factors? If it's the latter then I am not clear about why this is so objectionable. How is this company different from the others that provide genetic testing through your healthcare provider?
I.e., 98% of breast cancer patients and 90% of breast cancer patients under 40 do not have cancer because of this genetic susceptibility.
This is by far the most common germline cancer mutation, and it is a severe, DNA damage repair mutation. There are few comparable mutations that will indicate such a severe risk. If you're NF1 mutant or have Li-Fraumeni (other germline cancer-causing mutations) you will have other disease syndromes before you develop cancer.
In the vast majority of cases, testing cancer susceptibility genetically will either (1) miss the largest non-genetic component of cancer susceptibility and (2) incorrectly indicate risk (i.e., a positive test does not mean as much).
"Copy number variant" refers to larger deletions and duplications that can occur in the genome. There isn't some specific cutoff for size, but some examples in these kinds of genes would be an entire exon or gene. There are countless studies that find correlations between specific variants or CNVs and risk of cancers.
Standard variant detection is pretty straightforward. CNVs are harder because they are longer (several hundred to several thousand base pairs) than the raw data (150 to 250 bp for Illumina)- you don't get single reads that span the entire variant. You have to normalize then look for differences in coverage, or look for split reads (where the read is aligned on the border of one of these CNVs).
This kind of funding baffles me because they don't seem to be proposing anything new at all (maybe slightly better CNV detection?) and there are already lots of labs/companies doing this kind of testing. Maybe they are working on being very efficient to offer a better price.
Just out of curiosity and to follow-up, presumably this is a example of the list of detected CNVs in a TCGA Breast Cancer data-set you're referring to: http://cancer.sanger.ac.uk/cosmic/gene/analysis?ln=BRCA1#cnv...
According to Sanger (or maybe TCGA?), a gain is when a genomic region (for a diploid) has more than five absolute copies of this region and a loss is when the genomic region has no reads ((http://cancer.sanger.ac.uk/cosmic/help/cnv/overview), where the copy number is perhaps determined by that normalized distribution of read coverage across the reference genome?
(http://bmcbioinformatics.biomedcentral.com/articles/10.1186/...). This is for CNVs that are longer than the 150-200bp Illumina fragments (Fig1c. Read Depth method, e.g., exome#3 looks like it has two absolute copies vs exome #1 and #2)
Then for small CNVs that perhaps span that 150-200bp fragment, we use the split read method to filter for incompletely mapped reads that are only aligned on the edges to the reference. This implies that there was a duplication event that expanded that sequence? (Fig 1b. Split Read method).
Presumably, the pipeline would determine the CNV sites in a specific patient sample, then cross-reference with the TCGA CNV data-set and come up with correlation score of how much those CNVs sites match with consensus CNVs in the cancer data-set? Thanks again for your detailed breakdown.
Whole genome is not controlled enough to meet clinical standards.
Can you explain this idea in more detail? Many CROs are already establishing preferred-provider relationships with investigators.
[Cost] In order to provide clinical results in a CAP/CLIA environment we need to ensure sufficient coverage to call all SNV/indels and all CNVs. This means we need much higher average coverage then 30X.
Since your exome is just 1-2% of your genome, and your exome encodes 20,000 to 25,000 genes, this means that 30 genes is <1% of your exome which itself is just <2% of your genome. So the Illumina sequencing costs of 30 genes alone are on the order of a fraction of a percent of the cost of doing whole genome (assuming you are using the same machines and the same coverage).
Now, there are some caveats to this, e.g.: -You would use a slower, higher output more expensive machine like an X10 to do a whole genome at scale. -You can save some costs by doing whole genome versus a targeted panel as the pulldown step of the panel adds additional unique costs of its own. -This does not include fixed costs of sample collection, or secondary confirmation, or other costs that increase the price per test. This does not include labor costs, bioinformatics, or other items where sometimes dealing with a whole genome is cheaper per bp of DNA then doing a smaller panel of genes. -I think the "$1000 genome" isn't really here yet.
In a few years, the cost of sequencing the whole genome will be a few hundred dollars, at which point I think it makes sense to do the whole thing.
[Usefulness] It is important to note, however, that most of the genome is not very actionable right now. At Color our focus is on providing you with information you and your doctor can use, which means most of your genome is not characterized well enough to be clinically useful. At this point, depending on who is doing the estimate, only 30-60% of the 20,000+ genes you have are ascribed to a function, and even then it is often unclear how impactful a mutation in those genes are....
I can sympathize with many of the regulatory and cost hurdles, especially for dealing with humans. We're in the purchasing process for a MiSeqDX and starting out specifically only on bacterial and viral sequences with an eventual path towards humans once we accomplish CLIA compliance. Long and costly effort...
Agree on the "$1000 genome" comment. Our average prep kit is ~$700 and, like you mentioned, once you factor in time, labor, computing costs, etc. the cost is well into the thousands.
I've had 4 relatives pass from various cancers which is what got me interested in the field to begin with. Truly hoping we can make some breakthroughs and it is encouraging to see startups such as yours pushing to make that happen. Best of luck!
The figure you linked is a good explanation. The split read method is helpful for finding the edges of the CNV, while the number of reads (relative to other regions that were tested) can give an idea of the number of copies. The problem is that these methods all have their own unique biases/noise that makes it non-trivial to figure out the absolute copy number change.
Ideally they would find a similar CNV that has some clinical association.
The DGV has a lot of reference CNVs. Here are some in BRCA1: http://dgv.tcag.ca/gb2/gbrowse/dgv2_hg19/?name=id:3087443;db...