As a high school student in Palo Alto, Chris Burge had two academic passions. He was fascinated by the life sciences. He was also a math whiz –– a talent that earned him a lead role on his school’s prize-winning Math Olympiad team.
Later, having earned a B.S. in biology at Stanford, he needed to choose a field for graduate study. His decision? To combine his passions. “I wasn’t sure it was a good career move,” he says, “but it was the right match for my interests and skills.”
Happily for Burge –– now an assistant biology professor with a Stanford doctorate in mathematical biology –– his choice came at a time when the fields were intersecting as never before. The result is the rapid growth of what’s called either computational biology or bioinformatics.
The hybrid field’s emergence reflects the fact that a lot of biology involves information-processing. Our genomes, for example, are similar to databases. Developmental or environmental cues tell a cell to make a protein — melatonin, say, or testosterone. Chemical signals activate the relevant gene. The information harbored in that gene’s DNA bases is conveyed to the cell’s proteinmaking machinery. The protein comes off the cell’s assembly lines, its structure precisely reflecting the gene’s data.
When the data’s bad, of course, it can be trouble. A mutated gene dubbed p53 has a role in fully half of human cancers. So finding the genes –– estimates of the total range from 25,000 to 100,000-plus –– is vital.
But it’s a notoriously tough problem. Our genetic machinery, for one, has about 3 billion DNA bases. (How many is that? The writer Ben Bova notes that if you assigned a letter to each base, put 4,000 letters on a page and assembled the pages into books, they’d take up 75 feet of shelf space.) A further problem: the DNA that actually works exists in small stretches along that 3 billion-base expanse.
What’s more, as MIT Nobelist Phillip Sharp showed, genes themselves are segmented. They include working stretches of DNA called exons and longer, largely inactive chunks called introns.
“A typical human gene has about 10 exons,” notes Burge, “and they’re usually small –– 100 to 200 bases long. Introns, on the other hand, are often much larger, up to 100,000 bases or more.”
So while sequencing the human genome was a great feat, basically it meant we now know all the letters in our genetic database. Identifying which letters make up meaningful sentences and which are simply genetic gibberish is at best half done.
Burge relies on the fact that the roughly 12,000 genes already found have yielded secrets about genes in general. Certain sixbase sequences, for example, turn up in genes several times as often as in the gibberish. Using these and other clues, Burge creates computer programs that comb through DNA sequences searching for genes. It can take a lot of computer firepower –– as when Burge and his co-workers matched information from a database of some half-million protein sequences against all 3 billion DNA bases in the human genome. “The computer cluster we used was the equivalent of about 200 desktop machines,” he notes, “and it took a week to do the whole run.” But such efforts can also further the quest for genes.
Thus, on one well-studied human chromosome, Burge’s system, called Genscan, predicted almost all the 500 genes that another group identified using traditional molecular methods. A gratifying result –– except Burge’s program also predicted 300 additional genes. “Does our program predict genes that aren’t really genes?” he asks. “Or, are there a lot more genes we didn’t know about?”
With collaborators, Burge showed that many of those 300 genes were real. Of course, this was just a step on a journey that’s sure to be long, but that’s bound to have farreaching effects. “To do something about diseases with genetic roots,” notes Burge, “you first have to label what is a gene and what isn’t.”