Talk to neuroscientist Tomaso Poggio for any length of time, and you’re likely to learn more than one unexpected fact about brains, minds, or machines. Like, for example, the fact that the size of a fruit fly’s brain—when the number of neurons are plotted logarithmically—lies almost exactly halfway between the human brain and no brain at all. “When I started my scientific career, I studied the brain of the fly,” says Poggio. Nowadays, investigating that space between “brain” and “no brain” is what drives Poggio, the Eugene McDermott Professor of Brain and Cognitive Sciences, as he directs the Center for Brains, Minds, and Machines (CBMM), a multi-institutional collaboration headquartered at MIT’s McGovern Institute for Brain Research.
CBMM’s mission, in Poggio’s words, is no less than “understanding the problem of intelligence—not only the engineering problem of building an intelligent machine, but the scientific problem of what intelligence is, how our brain works and how it creates the mind.” To Poggio, whose multidisciplinary background also includes physics, mathematics, and computer science, the question of how intelligence mysteriously arises out of certain arrangements of matter and not others “is not only one of the great problems in science, like the origin of the universe—it’s actually the greatest of all, because it means understanding the very tool we use to understand everything else: our mind.”
One of Poggio’s primary fascinations is the behavior of so-called “deep-learning” neural networks. These computer systems are very roughly modeled on the arrangement of neurons in certain regions of biological brains. A neural network is termed “deep” when it passes information among multiple layers of digital connections in between the input and the output. These hidden layers may number anywhere from the dozens to the thousands, and their unusual pattern-matching capabilities power many of today’s “artificial intelligence” applications—from the speech recognition algorithms in smartphones to the software that helps guide self-driving cars. “It’s intriguing to me that these software models, which are based on such a coarse approximation of neurons and have very few biologically based constraints, not only perform well in a number of difficult pattern-recognition problems—they also seem to be predicting some of the properties of actual neurons in the visual cortex of monkeys,” Poggio explains. The question is: why?
The truth is, nobody knows—even as the technology of deep learning accelerates at an ever-quickening pace. “The theoretical understanding of these systems is lagging behind the application,” says Lorenzo Rosasco, a machine learning researcher who collaborates with Poggio at CBMM. To Poggio, this gap in fundamental theory is “pretty typical” for doing groundbreaking science. “People didn’t really understand at first why a battery works or what electricity is—it was just experimentally found,” he explains. “Then from studying it, there is a theory that develops, and this is what is important for further progress.”
What Couloumb and Ohm did for electricity, Poggio wants to do for deep neural networks: to begin defining a theory. He, Rosasco, and a dozen other CBMM collaborators recently published a set of three papers that does just that. The field of machine learning already has several decades’ worth of theoretical understanding applied to what Poggio calls “shallow” neural networks— generally, systems with only one layer in between the input and output. But deep-learning networks are much more powerful (as the latest tech-industry headlines readily confirm). “Basically there is no good theory for why deep networks work better than these one-layer networks,” Poggio says. Each of his three papers addresses one piece of that theoretical puzzle—from the technical details all the way up to their (in Poggio’s words) “philosophical” implications.
Breaking the Curse
The first paper in the trio has a disarmingly layman-friendly title: “Why and When Can Deep—but Not Shallow—Networks Avoid the Curse of Dimensionality: A Review.” This “curse” may sound like something J. K. Rowling might dream up if she were writing a physics textbook. But it’s actually a well-known mathematical thorn in the side of any researcher who’s had to tangle with large, complex sets of data—precisely the kind of so-called big data that deep-learning networks are increasingly being used to make sense of in science and industry.
“Dimensionality” refers to the number of parameters that a data point contains. A point in physical space, for example, exists in three dimensions defined by length, height, and depth. Many phenomena of interest to science, however—for example, gene expression in an organism or ecological interactions in an environment—generate data with thousands (or more) parameters for every point. “These parameters are like knobs or dials” on a complicated machine, says Poggio. To model these “high-dimensional” systems mathematically, equations are needed that can specify every possible state of every available “knob.” Mathematicians have proven that a one-layer neural network can—in theory—model any kind of system to any degree of accuracy, no matter how many of these dimensions (or “knobs”) it contains. There’s just one problem: “it will take an enormous amount of resources” in time and computing power, Poggio says.
Deep neural networks, however, seem to be able to escape this “curse of dimensionality” under certain conditions. Take image-classifying software, for example. A deep neural network trained to detect the image of a school bus in a 32-by-32 grid of pixels would be considered primitive by contemporary standards—after all, smartphone apps can routinely recognize faces in photos containing millions of pixels. And yet the number of parameters, or “knobs,” in even that 32-by-32 pixel grid is astronomical: “a one followed by a thousand zeros,” says Poggio. Why can deep-learning networks handle such seemingly intractable tasks with aplomb?
To Poggio and Rosasco (who co-authored the first paper with colleagues from the California Institute of Technology and Claremont Graduate University), the answer may reside in a special set of mathematical relationships called compositional functions.
A function is any equation that transforms an input to an output: for example, f(x) = 2x means “for any number given as an input, the output will be double that number.” A compositional function behaves the same way, except that instead of using numbers as inputs, it uses other functions—creating a structure that resembles a tree, with functions composed from other functions, and so on.
The mathematics of this tree can become incredibly complicated. But, significantly, the hierarchical structure of compositional functions mirrors the architecture of deep neural networks—a dense web of layered connections. And it just so happens that computational tasks that involve classifying patterns composed of constituent parts— like recognizing the features of a school bus or a face in an array of pixels—are described by compositional functions, too. Something about this hand-in-glove “fit” among the structures of deep neural networks, compositional functions, and pattern-recognition tasks causes the curse of dimensionality to disappear.
Not only does Poggio’s theory provide a roadmap for what kinds of problems deep-learning networks are ideally equipped to solve—it also sheds light on what kinds of tasks these networks probably won’t handle especially well. In an age when “artificial intelligence” is often hyped as a technological panacea, Poggio’s work demystifies neural networks. “There’s often a suggestion that there is something ‘magical’ in the way deep-learning systems can learn,” says Rosasco. “This paper is basically saying, ‘Okay, there are also some other theoretical considerations that actually seem to be able to, at least qualitatively, make sense of this.’” In other words, if a complicated task or problem can be described using compositional functions, a deep neural network may be the best computational tool to approach it with. But if the problem’s complexity doesn’t match the language of compositional functions, neural networks won’t “magically” handle it any better than other computer architectures will.
Poggio’s other two theoretical papers also use clever mathematics to attempt to bring some other “magical”- seeming features of deep neural networks down to earth. The second paper uses an algebra concept called Bezout’s theorem to explain how these networks can be successfully trained (or “optimized”) using what conventional statistics practices would deem to be low-quality data; the third explains why deep-learning systems, once trained, are able to make relatively accurate predictions about data they haven’t been exposed to before using a method that Poggio likens to a machine-learning version of Occam’s razor (the philosophical principle that states that simpler explanations for a phenomenon are more likely to be true than complicated ones).
For Poggio, the implications of these theories raise “some interesting philosophical questions” about the similarities between our own brains and the deep neural networks that “crudely” (in his words) model them. The fact that both deep-learning networks and our own cognitive machinery seem to “prefer” processing compositional functions, for example, strikes Poggio as more than mere coincidence. “For certain problems like vision, it’s kind of obvious that you can recognize objects and then put them together in a scene,” he says. “Text and speech have this structure, too. You have letters, you have words, then you compose words in sentences, sentences in paragraphs, and so on. Compositionality is what language is.” If deep neural networks and our own brains are “wired up” in similar ways, Poggio says, “then you would expect our brains to do well with problems that are compositional”—just as deep-learning systems do.
Can a working theory of deep neural networks begin to crack the puzzle of intelligence itself? “Success stories in this area are not that many,” admits Rosasco. “But Tommy [Poggio] is older and braver than me, so I decided, ‘Yeah, I’ll follow him into it.’” Speaking for himself, Poggio certainly sounds like an enthusiastic pioneer. “You want a theory for two reasons,” he asserts. “One is basic curiosity: Why does it work? The second reason is hope: that it can tell you where to go next.”
Sounds like Professor Poggio has an interesting day job! 🙂 Terrific report on a fascinating topic. Very glad to get this kind of thing in my email from MIT!
I would question, though, whether it’s truly obvious that “you recognize objects and then put them together in a scene”. It seems equally true that things are known by the context a scene gives them, like blinding white sand on a tropical beach vs. snow on a mountain. The context defines interpretation of the sensory data.
It brings to mind Plato’s famous Theory of Forms, which every artist instinctively channels. Things are recognized by their wholeness (and in a context) less than by the summation of their parts. A portrait painter sees a set of ideas called a “face”, not a million pixels. I wondcr (naively to be sure) if the method of compositional functions somehow shapes or constrains interpretation of raw stimulus into some sort of lucid but hidden formal construction that becomes a pathway for perception. Frankly I have no idea. I’m only now learning MatLab and digging into some of the beginner Deep Learning tutorials. So apologies if my comment is beyond naïve.
But it seems to me human perception and understanding works backwards (from context to object) as well as forwards (summing objects into a context). Clearly the context has to be formed at some point and in some way. A baby may arguably do it primarily from objects to context — but there is something about intelligence that seems to exist beyond the brain. As if the brain itself is like a radio and intelligence is a form of n-dimensional wave the radio plays. That’s a very Platonic conception, I admit. But intelligence is all across nature — from animals to even plants. They all do fantastically difficult, complex things. The human brain is one radio, but there are other radios. And they all may tune in and play some sort of common natural phenomenon called “intelligence”. And intelligence may not be essentially a project of the assemblage of parts, but equally a perception of wholes that give meaning to the parts, and where the recognition of wholes is a property of the waves played by the radios from some source beyond brain and mind. Of course, this line of thinking is 2,400 years old . . . . I didn’t invent it!
I am by no means a deep learning expert, just love great articles such as this one about fascinating work such as Poggio’s. When I read or think about this (and other subjects exhibiting similar compositional function aspects) I am always reminded of D. Hofstadter’s concept of “stange loops”: the fact that as long as things are somehow hierarchical, tree-shaped, “composable” from other things, we can get our hands, eyes, ears, and senses around them but as soon as “strange loops” get in the game, hierarchical deduction no longer works — and deep learning networks may become clueless.
Could that have something to do with computability? Some problems may be incredibly hard to crack but given exponential time and computing power they remain at least theoretically crackable even if they are not P-polynomial — if only by brute force — whereas other problems exhibiting some “circular” aspects are no longer deducible and thus intrinsically uncrackable, beyond NP, exponential, and unlimited time and computing budgets because of potentially endless “strange loops”?
This may just be a naive random rumination but it keeps nagging me when exploring (superficially) so many topics that all seem to bump against some circularity frontier.
Each layer in a deep neural network is a linear mapping followed by distance dependent pulling in the directions of the various attractor states. You have decision regions defined by the attractor states and the boundaries between those. That also results in information loss by partial quantization just as rounding 3.142 to 3 does.
For deep networks then you have to maintain contact with the input (eg ResNets) to avoid discarding relevant information before it can be used. It also ties in directly with the ODE viewpoint on deep networks.
Maybe there are more efficient algorithms to be found.
i like this article, it has a lot of info i need to know, thanks for sharing