Digital People: From Bionic Humans to Androids (2004)

Chapter: 7 The Five Senses, and Beyond

Previous Chapter: 6 Limbs, Movement, and Expression
Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

7
The Five Senses, and Beyond

We apprehend the world and each other through our senses; without them, we could think, perhaps, but we could not deal with physical reality or engage one another. Similarly, an artificial being needs more than a silicon brain, more than metal limbs and plastic muscles. As a creature in motion, it must understand its environment in order to move freely and intelligently. To deal with humans, it must respond to their presence and communicate with them. These functions require sensory apparatus, backed up by cognitive facilities that interpret what is sensed and make intelligent decisions about interacting with the world.

Humans make such decisions based on vision, hearing, touch, taste, and smell. (Broadly defined, touch includes the tactile sense of pressure, along with sensitivity to heat, cold, and pain, as well as the kinesthetic senses that track the position of the limbs, bodily posture, and balance. These are often clustered together as the haptic senses, from a Greek root meaning “to touch.”) Each of these human senses has an artificial counterpart but a digital creature can be effective without the full set, although a true android would need all five. On the other hand, artificial beings might employ senses humans lack, such as batlike sonar “vision” and sensitivity to radio waves.

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

We can hardly imagine an artificial being without some form of vision, which is deeply embedded in us. Much of the human cortex is devoted to visual cognition, far more than to any other sensory mode. Vision is our most effective means of exploring our surroundings, from detailed closeups to distant panoramas, and, through our superb ability to recognize faces and their expressions, it is critical for social interaction. (People who suffer from the neurological condition called prosopagnosia, the inability to recognize faces, lead difficult lives. One sufferer tells of failing to identify his own mother, who never forgave him.) At a more abstract level, vision is an element in creating mental imagery, because the “mind’s eye” uses some of the same mental facilities that carry out visual cognition.

We consider hearing to be our second most important sense. Like vision, it provides us with information about our surroundings, although to a lesser extent than in many animals. Working hand-in-glove with the power of speech, it is an important part of human communication, and although many animals use sound to communicate, language is a preeminent human ability—along with vision, one of our highest mental functions. Just as the act of seeing goes beyond the mere reception of light waves and attaches meaning to the images the waves form, meaningful speaking and listening go beyond the mere production and reception of sound waves.

Touch, taste, and smell require less mental processing than vision and hearing, and they engage the world more directly. With vision and hearing, we receive only energy; nothing material enters the body. Taste and smell, however, are the chemical senses that react to molecules actually penetrating the body. Tactile sensors in the skin also physically contact reality, determining what is hard or soft, hot or cold, enabling the hands to actively grasp and shape objects, and providing the emotional warmth of the human touch.

Emulating vision, hearing and speech, and haptic abilities would go far toward producing an effective artificial creature—possibly one that could develop further through its embodiment, as Rodney Brooks has proposed. This program omits smell and taste, which are essential for many living beings, as in the exquisite sense of smell in dogs, or

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

the constant sampling of water by certain fish whose skin is covered by taste buds. Many animal species use pheromones, substances that transmit information from one creature to another by odor. Smell and taste do not play similarly important roles for humans, so these senses might seem like mere frills for artificial beings.

Nevertheless, artificial smell is important for uses such as detecting contaminants in air or water and can take on additional meaning because the sense of smell is linked to the fabric of thought. The human olfactory system has complex neural pathways, some going to the limbic system of the brain. This is a collection of interacting parts that appeared early in the evolution of the mammalian brain and is strongly tied to instincts and feelings. Odors can be powerfully evocative because they speak directly to this ancient core. This might seem irrelevant to machine thought, which we tend to characterize as rational rather than emotional. But a variety of evidence shows that reason and emotion are connected in our own brains and minds, as I will discuss in Chapter 8. True artificial thought might also require both and might be enriched by a layer of nonrational but valid meaning entering the brain through the sense of smell.

For now, though, artificial taste and smell are at an early stage where sensors are still being developed. This is also partly true for touch. However, we already have digital hardware that can detect and manipulate light, and sense and produce sound. Progress in artificial hearing, speech, and vision focuses on the cognitive abilities that support these three vital functions.

SEEING INTO KNOWING

Creating synthetic vision as powerful as the natural version is not easy, partly because the human eye is a remarkable optical instrument, with high resolution, the ability to distinguish millions of colors, and a variable focal length. But these features are enormously enhanced by the mind. Under mental control (largely at an unconscious level) your eyes automatically refocus to provide clear vision from near to far, and they constantly move, to ensure that the portion of the retina with the highest resolution points at the most significant part of a scene.

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

It takes further mental effort to interpret the information these actions bring into the brain. The brain must learn to see, a complex process that begins early in life. How difficult this is even for the powerful visual cortex is illustrated by a real-life case related by the neurologist and writer Oliver Sacks—the tale of a middle-aged man who miraculously regained his sight after decades of blindness, but who found that eyesight alone was not enough; he also needed a brain that had learned to understand visual information. Although he struggled hard to comprehend the world visually, it was too late for him to master this ability.

Given the enormous demands vision places on the brain, it is not surprising that it takes massive computing capacity for a machine to match human vision. Hans Moravec notes that early AI researchers were ready to believe that given the right software, machine minds could be made fully intelligent. “Computer vision convinced me otherwise, ” he now writes, adding,

Each robot’s-eye glimpse results in a million-point mosaic. Touching every point took our computer seconds, finding a few extended patterns consumed minutes, and full stereoscopic matching of the view from two eyes needed hours. Human vision does vastly more every tenth of a second.

Typically, to perform the equivalent of human vision in real time requires a computer executing billions of instructions per second. Early computers were incapable of handling streams of visual data and interpreting it on reasonable time scales; in the late 1960s and early 1970s, it took hours for the pioneering robot Shakey to calculate its actions as it scanned its surroundings.

Now cheap, readily available microprocessors can handle visual information at high speeds, and a laptop computer can perform aspects of visual cognition in real time. Larry Matthies, who runs the Machine Vision group at the Jet Propulsion Laboratory, says that computers are now so fast that even complex programs for machine vision can be rapidly executed. Philosophical differences about top-down versus bottom-up or other approaches to artificial vision, he adds, have “very quickly become outdated. Because we’ve got fast enough machines you can do better vision, more reasoning—and that’s the solution.”

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

Video cameras are the eyes of these fast processors, capturing images in digital form; that is, as streams of bits representing the position and color of each picture element or “pixel” in a video frame. A pixel is the smallest unit in an electronic display. It takes about a million pixels to form an image on a computer screen, just as a myriad of individual colored tiles forms a wall mosaic. (To be exact, computer monitors typically display 1,024 × 768 or 1,280 × 1,024 pixels horizontally and vertically, respectively). Even fewer pixels per frame is adequate for many uses, and that lower resolution is easily achieved with inexpensive Web cameras that routinely send video over the Internet.

Other approaches work differently from the eyes; they examine the environment actively rather than passively. One method employs low-power infrared lasers mounted on the artificial being. When the laser beams strike an object, they are reflected back to sensors mounted on the being, where their time of flight is analyzed to find the object’s range and bearing. Another approach emulates the echolocation used by bats and porpoises. These creatures generate high-frequency (ultrasonic) sound waves and listen for the echoes, which their brains analyze to characterize their surroundings. A similar process operates in sonar (sound navigation ranging) as used by nuclear submarines, and some robots use sonar as well.

There are also new ways to interpret sensory data, such as the promising approach called probabilistic robotics. According to Sebastian Thrun (then at Carnegie Mellon University, and now at Stanford), it uses the fact that “robots are inherently uncertain about the state of their environments,” because of limitations in their sensors, random noise, and the unpredictability of the environments themselves, caused by, for example, the movement of people within the creature’s visual field. Instead of calculating exactly what to do next, the being accommodates its uncertainty by determining a range of possibilities. As its sensors gather more data, the being’s calculations converge to a high level of confidence about its physical location and other quantities. This method takes more computer time than direct approaches, but today’s computers are up to the task. As Thrun notes, the payoff is that a probabilistic robot can

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

gracefully recover from errors, handle ambiguities, and integrate sensor data in a consistent way. Moreover, a probabilistic robot knows about its own ignorance—a key prerequisite of truly autonomous robots.

These sterling qualities sound like a working definition of mature human wisdom, and could provide a superior basis for a high level of robotic intelligence.

Despite these and other advances, no artificial being so far displays general visual comprehension at the human level, but artificial vision works well within certain categories essential for beings that are mobile or meant to interact with people.

FROM HERE TO THERE

To move from one location to another, an artificial being must know its starting position, plan a route, and make the journey without hitting anyone or anything—hence localization, mapping, and obstacle avoidance form a basic set of visual abilities. Using these abilities, more or less autonomous mobile digital beings are becoming almost common sights in a variety of arenas—the home, hospitals, museums, the battlefield, and on distant planets as part of NASA’s exploration of space.

NASA cannot yet send astronauts to other planets, so the agency has pioneered in developing mobile robotic stand-ins for human explorers. The Robonaut unit described earlier is not one of these standins, because the focus is on moving its arm and hand rather than its whole body, and its visual cognition comes from a human operator. But the Sojourner rover, a small, wheeled unit delivered to Mars by the Pathfinder mission and that began examining Martian rocks on July 4, 1997, was the first in a series of mobile exploration robots with visual abilities.

The latest NASA mission to Mars began with two spacecraft launched in June and July, 2003, each carrying a new rover. In January 2004, the spacecraft delivered these nearly identical robots—dubbed Spirit and Opportunity—to two widely separated areas of the planet, carefully chosen because they show signs that liquid water might have flowed there in the ancient past. If the robots determine that liquid

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

water once existed on Mars, they will have found an important indicator for the existence of past Martian life.

Like Sojourner, Spirit and Opportunity carry instruments to examine rocks and soil, in the hope of finding detailed geological evidence for the past presence of water. However, the new rovers travel much faster than Sojourner did, covering in three Martian days the same 100 meters (330 feet) that Sojourner took 12 weeks to cover. An Earth-based controller can send a radio message to a rover telling it what to examine, but even at the speed of light, radio waves from Earth take minutes to reach Mars, making it impossible to drive the robot in real time. Thus an exploring rover is on its own and must see well enough to safely reach a specified site over rough terrain.

A rover does this by first determining its present location. It could do so by tracking every turn of its wheels since leaving its landing site, like an automobile odometer. However, wheels tend to slip on rocks and sand so instead the rover uses what Larry Matthies calls “visual odometry.” Seeing the world in three dimensions through two video cameras, as we do through our eyes, it maps the peaks and valleys, the rough and smooth areas of its neighborhood. Then it selects a prominent benchmark feature, perhaps a tall rock with a distinctive shape that it can recognize from varied distances and angles. Referring to this landmark, the unit can determine where it is to within 1 percent of the distance it has traveled. After establishing its location, the rover plans its trek to the target area. Like a human mountain climber scanning the terrain ahead for the best route, it examines its three-dimensional map to determine surface roughness, grade steepness, and obstacles, and selects the best path.

If all goes as planned, this version of autonomous robot vision will play a central part in a mission costing $800 million. NASA sees the current mission as a prelude to a 2009 one, where an even more capable rover will move to selected rocks, pick them up, and carry them back to a spacecraft that will return those pieces of Mars to Earth.

Selecting visual landmarks for navigation also works on Earth. Paolo Pirjanian, Chief Scientist of California-based Evolution Ro-

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

botics, Inc., sees the method as a boon for everyday use. Although robots are now used in applications such as delivering hospital supplies, they require training to familiarize them with their particular environment, possibly relying on effective but expensive laser rangefinders. Pirjanian and his colleagues propose an alternative they call visual simultaneous localization and mapping (VSLAM), which might be suitable for consumer products because it uses inexpensive video cameras.

A VSLAM robot gets its bearings in bootstrap fashion. It begins by taking pictures of recognizable features like furniture, and holds them in a database. Initially the robot estimates the landmarks’ locations and its own through wheel odometry. As it continues mapping, it compares whatever its camera registers to its database. When a match occurs, the unit uses probabilistic methods to recalculate the landmarks’ position and its own. The interplay between these upgrades steadily refines the robot’s knowledge, leading to a final accuracy of about 10 centimeters (4 inches) in its position, and 5 degrees in its direction of motion. Unlike robots that find their way by means of a fixed internal map, VSLAM can also deal with change: If there is enough alteration in its surroundings that no landmarks are recognized, the robot finds new ones and updates its map.

Artificial vision has become so fine-tuned that it can be trusted at high speeds and when lives are at stake. The small robot cave explorers deployed in the 2001–2002 U.S. campaign in Afghanistan show the military potential, and the Department of Defense (DoD) foresees more demanding applications. Through DARPA, the DoD is offering $1 million to anyone who can create a self-guided unit for desert warfare. The prize will be awarded in 2004 for a vehicle that can trek through the Mojave Desert from Barstow, California to a location near Las Vegas on its own. To cover a distance of about 320 kilometers (200 miles) within the allotted time of 10 hours, the unit must maintain an average speed of at least 32 kilometers per hour (20 miles per hour).

Other high-speed applications aimed at improving automobile safety through the use of intelligent artificial vision have been under development at Carnegie Mellon University and elsewhere. In one

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

current effort, the DaimlerChrysler Corporation is working on machine vision for its vehicles that would supplement and even override human judgment. Using video input, a fast computer in the vehicle keeps track of nearby objects in real time. “If a child suddenly appears between parked vehicles,” says the corporation,

the computer registers the danger within 80 milliseconds … and, if necessary, initiates the braking procedure. In this time the driver’s visual center would only just have received the visual information … without the brain having been able to initiate any reaction at all.

This application reminds us that when artificial vision is not being used to examine other planets, it is operating in environments that include people. Whether to sense a child in traffic, or to enhance human–robot interactions in general, the ability to differentiate people from things is the next important level of artificial vision.

FACES IN THE CROWD

It’s hard to imagine a more commonplace activity than recognizing a friend, but there is nothing simple about the action. His or her face must be detected as a face among many objects in the visual field, then recognized as belonging to a particular person. After that, we might also perceive the mood it is expressing. Human visual cognition is remarkably competent at all this, even with wide variations in lighting and in the angle at which we see the face, even if it is partly obscured or we have not seen it for a long time—so competent, in fact, that we sometimes see faces where none exist, as on the surface of the moon.

The realities of today’s world provide strong motives to find ways of artificially replicating these abilities. With terrorism as a serious threat, with identity theft and transactional types of fraud growing, governments, law-enforcement agencies, and commercial enterprises seek secure and rapid means to verify personal identity. Computer methods can provide this service, within the area called biometrics—the identification and recognition of people through physiological or behavioral traits, which also includes fingerprinting, retinal scans, and voice recognition.

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

The same biometric capabilities that enhance security can also improve the interactions between artificial beings and humans. The first step is detecting that a face is present. One research group, led by Takeo Kanade of Carnegie Mellon University, has in the last several years found accurate ways to pick out faces from complex cluttered backgrounds, using probabilistic methods and also a neural net. As presented earlier, a neural net is a set of interconnected processors that can be trained to acquire and store knowledge—in this case, how to decide whether a given image contains a face. In the approach Kanade’s group devised, the network examines a still image in small pieces, some chosen to filter for facelike features; for instance, one piece consists of horizontal stripes 20 pixels wide by five pixels high, a configuration that tends to pick out a mouth or pair of eyes in a face presented in full frontal view.

The researchers trained the network with more than a thousand assorted images of faces, and also with images deliberately chosen not to contain faces. As we ourselves do, the network sometimes incorrectly found faces where there were none. These erroneous choices became examples of what not to identify as a face, thereby sharpening the network’s judgment. Once trained, the system was tested on hundreds of new images including photographs of individuals and groups, the Mona Lisa, and the face cards from a deck of playing cards. The network found up to 90 percent of the faces, depending on the trade-off between making the identification highly certain and allowing a few incorrect identifications to slip through. The approach using probabilistic methods was even more effective, in that it also worked well for faces seen in profile and in three-quarters view. (You can try both approaches at Web sites maintained by Kanade’s group, where anyone can submit test images. Each face that the algorithms find is returned neatly surrounded by a green outline, leaving no doubt of the effectiveness of the methods.)

This kind of face detection can also be carried out in real time, a requirement for robotic applications. One example of software for real-time detection, developed by the German-based Fraunhofer Institute for Integrated Circuits, can be downloaded from their Web site. Used on images generated by an inexpensive video camera connected

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

to my desktop computer, this algorithm found a variety of real, photographed, and hand-drawn faces within a second of their appearance in the field of view, as long as the face was seen full on. The system could also track a face as it moved, if the movement was not too rapid.

The next step after detection, face recognition, is also reaching maturity, driven by pressing needs for identification and verification. In identification, an unknown face is compared to a dataset of known faces, such as a security watchlist; in verification, the claimant’s face is compared to a stored image of the person he or she claims to be. Like face detection, recognition is susceptible to a variety of approaches, such as one developed by the MIT Media Lab’s Alexander Pentland, who categorizes faces based on a set of visual building blocks he has developed; for instance, the appearance of the upper lip and the forehead. Computer software uses these fundamental elements to identify faces, with sufficient success that Pentland’s method has earned the trust of banks and security agencies.

A recent series of tests of computerized face recognition systems that was sponsored by the FBI, the Secret Service, and other government agencies, proved that commercially available algorithms had significantly improved in just two years. Automatic verification software approved 90 percent of legitimate subjects and only 1 percent of imposters, and an unknown face was correctly identified as belonging to a base set of more than 37,000 faces, with virtual certainty or very high probability, more than 80 percent of the time.

Despite this impressive performance, the government tests showed that there are still kinks. Success rates dropped substantially when the subject was seen under some types of lighting. The rate of correct identification has also been low for faces not seen full on, but this problem has recently been largely alleviated by the “morphable model,” in which the software generates a three-dimensional model of what the camera sees. This virtual face is then changed and rotated to show how the subject would look if facing forward, and the result is fed into the face recognition routine. In one example, a poor identification rate of 15 percent for subjects looking right or left jumped to 77 percent when the morphable model was employed. This improvement suggests that better software, coupled with increased com-

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

puting capacity, will solve many if not all the remaining problems with face recognition technology.

If artificial beings are to “read” people; that is, read their emotions through their facial expressions, further advances are needed. The human face has more muscles than does the visage of any other living creature. These muscles can wrest the face into thousands of expressions, some differing only subtly but carrying serious differences in meaning. Since early studies made by the nineteenth-century anatomist Guillaume-Benjamin-Amand Duchenne, for instance, it has been known that the difference between a false smile of seeming happiness, and a true smile of real joy, is that in a true smile the corners of the mouth are raised and the skin crinkles at the corner of the eyes.

Machine vision can already distinguish among emotions that produce widely different expressions. In one example, Gwen Littlewort and her colleagues, at the Machine Perception Laboratory of the University of California, San Diego, have developed a system that automatically detects a face as seen in a video image, and decides in which of seven categories its expression belongs: anger, disgust, fear, joy, sadness, surprise, or neutrality. Although relatively crude, this level of emotional identification is sufficient to enhance rapport between humans and artificial beings, allowing the latter to respond differently to an angry person, say, than to a surprised one.

But a digital being that cannot tell a false smile from a real one might remain naïve about humans, like the android Commander Data in Star Trek. Fortunately, in 1982, Paul Ekman, a psychologist of the University of California, San Francisco, who specializes in facial expressions, with his colleague Wallance Friesen, developed a method to classify everything a face can do. The Facial Action Coding System uses anatomical knowledge to define more than 30 action units (AUs) corresponding to contractions of specific muscles in the upper and lower face. These AUs are sufficient to fully describe the thousands of possible facial expressions.

In 2001, Takeo Kanade’s group at Carnegie Mellon drew on this work to develop a neural network that breaks down any facial expression it sees into discrete AUs, with a recognition rate exceeding 96

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

percent. This means only that the system can detect subtle differences in expressions, not necessarily the emotions behind them, but psychologists are working on associating specific emotions with specific combinations of AUs, so there is potential for artificial beings to be able to perceive the fine points of human feelings.

The techniques that work for detecting and recognizing faces, such as the probabilistic approach, can also be applied to objects like automobiles and paper money, so machine vision will grow in capability. What has been achieved so far is only a part of general human visual cognitive ability.

GETTING THE WORD OUT

Despite the long way left to go, though, recognizing people and reading their faces represents a landmark in the development of synthetic creatures. But to achieve a comfortable relation with people, an artificial being also requires intelligent hearing and speech. Both the virtual and the real histories of artificial beings recognize the power of meaningful discourse—from the brass talking head supposedly made by Albertus Magnus in the thirteenth century, to the Turing test. As noted earlier, in 1637 René Descartes asserted that it might be possible to construct a machine that uttered words. But he went on to say,

It is not conceivable that such a machine should produce different arrangements of words so as to give an appropriately meaningful answer to whatever is said in its presence, as even the dullest of men can do.

We do not yet have machines that converse as well as “even the dullest of men”: but we do have transducers that change sound waves into computer bits, and vice-versa. This is a start, and researchers have created systems that hear what is said to them and give appropriate spoken responses, but only within limited arenas. However, these machines also display qualities that Descartes might never have considered: They sound human, and in addition to grasping the meaning of the words, they grasp how the words are said and the qualities of the voice that says them.

It’s easy to experience machine hearing and speech, at least at a

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

rudimentary level. My I-Cybie robot dog has a microphone in each plastic ear to triangulate the source of a sound. When I clap my hands, the dog turns its head toward me. If I clap in a certain sequence or say one of a small vocabulary of command words, it does a trick, like any well-trained natural dog. Also like a real dog, it learns to respond to a name and it speaks dog language—barking to show that it understands a given command and whimpering when it is not getting enough attention.

Another level of speech interaction is found in computer dictation programs, where what you say into a microphone is turned into written words on the screen. To get a true sense of machine conversation, though, pick up the telephone and dial airline reservations or your bank. There is a good chance you’ll hear a synthesized voice welcome you, and ask what you need. You respond verbally, and a dialogue ensues. The conversation might well have its moments of frustration when you and the machine misunderstand each other. Still, according to Julia Hirschberg, a computational linguist at Columbia University, such conversations represent significant progress since the late 1980s. Computers are now fast enough to hear and respond in real time, and although the process is not perfect, Hirschberg notes that “Speech recognition and understanding is ‘good enough’ for limited, goal-directed interactions.” (Italics in the original.)

To be judged good enough or better, a machine must pass three tests: It must recognize the words you say, regardless of accent and personal speaking style, it must generate words that you recognize without machinelike overtones, and it must give sensible responses to your conversation. This last requirement is basically the Turing test, only with speech instead of written messages. If the machine converses so well on any conceivable subject that it cannot be distinguished from a person, it passes Turing’s criterion for artificial intelligence. Even in limited conversations, however, the computer must be able to recognize words spoken by people, and to form its own words.

Speech recognition systems work by matching what a person says against a corpus; that is, a dataset of natural speech stored in the com-

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

puter. The bigger the corpus, the better the system can recognize a range of utterances. Each speech sound in the corpus is broken down into a soundprint or acoustic spectrum—a list of the frequencies that make up the sound and their strengths. When the system hears a voice, that, too, is analyzed, in real time. By comparing the incoming soundprints with the stored ones, the computer assigns a probability that each sound has been correctly recognized. Further information comes from knowing the probabilities of the myriad other sounds that might follow the recognized one. The system also uses a “dictionary,” a set of sound prints for words in the language, and a “grammar,” which tells it the probability of finding a particular word once the preceding word is known. Then all these factors are manipulated by extremely sophisticated statistics, resulting in highly accurate word recognition. Compared to this complex process, speech synthesis is relatively simple.

But merely recognizing and saying words is not enough. As researcher Sylvie Mozziconacci of Leiden University writes,

Communication is not merely an exchange of words … variations in pitch, intensity, speech rate, rhythm and voice quality are available to speaker and listener in order to encode and decode the full spoken message.

Recognizing words is one thing. Interpreting them, or speaking them with natural meaning and delivery, is something else.

To make a synthetic voice sound better than the mechanical monotone of a movie robot requires prosody. To poets, prosody means the study of meter, alliteration, and rhyme scheme that contribute to the flow and impact of a poem. For those who design machines that speak and listen, prosody means the differences in intonation that people use in speech, adding meaning or emotion to the literal significance of the words, or, as Elizabeth Shriberg and Andreas Stolcke of SRI International write, it is “the rhythm and melody of speech.” These intonational variations are put into synthesized voices by careful adjustment of pitch, pacing, and so on to copy the natural sound of people talking.

The other side of the prosody coin is the problem of ensuring that an artificial being can fully interpret what humans say. That helps

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

to reduce the ambiguity in human language, a major barrier to full machine understanding of speech, and to sense emotions and physiological states expressed in the human voice. As in facial recognition, the aim of sensing emotions and physiological states is being driven by the war on terrorism because it is important to detect stress in intercepted voice communications or “chatter.” Corporations are interested as well; they want to know when customers on the telephone are angry so that they can be mollified by appropriate responses (which might be hopeless if a customer’s anger was elicited by the frustration of talking to a machine with poor verbal skills).

But it is not easy to quantify exactly what it is we sense in prosody, or to put that knowledge into artificial speech systems. Reviewing how humans recognize emotion, Ralph Adolphs, a neurologist at the University of Iowa, says,

In general, recognizing emotions from prosody alone is more difficult than recognizing emotions from facial expressions. Certain emotions, such as disgust, can be recognized only very poorly from prosody.

Even human evaluators might disagree about how to classify the emotions expressed in a voice, especially for short utterances. Without a reliable database of classifications, it is difficult to determine exactly what a machine system should listen for to determine a person’s state of mind. But progress is being made in this relatively new area, especially in its pragmatic aspects: for instance, it seems that when correcting an error made by an artificial speech system a human tends to hyperarticulate—that is, speak slower and louder, and at a higher pitch—a clue that is useful in helping the system to respond appropriately.

Today’s artificial speech systems show the level at which recognition, synthesis, and conversational ability come together. Speech Experts, a German firm, recently announced a washing machine that obeys voice commands. This might seem an odd choice for advanced speech capabilities, but a company spokesman claims that, “Electronic appliances have become so complicated … that consumers are put off by them. Speech recognition would help people.” The machine is said to be able to follow complex instructions, such as “Prewash, then hot

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

wash at 95 degrees, then spin at 1,400 revolutions and start in half an hour.” It currently responds to a few hundred German words, but is expected to be able to eventually handle several thousand, and in other languages as well.

Another example that shows how conversational machines function in practice is a telephone-based system for booking train travel, used since 2001 by Amtrak, the U.S. passenger rail system. Dial the Amtrak number, and a pleasantly crisp female voice says “Hi. This is Amtrak. I’m Julie.” Speaking in the first person and using casual speech such as “Here goes” and “No problem,” Julie offers schedules, ticket reservations, and train status. At each juncture where the caller must make a choice, the questions are crafted so that a yes or no will do, or Julie announces the words the customer can use and be understood, such as “Book that one” or “Change itinerary.”

Within the constraint of a limited vocabulary, Julie does well in recognizing words and responding suitably, as I found when I decided to test Julie by making a reservation. In several conversations, it never missed “New Orleans,” which has a variety of pronunciations. It misunderstood only when I departed from the list of approved words, and once when it interpreted my “19” as “90”—an understandable error that humans make too—and the system let me correct the error with little fuss. Surveys show that customers are substantially happier with Julie than with the touch-tone method Amtrak used previously—but the same surveys also show that many customers still hang up before completing the reservation process. Certainly no one yet has full confidence in Julie, competent as it sounds; the caller can always say “agent” to get connected to a human.

Other voice-based systems include a mock air-travel planning service based at Carnegie Mellon University that was designed as a test bed for the DARPA Communicator project. This ambitious effort had the goal of developing speech-based interfaces for battlefield use that would “support complex conversational interaction, where both user and the system can initiate interaction, provide information, ask for clarification, signal nonunderstanding, or interrupt the other participant.” When you dial the phone number, you are greeted by a

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

male voice. Its timbre is pleasant, but its delivery is a touch robotic, so although this system delivers the same kind of information Julie does, it is a less engaging chat partner.

The limitations of current voice systems are clear, and their variations emphasize that there is as yet no single optimum approach, although some methods are perfectly adequate for closely constrained dialogue. To achieve a higher standard of machine listening, understanding, and speaking that approaches human levels, deeper aspects of artificial intelligence must come into play, as Alan Turing understood. But even at the lower levels we have achieved so far, there is undeniable power in hearing a humanlike voice respond to your words—or perhaps hearing a digital being greet you by name after recognizing your face, while extending a hand to shake yours.

REACH OUT AND TOUCH

Appealing as the idea of shaking hands with a humanoid creature can be, you might want to think twice about actually doing it. Entrusting your fingers to a motor-driven mechanical hand could lead to pain or worse. An artificial being might know enough to begin grasping your hand in a socially acceptable way, but not when to stop. When you shake hands with another person, you each feel the pressure the other is exerting. Unless your intention is the hostile one of squeezing as hard as you can, you modulate your grip to more or less match what you feel from the other hand.

Artificial beings need a similar kind of sensing, and not only to keep from hurting humans. If a being can track the forces that it develops as it interacts with its environment, it can precisely calibrate how to grasp things, and it can adjust the forces it exerts so that its appendages “give” when they encounter an obstacle. This force feedback is one essential for artificial touch. Another is the kinesthetic sense that gives information about the location of a being’s limbs; otherwise, it could not guide its own hand toward an object. A third is a true tactile sense, allowing the being to perceive the surface properties of whatever it manipulates.

These haptic modes are found in varying degrees in artificial crea-

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

tures. Kinesthetic sensing is a necessity for the walking robots in Chapter 6, and units with hands need broader abilities. Two examples with cyborglike elements have been developed for use in space: NASA’s Robonaut, and the four-fingered DLR Hand II, developed at the Deutschen Zentrum für Luft- und Raumfahrt (DLR), the German Aerospace Center. In both Robonaut and the DLR Hand II, human operators remotely perform manual tasks using video feeds that display what the robotic hand is doing. But the humans do better when the forces and textures felt by the robotic hand are fed back to their hands, via data gloves.

The transmission of sensory data from a robotic hand to a real one requires ingenious and extensive hardware. The force sensors developed for the DLR Hand II are tiny enough to fit into its fingertips, and according to Robert Ambrose, who heads the Robonaut group, the unit has more than 150 sensors in its arm and hand, although not all are involved in providing feedback. But even this many sensors is not enough to match the full power of human tactility. Our fingertips and tongue-tip are highly sensitive because touch sensors are densely concentrated there. We do not fully understand this network, and some researchers think its complexity rivals that of the visual system. In any case, it takes clever engineering to make sensors small and numerous enough to be installed at similar high densities.

The engineering challenge is being addressed, however, because of the role artificial touch can play in robotic surgery, a technique that is now commercially available, for instance, in Intuitive Surgical’s da Vinci system. Like the NASA and DLR robots, surgical robots are cyborglike rather than autonomous; that is, a trained human surgeon manipulates controls to operate a remote set of surgical instruments. One day, surgeons might be able to operate remotely at accident or battlefield sites anywhere in the world. Another application is already realized—minimally invasive surgery, performed through small bodily incisions typically a centimeter in size. The surgeon sees by way of a tiny video camera called an endoscope, and wields miniature tools, all inserted through the incisions. With the intervention of suitable hardware and a computer, the surgeon’s hand movements are appropriately scaled down, and any hand tremors are removed.

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

The technique is being used for a variety of procedures, from gallbladder removal to heart valve repair. It offers patients reduced pain and blood loss, minimal muscular damage, and shorter recovery times. However, one drawback is the surgeon’s inability to directly feel internal organs and their resistance to the scalpel. To remedy this problem, force feedback and tactile sensing are being added to surgical robots, with encouraging results. At the Harvard BioRobotics Laboratory, Robert Howe and his colleagues monitored medical students as they used a telerobotic system to expose a simulated artery, a common type of surgical task. Adding force feedback to visual feedback did not improve the speed or precision of the operations, but it did enable the students to perform the procedures less forcefully. This reduced the rate of inadvertent damage or “nicking” of the artery by some 75 percent compared to remote surgery using visual feedback alone.

The Harvard group is also finding ways to help surgeons remotely search for internal lumps, not easy to do through a small incision. Howe likens it to “trying to find a pea inside a bowl of jello using chopsticks.” The solution is a robotic fingertip consisting of 64 pressure sensors in a square array, inserted in the body. Each sensor is connected to a motorized pin outside the body, and the surgeon’s finger rests against this array of pins. As the robot fingertip moves within the body and encounters a lump, the pressure readings on the sensors change and the corresponding pins move in proportion. The end result is that the array of external pins maps the shape of the lump, which can then be felt by the surgeon’s finger resting on the pins.

Other approaches could eliminate separate sensors to yield artificial skin or muscles with built-in haptic senses. For example, researchers at the STMicroelectronics Corporation and the University of Bologna have mounted a grid of fine electrically conducting wires in a soft substrate. Pressure on the material changes the electrical interactions among the wires. This information is turned into a map that gives the shape of the object causing the deformation. And at the Polytechnic University of Cartagena, Toribio Otero and Maria Cortés have used a plastic called polypyrrole to make a touch-sensitive muscle. Like other smart materials used for artificial muscles, theirs alters its

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

electrical properties in response to pressure and changes shape when an electrical current is applied. The interaction between these behaviors provides feedback that adjusts the force the material exerts according to the resistance it encounters, as we humans do.

Sensitive artificial touch is an engineering challenge because it requires many sensors that are densely distributed over an area; synthetic smell and taste are difficult to implement because of the sheer variety of what they sample. Nevertheless, concerns about security and crime are motivating researchers to develop artificial smell. A sensitive nose, natural or artificial, can detect explosives, buried land mines, and smoke from fires, as well as hidden drugs. Although the sense of smell is not fully understood, we know that humans identify smells by means of about a thousand special proteins in the nose, each of which reacts to a particular group of molecules, typically of an organic substance. Most odors do not come from just one chemical element or compound. When we recognize a smell as “coffee” or “vanilla,” we are identifying a set of molecules that has activated a particular pattern of proteins, which means we can recognize many millions of odors.

An artificial nose, therefore, must first react to specific chemicals, and then register the different compounds in a given odor. Moreover, to become a useful digital technique, it must change chemical reactions into electronic impulses. The Cyrano 320, an electronic nose made by Cyrano Sciences of Pasadena, California, uses a small chip with 32 receptors. Each receptor consists of a specific polymer mixed with some carbon black, a form of carbon that conducts electricity. When exposed to a vapor, each polymer expands by an amount determined by the molecules making up the vapor. This expansion changes the electrical resistance of each polymer and hence of the entire chip, producing a composite fingerprint reflecting all the molecules the chip has detected. Although 32 receptors is not many compared to the thousand proteins in the human nose, it is still enough to identify a lot of odors.

An artificial tongue can operate in a similar way, because all the flavors we experience, from ice cream to sushi, arise when our taste buds respond to a basic palette: the traditional bitter, sour, sweet, and

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

salty, with umami (the taste that comes with monosodium glutamate or MSG) recently added by many experts. The food and beverage industries have developed devices more sensitive than the human tongue to detect flavors, such as bitterness and sweetness, essential for their products. Researchers at the University of Texas and University of Connecticut have gone further, developing electronic methods to test for the presence of all the basic tastes except umami, although these methods have not yet yielded a commercial product.

MORE THAN HUMAN

Sight, hearing, touch, taste, and smell—for each, there are ways the artificial versions fall short of nature, but other ways they can improve on it. They can be extended beyond human norms, or supplemented by sensory modes without human analogues, such as active probing by sonar or laser beams, which work even in the dark, determine the distance and direction to an object, and distinguish between different types of obstacles.

Other advantages are realized by extending artificial vision further into the electromagnetic spectrum. Humans can see light from 400 to 750 nanometers in wavelength, from violet to red, with the other rainbow colors in between. This is only a tiny portion of the range for electromagnetic radiation, from X-rays and gamma rays with ultrashort wavelengths, to radio waves many meters in wavelength. Within this range lies invisible infrared radiation, which begins at wavelengths beyond 750 nanometers and is generally produced by objects hotter than room temperature. Hold your hand above a hot electric heating coil, or stand in bright sunlight; the warmth you feel is delivered by infrared waves.

The connection between heat and infrared radiation gives another way to see in the dark; that is, to discern warm or hot entities like human bodies and internal combustion engines. This is the principle behind one kind of night-vision goggle, and appropriate sensors provide the same capability to robots. The advantages for military, police, and rescue operations are obvious, and if nursebots or doctorbots ever become realities, their medical diagnoses could be

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

aided by infrared vision. It can detect tumors, which are warmer than their bodily surroundings, and can remotely measure body temperature. This capability became important during the breakout of severe acute respiratory syndrome (SARS) in 2003, when international travelers were screened by testing them for above-normal temperatures that might indicate the high-fever characteristic of the disease.

Add radio waves to the suite of electromagnetic wavelengths that digital beings could sense, and you get another extrahuman mode. One result could be beings that always know exactly where they are. While a robot on Mars needs extraordinary means to determine its location, a unit on Earth could simply incorporate a global positioner—the small electronic device that uses radio signals from orbiting artificial satellites to determine where on the planet it sits, to an accuracy of a few meters. Artificial beings could also have complete access to the resources of the Internet, through high-speed wireless connections, giving them the ability to tap into a world of databases, factual information, news, and much more for the being’s own use or to answer questions from humans.

With radio, artificial beings could also engage in artificial telepathy, silently communicating among themselves even when far apart. Recall the brutal worldwide uprising of robots in the play R.U.R., or their sinister swarms in the story “With Folded Hands.” It takes only a touch of paranoia to see robot telepathy as a threat, but the applications thus far have been benign. The best known such application is competitive soccer played by teams of wirelessly linked AIBO robot dogs. Robotic soccer has taught researchers a lot about coordinated robotic behavior, and it has also evolved into an annual World RoboCup event where crowds cheer on their teams, and wait for a player to score a goal and perform a victory dance. Similarly, a big hit of the ROBODEX 2003 exposition in Japan was a robotic ballet. The principal dancers and corps de ballet consisted of tiny inch-tall units, made by the Seiko Epson Corporation. Controlled by a wireless linkage, they gracefully twirled, blinked their LED eyes, and formed perfectly aligned patterns to the strains of romantic music, as audiences watched enthralled.

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

CHIP VISION

Artificial senses can also open up a whole world of new human capabilities; as bionic implants, they can not only replace but even extend the natural senses. There is enormous interest in doing for the blind what cochlear implants have done for the deaf, as well as in other possibilities for bionic enhancement or replacement of human sensory organs. A limited experiment in direct human access to wireless communication was carried out in 1998 by Kevin Warwick, at the University of Reading in the United Kingdom, who had implanted into his arm a chip that emitted an identifying radio signal. The signal triggered functions such as turning on lights when he entered a room. However, the chip was not connected to his nervous system and did not carry out any functions of greater complexity.

Now under way are substantial efforts to restore sight to the blind through implants in the brain or retina. Most blindness is caused by a loss in the retina’s sensitivity to light, although both the optic nerve, which transmits visual impulses to the visual cortex, and the visual cortex itself remain perfectly functional. This is what happens to people with the disease called retinitis pigmentosa, and to those with macular degeneration—the age-related condition that is the most common cause of blindness in the United States, responsible for loss of sight in 200,000 eyes per year. In these cases, retinal implants show promise for restoring sight.

When a nonworking retina is electrically stimulated, the brain perceives flashes of light called phosphenes. Nanoelectronic techniques have made it possible to embed a minute set of electrodes, a fraction of a centimeter across, in the eye atop the retina. In one recent example, a group led by Mark Humayun and Eugene de Juan, at the University of Southern California in Los Angeles, implanted such an array connected to a video camera worn by the blind person. The camera activates the electrodes, stimulating neurons to create phosphenes that are related to the image registered by the camera. At the Illinois-based Optobionics Corporation, its founders Vincent and Alan Chow have eliminated the camera by implanting chips containing

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

silicon light sensors directly into the eyes of test subjects. The sensors convert light into electrical impulses that activate nerve cells.

A more radical method brings visual information directly into the brain, which means the technique could cure blindness due to a damaged eye or optic nerve, as well as blindness arising from retinal problems. William Dobelle, an independent scientist who operates his own laboratories in the United States and Portugal, has developed an electrode array that is implanted on the surface of the brain, where it stimulates the visual cortex. The array is connected to an electrical socket mounted on the outer surface of the skull, into which is plugged a video camera.

None of the methods described above is a complete bionic cure for blindness. Many questions remain, such as how well the body accepts the implants. However, these initial efforts are providing glimmerings of vision to the blind—in one case, apparently sufficient to allow the implantee to drive a car under controlled conditions—although not yet anything close to full restoration of sight. One problem is low resolution, because the number of electrodes or sensors in each implant is minuscule compared to the millions of rods and cones in the natural retina. Advances in nanoelectronics will undoubtedly improve the resolution, but a more fundamental difficulty remains. The retina contains a complex multilayered system of neurons that respond to the impulses from the rods and cones and thereby analyze visual information even before it reaches the brain. This retinal processing tracks movement and the edges of objects, both significant elements in any visual scene. None of the implant schemes tested so far performs this essential first step in visual thinking, but this important point is being addressed by researchers working on “biomorphic” or “neuromorphic” chips that copy biological functioning.

Kwabena Boahen, at the University of Pennsylvania, has gone beyond merely simulating the retina to actually copying it. Using transistors etched in a silicon chip, which are interconnected and made to operate in a way that mimics the layered retinal neurons, he has reproduced the edge- and motion-detection carried out by a natural retina. There is still a long path ahead, however, before this chip is ready to

Suggested Citation: "7 The Five Senses, and Beyond." Sidney Perkowitz. 2004. Digital People: From Bionic Humans to Androids. Washington, DC: Joseph Henry Press. doi: 10.17226/10738.

be tested in a human subject. Indeed, there is a long path ahead until any of the retinal or brain implants can gain FDA approval, but the path might eventually lead beyond replacement to enhancement. Implants that use a video camera could draw on the advantages of telephoto, wide angle, and zoom lenses to enhance bionic vision. The camera could also be made sensitive to infrared light, giving the wearer night vision, which could also be built into implants that use light sensors in the eye rather than a camera.

Apart from implants, approaches like laser surgery combined with adaptive optics—the technique used in ground-based astronomical telescopes to correct light distortions caused by atmospheric turbulence—could bring us supernormal vision. The method relies on a wave-front sensor to examine the light waves; if they are not in perfect step, the deviations are corrected by changing the shape of a mirror as the light reflects from it, producing an undistorted image. David Williams and Junzhong Liang, of the University of Rochester, have pioneered the use of wave-front sensors to map all the optical aberrations in a person’s eye. This technique provides guidance for an advanced form of laser surgery, where the surgeon sculpts the cornea with tiny compensating corrections. In principle, all vision problems including astigmatism can be fully eliminated to give the fortunate patient 20/10 or 20/8 vision—the absolute best the human eye can do, given its density of rods and cones. Clinical trials have shown the effectiveness of the technique, which has given some people 20/16 vision.

In both natural and artificial beings, the senses are bridges between the physical operations of a body and the higher operations of a brain or a mind. These sensory bridges carry us into the mental make-up of a digital being: its intelligence, its rational thought, its feelings if any, and—if any—its consciousness.

Next Chapter: 8 Thinking, Emotion, and Self-Awareness
Subscribe to Emails from the National Academies
Stay up to date on activities, publications, and events by subscribing to email updates.