Smart Spaces Conference Paper
This paper was presented at the 1998 Joint DARPA/NIST Smart Spaces Technology Workshop, 30-31 July 1998, National Institute of Standards and Technology, Gaithersburg, MD. At that time, Telcordia Technologies was known as Bellcore.
The system, much improved since this paper was written, is now sold by Foveal Systems. See the AutoAuditorium Home Page: www.AutoAuditorium.com .
Bellcore Applied Research
Morristown, NJ 07960
Bellcore's AutoAuditorium (TM) System is a practical application of a Smart Space, turning an ordinary auditorium into one that can automatically make broadcasts and recordings. The system is permanently installed in the room and uses optical and acoustic sensors (television cameras and microphones) to be ``aware'' of what is happening in the room. It uses this awareness to televise the sound and images of the most common form of auditorium talk, a single person on a stage, speaking with projected visual aids to a local audience.
Once turned on, the system is completely automatic. The person on stage and the people in the local audience may not even be aware that it is on. To remote audiences, the program is usually as watchable as one produced by a one-person crew running the system by hand.
This paper describes the system, some of our experiences using it, and planned enhancements and research.
The AutoAuditorium Tracking Camera follows a person on the stage, panning, tilting, zooming and focusing in response to her movements.
The AutoAuditorium Director controls the video mixer, selecting among the four cameras and a combination shot (slide screen + presenter) using heuristics that produce quite watchable programs from most presentations.
The AutoAuditorium Sound mixes sound from an optional wireless microphone, microphones installed above the stage, and microphones installed above the audience seating area. The stage microphones provide adequate audio coverage if the wireless microphone is not used or fails, and they also feed the room's public address system. The Sound subsystem gives preference to voices originating from the stage, but also listens for audience questions.
The outputs of these subsystems create a television program that is then distributed via various mechanisms, video cassette recording, video network, and computer-encoded recording and transmission.
In the current system, each of the subsystems operates independently, although the Director changes parameter settings in the Tracking Camera algorithm for some shot selections. We plan to add more cross-subsystem awareness.
Instead, a ``Spotting Camera'', mounted close to the Tracking Camera, is pointed at the stage area and its signal goes to one of the frame grabbers in the computer. A Search Area, where the person on the stage will be walking in the Spotting Camera image, is defined during installation. A map is defined that relates points in the Spotting Camera image to pan, tilt, and zoom positions of the Tracking Camera. The Tracking Camera software detects any motion in the Search Area and drives the Tracking Camera to the appropriate pan, tilt, and zoom position. (The Search Area also keeps the seated (and sometimes standing) audience motion from becoming important to the Tracking Camera.) See Figure 1.
Several parameters are set during system installation to tune the various tracking and smoothing algorithms:
The Director analyzes the Slide Camera image to determine if the projection screen is blank. If so, it directs the video mixer to show the Tracking Camera, following the speaker as he moves around the stage and talks to his audiences. See Figure 2.
Should a slide be projected, the Director sees that the Slide Camera image is no longer blank and quickly directs the video mixer to show it. See Figure 3.
Since it is not yet possible to determine automatically whether the most important image should be of the speaker or of the screen, a ``combination shot'' is constructed, with the speaker placed in a picture-in-picture box in the lower corner of Slide Camera image. See Figure 4.
The picture-in-picture appears after a brief delay, since the Tracking Camera alogrithm needs time to adjust to the new parameters that the Director sends it.
If the screen goes blank (Figure 5), or if the slide is unchanging for a long time, then the Director selects a ``covering shot'' (Figure ) from one of the other two fixed cameras, while the Tracking Camera algorithm is reset to track the person in the center of the image. Then the covering shot is replaced with the Tracking Camera shot, Figure 6.
Should there be motion on the projection screen, or should the slide remain unchanged for an even longer time, the Director then reconstructs the combination shot. See Figure .
Because the slide image is quickly recalled to the program if there is motion within it, the Director often selects that shot just as the speaker is making a point about, and pointing at, the slide.
In the Morristown Auditorium, the ceiling over the stage is low enough that six microphones, careful placed, provide adequate audio coverage of anyone standing on or near the stage. An automatic microphone mixer combines them with the signal from the wireless microphone receiver and a microphone built into the lectern. It is so effective at selecting the best sound source into the program that we just leave the inputs at standard settings. The output from this mixer is used both for the room public address (PA) system and as part of the AutoAuditorium Sound feed. See Figure 9.
But the auditoriums can sometimes get very busy, with two and even three separate events in a single day. Operators stuck at the control console all day became bored and tired and would make mistakes. The operators also had other duties and were sometimes difficult to schedule.
As computer vision systems became more capable, experiments in using vision analysis to drive a tracking camera and a video mixer showed promise. By 1994, the first version of a research prototype AutoAuditorium System became operational in our Morristown NJ auditorium. Weekly work-in-progress talks were sent live over our experimental desktop video teleconferencing system, called Cruiser/Touring Machine {CTM} and also recorded for Cruiser's on-demand playback service. These weekly tests led to more refined algorithms and tuned parameters. Eventually, many people watching programs produced by the AutoAuditorium System could not tell the difference between them and manually produced programs. In fact, the AutoAuditorum programs were sometimes superior to those produced by hand because the operators would sometimes day-dream; producing a program can get very tedious.
Recently, the prototype system was ported from a locally written real-time operating system running on a single board computer in a VME card cage and using VME frame grabbers. The production system now runs on an IBM-compatible PC running Linux with PCI-bus frame grabbers.
While the system works well, it cannot fix badly prepared or presented talks. For example, visuals that can not be read easily from the back of the room are also difficult to see on television. A human operator can sometimes improve the situation by taking closeups of portions of the projection screen, illustrating the points the speaker is making. Such a capability does not yet exist in AutoAuditorium.
The production system has considerably more processing power than the prototype, so it should be possible to identify multiple people in the Search Area, especially when they are well separated. That would help the Tracking Camera to stay with the original target, or to decide to zoom out to cover both targets until one or the other left the scene.
Or, the one tracking algorithm could drive multiple tracking cameras, say with very different view points. When only one person was on stage, the ability to change camera angles could help provide variety to the program. When more than one person was on the stage, separate cameras could be assigned to separate people.
For one, the Director could be aware of circumstances where the Tracking Camera does not move for a long time. Some speakers place themselves behind or next to the lectern and stay there. If the Director could be aware of that, it could decide to take other shots, say of the whole front of the room or of the audience, just to provide some variety.
Another possibility, given the enhancement to track more than one person on stage, could be to use the whole-stage fixed camera shot when more than one person occupies the stage, especially if the whole-stage shot covers a wider area than the Tracking Camera can.
Multiple microphones over the stage area should make it possible to know approximately where sound is coming from. Again, given the enhancement where the Tracking Camera can identify several people on stage, that information could help the Director and/or Tracking Camera decide which person to show to the remote audiences.
Rutgers University has Array Microphone technology, sometimes referred to as Speaker Seeker {SS1} {SS2} that can stereo locate the position of a sound source. We have an early version of Speaker Seeker installed in the Morristown Auditorium, but it remains to be integrated with the AutoAuditorium system. When a person in the audience speaks, Speaker Seeker can usually point a camera at her. If that image, and the confidence measure from Speaker Seeker indicating the likelihood that it had a good image, were made available to the AutoAuditorium System, then the Director could decide to include the image of the questioner along with the sound of her voice.
Our own experience shows that having an AutoAuditorium System allows us to record and broadcast programs that otherwise would not have been captured.
AutoAuditorium Home Page www.AutoAuditorium.com |
AutoAuditorium Site Map | AutoAuditorium E-Mail info@AutoAuditorium.com |