AutoAuditorium System: Smart Spaces Conference Paper

Home Page | Site Map | E-Mail
Contact Us | Privacy Policy
+1 973 822-2085

Smart Spaces Conference Paper

This paper was presented at the 1998 Joint DARPA/NIST Smart Spaces Technology Workshop, 30-31 July 1998, National Institute of Standards and Technology, Gaithersburg, MD. At that time, Telcordia Technologies was known as Bellcore.

The system, much improved since this paper was written, is now sold by Foveal Systems. See the AutoAuditorium Home Page: www.AutoAuditorium.com .

AutoAuditorium:
a Fully Automatic, Multi-Camera System
to Televise Auditorium Presentations

Michael H. Bianchi
Bellcore Applied Research
Morristown, NJ 07960

Abstract

A large room full of people watching a presentation suggests that there are other people, unavailable at that time or not at that location, who would like to see the talk but can not. Televising that talk, via broadcast or recording, could serve those absent people.

Bellcore's AutoAuditorium ^(TM) System is a practical application of a Smart Space, turning an ordinary auditorium into one that can automatically make broadcasts and recordings. The system is permanently installed in the room and uses optical and acoustic sensors (television cameras and microphones) to be ``aware'' of what is happening in the room. It uses this awareness to televise the sound and images of the most common form of auditorium talk, a single person on a stage, speaking with projected visual aids to a local audience.

Once turned on, the system is completely automatic. The person on stage and the people in the local audience may not even be aware that it is on. To remote audiences, the program is usually as watchable as one produced by a one-person crew running the system by hand.

This paper describes the system, some of our experiences using it, and planned enhancements and research.

AUTOAUDITORIUM SYSTEM DESCRIPTION

The prototype AutoAuditorium System is installed in the largest meeting room at Bellcore's Morristown New Jersey location. The system consists of a computer with two video frame grabbers, three fixed cameras (pointed at the stage, the screen, and the lectern from the side), one tracking camera under computer control that follows the person on the stage, a video mixer, also under computer control, and several automatic audio mixers. The system is organized into three main subsystems.

The AutoAuditorium Tracking Camera follows a person on the stage, panning, tilting, zooming and focusing in response to her movements.

The AutoAuditorium Director controls the video mixer, selecting among the four cameras and a combination shot (slide screen + presenter) using heuristics that produce quite watchable programs from most presentations.

The AutoAuditorium Sound mixes sound from an optional wireless microphone, microphones installed above the stage, and microphones installed above the audience seating area. The stage microphones provide adequate audio coverage if the wireless microphone is not used or fails, and they also feed the room's public address system. The Sound subsystem gives preference to voices originating from the stage, but also listens for audience questions.

The outputs of these subsystems create a television program that is then distributed via various mechanisms, video cassette recording, video network, and computer-encoded recording and transmission.

In the current system, each of the subsystems operates independently, although the Director changes parameter settings in the Tracking Camera algorithm for some shot selections. We plan to add more cross-subsystem awareness.

AutoAuditorium Tracking Camera

The AutoAuditorium Tracking Camera follows the person on the stage without requiring that they wear or carry anything or that they be identified in advance to the system. (There are other tracking cameras that identify their targets via devices worn by the person, or by an operator identifying a ``visual signature''. Either of these techniques would have interfered with the goal of making the system totally automatic and unobtrusive.)

Instead, a ``Spotting Camera'', mounted close to the Tracking Camera, is pointed at the stage area and its signal goes to one of the frame grabbers in the computer. A Search Area, where the person on the stage will be walking in the Spotting Camera image, is defined during installation. A map is defined that relates points in the Spotting Camera image to pan, tilt, and zoom positions of the Tracking Camera. The Tracking Camera software detects any motion in the Search Area and drives the Tracking Camera to the appropriate pan, tilt, and zoom position. (The Search Area also keeps the seated (and sometimes standing) audience motion from becoming important to the Tracking Camera.) See Figure 1.

Figure 1: Spotting Camera Image, with Search Area and Extraneous Motion Area Defined.

Several parameters are set during system installation to tune the various tracking and smoothing algorithms:

Parameters associated with the particular brand and model of pan-and-tilt mount, lens, and camera used in the Tracking Camera subsystem.
A minimum and maximum target shape and size.
Areas where there may be ``extraneous'' motion.
Parameters that affect the responsiveness of the tracking algorithm

An example of an extraneous motion occurs when the projection screen is within the Search Area that a person may occupy. Defining the portion of the Spotting Camera image that is the projection screen as an Extraneous Motion Area helps the algorithm discriminate between motion due to the person and motion due to the visuals changing. (Figure ) When the person is not standing near the screen, there is no confusion. Should the person be near the screen when the slide changes or animates the algorithm may see the motion on the screen and the motion of the person as related. If it does, the Tracking Camera zooms out to include both in the shot.

AutoAuditorium Director

The AutoAuditorium Director's function is to present camera shots that will be interesting to the remote audiences. It is driven by analyzing the image on the projection screen, viewed by a fixed camera called the Slide Camera. That image goes to both the video switcher and the second frame grabber in the computer.

The Director analyzes the Slide Camera image to determine if the projection screen is blank. If so, it directs the video mixer to show the Tracking Camera, following the speaker as he moves around the stage and talks to his audiences. See Figure 2.

Figure 2: Tracking Camera Shot of Speaker, Alone.

Should a slide be projected, the Director sees that the Slide Camera image is no longer blank and quickly directs the video mixer to show it. See Figure 3.

Figure 3: Slide Camera Shot of Projection Screen, Alone.

Since it is not yet possible to determine automatically whether the most important image should be of the speaker or of the screen, a ``combination shot'' is constructed, with the speaker placed in a picture-in-picture box in the lower corner of Slide Camera image. See Figure 4.

Figure 4: Combination Shot: Slide Camera with Tracking Camera Picture-In-Picture.

The picture-in-picture appears after a brief delay, since the Tracking Camera alogrithm needs time to adjust to the new parameters that the Director sends it.

Figure 5: Combination Shot: Blank Projection Screen.

Figure 6: Covering Shot.

If the screen goes blank (Figure 5), or if the slide is unchanging for a long time, then the Director selects a ``covering shot'' (Figure ) from one of the other two fixed cameras, while the Tracking Camera algorithm is reset to track the person in the center of the image. Then the covering shot is replaced with the Tracking Camera shot, Figure 6.

Should there be motion on the projection screen, or should the slide remain unchanged for an even longer time, the Director then reconstructs the combination shot. See Figure .

Figure 7: Back to the Combination Shot.

Because the slide image is quickly recalled to the program if there is motion within it, the Director often selects that shot just as the speaker is making a point about, and pointing at, the slide.

Simple, But Effective

This simple heuristic, determining whether the projection screen is blank or not, has proved surprisingly effective in creating watchable programs of auditorium talks. Most of the time, the image on the screen is one with which the remote audiences can identify. A slide screen that has not changed in a long time (90 seconds in the Morristown installation) is generally not missed. Bringing the slide image back periodically lets the remote audiences refresh their memories about slide's content.

AutoAuditorium Sound

The AutoAuditorium Sound subsystem listens to sound from the stage, sound from the audience, and sound associated with presentation projectors. It produces a final mix from these sources.

Stage Sound

In a ordinary auditorium, it is not uncommon to require that the speaker either stand at a lectern's microphone, or stand in front of a microphone stand, or wear a wireless microphone. But in a modest size room, with seating for 100 or fewer, sound system amplification for the speaker may not be strictly necessary for the local audience. Still, some form of audio pickup is required for the remote audiences.

In the Morristown Auditorium, the ceiling over the stage is low enough that six microphones, careful placed, provide adequate audio coverage of anyone standing on or near the stage. An automatic microphone mixer combines them with the signal from the wireless microphone receiver and a microphone built into the lectern. It is so effective at selecting the best sound source into the program that we just leave the inputs at standard settings. The output from this mixer is used both for the room public address (PA) system and as part of the AutoAuditorium Sound feed. See Figure 9.

Figure 9: AutoAuditorium Sound System.

Audience Sound

A similar system of ceiling microphones and an automatic mixer is used to cover the audience seating area, but with a crucial difference. Since the PA speakers are also on the ceiling over the audience, their sound would be heard by the audience microphones and cause a ``bottom-of-the-barrel'' reverberation. To prevent this, a simple circuit, referred to as the ``Mic Ducker'', mutes the audience microphones whenever the room PA system ``speaks''. This gives the sound from the stage precedence and keeps general audience rustling from being an annoying part of the AutoAuditorium program sound. However, it also allows the remote audiences to hear the reactions and questions of the local audience.

Projector Sound

A third audio source is sound associated with projections, either from video tape or computers. In our Morristown auditorium, this ``HiFi System'' has its own amplifiers and speakers, under the projection screen. This signal is ``tapped'' and provided to the AutoAuditorium Sound mix.

The Final Mix

The three audio feeds, from the stage, the audience, and the projectors, are mixed together by a final automatic mixer. Again, the strongest signal source or sources dominate the mix, and the master level is kept within the limits. The result is generally acceptable, although soft-spoken audience members are sometimes difficult to hear, both in the room and in the the program.

AUTOAUDITORIUM EXPERIENCES

The idea of being able to telecast auditorium talks anywhere within Bellcore's New Jersey locations originated in the late 1980s. By the end of 1993 we had four auditoriums equipped with manually operated 3-camera systems. The expectation quickly grew that any talk of importance in any of those auditoriums would be telecast over our in-house T1 video network.

But the auditoriums can sometimes get very busy, with two and even three separate events in a single day. Operators stuck at the control console all day became bored and tired and would make mistakes. The operators also had other duties and were sometimes difficult to schedule.

As computer vision systems became more capable, experiments in using vision analysis to drive a tracking camera and a video mixer showed promise. By 1994, the first version of a research prototype AutoAuditorium System became operational in our Morristown NJ auditorium. Weekly work-in-progress talks were sent live over our experimental desktop video teleconferencing system, called Cruiser/Touring Machine {CTM} and also recorded for Cruiser's on-demand playback service. These weekly tests led to more refined algorithms and tuned parameters. Eventually, many people watching programs produced by the AutoAuditorium System could not tell the difference between them and manually produced programs. In fact, the AutoAuditorum programs were sometimes superior to those produced by hand because the operators would sometimes day-dream; producing a program can get very tedious.

Recently, the prototype system was ported from a locally written real-time operating system running on a single board computer in a VME card cage and using VME frame grabbers. The production system now runs on an IBM-compatible PC running Linux with PCI-bus frame grabbers.

While the system works well, it cannot fix badly prepared or presented talks. For example, visuals that can not be read easily from the back of the room are also difficult to see on television. A human operator can sometimes improve the situation by taking closeups of portions of the projection screen, illustrating the points the speaker is making. Such a capability does not yet exist in AutoAuditorium.

SYSTEM ENHANCEMENTS AND FUTURE RESEARCH

A number of improvements and further investigations are under consideration:

More Than One Person on Stage

The Tracking Camera algorithms work well only when there is one person to be tracked. They do several things to keep from being distracted by other people momentarily crossing the Search Area, but if two people stand on the stage, the resulting Tracking Camera image is fairly unpredictable.

The production system has considerably more processing power than the prototype, so it should be possible to identify multiple people in the Search Area, especially when they are well separated. That would help the Tracking Camera to stay with the original target, or to decide to zoom out to cover both targets until one or the other left the scene.

Or, the one tracking algorithm could drive multiple tracking cameras, say with very different view points. When only one person was on stage, the ability to change camera angles could help provide variety to the program. When more than one person was on the stage, separate cameras could be assigned to separate people.

Cross-Subsystem Awareness

The several subsystems of the AutoAuditorium currently run very autonomously. With the exception of the Director changing some Tracking Camera parameters as it moves the speaker's image from a full-screen shot to picture-in-picture shot, the visual processes run independently. The AutoAuditorium Sound does not connect to the computer at all. However, it is easy to enumerate benefits in making the different subsystems more aware of each other.

For one, the Director could be aware of circumstances where the Tracking Camera does not move for a long time. Some speakers place themselves behind or next to the lectern and stay there. If the Director could be aware of that, it could decide to take other shots, say of the whole front of the room or of the audience, just to provide some variety.

Another possibility, given the enhancement to track more than one person on stage, could be to use the whole-stage fixed camera shot when more than one person occupies the stage, especially if the whole-stage shot covers a wider area than the Tracking Camera can.

Multiple microphones over the stage area should make it possible to know approximately where sound is coming from. Again, given the enhancement where the Tracking Camera can identify several people on stage, that information could help the Director and/or Tracking Camera decide which person to show to the remote audiences.

Seeing Audience Members

The one area where a human operator can clearly outperform the AutoAuditorium product is when audience members ask questions of the person on stage. The audience microphones pick up the questions of the local audience and they are heard by the remote audiences. A human operator can point a camera into the seating area and find the person asking a question but currently the AutoAuditorium System cannot.

Rutgers University has Array Microphone technology, sometimes referred to as Speaker Seeker {SS1} {SS2} that can stereo locate the position of a sound source. We have an early version of Speaker Seeker installed in the Morristown Auditorium, but it remains to be integrated with the AutoAuditorium system. When a person in the audience speaks, Speaker Seeker can usually point a camera at her. If that image, and the confidence measure from Speaker Seeker indicating the likelihood that it had a good image, were made available to the AutoAuditorium System, then the Director could decide to include the image of the questioner along with the sound of her voice.

Passive Micing Using Array Microphones

The Rutgers Array Microphone technology could also be used to pick up the questioner's voice, instead of ceiling microphones. There may be rooms where Array Microphones on the walls could do a superior job to over-head microphones, such as when there are high ceilings over the stage and audience. We would like to investigate expanding the ``nothing to wear to be heard'' aspect of our current installation to more challenging spaces.

CONCLUSIONS

As the number and reach of high bandwidth networks grow, and with them the ability to present quality video improves, the opportunity, need, and demand to produce video programs on a routine basis will also grow. It is already becoming necessary to reduce or eliminate the manual components of routine or ad hoc programs. Turning an auditorium into a Smart Space with the mission of capturing the talks that take place there is a natural way to supply those programs.

Our own experience shows that having an AutoAuditorium System allows us to record and broadcast programs that otherwise would not have been captured.