WO1997036425A1

WO1997036425A1 - Video processing

Info

Publication number: WO1997036425A1
Application number: PCT/GB1997/000641
Authority: WO
Inventors: Gary Dean Burgess
Original assignee: British Telecommunications Public Limited Company
Priority date: 1996-03-28
Filing date: 1997-03-07
Publication date: 1997-10-02
Also published as: JP2000507418A; AU2102297A

Abstract

Image processing apparatus comprising input means (51) for receiving input signals from n videoconferencing terminals, where n is an interger greater than or equal to 3, each input signal representing frames of a video signal, processing means for forming n composite signals each representing different combinations of at least two of the input signals, and means for transmitting the composite signals to the relevant videoconferencing terminal.

Description

VIDEO PROCESSING

This invention relates to teleconferencing and, in particular, to systems enabling videoconferencing between three or more locations. Videoconferencing can be regarded as a technological substitute for face- to-face meetings. For meetings between two locations, current technology allows one set of participants to see the other set of participants. Where more than two locations are interconnected (so called multipoint videoconferencing) current systems generally provide a view of only one other location at a time, owing to cost and technology constraints,

A number of standards relating to the field of videoconferencing have been adopted, in particular ITU-T Recommendation H.261 "Video codec for audio-visual services at px64 kbit/s" . H.261 proposed a common intermediate format (CIF). CIF is based on 288 non-interlaced lines per picture at 30 pictures per second. This format was found to solve the compatibility problems between the traditional formats used in Japan and North America and those used in Europe and provide a good quality picture for use in videoconferencing. A second picture format was also included having one half of the resolution of CIF in two dimensions. This format is known as quarter CIF (QCIF). Other relevant International standards are those set by the Moving

Pictures Expert Group (MPEG), both ISO/IEC IS1 1 1 72- 1 (commonly known as MPEG 1 ) and ISO/IEC/1 3818 (commonly known as MPEG2) . Both of these standards also utilise the Common intermediate Format, and the individual pictures can be any size within the 352 pixels by 288 lines picture. A multipoint videoconference is generally controlled by a multipoint control unit (MCU) which processes the audio and video signals from each location separately. The MCU is usually provided as a separate piece of equipment but may form an intergral part of one of the participating terminals. The MCU generally provides an open audio-mixing system, where all participants are able to hear all other participants but not themselves. However, each terminal is only able to see one other participating terminal, the MCU switching the video from selected terminals to be seen at the other terminals. Various methods for selecting who is seen at a particular terminal are known. Two of the most popular involve selecting the picture automatically from the terminal where someone is speaking or having a chair person controlling which picture is seen by whom.

European Patent Application No. 523629 relates to such a multipoint teleconferencing system. A chairperson is located at one of the terminals to control which pictures are viewed by the participants. Each participant receives the same video signal as the other participants for display. European patent application no. 642271 descπbes videoconferencing apparatus in which a multipoint control unit selects every nth field of the incoming video signals to derive a single output signal which is sent to the participants. Again all participants receive the same video signal.

These current systems suffer from the intrusion of the picture-switching process and the feeling of presence is lost since all participants may not be seen at any one time. An example of "loss of presence" occurs when a participant is particularly quiet or is merely listening; it is easy to forget that this participant is present in the teleconference.

A more desirable approach to multipoint videoconferencing would be to enable participants to be seen and heard at all times during the conference, making a videoconference closer to a real face-to-face meeting.

In accordance with the invention, image processing apparatus comprises input means for receiving input signals from n terminals, where n is an integer greater than or equal to 3, each input signal representing frames of a video signal, processing means for forming n composite signals each representing different combinations of at least two of the input signals, and means for transmitting the composite signals to the relevant terminal. Preferably the processing means comprises means for identifying control data in each input signal, means for redefining the control data for inclusion in the composite signals and means for inserting video data from the input signals into the composite signals.

Since the video data itself is not processed, the propogation delays through the apparatus are relatively low, so providing an acceptable degree of service to users.

Preferably, the frame rate of the composite signals may be equal to the highest frame rate of the input signals or equal to a predetermined fixed rate. Preferably the input signals conform to the quarter Common Intermediate Format and the composite signals conform to the Common Intermediate Format.

In accordance with a further aspect of the invention a method of processing image data from a plurality of terminals comprises receiving the input signals from n terminals, where n is an integer greater than or equal to 3, processing the input signals to form n composite signals representing combinations of at least two input signals, each composite signal being different, and transmitting the composite signals to the relevant terminals.

When n is greater than 5, the composite signals may represent combinations of four input signals, the input signals preferably being selected on the basis of at which terminal the most recent speakers are located.

The method preferably includes identifying control data in each input signal, redefining the control data for inclusion in the composite signals and inserting video data from the input signals into the composite signals. The invention will now be described by way of example only with reference to the accompanying drawings in which:

Figure 1 shows schematically a multipoint videoconference; Figure 2 shows an area of a video image divided into blocks; Figure 3a shows a macro block consisting of four luminance and two chrominance blocks;

Figure 3b shows a group of blocks (GOB);

Figure 3c shows the structure of a whole image consisting of twelve groups of blocks according to the common intermediate format and three groups of blocks according to quarter CIF; Figure 4 shows the framing structure for an H.261 encoded picture;

Figure 5 shows the functional elements of apparatus according to the invention;

Figure 6 shows schematically a CIF picture formed from four QCIF pictures, according to the invention; Figure 7 shows an example of a look-up table defining the new GOB numbering of video data for each output; and

Figure 8 shows the functional elements of an alternative embodiment of apparatus according to the invention. As shown in Figure 1 , a multipoint videoconference involves at least three locations, a videoconferencing terminal 1 2 being provided at each location. The locations might be in the same country or spread over a number of countries. In the embodiment shown in Figure 1 , a multipoint control unit (MCU) 14 controls the videoconference and performs all the required audio and video mixing and switching and the control signalling. Each terminal 1 2 is connected to the MCU 14 via broadband digital links such as Intergrated Services Digital Network (ISDN) B- channels. In the UK each B-channel has a capacity of 64kbit/second.

Each terminal 1 2 conforms to the H.261 standard and is capable of transmitting CIF or QCIF pictures. On commencement of a videoconference all participating terminals signal their capabilities to the MCU which then signals to the terminals to request the data in QCIF format.

According to the H.261 standard, images are divided into blocks 22 as shown in Figure 2 for subsequent processing. The smallest block size is an 8x8 pixel block but other sized blocks may be employed. A group of four such luminance (Y) blocks, and the two corresponding chrominance (C_b and C_r) blocks, that cover the same area at half the luminance resolution, are collectively called a macro block (MB) as shown in Figure 3a. Thirty-three macro blocks, grouped and numbered as shown in Figure 3b, are known as a group of blocks (GOB) . The GOBs, grouped and numbered as shown in Figure 3c, form a full CIF or QCIF picture.

The framing structure for a frame of H.261 encoded data is shown in Figure 4. The structure is organised in a series of layers, each containing information relevant to the succeeding layers. The layers are arranged as follows: Picture layer 401 ; GOB layer 403; MB layer 405; and Block layer 407. Each of the layers has a header. The picture header 402 includes information relating to the picture number of the encoded picture, the type of picture (e.g. whether the picture is intraframe coded or interframe coded) and Forward Error Correction (FEC) codes. The GOB header 404 includes information relating to the GOB number within the frame and the quantising step size used to code the GOB. The MB header 406 includes information relating to the MB number and the type of the MB (i.e. intra/inter, foward/backward predicted, luminance/chrominance etc.) . ~ _«-,,_«.._« O 97/36425

Figure 5 shows apparatus according to the invention for combining four

QCIF coded pictures into a single full CIF picture. Such apparatus is provided within the MCU 14. Each individual terminal 1 2 participating in the videoconference supplies QCIF H.261 formatted video data to the MCU 14. The apparatus shown in Figure 5 receives five QCIF pictures from the participating terminals and produces CIF signals representing each combination of four QCIF pictures into a 2x2 array of QCIF coded pictures. The resulting CIF signals are then transmitted to the appropriate participating terminals 1 2 for display on a display capable of displaying CIF resolution pictures. The apparatus shown only operates upon the video signals from the terminals 1 2: the audio, user data information and signalling are controlled in a conventional manner by the host

MCU 14 in which the apparatus is located.

The apparatus compπses five inputs 51 a-e for receiving QCIF format signals from five participating terminals 1 2. Each input signal is input to a forward error correction (FEC) decoder 52a-e which decodes each FEC codes contained in the picture header 402 of each signal, error corrects the video data of the signal in a conventional manner and establishes framing locks on each input signal. Once framing is established for a particular signal, each FEC decoder 52 signals this to a control means 54. The control means 54 may be provided by a microprocessor. The error corrected QCIF signals are then input to first-in-first-out (FIFO) input buffers 53 a-e.

The control means 54 then searches each contributing error-corrected QCIF signal to identify header codewords (such as the GOB header 404 and the MB header 406). This is achieved by a device 55 which decodes the attributed data in the FEC-corrected QCIF signals output from the input buffers 53. The device 55 comprises a series of comparators (not shown) and a shift register (not shown) of sufficient length to hold the longest code word. The comparators compare the data as it enters the shift register and when a code word is identified, it is forwarded to the control means 54 via a bus 55a. The shift registers then perform a serial to parallel conversion to organise the input video data into bytes for output via bus 55b and convenient storage in random access memory (RAM) 56. A suitable device 55 to perform these operations is a Field Programmable Gate Array (FPGA) such as the Xylinx device. Each GOB will thus be reorganised into a number of words ( 1 6 bit or 32 bit) having newly assigned byte boundaries, since H.261 signals are not originally organised in bytes. Thus the bytes of data allocated to a particular GOB may inevitably contain data not relevant to that GOB; this data will form part of the first and last bytes of the GOBs concerned. These first and last bytes are marked to state the number of valid bits that they contain. The control means 54 monitors the status of the data content of the individual input buffers 53a-e via an input control device 60 (such as a FPGA) to ensure that there is no overflow or underflow of data in the buffers. The video data of each GOB is allocated to a portion of random access memory (RAM) 56. Since intra- or inter- frame coding may be used in H.261 , the amount of video data within a GOB may vary significantly. The video data of each

GOB is therefore allocated to a portion of RAM of sufficient capacity to hold the largest possible GOB allowed under H.261 . The GOBs for a particular QCIF picture (which contains three GOBs) are logically grouped together in the RAM.

Together with the video data, various codes associated with each GOB are also stored in the RAM. These codes relate to: the source of the data (i.e. from which terminal 1 2 the video originated); the picture number (PIC) of the current picture held in RAM from a particular source; the original group number (OGN)( 1 ,2,3) of the GOB in a particular PIC; the number of bytes (Nbyte) in the GOB; the valid data content (VFByte) of the first byte in a GOB; and the valid data content (VLByte) of the last byte in a GOB.

Also associated with each GOB is a number of pointers to locate the position of headers within the frame. These are used, for example, to locate the OGN codeword position for editing purposes prior to compilation of the video data to form a CIF format signal.

The following processes then take place to compose each new CIF picture-data sequence from the original individual constituent QCIF picture data stored in the RAM 56: • Assign an appropriate CIF Picture Header for the output CIF Frame; this is output ahead of the GOBs of data. • Edit each GOB Header code to conform to the new positions of each GOB in the CIF structure required for the given output to which the data is to be sent.

• Transfer the needed GOB data from each constituent QCIF picture (held in RAM 56) in the correct sequence after the CIF Picture Header, to form the output CIF frame data sequence required for each output. An example of a required sequence is depicted in Figure 6. For example Output 3' , which is the H.261 CIF sequence required for output 3, will require GOB data (after the new CIF picture header) from all of the other pictures (except input 3) in the following sequence:

< Pιc 1 , GOB 1 > < Pιc 2, GOB 1 > < Pιc 1 , GOB 2 > < Pιc 2, GOB 2 > < Pic 1 , GOB 3 > < Pιc 2, GOB 3 > < Pιc 4, GOB 1 > < Pιc 5, GOB 1 > < Pιc 4, GOB 2 > < Pιc 5, GOB 2 > < Pιc 4, GOB 3 > < Pιc 5, GOB 3 > where Picx, GOBy represents GOB number y from input number x. A look-up table of the required header editing (as shown in Figure 7) is used to guide the control module 54.

The contents of each portion of the RAM is polled by the control means 54 at the highest allowed H.261 picture rate, approximately 30 Hertz. When a complete frame of data for an individual QCIF signal from a terminal 1 2 is available it is transferred to an output data FIFO 57 If the required data for any QCIF segment of the CIF frame is not yet available from the RAM, then an empty GOB of data (i.e. just a header) is transferred instead This allows the destination terminal to display an image until a new frame is ready to be sent by the MCU. The control means 54 monitors the status of the individual areas of the RAM to ensure that the above procedure is followed.

Ail used outputs are loaded with data in a polled sequential cycle: i e. each CIF frame output is built up by transferring one GOB at a time to each output buffer 57 in turn before returning to the first to start again As can be seen from Figure 6, data for several outputs tend to require the same input picture data at any one time in the CIF compilation sequence, allowing a large degree of parallelism to be employed in the data transfer.

The RAM 56 is of sufficient capacity to store a sequence of several QCIF frames of data from any single source if required, although in normal operation on average only two QCIF frames of data are required. Once an area of RAM has been transferred to all of the required output buffers 57 a-e then the area is made available for storing a new QCIF frame. New MB address Stuffing Codes are omitted or inserted to control the output data-rate to comply with H.261 for a CIF picture.

The output buffers 57 buffer the data being assembled from the original

QCIF data GOBs prior to forward error correction coding. Once sufficient data to form a full FEC frame of data (492 bits) have been loaded into an output FIFO 57, the data is fed to a following FEC encoder (58a-e) for forward error correction framing.

The output buffers 57 are of sufficient capacity to aliow loading of data to take place without overflow, whilst providing the FEC encoders 58 with data when requested without underflow. The flow of data into the buffers 57 and out of the buffers 59 and the FEC is controlled by an output control 62, which may also be a FPGA device. The forward error corrected signal output from the encoders 58 is input to CIF output buffers 59a-e which buffer the CIF signals for transmission to the relevant participating terminal 1 2. CIF Output Frame Rate

Each of the individual terminals 1 2 participating in a conference are autonomous. This means that there will tend to be different and varying amounts of information within each individual QCIF-coded picture; each terminal 1 2 will be operating at slightly different picture-rate tolerances (±50pmm); and each terminal 1 2 can produce a different picture-rate (through picture dropping) . The last item potentially creates the biggest problem. The possible options and alternatives available when combining pictures at different frame-rates into a larger picture of one frame-rate are discussed below.

The combined CIF picture is compiled from a maximum of four contributing QCIF pictures. If different picture rates are used by the different QCIF picture feeds, then the combined CIF picture may be formed for instance by either using the highest QCIF picture rate present or using a fixed pre-determined rate.

If the QCIF source with the highest picture-rate is used to determine the CIF output frame-rate, this rate may vary dynamically with the changing scene contents which are encoded by each participating terminal 1 2. It is possible to keep track of the highest current picture-rate and to modify the CIF output frame- rate accordingly.

Alternatively, the highest picture-rate possible (29.97 Hz), or some other pre-determined rate, is used to set the CIF output frame-rate. In this case, individual QCIF data picture-rates would not be used to determine the output rate. This option is slightly more wasteful of data-capacity than the previous option, requiring a larger 'overhead', but simplifies the operation of the apparatus and potentially allows for the use of each individual Temporal Reference (TR) code of an H.261 format signal. The TR code can be used to determine the relative temporal position of each QCIF picture within a sequence of CIF frames, possibly leading to enhanced rendition of motion when displayed It may well be that one or more of the terminals 1 2 can only receive pictures at a particular lower rate In this case that lower rate will set the limit on the maximum allowable pre¬ determined CIF picture-rate for all participants, the controlling MCU 14 signalling this to all the participating terminals. The MCU can impose a maximum picture rate on the contributing incoming feeds if necessary.

The newly formed CIF format signals have a mean data-rate that is the sum of the data rates of the constituent QCIF pictures, plus an additional 'overhead' capacity to cater for combining pictures with the different picture-rates (as discussed above). Each CIF frame must contain all of the constituent GOB headers, even for any omitted data. A proportionally higher data rate will be required on the output CIF channel, depending upon the picture-rate disparities between the incoming QCIF feeds. The following is an estimation of a 'worst- case' scenario, to determine the overhead required.

Worst Case Scenario

Say one QCIF source picture-rate is 30 Hz, whilst the other three are 1 Hz. This means that there will be 29 inserted pictures in every 30 where additional GOB headers without associated data need to be inserted to form the CIF output. Say 26 bits are allocated to each GOB header. Therefore the total number of additional GOB Header bits for 3 QCIF pictures (each containing 3 GOBs) is- 3 x 3 x 26 = 234 bits/CIF frame These extra bits are added for 29 frames out of 30 in every second: 29 x 234 = 6,786 extra bits over-head/sec

Thus a constant 'overhead' of 6.786 kbits/s is required. This quantity will be a greater fraction of the overall data rate for the lower data-rates

Each of the terminals 1 2 may allocate a different channel capacity (R) to video data for transmission to the MCU. The image processor of the invention in the MCU produces a combined CIF coded video signal for transmission at the highest allowed picture-rate for the call. If no constraints are set, this will be 30 Hz (in fact 29.97 Hz ± 50ppm) ; constraints can be sent from the MCU 14 (using, for example, H.221 format signalling) to lower this to say 1 5, 1 0 or 7 5 Hz if desired or required. This allows the image processor of the invention to handle ail incoming QCIF rates, allowing empty GOBs to be transmitted when there is insufficient video data from any source.

When empty GOBs are transmitted, additional information is required for the GOB header data, leading to an additional 'overhead' (discussed earlier) of data capacity required for the output to each terminal 1 2. Under 'worst case' conditions (one QCIF source of 30 Hz with the other three at 1 Hz to be combined into 30 Hz CIF frames) this overhead will approximately be an additional 6.8 kbits/s, independent of the overall channel capacities involved When viewed on a H .221 time-slot basis, this overhead works out to be about 68 bits in every B- channel of 8 x 80 bits: the overhead will thus fit within a single 8 kbit/s Sub- Channel (80 bits) .

The down-link (from MCU to Terminal 1 2) channel capacities required are therefore the sum of the four QCIF capacities which will go to form the new CIF pictures, plus the overhead, Audio, Data, a frame alignment signal and a bit allocation signal.

Data Header Modification

As described earlier, modifications are made to the data header information associated with each new CIF frame which is to be compiled out of the original contributing QCIF data. These modifications are performed on the data which is held in the RAM prior to its sequenced transfer to the Output buffers 57 a-e.

As outlined earlier, each incoming H.261 coded QCIF picture is autonomous with its own unique data structure. The internal structure is _{mn* M} 97/36425

1 1

organised in a series of layers, as shown in Figure 4, each containing information relevant to the succeeding layers. The modifications made to these layers to compile a CIF-format frame are outlined below: Picture Layer: The individual constituent QCIF Macroblocks are assigned a location in the new CIF array of Macroblocks. A new Picture Layer Picture Start Code (PSC) is assigned to conform to the new CIF format and a flag set which defines the source format (0: QCIF, 1 . CIF) to declare CIF for coded pictures output. The Temporal Reference (TR) code could be taken as one of the contributing QCIF pictures 'averaged' from all of the contributions or used to temporally locate each QCIF segment of data into the new CIF frame. GOB Layer:

Each individual QCIF GOB header Group Number (GN) (a 4 bit positional locator number code) is edited to be redefined for the new CIF structure, as shown in the table of Figure 7. MB Layer:

A Macro-Block stuffing (MBA stuffing) code word is available and may be employed for 'padding out' the data content if desired.

Figure 6 shows the resulting CIF pictures for a videoconference including five terminals. Each CIF picture is formed from four QCIF pictures. The last CIF picture in Figure 6 represents a combination of the QCIF signals from terminals number 1 , 2, 3 and 4 and is transmitted from the MCU to terminal no. 5. Thus terminal number 5 will display a composite image composed of images from all four other participating terminals 1 2. The image processor of the invention can produce a CIF picture from one, two, three or four QCIF pictures. This method could also be used to combine CIF formatted pictures into "Multiple-CIF" formats (e.g. to combine four CIF images into one composite signal) to produce higher resolution pictures. It could similarly also be used, with only minor changes, to combine MPEG (H.262) pictures into multiple pictures.

The location information contained in the H.261 data headers may be edited to position individual picture segments anywhere within the available display field as desired. This can be used to produce a subjectively more pleasing arrangement of contributing QCIF pictures when there are less than four participants being displayed. For example, if the final CIF picture is compiled from only two contributing QCIF pictures, such as would be the case in a three-way conference, then it may be subjectively better to arrange the two pictures say side by side in the middle of the screen, rather than in any of the corners. This can easily be achieved by re-numbeπng the constituent GOBs for each QCIF picture to occupy, for example, positions 3, 5, 7 and 4, 6, 8 in the CIF array. Alternatively, the images may be placed on top of each, at the top of the display etc.

Although the above specific description has focussed on video signals conforming to the H.261 standard, there is no intention to limit the scope of the invention to video signals of this type. For instance, the invention is also applicable to video signals conforming to one of the MPEG standards In this case, since the pictures are not confined to QCIF and CIF pictures, a composite signal may be generated which represents more than four QCIF pictures For instance, say the resolution of a user's screen is 352 pixels by 288 lines and each participant terminal transmits to a central image processing apparatus according to the invention a full resolution (i.e. 352 x 288) picture. If the image processing apparatus is arranged to display four images, a pre-processor 80 (as shown in Figure 8) then pre-processes each incoming signal to reduce its resolution by 50% in each dimension. (In Figure 8, like elements are indicated by the same reference numerals as Figure 5.)

Claims

1 . Image processing apparatus comprising input means for receiving input signals from n terminals, where n is an integer greater than or equal to 3, each input signal representing frames of a video signal, processing means for forming n composite signals representing combinations of at least two of the input signals, each composite signal being different and means for transmitting the composite signals to the relevant terminals.

2. Apparatus according to claim 1 , wherein the processing means compπses means for identifying control data in each input signal, means for redefining the control data for inclusion in the composite signals and means for inserting video data from the input signals into the composite signals.

3. Apparatus according to Claim 1 or 2 wherein the frame rate of the composite signals is equal to the highest frame rate of the input signals.

4. Apparatus according to Claim 1 or 2 wherein the frame rate of the composite signals is equal to a predetermined fixed rate.

5. Apparatus according to any preceding claim wherein the input signals conform to the quarter Common Intermediate Format and the composite signals conform to the Common Intermediate Format.

6. Apparatus according to any of claims 1 -4 wherein the input signals and the composite signals conform to the same format, the apparatus further including a pre-processor for pre-processing the input signals.

7 A method of processing the image data from a plurality of terminals, the method compπsing receiving the input signals from n terminals, where n is an integer greater than or equal to 3, each input signal representing frames of a video signal, processing the input signals to form n composite signals representing combinations of at least two input signals, each composite signal being different, and transmitting the composite signals to the relevant terminals.

8. A method according to claim 7 wherein, when n is greater than 5, the composite signals represent combinations of four input signals, the input signals being selected on the basis of at which terminal the most recent speakers are located, or controlled by a conference chairperson. 9 A method according to claim 7 or 8 further comprising identifying control data in each input signal, redefining the control data for inclusion in the composite signals and inserting video data from the input signals into the composite signals. 10. A method according to any of claims 7 to 9 wherein the frame rate of the composite signals is equal to the highest frame rate of the input signals. 1 1 A method according to any of claims 7 to 9 wherein the frame rate of the composite signals is equal to a predetermined fixed rate.