JP2007088803A

JP2007088803A - Information processing device

Info

Publication number: JP2007088803A
Application number: JP2005274885A
Authority: JP
Inventors: Masato Togami; 真人戸上; Akio Amano; 明雄天野; Hiroshi Shinjo; 広新庄; Atsushi Ishibashi; 厚石橋
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2005-09-22
Filing date: 2005-09-22
Publication date: 2007-04-05

Abstract

【課題】
地上波放送や衛星放送などのテレビ番組のシーンの中からユーザーが興味を持ったシーンを特定する場合、視線、瞳孔、瞬目等の表情に関わる感性情報だけでは、ユーザーが興味を持ったシーンを抽出するには不十分であるという課題がある。
【解決手段】
本発明のメディア処理装置では、画像取得部で取得した画像情報をもとに、ユーザー検出を行い、ユーザーの顔位置を推定する。またマイクロホンアレイで収録した音データより、音声区間の検出を行う。検出した音声区間の音源方向と顔位置とが一致した場合に、発話対象物がメディア処理装置であるとみなす。そしてその音声区間の発話をユーザーがメディア処理装置に表示される映像に対し、興味を持ったために発した発話であると判断し、音声区間と同じ時間の映像区間をユーザーが興味を持った映像区間であると判断する。
【選択図】図２【Task】
When identifying a scene that the user is interested in from TV program scenes such as terrestrial broadcasts and satellite broadcasts, the scene that the user is interested in is based only on sensitivity information related to facial expressions such as eyes, pupils, and blinks. There is a problem that it is insufficient to extract.
[Solution]
In the media processing device of the present invention, user detection is performed based on the image information acquired by the image acquisition unit, and the face position of the user is estimated. The voice section is detected from the sound data recorded by the microphone array. When the sound source direction and the face position in the detected voice section coincide with each other, the utterance target is regarded as a media processing device. Then, it is determined that the utterance of the voice section is uttered because the user is interested in the video displayed on the media processing device, and the video is interested in the video section at the same time as the voice section. Judged as a section.
[Selection] Figure 2

Description

本発明は、テレビ番組等の映像情報からユーザーが特に興味を持ったシーンを特定するメディア処理技術の分野に属する。 The present invention belongs to the field of media processing technology for identifying a scene in which a user is particularly interested from video information such as a television program.

従来、人間の感性情報を取得するための技術として、被験者の映像から視線、瞳孔、瞬目等の表情に関わる感性情報抽出するものがある（例えば、特許文献１参照）。本技術においては、提示した映像への興味を感性情報から把握して、興味ありと判定された映像を基にユーザが興味ありと思われる映像を検索して提示する。
また入力音声の音源方向を検出する方向検出部と、入力された画像から人間を抽出する人間検出部とを持ち、音源に人間が抽出された場合に発言者による発言と判断する技術が存在する（例えば特許文献２）。 2. Description of the Related Art Conventionally, as a technique for acquiring human sensitivity information, there is a technique for extracting sensitivity information related to facial expressions such as a line of sight, pupils, and blinks from a subject's video (for example, see Patent Document 1). In the present technology, the interest in the presented video is grasped from the sensibility information, and the video that the user is interested in is retrieved and presented based on the video determined to be interested.
In addition, there is a technology that includes a direction detection unit that detects a sound source direction of an input sound and a human detection unit that extracts a person from an input image, and determines that a speech is made by a speaker when a person is extracted from the sound source. (For example, Patent Document 2).

特開平１１−２８２９２１号公報Japanese Patent Application Laid-Open No. 11-282721

特開２００３−１８９２７３号公報JP 2003-189273 A

ユーザが興味を持つであろうテレビ番組をテレビ番組の中から検索したり、特定の人間がどのような種類のテレビ番組に興味を持つのかを推論し、その人間が興味を持つテレビ番組を自動選択するシステムの実現において、上記従来技術では、視線などを用いてユーザの興味を判定する。が、視線等の情報だけではユーザの真意把握に不十分な場合がありうる。 Search for TV programs that the user may be interested in, infer what type of TV programs a particular person is interested in, and automatically select the TV programs that the person is interested in In realizing the system to be selected, in the above-described conventional technology, the user's interest is determined using a line of sight or the like. However, there are cases where information such as the line of sight alone is insufficient for grasping the true intention of the user.

この点、人間は、テレビ番組を観てその内容に興味を覚えた場合、感嘆して、又、笑うなど声を出すことが多いと考えられる。が、上記従来の技術では、こうした人間の特性を考慮した判断がなされていない。そこで、本願では、ユーザが声を出して笑うほど面白いテレビ番組や、人間が声を出して驚くほど引き込まれるような興味あるテレビ番組等をシステムが判別することのできる番組提示装置を開示することを目的とする。 In this regard, it is considered that humans often scream and laugh when they are interested in the contents of TV programs. However, in the above-described conventional technology, a determination in consideration of such human characteristics is not made. Therefore, in the present application, a program presentation device is disclosed in which the system can discriminate a TV program that is interesting enough for a user to laugh out loud, a TV program that is interesting that a human will speak out and be drawn surprisingly, etc. With the goal.

本発明のメディア処理装置では、画像取得部で取得した画像情報をもとに、ユーザー検出を行い、ユーザーの顔位置を推定する。またマイクロホンアレイで収録した音データより、音声区間の検出を行う。検出した音声区間の音源方向と顔位置とが一致した場合に、発話対象物がメディア処理装置であるとみなす。そしてその音声区間の発話をユーザーがメディア処理装置に表示される映像に対し、興味を持ったために発した発話であると判断し、音声区間と同じ時間の映像区間をユーザーが興味を持った映像区間であると判断する。 In the media processing device of the present invention, user detection is performed based on the image information acquired by the image acquisition unit, and the face position of the user is estimated. The voice section is detected from the sound data recorded by the microphone array. When the sound source direction and the face position in the detected voice section coincide with each other, the utterance target is regarded as a media processing device. Then, it is determined that the utterance of the voice section is uttered because the user is interested in the video displayed on the media processing device, and the video in which the user is interested in the video section of the same time as the voice section Judged as a section.

本発明の構成によれば、ユーザーがテレビなどのメディア処理装置のほうを向き、かつ、声を出したり、笑ったりしているシーンを、つまりユーザーが興味を持っているシーンを特定することができる。 According to the configuration of the present invention, it is possible to identify a scene in which the user faces a media processing device such as a television and is speaking out or laughing, that is, a scene in which the user is interested. it can.

以下、本願発明の代表的な実施形態を図面を参照しつつ説明する。 Hereinafter, typical embodiments of the present invention will be described with reference to the drawings.

図1は本発明の基本構成図である。メディア処理装置１は、筐体にカメラ２及びマイクロホンアレイ１０を保持する。メディア処理装置１は、テレビ番組などのコンテンツをユーザに提示する表示装置を有している。 FIG. 1 is a basic configuration diagram of the present invention. The media processing device 1 holds a camera 2 and a microphone array 10 in a casing. The media processing device 1 has a display device that presents content such as a television program to the user.

本実施例ではマイクロホンアレイを用いるが、音声を取得する音声取得装置であればマイク等であっても良い。マイクロホンアレイは、単一のマイクロホンでは得られない音の到来方向という情報を得ることができ、本発明の発話対象判定部７の性能を上げることが可能となる。 In this embodiment, a microphone array is used, but a microphone or the like may be used as long as it is a sound acquisition device that acquires sound. The microphone array can obtain information on the direction of arrival of sound that cannot be obtained with a single microphone, and can improve the performance of the speech object determination unit 7 of the present invention.

図２は、本願実施例のブロック図である。尚、図１において、これらの処理部は表示部と一体であるが、処理部と表示部とは別体であって、別途有線・無線で接続されていることも可能である。カメラ２で取り込んだ画像は、画像取得部4に送られ、デジタルの画像データに変換される。画像取得部4でデジタルの画像データに変換された、画像データは、ユーザー検出部５に送られる。ユーザー検出部５では、送られた画像データから顔画像認識技術、または視線認識技術を使って、ユーザーの正面を向いている顔を検出する。用いる顔画像認識技術及び視線認識技術は、公知の技術を採用することが可能である。ユーザ検出部５に接続される顔位置推定部１１では、ユーザー検出部5の顔画像認識結果より、実空間上のユーザーの顔の位置推定する。マイクロホンアレイ１０では、複数チャンネルの音声信号を取り込む。マイクロホンアレイ１０で取り込んだ複数チャンネルの音声信号は、音声取得部6に送られ、複数チャンネルのデジタルデータに変換される。 FIG. 2 is a block diagram of this embodiment. In FIG. 1, these processing units are integrated with the display unit, but the processing unit and the display unit are separate from each other, and may be separately connected by wire or wirelessly. The image captured by the camera 2 is sent to the image acquisition unit 4 and converted into digital image data. The image data converted into digital image data by the image acquisition unit 4 is sent to the user detection unit 5. The user detection unit 5 detects a face facing the front of the user from the transmitted image data using a face image recognition technique or a line-of-sight recognition technique. As the face image recognition technique and the line-of-sight recognition technique to be used, known techniques can be adopted. The face position estimation unit 11 connected to the user detection unit 5 estimates the position of the user's face in real space from the face image recognition result of the user detection unit 5. The microphone array 10 captures a plurality of channels of audio signals. A plurality of channels of audio signals captured by the microphone array 10 are sent to the audio acquisition unit 6 and converted into digital data of a plurality of channels.

次にデジタルデータは、音声検出部１２に送られる。音声検出部１２では、音声のパワーに基づく音声区間検出処理を、複数チャンネルのデジタルデータの時系列データのうち、一つの時系列データに対して施し、音声区間を検出する。マイクロホンに入る信号には、音声の他に、人間の足音など雑音が混入する。音声は、数ｓの音が連続するという性質があるのに対し、雑音は比較的短時間で途切れるものが多い。音声区間検出を行うことで、そのような短時間で途切れる雑音を除去し、数ｓの音が連続する音声のみを抽出することが可能となる。音声区間検出処理としては、音声のフレームパワーが予め定めるしきい値以上に変化するフレーム（音声始端）から音声のフレームパワーが予め定めるしきい値以下の値に変化するフレーム（音声終端）までを音声区間として切り出す単一しきい値音声区間検出処理や単一しきい値音声区間検出の音声始端検出に用いるしきい値と音声終端検出に用いるしきい値とで、異なるしきい値を用いる２段しきい値音声区間検出処理を用いる。音声は、音声区間の始端より終端のほうがパワーが大きいため、始端より終端のパワーのしきい値を小さくすることで、検出する音声区間から、始端が欠落することを防ぐことが可能となる。また単一しきい値音声区間検出処理や２段しきい値音声区間検出処理において、音声区間内に、予め定める長さ以下のポーズが入ることを許容するポーズ付き音声区間検出処理も適用可能である。音声は、一つの音声区間で、常にある一定以上のパワーとなるわけではなく、間に数百ｍｓのポーズが入ることがあるが、ポーズ付き音声区間検出処理を行うことで、そのような場合であっても、音声区間を適切に検出することが可能となる。 Next, the digital data is sent to the voice detection unit 12. The voice detection unit 12 performs voice segment detection processing based on voice power on one time-series data among time-series data of digital data of a plurality of channels, and detects a voice segment. In addition to the voice, noise such as human footsteps is mixed in the signal entering the microphone. While voice has the property that several s of sounds continue, noise often breaks in a relatively short time. By performing speech segment detection, it is possible to remove such noise that is interrupted in a short time and extract only speech in which several s of sounds are continuous. The voice section detection processing is performed from a frame in which the voice frame power changes more than a predetermined threshold (voice start end) to a frame in which the voice frame power changes to a value less than the predetermined threshold (voice end). Different threshold values are used for the threshold value used for detecting the voice start end and the threshold value used for detecting the voice end of single threshold voice segment detection processing or single threshold voice segment detection cut out as a voice segment 2 A stage threshold speech section detection process is used. Since voice has a higher power at the end than at the start of the voice section, it is possible to prevent the start from being missing from the detected voice section by reducing the threshold of the power at the end from the start. In addition, in the single threshold voice segment detection process and the two-stage threshold voice segment detection process, a paused voice segment detection process that allows a pause of a predetermined length or less to be included in the voice segment is also applicable. is there. The voice does not always have a certain level of power in one voice section, and there may be a pause of several hundred ms, but in such cases by performing a paused voice section detection process Even so, it is possible to appropriately detect the speech section.

音源位置検出部１３では、音声検出部１２で検出された音声区間内で、最も優勢な音源の音源方向を推定する。音源方向の推定方法には、公知の技術を用いることができる。発話対象判定部7では、音源位置検出部１３が出力する音源方向と顔位置推定部１１が出力するユーザーの顔の位置が一致するかどうか判定する。この際、ユーザーの顔の位置と音源方向とは必ずしも厳密に一致している必要はなく、予め許容誤差を定めその範囲以内であれば、音源方向と顔の位置が一致していると判定することとする。こうすることで、顔位置もしくは、音源位置の推定に若干誤差があったとしても、発話対象物を正しく推定することができる。そして、音源方向と顔位置とが一致した場合、その音声区間の発話対象物がメディア処理装置１であると判定する。音源方向と顔位置が一致しない場合は、発話対象物はメディア処理装置１ではなく、映像に興味を持ったために、発話したのではない可能性が高い。例えば、複数人でお互いの顔を見ながら、話をしている場合などである。従って、音源方向と顔位置が一致する場合のみ、発話対象物をメディア処理装置１であるとし、映像に興味を持ったために、発話したと判定することで、ユーザーが映像に対して興味を持った区間を高精度に抽出することが可能となる。表示内容同定部１４では、音声検出部１２が検出した音声区間の発話が生成された時刻に放映していた番組名及びチャンネルを検出する。さらに、発話が成された時刻を含む、映像区間を切り出す。映像区間の切り出しは、番組単位で切り出しても良いし、予め定める時間長だけ、発話が成された時間の映像区間に前後の映像区間を付与して切り出しても良いし、CM検出を行い、音声区間の発話が成された時刻を含み、前後がCMで挟まれた映像区間を切り出しても良い。ユーザーが映像に対してもつ印象は、短時間で、頻繁に変化するものではなく、比較的ゆっくりと変化するものだと思われる。それに対して、ユーザーが映像に興味を持ち、声を出す時間は、比較的短時間であると思われる。つまりユーザーが興味を持ち、声を出すのは、その声を出した瞬間の映像だけに興味を持ったのではなく、その前の映像も含めて、興味を持ったと思われる。またユーザーの興味が比較的ゆっくりと変化することから、声を出した後の、映像に対してもユーザーが興味を持っている可能性が高いと考えられる。つまり、ユーザーが声を出した時の映像だけでなく、その前後の映像を含めた映像区間を切り出すことで、ユーザーが興味を持った映像区間をより正確に反映した映像区間の切り出しが可能となる。 The sound source position detection unit 13 estimates the sound source direction of the most prevalent sound source in the voice section detected by the voice detection unit 12. A known technique can be used for the method of estimating the sound source direction. The utterance target determination unit 7 determines whether or not the sound source direction output from the sound source position detection unit 13 matches the user's face position output from the face position estimation unit 11. At this time, the position of the user's face and the sound source direction do not necessarily exactly match each other, and if the allowable error is set in advance and falls within the range, it is determined that the sound source direction and the face position match. I will do it. By doing so, the utterance target can be correctly estimated even if there is a slight error in the estimation of the face position or the sound source position. If the sound source direction matches the face position, it is determined that the utterance target in the voice section is the media processing device 1. When the sound source direction and the face position do not match, it is highly likely that the utterance target is not the media processing apparatus 1 but is interested in the video, and is not uttered. For example, when multiple people are talking while looking at each other's faces. Therefore, only when the sound source direction matches the face position, the utterance target is assumed to be the media processing apparatus 1, and the user is interested in the video by determining that the utterance has occurred because he / she is interested in the video. It is possible to extract the sections with high accuracy. The display content identification unit 14 detects the name of the program and the channel aired at the time when the utterance of the voice section detected by the voice detection unit 12 is generated. Furthermore, a video section including the time when the utterance was made is cut out. The video section may be cut out in units of programs, or may be cut out by adding the preceding and following video sections to the video section of the time when the utterance was made for a predetermined time length, or performing CM detection, A video section including the time when the speech of the voice section is made and the front and rear of the voice section may be cut out. The impression that the user has on the video is likely to change relatively slowly, not quickly, in a short time. On the other hand, the time when the user is interested in the video and speaks is considered to be relatively short. In other words, it seems that the user was interested in making a voice, not only in the video at the moment when the voice was made, but also in the video before that. In addition, since the user's interest changes relatively slowly, it is highly likely that the user is interested in the video after the voice is spoken. In other words, it is possible to cut out a video section that more accurately reflects the video section that the user is interested in by cutting out the video section including the video before and after that, not just the video when the user made a voice. Become.

ラベリング部8では、音声区間の発話対象物がメディア処理装置１である場合と、該音声区間の発話対象物がメディア処理装置１でない場合とで区別がつくようなラベルを、表示内容同定部１４が出力する映像区間毎にラベリングする。その上で、映像区間の開始時刻と終了時刻及び番組名及びチャンネルの情報及びラベル情報を記憶部９に記憶する。人間はテレビ番組を観て、面白いと思った場合、声を出して笑う。またテレビ番組に集中し引き込まれる時には、無意識に声を出して驚くことがある。このように人は興味を持ってテレビを見ている場合、声を出すことが多い。一方で、テレビを複数人でみながらテレビ番組とは関係ない話をしていることも想定される。この実施例の構成を用いることで、ユーザが表示画面をみてかつ声を出している場面を特定することができるので、表示されている番組にユーザが興味を持った映像区間と、そうではない映像区間とを分けることが可能となる。 In the labeling unit 8, a label that distinguishes between the case where the speech object in the speech section is the media processing apparatus 1 and the case where the speech object in the speech section is not the media processing apparatus 1 is displayed on the display content identification unit 14 Is labeled for each video section output by. In addition, the storage section 9 stores the start time and end time of the video section, program name, channel information, and label information. People watch TV programs and laugh out loud if they find it interesting. Also, when you are concentrated and drawn into a TV program, you may be surprised to speak unconsciously. In this way, people often speak out when watching TV with interest. On the other hand, it is also assumed that the story is not related to the television program while watching the television with multiple people. By using the configuration of this embodiment, it is possible to identify the scene where the user is watching the display screen and making a voice, so that the video section in which the user is interested in the displayed program is not. It is possible to separate the video section.

図３は、記憶部９に記憶されるユーザーが興味を持った映像区間についてのデータ構造の一例である。該データ構造は、ユーザー名またはＩＤ及び興味を持った番組のチャンネル、及び興味を持った映像区間の開始時刻及び終了時刻から構成される。表示内容同定部１４で、ユーザーが興味を持った映像区間を番組単位で切り出した場合は該データ構造の項目に番組名など番組を特定する情報を加えても良い。複数のユーザが利用している場合にはユーザ図３のようにユーザ識別子と対応づけて記録することが望ましい。ユーザー識別については、マウスやキーボードやタッチパネルなど、ユーザーインターフェースを本発明に付随させ、初期設定時にテレビを使うユーザー名を全て登録しておき、テレビを見る時に、メディア処理装置１の画面に表示されるユーザー名を選択することができるようにしておいたり、顔画像認証技術や音声認証技術などを使い、自動でユーザー認証を行う装置と組み合わせても良い。顔画像認証技術をユーザー認証に用いる実施例については、本発明の２番目の実施例として後述する。 FIG. 3 is an example of a data structure for a video section in which the user is interested and stored in the storage unit 9. The data structure includes a user name or ID, a channel of an interested program, and a start time and an end time of an interested video section. When the display content identification unit 14 cuts out a video section in which the user is interested in a program unit, information specifying a program such as a program name may be added to the item of the data structure. When a plurality of users use it, it is desirable to record the user in association with the user identifier as shown in FIG. For user identification, a user interface such as a mouse, a keyboard, or a touch panel is attached to the present invention, and all user names that use the television are registered at the initial setting, and displayed on the screen of the media processing device 1 when watching the television. The user name can be selected, or it can be combined with a device that automatically performs user authentication using face image authentication technology or voice authentication technology. An embodiment in which the face image authentication technology is used for user authentication will be described later as a second embodiment of the present invention.

尚、音声検出部１２において、音声認識技術のワードスポット技術を用いて、ユーザの興味をあらわしていると考えられる特定の単語のみに（例えば、「面白い」など）反応するようにしても良い。ある特定の単語のみに反応するようにすることで、咳払いなどに反応しにくくなる効果がある。また音声認識辞書に「面白い」、「興味深い」などユーザーが映像区間に対し好印象を持ったことを表す単語と、「つまらない」、「面白くない」などの悪印象を持ったことを表す単語を登録しておく。その上で、図３の各映像区間ごとのデータに、ユーザー印象に関する項目を追加し、「好印象」「悪印象」等、ユーザが肯定的か否定的であるかが区別できるラベルを記載しても良い。このような構成にすることで、ユーザーが好印象を持った映像区間を検索するだけでなく、ユーザーが悪印象を持った映像区間を検索することも可能となる。また「好印象」「悪印象」の２値だけでなく、「しみじみとした」「かなしい」「楽しい」のようなラベルを該映像区間に関するデータに追加しても良い。この場合、予め「しみじみとした」などの各ラベルに対応する単語を音声認識辞書に登録しておき、音声認識辞書では各単語がどのラベルに対応するかを分かるように、各単語の読み方などの情報とともに、ラベル情報も保持する。このような構成にすることで、各映像区間を、ユーザーの印象に合わせて、より細かく分類することが可能となる。
図４は、本発明の２番目の実施例のブロック図である。本実施例は、図４に示した実施例にユーザ認証部15を設けた点が異なる。前述の構成と同じものについては説明を省く。ユーザー認証部１５は、予め登録しておいたユーザーの顔情報とユーザー検出部５が出力する顔画像とを照らし合わせ、該顔画像が誰であるかを判定する。本実施例では、ユーザー認証を組み合わせているため、自動でユーザーを認証することができ、興味を持った映像区間を、ユーザー毎に分類することが可能となる。 Note that the voice detection unit 12 may react only to a specific word that is considered to represent the user's interest (for example, “interesting” or the like) by using a word spot technology of voice recognition technology. By reacting only to a specific word, there is an effect that it is difficult to react to coughing. In the speech recognition dictionary, words that indicate that the user has a good impression of the video section, such as “interesting” or “interesting”, and words that indicate that the user has a bad impression, such as “dull” or “not interesting”. Register. In addition, an item related to the user impression is added to the data for each video section in FIG. 3, and a label that can distinguish whether the user is positive or negative, such as “good impression” and “bad impression” is described. May be. With such a configuration, it is possible not only to search for a video section having a good impression by the user, but also to search for a video section having a bad impression by the user. Further, not only binary values of “good impression” and “bad impression” but also labels such as “smear”, “good”, and “fun” may be added to the data relating to the video section. In this case, the word corresponding to each label such as “smudged” is registered in the speech recognition dictionary in advance, and how to read each word in the speech recognition dictionary so that it can be understood which label each word corresponds to. In addition to the above information, label information is also retained. With this configuration, each video section can be classified more finely according to the user's impression.
FIG. 4 is a block diagram of a second embodiment of the present invention. This embodiment is different from the embodiment shown in FIG. 4 in that a user authentication unit 15 is provided. A description of the same components as those described above will be omitted. The user authentication unit 15 compares the face information of the user registered in advance with the face image output by the user detection unit 5, and determines who the face image is. In the present embodiment, since user authentication is combined, it is possible to automatically authenticate the user, and it is possible to classify the video section in which the user is interested for each user.

図５は、本発明の３番目の実施例の構成図である。本実施例においては、第２の処理装置であるロボット１６がメディア処理装置１の制御を担う。マイクロホンアレイ１０及びカメラ２−１はロボット１６の筐体に取り付けられている。上述したようにマイクロホンアレイ１０は単一マイクでも良い。カメラ２−２はメディア処理装置１に取り付けられている。計算機１７は、ロボット１６やメディア処理装置１の制御及びカメラ２−１、カメラ２−２、マイクアレイ１０の入力信号を使い、信号処理を行うことができる。 FIG. 5 is a block diagram of the third embodiment of the present invention. In the present embodiment, the robot 16 that is the second processing device controls the media processing device 1. The microphone array 10 and the camera 2-1 are attached to the housing of the robot 16. As described above, the microphone array 10 may be a single microphone. The camera 2-2 is attached to the media processing device 1. The computer 17 can perform signal processing using the control of the robot 16 and the media processing device 1 and the input signals of the camera 2-1, the camera 2-2, and the microphone array 10.

図６は、本発明の３番目の実施例のブロック図である。本実施例では、ラベリング部８、記憶部９、表示内容同定部１４、画像取得部４、ユーザー検出部５、音声取得部６、発話対象判定部７、音声検出部１２、音源位置推定部１３、ユーザー認証部１５、顔位置推定部１１、音声分析部１８をメディア処理装置１が担う。ここで、一部をロボット１６の処理部が担うことも可能である。尚、以下の説明において上述の実施例と同じものについての説明は省略する。 FIG. 6 is a block diagram of a third embodiment of the present invention. In the present embodiment, the labeling unit 8, the storage unit 9, the display content identification unit 14, the image acquisition unit 4, the user detection unit 5, the voice acquisition unit 6, the speech target determination unit 7, the voice detection unit 12, and the sound source position estimation unit 13. The media processing device 1 serves as the user authentication unit 15, the face position estimation unit 11, and the voice analysis unit 18. Here, a part of the processing unit of the robot 16 may be responsible. In the following description, description of the same components as those in the above-described embodiment will be omitted.

カメラ２−１及びカメラ２−２で取り込んだ画像データはそれぞれ画像取得部4に送られ、以降それぞれについて処理が行われる。発話対象判定部7では、音源位置検出部１３が出力する音源方向と顔位置推定部１１が出力するユーザーの顔の位置が一致するかどうかを、マイク２−１及びマイク２−２で取り込んだ画像データ中の顔毎に判定する。音源方向と顔位置とがあら予め定めた範囲した場合、カメラ２−１に映っている顔については、「発話対象物はロボット１６である」という結果を返えす。又、カメラ２−２に映っている顔であれば、「発話対象物はメディア処理装置１である」という判定結果を返す。音源方向と顔位置が所定範囲内にない場合、何も返さない。ラベリング部８では、発話対象部の判定に基いて表示内容に、音声区間の発話対象物がメディア処理装置１である場合と、該音声区間の発話対象物がメディア処理装置１でない場合とで、区別がつくようにラベリングを行う。そして発話対象物がメディア処理装置１であるとラベリングされた映像区間にユーザーが興味を持ったと判定し、この映像区間を記憶部９のユーザー毎の映像区間データベースに保存する。 The image data captured by the camera 2-1 and the camera 2-2 is sent to the image acquisition unit 4, and processing is performed thereafter. In the utterance target determination unit 7, the microphone 2-1 and the microphone 2-2 capture whether or not the sound source direction output from the sound source position detection unit 13 and the user's face position output from the face position estimation unit 11 match. It is determined for each face in the image data. When the sound source direction and the face position are in a predetermined range, a result that “the utterance target is the robot 16” is returned for the face reflected in the camera 2-1. If the face is reflected in the camera 2-2, a determination result “the utterance target is the media processing device 1” is returned. If the sound source direction and face position are not within the predetermined range, nothing is returned. In the labeling unit 8, the display content is based on the determination of the utterance target part when the utterance target object in the voice section is the media processing apparatus 1 and when the utterance target object in the voice section is not the media processing apparatus 1. Label them so that they can be distinguished. Then, it is determined that the user is interested in the labeled video segment when the utterance target is the media processing device 1, and the video segment is stored in the video segment database for each user in the storage unit 9.

またラベリング部８では、発話対象判定部7が返す発話対象物がロボット１６の時は、その音声区間データを音声分析部18に渡す。音声分析部１８では、公知の音声認識技術を用いて、入力された音声区間データを分析し、発話内容を示す文字列に変換する。発話対象物がロボット１６の場合は、その発話は、ロボット１６に対するコマンドであると考えられるため、その発話を音声認識し、コマンドが何であるかを認識し、そのコマンドに対応する必要がある。尚、前述したように、音声検出部１２において、音声認識技術のワードスポット技術を用いて、発声内容を分析することがある。この場合は、音声検出部１２と音声分析部１８で用いる音声認識の辞書を異なるものとする。音声検出部１２では、「面白い」などのユーザーの番組に対する印象・評価を表す単語を列挙した辞書を用いるのに対し、音声分析部１８では予め定めるロボットへのコマンドを既述した辞書を用いる。ロボット１６やメディア処理装置１は発話内容に応じて動作を切り替える。例えば、発話が「テレビを変えて」であれば、メディア処理装置１に付随の表示装置に表示する番組を切り替えたり、発話が「こっちを向いて」であれば、ロボット１６の首をユーザー方向に向けたりする。発話内容に応じて、メディア処理装置１及びロボット１６がどのように動作を切り替えるかは、予め音声認識辞書の各単語に紐付け定義しておく。 Further, in the labeling unit 8, when the utterance target returned by the utterance target determination unit 7 is the robot 16, the voice section data is passed to the voice analysis unit 18. The voice analysis unit 18 analyzes the input voice section data using a known voice recognition technique, and converts it into a character string indicating the utterance content. When the utterance target is the robot 16, the utterance is considered to be a command to the robot 16, so it is necessary to recognize the utterance by speech, recognize what the command is, and respond to the command. As described above, the voice detection unit 12 may analyze the utterance content using the word spot technology of the voice recognition technology. In this case, the speech recognition dictionaries used by the speech detection unit 12 and the speech analysis unit 18 are different. The voice detection unit 12 uses a dictionary that lists words representing impressions / evaluations of a user's program such as “interesting”, while the voice analysis unit 18 uses a dictionary that already describes commands to a predetermined robot. The robot 16 and the media processing device 1 switch operations according to the utterance content. For example, if the utterance is “change TV”, the program to be displayed on the display device attached to the media processing device 1 is switched, and if the utterance is “turn to this”, the robot 16 is directed toward the user. Or turn to. How the media processing device 1 and the robot 16 switch the operation according to the utterance content is defined in advance in association with each word in the speech recognition dictionary.

図７は、本発明の３番目の実施例の使用例である。ユーザー１９-１の顔はロボット１６のほうを向いているが、声は出していない。ユーザー１９−２の顔は、メディア処理装置１の方を向いており、声を出して笑っている。カメラ２−１には、ユーザー１９−１の正面顔が映っており、カメラ２−２には、ユーザー１９−２の正面顔が映っている。声を出しているのはユーザー１９−２であるため、発話対象判定部７は、ユーザーの発話対象物はメディア処理装置であると判定する。ラベリング部８は、メディア処理装置に対して顔が正面を向けられている映像区間にユーザーが興味を持ったと判定し、興味を持った映像区間だけを記憶部９のユーザー１９−２の映像区間データベースに保存する。本実施例では、ユーザー発話が、ロボットへの音声コマンドであるか、映像を観て、興味を持ったために、発した声であるかを、発話対象物毎に判別する。つまり映像を観て、興味を持ったために、発した声であると判別された発話に対しては、ロボットの音声コマンドとして、音声認識を行うことはしない。そのため、映像を観て、興味を持ったために、発した声を、ロボットが音声コマンドとして音声認識を行い、誤反応することが起き難い。同様にロボットへの音声コマンドとして発話したユーザー発話を、映像を観て、興味を持ったために、発した声であると思い、映像区間に誤ったラベルを付与することが起き難いという効果がある。 FIG. 7 is a usage example of the third embodiment of the present invention. The face of the user 19-1 faces the robot 16 but does not speak. The face of the user 19-2 faces the media processing device 1 and laughs out loud. The camera 2-1 shows the front face of the user 19-1, and the camera 2-2 shows the front face of the user 19-2. Since the user 19-2 is speaking, the utterance target determination unit 7 determines that the user's utterance target is a media processing device. The labeling unit 8 determines that the user is interested in the video section in which the face is directed to the media processing device, and only the video section in which the user is interested is stored in the video section of the user 19-2 in the storage unit 9. Save to database. In the present embodiment, it is determined for each utterance target whether the user utterance is a voice command to the robot or a voice uttered due to interest in viewing the video. In other words, voice recognition is not performed as a voice command of a robot for an utterance that is discriminated as a voice uttered due to interest in watching the video. For this reason, it is difficult for the robot to recognize the voice that was uttered as a voice command and to react erroneously because it was interested in watching the video. Similarly, a user's utterance uttered as a voice command to the robot is considered to be a voice uttered because he / she was interested in watching the video, and it is difficult to give an incorrect label to the video section. .

図８は、記憶部９に記憶される本願のデータ構造を用いたユーザーが興味を持った映像区間の検索システムのフローチャートである。ユーザーは、該検索システムのＧＵＩ等用いた指示入力部を介してユーザー名や日付を入力する。該検索システムは入力されたユーザー名及び日付と一致する映像区間を記憶部９から検索する。そして、ＧＵＩに検索結果をリストで表示する。ユーザーは表示されたリスト内の映像区間を選択することで、その映像区間を観ることができる。本検索システムでは、日付を指定し、検索するシステムの構成を示したが、指定するのは、日付だけでなく、テレビチャンネル名や、番組名などでも良い。放映中の番組だけでなく、過去に見た面白かった番組の面白かった映像シーンを見直したい場合が、頻繁に生じるが、本映像区間の検索システムを用いることで、簡単に、過去のユーザーが興味を持ったシーンを抽出し、見直すことができる。 FIG. 8 is a flowchart of a video section search system in which the user is interested using the data structure of the present application stored in the storage unit 9. The user inputs a user name and date via an instruction input unit using the GUI of the search system. The search system searches the storage unit 9 for a video section that matches the input user name and date. Then, the search results are displayed in a list on the GUI. The user can watch the video section by selecting the video section in the displayed list. In this search system, the date is specified and the configuration of the search system is shown. However, not only the date but also a TV channel name or a program name may be specified. If you want to review not only the programs that are being broadcast but also the interesting video scenes of the programs that you have seen in the past, it often happens, but by using this video section search system, past users can easily get interested. You can extract and review scenes with

尚、本願で開示した実施例は、コンピュータに本願発明を実行させるプログラムを読み込むことで実行される。その他、一部ハードウェアとの協調によって実行されることも可能である。 The embodiment disclosed in the present application is executed by reading a program that causes a computer to execute the present invention. In addition, it can be executed by cooperation with some hardware.

本発明の代表的な実施例の基本構成図Basic configuration diagram of a typical embodiment of the present invention 実施例１のブロック図の一例Example of block diagram of Embodiment 1 ユーザーが興味を持った映像区間についてのデータ構造の一例An example of the data structure of the video section that the user is interested in 実施例２のブロック図の一例Example of block diagram of embodiment 2 実施例3の構成図の一例Example of configuration diagram of Example 3 実施例3のブロック図の一例Example of block diagram of Example 3 実施例3の使用例の一例Example of usage example of Example 3 ユーザーが興味を持った映像区間の検索システムのフローチャートの一例An example of a flowchart for a video segment search system that the user is interested in

Explanation of symbols

１・・・メディア処理装置、２・・・カメラ、３・・・マイク、４・・・画像取得部、５・・・ユーザー検出部、６・・・音声取得部、７・・・発話対象判定部、８・・・ラベリング部、９・・・記憶部、１０・・・マイクロホンアレイ、１１・・・顔位置推定部、１２・・・音声検出部、１３・・・音源位置推定部、１４・・・表示内容同定部、１５・・・ユーザー認証部、１６・・・ロボット、１７・・・計算機、１８・・・音声分析部、１９・・・ユーザー DESCRIPTION OF SYMBOLS 1 ... Media processing apparatus, 2 ... Camera, 3 ... Microphone, 4 ... Image acquisition part, 5 ... User detection part, 6 ... Voice acquisition part, 7 ... Speech object Determination unit, 8 ... labeling unit, 9 ... storage unit, 10 ... microphone array, 11 ... face position estimation unit, 12 ... voice detection unit, 13 ... sound source position estimation unit, 14 ... Display content identification unit, 15 ... User authentication unit, 16 ... Robot, 17 ... Computer, 18 ... Voice analysis unit, 19 ... User

Claims

An image acquisition unit for acquiring images via a camera;
An audio acquisition unit for acquiring sound via a microphone;
A face position estimation unit for detecting the front face of the user from the image;
A face position estimation unit for estimating the position of the front face;
A voice detection unit for detecting a user's utterance from the acquired sound;
A voice position estimation unit for estimating a sound source direction of the detected sound;
It is determined whether the estimated face position and the sound source direction are within a predetermined range, and the determination result that the content displayed on the display unit at the time of the face or sound detection is within the predetermined range is labeled Labeling and
An information processing apparatus comprising: a recording unit that records the labeling result.

The information processing apparatus according to claim 1, wherein the labeling is performed by cutting out the content for each section from a commercial to a commercial.

Furthermore, it has a user identification part,
The data processing apparatus according to claim 1, wherein the labeling is performed together with identification information of a user identified by the user identification unit.

The information processing apparatus according to claim 3, wherein the user identification unit performs the user identification by comparing a face image acquired via the image acquisition unit with a registered image.

The voice detection unit performs voice recognition of the detected voice, determines whether the voice is positive or negative with respect to the content from a result of the voice recognition,
5. The information processing apparatus according to claim 1, wherein the labeling unit also labels the result of the determination.

Each having a camera and a microphone, connected to a display unit for displaying content and a terminal for controlling content display on the display unit;
An image acquisition unit for acquiring video via each of the cameras;
An audio acquisition unit for acquiring sound via each of the microphones;
A face position estimation unit for detecting the front face of the user from the image;
A face position estimation unit for estimating the position of the front face;
A voice detection unit for detecting a user's utterance from the acquired sound;
A voice position estimation unit for estimating a sound source direction of the detected sound;
Determining whether the estimated face position and the sound source direction are within a predetermined range;
For the information acquired from the display unit, labeling the determination result to the content displayed on the display unit at the time of the face or sound detection;
A recording unit for recording the labeling result;
An information processing apparatus comprising: a voice analysis unit that performs voice recognition on the sound acquired from the terminal determined to be within the predetermined range and gives a command instruction to the terminal based on a recognition result

In addition, an instruction input unit is provided,
Receive designation of user, time, program channel name or program name via the instruction input unit,
The video of a section that is associated with the designated user, time, program channel name, or program name and is labeled is displayed on the display unit. An information processing apparatus according to any one of the above.