JP2025075999A

JP2025075999A - COMMUNICATION SUPPORT METHOD, COMMUNICATION SUPPORT DEVICE, AND PROGRAM

Info

Publication number: JP2025075999A
Application number: JP2023187579A
Authority: JP
Inventors: 圭吾松原; Keigo Matsubara; 智鳴子; Satoshi Naruko; 博康高木; Hiroyasu Takagi; 高志梶山; Takashi Kajiyama; 聡杉浦; Satoshi Sugiura
Original assignee: Yamaha Corp
Current assignee: Yamaha Corp
Priority date: 2023-11-01
Filing date: 2023-11-01
Publication date: 2025-05-15

Abstract

To provide a communication support method capable of reporting that participants on a distal end side and a proximal end side have intentions to utter.SOLUTION: A communication support method acquires images of all the participants on a proximal end side and a distal end side in a remote conference from image data captured with cameras, detects intention display information being an index for determining whether an utterance is intended, from the images of all the participants, estimates an utterance intention holding person who has an utterance intention from among all the participants on the basis of the intention display information, and displays the information of the utterance intension holding person on a display.SELECTED DRAWING: Figure 3

Description

本発明の一実施形態は、コミュニケーション支援方法、コミュニケーション支援装置、およびプログラムに関する。 One embodiment of the present invention relates to a communication support method, a communication support device, and a program.

特許文献１には、入力された音声から発話を検出すると共に、映像と音声から予備動作（相槌、手の上下方向の動き、頷き、身体動作）検出して、各動作毎に所定のポイントを付与して保持すると共に、発話の有無と当該ポイントを他の端末に送信することで共有し、その上で、各端末において発話があるユーザを選択し、当該ユーザの表示領域に枠を重畳表示し、所定の有効時間内の予備動作のポイントを合計した値を発話可能性ポイントとし、最も高いポイントのユーザを選択して、当該ユーザの表示領域に枠を点滅させて重畳表示する、Ｗｅｂ会議における発話動作検出が開示されている。 Patent Document 1 discloses speech action detection in a web conference, which detects speech from input audio, detects preparatory actions (backchannels, upward and downward hand movements, nodding, body movements) from video and audio, assigns and retains a predetermined point for each action, and shares the presence or absence of speech and the point by transmitting it to other terminals, selects a user who has spoken on each terminal, superimposes a frame on the display area of that user, calculates the total point of preparatory actions within a predetermined effective time as the speech possibility point, selects the user with the highest point, and superimposes a blinking frame on the display area of that user.

特開２０１１－１１８６３２号公報JP 2011-118632 A

例えば近端側の複数の参加者が会話している場合に、遠端側の参加者に発言したい意思があった場合でも、近端側の複数の参加者は当該意思を知ることは難しい。したがって、遠端側の参加者は会話に参加し難い。 For example, when multiple near-end participants are having a conversation, even if a far-end participant wishes to speak, it is difficult for the multiple near-end participants to know of that intention. Therefore, it is difficult for the far-end participant to join the conversation.

本実施形態の一つは、遠端側および近端側の参加者が発言したい意思を持つことを知らせることができるコミュニケーション支援方法を提供することを目的とする。 One of the embodiments aims to provide a communication support method that allows far-end and near-end participants to inform each other that they wish to speak.

コミュニケーション支援方法は、カメラで撮影した映像データから、遠隔会議における近端側と遠端側を含む全ての参加者の映像を取得し、前記全ての参加者の映像から、発話意思があるか否かを判定する指標である意思表示情報を検出し、前記意思表示情報に基づいて前記全ての参加者の中から発話意思がある発話意思保持者を推定し、前記発話意思保持者の情報を表示器に表示する。 The communication support method obtains images of all participants in a remote conference, including the near-end and far-end, from video data captured by a camera, detects intention-expression information, which is an index for determining whether or not there is an intention to speak, estimates which of all the participants have the intention to speak based on the intention-expression information, and displays information about the person with the intention to speak on a display.

本発明の一実施形態によれば、遠端側および近端側の参加者が発言したい意思を持つことを知らせることができる。 According to one embodiment of the present invention, far-end and near-end participants can signal their intention to speak.

コミュニケーション支援システムのブロック図である。FIG. 1 is a block diagram of a communication support system. 第１地点１におけるコミュニケーション支援システムの設置された室内の立面模式図である。1 is a schematic elevation view of a room in which a communication support system is installed at a first location 1. FIG. 第２地点２におけるコミュニケーション支援システムの設置された室内の立面模式図である。1 is a schematic elevation view of a room in which a communication support system is installed at a second location 2. FIG. コミュニケーション支援方法の動作を示すフローチャートである。13 is a flowchart showing the operation of a communication support method. 第１地点１のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。1 is an example of a screen of a remote conference application program displayed on a display of a PC 30 at a first location 1. 第１地点１のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。1 is an example of a screen of a remote conference application program displayed on a display of a PC 30 at a first location 1. 第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。13 is an example of a screen of a remote conference application program displayed on a display of a PC 30 at a second location 2. 第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。13 is an example of a screen of a remote conference application program displayed on a display of a PC 30 at a second location 2. 第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。13 is an example of a screen of a remote conference application program displayed on a display of a PC 30 at a second location 2. 第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。13 is an example of a screen of a remote conference application program displayed on a display of a PC 30 at a second location 2. 第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。13 is an example of a screen of a remote conference application program displayed on a display of a PC 30 at a second location 2. 第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。13 is an example of a screen of a remote conference application program displayed on a display of a PC 30 at a second location 2.

図１は、コミュニケーション支援システムのブロック図である。コミュニケーション支援システムは、第１地点１および第２地点２にそれぞれ設置された、マイク１０、カメラ２０、およびパーソナルコンピュータ（ＰＣ）３０を備える。マイク１０およびカメラ２０は、ＰＣ３０に接続される。第１地点１のＰＣ３０および第２地点２のＰＣ３０は、ネットワークを介して接続される。 Figure 1 is a block diagram of a communication support system. The communication support system includes a microphone 10, a camera 20, and a personal computer (PC) 30, which are installed at a first location 1 and a second location 2, respectively. The microphone 10 and the camera 20 are connected to the PC 30. The PC 30 at the first location 1 and the PC 30 at the second location 2 are connected via a network.

図２は、第１地点１の室内の立面模式図である。マイク１０は、一例として、室内の天井に設置されている。マイク１０は、厚みの薄い直方体形状の筐体を有する。カメラ２０は机上に設置されている。マイク１０の筐体の直下には、机が設置されている。図１の例では、机の周囲には、複数の参加者（参加者ｕ１，ｕ２）がいる。 Figure 2 is a schematic elevation diagram of the room at the first point 1. As an example, the microphone 10 is installed on the ceiling of the room. The microphone 10 has a thin, rectangular parallelepiped housing. The camera 20 is installed on a desk. The desk is placed directly below the housing of the microphone 10. In the example of Figure 1, multiple participants (participants u1, u2) are located around the desk.

図３は、第２地点２の室内の立面模式図である。マイク１０は、一例として、机上に設置されている。マイク１０は、厚みの薄い直方体形状の筐体を有する。カメラ２０はＰＣ３０に設置されている。図２の例では、机の周囲のうちＰＣ３０の前に参加者ｕ３がいる。 Figure 3 is a schematic elevation diagram of the room at the second point 2. As an example, the microphone 10 is installed on a desk. The microphone 10 has a thin, rectangular parallelepiped housing. The camera 20 is installed on the PC 30. In the example of Figure 2, participant u3 is located in front of the PC 30 around the desk.

本実施形態において、第１地点１および第２地点２にそれぞれ設置された、マイク１０、カメラ２０、およびＰＣ３０はそれぞれ同一の構成および同一の機能を備える。 In this embodiment, the microphone 10, camera 20, and PC 30 installed at the first location 1 and the second location 2, respectively, have the same configuration and the same functions.

カメラ２０は、参加者の画像を取得する。カメラ２０は、取得した画像に係る映像信号に所定の信号処理を施し、信号処理後の映像信号をＰＣ３０に送信する。カメラ２０は、例えばパン、チルト、またはズームによるフレーミング処理を行う。 The camera 20 captures an image of the participant. The camera 20 performs a predetermined signal processing on the video signal related to the captured image, and transmits the processed video signal to the PC 30. The camera 20 performs framing processing, for example, by panning, tilting, or zooming.

マイク１０は、参加者の音声を取得する。マイク１０は、取得した音信号に所定の信号処理を施す。マイク１０は、例えば複数のマイクユニットを有するアレイマイクである。マイク１０は、例えば、取得した音信号にビームフォーミングの指向性処理を施す。ビームフォーミングは、例えば、遅延和処理により話者の方向に位相を揃え、話者の方向に感度を高くした収音ビームを形成する処理である。マイク１０は、話者の音声の方向情報を求め、話者の方向に収音ビームを向ける処理を行ってもよい。マイク１０は、複数のマイクユニットから取得した音信号を分析して音声の到来方向を推定する。音信号の分析方法は、相互相関法、遅延和（Ｄｅｌａｙ－ａｎｄ－Ｓｕｍ）法、あるいはＭＵＳＩＣ（ＭｕｌｔｉｐｌｅＳｉｇｎａｌＣｌａｓｓｉｆｉｃａｔｉｏｎ）法等、どの様な手法であってもよい。相互相関法では、マイク１０は、例えば複数のマイクの音信号の相互相関を算出する。マイク１０は、例えばある２つのマイクの音信号の相互相関のピークを求める。さらに、マイク１０は、別の２つのマイクの音信号の相互相関のピークを求める。マイク１０は、この様にして算出した複数の相互相関のピークに基づいて、音声の到来方向を推定する。これにより、マイク１０は、話者の方向情報を求めることができる。 The microphone 10 acquires the voice of the participant. The microphone 10 performs a predetermined signal processing on the acquired sound signal. The microphone 10 is, for example, an array microphone having multiple microphone units. The microphone 10 performs, for example, beamforming directional processing on the acquired sound signal. Beamforming is, for example, a process of forming a sound collection beam with a phase aligned in the direction of the speaker by delay-and-sum processing and high sensitivity in the direction of the speaker. The microphone 10 may obtain directional information of the speaker's voice and perform processing to direct the sound collection beam in the direction of the speaker. The microphone 10 analyzes the sound signals acquired from the multiple microphone units to estimate the direction of arrival of the sound. The method of analyzing the sound signal may be any method, such as a cross-correlation method, a delay-and-sum method, or a MUSIC (Multiple Signal Classification) method. In the cross-correlation method, the microphone 10 calculates, for example, the cross-correlation of the sound signals of multiple microphones. For example, microphone 10 finds the cross-correlation peak of the sound signals of two microphones. Furthermore, microphone 10 finds the cross-correlation peak of the sound signals of two other microphones. Based on the multiple cross-correlation peaks calculated in this way, microphone 10 estimates the direction from which the sound is coming. This allows microphone 10 to find directional information about the speaker.

マイク１０は、信号処理後の音信号をＰＣ３０に送信する。ＰＣ３０は、マイク１０から受信した音信号、およびカメラ２０から受信した映像信号を遠端側のＰＣ３０に送信する。ＰＣ３０は、カメラ２０から受信した映像信号を自装置の表示器（不図示）に表示してもよい。 The microphone 10 transmits the processed audio signal to the PC 30. The PC 30 transmits the audio signal received from the microphone 10 and the video signal received from the camera 20 to the PC 30 on the far end side. The PC 30 may display the video signal received from the camera 20 on a display (not shown) of its own device.

また、ＰＣ３０は、遠端側のＰＣ３０から映像信号および音信号を受信する。ＰＣ３０は、受信した映像信号を不図示の表示器に出力する。また、ＰＣ３０は、受信した音信号を不図示のスピーカに出力する。これにより、コミュニケーション支援システムは、遠隔地との会議を行うための遠隔会議システムの構成要素として機能する。 The PC 30 also receives a video signal and an audio signal from the PC 30 on the far end. The PC 30 outputs the received video signal to a display device (not shown). The PC 30 also outputs the received audio signal to a speaker (not shown). In this way, the communication support system functions as a component of a remote conference system for holding conferences with remote locations.

ＰＣ３０は、一般的な情報処理装置であり、各種処理を行うプロセッサを備える。ＰＣ３０は、マイク１０またはカメラ２０を操作するためのリモートコントローラとしても機能する。ＰＣ３０のプロセッサは、内蔵するフラッシュメモリ等の記憶装置に記憶されている動作用のプログラムを読み出すことにより、コミュニケーション支援システムの制御部として機能する。なお、プログラムは自装置のフラッシュメモリに記憶しておく必要はない。ＰＣ３０は、例えばサーバ等から都度プログラムをダウンロードしてもよい。 The PC 30 is a general information processing device, and is equipped with a processor that performs various processes. The PC 30 also functions as a remote controller for operating the microphone 10 or the camera 20. The processor of the PC 30 functions as a control unit of the communication support system by reading out an operation program stored in a storage device such as an internal flash memory. Note that the program does not need to be stored in the flash memory of the PC 30. The PC 30 may download the program each time, for example, from a server or the like.

図４は、コミュニケーション支援方法の動作を示すフローチャートである。まず、ＰＣ３０は、カメラで撮影した映像データから、遠隔会議における近端側と遠端側を含む全ての参加者の映像を取得する（Ｓ１１）。例えば第１地点１のＰＣ３０は、第１地点１のカメラ２０で撮影した映像データ（以下、第１映像データと称する。）を取得し、第２地点２のカメラ２０で撮影した映像データ（以下、第２映像データと称する。）を受信する。ＰＣ３０は、例えばニューラルネットワーク等を用いた所定のモデルにより、第１映像データおよび第２映像データのそれぞれについて話者の顔を認識する処理を行い、全ての参加者の映像を取得する。 Figure 4 is a flowchart showing the operation of the communication support method. First, PC 30 acquires images of all participants in the remote conference, including the near-end and far-end, from video data captured by a camera (S11). For example, PC 30 at first point 1 acquires video data captured by camera 20 at first point 1 (hereinafter referred to as first video data), and receives video data captured by camera 20 at second point 2 (hereinafter referred to as second video data). PC 30 performs processing to recognize the speaker's face in each of the first video data and the second video data using a predetermined model, for example using a neural network, and acquires images of all participants.

次に、ＰＣ３０は、取得した全ての参加者の映像から、発話意思があるか否かを判定する指標である意思表示情報を検出する（Ｓ１２）。例えば、ＰＣ３０は、参加者の映像と意思表示情報との関係をＤＮＮ（Deep Neural Network）等で訓練した所定の訓練済モデルを用意する。ＰＣ３０は、過去の複数の参加者の発話開始直前の映像、発話開始時の映像を訓練用データとして取得して、所定のモデルに、発話開始直前の映像と意思表示情報との関係を訓練させる。ＰＣ３０は、訓練済の所定のモデルに各参加者の映像を入力して、それぞれの参加者の意思表示情報を求める。この場合、意思表示情報は、例えば０～１００％の確率値（会議参加度）として求められる。 Next, PC 30 detects intention information, which is an index for judging whether or not there is an intention to speak, from the acquired video of all participants (S12). For example, PC 30 prepares a predetermined trained model in which the relationship between the participant's video and the intention information is trained using a DNN (Deep Neural Network) or the like. PC 30 acquires video of multiple participants in the past just before they start speaking and video at the start of speaking as training data, and trains the predetermined model on the relationship between the video just before they start speaking and the intention information. PC 30 inputs the video of each participant into the trained predetermined model to obtain the intention information of each participant. In this case, the intention information is obtained as a probability value (degree of participation in the meeting) of, for example, 0 to 100%.

次に、ＰＣ３０は、意思表示情報に基づいて全ての参加者の中から発話意思がある発話意思保持者を推定する（Ｓ１３）。例えば、ＰＣ３０は、意思表示情報の会議参加度が所定の閾値を超えた場合に、対応する参加者を発話意思保持者と判断する。 Next, PC30 estimates a participant who intends to speak from among all the participants based on the intention expression information (S13). For example, when the conference participation level of the intention expression information exceeds a predetermined threshold, PC30 determines that the corresponding participant is a participant who intends to speak.

そして、ＰＣ３０は、発話意思保持者の情報を表示器に表示する（Ｓ１４）。図５および図６は、第１地点１のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。図７および図８は、第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。図５および図７は、誰も発話意思を持っていない場合の画面の一例であり、図６および図８は参加者ｕ１を発話意思保持者であると判断した場合の画面の一例である。 Then, the PC 30 displays information about the person with the intention to speak on the display (S14). Figures 5 and 6 are examples of screens of a remote conference application program displayed on the display of the PC 30 at the first location 1. Figures 7 and 8 are examples of screens of a remote conference application program displayed on the display of the PC 30 at the second location 2. Figures 5 and 7 are examples of screens when no one has the intention to speak, and Figures 6 and 8 are examples of screens when it is determined that participant u1 has the intention to speak.

ＰＣ３０は、発話意思保持者の情報の一例として、発話意思保持者である参加者ｕ２の位置において、該参加者ｕ２を囲む矩形画像を重畳して表示する。これにより、全ての参加者は、発言したい意思を持つ参加者がいることを知ることができる。 As an example of information about a participant who intends to speak, the PC 30 displays a rectangular image surrounding the participant u2, who intends to speak, superimposed on the position of the participant u2. This allows all participants to know that there is a participant who intends to speak.

あるいは、ＰＣ３０は、発話意思保持者の情報の一例として、図９に示す様に、発話意思保持者である参加者ｕ２を目立たせる様な映像に変更してもよい。図９は、第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。図９の例では、ＰＣ３０は、発話意思保持者である参加者ｕ２の映像の画面内の占有率が所定比率（例えば５０％）になるように拡大するフレーミング処理している。あるいは、ＰＣ３０は、現在の話者（例えば参加者ｕ１）および発話意思保持者である参加者ｕ２の映像の画面内の占有率が所定比率（例えば５０％）になるように拡大するフレーミング処理をしてもよい。また、ＰＣ３０は、現在の話者（例えば参加者ｕ１）を拡大した映像と、発話意思保持者である参加者ｕ２を拡大した映像と、を１画面内で分割して表示してもよい。また、ＰＣ３０は、現在の話者（例えば参加者ｕ１）を拡大した映像と、発話意思保持者である参加者ｕ２を拡大した映像と、を所定時間経過毎に切り替えてもよい。この時、ＰＣ３０は、現在の話者（例えば参加者ｕ１）を拡大した映像を長く、発話意思保持者である参加者ｕ２を拡大した映像を短く表示してもよい。あるいは、ＰＣ３０は、複数の発話意思保持者を推定した場合には、会議参加度の高い参加者の映像を相対的に長く表示してもよい。 Alternatively, the PC 30 may change the image to one that highlights the participant u2 who intends to speak, as shown in FIG. 9, as an example of information on the person who intends to speak. FIG. 9 is an example of a screen of a remote conference application program displayed on the display of the PC 30 at the second location 2. In the example of FIG. 9, the PC 30 performs a framing process to enlarge the image of the participant u2 who intends to speak so that the occupancy rate of the screen of the participant u2 who intends to speak is a predetermined ratio (e.g., 50%). Alternatively, the PC 30 may perform a framing process to enlarge the image of the current speaker (e.g., participant u1) and the participant u2 who intends to speak so that the occupancy rate of the screen of the participant u2 who intends to speak is a predetermined ratio (e.g., 50%). The PC 30 may also display an enlarged image of the current speaker (e.g., participant u1) and an enlarged image of the participant u2 who intends to speak, splitting the image into two images on one screen. The PC 30 may also switch between an enlarged image of the current speaker (e.g., participant u1) and an enlarged image of the participant u2 who intends to speak every predetermined time. At this time, the PC 30 may display a long enlarged image of the current speaker (e.g., participant u1) and a short enlarged image of participant u2 who intends to speak. Alternatively, if the PC 30 has estimated multiple participants who intend to speak, it may display a relatively long image of a participant who has a high level of participation in the conference.

遠端側の参加者は表示器に表示されるため、近端側の参加者が遠端側の参加者の意思を把握することは難しい。特に、近端側の複数の参加者が会話している場合に遠端側の参加者が当該会話に入ることは非常に難しい。しかし、本実施形態のコミュニケーション支援方法によれば、遠端側および近端側の全ての参加者に、発言したい意思を持っていることを知らせることができる。これにより、参加者は、発話を開始することへの心理的負担が軽減されるという新たな顧客体験を得ることができる。また、遠端側の参加者が近端側の複数の参加者に対して発話意思を知らせることで、遠隔側の参加者であっても近端側の会話に参加しやすくなるという新たな顧客体験を得ることができる。すなわち、本実施形態のコミュニケーション支援方法は、遠端側の参加者に対して、近端側で実際に会って話している環境に近い状況を演出することができる。 Because the far-end participants are displayed on a display, it is difficult for the near-end participants to understand the intentions of the far-end participants. In particular, when multiple near-end participants are talking, it is very difficult for the far-end participants to join the conversation. However, according to the communication support method of this embodiment, it is possible to inform all participants on the far-end and near-end sides that one has the intention to speak. This allows participants to obtain a new customer experience in which the psychological burden of starting to speak is reduced. In addition, by having the far-end participants inform multiple near-end participants of their intention to speak, it is possible to obtain a new customer experience in which even remote participants can easily participate in the near-end conversation. In other words, the communication support method of this embodiment can create a situation for the far-end participants that is close to an environment in which they are actually meeting and talking at the near-end.

（変形例１）
ＰＣ３０は、取得した遠端側の参加者および近端側の参加者の映像から、参加者の視線情報を意思表示情報として検出してもよい。 (Variation 1)
The PC 30 may detect the gaze information of the participants as intention expression information from the acquired video of the far-end participant and the near-end participant.

ＰＣ３０は、視線情報として、参加者毎の視線の方向を算出して、各参加者の視線先を求める。つまり、視線情報は、視線元の参加者と視線先の参加者とを表す有向グラフとして表される。ＰＣ３０は、視線情報に基づいて全ての参加者の中から発話意思がある発話意思保持者を推定する。例えば、ＰＣ３０は、有向グラフに基づいて参加者毎の被視線数を求める。被視線数の多い参加者は、他の複数の参加者が注目していることを意味する。つまり、被視線数は、現在の発話者等、会議参加度の高い参加者に対応する。したがって、ＰＣ３０は、被視線数が所定の閾値を超えた場合に、対応する参加者を発話意思保持者と判断する。あるいは、ＰＣ３０は、現在の発話者における視線先の参加者を発話意思保持者と判断してもよい。これにより、変形例１のコミュニケーション支援方法は、現在の発話者が次に発話して欲しいと思っている参加者を推定することができる。 PC30 calculates the direction of gaze for each participant as gaze information, and determines the gaze destination of each participant. In other words, the gaze information is expressed as a directed graph representing the participant who is the source of the gaze and the participant who is the destination of the gaze. PC30 estimates the participant who intends to speak from among all participants based on the gaze information. For example, PC30 calculates the number of gazes for each participant based on the directed graph. A participant who is gazed at many times means that multiple other participants are paying attention to him/her. In other words, the number of gazes corresponds to a participant who has a high level of participation in the conference, such as the current speaker. Therefore, when the number of gazes exceeds a predetermined threshold, PC30 determines the corresponding participant as the participant who intends to speak. Alternatively, PC30 may determine the participant who is gazed at by the current speaker as the participant who intends to speak. In this way, the communication support method of variant 1 can estimate the participant who the current speaker wants to speak next.

（変形例２）
ＰＣ３０は、取得した遠端側の参加者および近端側の参加者の映像から、参加者の特定の動きを示す動作情報を意思表示情報として検出してもよい。特定の動作とは、例えばうなずく動作、あるいは手を動かす動作等である。これらの様な動作は、会話に参加しようとする可能性の高い動作である。 (Variation 2)
The PC 30 may detect, as intention expression information, motion information indicating a specific motion of the participants from the acquired video of the far-end participant and the near-end participant. The specific motion is, for example, a nodding motion, a hand movement motion, etc. Such motions are likely to indicate an attempt to participate in the conversation.

そこで、ＰＣ３０は、意思表示情報として上記特定の動作情報を検出し（Ｓ１２）、当該特定の動作情報に基づいて全ての参加者の中から発話意思がある発話意思保持者を推定してもよい（Ｓ１３）。例えば、ＰＣ３０は、特定の動作をしている参加者を発話意思保持者と判断する。 Therefore, the PC 30 may detect the above-mentioned specific motion information as intention expression information (S12), and estimate from among all the participants who have the intention to speak based on the specific motion information (S13). For example, the PC 30 may determine that a participant who is performing a specific motion is a participant who has the intention to speak.

上述の様に、遠端側の参加者は表示器に表示されるため、近端側の参加者が遠端側の特定の動作に気付くことは難しい。しかし、変形例２のコミュニケーション支援方法によれば、遠端側および近端側の全ての参加者に対して、発話に関係する特定の動作を行っていることを知らせることができる。 As mentioned above, the far-end participant is displayed on the display, making it difficult for the near-end participant to notice specific actions being made by the far-end participant. However, with the communication support method of variant example 2, it is possible to inform all participants on both the far-end and near-end sides that a specific action related to speech is being made.

（変形例３）
ＰＣ３０は、音声信号に含まれる音声情報から意思表示情報を検出し（Ｓ１２）、当該音声情報に基づいて全ての参加者の中から発話意思がある発話意思保持者を推定してもよい（Ｓ１３）。例えば、ＰＣ３０は、特定の単語（参加者の名前、遠端側の場所の名前等）を認識した場合に、対応する参加者を発話意思保持者と判断する。 (Variation 3)
The PC 30 may detect intention indication information from the voice information included in the voice signal (S12), and may infer from among all the participants who have the intention to speak based on the voice information (S13). For example, when the PC 30 recognizes a specific word (such as the name of a participant or the name of a place on the far-end side), it determines that the corresponding participant is the one who has the intention to speak.

また、ＰＣ３０は、例えば複数人が同時に発話した場合に、その後に発話を停止した参加者（発話を譲った参加者）を発話意思保持者と推定してもよい。 In addition, for example, when multiple people speak at the same time, PC 30 may infer that the participant who subsequently stops speaking (the participant who gives way to speaking) is the person with the intention to speak.

あるいは、ＰＣ３０は、会話の履歴を記録し、参加者毎に、次に発話する参加者の関係をデータベースとして記録しておいてもよい。あるいは、ＰＣ３０は、直前に発話した参加者を、次の発話意思保持者と推定してもよい。 Alternatively, PC 30 may record the conversation history and record the relationship of the participant who will speak next for each participant as a database. Alternatively, PC 30 may infer that the participant who has just spoken is the person who intends to speak next.

また、ＰＣ３０は、会話内容と発話との関係を訓練した訓練済モデルを用いて、現在の会話内容に対する次の発話意思保持者を推定してもよい。あるいは、ＰＣ３０は、参加者毎の情報として、例えば役割（進行役、記録役、傍聴者等）、知識、または技術等の情報を記録しておき、現在の会話内容に対応する役割、知識、または技術等の情報を所有する参加者を発話意思保持者として推定してもよい。 The PC 30 may also use a trained model that has trained the relationship between the conversation content and speech to estimate the next person who intends to speak in relation to the current conversation content. Alternatively, the PC 30 may record information such as role (facilitator, recorder, observer, etc.), knowledge, or skills as information for each participant, and estimate a participant who possesses information such as role, knowledge, or skills that corresponds to the current conversation content as the person who intends to speak.

以上の様に、変形例３のコミュニケーション支援方法は、音声情報から意思表示情報を検出する。コミュニケーション支援方法は、話者ではなく発話意思保持者を推定するため、例えば「はい」、「そうですね」等の会議参加度に影響しない一言だけを発話した参加者は発話意思保持者として推定されない。したがって、変形例３のコミュニケーション支援方法は、この様な発話者を強調表示する、あるいはフレーミングすることはない。 As described above, the communication support method of variant 3 detects intention information from voice information. Since the communication support method estimates the person with the intention to speak rather than the speaker, a participant who only utters one word that does not affect the level of participation in the conference, such as "Yes" or "That's right," is not estimated as the person with the intention to speak. Therefore, the communication support method of variant 3 does not highlight or frame such speakers.

（変形例４）
ＰＣ３０は、参加者の会話参加度をスコアリングしてもよい。ＰＣ３０は、スコアリングの点数に基づいて全ての参加者の中から発話意思がある発話意思保持者を推定してもよい（Ｓ１３）。 (Variation 4)
The PC 30 may score the conversation participation of the participants. The PC 30 may estimate, based on the scores of the scores, who has the intention to speak from among all the participants (S13).

例えば、ＰＣ３０は、変形例１で示した有向グラフに基づいて参加者毎の被視線数を求める。被視線数の多い参加者は、他の複数の参加者が注目していることを意味する。したがって、ＰＣ３０は、参加者毎の被視線数を点数として求める。また、ＰＣ３０は、各参加者の被視線数に所定の重み係数を乗算した結果を点数として求めてもよい。重み係数は、例えば被視線数の多い参加者からの視線を高くし、被視線数の少ない参加者からの視線を低くする。当該被視線数の多い参加者の視線先の参加者は、次に発話する可能性が高く、高いスコアとして求められる。 For example, PC 30 finds the number of times each participant has received gazes based on the directed graph shown in variant example 1. A participant who has received a large number of gazes means that multiple other participants are paying attention to them. Therefore, PC 30 finds the number of times each participant has received gazes as a score. PC 30 may also find the score by multiplying the number of times each participant has received gazes by a specified weighting coefficient. The weighting coefficient, for example, gives a higher score to gazes from participants who have received a large number of gazes and a lower score to gazes from participants who have received a small number of gazes. A participant who is the target of the gaze of a participant who has received a large number of gazes is likely to be the next to speak, and is thus given a high score.

また、ＰＣ３０は、動作情報として、特定動作を行っている参加者の点数を上記被視線数に加算してもよい。特定の動作とは、例えばうなずく動作、あるいは手を動かす動作等である。これらの様な動作は、会話に参加しようとする可能性の高い動作である。したがって、ＰＣ３０は、動作情報として、特定動作を行っている参加者に点数を加算する。 In addition, PC 30 may add the points of participants who perform specific actions as action information to the number of gazes. A specific action is, for example, a nodding action or a hand movement action. These actions are actions that are likely to indicate an attempt to join a conversation. Therefore, PC 30 adds points to participants who perform specific actions as action information.

ＰＣ３０は、点数が所定の閾値を超えた場合に、対応する参加者を発話意思保持者と判断する。ＰＣ３０は、上記の様に視線情報に動作情報としての点数を合算することで、発話意思保持者をより高精度に推定することができる。 When the score exceeds a predetermined threshold, PC30 judges the corresponding participant to be a person with the intention to speak. By adding the score as movement information to the gaze information as described above, PC30 can estimate with higher accuracy the person with the intention to speak.

また、ＰＣ３０は、該スコアリングの結果を発話意思保持者の情報として表示してもよい。図１０は、第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。図１０の例では、ＰＣ３０は、点数が所定の閾値を超えて発話意思保持者と判断した参加者の映像の周囲に、スコアリングの結果を表示している。これにより、遠端側および近端側の全ての参加者は、他の全ての参加者の発話意思の強さを直感的に知ることができる。 The PC 30 may also display the scoring results as information about participants with the intention to speak. Figure 10 is an example of a screen of a remote conference application program displayed on the display of the PC 30 at the second location 2. In the example of Figure 10, the PC 30 displays the scoring results around the video of participants whose scores exceed a predetermined threshold and who are determined to be participants with the intention to speak. This allows all participants on the far-end and near-end sides to intuitively know the strength of the intention to speak of all other participants.

また、ＰＣ３０は、スコアリングの点数が高い複数の発話意思保持者を強調して表示してもよい。図１１は、第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。図１１の例では、ＰＣ３０は、点数が所定の閾値を超えて発話意思保持者と判断した参加者の映像を囲む矩形画像を重畳して表示する。これにより、遠端側および近端側の全ての参加者は、発言したい意思を持つ参加者がいることを一目で知ることができる。 The PC 30 may also highlight multiple participants with high scores who intend to speak. Figure 11 shows an example of a screen of a remote conference application program displayed on the display of the PC 30 at the second location 2. In the example of Figure 11, the PC 30 superimposes and displays a rectangular image surrounding the video of a participant who has a score exceeding a predetermined threshold and who has been determined to be a participant with an intention to speak. This allows all participants on the far-end and near-end sides to know at a glance that there is a participant who intends to speak.

（変形例５）
ＰＣ３０は、意思表示情報に基づいて発話意思の無い参加者を推定してもよい。また、ＰＣ３０は、発話意思のない参加者の情報を表示しないようにしてもよい。また、ＰＣ３０は、発話意思のない参加者の映像をマスクする処理を行ってもよい。 (Variation 5)
The PC 30 may estimate participants who have no intention of speaking based on the intention expression information. The PC 30 may also not display information about participants who have no intention of speaking. The PC 30 may also perform a process of masking the video of participants who have no intention of speaking.

例えば、ＰＣ３０は、うつむいている等の特定の動作、姿勢を検出した場合に、当該参加者を発話意思の無い参加者として推定する。また、ＰＣ３０は、被視線数の少ない参加者を発話意思の無い参加者として推定してもよい。あるいは、ＰＣ３０は、視線の変化の少ない参加者を発話意思の無い参加者として推定してもよい。 For example, when PC 30 detects a particular movement or posture, such as looking down, it infers that the participant is one who has no intention of speaking. PC 30 may also infer a participant who receives few glances as one who has no intention of speaking. Alternatively, PC 30 may infer a participant who shows little change in gaze as one who has no intention of speaking.

また、ＰＣ３０は、音声情報に基づいて発話意思の無い参加者を推定してもよい。ＰＣ３０は、例えば所定時間以上、発言のない参加者を発話意思の無い参加者として推定してもよい。また、ＰＣ３０は、例えば名前を呼ばれていない参加者を発話意思の無い参加者として推定してもよい。 The PC 30 may also estimate participants who have no intention of speaking based on the audio information. The PC 30 may estimate, for example, a participant who has not spoken for a predetermined period of time or more as a participant who has no intention of speaking. The PC 30 may also estimate, for example, a participant whose name has not been called as a participant who has no intention of speaking.

（変形例６）
ＰＣ３０は、マイクで収音した音声信号に基づいて全ての参加者の中から話者を特定し、話者および発話意思保持者の情報を表示器に表示してもよい。 (Variation 6)
The PC 30 may identify the speaker from among all the participants based on the audio signal picked up by the microphone, and display information about the speaker and the person with the intention to speak on a display.

図１２は、第２地点２のＰＣ３０の表示器に表示される、遠隔会議アプリケーションプログラムの画面の一例である。ＰＣ３０は、マイク１０で推定された話者の方向情報を受信する。図１２の例では、ＰＣ３０は、受信した話者の方向情報に対応する参加者の映像を囲む矩形画像を重畳して表示する。また、図１２の例では、ＰＣ３０は、発話意思保持者に対応する参加者の映像にも矩形画像を重畳して表示する。ただし、ＰＣ３０は、話者の情報と発話意思保持者の情報は異なる態様で表示することが好ましい。例えば図１２の例では、ＰＣ３０は、発話意思保持者に対応する参加者には、破線の矩形画像を重畳して表示している。 Figure 12 is an example of a screen of a remote conference application program displayed on the display of PC 30 at second location 2. PC 30 receives speaker direction information estimated by microphone 10. In the example of Figure 12, PC 30 displays a rectangular image that is superimposed on the video of the participant corresponding to the received speaker direction information. Also, in the example of Figure 12, PC 30 displays a rectangular image that is superimposed on the video of the participant corresponding to the person with the intention to speak. However, it is preferable that PC 30 displays the speaker information and the information of the person with the intention to speak in different formats. For example, in the example of Figure 12, PC 30 displays a dashed rectangular image that is superimposed on the participant corresponding to the person with the intention to speak.

これにより、全ての参加者は、現在の話者と、発言したい意思を持つ参加者と、を一目で知ることができる。 This allows all participants to see at a glance who is currently speaking and who wants to speak.

本実施形態の説明は、すべての点で例示であって、制限的なものではないと考えられるべきである。本発明の範囲は、上述の実施形態ではなく、特許請求の範囲によって示される。さらに、本発明の範囲は、特許請求の範囲と均等の範囲を含む。 The description of the present embodiment should be considered to be illustrative in all respects and not restrictive. The scope of the present invention is indicated by the claims, not by the above-described embodiments. Furthermore, the scope of the present invention includes the scope equivalent to the claims.

１０：マイク，２０：カメラ，３０：ＰＣ 10: Microphone, 20: Camera, 30: PC

Claims

From the video data captured by the camera, images of all participants in the remote conference, including the near-end and far-end, are acquired,
Detecting intention expression information, which is an index for determining whether or not the participant has an intention to speak, from the images of all the participants;
predicting a person who has an intention to speak from among all the participants based on the intention expression information;
displaying information about the person with the intention to speak on a display device;
Communication support methods.

Detecting gaze information or motion information showing a specific movement of the participant from the video as the intention expression information;
The communication support method according to claim 1 .

The communication support method according to claim 1 or 2, wherein the intention expression information is detected from the voice information included in the voice signal.

Score the conversational participation of the participants;
Displaying the result of the scoring as the information.
The communication support method according to claim 1 or 2.

and highlighting and displaying the plurality of speech intention holders who have high scores in the scoring.
The communication support method according to claim 4.

Inferring participants who have no intention of speaking from the intention expression information;
Do not display information about participants who have no intention of speaking;
The communication support method according to claim 1 or 2.

Identifying a speaker from among all the participants based on an audio signal picked up by a microphone;
displaying information about the speaker and the speech intention holder on a display device;
The communication support method according to claim 1 or 2.

From the video data captured by the camera, images of all participants in the remote conference, including the near-end and far-end, are acquired,
Detecting intention expression information, which is an index for determining whether or not the participant has an intention to speak, from the images of all the participants;
predicting a person who has an intention to speak from among all the participants based on the intention expression information;
displaying information about the person with the intention to speak on a display device;
A communication support device comprising a processor.

The processor detects gaze information or motion information indicating a specific movement of the participant from the video as the intention expression information.
The communication support device according to claim 8.

The communication support device according to claim 8 or 9, wherein the processor detects the intention expression information from the voice information included in the voice signal.

The processor,
Score the conversational participation of the participants;
displaying the result of the scoring as the information on the display device;
The communication support device according to claim 8 or 9.

the processor highlights the plurality of speech intention holders who have high scores in the scoring and displays them on the display device.
The communication support device according to claim 11.

The processor,
Inferring participants who have no intention of speaking from the intention expression information;
Do not display information about participants who have no intention of speaking;
The communication support device according to claim 8 or 9.

The processor,
Identifying a speaker from among all the participants based on an audio signal picked up by a microphone;
displaying information about the speaker and the speech intention holder on a display device;
The communication support device according to claim 8 or 9.

From the video data captured by the camera, images of all participants in the remote conference, including the near-end and far-end, are acquired,
Detecting intention expression information, which is an index for determining whether or not the participant has an intention to speak, from the images of all the participants;
predicting a person who has an intention to speak from among all the participants based on the intention expression information;
displaying information about the person with the intention to speak on a display device;
A program that causes an information processing device to execute processing.