[go: up one dir, main page]

GB2627505A - Augmented voice communication system and method - Google Patents

Augmented voice communication system and method Download PDF

Info

Publication number
GB2627505A
GB2627505A GB2302707.1A GB202302707A GB2627505A GB 2627505 A GB2627505 A GB 2627505A GB 202302707 A GB202302707 A GB 202302707A GB 2627505 A GB2627505 A GB 2627505A
Authority
GB
United Kingdom
Prior art keywords
user
audio
engagement
elements
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
GB2302707.1A
Other versions
GB202302707D0 (en
Inventor
Shapiro Kalila
Leyton Pedro
Ryan Nicholas
Visciglia Aron
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sony Interactive Entertainment Inc
Original Assignee
Sony Interactive Entertainment Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sony Interactive Entertainment Inc filed Critical Sony Interactive Entertainment Inc
Priority to GB2302707.1A priority Critical patent/GB2627505A/en
Publication of GB202302707D0 publication Critical patent/GB202302707D0/en
Publication of GB2627505A publication Critical patent/GB2627505A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/165Management of the audio stream, e.g. setting of volume, audio stream path
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

A system for providing modified audio to a user of a display device. The system comprising a display device configured to display images comprising elements representing other user(s) and provide corresponding audio output 200. An engagement determination unit identifies an element corresponding to user attention 210 and determines the level of engagement with that element. An audio modification unit modifies audio output 220 based on the determined level of engagement of the user with that element. Audibility of the audio corresponding to the identified element is increased relative to audio not corresponding to the identified element. The increase in audibility is proportional to the level of engagement. Optionally, audibility of an identified element is improved by increasing its volume and / or by decreasing the volume of other elements. A user’s attention may be determined via their biometric information such as gaze direction, gestures, relative body positioning and speech properties.

Description

AUGMENTED VOICE COMMUNICATION SYSTEM AND METHOD BACKGROUND OF THE INVENTION
Field of the invention
This disclosure relates to an augmented voice communication system and method.
Description of the Prior Art
The "background" description provided herein is for the purpose of generally presenting the context of the disclosure. Work of the presently named inventors, to the extent it is described in this background section, as well as aspects of the description which may not otherwise qualify as prior art at the time of filing, are neither expressly nor impliedly admitted as prior art against the present invention.
In recent years there has been an increase in the number of interactions that are conducted remotely rather than in-person; this is a trend which has been particularly accelerated in recent years due to restrictions upon in-person meet-ups for many people. Examples of types of remote interactions include video calling, multiplayer computer games, and social virtual environments (such as a virtual reality, VR, environment for users to socialise in).
While the availability of such interactions is advantageous in that it can enable a more natural communication between users (for instance, compared to a text-based communication or a phone call), these interactions may still suffer from a number of drawbacks which can influence the quality of the communication between users. For instance, without a physical aspect to the interactions users may not be able to pick up on social cues as effectively -this can lead to users talking over one another, for example. Similarly, without a physical space to enable users to arrange themselves into groups the issue of cross-talk between different groups of users may become apparent.
While some online platforms have implemented features to address such issues, such as the ability of specific users to be able to mute other users or assign users to break-out rooms, these may be unsatisfactory for a number of reasons. For instance, the muting of users can be a laborious process during a fast-moving conversation, and can lead to a break in immersion for a user when implemented in a virtual environment setting (rather than specifically a video call). Similarly, the use of break-out rooms can be problematic in that it divides up the group of users in a fashion that makes it difficult for the groups to mix or communicate -thereby reducing the scope for interaction between users.
It is therefore considered that improved methods for enabling voice-based communications may be desired, particularly in the case of interactions within a virtual environment in which it is considered important to maintain a sense of immersion for a user.
It is in the context of the above discussion that the present disclosure arises. SUMMARY OF THE INVENTION This disclosure is defined by claim 1. Further respective aspects and features of the disclosure are defined in the appended claims.
It is to be understood that both the foregoing general description of the invention and the following detailed description are exemplary, but are not restrictive, of the invention.
BRIEF DESCRIPTION OF THE DRAWINGS
A more complete appreciation of the disclosure and many of the attendant advantages thereof will be readily obtained as the same becomes better understood by reference to the following detailed description when considered in connection with the accompanying drawings, wherein: Figure 1 schematically illustrates an exemplary entertainment system; Figure 2 schematically illustrates a method for generating and providing modified audio to a user of a display device; Figure 3 schematically illustrates an exemplary environment to which an audio modification method may be applied; Figure 4 schematically illustrates a method of identifying an element with which a user is engaged; Figure 5 schematically illustrates a summary method; and Figure 6 schematically illustrates a system for generating and providing modified audio to a user of a display device.
DESCRIPTION OF THE EMBODIMENTS
Referring now to the drawings, wherein like reference numerals designate identical or corresponding parts throughout the several views, embodiments of the present disclosure are described.
Referring to Figure 1, an example of an entertainment system 10 is a computer or console such as the Sony° PlayStation 5° (PS5).
The entertainment system 10 comprises a central processor 20. This may be a single or multi core processor, for example comprising eight cores as in the PS5. The entertainment system also comprises a graphical processing unit or GPU 30. The GPU can be physically separate to the CPU, or integrated with the CPU as a system on a chip (SoC) as in the PS5.
The entertainment device also comprises RAM 40, and may either have separate RAM for each of the CPU and GPU, or shared RAM as in the P55. The or each RAM can be physically separate, or integrated as part of an SoC as in the P55. Further storage is provided by a disk 50, either as an external or internal hard drive, or as an external solid state drive, or an internal solid state drive as in the PS5.
The entertainment device may transmit or receive data via one or more data ports 60, such as a USB port, Ethernet® port, Wi-Fi ® port, Bluetooth® port or similar, as appropriate. It may also optionally receive data via an optical drive 70.
Audio/visual outputs from the entertainment device are typically provided through one or more A/V ports 90, or through one or more of the wired or wireless data ports 60.
An example of a device for displaying images output by the entertainment system is a head mounted display 'HMD' 120, such as the PlayStation VR 2 'PSVR2', worn by a user 1.
Where components are not integrated, they may be connected as appropriate either by a dedicated data link or via a bus 100.
Interaction with the system is typically provided using one or more handheld controllers (130, 130A), such as the DualSense® controller (130) in the case of the PS5, and/or one or more VR controllers (130A-L,R) in the case of the H MD.
Embodiments of the present disclosure are directed towards systems and methods for improving audio-based interactions between users of a virtual environment. While discussed here in the context of a virtual environment, techniques may be adapted in any suitable manner so as to enable their use in other settings such as during a more conventional video call, for instance, or in a real-world environment in which audio can be modified before being provided to a user (such as through noise cancelling headphones associated with a microphone to capture external audio to be selectively passed through). Examples of each of these are discussed below.
Embodiments of the present disclosure relate to the concept of user engagement with an element; this element may be an element in a virtual environment (such as another player's avatar), another user (or a representation of that user, such as a stream of their video in a video call), or any other feature within a social setting. Engagement may be a measure of focus or interaction with the element -this can indicate a user's interest in a conversation, whether or not they are taking part. The engagement can be determined in a number of different manners individually or in combination; examples include using gaze tracking, gestures, biometrics, patterns of speech, relative locations in an environment (real or virtual), body positioning, and properties of the audio output by different users/elements. Examples of the use of these are discussed below.
Figure 2 schematically illustrates a method for generating and providing modified audio to a user of a display device. In particular, this method provides a modified audio output which is dependent upon an indication of a user engagement with an element such that the audibility of sounds associated with that element is improved. The improvement of audibility may be realised through one or both of an increase in the volume of those sounds and a decrease in the volume of other sounds; of course, other techniques (such as modifying the apparent location of a sound source, or applying modifications such as a frequency or speed change) may be used to adapt the audibility where appropriate. Further modifications and variations of this method are discussed below with reference to the system of Figure 6, for instance.
A step 200 comprises displaying images to the user of the display device, the images comprising elements representing one or more other users, and providing a corresponding audio output to the user. This may be a virtual environment provided as a part of a social application or a video game, for example, in which each of a number of players (and/or non-player characters) are represented in a setting by avatars or the like. The corresponding audio output may include any sounds associated with the virtual environment-this may include environmental sounds (such as a flowing river), utterances by one or more player or non-player characters, and/or sounds associated with events (such as a character kicking a ball) within the environment. In some cases it may be useful to consider a group of elements as a single element for the purposes of ascertaining user engagement -for instance, all (or at least a subset of) participants in a single conversation may be regarded as a single group for the purpose of determining user engagement, or all the players of a sport and the related equipment (such as a ball and goalposts) may be considered a single element.
A step 210 comprises identifying an element corresponding to user engagement and determine a level of user engagement with that element. The elements may include any sound source within the environment; while in many cases this may be an avatar associated with another user, it may also include other sound sources such as virtual display devices or virtual audio output units, objects being interacted with, or even sound sources which have no representative element within the environment (such as a voiceover or narration provided within the environment).
This identification may be implemented in a number of different manners, which can be implemented in isolation or in combination with one another. For example, in some cases one or more inputs by a user or measurements of a state of that user may be used to identify user engagement with a specific element -for instance, if a user moves their avatar then an element may be identified as being engaged with based upon this input. Alternatively, or in addition, a level of user engagement may be monitored for each (or at least a subset of) the elements within the virtual environment -this may comprise the calculation of an engagement score for each element that is updated with an appropriate frequency.
Examples of inputs or measurements that may be considered for the determination of user engagement include relative avatar positions/motions within the virtual environment, semantic analysis of voice chat, volume of voice chat, biometric data (such as heart rate, or galvanic skin response), speed of voice chat (for instance, average word length or number of words per minute), gaze direction, controller inputs by a user, characterisation of those inputs (such as the pressure associated with a button press), and/or speech cadence of participants within the voice chat (such as determining whether speech from users overlap).
While this step refers to identifying an element corresponding to user engagement, implementations of this method may of course be used to identify a plurality of elements with which the user is engaged.
For instance, this can include the identification of a number of participants in a group conversation which the user is a part of or the identification of a user taking part in a conversation and another element which generates audio which the user is engaged with -such as talking while watching a video, in which both the user being spoken to and the video (or a real/virtual display associated with the video) are identified as elements being engaged with.
A step 220 comprises modifying the audio output, in dependence upon the identification of the element with which the user is engaged and the determined level of engagement, so as to increase the audibility of audio corresponding to the identified element relative to audio not corresponding to the identified element. This may include one or both of increasing the volume of audio associated with the identified element and reducing the volume of audio associated with elements other than the identified element.
Other modifications to the audio may also (or instead) be applied so as to vary the audibility -this may include changing the apparent location of a sound source or other properties such as a frequency or speed of the audio.
In accordance with the method of Figure 2, it is possible to provide an interaction in which a user is able to improve the quality of the audio aspect of that interaction. An example of this is providing a conversation experience for a user in a crowded virtual environment in which the audio associated with their conversation partners is enhanced relative to that of other users in the environment -thereby enabling an improved communication between users. It is also considered that such a method may be advantageous in that the immersion of the user may be maintained despite the audio modification, particularly when compared to alternatives in which other users are muted or moved to a different environment (for example).
Figure 3 schematically illustrates an exemplary environment to which such a method may be applied so as to provide improved audio to a user. The scene 300 may be an image that is displayed by a display device (such as an HMD or television), or may be representative of the point of view of a user in a real environment. In some cases, the scene 300 may be an image of the real environment that that user is located within that is captured by a camera and displayed via an HMD; for instance, as a part of an augmented or mixed reality implementation. It is of course considered that there is no requirement for a head-mountable display of any type to be provided to the user in any context, as advantages of the disclosed methods may be obtained independent of the display device. For instance, a user viewing content via a television may still benefit from the improved audio engagement. In some embodiments, such as those in which elements within a real-world environment are to be identified, it is not required that any display be provided at all.
A number of people 310 (or avatars representing people in a virtual environment) are present in the environment, each of which may be regarded as an element with which a user may be engaged. In some cases, a group of users may be considered as a single element -such as the left pair of people 320 being regarded as a first element and the right pair of people 330 being regarded as a second element. This grouping may be based upon physical proximity of the people 310 to one another, or other considerations such as identifying that the people 310 are engaged in the same or different conversation topics. In a virtual environment, the people 310 may be grouped based upon whether they are in a group or party with one another.
Similarly, the environment includes a display 340 and speakers 350a and 350b which can also be considered as elements with which a user may be engaged. Information about these may be obtained in any suitable manner -in the case in which these devices are real devices within a real environment, these devices may be configured to communicate with the user's device (that is, the device providing audio to the user) via a wireless communication which provides data about the audio independently of the audio emission by the respective devices. In some cases the speakers 350a and 350b may be considered a single element (for instance, in a case in which they provide the same audio output), while in others they may be regarded as separate elements (even if they provide the same audio output) due to the directionality of the respective speakers with respect to the user -the directionality may mean that one of the speakers has a greater impact upon the user's ability to hear audio than the other, for instance.
In this example, the engagement of a user with any of the elements in the scene 300 may be determined with respect to any suitable parameters. For instance, the proximity of the user to any of the elements may be considered, as well as their interaction with those elements. For example, a user who engages one of the pairs of people 320 or 330 in conversation may be considered to have a high engagement with that pair. Similarly, a user who gazes at the display 340 and/or reacts to content shown on the display 340 may be considered to have a significant level of engagement with the display 340. A user who discusses the content shown on the display 340 or output by the speakers 350a or 3506 (for instance, based upon content recognition and identifying corresponding keywords in a user's speech) can also be considered to have an engagement with the respective element or elements.
In some cases, grouping of elements may be extended as appropriate-for instance, if the pair of people 320 are determined to be discussing the content on the display 340 (based upon their actions, such as gazing at the display, or based upon the contents of their conversation) then a single element may be defined which includes both the pair of people 320 and the display 340.
Of course, the selection of elements shown in the scene 300 is considered to be exemplary only implementations in accordance with the present disclosure may be provided in any suitable scene, with elements corresponding to any passive or active element as appropriate. An element may be any object or person associated with an audio output, whether that is an output generated by the element (such as a device with a speaker) or an output generated by interaction with that element (such as a ball that is struck in a sport).
While discussed in the context of virtual reality, it is also considered that the methods described may be applied to a real-world environment. For instance, a user may be provided with an augmented reality headset (comprising a display and an audio output element) or headphones (or any other audio providing hardware); in either case, the hardware should also be provided with a microphone so as to enable the capture of audio from the environment. Modifications to the captured audio can then be performed in accordance with the methods described in this disclosure, with the modified audio being provided to the user in addition to (or instead of, if the user is unable to hear well -such as due to noise cancelling arrangements) the unmodified audio. The hardware arrangement provided to a user may be further equipped with a camera for capturing images of the real-world environment and the elements within it.
In such a case, the elements for which user engagement is determined may include any real-world objects or people. For instance, in an example in which a user is playing a board game in a public setting elements may comprise the game, game pieces, people, speakers providing music in the environment, and any other features. In such an example, the user may wish to emphasise the audio associated with the game and their fellow players, whilst reducing the impact of the audio associated with other elements in the real-world environment. To enable this, the user may be provided with a headset comprising at least an audio input and audio output function -for instance, a microphone and speakers, or functionality (such as wireless connectivity) which enables a sound signal to be received from another device. This headset may also comprise elements which dampen the sound of the environment, such as a noise-cancelling headphone arrangement, to afford any audio output from the headset a much greater influence over what the user hears.
The metrics used to determine a level and/or target of engagement by a user may be modified so as to be more appropriate for a real-world environment. For instance, where an implementation for a virtual environment would consider the relative locations of a user's avatar and other elements within the virtual environment, a real-world environment implementation would instead consider a real-world location or proximity for the user and respective elements.
Figure 4 schematically illustrates a method of identifying an element with which a user is engaged in accordance with step 210 of Figure 2 as discussed above. The steps shown in Figure 4 may be performed in any suitable order, rather than being limited to the order that is shown. For instance, steps 400, 410, and 420 may be performed in any order or substantially simultaneously.
A step 400 comprises obtaining user data; this may include data from any sensors which can detect properties of the user, as well as information about their audio output, attention, and location within an environment (real or virtual). The user data that is obtained may be any data which can be indicative of engagement of the user with an element, such as data which can indicate a focus upon, interaction with, or reaction to an element.
A step 410 comprises detecting and characterising one or more elements within the environment. The detection of elements may be based upon game data or the like which indicates the presence of elements within a virtual environment, or it may comprise an analysis of sensor data (such as captured images) which represent a real-world environment. The characterisation of the elements may include an identification of an element, and/or the determination of any other properties of the element. For instance, the characterisation may include a determination of a type of object (such as person, animal, electronic device), whether it is an active or passive audio element, and/or whether the element belongs to a group.
A step 420 comprises identifying audio within the environment. The detection of audio may be based upon game data or the like which indicates the presence of audio within a virtual environment, or it may comprise an analysis of sensor data (such as audio captured by a microphone) which represent a real-world environment. The identification may further comprise a characterisation of the audio, such as identifying one or more properties of the sound, the location of the source of the sound, the content of the sound (such as identifying words in speech), or what it is the sound represents (such as identifying that a ball has been struck based upon the detection of corresponding audio).
A step 430 comprises matching detected elements with identified audio based upon the respective detections and/or characterisations of steps 410 and 420. This step may include any association between an element and corresponding audio, such as the generation of metadata indicating an association or a labelling of one or both of the element and audio with an indicator of the other of the element and audio.
A step 440 comprises determining one or more audio modifications that are to be applied to one or more portions of the identified audio. This step comprises the determination of user engagement with an element (if this is not performed as a part of the step 400 or 410), and a determination of a modification to be applied to the audio associated with that element. This may be performed in any of a number of ways as appropriate for a given implementation -the specific implementation of this step may be selected in dependence upon user preferences, the environment, and/or the presence of particular elements in the environment.
In a first example, the elements in the environment are ranked in dependence upon the level of user engagement with the respective elements. The top N elements, or N% of elements, or any other subset of elements may be selected to have their audio modified so as to improve the audibility. Meanwhile, the bottom M or M% of elements (where M may be the same value as N, or a different value) may be similarly selected to have their audio modified so as to reduce the audibility.
In a second example, elements with an above-threshold level of user engagement may be selected for audio modification so as to improve the audibility. Optionally, elements with a below-threshold (this threshold may be the same, or may be a second threshold that is separately defined) level of user engagement may be selected for audio modification so as to improve the audibility.
In some cases, a variation of the audibility of audio associated with an element is proportional to the level of user engagement with that element. For instance, if a user is particularly focused on a conversation (that is, engaged with the conversation and/or participants of that conversation) then a greater modification may be applied to increase the audibility of that conversation than if the user were only paying slight attention (such as if they were more focused on a sports game being shown in the environment).
Alternatively, or in addition, a determination may be made in dependence upon the relative characteristics of the audio. For instance, audio with a similar frequency to audio associated with a higher level of user engagement may be preferentially identified for modification to reduce audibility. A similar process may be performed in dependence upon the relative locations of elements and the user, such that elements which are in the same general direction as a user-engaged element may be preferentially identified for modification to reduce audibility. In this manner, those sounds which are most likely to interfere with a user's ability to clearly hear and understand audio may be preferentially reduced in audio impact.
The modifications may be defined in any suitable manner. For instance, a target volume may be defined (such as a predefined 'low audio level' or 'high audio level' defined in a user profile or the like) in the case of a modification of the audio, or an absolute change (such as a number of decibels or hertz for a volume or frequency change) or scaling of a value may be defined. The modifications may be determined independently of the environment, or may be based upon an average volume of sounds in the environment or a standard deviation of the volume of sounds within the environment, for example. In other words, the modifications may be determined based upon any parameter associated with the environment, audio within the environment, and/or the user.
Figure 5 schematically illustrates a summary method, which may be adapted in accordance with any of the features described throughout this disclosure. The steps shown in Figure 5 may be performed in any suitable order, rather than being limited to the order that is shown. For instance, steps 500 and 510 may be performed in any order or simultaneously. It is further considered that the steps may be performed at different rates, for instance of the user data is sampled more frequently than the environment.
A step 500 comprises obtaining environment data; that is, any data about the environment (real or virtual) and/or elements within that environment. This may be performed in any suitable manner as appropriate for a given environment -for instance, from data representing a virtual environment, or images of a real environment.
A step 510 comprises obtaining user data; that is, any data about a user that may be indicative of their engagement with one or more of the elements within the environment. This may include information about a user's behaviour (such as gaze tracking data or movement data), interactions (such as voice data when engaged in a conversation), physical state (such as biometric data), interests (such as a user profile indicating elements or particular subjects of interest), and/or location (either absolute, or relative to one or more of the elements) to provide a number of examples.
A step 520 comprises determining engagement of a user with any or each of the elements in the environment. This may be based on any suitable data obtained from steps 500 and/or 510 so as to determine both a target of the user's engagement and the level of that engagement. This may be performed for each element in the environment, or a subset of the elements representing a particular element type (such as other people, or elements associated with a particular activity) or other subset of the elements.
A step 530 comprises modifying audio associated with one or more of the elements in the environment.
This modification may include modifying audio associated with one or more elements to have an increased audibility, and/or modifying audio associated with one or more elements to have a decreased audibility. This may be performed in dependence upon any suitable parameters, such as the relative locations of elements corresponding to audio, in addition to a consideration of the user's level of engagement with respect to particular elements. In some cases, the modification to be applied to audio may be determined such that the increase or decrease in audibility is proportional to the level of engagement of the user with a corresponding element.
A step 540 comprises outputting at least the modified audio to the user; this may be performed in conjunction with the display of a corresponding image of the environment in some embodiments.
Figure 6 schematically illustrates system for generating and providing modified audio to a user of a display device. This system comprises a display device 600, an engagement determination unit 610, an audio modification unit 620, and an optional audio monitoring unit 630. This system is considered to be exemplary, in that modifications can be made as appropriate for a particular implementation. For instance, in some implementations (as described above) there may be no display of images comprising elements to a user -for instance when using a see-through display, or an audio headset with no display at all.
The functionality of these units may be distributed amongst any suitable number of devices as appropriate; for instance, processing units comprised within the HMD or other head-mountable unit worn by a user may be configured to provide the functionality of any or all of the units 610, 620, and 630. Alternatively, or in addition, other devices such as a server or computing device (such as a games console) may be provided which provide at least a portion of that functionality.
The display device 600 is configured to display images to a user, the images comprising elements representing one or more other users, and to provide a corresponding audio output to the user. This may be any suitable display device, including an HMD (full-immersion or see-through) or a television. The displayed images may include images of a virtual environment, for example, or captured images of a real environment in which the user is present.
The engagement determination unit 610 is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element. This may be based upon any suitable data which characterises the user, the environment, or elements within that environment (including other users corresponding to those elements). For instance, this may be based upon data about a virtual environment, such as game context or data, or other inputs such as user profile information and real-world sensor data. While here reference is made to 'an element', it should be considered that a plurality of elements may be identified and/or that an element actually corresponds to a number of distinct elements (such as an element being identified which corresponds to all participants in a conversation).
In some examples, the engagement determination unit 610 is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element in dependence upon the user's gaze direction and/or gaze information for at least one of the one or more other users. This data may be inferred from a user profile which predicts user gaze direction, for instance, or from the user of gaze tracking cameras or the like associated with a user.
In some examples, the engagement determination unit 610 is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element in dependence upon biometric information of the user. This biometric information may include a heart rate, for instance, or a galvanic skin response. These measurements may be correlated with the actions of the elements or interactions with them to determine engagement -for instance, a varying heart rate may only be considered to be indicative of engagement if it correlates with developments in the interaction with an element such as another user speaking, or a goal being scored in a viewed sports game.
In some examples, the engagement determination unit 610 is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element in dependence upon the location of the user relative to elements representing other users. In other words, a user may be considered to have a higher level of engagement with an element if they are closer to the element. One example of this is a user moving closer to people that they are engaging with in a conversation, as this is a natural method for improving their ability to hear others.
In some examples, the engagement determination unit 610 is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element in dependence upon gestures performed by the user and/or gestures performed by elements associated with at least one of the one or more other users. For example, pointing at an element may be suggestive of engagement with that element, or gestures (such as a cheering gesture) in response to actions by an element (such as a declaration of success by another person).
In some examples, the engagement determination unit 610 is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element in dependence upon an analysis of the audio corresponding to the user and/or at least one of the one or more other users. In some implementations, the analysis may comprise analysing the content of the audio and/or the cadence of speech in the audio to determine whether the user and the at least one of the one or more other users are engaged in a conversation with one another. In other words, processing may be performed to determine whether users are simply talking in the same space (but not to each other) or are actually talking to one another.
The audio modification unit 620 is configured to modify the audio output, in dependence upon the identification of the element with which the user is engaged and the determined level of engagement, so as to increase the audibility of audio corresponding to the identified element relative to audio not corresponding to the identified element, wherein the increase in audibility is proportional to the level of engagement. The increase in audibility may be achieved by modifying the volume of respective audio for elements, for instance, or by changing other characteristics such as an audio speed or frequency.
In particular, the audio modification unit 620 may be configured to increase the volume of the audio corresponding to the identified element and/or decrease the volume of audio not corresponding to the identified element. The audio modification unit 620 may configured to increase the volume of audio associated with the user in response to the identification of an element as the subject of user engagement.
The audio modification may be implemented in any suitable manner for a given implementation. In a virtual environment, it may be possible to separate audio from different elements (such as different users communicating on respective voice channels) and simply modify the audio for the respective elements on an individual basis. In other cases, it may be suitable to capture audio of the environment or obtain an audio output and to perform a sound separation process prior to performing an element-wise modification of the audio. Such a separation is not considered necessary however, as by identifying characteristics of desired or undesired audio it may be possible to perform a modification without separation. For instance, sounds with particular characteristic frequencies (that is, those associated with an identified element's audio output) may be increased in volume while sounds with other frequencies are decreased in volume.
In some implementations an audio monitoring unit 630 may be provided which is configured to monitor audio not corresponding to the identified element and to generate a real-time visual representation of this audio for display to the user. For instance, this could be in the form of goal updates for a game being shown on a display in an environment while the user is engaged with other elements (such as talking to other people). This representation may include text or image-based updates, and may have any suitable level of detail. Other examples can include summaries of other conversations in an environment, or the highlighting of key words. This may be performed in dependence upon a user profile, for example, which can indicate particular audio of interest which should be used to generate relevant representations. For instance, based upon a user's particular interests certain keywords may be defined -such as a user known to be interested in football having game-related keywords such as 'goal' and 'tackle' or team-specific keywords such as club or player names being highlighted in the representation as an indication that other people are discussing this topic. The representation may also indicate the related elements -such as identifying a particular element as the source of the audio.
In some embodiments, the audio monitoring unit 630 is configured to evaluate the relevance of the audio not corresponding to the identified element to the audio corresponding to the identified element, and to generate a real-time visual representation of the relevance for display to the user. This may be in addition to the real-time visual representation of the audio, or as an alternative. In either case, this can be an indicator to the user as to whether other audio in the environment would be of interest to them despite the lack of engagement -this can enable a user to be informed as to the possibility of engaging with different elements if the associated audio is of particular relevance, for example.
Of course, rather than a real-time representation in some cases it may be preferable that a summary of the other audio (that is, the audio not corresponding to the identified element) is provided to the user at an appropriate time, such as the end of a social session. This can enable a user to be informed as to which audio they may have missed due to the audio modification that is performed. This summary may include the identification of elements as well as the associated audio -for instance, identifying other people in an environment and their topics of conversation.
In some implementations, the engagement determination unit 610 may be configured to identify a plurality of elements which are each subjects of the user's engagement, while the audio modification unit 620 may be configured to modify the audio output to increase the audibility of the audio corresponding to each of these identified elements. This may be performed by considering a number of elements separately, or by defining a single element which corresponds to a number of distinct elements within the environment -such as defining all participants of a conversation or a game as being a single element for engagement determination purposes.
In some implementations the engagement determination unit 610 may be configured to identify one or more additional elements that are being engaged with by the identified element, while the audio modification unit 620 may be configured to modify the audio output so as to also increase the audibility of audio corresponding to the additional elements. This may be advantageous in social settings in that often a conversation may have some people contribute more than others -those who contribute less would likely be determined to have a lower engagement with the user in many cases, but nonetheless should be regarded as being engaged with due to the fact they are participating in the same conversation. This identification may be performed based upon interaction history, for example, information such as relative location or engagement of elements, or other data such as party information for a virtual environment.
The arrangement of Figure 6 is an example of a processor (for example, a GPU and/or CPU located in a games console or any other computing device, such as the device 10 of Figure 1) that is operable to generate and provide modified audio to a user of a display device, and in particular is operable to: display images to the user of the display device, the images comprising elements representing one or more other users, and providing a corresponding audio output to the user; identify an element corresponding to user engagement and determine a level of user engagement with that element; and modify the audio output, in dependence upon the identification of the element with which the user is engaged and the determined level of engagement, so as to increase the audibility of audio corresponding to the identified element relative to audio not corresponding to the identified element.
The techniques described above may be implemented in hardware, software or combinations of the two. In the case that a software-controlled data processing apparatus is employed to implement one or more features of the embodiments, it will be appreciated that such software, and a storage or transmission medium such as a non-transitory machine-readable storage medium by which such software is provided, are also considered as embodiments of the disclosure.
Thus, the foregoing discussion discloses and describes merely exemplary embodiments of the present invention. As will be understood by those skilled in the art, the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. Accordingly, the disclosure of the present invention is intended to be illustrative, but not limiting of the scope of the invention, as well as other claims. The disclosure, including any readily discernible variants of the teachings herein, defines, in part, the scope of the foregoing claim terminology such that no inventive subject matter is dedicated to the public.

Claims (15)

  1. CLAIMS1. A system for generating and providing modified audio to a user of a display device, the system comprising: the display device configured to display images to a user, the images comprising elements representing one or more other users, and to provide a corresponding audio output to the user; an engagement determination unit configured to identify an element corresponding to user engagement and determine a level of user engagement with that element; and an audio modification unit configured to modify the audio output, in dependence upon the identification of the element with which the user is engaged and the determined level of engagement, so as to increase the audibility of audio corresponding to the identified element relative to audio not corresponding to the identified element, wherein the increase in audibility is proportional to the level of engagement.
  2. 2. A system according to claim 1, wherein the audio modification unit is configured to increase the volume of the audio corresponding to the identified element and/or decrease the volume of audio not corresponding to the identified element.
  3. 3. A system according to any preceding claim, wherein the audio modification unit is configured to increase the volume of audio associated with the user in response to the identification of an element as the subject of user engagement.
  4. 4. A system according to any preceding claim, wherein the engagement determination unit is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element in dependence upon the user's gaze direction and/or gaze information for at least one of the one or more other users.
  5. 5. A system according to any preceding claim, wherein the engagement determination unit is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element in dependence upon biometric information of the user.
  6. 6. A system according to any preceding claim, wherein the engagement determination unit is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element in dependence upon the location of the user relative to elements representing other users.
  7. 7. A system according to any preceding claim, wherein the engagement determination unit is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element in dependence upon gestures performed by the user and/or gestures performed by elements associated with at least one of the one or more other users.
  8. 8. A system according to any preceding claim, wherein the engagement determination unit is configured to identify an element corresponding to user engagement and determine a level of user engagement with that element in dependence upon an analysis of the audio corresponding to the user and/or at least one of the one or more other users.
  9. 9. A system according to claim 8, wherein the analysis comprises analysing the content of the audio and/or the cadence of speech in the audio to determine whether the user and the at least one of the one or more other users are engaged in a conversation.
  10. 10. A system according to any preceding claim, wherein: the engagement determination unit is configured to identify a plurality of elements which are each subjects of the user's engagement, and the audio modification unit is configured to modify the audio output to increase the audibility of the audio corresponding to each of these identified elements.
  11. 11. A system according to any preceding claim, comprising an audio monitoring unit configured to monitor audio not corresponding to the identified element and to generate a real-time visual representation of this audio for display to the user.
  12. 12. A system according to claim 11, wherein the audio monitoring unit is configured to evaluate the relevance of the audio not corresponding to the identified element to the audio corresponding to the identified element, and to generate a real-time visual representation of the relevance for display to the user.
  13. 13. A system according to any preceding claim, wherein: the engagement determination unit is configured to identify one or more additional elements that are being engaged with by the identified element, and the audio modification unit is configured to modify the audio output so as to also increase the audibility of audio corresponding to the additional elements.
  14. 14. A method for generating and providing modified audio to a user of a display device, the method comprising: displaying images to the user of the display device, the images comprising elements representing one or more other users, and providing a corresponding audio output to the user; identifying an element corresponding to user engagement and determine a level of user engagement with that element; and modifying the audio output, in dependence upon the identification of the element with which the user is engaged and the determined level of engagement, so as to increase the audibility of audio corresponding to the identified element relative to audio not corresponding to the identified element.
  15. 15. A non-transitory machine-readable storage medium which stores computer software which, when executed by a computer, causes the computer to carry out the method of claim 14.
GB2302707.1A 2023-02-24 2023-02-24 Augmented voice communication system and method Pending GB2627505A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB2302707.1A GB2627505A (en) 2023-02-24 2023-02-24 Augmented voice communication system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB2302707.1A GB2627505A (en) 2023-02-24 2023-02-24 Augmented voice communication system and method

Publications (2)

Publication Number Publication Date
GB202302707D0 GB202302707D0 (en) 2023-04-12
GB2627505A true GB2627505A (en) 2024-08-28

Family

ID=85793965

Family Applications (1)

Application Number Title Priority Date Filing Date
GB2302707.1A Pending GB2627505A (en) 2023-02-24 2023-02-24 Augmented voice communication system and method

Country Status (1)

Country Link
GB (1) GB2627505A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100315482A1 (en) * 2009-06-15 2010-12-16 Microsoft Corporation Interest Determination For Auditory Enhancement
US20170171261A1 (en) * 2015-12-10 2017-06-15 Google Inc. Directing communications using gaze interaction
US20190281255A1 (en) * 2012-10-18 2019-09-12 Altia Systems, Inc. Panoramic streaming of video with user selected audio
US10581625B1 (en) * 2018-11-20 2020-03-03 International Business Machines Corporation Automatically altering the audio of an object during video conferences
WO2020204934A1 (en) * 2019-04-05 2020-10-08 Hewlett-Packard Development Company, L.P. Modify audio based on physiological observations
WO2022178194A1 (en) * 2021-02-18 2022-08-25 Dathomir Laboratories Llc Decorrelating objects based on attention

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100315482A1 (en) * 2009-06-15 2010-12-16 Microsoft Corporation Interest Determination For Auditory Enhancement
US20190281255A1 (en) * 2012-10-18 2019-09-12 Altia Systems, Inc. Panoramic streaming of video with user selected audio
US20170171261A1 (en) * 2015-12-10 2017-06-15 Google Inc. Directing communications using gaze interaction
US10581625B1 (en) * 2018-11-20 2020-03-03 International Business Machines Corporation Automatically altering the audio of an object during video conferences
WO2020204934A1 (en) * 2019-04-05 2020-10-08 Hewlett-Packard Development Company, L.P. Modify audio based on physiological observations
WO2022178194A1 (en) * 2021-02-18 2022-08-25 Dathomir Laboratories Llc Decorrelating objects based on attention

Also Published As

Publication number Publication date
GB202302707D0 (en) 2023-04-12

Similar Documents

Publication Publication Date Title
US12239907B2 (en) Methods, systems and devices for providing portions of recorded game content in response to an audio trigger
US7785197B2 (en) Voice-to-text chat conversion for remote video game play
JP5339900B2 (en) Selective sound source listening by computer interactive processing
JP2004267433A (en) Information processing apparatus, server, program, and recording medium for providing voice chat function
JP5458027B2 (en) Next speaker guidance device, next speaker guidance method, and next speaker guidance program
US20240221714A1 (en) Transfer function generation system and method
WO2023032736A1 (en) Communication assistance system, communication assistance method, and communication assistance program
GB2627505A (en) Augmented voice communication system and method
US20240108262A1 (en) Harassment detection apparatus and method
US12100380B2 (en) Audio cancellation system and method
US20220405047A1 (en) Audio cancellation system and method
EP4358084A1 (en) Audio cancellation system and method
US12406032B2 (en) Data processing apparatus and method
US20250269281A1 (en) Apparatus, systems and methods for video games
GB2621873A (en) Content display system and method
GB2634274A (en) Virtual environment augmentation methods and systems