[go: up one dir, main page]

WO2024191212A1 - Method and electronic device for handling sound source in media - Google Patents

Method and electronic device for handling sound source in media Download PDF

Info

Publication number
WO2024191212A1
WO2024191212A1 PCT/KR2024/003269 KR2024003269W WO2024191212A1 WO 2024191212 A1 WO2024191212 A1 WO 2024191212A1 KR 2024003269 W KR2024003269 W KR 2024003269W WO 2024191212 A1 WO2024191212 A1 WO 2024191212A1
Authority
WO
WIPO (PCT)
Prior art keywords
sound source
subject
media
primary
electronic device
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
PCT/KR2024/003269
Other languages
French (fr)
Inventor
Ranjan Kumar SAMAL
Praveen Kumar Guvvakallu Sivamoorthy
Biju Mathew Neyyan
Somesh Nanda
Arshed V Hakeem
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Samsung Electronics Co Ltd
Original Assignee
Samsung Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Samsung Electronics Co Ltd filed Critical Samsung Electronics Co Ltd
Priority to US18/734,468 priority Critical patent/US20240321287A1/en
Publication of WO2024191212A1 publication Critical patent/WO2024191212A1/en
Anticipated expiration legal-status Critical
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering

Definitions

  • the disclosure relates to content processing (e.g., audio processing, video processing or the like), and for example, to a method and an electronic device for handling acoustic sound sources in audio signals (e.g., suppressing acoustic sound sources in the audio signals, modifying acoustic sound sources in the audio signals, optimizing the acoustic sound sources in the audio signals or the like).
  • content processing e.g., audio processing, video processing or the like
  • a method and an electronic device for handling acoustic sound sources in audio signals e.g., suppressing acoustic sound sources in the audio signals, modifying acoustic sound sources in the audio signals, optimizing the acoustic sound sources in the audio signals or the like.
  • a sound recording can include ambient/background noises, especially when the sound recording is done in a noisy environment (such as, party, sports events, school function, college function or the like).
  • the recorded audio signals (or recorded sound signals) often carry environment sounds along with an intended sound source. For example, audio conversation recorded in a restaurant will contain ambient music, speech, and crowd babble noise.
  • noise suppression aims at reducing the unwanted sound(s) from the recorded audio signal.
  • noise suppression schemes can be either a Digital signal processing (DSP) based or Deep Neural Network (DNN) model based scheme.
  • DSP Digital signal processing
  • DNN Deep Neural Network
  • the DSP based schemes are light weight low latency and trained for few noise types elimination.
  • the DNN model based schemes are powerful in eliminating variety of noise types and relatively larger in size and need more computational resources.
  • Embodiments of the disclosure may provide methods and systems (e.g., electronic device or the like) to contextually suppress acoustic sound sources in audio signals, where the context in which media (which comprises audio signals and visual information) was captured is detected with the sound types, the irrelevant sound sources are identified, and the irrelevant sound sources are suppressed.
  • methods and systems e.g., electronic device or the like
  • Embodiments of the disclosure may provide systems and methods to determine the relevant sound sources as primary subject, secondary subject, and/or non-subject sound sources, based on the relevancy of sound set by the user from several intelligent data driven modes (e.g., AI modes or the like), and automatically determines to either partially or completely suppress the non-subject sound sources.
  • intelligent data driven modes e.g., AI modes or the like
  • Embodiments of the disclosure may provide systems and methods to choose to partially or completely eliminate secondary sound sources, based on a correlation between the primary sound sources and secondary sound sources.
  • Embodiments of the disclosure may provide systems and methods to determine the orientation and movement of subject (e.g., human or the like) in visual scene and position of a recording microphone to adaptively tune the subject sound source.
  • subject e.g., human or the like
  • Embodiments of the disclosure may provide systems and methods that use predefined AI modes to regulate the proportionate mixing of the primary sound sources, the secondary sound sources, the non-subject sound sources, and irrelevant sound sources based on context.
  • Embodiments of the disclosure may provide systems and methods to predefine the AI modes to automatically tune the audio content based on relevance to the context situation, capability of a target device to be played on, and users hearing profile.
  • Embodiments of the disclosure may provide systems and methods to provide that the AI modes automatically tunes the primary subject sound sources, the secondary subject sound sources, the non-subject sound sources, and the irrelevant subject sound sources based on target device capabilities, the user hearing profile, user's intention to play the video/audio, and contextual situation while the signals were recorded.
  • various example embodiments herein provide methods for handling a sound source in a media (including audio signals and visual information).
  • the method includes: determining, by an electronic device, a context in which the media is captured; determining and classifying, by the electronic device, a relevant sound source in the media as a primary sound source, a secondary sound source, and a non-subject sound source based on the determined context; and generating an output sound by suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model based on the determination and classification.
  • the method includes: generating the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification.
  • the method includes: generating the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
  • the method includes: detecting, by the electronic device, at least one event, wherein the at least one event includes at least one of: a change in a sound source parameter associated the media, a change in correlation between a visual subject associated with the media in focus to the sound source occurring at a specified interval, a change in correlation between the primary sound source and the secondary sound source in the media at a specified interval, a change in an orientation and movement of a subject in a visual scene and position of a recording media associated with the media.
  • the at least one event includes at least one of: a change in a sound source parameter associated the media, a change in correlation between a visual subject associated with the media in focus to the sound source occurring at a specified interval, a change in correlation between the primary sound source and the secondary sound source in the media at a specified interval, a change in an orientation and movement of a subject in a visual scene and position of a recording media associated with the media.
  • the method may further include: determining, by the electronic device, the context of the media based on the at least one detected event; determining, by the electronic device, a second relevant sound source in the media as a second primary sound source, a second secondary sound source, and a second non-subject sound source based on the determined context; generating a second output sound by completely suppressing at least one of the second secondary sound source and the second non-subject sound source in the media and optimizing the second primary sound source in the media using the data driven model based on the determination and classification.
  • the method includes generating a second output sound by partially suppressing at least one of the second secondary sound source and the second non-subject sound source in the media and optimizing the second primary sound source in the media using the data driven model based on the determination and classification.
  • completely suppressing at least one of the secondary sound source and the non-subject sound source in the media includes identifying at least one of the secondary sound source and the non-subject sound source in the media as irrelevant sound source, and completely suppressing at least one of the secondary sound source and the non-subject sound source in the media based on identifying at least one of the secondary sound source and the non-subject sound source in the media as irrelevant sound source.
  • partially suppressing at least one of the secondary sound source and the non-subject sound source in the media includes determining a correlation between the primary sound source and at least one of the secondary sound source and the non-subject sound source, and partially suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the determined correlation.
  • determining and categorizing, by the electronic device, the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source includes obtaining, by the electronic device, at least one of an environmental context, a scene classification information, a device context, and a hearing profile, and determining, by the electronic device, the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source based on at least one of the environmental context, the scene classification information, the device context, and the hearing profile.
  • the method includes selectively monitoring, by the electronic device, each relevant sound source based on at least one of the environmental context, the scene classification information, and the device context.
  • generating the output sound source by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model includes determining a relative orientation of a recording media and the primary sound source, adjusting a proportion of the secondary sound source and a proportion of the non-subject sound source upon determining a relative orientation of the recording media and the primary sound source, and generating the output sound source by adjusting relative orientation of the recording media and the primary sound source, the proportion of the secondary sound source, and the proportion of the non-subject sound source.
  • example embodiments herein provide methods for handling a sound source in a media.
  • the method includes: identifying, by an electronic device, at least one subject that is source of sound in each scene in a media; identifying, by the electronic device, at least one of a context of each scene, a context of the electronic device from which the media is captured and a context of an environment of the scene; classifying, by the electronic device, each subject in each scene as at least one of: a primary subject and at least one non-primary subject based on the identification; determining, by the electronic device, a relationship between the primary subject and the at least one non-primary subject in each scene based on the classification; and combining, by the electronic device, a sound from the primary subject and the non-primary subject in specified proportion in response to the determined relationship between the primary subject and the at least one non-primary subject.
  • the method includes partially or completely eliminating, by the electronic device, the sound from the at least one non-primary subject upon determining the relationship between the primary subject and the at least one non-primary subject.
  • the method includes: determining, by the electronic device, a relevancy of the at least one non-primary subject with respect to the context based on a data driven model; and partially or completely suppressing, by the electronic device, the sound from the at least one non-primary subjects based on the determination.
  • the method includes: identifying, by the electronic device, the at least one non-primary subject as irrelevant sound source; and completely suppressing, by the electronic device, the sound from the at least one non-primary subject.
  • the method includes determining, by the electronic device, at least one of an orientation and a movement of the subject in each scene and position of a recording media to adaptively tune the sound from the subject in the scene.
  • example embodiments herein provide an electronic device including a sound source controller coupled with at least one processor, comprising processing circuitry, and a memory.
  • the sound source controller is configured to: determine a context in which a media is captured; determine and classify a relevant sound source in the media as a primary sound source, a secondary sound source, and a non-subject sound source based on the determined context; generate an output sound by completely suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model based on the determination and classification.
  • the sound source controller is configured to generate the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification.
  • the sound source controller is configured to generate the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
  • example embodiments herein provide an electronic device including a sound source controller coupled with at least one processor, comprising processing circuitry, and a memory.
  • the sound source controller is configured to: identify at least one subject that is a source of sound in each scene in a media; identify at least one of a context of each scene, a context of the electronic device from which the media is captured and a context of an environment of the scene; classify each subject in each scene as at least one of: a primary subject and at least one non-primary subject based on the identification; determine a relationship between the primary subject and the at least one non-primary subject in each scene based on the classification; and combine a sound from the primary subject and the non-primary subject in specified proportion in response to the determined relationship between the primary subject and the at least one non-primary subject.
  • FIG. 1 is a block diagram illustrating an example configuration of an electronic device, according to various embodiments
  • FIG. 2 is diagram illustrating an example environment in which various hardware components of a sound source controller included in the electronic device is depicted, according to various embodiments;
  • FIG. 3 is a diagram illustrating an example scenario in which operations of an audio video (AV) subject pair generator included in the sound source controller is explained, according to various embodiments;
  • AV audio video
  • FIG. 4 is a diagram illustrating an example scenario in which operations of a context creator included in the sound source controller is explained, according to various embodiments;
  • FIG. 5 is a diagram illustrating an example scenario in which operations of a device context generation is explained, according to various embodiments
  • FIG. 6 is a diagram illustrating an example in which operations of a voice noise classifier included in the sound source controller is explained, according to various embodiments;
  • FIG. 7 is a diagram illustrating an example in which operations of the voice noise mixer included in the sound source controller is explained, according to various embodiments.
  • FIG. 8 and FIG. 9 are flowcharts illustrating an example method for handling a sound source in a media, according to various embodiments
  • FIGS. 10A, 10B, 10C, 10D, 11A and FIG. 11B are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on a video, according to various embodiments;
  • FIGS. 12A, 12B, 12C and FIG. 12D are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on a video, according to various embodiments;
  • FIGS. 13A, 13B, 13C, 13D and FIG. 13E are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on a video, according to various embodiments.
  • FIG. 14 is a diagram illustrating an example scenario in which speech clarity is improved based on subject orientation, according to various embodiments.
  • Embodiments herein may be described and illustrated in terms of blocks which carry out a described function or functions. These blocks, which may be referred to herein as managers, units, modules, hardware components or the like, are physically implemented by analog and/or digital circuits such as logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, hardwired circuits and the like, and may optionally be driven by a firmware.
  • the circuits may, for example, be embodied in one or more semiconductor chips, or on substrate supports such as printed circuit boards and the like.
  • circuits of a block may be implemented by dedicated hardware, or by a processor (e.g., one or more programmed microprocessors and associated circuitry), or by a combination of dedicated hardware to perform some functions of the block and a processor to perform other functions of the block.
  • a processor e.g., one or more programmed microprocessors and associated circuitry
  • Each block of the embodiments may be physically separated into two or more interacting and discrete blocks without departing from the scope of the disclosure.
  • the blocks of the embodiments may be physically combined into more complex blocks without departing from the scope of the disclosure.
  • the embodiments herein describe various example methods for handling a sound source in a media (including audio signals and visual information).
  • the method includes determining, by an electronic device, a context in which the media is captured. Further, the method includes determining and classifying, by the electronic device, a relevant sound source in the media as a primary sound source, a secondary sound source, and a non-subject sound source based on the determined context. In an embodiment, the method includes generating an output sound by suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model based on the determination and classification.
  • the method includes generating the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification. In an embodiment, the method includes generating the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
  • the disclosed method can be used to contextually suppress the acoustic sound sources in the audio signals, so as to improve the user experience.
  • the user has thrown a house party to close friends. In his party, there are many things happening around such as background music, pet dog howling, guests clapping, laughing, giggling, etc.
  • the user navigates through the party to record his friends and pets.
  • the smart phone including a video camera helps in prioritizing the sound sources based on the visual focus context.
  • the generated video will contain audio sounds relevant to the visual scene. Thus, results in improving the user experience.
  • the disclosed method uses the environmental context, scene classification, and device context to selectively focus on each of the relevant sound sources. Further, the disclosed method categorizes the sound sources into the primary sound sources, the secondary sound sources, the non-subject sound sources, and irrelevant sound sources based on the environment context, the device context and the scene context. The disclosed method expresses on how the acoustic parameters of the sound sources changed. The disclosed method generates the different acoustic signals with various combination of the sound sources in different contexts.
  • the disclosed method uses pre-defined (e.g., specified) data driven model (e.g., AI modes, ML modes or the like) to automatically adjust the sound source proportionate mixing.
  • the disclosed method determines the relative orientation of the mic and the primary sound source, and uses the same to adjust the other sound source proportion to generate acoustic signal similar to ideal situation.
  • the disclosed method uses the device context information to determine what is voice and what is noise for the same video recording in different context.
  • the disclosed method can be used to mixes the noise content at different levels intelligently to retain the naturalness of the video while minimizing or reducing user effort or simplifying the recording / editing time for the user.
  • the electronic device correlates the context in which video is captured with the sound types captured as part of the recordings. From the correlation, the electronic device determines irrelevant sound sources and completely suppresses them. The electronic device further establishes correlation between the visual subjects in focus to the sound sources occurring in the time point. The electronic device categorizes the relevant sound sources as primary subject, secondary subject, and/or non-subject sound sources. Further, based on the modes set by the user from several intelligent AI modes or ML modes, the electronic device automatically determines to either partially or completely suppress the non-subject sound sources. The electronic device further uses the visual and audio information to establish correlation between the primary and secondary sound sources. Based on the correlation, the electronic device chooses to partially or completely eliminate the secondary sound sources. Further, the electronic device also determines the orientation and movement of subject in visual scene and position of the recording microphone to adaptively tunes the subject sound source parameters to have constant volume level from the source.
  • FIGS. 1 through 14 where similar reference characters denote corresponding features consistently throughout the figures, there are shown various example embodiments.
  • FIG. 1 is a block diagram illustrating an example configuration of the electronic device (100), according to various embodiments.
  • the electronic device (100) can be, for example, but not limited to a laptop, a smart phone, a desktop computer, a notebook, a Device-to-Device (D2D) device, a vehicle to everything (V2X) device, a foldable phone, a smart TV, a tablet, an immersive device, a virtual reality (VR) device, a mixed reality device, an augmented reality device, a virtual headset, and an internet of things (IoT) device.
  • D2D Device-to-Device
  • V2X vehicle to everything
  • foldable phone a smart TV
  • a tablet an immersive device
  • VR virtual reality
  • mixed reality device augmented reality device
  • a virtual headset and an internet of things (IoT) device.
  • IoT internet of things
  • the electronic device (100) includes a processor (e.g., including processing circuitry) (110), a communicator (e.g., including communication circuitry) (120), a memory (130), a sound source controller (e.g., including various circuitry) (140) and a data driven controller (e.g., including various circuitry) (150).
  • the processor (110) is coupled with the communicator (120), the memory (130), the sound source controller (140) and the data driven controller (150).
  • the sound source controller (140) determines a context in which a media is captured.
  • the media includes audio signals and visual information.
  • the media can be, for example, but not limited to a video, a multimedia content, an animation, shorts or the like.
  • the sound source controller (140) determines and classifies a relevant sound source in the media as a primary sound source, a secondary sound source, and a non-subject sound source based on the determined context.
  • the primary sound source can be an ambient music
  • the secondary sound source can be the guitar sound
  • the non-subject sound source can be the dog howling and laughing sound.
  • the sound source controller (140) obtains an environmental context (e.g., weather context, location context or the like), a scene classification information, a device context (e.g., CPU usage context, application usage context or the like), and a hearing profile. Based on the environmental context, the scene classification information, the device context, and the hearing profile, the sound source controller (140) determines the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source.
  • an environmental context e.g., weather context, location context or the like
  • a scene classification information e.g., a device context
  • a hearing profile e.g., a device context
  • the sound source controller (140) determines the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source.
  • the sound source controller (140) Based on the determination and classification, the sound source controller (140) generates an output sound by suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model (e.g., AI model, ML model or the like).
  • a data driven model e.g., AI model, ML model or the like.
  • the sound source controller (140) generates the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification.
  • At least one of the secondary sound source and the non-subject sound source in the media are completely suppressed by determining a correlation between the primary sound source and at least one of the secondary sound source and the non-subject sound source, and completely suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the determined correlation.
  • At least one of the secondary sound source and the non-subject sound source in the media is completely suppressed by identifying at least one of the secondary sound source and the non-subject sound source in the media as irrelevant sound source, and completely suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the identification.
  • At least one of the secondary sound source and the non-subject sound source in the media is partially suppressed by determining the correlation between the primary sound source and at least one of the secondary sound source and the non-subject sound source, and partially suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the determined correlation.
  • the sound source controller (140) generates the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
  • the sound source controller (140) determines a relative orientation of a recording media (e.g., speaker, mic, or the like) and the primary sound source. Further, the sound source controller (140) may adjust a proportion of the secondary sound source and a proportion of the non-subject sound source upon determining a relative orientation of the recording media and the primary sound source. Further, the sound source controller (140) generates the output sound source by adjusting relative orientation of the recording media and the primary sound source, the proportion of the secondary sound source, and the proportion of the non-subject sound source.
  • the sound source controller (140) detects an event.
  • the event can be, for example, but not limited to a change in a sound source parameter associated the media, a change in correlation between a visual subject associated with the media in focus to the sound source occurring at a specified interval, a change in correlation between the primary sound source and the secondary sound source in the media at a specified interval, a change in an orientation and movement of a subject (e.g., targeted human, or the like) in a visual scene and position of a recording media associated with the media.
  • a subject e.g., targeted human, or the like
  • the sound source controller (140) determines the context of the media based on the detected event. Further, the sound source controller (140) determines a second relevant sound source in the media as a second primary sound source, a second secondary sound source, and a second non-subject sound source based on the determined context.
  • the sound source controller (140) generates a second output sound by completely suppressing at least one of the second secondary sound source and the second non-subject sound source in the media and optimizing the second primary sound source in the media using the data driven model based on the determination and classification.
  • the sound source controller (140) generates the second output sound by partially suppressing at least one of the second secondary sound source and the second non-subject sound source in the media and optimizing the second primary sound source in the media using the data driven model based on the determination and classification. In an embodiment, the sound source controller (140) generates the second output sound by automatically adjusting at least one of the second primary sound source, the second secondary sound source, and the second non-subject sound source using the data driven model based on the determination and classification.
  • the sound source controller (140) selectively monitors on each of the relevant sound source based on at least one of the environmental context, the scene classification information, and the device context.
  • the sound source controller (140) identifies at least one subject that is source of sound in each scene in the media. Further, the sound source controller (140) identifies at least one of the context of each scene, the context of the electronic device (100) from which the media is captured and the context of the environment of the scene. Further, the sound source controller (140) classifies each subject in each scene as at least one of: the primary subject and at least one non-primary subject based on the identification. Further, the sound source controller (140) determines the relationship between the primary subject and the at least one non-primary subject in each scene based on the classification. Further, the sound source controller (140) combines the sound from the primary subject and the non-primary subject in pre-defined proportion in response to the determined relationship between the primary subject and the non-primary subject.
  • the sound source controller (140) partially or completely eliminates the sound from the non-primary subject upon determining the relationship between the primary subject and the non-primary subject. In an embodiment, the sound source controller (140) determines the relevancy of the non-primary subject with respect to the context based on the data driven model. Further, the sound source controller (140) partially or completely suppresses the sound from the non-primary subjects based on the determination.
  • the sound source controller (140) identifies the non-primary subject as irrelevant sound source. Further, the sound source controller (140) completely suppresses the sound from the non-primary subject. In an embodiment, the sound source controller (140) determines at least one of an orientation and a movement of the subject in each scene and position of a recording media to adaptively tune the sound from the subject in the scene.
  • the sound source controller (140) may, for example, be implemented by analog and/or digital circuits such as logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, hardwired circuits and the like, and may optionally be driven by firmware.
  • the processor (110) may include one or a plurality of processors.
  • the one or the plurality of processors may be a general-purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU).
  • the processor (110) may include multiple cores and is configured to execute the instructions stored in the memory (130).
  • the processor 110 according to an embodiment of the disclosure may include various processing circuitry and/or multiple processors.
  • processor may include various processing circuitry, including at least one processor, wherein one or more of at least one processor, individually and/or collectively in a distributed manner, may be configured to perform various functions described herein.
  • processor at least one processor
  • processors when a "processor” "at least one processor” and “one or more processors”are described as being configured to perform numerous functions, these terms cover situations, for example and without limitation, in which one processor performs some of recited functions and another processor(s) performs other of recited functions, and also situations in which a single processor may perform all recited functions.
  • the at least one processor may include a combination of processors performing various of the recited /disclosed functions, e.g., in a distributed manner.
  • At least one processor may execute program instructions to achieve or perform various functions.
  • the processor (110) is configured to execute instructions stored in the memory (130) and to perform various processes.
  • the communicator (120) may include various communication circuitry and is configured for communicating internally between internal hardware components and with external devices via one or more networks.
  • the memory (130) also stores instructions to be executed by the processor (110).
  • the memory (130) may include non-volatile storage elements. Examples of such non-volatile storage elements may include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories.
  • EPROM electrically programmable memories
  • EEPROM electrically erasable and programmable
  • the memory (130) may, in some examples, be considered a non-transitory storage medium.
  • non-transitory may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. However, the term “non-transitory”should not be interpreted that the memory (130) is non-movable.
  • a non-transitory storage medium may store data that can, over time, change (e.g., in Random Access Memory (RAM) or cache).
  • RAM Random Access Memory
  • the communicator (120) may include an electronic circuit specific to a standard that enables wired or wireless communication.
  • the communicator (120) is configured to communicate internally between internal hardware components of the electronic device (100) and with external devices via one or more networks.
  • At least one of the pluralities of modules/controller may be implemented through an Artificial intelligence (AI) model using the data driven controller (150).
  • the data driven controller (150) can be a machine learning (ML) model based controller and AI model based controller.
  • a function associated with the AI model may be performed through the non-volatile memory, the volatile memory, and the processor (110).
  • the processor (110) may include one or a plurality of processors.
  • processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU).
  • CPU central processing unit
  • AP application processor
  • GPU graphics processing unit
  • VPU visual processing unit
  • NPU neural processing unit
  • the one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or AI model stored in the non-volatile memory and the volatile memory.
  • the predefined operating rule or artificial intelligence model is provided through training or learning.
  • being provided through learning may refer, for example, to a predefined operating rule or AI model of a desired characteristic being made by applying a learning algorithm to a plurality of learning data.
  • the learning may be performed in a device itself in which AI according to an embodiment is performed, and/o may be implemented through a separate server/system.
  • the AI model may comprise of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights.
  • Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.
  • the learning algorithm may refer, for example, to a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction.
  • Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
  • FIG. 1 shows various hardware components of the electronic device (100) it is to be understood that disclosure is not limited thereto.
  • the electronic device (100) may include less or more number of components.
  • the labels or names of the components are used only for illustrative purpose and does not limit the scope of the disclosure.
  • One or more components can be combined together to perform same or substantially similar function in the electronic device (100).
  • FIG. 2 is a diagram illustrating an example environment in which various hardware components of the sound source controller (140) included in the electronic device (100) are depicted, according to various embodiments.
  • step 1 the video is provided, wherein the video can be pre-recorded or is being recorded in real-time.
  • step 2 all the sound sources present in the audio stream for a given time span are separated.
  • step 3 the visual subjects within the given time frame are extracted.
  • step 4 the environment information/context while the video was recorded or from the video scene is determined.
  • step 5 the information about the visual and acoustic scene generated from the video is determined.
  • step 6 the information from the device application on which the video is being processed is gathered.
  • step 7 the relation information about the visual and acoustic subjects are carried and the relation is generated.
  • Embodiments herein disclose an audio visual (AV) subject pair generator (210), wherein the AV subject pair generator (210) can correlate one or more visual subjects with one or more acoustic subjects.
  • the operations of the AV subject pair generator (210) is explained in FIG. 3.
  • a weight is assigned to context elements by the context creator (220). The weight is set by the electronic device (100) or the user of the electronic device (100).
  • the operations of the context creator (220) is explained in FIG. 4.
  • a Voice Noise (VN) classifier uses the device context information to categorize each of the context elements to be the primary subjects, the secondary subjects, the non-subject subjects, and the irrelevant subjects.
  • VN Voice Noise
  • the VN classifier (230) can categorize the acoustic sound sources to be the primary subjects, the secondary subjects, the non-subject subjects and the irrelevant subjects using information received from the AV subject pair generator (210) along with the context information.
  • the operations of the voice noise classifier (230) is explained in FIG. 6.
  • pre-defined AI modes are provided, wherein a selection can be made from the AI modes repository (240), and the selected AI mode helps a VN estimator (not shown) to classify what is voice and what is noise.
  • the VN mixer (250) proportionately mixes the voice and noise sound sources.
  • the VN mixer (250) can also suitably adjust the intensity of the primary subject source, while keeping all other subject sources intensities at a constant level.
  • the operations of the voice noise mixer (250) is explained in greater detail below with reference to FIG. 7. Further, the VN mixer (250) proportionately mixes the voice and noise sound sources, which is then provided to the video multiplexer (mixer) (260)).
  • FIG. 2 shows various hardware components of the sound source controller (140) it should be understood that the disclosure is not limited thereto.
  • the sound source controller (140) may include less or more number of components.
  • the labels or names of the components are used only for illustrative purpose and does not limit the scope of the disclosure.
  • One or more components can be combined together to perform same or substantially similar function in the sound source controller (140).
  • FIG. 3 is a diagram illustrating an example scenario (300) in which operations of the AV subject pair generator (210) included in the sound source controller (140) is explained, according to various embodiments.
  • the AV subject pair generator (210) is responsible to link the visual subjects to the acoustic signal subjects.
  • the AV subject pair generator (210) uses the pre-linked audio-visual information to link the incoming acoustic and the visual subjects.
  • the AV subject pair generator (210) uses corresponding subject characteristics such as Speaker Embeddings, Gender, Type, etc. to disambiguate similar acoustic subjects.
  • the AV subject pair generator (210) uses the deep learning technique/machine learning technique /generative model technique which is pre-trained on several audio-visual subjects to generate the relation.
  • the table 3 is the AV subject pair generation is obtained from the table 1 (audio source separation) and table 2 (visual subject extraction).
  • FIG. 4 is a diagram illustrating an example scenario (400) in which operations of the context creator (220) included in the sound source controller (140) is explained, according to various embodiments.
  • the context creator (220) creates a knowledge graph dynamically from the sequence of audio visual frames.
  • the context creator (220) correlates the audio-visual subjects to the target scene.
  • the context creator (220) uses the pre-trained DL technique/ML technique /generative model technique with information to associate the AV subjects to the scene.
  • the context creator (220) assigns each of the subjects in the AV frame to one or more of the detected scene.
  • the device context is responsible to determine the usage context of the solution on the target device.
  • the table 7 is output of the context creator obtained from the table 4 (e.g., environmental context), the table 5 (e.g., scene classification) and the table 6 (e.g., device context).
  • Subject Type Gender Name Acoustic Subject Scene Person Human Female Nikita Speech1 Birthday Person Human Female Emanuella Speech2 Birthday Dog Animal Barking Birthday, General Beach Sea Dead Sea Water Gush General Beach Sea Dead Sea Wind General Boat Commercial Titapic Boat Horn
  • FIG. 5 is a diagram illustrating an example scenario in which operations of the device context generation is explained, according to various embodiments.
  • the device context can not only gather context information from messenger application, but also from the voice call, the video call, the media sharing application, etc.
  • the intent is to preserve different sound sources as part of different sharing process.
  • the John sharing his happy moments with Neeta and at 504, John sharing video segment with Fire department. which contained the fire alarm sound as an evidence
  • FIG. 6 is a diagram illustrating an example scenario (600) in which operations of the voice noise classifier (230) included in the sound source controller (140) are explained, according to various embodiments.
  • the voice noise classifier (230) categorizes the acoustic subject to be either voice or noise in specific context. Using the voice noise classifier (230), the voice may refer, for example, to sound sources which needs to be retained and noise may refer, for example, to sound sources which will be partially or completely eliminated.
  • the voice noise classifier (230) takes the information from the context creator (220) and the AV subject pair generator (210) and generates the classification labels.
  • the table 9 is an output for the voice noise classifier (230).
  • Subject Type Gender Name Acoustic Subject Scene Voice noise classifier Person Human Female Nikita Speech1 Birthday Primary Person Human Female Emanuella Speech2 Birthday Secondary Dog Animal Barking Birthday, General Irrelevant Beach Sea Dead Sea Water Gush General Non-Subject Beach Sea Dead Sea Wind General Non-Subject Boat Commercial Titapic Boat Horn General Irrelevant
  • FIG. 7 is a diagram illustrating an example scenario in which operations of the voice noise mixer (250) included in the sound source controller (140) is explained, according to various embodiments.
  • the voice noise mixer (250) is responsible to take in one of the AI modes and alter the sound sources in audio signal proportionately.
  • Several artificial intelligent modes such as speech enhancement, vlog, visual AI, Music AI, etc which are capable to alter the proportion of sound sources automatically.
  • Several Artificial Intelligent modes dictates how the various sound sources will be altered based on the context.
  • the voice noise mixer (250) takes in the VN classifier (230) results and generates the mixing proportion as shown in the table 10 below.
  • FIG. 8 and FIG. 9 are flowcharts (800 and 900) illustrating an example method for handling the sound source in the media, according to various embodiments.
  • the operations (802-810) are handled by the sound source controller (140).
  • the method includes determining the context in which the media is captured.
  • the method includes determining and classifying the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source based on the determined context.
  • the method includes generating the output sound by suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification.
  • the method includes generating the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification. In an embodiment, at 810, the method includes generating the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
  • the operations (902-910) are handled by the sound source controller (140).
  • the method includes identifying at least one subject that is source of sound in each scene in the media.
  • the method includes identifying at least one of the context of each scene, the context of the electronic device (100) from which the media is captured and the context of the environment of the scene.
  • the method includes classifying each subject in each scene as at least one of: a primary subject and at least one non-primary subject based on the identification.
  • the method includes determining a relationship between the primary subject and the at least one non-primary subject in each scene based on the classification.
  • the method includes combining a sound from the primary subject and the non-primary subject in pre-defined proportion in response to the determined relationship between the primary subject and the at least one non-primary subject.
  • the disclosed method can be used to determine the sound sources present in the audio signal.
  • the disclosed method can be used to classify the sound sources to be relevant, or irrelevant based on the visual scene and environmental context and completely eliminate irrelevant sound sources.
  • the disclosed method can be used to classify the relevant sound sources to be primary, secondary, or non-subject sound sources.
  • the disclosed method can be used to partially or completely suppress the non-subject noises with relevance to the context.
  • the disclosed method can be used to dynamically host several AI based odes for contextual and intelligent handling.
  • the disclosed method can be used to contextually suppress the acoustic sound sources in the audio signals, so as to improve the user experience.
  • the disclosed method can be used to mixes the noise content at different levels intelligently to retain the naturalness of the video while minimizing or reducing user effort or simplifying the recording / editing time for the user.
  • FIGS. 10A, 10B, 10C, 10D, 11A and FIG. 11B are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on the video, according to various embodiments.
  • the user has thrown a house party for his close friends. In his party, there are many things happening around such as background music, barking dogs, guests clapping, laughing, giggling, etc. The user navigates through the party to make a video recording of his friends and pets.
  • Embodiments herein can help in prioritizing the sound sources based on the visual focus context.
  • Table 11 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
  • Table 12 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
  • Table 13 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
  • Table 14 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
  • FIG. 11A and FIG. 11B are diagrams illustrating examples of corresponding data flow of FIGS. 10A to 10D, wherein the AV subject pair generator (210) determines the visual subjects present in the video.
  • Table 15 depicts the determined visual subjects for the current example (depicted in FIG. 11A and FIG. 11B).
  • the context creator (220) can receive the environmental context (as depicted in table 16), the scene classification (as depicted in table 17), and device context (as depicted in table 18), using which the context creator (220) can generate the context.
  • the VN classifier (230) can classify the visual and acoustic subjects as the primary subjects, the secondary subjects, the non-subject subjects, or the irrelevant subjects, in the current scenario.
  • Table 19 depicts an example classification.
  • the VN mixer (250) proportionately mixes the voice and noise sound sources, which is then provided to the video multiplexer (muxer (260)).
  • Table 20 depicts an example scenario.
  • FIGS. 12A, 12B, 12C and 12D are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on the video, according to various embodiments.
  • FIGS. 13A, 13B, 13C, 13D and 13E are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on the video, according to various embodiments.
  • Embodiments herein can help in prioritizing the sound sources based on the visual focus context.
  • table 21 is an example table depicting the various sound sources at the instance (when John has started recording video), their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
  • Table 22 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
  • Table 23 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
  • Table 24 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)), for the video that John is going to share with his friends of the party.
  • Table 25 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
  • FIG. 14 is a diagram illustrating an example scenario (1400) in which speech clarity is improved based on the subject orientation, according to various embodiments.
  • the user e.g., Edward
  • the user speaks out spontaneously by randomly viewing the camera and the food item being prepared.
  • the background noise is almost constant, whereas the users voice diminishes based on his head orientation.
  • Edward from Spain is reviewing Indian street food. His associate is video recording the review using the camera.
  • Edward while explaining about the dish rotates his head to look at the food item to describe it further.
  • the speech parameters drops when the user rotates his head away from the mic. The same is indicated in the table.
  • Edward turns back towards camera to speak and explain further about the street food.
  • Embodiments herein are explained with respect to scenarios, wherein the user is capturing video or recorded videos. However, it may be apparent to a person of ordinary skill in the art that embodiments herein may be applicable to any scenario, wherein sound is being captured; such as, but not limited to, a sound recording, a call recording, and so on.
  • the embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the elements.
  • the elements can be at least one of a hardware device, or a combination of hardware device and software module.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Circuit For Audible Band Transducer (AREA)

Abstract

Embodiments herein disclose methods for handling a sound source in a media by an electronic device. The method includes: determining and classifying a relevant sound source in a media as a primary sound source, a secondary sound source, and a non-subject sound source based on a determined context; generating an output sound by suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model based on the determination and classification; or generating the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification.

Description

METHOD AND ELECTRONIC DEVICE FOR HANDLING SOUND SOURCE IN MEDIA
The disclosure relates to content processing (e.g., audio processing, video processing or the like), and for example, to a method and an electronic device for handling acoustic sound sources in audio signals (e.g., suppressing acoustic sound sources in the audio signals, modifying acoustic sound sources in the audio signals, optimizing the acoustic sound sources in the audio signals or the like).
A sound recording can include ambient/background noises, especially when the sound recording is done in a noisy environment (such as, party, sports events, school function, college function or the like). The recorded audio signals (or recorded sound signals) often carry environment sounds along with an intended sound source. For example, audio conversation recorded in a restaurant will contain ambient music, speech, and crowd babble noise.
Consider an example scenario, where a user has thrown a house party for close friends. In the party, there are many things happening around such as background music, barking dogs, noisy guests, etc. The user navigates through the party to record his friends and pets. However, the sound recording contains all the background noise, which adversely affects the quality of the sound recording.
In existing methods and systems, noise suppression aims at reducing the unwanted sound(s) from the recorded audio signal. Often, noise suppression schemes can be either a Digital signal processing (DSP) based or Deep Neural Network (DNN) model based scheme. The DSP based schemes are light weight low latency and trained for few noise types elimination. The DNN model based schemes are powerful in eliminating variety of noise types and relatively larger in size and need more computational resources.
Current noise suppression solutions have no intelligence to decide on the noise proportion of each sound sources present in the audio signal. In an example scenario, consider that a group of friends are celebrating a birthday party at a home, wherein a video recording of the party is being made. During the celebrations, there was music being played, friends chatter, and laughter. Suddenly the home fire alarm just went off due to some circuit malfunction. The alarm sound was captured in the video. The current recording systems are not intelligent enough to diminish the alarm sound when the event being captured in the video is a birthday party.
Further, often the sound source changes while the recording is in progress. Due to this, the sound intensity from the source decreases whereas the environmental noise remains constant as it was before rotation.
The above information is presented as background information only to help the reader to understand the disclosure. No determination or assertion as to whether any of the above might be applicable as prior art with regard to the present application is made.
Embodiments of the disclosure may provide methods and systems (e.g., electronic device or the like) to contextually suppress acoustic sound sources in audio signals, where the context in which media (which comprises audio signals and visual information) was captured is detected with the sound types, the irrelevant sound sources are identified, and the irrelevant sound sources are suppressed.
Embodiments of the disclosure may provide systems and methods to determine the relevant sound sources as primary subject, secondary subject, and/or non-subject sound sources, based on the relevancy of sound set by the user from several intelligent data driven modes (e.g., AI modes or the like), and automatically determines to either partially or completely suppress the non-subject sound sources.
Embodiments of the disclosure may provide systems and methods to choose to partially or completely eliminate secondary sound sources, based on a correlation between the primary sound sources and secondary sound sources.
Embodiments of the disclosure may provide systems and methods to determine the orientation and movement of subject (e.g., human or the like) in visual scene and position of a recording microphone to adaptively tune the subject sound source.
Embodiments of the disclosure may provide systems and methods that use predefined AI modes to regulate the proportionate mixing of the primary sound sources, the secondary sound sources, the non-subject sound sources, and irrelevant sound sources based on context.
Embodiments of the disclosure may provide systems and methods to predefine the AI modes to automatically tune the audio content based on relevance to the context situation, capability of a target device to be played on, and users hearing profile.
Embodiments of the disclosure may provide systems and methods to provide that the AI modes automatically tunes the primary subject sound sources, the secondary subject sound sources, the non-subject sound sources, and the irrelevant subject sound sources based on target device capabilities, the user hearing profile, user's intention to play the video/audio, and contextual situation while the signals were recorded.
Accordingly, various example embodiments herein provide methods for handling a sound source in a media (including audio signals and visual information). The method includes: determining, by an electronic device, a context in which the media is captured; determining and classifying, by the electronic device, a relevant sound source in the media as a primary sound source, a secondary sound source, and a non-subject sound source based on the determined context; and generating an output sound by suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model based on the determination and classification.
In an example embodiment, the method includes: generating the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification.
In an example embodiment, the method includes: generating the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
In an example embodiment, the method includes: detecting, by the electronic device, at least one event, wherein the at least one event includes at least one of: a change in a sound source parameter associated the media, a change in correlation between a visual subject associated with the media in focus to the sound source occurring at a specified interval, a change in correlation between the primary sound source and the secondary sound source in the media at a specified interval, a change in an orientation and movement of a subject in a visual scene and position of a recording media associated with the media. The method may further include: determining, by the electronic device, the context of the media based on the at least one detected event; determining, by the electronic device, a second relevant sound source in the media as a second primary sound source, a second secondary sound source, and a second non-subject sound source based on the determined context; generating a second output sound by completely suppressing at least one of the second secondary sound source and the second non-subject sound source in the media and optimizing the second primary sound source in the media using the data driven model based on the determination and classification.
In an example embodiment, the method includes generating a second output sound by partially suppressing at least one of the second secondary sound source and the second non-subject sound source in the media and optimizing the second primary sound source in the media using the data driven model based on the determination and classification.
In an example embodiment, the method includes generating a second output sound by automatically adjusting at least one of the second primary sound source, the second secondary sound source, and the second non-subject sound source using the data driven model based on the determination and classification.
In an example embodiment, completely suppressing at least one of the secondary sound source and the non-subject sound source in the media includes determining a correlation between the primary sound source and at least one of the secondary sound source and the non-subject sound source, and completely suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the determined correlation.
In an example embodiment, completely suppressing at least one of the secondary sound source and the non-subject sound source in the media includes identifying at least one of the secondary sound source and the non-subject sound source in the media as irrelevant sound source, and completely suppressing at least one of the secondary sound source and the non-subject sound source in the media based on identifying at least one of the secondary sound source and the non-subject sound source in the media as irrelevant sound source.
In an example embodiment, partially suppressing at least one of the secondary sound source and the non-subject sound source in the media includes determining a correlation between the primary sound source and at least one of the secondary sound source and the non-subject sound source, and partially suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the determined correlation.
In an example embodiment, determining and categorizing, by the electronic device, the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source includes obtaining, by the electronic device, at least one of an environmental context, a scene classification information, a device context, and a hearing profile, and determining, by the electronic device, the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source based on at least one of the environmental context, the scene classification information, the device context, and the hearing profile.
In an example embodiment, the method includes selectively monitoring, by the electronic device, each relevant sound source based on at least one of the environmental context, the scene classification information, and the device context.
In an example embodiment, generating the output sound source by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model includes determining a relative orientation of a recording media and the primary sound source, adjusting a proportion of the secondary sound source and a proportion of the non-subject sound source upon determining a relative orientation of the recording media and the primary sound source, and generating the output sound source by adjusting relative orientation of the recording media and the primary sound source, the proportion of the secondary sound source, and the proportion of the non-subject sound source.
Accordingly, example embodiments herein provide methods for handling a sound source in a media. The method includes: identifying, by an electronic device, at least one subject that is source of sound in each scene in a media; identifying, by the electronic device, at least one of a context of each scene, a context of the electronic device from which the media is captured and a context of an environment of the scene; classifying, by the electronic device, each subject in each scene as at least one of: a primary subject and at least one non-primary subject based on the identification; determining, by the electronic device, a relationship between the primary subject and the at least one non-primary subject in each scene based on the classification; and combining, by the electronic device, a sound from the primary subject and the non-primary subject in specified proportion in response to the determined relationship between the primary subject and the at least one non-primary subject.
In an example embodiment, the method includes partially or completely eliminating, by the electronic device, the sound from the at least one non-primary subject upon determining the relationship between the primary subject and the at least one non-primary subject.
In an example embodiment, the method includes: determining, by the electronic device, a relevancy of the at least one non-primary subject with respect to the context based on a data driven model; and partially or completely suppressing, by the electronic device, the sound from the at least one non-primary subjects based on the determination.
In an example embodiment, the method includes: identifying, by the electronic device, the at least one non-primary subject as irrelevant sound source; and completely suppressing, by the electronic device, the sound from the at least one non-primary subject.
In an example embodiment, the method includes determining, by the electronic device, at least one of an orientation and a movement of the subject in each scene and position of a recording media to adaptively tune the sound from the subject in the scene.
Accordingly, example embodiments herein provide an electronic device including a sound source controller coupled with at least one processor, comprising processing circuitry, and a memory. The sound source controller is configured to: determine a context in which a media is captured; determine and classify a relevant sound source in the media as a primary sound source, a secondary sound source, and a non-subject sound source based on the determined context; generate an output sound by completely suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model based on the determination and classification.
In an example embodiment, the sound source controller is configured to generate the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification.
In an example embodiment, the sound source controller is configured to generate the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
Accordingly, example embodiments herein provide an electronic device including a sound source controller coupled with at least one processor, comprising processing circuitry, and a memory. The sound source controller is configured to: identify at least one subject that is a source of sound in each scene in a media; identify at least one of a context of each scene, a context of the electronic device from which the media is captured and a context of an environment of the scene; classify each subject in each scene as at least one of: a primary subject and at least one non-primary subject based on the identification; determine a relationship between the primary subject and the at least one non-primary subject in each scene based on the classification; and combine a sound from the primary subject and the non-primary subject in specified proportion in response to the determined relationship between the primary subject and the at least one non-primary subject.
These and other aspects of various example embodiments herein will be better appreciated and understood when considered in conjunction with the following description and the accompanying drawings. It should be understood, however, that the following descriptions, while indicating various example embodiments and numerous specific details thereof, are given by way of illustration and not of limitation. Many changes and modifications may be made within the scope of the disclosure herein without departing from the spirit thereof.
Various example embodiments disclosed herein are illustrated in the accompanying drawings, throughout which like reference letters indicate corresponding parts in the various figures. The above and other aspects, features and advantages of certain embodiments of the present disclosure will be more apparent from the following detailed description, taken in conjunction with the accompanying drawings, in which:
FIG. 1 is a block diagram illustrating an example configuration of an electronic device, according to various embodiments;
FIG. 2 is diagram illustrating an example environment in which various hardware components of a sound source controller included in the electronic device is depicted, according to various embodiments;
FIG. 3 is a diagram illustrating an example scenario in which operations of an audio video (AV) subject pair generator included in the sound source controller is explained, according to various embodiments;
FIG. 4 is a diagram illustrating an example scenario in which operations of a context creator included in the sound source controller is explained, according to various embodiments;
FIG. 5 is a diagram illustrating an example scenario in which operations of a device context generation is explained, according to various embodiments;
FIG. 6 is a diagram illustrating an example in which operations of a voice noise classifier included in the sound source controller is explained, according to various embodiments;
FIG. 7 is a diagram illustrating an example in which operations of the voice noise mixer included in the sound source controller is explained, according to various embodiments;
FIG. 8 and FIG. 9 are flowcharts illustrating an example method for handling a sound source in a media, according to various embodiments;
FIGS. 10A, 10B, 10C, 10D, 11A and FIG. 11B are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on a video, according to various embodiments;
FIGS. 12A, 12B, 12C and FIG. 12D are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on a video, according to various embodiments;
FIGS. 13A, 13B, 13C, 13D and FIG. 13E are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on a video, according to various embodiments; and
FIG. 14 is a diagram illustrating an example scenario in which speech clarity is improved based on subject orientation, according to various embodiments.
The various example embodiments herein and the various features and advantageous details thereof are explained more fully with reference to the various non-limiting example embodiments that are illustrated in the accompanying drawings and detailed in the following description. Descriptions of well-known components and processing techniques may be omitted so as to not unnecessarily obscure the embodiments herein. The examples used herein are intended merely to facilitate an understanding of ways in which the embodiments herein can be practiced and to further enable those of skill in the art to practice the embodiments herein. Accordingly, the examples should not be construed as limiting the scope of the disclosure.
For the purposes of interpreting this disclosure, the definitions (as defined herein) will apply and whenever appropriate the terms used in singular will also include the plural and vice versa. It is to be understood that the terminology used herein is for the purposes of describing various embodiments only and is not intended to be limiting. The terms "comprising", "having", and "including" are to be construed as open-ended terms unless otherwise noted.
The words/phrases "exemplary", "example", "illustration", "in an instance" "and so on", "etc.", "etcetera", "e.g..,", "i.e.," are merely used herein to refer, for example, to "serving as an example, instance, or illustration." Any embodiment or implementation of the present subject matter described herein using the words/phrases "exemplary", "example", "illustration", "in an instance" "and so on", "etc.", "etcetera", "e.g..,", "i.e.," is not necessarily to be construed as preferred or advantageous over other embodiments.
Embodiments herein may be described and illustrated in terms of blocks which carry out a described function or functions. These blocks, which may be referred to herein as managers, units, modules, hardware components or the like, are physically implemented by analog and/or digital circuits such as logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, hardwired circuits and the like, and may optionally be driven by a firmware. The circuits may, for example, be embodied in one or more semiconductor chips, or on substrate supports such as printed circuit boards and the like. The circuits of a block may be implemented by dedicated hardware, or by a processor (e.g., one or more programmed microprocessors and associated circuitry), or by a combination of dedicated hardware to perform some functions of the block and a processor to perform other functions of the block. Each block of the embodiments may be physically separated into two or more interacting and discrete blocks without departing from the scope of the disclosure. Likewise, the blocks of the embodiments may be physically combined into more complex blocks without departing from the scope of the disclosure.
It should be noted that elements in the drawings are illustrated for the purposes of this description and ease of understanding and may not have necessarily been drawn to scale. For example, the flowcharts/sequence diagrams illustrate the method in terms of the steps required for understanding of aspects of the embodiments as disclosed herein. Furthermore, in terms of the construction of the device, one or more components of the device may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the present embodiments so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein. Furthermore, in terms of the system, one or more components/modules which comprise the system may have been represented in the drawings by conventional symbols, and the drawings may show only those specific details that are pertinent to understanding the various example embodiments so as not to obscure the drawings with details that will be readily apparent to those of ordinary skill in the art having the benefit of the description herein.
The accompanying drawings are used to help easily understand various technical features and it should be understood that the embodiments presented herein are not limited by the accompanying drawings. As such, the present disclosure should be construed to extend to any modifications, equivalents, and substitutes in addition to those which are particularly set out in the accompanying drawings and the corresponding description. Usage of words such as first, second, third etc., to describe components/elements/steps is for the purposes of this description and should not be construed as sequential ordering/placement/occurrence unless specified otherwise.
The terms "audio" and "sound" are used interchangeably in the patent disclosure.
The embodiments herein describe various example methods for handling a sound source in a media (including audio signals and visual information). The method includes determining, by an electronic device, a context in which the media is captured. Further, the method includes determining and classifying, by the electronic device, a relevant sound source in the media as a primary sound source, a secondary sound source, and a non-subject sound source based on the determined context. In an embodiment, the method includes generating an output sound by suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model based on the determination and classification. In an embodiment, the method includes generating the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification. In an embodiment, the method includes generating the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
The disclosed method can be used to contextually suppress the acoustic sound sources in the audio signals, so as to improve the user experience. In an example, the user has thrown a house party to close friends. In his party, there are many things happening around such as background music, pet dog howling, guests clapping, laughing, giggling, etc. The user navigates through the party to record his friends and pets. Based on the disclosed method, the smart phone including a video camera helps in prioritizing the sound sources based on the visual focus context. The generated video will contain audio sounds relevant to the visual scene. Thus, results in improving the user experience.
The disclosed method uses the environmental context, scene classification, and device context to selectively focus on each of the relevant sound sources. Further, the disclosed method categorizes the sound sources into the primary sound sources, the secondary sound sources, the non-subject sound sources, and irrelevant sound sources based on the environment context, the device context and the scene context. The disclosed method expresses on how the acoustic parameters of the sound sources changed. The disclosed method generates the different acoustic signals with various combination of the sound sources in different contexts.
The disclosed method uses pre-defined (e.g., specified) data driven model (e.g., AI modes, ML modes or the like) to automatically adjust the sound source proportionate mixing. The disclosed method determines the relative orientation of the mic and the primary sound source, and uses the same to adjust the other sound source proportion to generate acoustic signal similar to ideal situation. The disclosed method uses the device context information to determine what is voice and what is noise for the same video recording in different context.
The disclosed method can be used to mixes the noise content at different levels intelligently to retain the naturalness of the video while minimizing or reducing user effort or simplifying the recording / editing time for the user.
Based on the disclosed method, the electronic device correlates the context in which video is captured with the sound types captured as part of the recordings. From the correlation, the electronic device determines irrelevant sound sources and completely suppresses them. The electronic device further establishes correlation between the visual subjects in focus to the sound sources occurring in the time point. The electronic device categorizes the relevant sound sources as primary subject, secondary subject, and/or non-subject sound sources. Further, based on the modes set by the user from several intelligent AI modes or ML modes, the electronic device automatically determines to either partially or completely suppress the non-subject sound sources. The electronic device further uses the visual and audio information to establish correlation between the primary and secondary sound sources. Based on the correlation, the electronic device chooses to partially or completely eliminate the secondary sound sources. Further, the electronic device also determines the orientation and movement of subject in visual scene and position of the recording microphone to adaptively tunes the subject sound source parameters to have constant volume level from the source.
Referring now to the drawings, and more particularly to FIGS. 1 through 14, where similar reference characters denote corresponding features consistently throughout the figures, there are shown various example embodiments.
FIG. 1 is a block diagram illustrating an example configuration of the electronic device (100), according to various embodiments. The electronic device (100) can be, for example, but not limited to a laptop, a smart phone, a desktop computer, a notebook, a Device-to-Device (D2D) device, a vehicle to everything (V2X) device, a foldable phone, a smart TV, a tablet, an immersive device, a virtual reality (VR) device, a mixed reality device, an augmented reality device, a virtual headset, and an internet of things (IoT) device.
In an embodiment, the electronic device (100) includes a processor (e.g., including processing circuitry) (110), a communicator (e.g., including communication circuitry) (120), a memory (130), a sound source controller (e.g., including various circuitry) (140) and a data driven controller (e.g., including various circuitry) (150). The processor (110) is coupled with the communicator (120), the memory (130), the sound source controller (140) and the data driven controller (150).
The sound source controller (140) determines a context in which a media is captured. The media includes audio signals and visual information. The media can be, for example, but not limited to a video, a multimedia content, an animation, shorts or the like. Further, the sound source controller (140) determines and classifies a relevant sound source in the media as a primary sound source, a secondary sound source, and a non-subject sound source based on the determined context. For an example, in the music events, the primary sound source can be an ambient music, the secondary sound source can be the guitar sound, and the non-subject sound source can be the dog howling and laughing sound. In an embodiment, the sound source controller (140) obtains an environmental context (e.g., weather context, location context or the like), a scene classification information, a device context (e.g., CPU usage context, application usage context or the like), and a hearing profile. Based on the environmental context, the scene classification information, the device context, and the hearing profile, the sound source controller (140) determines the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source.
In an embodiment, based on the determination and classification, the sound source controller (140) generates an output sound by suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model (e.g., AI model, ML model or the like).
In an embodiment, the sound source controller (140) generates the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification.
In an embodiment, at least one of the secondary sound source and the non-subject sound source in the media are completely suppressed by determining a correlation between the primary sound source and at least one of the secondary sound source and the non-subject sound source, and completely suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the determined correlation.
In an embodiment, at least one of the secondary sound source and the non-subject sound source in the media is completely suppressed by identifying at least one of the secondary sound source and the non-subject sound source in the media as irrelevant sound source, and completely suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the identification.
In an embodiment, at least one of the secondary sound source and the non-subject sound source in the media is partially suppressed by determining the correlation between the primary sound source and at least one of the secondary sound source and the non-subject sound source, and partially suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the determined correlation.
In an embodiment, the sound source controller (140) generates the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification. In an embodiment, the sound source controller (140) determines a relative orientation of a recording media (e.g., speaker, mic, or the like) and the primary sound source. Further, the sound source controller (140) may adjust a proportion of the secondary sound source and a proportion of the non-subject sound source upon determining a relative orientation of the recording media and the primary sound source. Further, the sound source controller (140) generates the output sound source by adjusting relative orientation of the recording media and the primary sound source, the proportion of the secondary sound source, and the proportion of the non-subject sound source.
Further, the sound source controller (140) detects an event. The event can be, for example, but not limited to a change in a sound source parameter associated the media, a change in correlation between a visual subject associated with the media in focus to the sound source occurring at a specified interval, a change in correlation between the primary sound source and the secondary sound source in the media at a specified interval, a change in an orientation and movement of a subject (e.g., targeted human, or the like) in a visual scene and position of a recording media associated with the media.
Further, the sound source controller (140) determines the context of the media based on the detected event. Further, the sound source controller (140) determines a second relevant sound source in the media as a second primary sound source, a second secondary sound source, and a second non-subject sound source based on the determined context.
In an embodiment, the sound source controller (140) generates a second output sound by completely suppressing at least one of the second secondary sound source and the second non-subject sound source in the media and optimizing the second primary sound source in the media using the data driven model based on the determination and classification.
In an embodiment, the sound source controller (140) generates the second output sound by partially suppressing at least one of the second secondary sound source and the second non-subject sound source in the media and optimizing the second primary sound source in the media using the data driven model based on the determination and classification. In an embodiment, the sound source controller (140) generates the second output sound by automatically adjusting at least one of the second primary sound source, the second secondary sound source, and the second non-subject sound source using the data driven model based on the determination and classification.
Further, the sound source controller (140) selectively monitors on each of the relevant sound source based on at least one of the environmental context, the scene classification information, and the device context.
In an embodiment, the sound source controller (140) identifies at least one subject that is source of sound in each scene in the media. Further, the sound source controller (140) identifies at least one of the context of each scene, the context of the electronic device (100) from which the media is captured and the context of the environment of the scene. Further, the sound source controller (140) classifies each subject in each scene as at least one of: the primary subject and at least one non-primary subject based on the identification. Further, the sound source controller (140) determines the relationship between the primary subject and the at least one non-primary subject in each scene based on the classification. Further, the sound source controller (140) combines the sound from the primary subject and the non-primary subject in pre-defined proportion in response to the determined relationship between the primary subject and the non-primary subject.
In an embodiment, the sound source controller (140) partially or completely eliminates the sound from the non-primary subject upon determining the relationship between the primary subject and the non-primary subject. In an embodiment, the sound source controller (140) determines the relevancy of the non-primary subject with respect to the context based on the data driven model. Further, the sound source controller (140) partially or completely suppresses the sound from the non-primary subjects based on the determination.
In an embodiment, the sound source controller (140) identifies the non-primary subject as irrelevant sound source. Further, the sound source controller (140) completely suppresses the sound from the non-primary subject. In an embodiment, the sound source controller (140) determines at least one of an orientation and a movement of the subject in each scene and position of a recording media to adaptively tune the sound from the subject in the scene.
The sound source controller (140) may, for example, be implemented by analog and/or digital circuits such as logic gates, integrated circuits, microprocessors, microcontrollers, memory circuits, passive electronic components, active electronic components, optical components, hardwired circuits and the like, and may optionally be driven by firmware.
The processor (110) may include one or a plurality of processors. The one or the plurality of processors may be a general-purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU). The processor (110) may include multiple cores and is configured to execute the instructions stored in the memory (130). The processor 110 according to an embodiment of the disclosure may include various processing circuitry and/or multiple processors. For example, as used herein, including the claims, the term "processor" may include various processing circuitry, including at least one processor, wherein one or more of at least one processor, individually and/or collectively in a distributed manner, may be configured to perform various functions described herein. As used herein, when a "processor" "at least one processor" and "one or more processors"are described as being configured to perform numerous functions, these terms cover situations, for example and without limitation, in which one processor performs some of recited functions and another processor(s) performs other of recited functions, and also situations in which a single processor may perform all recited functions. Additionally, the at least one processor may include a combination of processors performing various of the recited /disclosed functions, e.g., in a distributed manner. At least one processor may execute program instructions to achieve or perform various functions.
Further, the processor (110) is configured to execute instructions stored in the memory (130) and to perform various processes. The communicator (120) may include various communication circuitry and is configured for communicating internally between internal hardware components and with external devices via one or more networks. The memory (130) also stores instructions to be executed by the processor (110). The memory (130) may include non-volatile storage elements. Examples of such non-volatile storage elements may include magnetic hard discs, optical discs, floppy discs, flash memories, or forms of electrically programmable memories (EPROM) or electrically erasable and programmable (EEPROM) memories. In addition, the memory (130) may, in some examples, be considered a non-transitory storage medium. The term "non-transitory" may indicate that the storage medium is not embodied in a carrier wave or a propagated signal. However, the term "non-transitory"should not be interpreted that the memory (130) is non-movable. In certain examples, a non-transitory storage medium may store data that can, over time, change (e.g., in Random Access Memory (RAM) or cache).
In an embodiment, the communicator (120) may include an electronic circuit specific to a standard that enables wired or wireless communication. The communicator (120) is configured to communicate internally between internal hardware components of the electronic device (100) and with external devices via one or more networks.
Further, at least one of the pluralities of modules/controller may be implemented through an Artificial intelligence (AI) model using the data driven controller (150). The data driven controller (150) can be a machine learning (ML) model based controller and AI model based controller. A function associated with the AI model may be performed through the non-volatile memory, the volatile memory, and the processor (110). The processor (110) may include one or a plurality of processors. At this time, one or a plurality of processors may be a general purpose processor, such as a central processing unit (CPU), an application processor (AP), or the like, a graphics-only processing unit such as a graphics processing unit (GPU), a visual processing unit (VPU), and/or an AI-dedicated processor such as a neural processing unit (NPU).
The one or a plurality of processors control the processing of the input data in accordance with a predefined operating rule or AI model stored in the non-volatile memory and the volatile memory. The predefined operating rule or artificial intelligence model is provided through training or learning.
Here, being provided through learning may refer, for example, to a predefined operating rule or AI model of a desired characteristic being made by applying a learning algorithm to a plurality of learning data. The learning may be performed in a device itself in which AI according to an embodiment is performed, and/o may be implemented through a separate server/system.
The AI model may comprise of a plurality of neural network layers. Each layer has a plurality of weight values, and performs a layer operation through calculation of a previous layer and an operation of a plurality of weights. Examples of neural networks include, but are not limited to, convolutional neural network (CNN), deep neural network (DNN), recurrent neural network (RNN), restricted Boltzmann Machine (RBM), deep belief network (DBN), bidirectional recurrent deep neural network (BRDNN), generative adversarial networks (GAN), and deep Q-networks.
The learning algorithm may refer, for example, to a method for training a predetermined target device (for example, a robot) using a plurality of learning data to cause, allow, or control the target device to make a determination or prediction. Examples of learning algorithms include, but are not limited to, supervised learning, unsupervised learning, semi-supervised learning, or reinforcement learning.
Although FIG. 1 shows various hardware components of the electronic device (100) it is to be understood that disclosure is not limited thereto. In various embodiments, the electronic device (100) may include less or more number of components. Further, the labels or names of the components are used only for illustrative purpose and does not limit the scope of the disclosure. One or more components can be combined together to perform same or substantially similar function in the electronic device (100).
FIG. 2 is a diagram illustrating an example environment in which various hardware components of the sound source controller (140) included in the electronic device (100) are depicted, according to various embodiments.
In step 1, the video is provided, wherein the video can be pre-recorded or is being recorded in real-time. In step 2, all the sound sources present in the audio stream for a given time span are separated. In step 3, the visual subjects within the given time frame are extracted. In step 4, the environment information/context while the video was recorded or from the video scene is determined. In step 5, the information about the visual and acoustic scene generated from the video is determined. In step 6, the information from the device application on which the video is being processed is gathered. In step 7, the relation information about the visual and acoustic subjects are carried and the relation is generated. Embodiments herein disclose an audio visual (AV) subject pair generator (210), wherein the AV subject pair generator (210) can correlate one or more visual subjects with one or more acoustic subjects. The operations of the AV subject pair generator (210) is explained in FIG. 3. In step 8, based on the device context for the video, a weight is assigned to context elements by the context creator (220). The weight is set by the electronic device (100) or the user of the electronic device (100). The operations of the context creator (220) is explained in FIG. 4. In step 9, a Voice Noise (VN) classifier (230) uses the device context information to categorize each of the context elements to be the primary subjects, the secondary subjects, the non-subject subjects, and the irrelevant subjects. In an embodiment, the VN classifier (230) can categorize the acoustic sound sources to be the primary subjects, the secondary subjects, the non-subject subjects and the irrelevant subjects using information received from the AV subject pair generator (210) along with the context information. The operations of the voice noise classifier (230) is explained in FIG. 6. In step 10, pre-defined AI modes are provided, wherein a selection can be made from the AI modes repository (240), and the selected AI mode helps a VN estimator (not shown) to classify what is voice and what is noise. In step 11, based on the AI modes, the VN mixer (250) proportionately mixes the voice and noise sound sources. The VN mixer (250) can also suitably adjust the intensity of the primary subject source, while keeping all other subject sources intensities at a constant level. The operations of the voice noise mixer (250) is explained in greater detail below with reference to FIG. 7. Further, the VN mixer (250) proportionately mixes the voice and noise sound sources, which is then provided to the video multiplexer (mixer) (260)).
Although FIG. 2 shows various hardware components of the sound source controller (140) it should be understood that the disclosure is not limited thereto. In various embodiments, the sound source controller (140) may include less or more number of components. Further, the labels or names of the components are used only for illustrative purpose and does not limit the scope of the disclosure. One or more components can be combined together to perform same or substantially similar function in the sound source controller (140).
FIG. 3 is a diagram illustrating an example scenario (300) in which operations of the AV subject pair generator (210) included in the sound source controller (140) is explained, according to various embodiments.
The AV subject pair generator (210) is responsible to link the visual subjects to the acoustic signal subjects. The AV subject pair generator (210) uses the pre-linked audio-visual information to link the incoming acoustic and the visual subjects. The AV subject pair generator (210) uses corresponding subject characteristics such as Speaker Embeddings, Gender, Type, etc. to disambiguate similar acoustic subjects. The AV subject pair generator (210) uses the deep learning technique/machine learning technique /generative model technique which is pre-trained on several audio-visual subjects to generate the relation. In an example (from FIG. 3), the table 3 is the AV subject pair generation is obtained from the table 1 (audio source separation) and table 2 (visual subject extraction).
Subject Type Gender Speaker Embeddings Acoustic Activity
Speech1 Human Female Nikita Voice Embedding Speaking
Speech 2 Human Female Emanuella Voice Embedding Chewing
Barking Dog Barking
Water Gush Sea Calm
Wind Beach Breezy
Table 1: Audio source separation
Subject Type Gender Name Activity
Person Human Female Nikita Smiling & Speaking
Person Human Female Emanuella Eating
Dog Animal Sam Barking
Beach Sea Dead Sea Calm
Table 2: Visual subject extraction
Subject Type Gender Name Acoustic Subject Activity
Person Human Female Nikita Speech1 Smiling while Speaking
Person Human Female Emanuella Speech2 Speaking while Eating
Dog Animal Sam Barking Barking
Beach Sea Dead Sea Water Gush Calm
Beach Sea Dead Sea Wind Breezy
Table 3: AV subject pair generation
FIG. 4 is a diagram illustrating an example scenario (400) in which operations of the context creator (220) included in the sound source controller (140) is explained, according to various embodiments.
The context creator (220) creates a knowledge graph dynamically from the sequence of audio visual frames. The context creator (220) correlates the audio-visual subjects to the target scene. The context creator (220) uses the pre-trained DL technique/ML technique /generative model technique with information to associate the AV subjects to the scene. The context creator (220) assigns each of the subjects in the AV frame to one or more of the detected scene. The device context is responsible to determine the usage context of the solution on the target device. In an example (from FIG. 4), the table 7 is output of the context creator obtained from the table 4 (e.g., environmental context), the table 5 (e.g., scene classification) and the table 6 (e.g., device context).
Subject Type
Beach Dead Sea
Weather Sunny
Boat Red
Table 4: Environmental context
Subject Type
Scene Birthday Celebrations
Table 5: Scene classification
Subject Type
Channels 5.1, Stereo, Mono
GPU Available
App WhatsApp
Conversation Sharing video
Table 6: Device context
Subject Type Gender Name Acoustic Subject Scene
Person Human Female Nikita Speech1 Birthday
Person Human Female Emanuella Speech2 Birthday
Dog Animal Barking Birthday, General
Beach Sea Dead Sea Water Gush General
Beach Sea Dead Sea Wind General
Boat Commercial Titapic Boat Horn General
Table 7: Context creator
FIG. 5 is a diagram illustrating an example scenario in which operations of the device context generation is explained, according to various embodiments. The device context can not only gather context information from messenger application, but also from the voice call, the video call, the media sharing application, etc. For example, in two conversation (shown in FIG. 5), the intent is to preserve different sound sources as part of different sharing process. At 502, the John sharing his happy moments with Neeta and at 504, John sharing video segment with Fire department. which contained the fire alarm sound as an evidence
Subject Type
Channels 5.1, Stereo, Mono
GPU Available
App WhatsApp
Conversation Sharing video
Conversation Context Birthday Celebration/Fire-alarm Malfunction
Table 8: Device context
FIG. 6 is a diagram illustrating an example scenario (600) in which operations of the voice noise classifier (230) included in the sound source controller (140) are explained, according to various embodiments. The voice noise classifier (230) categorizes the acoustic subject to be either voice or noise in specific context. Using the voice noise classifier (230), the voice may refer, for example, to sound sources which needs to be retained and noise may refer, for example, to sound sources which will be partially or completely eliminated. The voice noise classifier (230) takes the information from the context creator (220) and the AV subject pair generator (210) and generates the classification labels. In an example, the table 9 is an output for the voice noise classifier (230).
Subject Type Gender Name Acoustic Subject Scene Voice noise classifier
Person Human Female Nikita Speech1 Birthday Primary
Person Human Female Emanuella Speech2 Birthday Secondary
Dog Animal Barking Birthday, General Irrelevant
Beach Sea Dead Sea Water Gush General Non-Subject
Beach Sea Dead Sea Wind General Non-Subject
Boat Commercial Titapic Boat Horn General Irrelevant
Table 9: Voice noise classifier
FIG. 7 is a diagram illustrating an example scenario in which operations of the voice noise mixer (250) included in the sound source controller (140) is explained, according to various embodiments. The voice noise mixer (250) is responsible to take in one of the AI modes and alter the sound sources in audio signal proportionately. Several artificial intelligent modes such as speech enhancement, vlog, visual AI, Music AI, etc which are capable to alter the proportion of sound sources automatically. Several Artificial Intelligent modes dictates how the various sound sources will be altered based on the context. The voice noise mixer (250) takes in the VN classifier (230) results and generates the mixing proportion as shown in the table 10 below.
Subject Type Gender Name Acoustic Subject Scene VN Classifier VN Mixer
Person Human Female Nikita Speech1 Birthday Primary 90%
Person Human Female Emanuella Speech2 Birthday Secondary 90%
Dog Animal Barking Birthday, General Irrelevant 0%
Beach Sea Dead Sea Water Gush General Non-Subject 10%
Beach Sea Dead Sea Wind General Non-Subject 10%
Boat Commercial Titapic Boat Horn General Irrelevant 0%
Table 10: Voice noise mixer
FIG. 8 and FIG. 9 are flowcharts (800 and 900) illustrating an example method for handling the sound source in the media, according to various embodiments.
As shown in FIG. 8, the operations (802-810) are handled by the sound source controller (140). At 802, the method includes determining the context in which the media is captured. At 804, the method includes determining and classifying the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source based on the determined context. In an embodiment, at 806, the method includes generating the output sound by suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification. In an embodiment, at 808, the method includes generating the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification. In an embodiment, at 810, the method includes generating the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
As shown in FIG. 9, the operations (902-910) are handled by the sound source controller (140). At 902, the method includes identifying at least one subject that is source of sound in each scene in the media. At 904, the method includes identifying at least one of the context of each scene, the context of the electronic device (100) from which the media is captured and the context of the environment of the scene. At 906, the method includes classifying each subject in each scene as at least one of: a primary subject and at least one non-primary subject based on the identification. At 908, the method includes determining a relationship between the primary subject and the at least one non-primary subject in each scene based on the classification. At 910, the method includes combining a sound from the primary subject and the non-primary subject in pre-defined proportion in response to the determined relationship between the primary subject and the at least one non-primary subject.
The disclosed method can be used to determine the sound sources present in the audio signal. The disclosed method can be used to classify the sound sources to be relevant, or irrelevant based on the visual scene and environmental context and completely eliminate irrelevant sound sources. The disclosed method can be used to classify the relevant sound sources to be primary, secondary, or non-subject sound sources. The disclosed method can be used to partially or completely suppress the non-subject noises with relevance to the context. The disclosed method can be used to dynamically host several AI based odes for contextual and intelligent handling. Thus, the disclosed method can be used to contextually suppress the acoustic sound sources in the audio signals, so as to improve the user experience. The disclosed method can be used to mixes the noise content at different levels intelligently to retain the naturalness of the video while minimizing or reducing user effort or simplifying the recording / editing time for the user.
FIGS. 10A, 10B, 10C, 10D, 11A and FIG. 11B are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on the video, according to various embodiments. Consider that the user has thrown a house party for his close friends. In his party, there are many things happening around such as background music, barking dogs, guests clapping, laughing, giggling, etc. The user navigates through the party to make a video recording of his friends and pets. Embodiments herein can help in prioritizing the sound sources based on the visual focus context.
At 1002, consider that John has just landed a job in a great company and is throwing a house party. He wanted to take videos of the party and share the videos on the social media (e.g., Facebook®, Instagram® or the like). He starts capturing video of the party. Table 11 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
Sound Sources VN Classifier VN Mixer Mic-Geo Pos
Background Music Primary 80% 2 - Ambient
Dog Howling Irrelevant 0%
Guitar Secondary 10 %
Speech Non-Subject 5%
Human Clap & Laugh Non-Subject 5%
At 1004, John then focusses the camera on his friend, who is playing a song on a guitar, as his friends cheer him on. Table 12 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
Sound Sources VN Classifier VN Mixer Mic-Geo Pos
Background Music Non-Subject 0%
Dog Howling Irrelevant 0%
Guitar Primary 90 % 1, 2 - Front
Speech Non-Subject
5%
Human Clap & Laugh Non-Subject 5%
At 1006, then, his pet starts howling to the tune being played by his friend, so, John turns to focus the camera on the dog. Table 13 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
Sound Sources VN Classifier VN Mixer Mic-Geo Pos
Background Music Non-Subject 20%
Dog Howling Primary 70% 1,2 - Front
Guitar Irrelevant 0 %
Speech Non-Subject 10%
Human Clap & Laugh Non-Subject 10%
At 10008, as his friend finishes the song, John moves his camera around the room to show his friends clapping and cheering. Table 14 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
Sound Sources VN Classifier VN Mixer Mic-Geo Pos
Background Music Non-Subject 40%
Dog Howling Irrelevant 0%
Guitar Irrelevant 0 %
Speech Primary 90% 1 - Left
Human Clap & Laugh Secondary 90%
FIG. 11A and FIG. 11B are diagrams illustrating examples of corresponding data flow of FIGS. 10A to 10D, wherein the AV subject pair generator (210) determines the visual subjects present in the video. Table 15 depicts the determined visual subjects for the current example (depicted in FIG. 11A and FIG. 11B).
Visual Subject Acoustic Subject
- <Dog Howling Sound>
- Guitar Sound
- Fire Alarm Sound
- Babble Sound
- Laughing Sound
Table -
Food -
- Ambient Music
The context creator (220) can receive the environmental context (as depicted in table 16), the scene classification (as depicted in table 17), and device context (as depicted in table 18), using which the context creator (220) can generate the context.
Environment Attributes
Indoor Home
Guitar Playing
Dog Howling
People Random
Music Hindi Song
Scene Attributes
Party Type Social
Occasion New Job
Context Attributes
Camera Video Recording
Based on the context, the VN classifier (230) can classify the visual and acoustic subjects as the primary subjects, the secondary subjects, the non-subject subjects, or the irrelevant subjects, in the current scenario. Table 19 depicts an example classification.
Visual Subject Acoustic Subject VN Classification
- Dog Howling Irrelevant
- Guitar Sound Secondary
- Fire Alarm Sound Irrelevant
- Babble Sound Non-Subject
- Laughing Sound Non-Subject
Table - -
Food - -
- Ambient Music Primary
Based on the classification and the relevant AI mode (e.g., visual AI mode or the like), the VN mixer (250) proportionately mixes the voice and noise sound sources, which is then provided to the video multiplexer (muxer (260)). Table 20 depicts an example scenario.
Sound Sources VN Mixer
Background Music
80%
Dog Howling
0%
Guitar
10 %
Speech
5%
Human Clap & Laugh 5%
Fire Alarm
0%
FIGS. 12A, 12B, 12C and 12D are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on the video, according to various embodiments. FIGS. 13A, 13B, 13C, 13D and 13E are diagrams illustrating example scenarios, wherein contextual sound source control is being performed on the video, according to various embodiments.
The operations 1202-1208 and 1304 to 1310 are similar to steps 1002-1008. For the sake of brevity, we the repeated is not repeated here.
Consider that the user (e.g., John) has thrown the house party for his close friends. In his party, there are many things happening around such as background music, barking dogs, guests clapping, laughing, giggling, etc. The user navigates through the party to make the video recording of his friends and pets. Embodiments herein can help in prioritizing the sound sources based on the visual focus context.
Consider that John has just landed the job in the great company and is throwing a house party. He wanted to take videos of the party and share the videos on the social media. He starts capturing video of the party. While recording, the house fire alarm gets turned on momentarily due to a malfunction, and the noise of the fire alarm gets recorded in the video. The user shares the video with his friends sharing the details of the party, with the noise of the fire alarm being considered irrelevant and hence suppressed in this video (as depicted in FIG. 12A to FIG. 12D (hereafter FIG. 12)). Thereafter, the user shares the video to the fire department executive/service person as evidence of the malfunctioning fire alarm noise, with the noise of the party being considered irrelevant and hence suppressed in this video (as depicted in FIG. 13A to FIG. 13E (hereafter FIG. 13)).
As depicted in FIG. 12, table 21 is an example table depicting the various sound sources at the instance (when John has started recording video), their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
Sound Sources VN Classifier VN Mixer
Background Music Primary 80%
Dog Howling Irrelevant 0%
Guitar Secondary 10 %
Speech Non-Subject 5%
Human Clap & Laugh Non-Subject 5%
Fire Alarm Irrelevant 0%
John then focusses the camera on his friend, who is playing a song on the guitar, as his friends cheer him on. Table 22 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
Sound Sources VN Classifier VN Mixer
Background Music Non-Subject 0%
Dog Howling Irrelevant 0%
Guitar Primary 90 %
Speech Non-Subject 5%
Human Clap & Laugh Non-Subject 5%
Fire Alarm Irrelevant 0%
Then, his pet starts howling to the tune being played by his friend; so, John turns to focus the camera on the dog. Table 23 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
Sound Sources VN Classifier VN Mixer
Background Music Non-Subject 20%
Dog Howling Primary 70%
Guitar Irrelevant 0 %
Speech Non-Subject 10%
Human Clap & Laugh Non-Subject 10%
Fire Alarm Irrelevant 0%
The fire alarm get turned on momentarily at this point in time. As his friend finishes the song, John moves his camera around the room to show his friends clapping and cheering. Table 24 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)), for the video that John is going to share with his friends of the party.
Sound Sources VN Classifier VN Mixer
Background Music Non-Subject 40%
Dog Howling Irrelevant 0%
Guitar Irrelevant 0 %
Speech Primary 90%
Human Clap & Laugh Secondary 90%
Fire Alarm Irrelevant 0%
John shares the video segment with the fire department or a maintenance person, which contains the sound of the fire alarm as an evidence of the malfunctioning fire alarm, wherein this video segment has only the sound of the fire alarm and the sound of the party is suppressed in this video segment (as depicted in FIG. 13). Table 25 is an example table depicting the various sound sources at this instance, their respective classifications (as performed by the VN classifier (230)) and their respective determined relevance (a weightage (in terms of percentage), as determined by the VN mixer (250)).
Sound Sources VN Classifier VN Mixer
Background Music Irrelevant 0%
Dog Howling Irrelevant 0%
Guitar Irrelevant 0%
Speech Irrelevant 0%
Human Clap & Laugh Irrelevant 0%
Fire Alarm Primary 100%
FIG. 14 is a diagram illustrating an example scenario (1400) in which speech clarity is improved based on the subject orientation, according to various embodiments. Consider an example, the user (e.g., Edward) is reviewing Indian street food. The user speaks out spontaneously by randomly viewing the camera and the food item being prepared. The background noise is almost constant, whereas the users voice diminishes based on his head orientation.
At 1402, Edward from Spain is reviewing Indian street food. His associate is video recording the review using the camera. At 1404, Edward while explaining about the dish, rotates his head to look at the food item to describe it further. The speech parameters drops when the user rotates his head away from the mic. The same is indicated in the table. At 1406, Edward turns back towards camera to speak and explain further about the street food.
Embodiments herein are explained with respect to scenarios, wherein the user is capturing video or recorded videos. However, it may be apparent to a person of ordinary skill in the art that embodiments herein may be applicable to any scenario, wherein sound is being captured; such as, but not limited to, a sound recording, a call recording, and so on.
The various actions, acts, blocks, steps, or the like in the flow charts (800 and 900) may be performed in the order presented, in a different order or simultaneously. Further, in various embodiments, some of the actions, acts, blocks, steps, or the like may be omitted, added, modified, skipped, or the like without departing from the scope of the disclosure.
The embodiments disclosed herein can be implemented through at least one software program running on at least one hardware device and performing network management functions to control the elements. The elements can be at least one of a hardware device, or a combination of hardware device and software module.
While the disclosure has been illustrated and described with reference to various example embodiments, it will be understood that the various example embodiments are intended to be illustrative, not limiting. It will be further understood by those skilled in the art that various changes in form and detail may be made without departing from the true spirit and full scope of the disclosure, including the appended claims and their equivalents. It will also be understood that any of the embodiment(s) described herein may be used in conjunction with any other embodiment(s) described herein.

Claims (15)

  1. A method for handling a sound source in a media, comprising:
    determining, by an electronic device, a context in which the media is captured;
    determining and classifying, by the electronic device, a relevant sound source in the media as a primary sound source, a secondary sound source, and a non-subject sound source based on the determined context; and
    performing, by the electronic device, at least one of:
    generating an output sound by suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model based on the determination and classification,
    generating the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification, and
    generating the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
  2. The method as claimed in claim 1, wherein the method comprises:
    detecting, by the electronic device, at least one event, wherein the at least one event comprises at least one of: a change in a sound source parameter associated the media, a change in correlation between a visual subject associated with the media in focus to the sound source occurring at a specified interval, a change in correlation between the primary sound source and the secondary sound source in the media at a specified interval, a change in an orientation and movement of a subject in a visual scene and position of a recording media associated with the media;
    determining, by the electronic device, the context of the media based on the at least one detected event;
    determining, by the electronic device, a second relevant sound source in the media as a second primary sound source, a second secondary sound source, and a second non-subject sound source based on the determined context; and
    performing, by the electronic device, at least one of:
    generating a second output sound by completely suppressing at least one of the second secondary sound source and the second non-subject sound source in the media and optimizing the second primary sound source in the media using the data driven model based on the determination and classification,
    generating a second output sound by partially suppressing at least one of the second secondary sound source and the second non-subject sound source in the media and optimizing the second primary sound source in the media using the data driven model based on the determination and classification, and
    generating a second output sound by automatically adjusting at least one of the second primary sound source, the second secondary sound source, and the second non-subject sound source using the data driven model based on the determination and classification.
  3. The method as claimed in claim 1, wherein completely suppressing at least one of the secondary sound source and the non-subject sound source in the media comprises:
    determining a correlation between the primary sound source and at least one of the secondary sound source and the non-subject sound source; and
    completely suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the determined correlation.
  4. The method as claimed in claim 1, wherein completely suppressing at least one of the secondary sound source and the non-subject sound source in the media comprises:
    identifying at least one of the secondary sound source and the non-subject sound source in the media as an irrelevant sound source; and
    completely suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the identification.
  5. The method as claimed in claim 1, wherein partially suppressing at least one of the secondary sound source and the non-subject sound source in the media comprises:
    determining a correlation between the primary sound source and at least one of the secondary sound source and the non-subject sound source; and
    partially suppressing at least one of the secondary sound source and the non-subject sound source in the media based on the determined correlation.
  6. The method as claimed in claim 1, wherein determining and categorizing, by the electronic device, the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source comprises:
    obtaining, by the electronic device, at least one of an environmental context, a scene classification information, a device context, and a hearing profile; and
    determining, by the electronic device, the relevant sound source in the media as the primary sound source, the secondary sound source, and the non-subject sound source based on at least one of the environmental context, the scene classification information, the device context, and the hearing profile.
  7. The method as claimed in claim 6, wherein the method comprises:
    selectively monitoring, by the electronic device, on each relevant sound source based on at least one of the environmental context, the scene classification information, and the device context.
  8. The method as claimed in claim 1, wherein generating the output sound source by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model comprises:
    determining a relative orientation of a recording media and the primary sound source;
    adjusting a proportion of the secondary sound source and a proportion of the non-subject sound source upon determining a relative orientation of the recording media and the primary sound source; and
    generating the output sound source by adjusting relative orientation of the recording media and the primary sound source, the proportion of the secondary sound source, and the proportion of the non-subject sound source.
  9. A method for handling a sound source in a media, comprising:
    identifying, by an electronic device, at least one subject comprising a source of sound in each scene in a media;
    identifying, by the electronic device, at least one of a context of each scene, a context of the electronic device from which the media is captured and a context of an environment of the scene;
    classifying, by the electronic device, each subject in each scene as at least one of: a primary subject and at least one non-primary subject based on the identification;
    determining, by the electronic device, a relationship between the primary subject and the at least one non-primary subject in each scene based on the classification; and
    combining, by the electronic device, a sound from the primary subject and the non-primary subject in specified proportion in response to the determined relationship between the primary subject and the at least one non-primary subject.
  10. The method as claimed in claim 9, wherein the method comprises:
    partially or completely eliminating, by the electronic device, the sound from the at least one non-primary subject upon determining the relationship between the primary subject and the at least one non-primary subject.
  11. The method as claimed in claim 9, wherein the method comprises:
    determining, by the electronic device, a relevancy of the at least one non-primary subject with respect to the context based on a data driven model; and
    partially or completely suppressing, by the electronic device, the sound from the at least one non-primary subjects based on the determination.
  12. The method as claimed in claim 9, wherein the method comprises:
    identifying, by the electronic device, the at least one non-primary subject as irrelevant sound source; and
    completely suppressing, by the electronic device, the sound from the at least one non-primary subject.
  13. The method as claimed in claim 9, wherein the method comprises:
    determining, by the electronic device, at least one of an orientation and a movement of the subject in each scene and position of a recording media to adaptively tune the sound from the subject in the scene.
  14. An electronic device, comprising:
    at least one processor comprising processing circuitry;
    a memory; and
    a sound source controller, coupled with the processor and the memory, configured to:
    determine a context in which a media is captured;
    determine and classify a relevant sound source in the media as a primary sound source, a secondary sound source, and a non-subject sound source based on the determined context; and
    perform at least one of:
    generate an output sound by completely suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using a data driven model based on the determination and classification,
    generate the output sound by partially suppressing at least one of the secondary sound source and the non-subject sound source in the media and optimizing the primary sound source in the media using the data driven model based on the determination and classification, and
    generate the output sound by automatically adjusting at least one of the primary sound source, the secondary sound source, and the non-subject sound source using the data driven model based on the determination and classification.
  15. An electronic device, comprising:
    at least one processor comprising processing circuitry;
    a memory; and
    a sound source controller, coupled with the processor and the memory, configured to:
    identify at least one subject that is a source of sound in each scene in a media;
    identify at least one of a context of each scene, a context of the electronic device from which the media is captured and a context of an environment of the scene;
    classify each subject in each scene as at least one of: a primary subject and at least one non-primary subject based on the identification;
    determine a relationship between the primary subject and the at least one non-primary subject in each scene based on the classification; and
    combine a sound from the primary subject and the non-primary subject in specified proportion in response to the determined relationship between the primary subject and the at least one non-primary subject.
PCT/KR2024/003269 2023-03-14 2024-03-14 Method and electronic device for handling sound source in media Pending WO2024191212A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/734,468 US20240321287A1 (en) 2023-03-14 2024-06-05 Method and electronic device for handling sound source in media

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN202341016964 2023-03-14
IN202341016964 2024-02-29

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/734,468 Continuation US20240321287A1 (en) 2023-03-14 2024-06-05 Method and electronic device for handling sound source in media

Publications (1)

Publication Number Publication Date
WO2024191212A1 true WO2024191212A1 (en) 2024-09-19

Family

ID=92756365

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2024/003269 Pending WO2024191212A1 (en) 2023-03-14 2024-03-14 Method and electronic device for handling sound source in media

Country Status (2)

Country Link
US (1) US20240321287A1 (en)
WO (1) WO2024191212A1 (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101421793B1 (en) * 2012-06-29 2014-07-24 인텔렉추얼디스커버리 주식회사 Apparatus and method for providing hybrid audio
US20150092081A1 (en) * 2013-09-27 2015-04-02 Samsung Electronics Co., Ltd. Apparatus and method for photographing image in electronic device having camera
KR20210048272A (en) * 2019-10-23 2021-05-03 엘지전자 주식회사 Apparatus and method for automatically focusing the audio and the video
KR102273267B1 (en) * 2019-04-17 2021-07-05 에스케이텔레콤 주식회사 Method and apparatus for processing sound
KR102446694B1 (en) * 2016-02-23 2022-09-26 엘지전자 주식회사 Electronic device and its control method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101421793B1 (en) * 2012-06-29 2014-07-24 인텔렉추얼디스커버리 주식회사 Apparatus and method for providing hybrid audio
US20150092081A1 (en) * 2013-09-27 2015-04-02 Samsung Electronics Co., Ltd. Apparatus and method for photographing image in electronic device having camera
KR102446694B1 (en) * 2016-02-23 2022-09-26 엘지전자 주식회사 Electronic device and its control method
KR102273267B1 (en) * 2019-04-17 2021-07-05 에스케이텔레콤 주식회사 Method and apparatus for processing sound
KR20210048272A (en) * 2019-10-23 2021-05-03 엘지전자 주식회사 Apparatus and method for automatically focusing the audio and the video

Also Published As

Publication number Publication date
US20240321287A1 (en) 2024-09-26

Similar Documents

Publication Publication Date Title
US11582420B1 (en) Altering undesirable communication data for communication sessions
WO2021112642A1 (en) Voice user interface
WO2020017898A1 (en) Electronic apparatus and control method thereof
WO2021112631A1 (en) Device, method, and program for enhancing output content through iterative generation
WO2020149726A1 (en) Intelligent volume control
WO2019177367A1 (en) Display apparatus and control method thereof
US20190019067A1 (en) Multimedia conferencing system for determining participant engagement
US20230063988A1 (en) External audio enhancement via situational detection models for wearable audio devices
CN116324969A (en) Hearing enhancement and wearable system with positioning feedback
US12380908B2 (en) Dynamic noise and speech removal
WO2020105979A1 (en) Image processing apparatus and control method thereof
WO2023018084A1 (en) Method and system for automatically capturing and processing an image of a user
WO2015012819A1 (en) System and method for adaptive selection of context-based communication responses
EP4179733A1 (en) Method and electronic device for determining motion saliency and video playback style in video
WO2025005499A1 (en) Artificial intelligence-based foley sound providing apparatus and method
WO2021112419A1 (en) Method and electronic device for automatically editing video
CN111986690B (en) A method and device for reducing the noise of speech in a video
WO2024191212A1 (en) Method and electronic device for handling sound source in media
US11176923B1 (en) System and method for noise cancellation
WO2022092474A1 (en) Method and system for assigning unique voice for electronic device
WO2022177091A1 (en) Electronic device and method for controlling same
CN117119266B (en) Video score processing method, electronic device, and computer-readable storage medium
US20240212700A1 (en) User selectable noise suppression in a voice communication
WO2023048499A1 (en) Method and electronic device for personalized audio enhancement
EP4297708A1 (en) Managing audio content delivery

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 24771211

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2024771211

Country of ref document: EP