具体实施方式Detailed ways
以下的详细描述参照附图。在可能的情况下,在附图和以下的描述中使用相同的参考号来指代相同或类似的部分。虽然在本文描述了几个说明性实施例,但修改、调整和其他实现是可能的。例如,可以对附图中所示的组件进行替换、添加或修改,并且可以通过对所公开的方法进行替换、重新排序、移除或添加步骤来修改本文描述的说明性方法。因此,以下详细描述不限于所公开的实施例和示例。相反,适当的范围由所附权利要求来定义。The following detailed description refers to the accompanying drawings. Wherever possible, the same reference numbers will be used in the drawings and the following description to refer to the same or like parts. Although several illustrative embodiments are described herein, modifications, adaptations, and other implementations are possible. For example, substitutions, additions, or modifications may be made to components shown in the figures, and the illustrative methods described herein may be modified by substituting, reordering, removing, or adding steps to the disclosed methods. Therefore, the following detailed description is not limited to the disclosed embodiments and examples. Rather, the appropriate scope is defined by the appended claims.
图1A示出了佩戴符合所公开实施例的物理连接(或集成)到眼镜130的装置110的用户100。眼镜130可以是验光眼镜、放大镜、非验光眼镜、安全眼镜、太阳镜等。另外,在一些实施例中,眼镜130可以包括框架和耳机、鼻件等的部分,以及一个或没有镜片。因此,在一些实施例中,眼镜130可以主要用于支持装置110,和/或增强现实显示设备或其他光学显示设备。在一些实施例中,装置110可以包括用于捕捉用户100的视场的实时图像数据的图像传感器(图1A中未示出)。术语“图像数据”包括从近红外、红外、可见光和紫外光谱中的光学信号中检索到的任何形式的数据。图像数据可以包括视频剪辑和/或照片。FIG. 1A shows a user 100 wearing a device 110 physically connected (or integrated) to glasses 130 in accordance with the disclosed embodiments. The glasses 130 may be prescription glasses, magnifying glasses, non- prescription glasses, safety glasses, sunglasses, and the like. Additionally, in some embodiments, eyeglasses 130 may include portions of a frame and earpiece, nose piece, etc., and one or no lenses. Accordingly, in some embodiments, glasses 130 may be primarily used to support apparatus 110, and/or augmented reality display devices or other optical display devices. In some embodiments, the apparatus 110 may include an image sensor (not shown in FIG. 1A ) for capturing real-time image data of the user's 100 field of view. The term "image data" includes any form of data retrieved from optical signals in the near infrared, infrared, visible and ultraviolet spectrums. Image data may include video clips and/or photographs.
在一些实施例中,装置110可以与计算设备120无线通信或经由有线通信。在一些实施例中,计算设备120可以包括例如智能手机、平板电脑或专用处理单元,其可以是便携式的(例如,可以在用户100的口袋中携带)。尽管在图1A中示出为外部设备,但在一些实施例中,计算设备120可以作为可穿戴装置110或眼镜130的一部分来提供,无论是集成到其上还是安装在其上。在一些实施例中,计算设备120可以包括在集成式提供或安装到眼镜130上的增强现实显示设备或光学头戴显示器中。在其他实施例中,计算设备120可以作为包括腕带、多功能手表、纽扣、夹子等的用户100的另一可穿戴或便携式装置的一部分提供。在其他实施例中,计算设备120可以作为另一系统(诸如车载汽车计算或导航系统)的一部分提供。本领域技术人员可以了解,不同类型的计算设备和设备的布置可以实现所公开的实施例的功能。因此,在其他实现中,计算设备120可以包括个体计算机(PC)、膝上型计算机、互联网服务器等。In some embodiments, apparatus 110 may communicate wirelessly or via wired communication with computing device 120 . In some embodiments, computing device 120 may include, for example, a smartphone, tablet, or dedicated processing unit, which may be portable (eg, may be carried in the pocket of user 100). Although shown as an external device in FIG. 1A , in some embodiments, computing device 120 may be provided as part of wearable device 110 or glasses 130 , whether integrated or mounted thereon. In some embodiments, computing device 120 may be included in an augmented reality display device or an optical head-mounted display that is integrally provided or mounted on glasses 130 . In other embodiments, computing device 120 may be provided as part of another wearable or portable device of user 100 including a wristband, multifunction watch, buttons, clips, and the like. In other embodiments, computing device 120 may be provided as part of another system, such as an in-vehicle automotive computing or navigation system. Those skilled in the art will appreciate that different types of computing devices and arrangements of devices may implement the functionality of the disclosed embodiments. Thus, in other implementations, computing device 120 may include a personal computer (PC), a laptop computer, an Internet server, and the like.
图1B示出了佩戴符合所公开实施例的物理连接到项链140的装置110的用户100。装置110的这种配置可以适合于部分或全部时间不戴眼镜的用户。在该实施例中,用户100可以容易地穿戴装置110,并将其取下。Figure IB shows a user 100 wearing a device 110 physically connected to a necklace 140 in accordance with the disclosed embodiments. This configuration of device 110 may be suitable for users who do not wear glasses some or all of the time. In this embodiment, the user 100 can easily wear the device 110 and take it off.
图1C示出了佩戴符合所公开实施例的物理连接到皮带150的装置110的用户100。装置110的这种配置可以被设计为皮带扣。可替代地,装置110可以包括用于附接到各种服装物品的夹子,诸如皮带150,或背心、口袋、项圈、帽子或礼帽或服装物品的其他部分。FIG. 1C shows a user 100 wearing a device 110 physically connected to a belt 150 in accordance with the disclosed embodiments. This configuration of device 110 may be designed as a belt buckle. Alternatively, device 110 may include clips for attachment to various items of clothing, such as belt 150, or vests, pockets, collars, hats or hats, or other parts of clothing items.
图1D示出了佩戴符合所公开实施例的物理连接到腕带160的装置110的用户100。尽管根据该实施例,装置110的瞄准方向可能与用户100的视场不匹配,但装置110可以包括基于指示用户100正朝腕带160的方向看的用户100的跟踪到的眼动来识别手相关触发的能力。腕带160还可以包括加速度计、陀螺仪或用于确定用户100的手的运动或朝向以识别手相关触发的其他传感器。Figure ID shows a user 100 wearing a device 110 physically connected to a wristband 160 in accordance with the disclosed embodiments. Although according to this embodiment, the aiming direction of device 110 may not match the field of view of user 100 , device 110 may include identifying the hand based on the tracked eye movements of user 100 indicating that user 100 is looking in the direction of wristband 160 Abilities related to triggers. Wristband 160 may also include an accelerometer, gyroscope, or other sensor for determining the movement or orientation of the user's 100 hand to identify hand-related triggers.
图2是符合所公开实施例的示例性系统200的示意图,该系统200包括由用户100佩戴的可穿戴装置110,以及可选的计算设备120和/或能够经由网络240与装置110通信的服务器250。在一些实施例中,装置110可以捕捉和分析图像数据,识别图像数据中存在的手相关触发,以及至少部分地基于手相关触发的识别来执行动作和/或向用户100提供反馈。在一些实施例中,可选的计算设备120和/或服务器250可以提供附加功能以增强用户100与他或她的环境的交互,如下面更详细地描述的。FIG. 2 is a schematic diagram of an exemplary system 200 that includes a wearable device 110 worn by a user 100 , and optionally a computing device 120 and/or a server capable of communicating with the device 110 via a network 240 , consistent with the disclosed embodiments 250. In some embodiments, device 110 may capture and analyze image data, identify hand-related triggers present in the image data, and perform actions and/or provide feedback to user 100 based at least in part on the identification of the hand-related triggers. In some embodiments, optional computing device 120 and/or server 250 may provide additional functionality to enhance user 100 interaction with his or her environment, as described in more detail below.
根据所公开实施例,装置110可以包括用于捕捉用户100的视场的实时图像数据的图像传感器系统220。在一些实施例中,装置110还可以包括用于控制和执行装置110的公开功能(诸如控制图像数据的捕捉、分析图像数据,以及基于在图像数据中识别出的手相关触发来执行动作和/或输出反馈)的处理单元210。根据所公开的实施例,手相关触发可以包括由用户100执行的涉及用户100的手的一部分的手势。此外,符合一些实施例的,手相关触发可以包括手腕相关触发。另外,在一些实施例中,装置110可以包括用于产生向用户100的信息输出的反馈输出单元230。According to the disclosed embodiments, the apparatus 110 may include an image sensor system 220 for capturing real-time image data of the user's 100 field of view. In some embodiments, device 110 may also include functions for controlling and performing disclosed functions of device 110 (such as controlling the capture of image data, analyzing image data, and performing actions and/or based on hand-related triggers identified in the image data). or output feedback) of the processing unit 210. According to the disclosed embodiments, the hand-related trigger may include a gesture performed by the user 100 involving a portion of the user's 100 hand. Furthermore, consistent with some embodiments, hand-related triggers may include wrist-related triggers. Additionally, in some embodiments, the apparatus 110 may include a feedback output unit 230 for generating information output to the user 100 .
如上所讨论的,装置110可以包括用于捕捉图像数据的图像传感器220。术语“图像传感器”是指能够检测近红外、红外、可见光和紫外光谱中的光信号并将其转换为电信号的设备。电信号可以用于基于检测到的信号来形成图像或视频流(即图像数据)。术语“图像数据”包括从近红外、红外、可见光和紫外光谱中的光学信号中检索到的任何形式的数据。图像传感器的示例可以包括半导体电荷耦合器件(CCD)、互补金属氧化物半导体(CMOS)中的有源像素传感器或N型金属氧化物半导体(NMOS,活跃MOS)。在一些情况下,图像传感器220可以是包括在装置110中的相机的一部分。As discussed above, apparatus 110 may include image sensor 220 for capturing image data. The term "image sensor" refers to a device capable of detecting and converting light signals in the near-infrared, infrared, visible and ultraviolet spectrums into electrical signals. The electrical signals may be used to form images or video streams (ie, image data) based on the detected signals. The term "image data" includes any form of data retrieved from optical signals in the near infrared, infrared, visible and ultraviolet spectrums. Examples of image sensors may include semiconductor charge coupled devices (CCDs), active pixel sensors in complementary metal oxide semiconductors (CMOS), or N-type metal oxide semiconductors (NMOS, active MOS). In some cases, image sensor 220 may be part of a camera included in device 110 .
根据所公开的实施例,装置110还可以包括用于控制图像传感器220以捕捉图像数据并用于分析图像数据的处理器210。如下面关于图5A进一步详细讨论的,处理器210可以包括用于根据存储的或可访问的提供所需功能的软件指令对图像数据和其他数据的一个或多个输入执行逻辑操作的“处理设备”。在一些实施例中,处理器210还可以控制反馈输出单元230以向用户100提供包括基于分析的图像数据和存储的软件指令的信息的反馈。如本文所使用的术语,“处理设备”可以访问其中存储可执行指令的存储器,或者在一些实施例中,“处理设备”本身可以包括可执行指令(例如,存储在包括在处理设备中的存储器中)。According to the disclosed embodiments, the apparatus 110 may also include a processor 210 for controlling the image sensor 220 to capture image data and for analyzing the image data. As discussed in further detail below with respect to FIG. 5A, the processor 210 may include a “processing device for performing logical operations on one or more inputs of image data and other data in accordance with stored or accessible software instructions that provide the desired functionality. ". In some embodiments, the processor 210 may also control the feedback output unit 230 to provide feedback to the user 100 including information based on the analyzed image data and stored software instructions. As the term is used herein, a "processing device" may access memory in which executable instructions are stored, or, in some embodiments, the "processing device" may itself include executable instructions (eg, memory stored in the processing device) middle).
在一些实施例中,提供给用户100的信息或反馈信息可以包括时间信息。时间信息可以包括与一天中的当前时间相关的任何信息,并且,如下面进一步描述的,可以以任何感官感知方式来呈现。在一些实施例中,时间信息可以包括预先配置格式的一天中的当前时间(例如,下午2:30或14:30)。时间信息可以包括用户当前时区中的时间(例如,基于确定的用户100的位置),以及在另一期望位置中的时区和/或一天中的时间的指示。在一些实施例中,时间信息可以包括相对于一天中的一个或多个预定时间的若干小时或分钟。例如,在一些实施例中,时间信息可以包括直到特定小时(例如,到下午6:00)或某个其他预定时间还剩下三小时十五分钟的指示。时间信息还可以包括自特定活动(诸如会议开始或慢跑开始或任何其他活动)开始以来经过的持续时间。在一些实施例中,可以基于分析的图像数据来确定活动。在其他实施例中,时间信息还可以包括与当前时间和一个或多个其他例程、时段或调度事件相关的附加信息。例如,如下文进一步详细讨论的,时间信息可以包括对直到下一个调度事件的剩余分钟数的指示,这可以从日历功能或从计算设备120或服务器250检索的其他信息中确定。In some embodiments, the information or feedback information provided to the user 100 may include time information. The time information may include any information related to the current time of day, and, as described further below, may be presented in any sensory perception manner. In some embodiments, the time information may include the current time of day (eg, 2:30 PM or 14:30 PM) in a preconfigured format. The time information may include the time in the user's current time zone (eg, based on the determined location of the user 100), as well as an indication of the time zone and/or time of day in another desired location. In some embodiments, the time information may include hours or minutes relative to one or more predetermined times of day. For example, in some embodiments, the time information may include an indication that three hours and fifteen minutes remain until a particular hour (eg, by 6:00 PM) or some other predetermined time. The time information may also include the duration elapsed since the start of a particular activity, such as the start of a meeting or the start of a jog or any other activity. In some embodiments, the activity may be determined based on the analyzed image data. In other embodiments, the time information may also include additional information related to the current time and one or more other routines, time periods, or scheduled events. For example, as discussed in further detail below, the time information may include an indication of the number of minutes remaining until the next scheduled event, which may be determined from a calendar function or other information retrieved from computing device 120 or server 250 .
反馈输出单元230可以包括用于向用户100提供信息输出的一个或多个反馈系统。在所公开的实施例中,可以经由任何类型的连接的可听或可视系统或两者来提供可听或可视反馈。根据所公开的实施例的信息反馈可以包括对用户100的可听反馈(例如,使用蓝牙(BluetoothTM)或其他有线或无线连接的扬声器,或骨传导耳机)。一些实施例的反馈输出单元230可以另外地或可替代地产生向用户100的信息的可视输出,例如,作为投影到眼镜130的镜片上的或者经由与装置110通信的单独的抬头显示器提供的增强现实显示的一部分,诸如作为计算设备120的一部分提供的显示器260,计算设备120可以包括车载汽车抬头显示器、增强现实设备、虚拟现实设备、智能手机、PC、平板电脑等。Feedback output unit 230 may include one or more feedback systems for providing information output to user 100 . In the disclosed embodiments, audible or visual feedback may be provided via any type of connected audible or visual system or both. Information feedback in accordance with the disclosed embodiments may include audible feedback to the user 100 (eg, using Bluetooth ™ or other wired or wireless connected speakers, or bone conduction headphones). Feedback output unit 230 of some embodiments may additionally or alternatively generate visual output of information to user 100 , eg, as projected onto the lenses of eyeglasses 130 or provided via a separate heads-up display in communication with device 110 . A portion of an augmented reality display, such as display 260 provided as part of computing device 120, which may include a vehicle head-up display, augmented reality device, virtual reality device, smartphone, PC, tablet, and the like.
术语“计算设备”是指包括处理单元并具有计算能力的设备。计算设备120的一些示例包括PC、膝上型计算机、平板电脑或诸如汽车的车载计算系统的其他计算系统,例如,每个计算系统被配置为通过网络240直接与装置110或服务器250通信。计算设备120的另一示例包括具有显示器260的智能手机。在一些实施例中,计算设备120可以是特别针对装置110配置的计算系统,并且可以与装置110集成地提供或者与装置110相连。装置110还可以经由任何已知的无线标准(例如,WiFi、蓝牙等)以及近场电容耦合和其他短距离无线技术,或经由有线连接,通过网络240连接到计算设备120。在其中计算设备120是智能手机的实施例中,计算设备120可以具有安装在其中的专用应用程序。例如,用户100可以在显示器260上查看源自装置110或由装置110触发的数据(例如,图像、视频剪辑、提取的信息、反馈信息等)。另外,用户100可以选择数据的一部分以存储在服务器250中。The term "computing device" refers to a device that includes a processing unit and has computing capabilities. Some examples of computing devices 120 include PCs, laptops, tablets, or other computing systems such as an in-vehicle computing system of an automobile, for example, each computing system configured to communicate directly with device 110 or server 250 via network 240 . Another example of computing device 120 includes a smartphone with display 260 . In some embodiments, computing device 120 may be a computing system specifically configured for apparatus 110 , and may be provided integrally with or connected to apparatus 110 . Device 110 may also communicate via any known wireless standard (eg, WiFi, Bluetooth etc.) and near-field capacitive coupling and other short-range wireless techniques, or via a wired connection, are connected to computing device 120 through network 240. In embodiments in which computing device 120 is a smartphone, computing device 120 may have dedicated applications installed therein. For example, user 100 may view data (eg, images, video clips, extracted information, feedback information, etc.) originating from or triggered by device 110 on display 260 . Additionally, the user 100 may select a portion of the data to store in the server 250 .
网络240可以是共享、公共或专用网络,可以包括广域或局部区域,并且可以通过有线和/或无线通信网络的任何适当组合来实现。网络240还可以包括内联网或互联网。在一些实施例中,网络240可以包括短距离或近场无线通信系统,用于使彼此非常接近地(例如在用户的人之上或附近)提供的装置110与计算设备120之间能够通信。装置110可以例如使用无线模块(例如,Wi-Fi、蜂窝)自主地建立到网络240的连接。在一些实施例中,装置110可以在连接到外部电源时使用无线模块,以延长电池寿命。此外,装置110与服务器250之间的通信可以通过任何合适的通信信道来完成,诸如电话网、外联网、内联网、互特网、卫星通信、离线通信、无线通信、转发器通信、局域网(LAN)、广域网(WAN)和虚拟专用网(VPN)。The network 240 may be a shared, public or private network, may include a wide area or a local area, and may be implemented by any suitable combination of wired and/or wireless communication networks. Network 240 may also include an intranet or the Internet. In some embodiments, network 240 may include a short-range or near-field wireless communication system for enabling communication between apparatus 110 and computing device 120 provided in close proximity to each other (eg, on or near a user's person). Device 110 may autonomously establish a connection to network 240, eg, using a wireless module (eg, Wi-Fi, cellular). In some embodiments, the device 110 may use a wireless module when connected to an external power source to extend battery life. Furthermore, the communication between the device 110 and the server 250 may be accomplished through any suitable communication channel, such as a telephone network, extranet, intranet, Internet, satellite communication, offline communication, wireless communication, repeater communication, local area network ( LAN), Wide Area Network (WAN), and Virtual Private Network (VPN).
如图2所示,装置110可以经由网络240向服务器250传送数据或从服务器250接收数据。在所公开的实施例中,从服务器250和/或计算设备120接收的数据可以包括基于所分析的图像数据的很多不同类型的信息,包括与商业产品或人的身份、识别出的地标以及能够存储在服务器250中或由服务器250访问的任何其他信息有关的信息。在一些实施例中,可以经由计算设备120接收和传送数据。服务器250和/或计算设备120可以从不同的数据源(例如,用户特定数据库或用户的社交网络帐户或其他帐户、互联网和其他受管理或可访问的数据库)检索信息,并且根据所公开的实施例将与所分析的图像数据和识别出的触发相关的信息提供给装置110。在一些实施例中,可以分析从不同数据源检索的日历相关信息,以提供特定时间信息或基于时间的背景(context),用于基于所分析的图像数据提供特定信息。As shown in FIG. 2 , the device 110 may transmit data to or receive data from the server 250 via the network 240 . In the disclosed embodiments, data received from server 250 and/or computing device 120 may include many different types of information based on the analyzed image data, including identities related to commercial products or people, identified landmarks, and Information about any other information stored in or accessed by server 250 . In some embodiments, data may be received and transmitted via computing device 120 . Server 250 and/or computing device 120 may retrieve information from various data sources (eg, user-specific databases or users' social network accounts or other accounts, the Internet, and other managed or accessible databases), and in accordance with the disclosed implementations For example, information related to the analyzed image data and the identified triggers is provided to the device 110 . In some embodiments, calendar-related information retrieved from different data sources may be analyzed to provide specific time information or time-based context for providing specific information based on the analyzed image data.
在图3A中更详细地示出了根据一些实施例(如结合图1A所讨论的)与眼镜130结合的可穿戴装置110的示例。在一些实施例中,装置110可以与结构(图3A中未示出)相关联,该结构能够容易地将装置110分离并重新附接到眼镜130上。在一些实施例中,当装置110附接到眼镜130时,图像传感器220获取设定的瞄准方向而不需要方向校准。图像传感器220的设定瞄准方向可以基本上与用户100的视场一致。例如,与图像传感器220相关联的相机可以以预定角度安装在装置110内略微向下(例如,距地平线5-15度)的位置。因此,图像传感器220的设定瞄准方向可以基本上匹配用户100的视场。An example of a wearable device 110 integrated with eyewear 130 is shown in more detail in FIG. 3A (as discussed in connection with FIG. 1A ) in accordance with some embodiments. In some embodiments, the device 110 may be associated with a structure (not shown in FIG. 3A ) that enables the device 110 to be easily detached and reattached to the glasses 130 . In some embodiments, when the device 110 is attached to the glasses 130, the image sensor 220 acquires the set aiming direction without directional calibration. The set aiming direction of the image sensor 220 may substantially match the field of view of the user 100 . For example, a camera associated with image sensor 220 may be mounted slightly downward (eg, 5-15 degrees from the horizon) within device 110 at a predetermined angle. Therefore, the set aiming direction of the image sensor 220 may substantially match the field of view of the user 100 .
图3B是关于图3A讨论的实施例的组件的分解图。将装置110附接到眼镜130上可以用以下方式进行。首先,支架310可以使用支架310的侧面上螺钉320安装在眼镜130上。然后,装置110可以被夹在支架310上,使得其与用户100的视场对齐。术语“支架”包括能够将包括相机的设备拆卸和重新连接到一副眼镜或另一对象(例如,头盔)上的任何设备或结构。支架310可以由塑料(例如,聚碳酸酯)、金属(例如,铝)或塑料和金属的组合(例如,碳纤维石墨)制成。支架310可以使用螺钉、螺栓、卡扣或本领域中使用的任何紧固装置安装在任何类型的眼镜(例如,眼镜、太阳镜、3D眼镜、安全眼镜等)上。Figure 3B is an exploded view of components of the embodiment discussed with respect to Figure 3A. Attaching the device 110 to the glasses 130 can be done in the following manner. First, the bracket 310 may be mounted on the glasses 130 using the screws 320 on the side of the bracket 310 . The device 110 can then be clamped on the stand 310 so that it is aligned with the user's 100 field of view. The term "mount" includes any device or structure capable of detaching and reattaching a device, including a camera, to a pair of glasses or another object (eg, a helmet). The bracket 310 may be made of plastic (eg, polycarbonate), metal (eg, aluminum), or a combination of plastic and metal (eg, carbon fiber graphite). The bracket 310 may be mounted on any type of eyewear (eg, glasses, sunglasses, 3D glasses, safety glasses, etc.) using screws, bolts, snaps, or any fastening device used in the art.
在一些实施例中,支架310可以包括用于分离和再接合装置110的快速释放机件。例如,支架310和装置110可以包括磁性元件。作为替代示例,支架310可以包括公插销构件,而装置110可以包括母插孔板。在其他实施例中,支架310可以是一副眼镜的整体部分,或者单独出售并由验光师安装。例如,支架310可以被配置为安装在靠近镜架前部但在铰链之前的眼镜130的镜腿上。可替代地,支架310可以被配置为安装在眼镜130的鼻梁上。In some embodiments, the bracket 310 may include a quick release mechanism for disengaging and re-engaging the device 110 . For example, bracket 310 and device 110 may include magnetic elements. As an alternative example, bracket 310 may include a male pin member, while device 110 may include a female jack plate. In other embodiments, the bracket 310 may be an integral part of a pair of eyeglasses, or sold separately and installed by an optometrist. For example, the bracket 310 may be configured to be mounted on the temple of the eyeglass 130 near the front of the frame but before the hinge. Alternatively, the bracket 310 may be configured to be mounted on the bridge of the eyeglasses 130 .
在一些实施例中,装置110可以作为具有或不具有镜片的眼镜架130的一部分来提供。另外,在一些实施例中,装置110可以被配置为提供投影到眼镜130的镜片上的增强现实显示(如果提供),或者可替代地,例如根据所公开的实施例,可以包括用于投影时间信息的显示器。装置110可以包括附加显示器,或者可替代地,可以与单独提供的可以附接或可以不附接到眼镜130上的显示系统进行通信。In some embodiments, device 110 may be provided as part of a spectacle frame 130 with or without lenses. Additionally, in some embodiments, the apparatus 110 may be configured to provide an augmented reality display (if provided) projected onto the lenses of the eyeglasses 130, or alternatively, such as in accordance with the disclosed embodiments, may include time for projection information display. Device 110 may include an additional display, or, alternatively, may communicate with a separately provided display system that may or may not be attached to glasses 130 .
在一些实施例中,装置110可以以除可佩戴眼镜以外的形式来实现,例如如上文关于图1B-1D所描述的。图4A是从装置110的前视角的装置110的附加实施例的示例的示意图。装置110包括图像传感器220、夹子(未示出)、功能按钮(未示出)和悬挂环410,用于将装置110附连到例如如图1B所示的项链140上。当装置110悬挂在项链140上时,图像传感器220的瞄准方向可能与用户100的视场不完全一致,但瞄准方向仍然与用户100的视场相关。In some embodiments, device 110 may be implemented in forms other than wearable glasses, eg, as described above with respect to FIGS. 1B-1D . FIG. 4A is a schematic diagram of an example of an additional embodiment of the device 110 from a front view of the device 110 . Device 110 includes image sensor 220, clips (not shown), function buttons (not shown), and suspension loop 410 for attaching device 110 to necklace 140 such as shown in Figure IB. When the device 110 is suspended from the necklace 140 , the aiming direction of the image sensor 220 may not exactly match the user's 100 field of view, but the aiming direction is still relative to the user's 100 field of view.
图4B是从装置110的侧向的装置110的第二实施例的示例的示意图。除了悬挂环410之外,如图4B所示,装置110还可以包括夹子420。如图1C所示,用户100可以使用夹子420将装置110附接到衬衫或腰带150上。夹子420可以提供用于将装置110从不同衣物上分离和重新接合的容易的机件。在其他实施例中,装置110可以包括用于与汽车支架或通用支架的公插销连接的母插孔板。FIG. 4B is a schematic diagram of an example of a second embodiment of the device 110 from the side of the device 110 . In addition to the suspension ring 410, the device 110 may also include a clip 420 as shown in FIG. 4B. As shown in FIG. 1C , user 100 may use clip 420 to attach device 110 to shirt or waistband 150 . The clip 420 may provide an easy mechanism for detaching and re-engaging the device 110 from different garments. In other embodiments, the device 110 may include a female jack plate for connecting with the male pins of a car mount or universal mount.
在一些实施例中,装置110包括用于使用户100能够向装置110提供输入的功能按钮430。功能按钮430可以接受不同类型的触觉输入(例如,轻击、点击、双击、长按、从右向左滑动、左向右滑动)。在一些实施例中,每种类型的输入可以与不同的动作相关联。例如,轻击可以与拍摄图片的功能相关联,而从右到左的滑动可以与录制视频的功能相关联。In some embodiments, device 110 includes function buttons 430 for enabling user 100 to provide input to device 110 . Function buttons 430 may accept different types of tactile inputs (eg, tap, tap, double tap, long press, swipe from right to left, swipe from left to right). In some embodiments, each type of input may be associated with a different action. For example, a tap can be associated with the ability to take a picture, while a right-to-left swipe can be associated with the ability to record a video.
如图4C所示,装置110可以使用夹子431在用户100的衣服的边缘处连接到衣物(例如,衬衫、腰带、裤子等)上。例如,装置100的主体可以靠近衣物的内表面驻留,夹子431与衣物的外表面接合。在这样的实施例中,如图4C所示,图像传感器220(例如,用于可见光的相机)可以突出超过衣服的边缘。可替代地,夹子431可以与衣服的内表面接合,而装置110的主体靠近衣服的外部。在各种实施例中,衣服可以被定位在夹子431与装置110的主体之间。As shown in Figure 4C, the device 110 may be attached to clothing (eg, shirts, belts, pants, etc.) at the edges of the clothing of the user 100 using clips 431 . For example, the body of the device 100 may reside adjacent to the inner surface of the garment, with the clips 431 engaging the outer surface of the garment. In such an embodiment, as shown in Figure 4C, the image sensor 220 (eg, a camera for visible light) may protrude beyond the edge of the garment. Alternatively, the clip 431 may engage the inner surface of the garment with the body of the device 110 proximate the exterior of the garment. In various embodiments, a garment may be positioned between the clip 431 and the body of the device 110 .
在图4D中示出了装置110的示例实施例。装置110包括夹子431,其可以包括紧邻装置110的主体435的前表面434的点(例如,432A和432B)。在示例性实施例中,点432A、432B与前表面434之间的距离可以小于用户100的衣服的织物的典型厚度。例如,点432A、432B与表面434之间的距离可以小于T恤的厚度,例如,小于1毫米、小于2毫米、小于3毫米等,或者在一些情况下,夹子431的点432A、432B可以接触表面434。在各种实施例中,夹子431可以包括不接触表面434的点433,允许衣服插入夹子431与表面434之间。An example embodiment of apparatus 110 is shown in Figure 4D. Device 110 includes clips 431 that may include points (eg, 432A and 432B) proximate front surface 434 of body 435 of device 110 . In an exemplary embodiment, the distance between the points 432A, 432B and the front surface 434 may be less than a typical thickness of the fabric of the user's 100 garment. For example, the distance between the points 432A, 432B and the surface 434 may be less than the thickness of the T-shirt, eg, less than 1 mm, less than 2 mm, less than 3 mm, etc., or in some cases the points 432A, 432B of the clip 431 may touch Surface 434. In various embodiments, clip 431 may include points 433 that do not contact surface 434, allowing garments to be inserted between clip 431 and surface 434.
图4D示意性地示出装置110的被定义为前视图(F视图)、后视图(R视图)、顶视图(T视图)、侧视图(S视图)和底视图(B视图)的不同视图。在随后的附图中描述装置110时将参考这些视图。图4D示出了其中夹子431与传感器220位于装置110的同一侧(例如,装置110的前侧)的示例实施例。可替代地,夹子431可以与传感器220位于装置110的相对侧(例如,装置110的后侧)。在各种实施例中,如图4D所示,装置110可以包括功能按钮430。Figure 4D schematically illustrates the different views of the device 110 defined as a front view (F view), rear view (R view), top view (T view), side view (S view) and bottom view (B view) . These views will be referred to when describing the device 110 in the subsequent figures. FIG. 4D shows an example embodiment in which the clip 431 is located on the same side of the device 110 (eg, the front side of the device 110 ) as the sensor 220 . Alternatively, the clip 431 may be located on the opposite side of the device 110 (eg, the rear side of the device 110 ) from the sensor 220 . In various embodiments, as shown in FIG. 4D , device 110 may include function buttons 430 .
在图4E至4K中示出了装置110的各种视图。例如,图4E示出具有电连接441的装置110的视图。电连接441可以是例如USB端口,其可以用于向/从装置110传送数据并向装置110供电。在示例实施例中,连接441可以用于对图4E中示意性示出的电池442充电。图4F示出包括传感器220和一个或多个麦克风443的装置110的F视图。在一些实施例中,装置110可以包括面向外的若干麦克风443,其中麦克风443被配置为获得环境声音和与用户100通信的各种扬声器的声音。图4G示出了装置110的R视图。在一些实施例中,如图4G所示,麦克风444可以位于装置110的后侧。麦克风444可以用于检测来自用户100的音频信号。应当注意,装置110可以具有放置在装置110的任何一侧(例如,前侧、后侧、左侧、右侧、顶侧或底侧)的麦克风。在各种实施例中,一些麦克风可以在第一侧(例如,麦克风443可以在装置110的前面),而其他麦克风可以在第二侧(例如,麦克风444可以在装置110的后侧)。Various views of device 110 are shown in Figures 4E-4K. For example, FIG. 4E shows a view of device 110 with electrical connections 441 . Electrical connection 441 may be, for example, a USB port, which may be used to transfer data to/from device 110 and to provide power to device 110 . In an example embodiment, connection 441 may be used to charge battery 442, shown schematically in Figure 4E. FIG. 4F shows a view F of device 110 including sensor 220 and one or more microphones 443 . In some embodiments, device 110 may include outwardly facing several microphones 443 , where microphones 443 are configured to pick up ambient sounds and sounds from various speakers in communication with user 100 . FIG. 4G shows an R view of device 110 . In some embodiments, as shown in FIG. 4G , the microphone 444 may be located on the rear side of the device 110 . Microphone 444 may be used to detect audio signals from user 100 . It should be noted that device 110 may have microphones placed on any side of device 110 (eg, front, rear, left, right, top, or bottom). In various embodiments, some microphones may be on a first side (eg, microphone 443 may be on the front of device 110), while other microphones may be on a second side (eg, microphone 444 may be on the rear side of device 110).
图4H和4I示出了符合所公开实施例的装置110的不同侧(即,装置110的S视图)。例如,图4H示出传感器220的位置和夹子431的示例形状。图4J示出包括功能按钮430的装置110的T视图,图4K示出具有电连接441的装置110的B视图。Figures 4H and 4I illustrate different sides of device 110 (ie, S-views of device 110) consistent with the disclosed embodiments. For example, FIG. 4H shows the location of sensor 220 and an example shape of clip 431 . FIG. 4J shows a T view of the device 110 including the function button 430 , and FIG. 4K shows a B view of the device 110 with electrical connections 441 .
上面关于图3A、3B、4A和4B讨论的示例实施例不是限制性的。在一些实施例中,装置110可以任何合适的配置来实现,以执行所公开的方法。例如,返回参考图2,所公开的实施例可以实现根据任何配置的装置110,装置110包括图像传感器220和处理器单元210,以执行图像分析并用于与反馈单元230通信。The example embodiments discussed above with respect to Figures 3A, 3B, 4A and 4B are not limiting. In some embodiments, apparatus 110 may be implemented in any suitable configuration to perform the disclosed methods. For example, referring back to FIG. 2 , the disclosed embodiments may implement apparatus 110 according to any configuration including image sensor 220 and processor unit 210 for performing image analysis and for communicating with feedback unit 230 .
图5A是示出根据示例实施例的装置110的组件的框图。如图5A所示,并且如上面类似地讨论的,装置110包括图像传感器220、存储器550、处理器210、反馈输出单元230、无线收发器530和移动电源520。在其他实施例中,装置110还可以包括按钮、诸如麦克风的其他传感器以及诸如加速度计、陀螺仪、磁强计、温度传感器、颜色传感器、光传感器等的惯性测量设备。装置110还可以包括数据端口570和具有用于与外部电源或外部设备(未示出)连接的合适接口的电源连接510。FIG. 5A is a block diagram illustrating components of apparatus 110 according to an example embodiment. As shown in FIG. 5A , and as similarly discussed above, device 110 includes image sensor 220 , memory 550 , processor 210 , feedback output unit 230 , wireless transceiver 530 , and power bank 520 . In other embodiments, apparatus 110 may also include buttons, other sensors such as microphones, and inertial measurement devices such as accelerometers, gyroscopes, magnetometers, temperature sensors, color sensors, light sensors, and the like. The apparatus 110 may also include a data port 570 and a power connection 510 with a suitable interface for connection with an external power source or external device (not shown).
图5A中示出的处理器210可以包括任何合适的处理设备。术语“处理设备”包括具有对输入执行逻辑操作的电路的任何物理设备。例如,处理设备可以包括一个或多个集成电路、微芯片、微控制器、微处理器、中央处理单元(CPU)、图形处理单元(GPU)、数字信号处理器(DSP)、现场可编程门阵列(FPGA)的全部或部分、或适于执行指令或执行逻辑操作的其他电路。由处理设备执行的指令例如可以预加载到与处理设备集成或嵌入到处理设备中的存储器中,或者可以存储在单独的存储器(例如,存储器550)中。存储器550可以包括随机存取存储器(RAM)、只读存储器(ROM)、硬盘、光盘、磁介质、闪存、其他永久、固定或易失性存储器或任何其他能够存储指令的机制。The processor 210 shown in FIG. 5A may comprise any suitable processing device. The term "processing device" includes any physical device that has circuitry that performs logical operations on inputs. For example, a processing device may include one or more integrated circuits, microchips, microcontrollers, microprocessors, central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field programmable gates All or part of an array (FPGA), or other circuitry suitable for executing instructions or performing logical operations. Instructions executed by the processing device may, for example, be preloaded into memory integrated with or embedded in the processing device, or may be stored in a separate memory (eg, memory 550). Memory 550 may include random access memory (RAM), read only memory (ROM), hard disk, optical disk, magnetic media, flash memory, other persistent, fixed or volatile memory, or any other mechanism capable of storing instructions.
尽管在图5A所示的实施例中,装置110包括一个处理设备(例如,处理器210),但装置110可以包括一个以上的处理设备。每个处理设备可以具有相似的结构,或者处理设备可以具有彼此电连接或断开的不同结构。例如,处理设备可以是单独的电路或集成在单个电路中。当使用一个以上的处理设备时,处理设备可以被配置为独立地或协作地操作。处理设备可以电地、磁地、光学地、声学地、机械地或通过允许它们相互作用的其他方式耦合。Although in the embodiment shown in FIG. 5A, apparatus 110 includes one processing device (eg, processor 210), apparatus 110 may include more than one processing device. Each processing device may have a similar structure, or the processing devices may have different structures that are electrically connected or disconnected from each other. For example, the processing device may be a separate circuit or integrated in a single circuit. When more than one processing device is used, the processing devices may be configured to operate independently or cooperatively. The processing devices may be coupled electrically, magnetically, optically, acoustically, mechanically, or by other means that allow them to interact.
在一些实施例中,处理器210可以处理从用户100的环境捕捉的多个图像,以确定与捕捉后续图像有关的不同参数。例如,处理器210可以基于从捕捉的图像数据导出的信息来确定以下至少一个的值:图像分辨率、压缩率、裁剪参数、帧速率、焦点、曝光时间、光圈大小和光敏度。所确定的值可以用于捕捉至少一个后续图像。另外,处理器210可以检测包括用户环境中的至少一个手相关触发的图像,并经由反馈输出单元230执行动作和/或向用户提供信息输出。In some embodiments, processor 210 may process multiple images captured from the environment of user 100 to determine different parameters related to capturing subsequent images. For example, processor 210 may determine values for at least one of the following: image resolution, compression ratio, cropping parameters, frame rate, focus, exposure time, aperture size, and light sensitivity based on information derived from captured image data. The determined value can be used to capture at least one subsequent image. Additionally, the processor 210 may detect images that include at least one hand-related trigger in the user's environment, and via the feedback output unit 230 perform actions and/or provide information output to the user.
在另一实施例中,处理器210可以改变图像传感器220的瞄准方向。例如,当装置110附接有夹子420时,图像传感器220的瞄准方向可能与用户100的视场不一致。处理器210可以从分析的图像数据中识别某些情况,并调整图像传感器220的瞄准方向以捕捉相关图像数据。例如,在一个实施例中,处理器210可以检测与另一个体的交互,并且感测到该个体没有完全在视野中,因为图像传感器220向下倾斜。响应于此,处理器210可以调整图像传感器220的瞄准方向以捕捉个体的图像数据。还设想了其他场景,其中处理器210可以识别调整图像传感器220的瞄准方向的需要。In another embodiment, the processor 210 may change the aiming direction of the image sensor 220 . For example, when the device 110 is attached with the clip 420, the aiming direction of the image sensor 220 may not be aligned with the user's 100 field of view. The processor 210 may identify certain conditions from the analyzed image data and adjust the aiming direction of the image sensor 220 to capture the relevant image data. For example, in one embodiment, the processor 210 may detect an interaction with another subject and sense that the subject is not fully in the field of view because the image sensor 220 is tilted downward. In response to this, the processor 210 may adjust the aiming direction of the image sensor 220 to capture image data of the individual. Other scenarios are also contemplated in which processor 210 may identify the need to adjust the aiming direction of image sensor 220.
在一些实施例中,处理器210可以将数据传送到反馈输出单元230,反馈输出单元230可以包括被配置为向用户100提供信息的任何设备。反馈输出单元230可以作为装置110的一部分来提供(如所示),或者可以被提供在装置110的外部并通信地耦合到装置110。反馈输出单元230可以被配置为基于从处理器210接收的信号输出可视或非可视反馈,诸如当处理器210识别出所分析的图像数据中的手相关触发时。In some embodiments, processor 210 may communicate data to feedback output unit 230 , which may include any device configured to provide information to user 100 . Feedback output unit 230 may be provided as part of device 110 (as shown), or may be provided external to and communicatively coupled to device 110 . Feedback output unit 230 may be configured to output visual or non-visual feedback based on signals received from processor 210, such as when processor 210 identifies a hand-related trigger in the analyzed image data.
术语“反馈”是指响应于处理环境中的至少一个图像而提供的任何输出或信息。在一些实施例中,如上面类似地描述的,反馈可以包括时间信息的可听或可见指示、检测到的文本或数字、货币价值、品牌产品、人的身份、地标的身份或包括十字路口处的街道名称或交通灯的颜色等的其他环境情况或条件,以及与这些信息中的每一个相关联的其他信息。例如,在一些实施例中,反馈可以包括关于完成交易仍然需要的货币量的附加信息、关于识别出的人的信息、历史信息或检测到的地标的时间和入场价格等。在一些实施例中,反馈可以包括可听音调、触觉响应和/或用户100先前记录的信息。反馈输出单元230可以包括用于输出声学和触觉反馈的适当组件。例如,反馈输出单元230可以包括音频耳机、助听器类型设备、扬声器、骨传导耳机、提供触觉线索的接口、振动触觉刺激器等。在一些实施例中,处理器210可以经由无线收发器530、有线连接或某个其他通信接口与外部反馈输出单元230通信信号。在一些实施例中,反馈输出单元230还可以包括用于向用户100可视地显示信息的任何合适的显示设备。The term "feedback" refers to any output or information provided in response to at least one image in the processing environment. In some embodiments, as similarly described above, the feedback may include an audible or visible indication of time information, detected text or numbers, monetary value, branded products, the identity of a person, the identity of a landmark, or including at an intersection. other environmental circumstances or conditions, such as street names or the color of traffic lights, and other information associated with each of these. For example, in some embodiments, the feedback may include additional information about the amount of money still needed to complete the transaction, information about the person identified, historical information or the time and entry price of the landmark detected, and the like. In some embodiments, the feedback may include audible tones, haptic responses, and/or information previously recorded by the user 100 . Feedback output unit 230 may include appropriate components for outputting acoustic and haptic feedback. For example, the feedback output unit 230 may include audio headphones, hearing aid type devices, speakers, bone conduction headphones, interfaces that provide haptic cues, vibrotactile stimulators, and the like. In some embodiments, processor 210 may communicate signals with external feedback output unit 230 via wireless transceiver 530, a wired connection, or some other communication interface. In some embodiments, feedback output unit 230 may also include any suitable display device for visually displaying information to user 100 .
如图5A所示,装置110包括存储器550。存储器550可以包括处理器210可访问的用于执行所公开的方法的一组或多组指令集,包括用于识别图像数据中的手相关触发的指令。在一些实施例中,存储器550可以存储从用户100的环境捕捉的图像数据(例如,图像、视频)。此外,存储器550可以存储特定于用户100的信息,诸如已知个体的图像表示、喜爱的产品、个体物品、以及日历或约会信息等。在一些实施例中,处理器210可以基于存储器550中的可用存储空间来确定例如要存储哪种类型的图像数据。在另一实施例中,处理器210可以从存储在存储器550中的图像数据中提取信息。As shown in FIG. 5A , device 110 includes memory 550 . Memory 550 may include one or more sets of instructions accessible to processor 210 for performing the disclosed methods, including instructions for identifying hand-related triggers in image data. In some embodiments, memory 550 may store image data (eg, images, videos) captured from the environment of user 100 . In addition, memory 550 may store information specific to user 100, such as image representations of known individuals, favorite products, individual items, and calendar or appointment information, among others. In some embodiments, the processor 210 may determine, for example, which type of image data to store based on available storage space in the memory 550 . In another embodiment, the processor 210 may extract information from image data stored in the memory 550 .
如图5A进一步所示,装置110包括移动电源520。术语“移动电源”包括能够提供电力的任何设备,其可以容易地用手携带(例如,移动电源520可以重量小于一磅)。电源的移动性使得用户100能够在各种情况下使用装置110。在一些实施例中,移动电源520可以包括一个或多个电池(例如,镍镉电池、镍金属氢化物电池和锂离子电池)或任何其他类型的电源。在其他实施例中,移动电源520可以是可充电的并且包含在容纳装置110的外壳内。在其他实施例中,移动电源520可以包括一个或多个用于将环境能量转换为电能的能量收集设备(例如,便携式太阳能单元、人体振动单元等)。As further shown in FIG. 5A , the device 110 includes a power bank 520 . The term "power bank" includes any device capable of providing power that can be easily carried by hand (eg, power bank 520 may weigh less than a pound). The mobility of the power source enables the user 100 to use the device 110 in a variety of situations. In some embodiments, the power bank 520 may include one or more batteries (eg, nickel-cadmium batteries, nickel-metal hydride batteries, and lithium-ion batteries) or any other type of power source. In other embodiments, the power bank 520 may be rechargeable and contained within the housing of the receiving device 110 . In other embodiments, the power bank 520 may include one or more energy harvesting devices (eg, portable solar units, human body vibration units, etc.) for converting ambient energy into electrical energy.
移动电源520可以为一个或多个无线收发器(例如,图5A中的无线收发器530)供电。术语“无线收发器”是指被配置为通过使用射频、红外频率、磁场或电场在空中接口上交换传输的任何设备。无线收发器530可以使用任何已知标准来发送和/或接收数据(例如,WiFi、蓝牙蓝牙智能、802.15.4或ZigBee)。在一些实施例中,无线收发器530可以将数据(例如,原始图像数据、经处理的图像数据、提取的信息)从装置110发送到计算设备120和/或服务器250。无线收发器530还可以从计算设备120和/或服务器250接收数据。在其他实施例中,无线收发器530可以将数据和指令发送到外部反馈输出单元230。Power bank 520 may power one or more wireless transceivers (eg, wireless transceiver 530 in Figure 5A). The term "wireless transceiver" refers to any device configured to exchange transmissions over an air interface using radio, infrared, magnetic, or electric fields. Wireless transceiver 530 may transmit and/or receive data using any known standard (eg, WiFi, Bluetooth Bluetooth Smart, 802.15.4 or ZigBee). In some embodiments, wireless transceiver 530 may transmit data (eg, raw image data, processed image data, extracted information) from apparatus 110 to computing device 120 and/or server 250 . Wireless transceiver 530 may also receive data from computing device 120 and/or server 250 . In other embodiments, the wireless transceiver 530 may send data and instructions to the external feedback output unit 230 .
图5B是示出根据另一示例实施例的装置110的组件的框图。在一些实施例中,装置110包括第一图像传感器220a、第二图像传感器220b、存储器550、第一处理器210a、第二处理器210b、反馈输出单元230、无线收发器530、移动电源520和电源连接器510。在图5B所示的布置中,每个图像传感器可以提供不同图像分辨率的图像,或者面向不同方向的图像。可替代地,每个图像传感器可以与不同的相机(例如,广角相机、窄角相机、IR相机等)相关联。在一些实施例中,装置110可以基于各种因素来选择使用哪个图像传感器。例如,处理器210a可以基于存储器550中的可用存储空间来确定以某个分辨率来捕捉后续图像。FIG. 5B is a block diagram illustrating components of apparatus 110 according to another example embodiment. In some embodiments, the device 110 includes a first image sensor 220a, a second image sensor 220b, a memory 550, a first processor 210a, a second processor 210b, a feedback output unit 230, a wireless transceiver 530, a power bank 520, and Power connector 510. In the arrangement shown in Figure 5B, each image sensor may provide images of different image resolutions, or face images in different directions. Alternatively, each image sensor may be associated with a different camera (eg, wide-angle camera, narrow-angle camera, IR camera, etc.). In some embodiments, apparatus 110 may select which image sensor to use based on various factors. For example, processor 210a may determine to capture subsequent images at a certain resolution based on available storage space in memory 550 .
装置110可以在第一处理模式和第二处理模式下操作,使得第一处理模式可以比第二处理模式消耗更少的功率。例如,在第一处理模式下,装置110可以捕捉图像并处理所捕捉的图像,以例如基于识别手相关触发来做出实时决策。在第二处理模式下,装置110可以从存储器550中存储的图像中提取信息,并从存储器550中删除图像。在一些实施例中,移动电源520可以在第一处理模式下提供超过十五小时的处理,在第二处理模式下提供大约三小时的处理。因此,不同的处理模式可以允许移动电源520在不同的时间段(例如,超过两小时、超过四小时、超过十小时等)产生足够的电力来为装置110供电。The apparatus 110 may operate in a first processing mode and a second processing mode such that the first processing mode may consume less power than the second processing mode. For example, in the first processing mode, the device 110 may capture images and process the captured images to make real-time decisions, eg, based on recognizing hand-related triggers. In the second processing mode, apparatus 110 may extract information from images stored in memory 550 and delete images from memory 550 . In some embodiments, the power bank 520 can provide more than fifteen hours of processing in the first processing mode and approximately three hours of processing in the second processing mode. Thus, different processing modes may allow power bank 520 to generate enough power to power device 110 for different time periods (eg, over two hours, over four hours, over ten hours, etc.).
在一些实施例中,当由移动电源520供电时,装置110可以在第一处理模式中使用第一处理器210a,当由可经由电源连接器510连接的外部电源580供电时,可以在第二处理模式中使用第二处理器210b。在其他实施例中,装置110可以基于预定义条件来确定使用哪些处理器或哪些处理模式。即使当装置110不由外部电源580供电时,装置110也可以在第二处理模式下操作。例如,如果存储器550中用于存储新图像数据的可用存储空间低于预定义阈值,则装置110可以确定当装置110不由外部电源580供电时,装置110应该在第二处理模式下操作。In some embodiments, the device 110 may use the first processor 210a in the first processing mode when powered by the mobile power supply 520 and may use the first processor 210a in the second processing mode when powered by the external power supply 580 connectable via the power connector 510 The second processor 210b is used in processing mode. In other embodiments, the apparatus 110 may determine which processors or which processing modes to use based on predefined conditions. The device 110 may operate in the second processing mode even when the device 110 is not powered by the external power source 580 . For example, if the available storage space in memory 550 for storing new image data is below a predefined threshold, device 110 may determine that device 110 should operate in the second processing mode when device 110 is not powered by external power source 580 .
尽管在图5B中描绘了一个无线收发器,但装置110可以包括一个以上的无线收发器(例如,两个无线收发器)。在具有一个以上无线收发器的装置中,每个无线收发器可以使用不同的标准来发送和/或接收数据。在一些实施例中,第一无线收发器可以使用蜂窝标准(例如,LTE或GSM)与服务器250或计算设备120通信,第二无线收发器可以使用短程标准(例如,WiFi或蓝牙)与服务器250或计算设备120通信。在一些实施例中,当可穿戴装置由包括在可穿戴装置中的移动电源供电时,装置110可以使用第一无线收发器,而当可穿戴装置由外部电源供电时,装置110可以使用第二无线收发器。Although one wireless transceiver is depicted in Figure 5B, apparatus 110 may include more than one wireless transceiver (eg, two wireless transceivers). In devices with more than one wireless transceiver, each wireless transceiver may transmit and/or receive data using a different standard. In some embodiments, the first wireless transceiver may communicate with server 250 or computing device 120 using a cellular standard (eg, LTE or GSM) and the second wireless transceiver may use a short-range standard (eg, WiFi or Bluetooth) ) communicates with server 250 or computing device 120 . In some embodiments, device 110 may use a first wireless transceiver when the wearable device is powered by a power bank included in the wearable device, and may use a second wireless transceiver when the wearable device is powered by an external power source wireless transceiver.
图5C是示出根据另一示例实施例的包括计算设备120的装置110的组件的框图。在本实施例中,装置110包括图像传感器220、存储器550a、第一处理器210、反馈输出单元230、无线收发器530a、移动电源520和电源连接器510。如图5C进一步所示,计算设备120包括处理器540、反馈输出单元545、存储器550b、无线收发器530b和显示器260。计算设备120的一个示例是其中安装有专用应用程序的智能手机或平板电脑。在其他实施例中,计算设备120可以包括诸如车载汽车计算系统、PC、膝上型计算机以及符合所公开实施例的任何其他系统的任何配置。在该示例中,用户100可以响应于显示器260上的手相关触发的识别而查看反馈输出。另外,用户100可以在显示器260上查看其他数据(例如,图像、视频剪辑、对象信息、时间表信息、提取的信息等)。另外,用户100可以经由计算设备120与服务器250通信。5C is a block diagram illustrating components of apparatus 110 including computing device 120, according to another example embodiment. In this embodiment, the device 110 includes an image sensor 220 , a memory 550 a , a first processor 210 , a feedback output unit 230 , a wireless transceiver 530 a , a mobile power source 520 and a power connector 510 . As further shown in FIG. 5C , computing device 120 includes processor 540 , feedback output unit 545 , memory 550b , wireless transceiver 530b , and display 260 . An example of computing device 120 is a smartphone or tablet with dedicated applications installed therein. In other embodiments, computing device 120 may include any configuration such as an in-vehicle automotive computing system, a PC, a laptop computer, and any other system consistent with the disclosed embodiments. In this example, user 100 may view feedback output in response to identification of hand-related triggers on display 260 . Additionally, user 100 may view other data (eg, images, video clips, object information, schedule information, extracted information, etc.) on display 260 . Additionally, user 100 may communicate with server 250 via computing device 120 .
在一些实施例中,处理器210和处理器540被配置为从捕捉的图像数据中提取信息。术语“提取信息”包括通过本领域普通技术人员已知的任何手段在捕捉的图像数据中识别与对象、个体、位置、事件等相关联的信息的任何过程。在一些实施例中,装置110可以使用提取的信息向反馈输出单元230或向计算设备120发送反馈或其他实时指示。在一些实施例中,处理器210可以在图像数据中识别站在用户100前面的个体,并向计算设备120发送该个体的姓名和用户100最后遇到该个体的时间。在另一实施例中,处理器210可以在图像数据中识别一个或多个可见触发,包括手相关触发,并确定该触发是否与可穿戴装置的用户以外的人相关联,以选择性地确定是否执行与该触发相关联的动作。一个这样的动作可以是经由作为装置110的一部分(或与装置110通信)提供的反馈输出单元230或经由作为计算设备120的一部分提供的反馈单元545向用户100提供反馈。例如,反馈输出单元545可以与显示器260通信以使显示器260可见地输出信息。在一些实施例中,处理器210可以在图像数据中识别手相关触发并向计算设备120发送该触发的指示。然后,处理器540可以处理接收到的触发信息,并基于手相关触发经由反馈输出单元545或显示器260提供输出。在其他实施例中,处理器540可以基于从设备110接收的图像数据来确定手相关触发并提供类似于上述的适当反馈。在一些实施例中,处理器540可以基于识别出的手相关触发来向设备110提供指令或其他信息(诸如环境信息)。In some embodiments, processor 210 and processor 540 are configured to extract information from captured image data. The term "extracting information" includes any process of identifying information associated with objects, individuals, locations, events, etc. in captured image data by any means known to those of ordinary skill in the art. In some embodiments, apparatus 110 may use the extracted information to send feedback or other real-time indications to feedback output unit 230 or to computing device 120 . In some embodiments, processor 210 may identify an individual standing in front of user 100 in the image data and send to computing device 120 the name of the individual and the last time user 100 encountered the individual. In another embodiment, the processor 210 may identify one or more visible triggers in the image data, including hand-related triggers, and determine whether the trigger is associated with a person other than the user of the wearable device to selectively determine Whether to perform the action associated with this trigger. One such action may be to provide feedback to user 100 via feedback output unit 230 provided as part of (or in communication with) apparatus 110 or via feedback unit 545 provided as part of computing device 120 . For example, the feedback output unit 545 may communicate with the display 260 to cause the display 260 to visibly output information. In some embodiments, the processor 210 may identify a hand-related trigger in the image data and send an indication of the trigger to the computing device 120 . The processor 540 may then process the received trigger information and provide output via the feedback output unit 545 or the display 260 based on the hand-related trigger. In other embodiments, processor 540 may determine hand-related triggers based on image data received from device 110 and provide appropriate feedback similar to those described above. In some embodiments, processor 540 may provide instructions or other information (such as contextual information) to device 110 based on the identified hand-related triggers.
在一些实施例中,处理器210可以识别所分析图像中的其他环境信息,诸如站在用户100前面的个体,并向计算设备120发送与所分析信息有关的信息,例如个体的姓名和用户100最后遇到该个体的时间。在不同的实施例中,处理器540可以从捕捉的图像数据中提取统计信息并将统计信息转发到服务器250。例如,关于用户购买的项目类型或用户光顾特定商家的频率等的某些信息可由处理器540来确定。基于该信息,服务器250可以向计算设备120发送与用户的偏好相关联的优惠券和折扣。In some embodiments, processor 210 may identify other contextual information in the analyzed image, such as an individual standing in front of user 100, and send information related to the analyzed information to computing device 120, such as the name of the individual and user 100 The last time the individual was encountered. In various embodiments, the processor 540 may extract statistical information from the captured image data and forward the statistical information to the server 250 . For example, certain information may be determined by processor 540 regarding the types of items the user purchases or how often the user visits a particular business. Based on this information, server 250 may send coupons and discounts associated with the user's preferences to computing device 120 .
当装置110连接或无线连接到计算设备120时,装置110可以发送存储在存储器550a中的图像数据的至少一部分以存储在存储器550b中。在一些实施例中,在计算设备120确认传送该部分图像数据成功之后,处理器540可以删除该部分图像数据。术语“删除”意味着图像被标记为“已删除”,并且可以代替它存储其他图像数据,但不一定意味着图像数据被物理地从存储器中删除。When apparatus 110 is connected or wirelessly connected to computing device 120, apparatus 110 may transmit at least a portion of the image data stored in memory 550a for storage in memory 550b. In some embodiments, the processor 540 may delete the portion of the image data after the computing device 120 confirms that the portion of the image data was successfully transmitted. The term "deleted" means that the image is marked as "deleted" and other image data may be stored in its place, but does not necessarily mean that the image data is physically removed from memory.
如受益于本公开的本领域技术人员将理解的,可以对所公开的实施例进行很多变化和/或修改。并非所有组件对于装置110的操作都是必要的。任何组件可以位于任何适当的装置中,并且组件可以被重新布置成各种配置,同时提供所公开的实施例的功能。例如,在一些实施例中,装置110可以包括相机、处理器和用于向另一设备发送数据的无线收发器。因此,前述配置是示例,并且无论上面讨论的配置如何,装置110都可以捕捉、存储和/或处理图像。Many variations and/or modifications of the disclosed embodiments are possible, as will be understood by those skilled in the art having the benefit of this disclosure. Not all components are necessary for the operation of device 110 . Any of the components may be located in any suitable apparatus and the components may be rearranged into various configurations while providing the functionality of the disclosed embodiments. For example, in some embodiments, apparatus 110 may include a camera, a processor, and a wireless transceiver for transmitting data to another device. Accordingly, the foregoing configurations are examples and apparatus 110 may capture, store, and/or process images regardless of the configurations discussed above.
此外,前面和以下的描述涉及存储和/或处理图像或图像数据。在本文公开的实施例中,存储和/或处理的图像或图像数据可以包括由图像传感器220捕捉的一个或多个图像的表示。如本文所使用的术语,图像(或图像数据)的“表示”可以包括整个图像或图像的一部分。图像(或图像数据)的表示可以具有与图像(或图像数据)相同的分辨率或更低的分辨率,和/或图像(或图像数据)的表示可以在一些方面被改变(例如,被压缩、具有更低的分辨率、具有被改变的一种或多种颜色等)。Furthermore, the preceding and following descriptions relate to storing and/or processing images or image data. In embodiments disclosed herein, stored and/or processed images or image data may include representations of one or more images captured by image sensor 220 . As the term is used herein, a "representation" of an image (or image data) may include an entire image or a portion of an image. The representation of the image (or image data) may have the same resolution as the image (or image data) or a lower resolution, and/or the representation of the image (or image data) may be altered in some way (eg, compressed , have a lower resolution, have one or more colors changed, etc.).
例如,装置110可以捕捉图像并存储被压缩为JPG文件的图像的表示。作为另一示例,装置110可以捕捉彩色图像,但存储彩色图像的黑白表示。作为又一示例,装置110可以捕捉图像并存储图像的不同表示(例如,图像的一部分)。例如,装置110可以存储图像的一部分,该部分包括出现在图像中的人的脸,但基本上不包括围绕该人的环境。类似地,装置110例如可以存储图像的一部分,该部分包括出现在图像中的产品,但基本上不包括围绕该产品的环境。作为又一示例,装置110可以以降低的分辨率(即,以比捕捉的图像的分辨率低的分辨率)存储图像的表示。存储图像的表示可以允许装置110节省存储器550中的存储空间。此外,处理图像的表示可以允许装置110提高处理效率和/或帮助维持电池寿命。For example, device 110 may capture an image and store a representation of the image compressed as a JPG file. As another example, device 110 may capture a color image, but store a black and white representation of the color image. As yet another example, device 110 may capture an image and store different representations of the image (eg, a portion of the image). For example, device 110 may store a portion of an image that includes the face of a person appearing in the image, but does not substantially include the environment surrounding the person. Similarly, device 110 may, for example, store a portion of an image that includes the product that appears in the image, but does not substantially include the environment surrounding the product. As yet another example, the apparatus 110 may store the representation of the image at a reduced resolution (ie, at a lower resolution than the captured image). Storing the representation of the image may allow device 110 to conserve storage space in memory 550 . Additionally, processing the representation of the image may allow device 110 to increase processing efficiency and/or help maintain battery life.
除了上述之外,在一些实施例中,装置110或计算设备120中的任何一个可以经由处理器210或处理器540来进一步处理所捕捉的图像数据以提供识别所捕捉的图像数据中的对象和/或手势和/或其他信息的附加功能。在一些实施例中,可以基于识别出的对象、手势或其他信息来采取动作。在一些实施例中,处理器210或处理器540可以在图像数据中识别一个或多个可见触发,包括手相关触发,并确定该触发是否与用户以外的人相关联,以确定是否执行与该触发相关联的动作。In addition to the above, in some embodiments, any of apparatus 110 or computing device 120 may further process the captured image data via processor 210 or processor 540 to provide for identifying objects in the captured image data and /or additional functions of gestures and/or other information. In some embodiments, actions may be taken based on recognized objects, gestures, or other information. In some embodiments, processor 210 or processor 540 may identify one or more visible triggers in the image data, including hand-related triggers, and determine whether the trigger is associated with a person other than the user, to determine whether to perform a Trigger the associated action.
本公开的一些实施例可以包括可固定到用户的衣物上的装置。这种装置可以包括可由连接器连接的两个部分。一种捕捉单元可以被设计成穿戴在用户衣服的外面,并且可以包括用于捕捉用户环境的图像的图像传感器。捕捉单元可以连接到或可连接到供电单元,供电单元可以被配置为容纳电源和处理设备。捕捉单元可以是包括相机或用于捕捉图像的其他设备的小型设备。捕捉单元可以被设计成不显眼且不引人注目的,并且可以被配置成与被用户衣服隐藏的供电单元通信。供电单元可以包括系统的较大方面,诸如收发器天线、至少一个电池、处理设备等。在一些实施例中,捕捉单元和供电单元之间的通信可以通过包括在连接器中的数据电缆提供,而在其他实施例中,捕捉单元和供电单元之间的通信可以无线地实现。一些实施例可以允许改变捕捉单元的图像传感器的朝向,例如以更好地捕捉感兴趣的图像。Some embodiments of the present disclosure may include devices that may be secured to a user's clothing. Such a device may comprise two parts connectable by a connector. A capture unit may be designed to be worn on the outside of a user's clothing and may include an image sensor for capturing images of the user's environment. The capture unit may be connected or connectable to a power supply unit, which may be configured to house a power supply and processing equipment. The capture unit may be a small device including a camera or other device for capturing images. The capture unit may be designed to be unobtrusive and unobtrusive, and may be configured to communicate with a power supply unit hidden by the user's clothing. The power supply unit may include larger aspects of the system, such as transceiver antennas, at least one battery, processing devices, and the like. In some embodiments, the communication between the capture unit and the power supply unit may be provided through a data cable included in the connector, while in other embodiments, the communication between the capture unit and the power supply unit may be achieved wirelessly. Some embodiments may allow changing the orientation of the image sensor of the capture unit, eg, to better capture the image of interest.
图6示出了包含符合本公开的软件模块的存储器的示例性实施例。存储器550中包括朝向识别模块601、朝向调整模块602和运动跟踪模块603。模块601、602、603可以包含用于由包括在可穿戴装置中的至少一个处理设备(例如处理器210)执行的软件指令。朝向识别模块601、朝向调整模块602和运动跟踪模块603可以协作以为并入无线装置110的捕捉单元提供朝向调整。6 illustrates an exemplary embodiment of a memory containing software modules consistent with the present disclosure. The memory 550 includes an orientation recognition module 601 , an orientation adjustment module 602 and a motion tracking module 603 . Modules 601, 602, 603 may contain software instructions for execution by at least one processing device (eg, processor 210) included in the wearable device. Orientation identification module 601 , orientation adjustment module 602 , and motion tracking module 603 may cooperate to provide orientation adjustment for a capture unit incorporated into wireless device 110 .
图7示出包括朝向调整单元705的示例性捕捉单元710。朝向调整单元705可以被配置为允许图像传感器220的调整。如图7所示,朝向调整单元705可以包括眼球型调整机件。在替代实施例中,朝向调整单元705可以包括万向架、可调节杆、可枢转安装件以及用于调整图像传感器220的朝向的任何其他合适单元。FIG. 7 shows an exemplary capture unit 710 including an orientation adjustment unit 705 . The orientation adjustment unit 705 may be configured to allow adjustment of the image sensor 220 . As shown in FIG. 7 , the orientation adjustment unit 705 may include an eyeball type adjustment mechanism. In alternative embodiments, the orientation adjustment unit 705 may include a gimbal, an adjustable rod, a pivotable mount, and any other suitable unit for adjusting the orientation of the image sensor 220 .
图像传感器220可以被配置成以使得图像传感器220的瞄准方向基本上与用户100的视场重合的方式来随用户100的头部移动。例如,如上所述,根据捕捉单元710的预期位置,与图像传感器220相关联的相机可以在稍微朝上或朝下的位置以预定角度被安装在捕捉单元710内。因此,图像传感器220的设定瞄准方向可以匹配用户100的视场。在一些实施例中,处理器210可以使用从图像传感器220提供的图像数据来改变图像传感器220的朝向。例如,处理器210可以识别用户正在阅读书籍,并且确定图像传感器220的瞄准方向偏离文本。也就是,由于在文本的每一行的开始处的词语没有完全在视野范围内,处理器210可以确定图像传感器220在错误的方向上倾斜。响应于此,处理器210可以调整图像传感器220的瞄准方向。The image sensor 220 may be configured to move with the head of the user 100 in a manner such that the aiming direction of the image sensor 220 substantially coincides with the field of view of the user 100 . For example, as described above, a camera associated with the image sensor 220 may be mounted within the capture unit 710 at a predetermined angle in a slightly upward or downward facing position, depending on the intended position of the capture unit 710 . Therefore, the set aiming direction of the image sensor 220 may match the field of view of the user 100 . In some embodiments, processor 210 may use image data provided from image sensor 220 to change the orientation of image sensor 220 . For example, the processor 210 may recognize that the user is reading a book and determine that the aiming direction of the image sensor 220 is offset from the text. That is, the processor 210 may determine that the image sensor 220 is tilted in the wrong direction because the words at the beginning of each line of text are not completely within the field of view. In response to this, the processor 210 may adjust the aiming direction of the image sensor 220 .
朝向识别模块601可以被配置为识别捕捉单元710的图像传感器220的朝向。例如,图像传感器220的朝向可以通过分析由捕捉单元710的图像传感器220捕捉的图像、通过捕捉单元710内的倾斜或姿态感测设备、以及通过测量朝向调整单元705相对于捕捉单元710的其余部分的相对方向来识别。The orientation recognition module 601 may be configured to recognize the orientation of the image sensor 220 of the capture unit 710 . For example, the orientation of image sensor 220 may be determined by analyzing images captured by image sensor 220 of capture unit 710 , by tilt or attitude sensing devices within capture unit 710 , and by measuring orientation adjustment unit 705 relative to the remainder of capture unit 710 to identify the relative orientation.
朝向调整模块602可以被配置为调整捕捉单元710的图像传感器220的朝向。如上所讨论的,图像传感器220可以被安装在配置为用于移动的朝向调整单元705上。朝向调整单元705可以被配置为响应于来自朝向调整模块602的命令进行旋转和/或横向移动。在一些实施例中,朝向调整单元705可以经由马达、电磁铁、永久磁铁和/或其任何适当组合来调整图像传感器220的朝向。The orientation adjustment module 602 may be configured to adjust the orientation of the image sensor 220 of the capture unit 710 . As discussed above, the image sensor 220 may be mounted on the orientation adjustment unit 705 configured for movement. The orientation adjustment unit 705 may be configured to rotate and/or laterally move in response to commands from the orientation adjustment module 602 . In some embodiments, the orientation adjustment unit 705 may adjust the orientation of the image sensor 220 via a motor, an electromagnet, a permanent magnet, and/or any suitable combination thereof.
在一些实施例中,监视模块603可被提供用于连续监视。这种连续监视可以包括跟踪包括在由图像传感器捕捉的一个或多个图像中的对象的至少一部分的运动。例如,在一个实施例中,只要对象基本上保持在图像传感器220的视场内,装置110就可以跟踪对象。在附加实施例中,监视模块603可以接合朝向调整模块602以指示朝向调整单元705连续地将图像传感器220朝向感兴趣的对象。例如,在一个实施例中,监视模块603可以使图像传感器220调整朝向,以确保特定指定对象,例如特定人的面部,即使在指定对象四处移动时,仍保持在图像传感器220的视场内。在另一实施例中,监视模块603可以连续监视包括在由图像传感器捕捉的一个或多个图像中的感兴趣区域。例如,用户可以被特定任务占据,例如在膝上型计算机上打字,而图像传感器220保持朝向特定方向并且连续监视来自一系列图像的每个图像的一部分以检测触发或其他事件。例如,图像传感器210可以朝向一件实验室设备,并且监视模块603可以被配置为监视实验室设备上的状态灯以获取状态改变,同时用户的注意力被占用。In some embodiments, monitoring module 603 may be provided for continuous monitoring. Such continuous monitoring may include tracking movement of at least a portion of an object included in one or more images captured by the image sensor. For example, in one embodiment, device 110 may track an object as long as the object remains substantially within the field of view of image sensor 220 . In additional embodiments, the monitoring module 603 may engage the orientation adjustment module 602 to instruct the orientation adjustment unit 705 to continuously orient the image sensor 220 toward the object of interest. For example, in one embodiment, monitoring module 603 may cause image sensor 220 to be oriented to ensure that a particular designated object, such as a particular person's face, remains within the field of view of image sensor 220 even as the designated object moves around. In another embodiment, the monitoring module 603 may continuously monitor the region of interest included in one or more images captured by the image sensor. For example, a user may be occupied with a particular task, such as typing on a laptop computer, while the image sensor 220 remains oriented in a particular direction and continuously monitors a portion of each image from a series of images to detect triggers or other events. For example, the image sensor 210 may be directed towards a piece of laboratory equipment, and the monitoring module 603 may be configured to monitor a status light on the laboratory equipment for status changes while the user's attention is engaged.
在符合本公开内容的一些实施例中,捕捉单元710可以包括多个图像传感器220。多个图像传感器220可以各自配置为捕捉不同的图像数据。例如,当提供多个图像传感器220时,图像传感器220可以捕捉具有不同分辨率的图像,可以捕捉更宽或更窄的视场,并且可以具有不同的放大级别。图像传感器220可以被提供有不同的镜头以允许这些不同的配置。在一些实施例中,多个图像传感器220可以包括具有不同朝向的图像传感器220。因此,多个图像传感器220中的每一个可以指向不同的方向以捕捉不同的图像。在一些实施例中,图像传感器220的视场可以重叠。多个图像传感器220可以例如通过与图像调整单元705配对来配置用于朝向调整。在一些实施例中,监视模块603或与存储器550相关联的另一模块可以被配置为单独地调整多个图像传感器220的朝向以及根据需要打开或关闭多个图像传感器220中的每一个。在一些实施例中,监视由图像传感器220捕捉的对象或人可以包括跟踪该对象在多个图像传感器220的视场中的移动。In some embodiments consistent with the present disclosure, capture unit 710 may include multiple image sensors 220 . The plurality of image sensors 220 may each be configured to capture different image data. For example, when multiple image sensors 220 are provided, the image sensors 220 may capture images with different resolutions, may capture wider or narrower fields of view, and may have different magnification levels. Image sensor 220 may be provided with different lenses to allow for these different configurations. In some embodiments, the plurality of image sensors 220 may include image sensors 220 having different orientations. Thus, each of the plurality of image sensors 220 may be pointed in different directions to capture different images. In some embodiments, the fields of view of image sensors 220 may overlap. The plurality of image sensors 220 may be configured for orientation adjustment, eg, by pairing with the image adjustment unit 705 . In some embodiments, the monitoring module 603 or another module associated with the memory 550 may be configured to individually adjust the orientation of the plurality of image sensors 220 and turn each of the plurality of image sensors 220 on or off as desired. In some embodiments, monitoring an object or person captured by the image sensor 220 may include tracking the movement of the object in the field of view of the plurality of image sensors 220 .
符合本公开的实施例可以包括被配置为连接可穿戴装置的捕捉单元和供电单元的连接器。符合本公开的捕捉单元可以包括被配置为捕捉用户的环境的图像的至少一个图像传感器。符合本公开的供电单元可以被配置为容纳电源和/或至少一个处理设备。符合本公开的连接器可以被配置为连接捕捉单元和供电单元,并且可以被配置为将装置固定到衣物上,使得捕捉单元位于衣物的外表面之上而供电单元位于衣物的内表面之下。关于图8-图14进一步详细讨论符合本公开的捕捉单元、连接器和供电单元的示例性实施例。Embodiments consistent with the present disclosure may include a connector configured to connect a capture unit and a power supply unit of a wearable device. A capture unit consistent with the present disclosure may include at least one image sensor configured to capture an image of a user's environment. A power supply unit consistent with the present disclosure may be configured to house a power source and/or at least one processing device. Connectors consistent with the present disclosure may be configured to connect the capture unit and the power supply unit, and may be configured to secure the device to the garment such that the capture unit is above the outer surface of the garment and the power supply unit is below the inner surface of the garment. Exemplary embodiments of capture units, connectors, and power supply units consistent with the present disclosure are discussed in further detail with respect to FIGS. 8-14 .
图18是符合本公开的可固定到衣物上的可穿戴装置110的实施例的示意图。如图8所示,捕捉单元710和供电单元720可以通过连接器730连接,使得捕捉单元710位于衣物750的一侧而供电单元720位于衣物750的相对一侧。在一些实施例中,捕捉单元710可以位于衣物750的外表面之上而供电单元720可以位于衣物750的内表面之下。供电单元720可以被配置为贴着用户的皮肤放置。18 is a schematic diagram of an embodiment of a wearable device 110 that may be secured to clothing in accordance with the present disclosure. As shown in FIG. 8 , the capturing unit 710 and the power supply unit 720 may be connected through the connector 730 such that the capturing unit 710 is located on one side of the clothing 750 and the power supply unit 720 is located on the opposite side of the clothing 750 . In some embodiments, the capture unit 710 may be positioned above the outer surface of the garment 750 and the power supply unit 720 may be positioned below the inner surface of the garment 750 . The power supply unit 720 may be configured to be placed against the user's skin.
捕捉单元710可以包括图像传感器220和朝向调整单元705(如图7所示)。电源单元720可以包括移动电源520和处理器210。供电单元720还可以包括先前讨论的可以是可穿戴装置110的一部分的元件的任何组合,包括但不限于无线收发器530、反馈输出单元230、存储器550和数据端口570。The capture unit 710 may include the image sensor 220 and an orientation adjustment unit 705 (as shown in FIG. 7 ). The power supply unit 720 may include the mobile power supply 520 and the processor 210 . Power supply unit 720 may also include any combination of previously discussed elements that may be part of wearable device 110 , including but not limited to wireless transceiver 530 , feedback output unit 230 , memory 550 , and data port 570 .
连接器730可以包括夹子715或被设计成如图8所示将捕捉单元710和供电单元720夹到或附接到衣物750的其它机械连接。如图所示,夹子715可以在捕捉单元710和供电单元720的周界处连接到它们中的每一个,并且可以环绕衣物750的边缘以将捕捉单元710和供电单元720固定到位。连接器730还可以包括供电电缆760和数据电缆770。供电电缆760可以能够将电力从移动电源520输送到捕捉单元710的图像传感器220。供电电缆760还可以被配置为向捕捉单元710的任何其他元件(例如,朝向调整单元705)提供电力。数据电缆770可以能够将捕捉的图像数据从捕捉单元710中的图像传感器220传送到供电单元720中的处理器800。数据电缆770还能够在捕捉单元710与处理器800之间传送附加数据,例如,用于朝向调整单元705的控制指令。The connector 730 may include a clip 715 or other mechanical connection designed to clip or attach the capture unit 710 and the power supply unit 720 to the garment 750 as shown in FIG. 8 . As shown, clips 715 may be attached to each of capture unit 710 and power supply unit 720 at their perimeter, and may wrap around the edges of garment 750 to secure capture unit 710 and power supply unit 720 in place. Connector 730 may also include power cable 760 and data cable 770 . The power supply cable 760 may be capable of delivering power from the power bank 520 to the image sensor 220 of the capture unit 710 . The power cable 760 may also be configured to provide power to any other element of the capture unit 710 (eg, toward the adjustment unit 705). Data cable 770 may be capable of transmitting captured image data from image sensor 220 in capture unit 710 to processor 800 in power supply unit 720 . The data cable 770 is also capable of conveying additional data between the capture unit 710 and the processor 800 , eg, for control instructions towards the adjustment unit 705 .
图9是佩戴符合本公开的实施例的可穿戴装置110的用户100的示意图。如图9所示,捕捉单元710位于用户100的衣服750的外表面上。捕捉单元710经由环绕衣物750的边缘的连接器730连接到供电单元720(在该图示中未见)。9 is a schematic diagram of a user 100 wearing a wearable device 110 consistent with embodiments of the present disclosure. As shown in FIG. 9 , the capturing unit 710 is located on the outer surface of the clothing 750 of the user 100 . The capture unit 710 is connected to the power supply unit 720 (not seen in this illustration) via a connector 730 that wraps around the edge of the garment 750 .
在一些实施例中,连接器730可以包括柔性印刷电路板(PCB)。图10示出了其中连接器730包括柔性印刷电路板765的示例性实施例。柔性印刷电路板765可以包括捕捉单元710与供电单元720之间的数据连接和电源连接。因此,在一些实施例中,柔性印刷电路板765可以用于取代供电电缆760和数据电缆770。在替代实施例中,除了供电电缆760和数据电缆770中的至少一个之外,还可以包括柔性印刷电路板765。在本文讨论的各种实施例中,柔性印刷电路板765可以替代供电电缆760和数据电缆770,或者除了供电电缆760和数据电缆770之外还包括柔性印刷电路板765。In some embodiments, the connector 730 may comprise a flexible printed circuit board (PCB). FIG. 10 shows an exemplary embodiment in which the connector 730 includes a flexible printed circuit board 765 . The flexible printed circuit board 765 may include data and power connections between the capture unit 710 and the power supply unit 720 . Thus, in some embodiments, the flexible printed circuit board 765 may be used in place of the power cable 760 and the data cable 770 . In alternate embodiments, a flexible printed circuit board 765 may be included in addition to at least one of the power cable 760 and the data cable 770 . In various embodiments discussed herein, the flexible printed circuit board 765 may replace the power cable 760 and the data cable 770 or include the flexible printed circuit board 765 in addition to the power cable 760 and the data cable 770 .
图11是符合本公开的可固定到衣物上的可穿戴装置的另一实施例的示意图。如图11所示,连接器730可以相对于捕捉单元710和供电单元720位于中心。连接器730的中心位置可以有助于通过衣服750中的孔(诸如现有衣物750中的纽扣孔或衣物750中设计成容纳可穿戴装置110的特殊孔)将装置110固定到衣服750。11 is a schematic diagram of another embodiment of a wearable device secureable to clothing consistent with the present disclosure. As shown in FIG. 11 , the connector 730 may be centrally located relative to the capture unit 710 and the power supply unit 720 . The central location of the connector 730 may facilitate securing the device 110 to the garment 750 through a hole in the garment 750, such as a button hole in an existing garment 750 or a special hole in the garment 750 designed to receive the wearable device 110.
图12是可固定到衣物上的可穿戴装置110的又一实施例的示意图。如图12所示,连接器730可以包括第一磁体731和第二磁体732。第一磁体731和第二磁体732可以将捕捉单元710固定到供电单元720,其中衣物位于第一磁体731与第二磁体732之间。在包括第一磁体731和第二磁体732的实施例中,还可以包括供电电缆760和数据电缆770。在这些实施例中,供电电缆760和数据电缆770可以是任意长度的,并且可以在捕捉单元710与供电单元720之间提供灵活的供电和数据连接。包括第一磁体731和第二磁体732的实施例还可以包括除了或替代供电电缆760和/或数据电缆770的柔性PCB765连接。在一些实施例中,第一磁体731或第二磁体732可以由包括金属材料的对象代替。Figure 12 is a schematic diagram of yet another embodiment of a wearable device 110 that can be secured to clothing. As shown in FIG. 12 , the connector 730 may include a first magnet 731 and a second magnet 732 . The first magnet 731 and the second magnet 732 may fix the capturing unit 710 to the power supply unit 720 , wherein the laundry is located between the first magnet 731 and the second magnet 732 . In the embodiment including the first magnet 731 and the second magnet 732, a power supply cable 760 and a data cable 770 may also be included. In these embodiments, power cable 760 and data cable 770 may be of any length and may provide flexible power and data connections between capture unit 710 and power supply unit 720 . Embodiments including the first magnet 731 and the second magnet 732 may also include flexible PCB 765 connections in addition to or in place of the power cable 760 and/or the data cable 770 . In some embodiments, the first magnet 731 or the second magnet 732 may be replaced by an object comprising a metallic material.
图13是可固定到衣物上的可穿戴装置110的又另一实施例的示意图。图13示出了其中可以在捕捉单元710与供电单元720之间无线传送功率和数据的实施例。如图13所示,第一磁体731和第二磁体732可以被提供为连接器730,以将捕捉单元710和供电单元720固定到衣物750。功率和/或数据可以经由任何合适的无线技术(例如,磁耦合和/或电容耦合、近场通信技术、射频传送以及任何其他适合于跨短距离传送数据和/或功率的无线技术)在捕捉单元710和供电单元720之间传送。13 is a schematic diagram of yet another embodiment of a wearable device 110 that can be secured to clothing. FIG. 13 shows an embodiment in which power and data may be wirelessly transferred between capture unit 710 and power supply unit 720 . As shown in FIG. 13 , the first magnet 731 and the second magnet 732 may be provided as connectors 730 to fix the capturing unit 710 and the power supply unit 720 to the clothing 750 . Power and/or data may be captured via any suitable wireless technology (eg, magnetic coupling and/or capacitive coupling, near field communication technology, radio frequency transmission, and any other wireless technology suitable for transferring data and/or power over short distances). unit 710 and power supply unit 720.
图14示出了可固定到用户的衣物750上的可穿戴装置110的又一实施例。如图14所示,连接器730可以包括设计用于接触安装的特征。例如,捕捉单元710可以包括具有空心中心的环733,空心中心的直径略大于位于供电单元720上的盘状突起734。当与衣物750的织物压在一起时,盘状突起734可以紧密地安装在环733内,将捕捉单元710固定到供电单元720。图14示出了不包括捕捉单元710与供电单元720之间的任何电缆或其他物理连接的实施例。在该实施例中,捕捉单元710和供电单元720可以无线地传送功率和数据。在替代实施例中,捕捉单元710和供电单元720可以经由电缆760、数据电缆770和柔性印刷电路板765中的至少一个来传送功率和数据。FIG. 14 illustrates yet another embodiment of a wearable device 110 that may be secured to a user's clothing 750 . As shown in FIG. 14, the connector 730 may include features designed for contact mounting. For example, the capture unit 710 may include a ring 733 with a hollow center that is slightly larger in diameter than the disk-like protrusions 734 on the power supply unit 720 . When pressed together with the fabric of the garment 750 , the disc-shaped protrusions 734 may be tightly fitted within the ring 733 to secure the capturing unit 710 to the power supply unit 720 . FIG. 14 shows an embodiment that does not include any cables or other physical connections between the capture unit 710 and the power supply unit 720 . In this embodiment, the capture unit 710 and the power supply unit 720 may wirelessly transmit power and data. In alternative embodiments, capture unit 710 and power supply unit 720 may transmit power and data via at least one of cable 760 , data cable 770 , and flexible printed circuit board 765 .
图15示出了符合本文描述的实施例的供电单元720的另一个方面。供电单元720可以被配置为安置于直接靠着用户的皮肤。为了便于这样的安置,供电单元720还可以包括涂覆有生物相容性材料740的至少一个表面。生物相容性材料740可以包括当长时间贴靠皮肤时不会与用户的皮肤产生负面反应的材料。这些材料可以包括,例如,硅树脂、PTFE、聚酰亚胺胶带、聚酰亚胺、钛、镍钛合金、铂等。同样如图15所示,供电单元720的尺寸可以设置为使得供电单元的内部体积基本上由移动电源520填充。也就是,在一些实施例中,供电单元720的内部体积可以使得该体积不会容纳除移动电源520之外的任何附加组件。在一些实施例中,移动电源520可以利用其靠近用户皮肤的优势。例如,移动电源520可以使用珀尔帖效应来产生功率和/或为电源充电。FIG. 15 illustrates another aspect of a power supply unit 720 consistent with embodiments described herein. The power supply unit 720 may be configured to be positioned directly against the user's skin. To facilitate such placement, the power supply unit 720 may also include at least one surface coated with a biocompatible material 740 . Biocompatible material 740 may include materials that do not negatively react with the user's skin when held against the skin for extended periods of time. These materials may include, for example, silicone, PTFE, polyimide tape, polyimide, titanium, nitinol, platinum, and the like. As also shown in FIG. 15 , the power supply unit 720 may be sized such that the interior volume of the power supply unit is substantially filled by the power bank 520 . That is, in some embodiments, the internal volume of the power supply unit 720 may be such that the volume does not accommodate any additional components other than the power bank 520 . In some embodiments, the power bank 520 can take advantage of its proximity to the user's skin. For example, the power bank 520 may use the Peltier effect to generate power and/or charge the power source.
在其他实施例中,可固定到衣物上的装置还可以包括与容纳在电源单元720中的电源520相关联的保护电路。图16示出了包括保护电路775的示例性实施例。如图16所示,保护电路775可以位于相对于供电单元720远距离的位置。在替代实施例中,保护电路775还可以位于捕捉单元710中、柔性印刷电路板765上或供电单元720中。In other embodiments, the garment-secured device may also include a protection circuit associated with the power supply 520 housed in the power supply unit 720 . FIG. 16 shows an exemplary embodiment including a protection circuit 775 . As shown in FIG. 16 , the protection circuit 775 may be located remotely relative to the power supply unit 720 . In alternative embodiments, the protection circuit 775 may also be located in the capture unit 710 , on the flexible printed circuit board 765 , or in the power supply unit 720 .
保护电路775可以被配置为保护图像传感器220和/或捕捉单元710的其他元件免受移动电源520产生的潜在危险电流和/或电压的影响。保护电路775可以包括无源组件(诸如电容器、电阻器、二极管、电感器等)以向捕捉单元710的元件提供保护。在一些实施例中,保护电路775还可以包括有源组件(诸如晶体管)以向捕捉单元710的元件提供保护。例如,在一些实施例中,保护电路775可以包括用作熔断器的一个或多个电阻器。每个熔断器可以包括当流过熔断器的电流超过预定限制(例如,500毫安、900毫安、1安培、1.1安培、2安培、2.1安培、3安培等)时熔化(由此制动图像捕捉单元710的电路与供电单元720的电路之间的连接)的导线或金属条。任何或所有先前描述的实施例都可以包括保护电路775。Protection circuitry 775 may be configured to protect image sensor 220 and/or other elements of capture unit 710 from potentially dangerous currents and/or voltages generated by power bank 520 . The protection circuit 775 may include passive components (such as capacitors, resistors, diodes, inductors, etc.) to provide protection to the elements of the capture unit 710 . In some embodiments, the protection circuit 775 may also include active components, such as transistors, to provide protection to elements of the capture unit 710 . For example, in some embodiments, the protection circuit 775 may include one or more resistors that function as fuses. Each fuse may include melting (thus braking) when the current flowing through the fuse exceeds a predetermined limit (eg, 500 mA, 900 mA, 1 amp, 1.1 amp, 2 amp, 2.1 amp, 3 amp, etc.) The connection between the circuit of the image capture unit 710 and the circuit of the power supply unit 720) is a wire or metal strip. Any or all of the previously described embodiments may include protection circuitry 775 .
在一些实施例中,可穿戴装置可以经由任何已知的无线标准(例如,蜂窝、WiFi、蓝牙等),或经由近场电容耦合、其他短程无线技术,或经由有线连接,在一个或多个网络上向计算设备(例如,智能手机、平板电脑、手表、计算机等)发送数据。类似地,可穿戴装置可以经由任何已知的无线标准(例如,蜂窝、WiFi、蓝牙等),或经由近场电容耦合、其他短程无线技术,或经由有线连接,在一个或多个网络上从计算设备接收数据。发送到可穿戴装置和/或由无线装置接收的数据可以包括图像、图像的部分、与出现在经分析的图像中的信息有关的或与经分析的音频相关联的标识符,或表示图像和/或音频数据的任何其他数据。例如,可以分析图像,并且可以将与在图像中发生的活动相关的标识符发送到计算设备(例如,“配对设备”)。在本文描述的实施例中,可穿戴装置可以本地(在可穿戴装置上)和/或远程(经由计算设备)处理图像和/或音频。此外,在本文描述的实施例中,可穿戴装置可以将与图像和/或音频的分析有关的数据发送到计算设备以进行进一步的分析、显示,和/或向另一设备(例如,配对设备)发送。此外,配对设备可以执行一个或多个应用程序(apps)以处理、显示和/或分析从可穿戴装置接收的数据(例如,标识符、文本、图像、音频等)。In some embodiments, the wearable device can communicate via any known wireless standard (eg, cellular, WiFi, Bluetooth etc.), or via near-field capacitive coupling, other short-range wireless techniques, or via wired connections, to send data to computing devices (eg, smartphones, tablets, watches, computers, etc.) over one or more networks. Similarly, wearable devices can communicate via any known wireless standard (eg, cellular, WiFi, Bluetooth etc.), or via near-field capacitive coupling, other short-range wireless techniques, or via wired connections, receive data from computing devices over one or more networks. Data sent to the wearable device and/or received by the wireless device may include images, portions of images, identifiers related to information present in the analyzed images or associated with the analyzed audio, or representations of images and / or any other data for audio data. For example, the image can be analyzed and an identifier associated with the activity occurring in the image can be sent to the computing device (eg, a "paired device"). In the embodiments described herein, the wearable device may process images and/or audio locally (on the wearable device) and/or remotely (via a computing device). Additionally, in the embodiments described herein, the wearable device may send data related to the analysis of the image and/or audio to the computing device for further analysis, display, and/or to another device (eg, a paired device) )send. Additionally, the paired device may execute one or more applications to process, display and/or analyze data (eg, identifiers, text, images, audio, etc.) received from the wearable device.
所公开的实施例中的一些可以涉及用于确定至少一个关键字的系统、设备、方法和软件产品。例如,可以基于由装置110收集的数据来确定至少一个关键字。可以基于至少一个关键字来确定至少一个搜索查询。至少一个搜索查询可被发送到搜索引擎。Some of the disclosed embodiments may relate to systems, apparatus, methods, and software products for determining at least one keyword. For example, the at least one keyword may be determined based on data collected by the device 110 . At least one search query can be determined based on at least one keyword. At least one search query can be sent to the search engine.
在一些实施例中,可以基于由图像传感器220捕捉的至少一个或多个图像来确定至少一个关键字。在一些情况下,至少一个关键字可以从存储在存储器中的关键字池中选择。在一些情况下,可以对由图像传感器220捕捉的至少一个图像执行光学字符识别(OCR),并且可以基于OCR结果来确定至少一个关键字。在一些情况下,可以对由图像传感器220捕捉的至少一个图像进行分析以识别:人、对象、位置、场景等。此外,至少一个关键字可以基于识别出的人、对象、位置、场景等来确定。例如,至少一个关键字可以包括:人名、对象名、地名、日期、运动队名、电影名、书名等。In some embodiments, at least one keyword may be determined based on at least one or more images captured by image sensor 220 . In some cases, at least one keyword may be selected from a pool of keywords stored in memory. In some cases, optical character recognition (OCR) may be performed on at least one image captured by image sensor 220, and at least one keyword may be determined based on the OCR results. In some cases, at least one image captured by image sensor 220 may be analyzed to identify: people, objects, locations, scenes, and the like. Furthermore, at least one keyword may be determined based on the identified person, object, location, scene, and the like. For example, the at least one keyword may include: a person's name, an object's name, a place name, a date, a sports team name, a movie title, a book title, and the like.
在一些实施例中,可以基于用户的行为来确定至少一个关键字。可以基于对由图像传感器220捕捉的一个或多个图像的分析来确定用户的行为。在一些实施例中,可以基于用户和/或其他人的活动来确定至少一个关键字。可以对由图像传感器220捕捉的一个或多个图像进行分析以识别出现在由图像传感器220捕捉的一个或多个图像中的用户和/或其他人的活动。在一些实施例中,可以基于由装置110捕捉的至少一个或多个音频段来确定至少一个关键字。在一些实施例中,可以基于与用户相关联的至少GPS信息来确定至少一个关键字。在一些实施例中,可以基于至少当前时间和/或日期来确定至少一个关键字。In some embodiments, the at least one keyword may be determined based on the user's behavior. The user's behavior may be determined based on analysis of one or more images captured by image sensor 220 . In some embodiments, the at least one keyword may be determined based on the activity of the user and/or others. The one or more images captured by image sensor 220 may be analyzed to identify the activities of the user and/or other persons present in the one or more images captured by image sensor 220 . In some embodiments, the at least one keyword may be determined based on at least one or more audio segments captured by device 110 . In some embodiments, the at least one keyword may be determined based on at least GPS information associated with the user. In some embodiments, the at least one keyword may be determined based on at least the current time and/or date.
在一些实施例中,可以基于该至少一个关键字来确定至少一个搜索查询。在一些情况下,至少一个搜索查询可以包括至少一个关键字。在一些情况下,至少一个搜索查询可以包括至少一个关键字和由用户提供的附加关键字。在一些情况下,至少一个搜索查询可以包括至少一个关键字和诸如由图像传感器220捕捉的图像的一个或多个图像。在一些情况下,至少一个搜索查询可以包括至少一个关键字和诸如由装置110捕捉的音频段的一个或多个音频段。In some embodiments, at least one search query may be determined based on the at least one keyword. In some cases, at least one search query may include at least one keyword. In some cases, at least one search query may include at least one keyword and additional keywords provided by the user. In some cases, at least one search query may include at least one keyword and one or more images, such as images captured by image sensor 220 . In some cases, at least one search query may include at least one keyword and one or more audio segments, such as audio segments captured by device 110 .
在一些实施例中,至少一个搜索查询可被发送到搜索引擎。在一些实施例中,由搜索引擎响应于至少一个搜索查询提供的搜索结果可以被提供给用户。在一些实施例中,至少一个搜索查询可以用于访问数据库。In some embodiments, at least one search query may be sent to a search engine. In some embodiments, search results provided by a search engine in response to at least one search query may be provided to a user. In some embodiments, at least one search query may be used to access the database.
例如,在一个实施例中,关键字可以包括食品类型的名称(诸如藜麦),或者食品的品牌名称;并且搜索将输出与所需的消费量有关的信息,关于营养概况的事实,等等。在另一示例中,在一个实施例中,关键字可以包括餐厅的名称,并且搜索将输出与餐馆相关的信息,诸如菜单、开门时间、评论等等。餐厅的名称可以使用标牌图像上的OCR、使用GPS信息等来获得。在另一示例中,在一个实施例中,关键字可以包括人的姓名,并且搜索将提供来自该人的社交网络配置文件的信息。该人的名字可以使用附接到人的衬衫上的姓名标签上的OCR、使用面部识别算法等来获得。在另一示例中,在一个实施例中,关键字可以包括书的名称,并且搜索将输出与书有关的信息,诸如评论、销售统计、关于书的作者的信息等等。在另一示例中,在一个实施例中,关键字可以包括电影的名称,并且搜索将输出与电影相关的信息,诸如评论、票房统计、关于电影演员阵容的信息、放映时间等等。在另一示例中,在一个实施例中,关键字可以包括运动队的名称,并且搜索将输出与运动队相关的信息,诸如统计数据、最新结果、未来时间表、关于运动队队员的信息等等。例如,运动队的名称可以使用音频识别算法来获得。For example, in one embodiment, the keywords may include the name of a food type (such as quinoa), or the brand name of the food; and the search will output information related to desired consumption, facts about nutritional profiles, etc. . In another example, in one embodiment, the keywords may include the name of the restaurant, and the search will output information related to the restaurant, such as menus, opening times, reviews, and the like. The name of the restaurant can be obtained using OCR on the signage image, using GPS information, etc. In another example, in one embodiment, a keyword may include a person's name, and the search will provide information from that person's social network profile. The person's name may be obtained using OCR on a name tag attached to the person's shirt, using a facial recognition algorithm, or the like. In another example, in one embodiment, the keyword may include the title of the book, and the search will output information related to the book, such as reviews, sales statistics, information about the author of the book, and the like. In another example, in one embodiment, the keywords may include the title of the movie, and the search will output information related to the movie, such as reviews, box office statistics, information about the movie's cast, show times, and the like. In another example, in one embodiment, the keyword may include the name of the sports team, and the search will output information related to the sports team, such as statistics, recent results, future schedules, information about the players of the sports team, etc. Wait. For example, the name of a sports team can be obtained using an audio recognition algorithm.
基于相机的定向助听器Camera-based directional hearing aids
如前所述,所公开的实施例可以包括响应于处理环境中的至少一个图像而向一个或多个辅助设备提供反馈(诸如声学和触觉反馈)。在一些实施例中,辅助设备可以是耳机或用于向用户提供听觉反馈的其他设备(诸如助听器)。传统的助听器经常使用麦克风来放大用户的环境中的声音。然而,这些传统系统经常无法区分对设备佩戴者可能特别重要的声音,或者可能在有限的基础上这样做。使用所公开实施例的系统和方法,如下文详细描述的,提供了对传统助听器的各种改进。As previously described, the disclosed embodiments may include providing feedback (such as acoustic and haptic feedback) to one or more auxiliary devices in response to processing at least one image in the environment. In some embodiments, the assistive device may be a headset or other device for providing auditory feedback to the user, such as a hearing aid. Conventional hearing aids often use microphones to amplify sounds in the user's environment. However, these conventional systems often fail to distinguish sounds that may be of particular importance to the wearer of the device, or may do so on a limited basis. Using the systems and methods of the disclosed embodiments, as described in detail below, provide various improvements over conventional hearing aids.
在一个实施例中,可以提供基于相机的定向助听器,用于基于用户的视线方向来选择性地放大声音。助听器可以与诸如装置110的图像捕捉设备进行通信,以确定用户的视线方向。该视线方向可用于隔离和/或选择性地放大从该方向接收的声音(例如,来自用户视线方向上的个体的声音等)。从用户的视线方向以外的方向接收的声音可以被抑制、衰减、滤波等。In one embodiment, a camera-based directional hearing aid may be provided for selectively amplifying sound based on the user's gaze direction. The hearing aid may communicate with an image capture device, such as apparatus 110, to determine the user's gaze direction. This gaze direction may be used to isolate and/or selectively amplify sound received from that direction (eg, sounds from individuals in the user's gaze direction, etc.). Sound received from directions other than the user's line of sight may be suppressed, attenuated, filtered, and the like.
图17A是佩戴根据所公开实施例的用于基于相机的听觉接口设备1710的装置110的用户100的示例的示意图。如图所示,用户100可以穿戴物理连接到用户100的衬衫或其他衣物的装置110。符合所公开实施例的,如前面所述,装置110可以定位在其他位置。例如,装置110可以物理地连接到项链、腰带、眼镜、腕带、纽扣等。装置110可以被配置为与诸如听觉接口设备1710的听觉接口设备进行通信。这种通信可以通过有线连接,或者可以无线地进行(例如,使用蓝牙TM、NFC或无线通信形式)。在一些实施例中,还可以包括诸如计算设备120的一个或多个附加设备。因此,本文关于装置110或处理器210描述的一个或多个过程或功能可以由计算设备120和/或处理器540执行。17A is a schematic diagram of an example of a user 100 wearing an apparatus 110 for a camera-based auditory interface device 1710 in accordance with the disclosed embodiments. As shown, user 100 may wear device 110 that is physically connected to user 100's shirt or other clothing. Consistent with the disclosed embodiments, the device 110 may be positioned in other locations, as previously described. For example, device 110 may be physically attached to necklaces, belts, glasses, wristbands, buttons, and the like. Apparatus 110 may be configured to communicate with an auditory interface device, such as auditory interface device 1710 . Such communication may be via a wired connection, or may be performed wirelessly (eg, using Bluetooth ™ , NFC or wireless forms of communication). In some embodiments, one or more additional devices such as computing device 120 may also be included. Accordingly, one or more of the processes or functions described herein with respect to apparatus 110 or processor 210 may be performed by computing device 120 and/or processor 540 .
听觉接口设备1710可以是被配置为向用户100提供听觉反馈的任何设备。听觉接口设备1710可以对应于如上所述的反馈输出单元230,并且因此反馈输出单元230的任何描述也可以适用于听觉接口设备1710。在一些实施例中,听觉接口设备1710可以与反馈输出单元230分开,并且可以被配置为从反馈输出单元230接收信号。如图17A所示,听觉接口设备1710可以被放置在用户100的一个或两个耳朵中,类似于传统的听觉接口设备。听觉接口设备1710可以是各种样式的,包括耳道内、完全耳道内、耳内、耳后、耳上、耳道内接收器、开放安装或各种其他样式。听觉接口设备1710可以包括用于向用户100提供听觉反馈的一个或多个扬声器、用于检测用户100的环境中的声音的麦克风、内部电子设备、处理器、存储器等。在一些实施例中,除了麦克风之外或替代麦克风,听觉接口设备1710可以包括一个或多个通信单元,特别是一个或多个接收器,用于从设备110接收信号并将信号传送到用户100。Auditory interface device 1710 may be any device configured to provide auditory feedback to user 100 . The auditory interface device 1710 may correspond to the feedback output unit 230 as described above, and thus any description of the feedback output unit 230 may also apply to the auditory interface device 1710 . In some embodiments, auditory interface device 1710 may be separate from feedback output unit 230 and may be configured to receive signals from feedback output unit 230 . As shown in Figure 17A, auditory interface device 1710 may be placed in one or both ears of user 100, similar to conventional auditory interface devices. The auditory interface device 1710 may be of various styles, including in-canal, fully in-canal, in-ear, behind-the-ear, supra-ear, in-canal receiver, open mount, or various other styles. Auditory interface device 1710 may include one or more speakers for providing auditory feedback to user 100, a microphone for detecting sounds in the environment of user 100, internal electronics, a processor, memory, and the like. In some embodiments, auditory interface device 1710 may include, in addition to or in place of a microphone, one or more communication units, particularly one or more receivers, for receiving and transmitting signals from device 110 to user 100 .
听觉接口设备1710可以具有各种其他配置或放置位置。在一些实施例中,如图17A所示,听觉接口设备1710可以包括骨传导耳机1711。骨传导耳机1711可以通过外科手术植入,并且可以通过声音振动到内耳的骨传导来向用户100提供可听反馈。听觉接口设备1710还可以包括一个或多个耳机(例如,无线耳机、过耳耳机等)或由用户100携带或佩戴的便携式扬声器。在一些实施例中,听觉接口设备1710可以集成到其他设备中,诸如用户的蓝牙TM耳机、眼镜、头盔(例如,摩托车头盔、自行车头盔等)、帽子等。The auditory interface device 1710 may have various other configurations or placements. In some embodiments, the auditory interface device 1710 may include a bone conduction headset 1711, as shown in FIG. 17A. Bone conduction earphones 1711 may be surgically implanted and may provide audible feedback to user 100 through bone conduction of sound vibrations to the inner ear. The auditory interface device 1710 may also include one or more earphones (eg, wireless earphones, over-ear earphones, etc.) or portable speakers carried or worn by the user 100 . In some embodiments, the auditory interface device 1710 may be integrated into other devices, such as the user's Bluetooth ™ headset, glasses, helmets (eg, motorcycle helmets, bicycle helmets, etc.), hats, and the like.
装置110可以被配置为确定用户100的用户视线方向1750。在一些实施例中,可以通过监视用户100的下巴、或另一身体部分或面部部分相对于相机传感器1751的光轴的方向来跟踪用户视线方向1750。装置110可以被配置为例如使用图像传感器220来捕捉用户周围环境的一个或多个图像。所捕捉的图像可以包括用户100的下巴的表示,该表示可用于确定用户视线方向1750。处理器210(和/或处理器210a和210b)可以被配置为使用各种图像检测或处理算法(例如,使用卷积神经网络(CNN)、尺度不变特征变换(SIFT)、定向梯度直方图(HOG)特征或其他技术)来分析捕捉的图像并检测用户100的下巴或另一部分。基于检测到的用户100的下巴的表示,可以确定视线方向1750。可以部分地通过将检测到的用户100的下巴的表示与相机传感器1751的光轴进行比较来确定视线方向1750。例如,光轴1751在每个图像中可以是已知的或固定的,并且处理器210可以通过将用户100的下巴的代表性角度与光轴1751的方向进行比较来确定视线方向1750。虽然使用用户100的下巴的表示来描述该过程,但是可以检测各种其他特征以确定用户的视线方向1750,包括用户的脸、鼻子、眼睛、手等。The apparatus 110 may be configured to determine the user's gaze direction 1750 of the user 100 . In some embodiments, the user's gaze direction 1750 may be tracked by monitoring the orientation of the user's 100 jaw, or another body part or face part, relative to the optical axis of the camera sensor 1751 . Device 110 may be configured to capture one or more images of the user's surroundings, eg, using image sensor 220 . The captured image may include a representation of the user's 100 jaw, which may be used to determine the user's gaze direction 1750 . Processor 210 (and/or processors 210a and 210b) may be configured to use various image detection or processing algorithms (eg, using convolutional neural networks (CNN), scale-invariant feature transforms (SIFT), histograms of oriented gradients) (HOG feature or other techniques) to analyze the captured image and detect the jaw or another part of the user 100 . Based on the detected representation of the jaw of the user 100, the gaze direction 1750 can be determined. The gaze direction 1750 may be determined in part by comparing the detected representation of the jaw of the user 100 to the optical axis of the camera sensor 1751 . For example, the optical axis 1751 may be known or fixed in each image, and the processor 210 may determine the gaze direction 1750 by comparing a representative angle of the chin of the user 100 to the direction of the optical axis 1751 . While the process is described using a representation of the user's 100 jaw, various other features may be detected to determine the user's gaze direction 1750, including the user's face, nose, eyes, hands, and the like.
在其他实施例中,用户视线方向1750可以与光轴1751更紧密地对准。例如,如上所述,装置110可以被固定到用户100的一副眼镜上,如图1A所示。在该实施例中,用户视线方向1750可以与光轴1751的方向相同或接近。因此,用户视线方向1750可以基于图像传感器220的视野来确定或粗略估计。In other embodiments, the user's line of sight direction 1750 may be more closely aligned with the optical axis 1751 . For example, as described above, device 110 may be affixed to a pair of glasses of user 100, as shown in FIG. 1A. In this embodiment, the user's line of sight direction 1750 may be the same as or close to the direction of the optical axis 1751 . Therefore, the user's gaze direction 1750 may be determined or roughly estimated based on the field of view of the image sensor 220 .
图17B是符合本公开的可固定到衣物上的装置的实施例的示意图。如图17A所示,装置110可以固定到一件衣服上,诸如用户110的衬衫。如上所述,装置110可以固定到其他衣物上,诸如用户100的腰带或裤子。装置110可以具有一个或多个相机1730,它们可以对应于图像传感器220。相机1730可以被配置为捕捉用户100的周围环境的图像。在一些实施例中,相机1730可以被配置为检测捕捉用户周围环境的相同图像中用户下巴的表示,该图像可用于本公开中描述的其他功能。在其他实施例中,相机1730可以是专用于确定用户视线方向1750的辅助或单独相机。17B is a schematic diagram of an embodiment of a device that is secureable to a garment in accordance with the present disclosure. As shown in FIG. 17A, device 110 may be secured to a piece of clothing, such as user 110's shirt. As mentioned above, the device 110 may be secured to other clothing items, such as the user's 100 belt or pants. Device 110 may have one or more cameras 1730 , which may correspond to image sensor 220 . The camera 1730 may be configured to capture images of the user's 100 surroundings. In some embodiments, camera 1730 may be configured to detect a representation of the user's jaw in the same image that captures the user's surroundings, which image may be used for other functions described in this disclosure. In other embodiments, the camera 1730 may be a secondary or separate camera dedicated to determining the user's gaze direction 1750.
装置110还可以包括一个或多个麦克风1720,用于从用户100的环境捕捉声音。麦克风1720还可以被配置为确定用户100的环境中声音的方向性。例如,麦克风1720可以包括一个或多个定向麦克风,它们可能对拾取某些方向上的声音更敏感。例如,麦克风1720可以包括单向麦克风,其被设计成从单个方向或小范围的方向拾取声音。麦克风1720还可以包括心形麦克风,它可能对来自前面和侧面的声音敏感。麦克风1720还可以包括麦克风阵列,其可以包括附加的麦克风,诸如在装置110前面的麦克风1721,或放置在装置110侧面的麦克风1722。在一些实施例中,麦克风1720可以是用于捕捉多个音频信号的多端口麦克风。图17B中所示的麦克风仅作为示例,并且可以使用任何适当数量、配置或位置的麦克风。处理器210可以被配置为区分用户100的环境内的声音并且确定每个声音的近似方向性。例如,使用麦克风阵列1720,处理器210可以对麦克风1720之间个体声音的相对定时或振幅进行比较,以确定相对于装置100的方向性。The device 110 may also include one or more microphones 1720 for capturing sound from the user's 100 environment. Microphone 1720 may also be configured to determine the directionality of sound in the environment of user 100 . For example, microphone 1720 may include one or more directional microphones, which may be more sensitive to picking up sounds in certain directions. For example, the microphone 1720 may comprise a unidirectional microphone designed to pick up sound from a single direction or a small range of directions. Microphone 1720 may also include a cardioid microphone, which may be sensitive to sound from the front and sides. Microphone 1720 may also include a microphone array, which may include additional microphones, such as microphone 1721 on the front of device 110 , or microphone 1722 placed on the side of device 110 . In some embodiments, the microphone 1720 may be a multi-port microphone for capturing multiple audio signals. The microphones shown in Figure 17B are by way of example only, and any suitable number, configuration, or location of microphones may be used. The processor 210 may be configured to differentiate sounds within the environment of the user 100 and determine the approximate directionality of each sound. For example, using the microphone array 1720 , the processor 210 may compare the relative timing or amplitude of individual sounds between the microphones 1720 to determine the directionality relative to the device 100 .
作为在其他音频分析操作之前的初步步骤,可以使用任何音频分类技术对从用户的环境捕捉的声音进行分类。例如,声音可以被分类为包含音乐、音调、笑声、尖叫等的片段。各个片段的指示可以记录在数据库中,并且可以证明对于生活记录应用非常有用。作为一个示例,所记录的信息可以使系统能够检索和/或确定当用户遇到另一个体时的心情。另外,这样的处理相对快速和有效,并且不需要大量的计算资源,并且将信息发送到目的地不需要大量的带宽。此外,一旦音频的某些部分被分类为非语音,更多的计算资源可用于处理其他片段。As a preliminary step before other audio analysis operations, any audio classification technique can be used to classify sounds captured from the user's environment. For example, sounds may be classified into segments containing music, tones, laughter, screams, and the like. Indications of individual segments can be recorded in a database and can prove very useful for life-logging applications. As one example, the recorded information may enable the system to retrieve and/or determine the user's mood when encountering another entity. Additionally, such processing is relatively fast and efficient, and does not require significant computing resources, and does not require significant bandwidth to send the information to the destination. Also, once some parts of the audio are classified as non-speech, more computing resources are available to process other segments.
基于确定的用户视线方向1750,处理器210可选择性地调节或放大来自与用户视线方向1750相关联的区域的声音。图18是示出符合本公开的使用基于相机的助听器的示例性环境的示意图。麦克风1720可以检测用户100的环境内的一个或多个声音1820、1821和1822。基于由处理器210确定的用户视线方向1750,可以确定与用户视线方向1750相关联的区域1830。如图18所示,区域1830可以基于用户视线方向1750由锥体或方向范围来定义。如图18所示,角度范围可以由角度θ来定义。角度θ可以是用于定义调节用户100的环境内的声音的范围的任何合适的角度(例如,10度、20度、45度)。Based on the determined user's gaze direction 1750, the processor 210 may selectively adjust or amplify sound from the area associated with the user's gaze direction 1750. 18 is a schematic diagram illustrating an exemplary environment for using a camera-based hearing aid consistent with the present disclosure. Microphone 1720 may detect one or more sounds 1820 , 1821 and 1822 within the environment of user 100 . Based on the user's gaze direction 1750 determined by the processor 210, a region 1830 associated with the user's gaze direction 1750 can be determined. As shown in FIG. 18, the area 1830 may be defined by a cone or range of directions based on the user's gaze direction 1750. As shown in FIG. 18, the angular range can be defined by the angle θ. The angle θ may be any suitable angle (eg, 10 degrees, 20 degrees, 45 degrees) used to define the range in which the sound within the environment of the user 100 is adjusted.
处理器210可以被配置为基于区域1830对用户100的环境中的声音进行选择性调节。经调节的音频信号可以被发送到听觉接口设备1710,并且因此可以向用户100提供对应于用户的视线方向的听觉反馈。例如,处理器210可以确定声音1820(其可以对应于个体1810的语音,或者例如对应于噪声)处于区域1830内。处理器210然后可以对从麦克风1720接收的音频信号执行各种调节技术。调节可以包括相对于其他音频信号放大被确定为对应于声音1820的音频信号。放大可以例如通过相对于其他信号处理与1820相关联的音频信号来数字化地实现。放大还可以通过改变麦克风1720的一个或多个参数来实现,以聚焦于从与用户视线方向1750相关联的区域1830(例如,感兴趣的区域)发出的音频声音。例如,麦克风1720可以是定向麦克风,处理器210可以执行将麦克风1720聚焦在声音1820或区域1830内的其他声音上的操作。可以使用用于放大声音1820的各种其他技术,诸如使用波束成形麦克风阵列、声学望远镜技术等。The processor 210 may be configured to selectively adjust the sound in the environment of the user 100 based on the zone 1830 . The conditioned audio signal can be sent to the auditory interface device 1710 and thus can provide the user 100 with auditory feedback corresponding to the user's gaze direction. For example, processor 210 may determine that sound 1820 (which may correspond to the speech of individual 1810 , or, for example, to noise) is within region 1830 . The processor 210 may then perform various conditioning techniques on the audio signal received from the microphone 1720 . Adjusting may include amplifying the audio signal determined to correspond to sound 1820 relative to other audio signals. Amplification may be accomplished digitally, for example, by processing the audio signal associated with 1820 relative to other signals. Amplification may also be accomplished by changing one or more parameters of the microphone 1720 to focus audio sounds emanating from the area 1830 associated with the user's gaze direction 1750 (eg, a region of interest). For example, microphone 1720 may be a directional microphone and processor 210 may perform operations to focus microphone 1720 on sound 1820 or other sounds within area 1830. Various other techniques for amplifying sound 1820 may be used, such as the use of beamforming microphone arrays, acoustic telescope techniques, and the like.
调节还可以包括衰减或抑制从区域1830之外的方向接收的一个或多个音频信号。例如,处理器1820可以衰减声音1821和1822。类似于声音1820的放大,声音的衰减可以通过处理音频信号来发生,或者通过改变与一个或多个麦克风1720相关联的一个或多个参数来引导焦点远离从区域1830之外发出的声音。Conditioning may also include attenuating or suppressing one or more audio signals received from directions outside of zone 1830 . For example, processor 1820 may attenuate sounds 1821 and 1822. Similar to the amplification of sound 1820, attenuation of sound may occur by processing the audio signal, or by changing one or more parameters associated with one or more microphones 1720 to direct focus away from sound emanating from outside area 1830.
在一些实施例中,调节还可以包括改变对应于声音1820的音频信号的音调,以使声音1820对于用户100更易感知。例如,用户100可能对特定范围内的音调具有较小的敏感度,并且音频信号的调节可以调整声音1820的音高以使其对于用户100更易感知。例如,用户100可能经历10kHz以上的频率中的听觉损失。因此,处理器210可以将更高的频率(例如,在15khz处)重新映射到10khz。在一些实施例中,处理器210可以被配置为改变与一个或多个音频信号相关联的语速。因此,处理器210可以被配置为例如使用语音活动检测(VAD)算法或技术来检测由麦克风1720接收的一个或多个音频信号内的语音。如果确定声音1820对应于例如来自个体1810的语音或讲话,则处理器220可以被配置为改变声音1820的回放速率。例如,可以降低个体1810的语速以使检测到的语音对于用户100更易感知。可以执行各种其他处理(诸如修改声音1820的音调),以维持与原始音频信号相同的音高,或者降低音频信号内的噪声。如果已经对与声音1820相关联的音频信号执行了语音识别,则调节还可以包括基于检测到的语音来修改音频信号。例如,处理器210可以在词语和/或句子之间引入停顿或增加停顿的持续时间,这可以使语音更容易理解。In some embodiments, adjusting may also include changing the pitch of the audio signal corresponding to sound 1820 to make sound 1820 more perceptible to user 100 . For example, user 100 may have less sensitivity to tones within a certain range, and adjustment of the audio signal may adjust the pitch of sound 1820 to make it more perceptible to user 100 . For example, user 100 may experience hearing loss in frequencies above 10 kHz. Therefore, the processor 210 can remap higher frequencies (eg, at 15khz) to 10khz. In some embodiments, the processor 210 may be configured to vary the speech rate associated with one or more audio signals. Accordingly, processor 210 may be configured to detect speech within one or more audio signals received by microphone 1720, eg, using a voice activity detection (VAD) algorithm or technique. If it is determined that the sound 1820 corresponds to, for example, speech or speech from the individual 1810, the processor 220 may be configured to change the playback rate of the sound 1820. For example, the speaking rate of the individual 1810 may be reduced to make the detected speech more perceptible to the user 100 . Various other processing may be performed, such as modifying the pitch of the sound 1820, to maintain the same pitch as the original audio signal, or to reduce noise within the audio signal. If speech recognition has been performed on the audio signal associated with sound 1820, adjusting may also include modifying the audio signal based on the detected speech. For example, processor 210 may introduce pauses or increase the duration of pauses between words and/or sentences, which may make speech easier to understand.
然后可以将经调节的音频信号发送到听觉接口设备1710,并为用户100产生音频信号。因此,在经调节的音频信号中,声音1820可以更容易被用户100听到,比声音1821和1822更响亮和/或更容易区分,声音1821和1822可以表示环境内的背景噪声。The conditioned audio signal may then be sent to auditory interface device 1710 and an audio signal generated for user 100 . Thus, in the conditioned audio signal, sound 1820 may be more easily heard by user 100, louder and/or more distinguishable than sounds 1821 and 1822, which may represent background noise within the environment.
图19是示出符合所公开实施例的用于选择性地放大从检测到的用户的视线方向发出的声音的示例性过程1900的流程图。过程1900可以由与装置110相关联的一个或多个处理器(诸如处理器210)来执行。在一些实施例中,过程1900的一些或全部可以在装置110外部的处理器上执行。换句话说,执行过程1900的处理器可以与麦克风1720和相机1730一起包括在公共外壳中,或者可以包括在第二外壳中。例如,过程1900的一个或多个部分可以由听觉接口设备1710中的处理器或诸如计算设备120的辅助设备来执行。FIG. 19 is a flowchart illustrating an exemplary process 1900 for selectively amplifying sound emanating from a detected line of sight of a user, consistent with the disclosed embodiments. Process 1900 may be performed by one or more processors associated with apparatus 110, such as processor 210. In some embodiments, some or all of process 1900 may be performed on a processor external to device 110 . In other words, the processor performing process 1900 may be included in a common housing with microphone 1720 and camera 1730, or may be included in a second housing. For example, one or more portions of process 1900 may be performed by a processor in auditory interface device 1710 or an auxiliary device such as computing device 120 .
在步骤1910中,过程1900可以包括从用户的环境接收由相机捕捉的多个图像。相机可以是诸如装置110的相机1730的可穿戴相机。在步骤1912中,过程1900可以包括接收表示由至少一个麦克风接收的声音的音频信号。麦克风可以被配置为从用户的环境捕捉声音。例如,如上所述,麦克风可以是麦克风1720。因此,麦克风可以包括定向麦克风、麦克风阵列、多端口麦克风或各种其他类型的麦克风。在一些实施例中,麦克风和可穿戴相机可以包括在公共外壳(诸如装置110的外壳)中。执行过程1900的一个或多个处理器也可以包括在该外壳中,或者可以包括在第二外壳中。在这样的实施例中,处理器可以被配置为经由无线链路(例如,蓝牙TM、NFC等)从公共外壳接收图像和/或音频信号。因此,公共外壳(例如,装置110)和第二外壳(例如,计算设备120)还可以包括发送器或各种其他通信组件。In step 1910, the process 1900 may include receiving a plurality of images captured by the camera from the user's environment. The camera may be a wearable camera such as camera 1730 of device 110 . In step 1912, the process 1900 can include receiving an audio signal representing sound received by the at least one microphone. The microphone can be configured to capture sound from the user's environment. For example, the microphone may be microphone 1720, as described above. Thus, the microphones may include directional microphones, microphone arrays, multi-port microphones, or various other types of microphones. In some embodiments, the microphone and wearable camera may be included in a common housing, such as that of device 110 . One or more processors that perform process 1900 may also be included in this enclosure, or may be included in a second enclosure. In such an embodiment, the processor may be configured to receive image and/or audio signals from the common housing via a wireless link (eg, Bluetooth ™ , NFC, etc.). Thus, the common housing (eg, apparatus 110) and the second housing (eg, computing device 120) may also include a transmitter or various other communication components.
在步骤1914中,过程1900可以包括基于对多个图像中的至少一个的分析来确定用户的视线方向。如上所述,可以使用各种技术来确定用户视线方向。在一些实施例中,可以至少部分地基于在一个或多个图像中检测到的用户下巴的表示来确定视线方向。如上所述,可以处理图像以确定下巴相对于可穿戴相机光轴的指向方向。In step 1914, the process 1900 can include determining the user's gaze direction based on the analysis of at least one of the plurality of images. As mentioned above, various techniques can be used to determine the user's gaze direction. In some embodiments, the gaze direction may be determined based at least in part on a detected representation of the user's jaw in one or more images. As mentioned above, the images can be processed to determine the direction in which the jaw is pointing relative to the optical axis of the wearable camera.
在步骤1916中,过程1900可以包括对由至少一个麦克风从与用户的视线方向相关联的区域接收的至少一个音频信号进行选择性调节。如上所述,可以基于在步骤1914中确定的用户视线方向来确定区域。该范围可以与关于视线方向的角宽度(例如,10度、20度、45度等)相关联。如上所述,可以对音频信号执行各种形式的调节。在一些实施例中,调节可以包括改变音频信号的音调或重放速度。例如,调节可以包括改变与音频信号相关联的语速。在一些实施例中,调节可以包括相对于从与用户的视线方向相关联的区域之外接收的其他音频信号对该音频信号进行放大。可以通过各种手段来执行放大,诸如操作配置为聚焦于从该区域发出的音频声音的定向麦克风,或者改变与麦克风相关联的一个或多个参数以使该麦克风聚焦于从该区域发出的音频声音。放大可以包括衰减或抑制由麦克风从与用户110的视线方向相关联的区域之外的方向接收的一个或多个音频信号。In step 1916, the process 1900 can include selectively adjusting at least one audio signal received by the at least one microphone from a region associated with the user's line-of-sight direction. As described above, the area may be determined based on the user's gaze direction determined in step 1914 . The range may be associated with an angular width (eg, 10 degrees, 20 degrees, 45 degrees, etc.) with respect to the line-of-sight direction. As mentioned above, various forms of conditioning can be performed on the audio signal. In some embodiments, adjusting may include changing the pitch or playback speed of the audio signal. For example, adjusting may include changing the rate of speech associated with the audio signal. In some embodiments, adjusting may include amplifying the audio signal relative to other audio signals received from outside the area associated with the user's line of sight. Amplification may be performed by various means, such as operating a directional microphone configured to focus on audio sounds emanating from the area, or changing one or more parameters associated with the microphone to focus the microphone on audio emanating from the area sound. Amplifying may include attenuating or suppressing one or more audio signals received by the microphone from directions outside the area associated with the direction of the user's 110 line of sight.
在步骤1918中,过程1900可以包括使至少一个经调节的音频信号传输到被配置为向用户的耳朵提供声音的听觉接口设备。例如,经调节的音频信号可以被发送到听觉接口设备1710,其可向用户100提供对应于该音频信号的声音。执行过程1900的处理器还可以被配置为使表示背景噪声(background noise)的一个或多个音频信号被传输到听觉接口设备,该背景噪声可以相对于至少一个经调节的音频信号被衰减。例如,处理器220可以被配置为发送对应于声音1820、1821和1822的音频信号。然而,基于声音1820处于区域1830内的确定,与1820相关联的信号可以从声音1821和声音1822以不同的方式被修改(例如被放大)。在一些实施例中,听觉接口设备1710可以包括与听筒相关联的扬声器。例如,听觉接口设备可以至少部分地插入用户的耳朵中,用于向用户提供音频。听觉接口设备也可以在耳朵外部,诸如耳后听觉设备、一个或多个耳机、小型便携式扬声器等。在一些实施例中,听觉接口设备可以包括骨传导麦克风,其被配置为通过用户头骨的振动向用户提供音频信号。这样的设备可以与使用者的皮肤外部接触放置,或者可以通过外科手术植入并附接到使用者的骨骼上。In step 1918, the process 1900 can include transmitting the at least one conditioned audio signal to an auditory interface device configured to provide sound to a user's ear. For example, the conditioned audio signal may be sent to auditory interface device 1710, which may provide user 100 with sounds corresponding to the audio signal. The processor performing process 1900 may also be configured to cause one or more audio signals representing background noise, which may be attenuated relative to the at least one conditioned audio signal, to be transmitted to the auditory interface device. For example, processor 220 may be configured to transmit audio signals corresponding to sounds 1820 , 1821 and 1822 . However, based on the determination that sound 1820 is within region 1830, the signal associated with sound 1820 may be modified (eg, amplified) differently from sound 1821 and sound 1822. In some embodiments, auditory interface device 1710 may include a speaker associated with the earpiece. For example, an auditory interface device may be inserted at least partially into a user's ear for providing audio to the user. The auditory interface device may also be external to the ear, such as a behind-the-ear hearing device, one or more earphones, small portable speakers, and the like. In some embodiments, the auditory interface device may include a bone conduction microphone configured to provide audio signals to the user through vibrations of the user's skull. Such devices may be placed in external contact with the user's skin, or may be surgically implanted and attached to the user's bone.
使用语音和/或图像识别的助听器Hearing aids that use speech and/or image recognition
与所公开的实施例一致,助听器可以选择性地放大与辨识出的个体的语音相关联的音频信号。助听器系统可以存储识别出的人的语音特征和/或面部特征以帮助识别和选择性放大。例如,当个体进入装置110的视场时,该个体可以被识别为已经被介绍给该设备的个体,或者在过去可能与用户100交互过的个体(例如,朋友、同事、亲戚、先前的熟人等)。因此,相对于用户环境中的其他声音,可以隔离和/或选择性地放大与辨识出的个体的语音相关联的音频信号。与从个体方向以外的方向接收的声音相关联的音频信号可以被抑制、衰减、滤波等。Consistent with the disclosed embodiments, the hearing aid may selectively amplify the audio signal associated with the recognized speech of the individual. The hearing aid system may store the recognized speech features and/or facial features of the person to aid in recognition and selective amplification. For example, when an individual enters the field of view of device 110, the individual may be identified as an individual who has been introduced to the device, or who may have interacted with user 100 in the past (eg, friends, colleagues, relatives, previous acquaintances) Wait). Accordingly, the audio signal associated with the recognized individual's speech may be isolated and/or selectively amplified relative to other sounds in the user's environment. Audio signals associated with sound received from directions other than the individual's direction may be suppressed, attenuated, filtered, and the like.
用户100可以佩戴类似于上述基于相机的助听器设备的助听器设备。例如,助听器设备可以是如图17A所示的听觉接口设备1720。听觉接口设备1710可以是被配置为向用户100提供听觉反馈的任何设备。听觉接口设备1710可以被放置在用户100的一个或两个耳朵中,类似于传统的听觉接口设备。如上所述,听觉接口设备1710可以是各种样式的,包括耳道内、完全耳道内、耳内、耳后、耳上、耳道内接收器、开放安装或各种其他样式。听觉接口设备1710可以包括用于向用户100提供听觉反馈的一个或多个扬声器、用于从另一系统(诸如装置110)接收信号的通信单元、用于检测用户100的环境中的声音的麦克风、内部电子设备、处理器、存储器等。听觉接口设备1710可以对应于反馈输出单元230,或者可以与反馈输出单元230分开,并且可以被配置为从反馈输出单元230接收信号。The user 100 may wear a hearing aid device similar to the camera-based hearing aid device described above. For example, the hearing aid device may be an auditory interface device 1720 as shown in Figure 17A. Auditory interface device 1710 may be any device configured to provide auditory feedback to user 100 . The auditory interface device 1710 may be placed in one or both ears of the user 100, similar to conventional auditory interface devices. As mentioned above, the auditory interface device 1710 may be of various styles, including in-canal, completely in-canal, in-ear, behind-the-ear, supra-aural, in-canal receiver, open mount, or various other styles. Auditory interface device 1710 may include one or more speakers for providing auditory feedback to user 100, a communication unit for receiving signals from another system (such as device 110), a microphone for detecting sounds in the environment of user 100 , internal electronics, processors, memory, etc. The auditory interface device 1710 may correspond to the feedback output unit 230 or may be separate from the feedback output unit 230 and may be configured to receive signals from the feedback output unit 230 .
在一些实施例中,如图17A所示,听觉接口设备1710可以包括骨传导耳机1711。骨传导耳机1711可以通过外科手术植入,并且可以通过声音振动到内耳的骨传导来向用户100提供可听反馈。听觉接口设备1710还可以包括一个或多个耳机(例如,无线耳机、过耳耳机等)或由用户100携带或佩戴的便携式扬声器。在一些实施例中,听觉接口设备1710可以集成到其他设备中,诸如用户的蓝牙TM耳机、眼镜、头盔(例如,摩托车头盔、自行车头盔等)、帽子等。In some embodiments, the auditory interface device 1710 may include a bone conduction headset 1711, as shown in FIG. 17A. Bone conduction earphones 1711 may be surgically implanted and may provide audible feedback to user 100 through bone conduction of sound vibrations to the inner ear. The auditory interface device 1710 may also include one or more earphones (eg, wireless earphones, over-ear earphones, etc.) or portable speakers carried or worn by the user 100 . In some embodiments, the auditory interface device 1710 may be integrated into other devices, such as the user's Bluetooth ™ headset, glasses, helmets (eg, motorcycle helmets, bicycle helmets, etc.), hats, and the like.
听觉接口设备1710可以被配置为与诸如装置110的相机设备进行通信。这种通信可以通过有线连接,或者可以无线地进行(例如,使用蓝牙TM、NFC或无线通信形式)。如上所述,装置110可以由用户100以各种配置来佩戴,包括物理地连接到衬衫、项链、腰带、眼镜、腕带、纽扣或与用户100相关联的其他物品。在一些实施例中,还可以包括诸如计算设备120的一个或多个附加设备。因此,本文关于装置110或处理器210描述的一个或多个过程或功能可以由计算设备120和/或处理器540执行。Auditory interface device 1710 may be configured to communicate with a camera device such as apparatus 110 . Such communication may be via a wired connection, or may be performed wirelessly (eg, using Bluetooth ™ , NFC or wireless forms of communication). As described above, device 110 may be worn by user 100 in various configurations, including physically attached to a shirt, necklace, belt, eyeglasses, wristband, button, or other item associated with user 100 . In some embodiments, one or more additional devices such as computing device 120 may also be included. Accordingly, one or more of the processes or functions described herein with respect to apparatus 110 or processor 210 may be performed by computing device 120 and/or processor 540 .
如上所述,装置110可以包括至少一个麦克风和至少一个图像捕捉设备。如关于图17B所描述的,装置110可以包括麦克风1720。麦克风1720可以被配置为确定用户100的环境中声音的方向性。例如,麦克风1720可以包括一个或多个定向麦克风、麦克风阵列、多端口麦克风等。图17B中所示的麦克风仅作为示例,并且可以使用任何适当数量、配置或位置的麦克风。处理器210可以被配置为区分用户100的环境内的声音并且确定每个声音的近似方向性。例如,使用麦克风阵列1720,处理器210可以对麦克风1720之间个体声音的相对定时或振幅进行比较,以确定相对于装置100的方向性。装置110可以包括诸如相机1730的一个或多个相机,它们可以对应于图像传感器220。相机1730可以被配置为捕捉用户100的周围环境的图像。As mentioned above, the apparatus 110 may include at least one microphone and at least one image capture device. Device 110 may include microphone 1720 as described with respect to FIG. 17B . Microphone 1720 may be configured to determine the directionality of sound in the environment of user 100 . For example, microphone 1720 may include one or more directional microphones, microphone arrays, multi-port microphones, and the like. The microphones shown in Figure 17B are by way of example only, and any suitable number, configuration, or location of microphones may be used. The processor 210 may be configured to differentiate sounds within the environment of the user 100 and determine the approximate directionality of each sound. For example, using the microphone array 1720 , the processor 210 may compare the relative timing or amplitude of individual sounds between the microphones 1720 to determine the directionality relative to the device 100 . Device 110 may include one or more cameras, such as camera 1730 , which may correspond to image sensor 220 . The camera 1730 may be configured to capture images of the user's 100 surroundings.
装置110可以被配置为识别用户100的环境中的个体。图20A是示出符合本公开的使用具有语音和/或图像识别的助听器的示例性环境的示意图。装置110可以被配置为识别与用户100的环境内的个体2010相关联的面部2011或语音2012。例如,装置110可以被配置为使用相机1730来捕捉用户100的周围环境的一个或多个图像。所捕捉的图像可以包括辨识出的个体2010的表示,该个体2010可以是用户100的朋友、同事、亲戚或先前的熟人。处理器210(和/或处理器210a和210b)可以被配置为使用各种面部识别技术来分析捕捉的图像并检测识别出的用户,如元素2011所表示的。因此,装置110,或具体地存储器550,可以包括一个或多个面部或语音识别组件。Apparatus 110 may be configured to identify individuals in the environment of user 100 . 20A is a schematic diagram illustrating an exemplary environment for using a hearing aid with speech and/or image recognition consistent with the present disclosure. Apparatus 110 may be configured to recognize faces 2011 or voices 2012 associated with individuals 2010 within the environment of user 100 . For example, device 110 may be configured to capture one or more images of user 100's surroundings using camera 1730 . The captured image may include a representation of an identified individual 2010 , which may be a friend, colleague, relative, or previous acquaintance of the user 100 . Processor 210 (and/or processors 210a and 210b ) may be configured to analyze captured images and detect identified users using various facial recognition techniques, as represented by element 2011 . Accordingly, device 110, or memory 550 in particular, may include one or more facial or voice recognition components.
图20B示出了包括符合本公开的面部和语音识别组件的装置110的示例性实施例。装置110在图20B中以简化形式示出,并且装置110可以包含附加元件或可以具有替代配置,例如,如图5A-5C所示。存储器550(或550a或550b)可以包括面部识别组件2040和语音识别组件2041。这些组件可以是如图6所示的朝向识别模块601、朝向调整模块602和运动跟踪模块603的替代或补充。组件2040和2041可以包含用于由包括在可穿戴装置中的至少一个处理设备(例如处理器210)执行的软件指令。组件2040和2041仅作为示例被示出在存储器550内,并且可以位于系统内的其他位置。例如,组件2040和2041可以位于听觉接口设备1710中、计算设备120中、远程服务器上或另一关联设备中。FIG. 20B illustrates an exemplary embodiment of an apparatus 110 including a face and voice recognition component consistent with the present disclosure. Apparatus 110 is shown in simplified form in Figure 20B, and apparatus 110 may contain additional elements or may have alternative configurations, eg, as shown in Figures 5A-5C. Memory 550 (or 550a or 550b) may include facial recognition component 2040 and speech recognition component 2041. These components may be in place of or in addition to the orientation recognition module 601 , orientation adjustment module 602 and motion tracking module 603 as shown in FIG. 6 . Components 2040 and 2041 may contain software instructions for execution by at least one processing device (eg, processor 210) included in the wearable device. Components 2040 and 2041 are shown within memory 550 by way of example only, and may be located elsewhere within the system. For example, components 2040 and 2041 may reside in auditory interface device 1710, in computing device 120, on a remote server, or in another associated device.
面部识别组件2040可以被配置为识别用户100的环境内的一个或多个面部。例如,面部识别组件2040可以识别个体2010的面部2011上的面部特征,例如眼睛、鼻子、颧骨、下巴或其他特征。面部识别组件2040然后可以分析这些特征的相对大小和位置以识别用户。面部识别组件2040可以利用一种或多种算法来分析检测到的特征,诸如主分量分析(例如,使用本征脸)、线性判别分析、弹性束图匹配(例如,使用Fisher脸)、局部二进制模式直方图(LBPH)、尺度不变特征变换(SIFT)、加速鲁棒特征(SURF)等。还可以使用诸如三维识别、皮肤纹理分析和/或热成像的其他面部识别技术来识别个体。除了面部特征之外的其他特征也可以用于识别,诸如身高、体型或个体2010的其他区别特征。Facial recognition component 2040 can be configured to recognize one or more faces within the environment of user 100 . For example, facial recognition component 2040 can recognize facial features on individual 2010's face 2011, such as eyes, nose, cheekbones, chin, or other features. The facial recognition component 2040 can then analyze the relative size and location of these features to identify the user. The facial recognition component 2040 can utilize one or more algorithms to analyze the detected features, such as principal component analysis (eg, using eigenfaces), linear discriminant analysis, elastic bundle map matching (eg, using Fisher faces), local binary Pattern Histogram (LBPH), Scale Invariant Feature Transform (SIFT), Accelerated Robust Feature (SURF), etc. Other facial recognition techniques such as three-dimensional recognition, skin texture analysis, and/or thermal imaging may also be used to identify individuals. Features other than facial features may also be used for identification, such as height, body size, or other distinguishing features of the individual 2010 .
面部识别组件2040可以访问与用户100相关联的数据库或数据,以确定检测到的面部特征是否对应于辨识出的个体。例如,处理器210可以访问数据库2050,数据库2050包含关于用户100已知的个体的信息和表示相关联的面部特征或其他识别特征的数据。这样的数据可以包括个体的一个或多个图像,或者表示可用于通过面部识别进行的识别的用户面部的数据。数据库2050可以是能够存储关于一个或多个个体的信息的任何设备,并且可以包括硬盘驱动、固态驱动、网络存储平台、远程服务器等。数据库2050可以位于装置110内(例如,存储器550内)或装置110的外部,如图20B所示。在一些实施例中,数据库2050可以与社交网络平台相关联,例如Facebook TM、LinkedInTM、InstagramTM等。面部识别组件2040还可以访问用户100的联系人列表,诸如用户电话上的联系人列表、基于网络的联系人列表(例如,通过OutlookTM、SkypeTM、GoogleTM、SalesforceTM等)或与听觉接口设备1710相关联的专用联系人列表。在一些实施例中,数据库2050可以由装置110通过先前的面部识别分析来编译。例如,处理器210可以被配置为将与在由装置110捕捉的图像中识别出的一个或多个面部相关联的数据存储在数据库2050中。每次在图像中检测到面部时,可将检测到的面部特征或其他数据与数据库2050中的先前识别出的面部进行比较。面部识别组件2040可以确定个体是用户100的辨识出的个体、该个体先前是否在超过特定阈值的多个实例中被系统识别出、该个体是否已被明确地介绍给装置110等。Facial recognition component 2040 can access a database or data associated with user 100 to determine whether detected facial features correspond to an identified individual. For example, processor 210 may access database 2050 that contains information about individuals known to user 100 and data representing associated facial or other identifying features. Such data may include one or more images of an individual, or data representing a user's face that may be used for identification by facial recognition. Database 2050 may be any device capable of storing information about one or more individuals, and may include hard drives, solid state drives, network storage platforms, remote servers, and the like. Database 2050 may be located within device 110 (eg, within memory 550) or external to device 110, as shown in Figure 20B. In some embodiments, the database 2050 may be associated with a social networking platform, such as Facebook ™ , LinkedIn ™ , Instagram ™ , and the like. The facial recognition component 2040 can also access a contact list of the user 100, such as a contact list on the user's phone, a web-based contact list (eg, through Outlook ™ , Skype ™ , Google ™ , Salesforce ™ , etc.) or with an auditory interface A private contact list associated with device 1710. In some embodiments, database 2050 may be compiled by device 110 from previous facial recognition analysis. For example, processor 210 may be configured to store data associated with one or more faces identified in images captured by device 110 in database 2050 . Each time a face is detected in an image, the detected facial features or other data may be compared to previously identified faces in database 2050. Facial recognition component 2040 can determine whether the individual is an identified individual of user 100, whether the individual has been previously identified by the system in instances exceeding a certain threshold, whether the individual has been explicitly introduced to device 110, and the like.
在一些实施例中,用户100可以诸如通过web界面、移动设备上的应用程序或通过装置110或相关联的设备访问数据库2050。例如,用户100可以能够选择哪些联系人可由装置110识别,和/或手动删除或添加某些联系人。在一些实施例中,用户或管理员可以能够训练面部识别组件2040。例如,用户100可以具有确认或拒绝由面部识别组件2040做出的识别的选项,这可以提高系统的准确性。随着个体2010正在被识别,这种训练可能会实时发生,或者在以后的某个时候发生。In some embodiments, user 100 may access database 2050, such as through a web interface, an application on a mobile device, or through apparatus 110 or an associated device. For example, user 100 may be able to select which contacts are recognized by device 110, and/or manually delete or add certain contacts. In some embodiments, a user or administrator may be able to train facial recognition component 2040. For example, user 100 may have the option of confirming or rejecting identifications made by facial recognition component 2040, which may improve the accuracy of the system. This training may occur in real-time as the individual 2010 is being identified, or at some point in the future.
其他数据或信息也可以通知面部识别过程。在一些实施例中,如下文进一步详细描述的,处理器210可以使用各种技术来识别个体2010的语音。识别出的语音模式和检测到的面部特征可以单独或组合使用,以确定个体2010被装置110所识别。处理器210还可以确定如上所述的用户视线方向1750,其可以被用于验证个体2010的身份。例如,如果用户100正看向个体2010的方向(特别是长时间的),这可以指示个体2010被用户100所识别,这可用于增加面部识别组件2040或其他识别手段的置信度。Other data or information can also inform the facial recognition process. In some embodiments, the processor 210 may recognize the speech of the individual 2010 using various techniques, as described in further detail below. The recognized speech patterns and detected facial features may be used alone or in combination to determine that the individual 2010 is recognized by the device 110 . The processor 210 can also determine the user's gaze direction 1750 as described above, which can be used to verify the identity of the individual 2010. For example, if user 100 is looking in the direction of individual 2010 (especially for an extended period of time), this may indicate that individual 2010 is recognized by user 100, which may be used to increase confidence in facial recognition component 2040 or other identification means.
处理器210还可以被配置为基于与个体2010的语音相关联的声音的一个或多个检测到的音频特征来确定个体2010是否被用户100所识别。返回到图20A,处理器210可以确定声音2020对应于用户2010的语音2012。处理器210可以对表示由麦克风1720捕捉的声音2020的音频信号进行分析,以确定个体2010是否被用户100所识别。这可以使用语音识别组件2041(图20B)来执行,并且可以包括一个或多个语音识别算法,例如隐式马尔可夫模型、动态时间规整、神经网络或其他技术。语音识别组件和/或处理器210可以访问数据库2050,数据库2050还可以包括一个或多个个体的声纹。语音识别组件2041可以对表示声音2020的音频信号进行分析以确定语音2012是否与数据库2050中的个体的声纹匹配。因此,数据库2050可以包含与多个个体相关联的声纹数据,类似于上述存储的面部识别数据。在确定匹配之后,可以将个体2010确定为用户100的辨识出的个体。该过程可以单独使用,或者与上述面部识别技术结合使用。例如,可以使用面部识别组件2040来识别个体2010,并且可以使用语音识别组件2041来验证个体2010,反之亦然。The processor 210 may also be configured to determine whether the individual 2010 is recognized by the user 100 based on one or more detected audio characteristics of sounds associated with the individual 2010's speech. Returning to FIG. 20A , processor 210 may determine that sound 2020 corresponds to speech 2012 of user 2010 . The processor 210 may analyze the audio signal representing the sound 2020 captured by the microphone 1720 to determine whether the individual 2010 is recognized by the user 100 . This can be performed using speech recognition component 2041 (FIG. 20B), and can include one or more speech recognition algorithms, such as hidden Markov models, dynamic time warping, neural networks, or other techniques. The speech recognition component and/or processor 210 may access a database 2050, which may also include the voiceprints of one or more individuals. Speech recognition component 2041 can analyze the audio signal representing sound 2020 to determine whether speech 2012 matches the voiceprint of an individual in database 2050. Thus, database 2050 may contain voiceprint data associated with multiple individuals, similar to the stored facial recognition data described above. After a match is determined, individual 2010 may be determined as an identified individual of user 100 . This process can be used alone or in conjunction with the facial recognition technology described above. For example, the individual 2010 can be identified using the facial recognition component 2040 and the individual 2010 can be authenticated using the speech recognition component 2041, and vice versa.
在一些实施例中,装置110可以检测不在装置110的视场内的个体的语音。例如,语音可以通过免提电话、从后座或类似的地方听到。在这样的实施例中,在视场中没有说话者的情况下,个体的识别可以仅基于个体的语音。处理器110可以如上所述分析个体的语音,例如,通过确定检测到的声音是否与数据库2050中的个体的声纹匹配。In some embodiments, device 110 may detect the speech of an individual who is not within device 110's field of view. For example, speech can be heard on a speakerphone, from the back seat, or the like. In such an embodiment, in the absence of a speaker in the field of view, the identification of the individual may be based solely on the individual's speech. The processor 110 may analyze the individual's speech as described above, eg, by determining whether the detected sound matches the individual's voiceprint in the database 2050.
在确定个体2010是用户100的辨识出的个体之后,处理器210可以对与辨识出的个体相关联的音频进行选择性调节。经调节的音频信号可以被发送到听觉接口设备1710,并且因此可以向用户100提供基于辨识出的个体的经调节音频。例如,调节可以包括相对于其他音频信号放大被确定为对应于声音2020(其可对应于个体2010的语音2012)的音频信号。在一些实施例中,放大可以例如通过相对于其他信号处理与声音2020相关联的音频信号来数字化地实现。另外地或者可替代地,可以通过改变麦克风1720的一个或多个参数来实现放大,以聚焦于与个体2010相关联的音频声音。例如,麦克风1720可以是定向麦克风,处理器210可以执行将麦克风1720聚焦在声音2020上的操作。可以使用用于放大声音2020的各种其他技术,诸如使用波束成形麦克风阵列、声学望远镜技术等。After determining that individual 2010 is an identified individual of user 100, processor 210 may selectively adjust audio associated with the identified individual. The adjusted audio signal may be sent to auditory interface device 1710, and thus adjusted audio based on the identified individual may be provided to user 100. For example, adjusting may include amplifying an audio signal determined to correspond to sound 2020 (which may correspond to speech 2012 of individual 2010 ) relative to other audio signals. In some embodiments, the amplification may be accomplished digitally, for example, by processing the audio signal associated with the sound 2020 relative to other signals. Additionally or alternatively, amplification may be achieved by changing one or more parameters of the microphone 1720 to focus on audio sounds associated with the individual 2010 . For example, the microphone 1720 may be a directional microphone, and the processor 210 may perform operations to focus the microphone 1720 on the sound 2020. Various other techniques for amplifying sound 2020 may be used, such as the use of beamforming microphone arrays, acoustic telescope techniques, and the like.
在一些实施例中,选择性调节可以包括衰减或抑制从与个体2010不相关联的方向接收的一个或多个音频信号。例如,处理器210可以衰减声音2021和/或2022。类似于声音2020的放大,声音的衰减可以通过处理音频信号或通过改变与麦克风1720相关联的一个或多个参数来发生,以指引焦点离开与个体2010不相关联的声音。In some embodiments, the selective adjustment may include attenuating or suppressing one or more audio signals received from directions not associated with the individual 2010 . For example, processor 210 may attenuate sound 2021 and/or 2022. Similar to the amplification of sound 2020, attenuation of sound may occur by processing the audio signal or by changing one or more parameters associated with microphone 1720 to direct focus away from sounds not associated with individual 2010.
选择性调节还可以包括确定个体2010是否正在讲话。例如,处理器210可以被配置为分析包含个体2010的表示的图像或视频,以例如基于辨识出的个体的唇部的被检测到的运动来确定个体2010何时在说话。这也可以通过分析由麦克风1720接收的音频信号来确定,例如通过检测个体2010的语音2012来确定。在一些实施例中,可以基于辨识出的个体是否在说话而动态地发生(启动和/或终止)选择性调节。The selective adjustment may also include determining whether the individual 2010 is speaking. For example, the processor 210 may be configured to analyze an image or video containing a representation of the individual 2010 to determine when the individual 2010 is speaking, eg, based on the detected movement of the individual's lips identified. This can also be determined by analyzing the audio signal received by the microphone 1720 , for example by detecting the speech 2012 of the individual 2010 . In some embodiments, selective adjustment may occur (initiated and/or terminated) dynamically based on whether the identified individual is speaking or not.
在一些实施例中,调节还可以包括改变对应于声音2020的一个或多个音频信号的音调,以使该声音对于用户100更易感知。例如,用户100可能对特定范围内的音调具有较小的敏感度,并且音频信号的调节可以调整声音2020的音高。在一些实施例中,处理器210可以被配置为改变与一个或多个音频信号相关联的语速。例如,声音2020可以被确定为对应于个体2010的语音2012。处理器210可以被配置为改变个体2010的语速,以使检测到的语音对于用户100更易感知。可以执行各种其他处理(诸如修改声音2020的音调),以维持与原始音频信号相同的音高,或者降低音频信号内的噪声。In some embodiments, adjusting may also include changing the pitch of one or more audio signals corresponding to sound 2020 to make the sound more perceptible to user 100 . For example, the user 100 may have less sensitivity to tones within a certain range, and the adjustment of the audio signal may adjust the pitch of the sound 2020. In some embodiments, the processor 210 may be configured to vary the speech rate associated with one or more audio signals. For example, sound 2020 may be determined to correspond to speech 2012 of individual 2010 . The processor 210 may be configured to alter the speech rate of the individual 2010 to make the detected speech more perceptible to the user 100 . Various other processing may be performed, such as modifying the pitch of the sound 2020, to maintain the same pitch as the original audio signal, or to reduce noise within the audio signal.
在一些实施例中,处理器210可以确定与个体2010相关联的区域2030。区域2030可以与个体2010相对于装置110或用户100的方向相关联。个体2010的方向可以使用上述方法使用相机1730和/或麦克风1720来确定。如图20A所示,区域2030可以基于所确定的个体2010的方向由方向的锥体或范围来定义。如图20A所示,角度范围可以由角度θ来定义。角度θ可以是用于定义调节用户100的环境内的声音的范围的任何合适的角度(例如,10度、20度、45度)。区域2030可以随着个体2010的位置相对于装置110的改变而动态计算。例如,当用户100转向时,或者如果个体1020在环境内移动,处理器210可以被配置为跟踪环境内的个体2010并动态更新区域2030。区域2030可以用于选择性调节,例如通过放大与区域2030相关联的声音和/或衰减被确定为从区域2030之外发出的声音。In some embodiments, the processor 210 may determine the area 2030 associated with the individual 2010 . Area 2030 may be associated with the orientation of individual 2010 relative to device 110 or user 100 . The orientation of the individual 2010 may be determined using the camera 1730 and/or the microphone 1720 using the methods described above. As shown in FIG. 20A, a region 2030 may be defined by a cone or range of directions based on the determined direction of the individual 2010. As shown in FIG. 20A, the angular range can be defined by the angle θ. The angle θ may be any suitable angle (eg, 10 degrees, 20 degrees, 45 degrees) used to define the range in which the sound within the environment of the user 100 is adjusted. The area 2030 may be calculated dynamically as the location of the individual 2010 relative to the device 110 changes. For example, as the user 100 turns, or if the individual 1020 moves within the environment, the processor 210 may be configured to track the individual 2010 within the environment and update the area 2030 dynamically. Zone 2030 may be used for selective adjustment, such as by amplifying sounds associated with zone 2030 and/or attenuating sounds determined to be emanating from outside zone 2030 .
然后可以将经调节的音频信号发送到听觉接口设备1710,并为用户100产生音频信号。因此,在经调节的音频信号中,声音2020(特别是语音2012)可以比声音2021和2022更响亮和/或更容易区分,声音2021和2022可以表示环境内的背景噪声。The conditioned audio signal may then be sent to auditory interface device 1710 and an audio signal generated for user 100 . Thus, in the conditioned audio signal, sound 2020 (particularly speech 2012) may be louder and/or more distinguishable than sounds 2021 and 2022, which may represent background noise within the environment.
在一些实施例中,处理器210可以基于捕捉的图像或视频执行进一步的分析,以确定如何选择性地调节与辨识出的个体相关联的音频信号。在一些实施例中,处理器210可以分析所捕捉的图像以相对于其他选择性地调节与该个体相关联的音频。例如,处理器210可以基于图像来确定辨识出的个体相对于用户的方向,并且可以基于该方向来确定如何选择性地调节与该个体相关联的音频信号。如果辨识出的个体站在用户前面,则与该用户相关联的音频可以相对于与站在用户一侧的个体相关联的音频被放大(或以其他方式被选择性地调节)。类似地,处理器210可以基于与用户的接近度来选择性地调节与该个体相关联的音频信号。处理器210可以基于捕捉的图像来确定从用户到每个个体的距离,并且可以基于该距离来选择性地调节与这些个体相关联的音频信号。例如,离用户较近的个体可能比离用户较远的个体优先级更高。在一些实施例中,也可以考虑用户的观看方向与个体之间的角度。例如,定位在相对于用户的视线方向较小角度的个体可以优先级高于定位在与用户的视线方向较大角度的个体。In some embodiments, the processor 210 may perform further analysis based on the captured images or video to determine how to selectively adjust the audio signal associated with the identified individual. In some embodiments, the processor 210 may analyze the captured images to selectively adjust audio associated with the individual relative to others. For example, the processor 210 may determine the orientation of the identified individual relative to the user based on the image, and may determine how to selectively adjust the audio signal associated with the individual based on the orientation. If an identified individual is standing in front of the user, the audio associated with that user may be amplified (or otherwise selectively adjusted) relative to the audio associated with the individual standing at the user's side. Similarly, the processor 210 may selectively adjust the audio signal associated with the individual based on the proximity to the user. The processor 210 can determine the distance from the user to each individual based on the captured images, and can selectively adjust audio signals associated with the individuals based on the distance. For example, individuals closer to the user may have higher priority than individuals further away from the user. In some embodiments, the angle between the user's viewing direction and the individual may also be considered. For example, individuals positioned at a smaller angle relative to the user's line of sight may be prioritized over individuals positioned at a larger angle to the user's line of sight.
在一些实施例中,与辨识出的个体相关联的音频信号的选择性调节可以基于用户环境中的个体的身份。例如,在图像中检测到多个个体的情况下,处理器210可以如上所述使用一种或多种面部识别技术来识别个体。与用户100已知的个体相关联的音频信号可以被选择性地放大或以其他方式调节以具有相对于未知个体的优先权。例如,处理器210可以被配置为衰减或静音与用户环境中的旁观者(诸如嘈杂的办公室同事等)相关联的音频信号。在一些实施例中,处理器210还可以确定个体的层次结构并基于个体的相对状态给予优先权。该层次结构可以基于个体在家庭或组织(例如,公司、运动队、俱乐部等)中相对于用户的位置。例如,用户的老板可能比同事或维护团队的成员排名更高,因此可能在选择性调节过程中具有优先权。在一些实施例中,可以基于列表或数据库来确定层次结构。被系统辨识出的个体可以被单独排序或分组为几层优先级。该数据库可以专门为此目的而维护,也可以从外部访问。例如,数据库可以与用户的社交网络(例如,FacebookTM、LinkedIn TM等)相关联,并且可以基于个体的分组或与用户的关系来对其进行优先级排序。例如,被识别为“密友”或家人的个体可以优先于用户的熟人。In some embodiments, the selective adjustment of the audio signal associated with the identified individual may be based on the identity of the individual in the user's environment. For example, where multiple individuals are detected in the image, the processor 210 may identify the individuals using one or more facial recognition techniques as described above. Audio signals associated with individuals known to user 100 may be selectively amplified or otherwise adjusted to have priority over unknown individuals. For example, the processor 210 may be configured to attenuate or mute audio signals associated with bystanders in the user's environment (such as noisy office colleagues, etc.). In some embodiments, the processor 210 may also determine a hierarchy of individuals and give priority based on the relative status of the individuals. The hierarchy may be based on the location of the individual relative to the user within the family or organization (eg, company, sports team, club, etc.). For example, a user's boss may be ranked higher than a co-worker or a member of the maintenance team and thus may have priority in the selective moderation process. In some embodiments, the hierarchy may be determined based on a list or database. Individuals identified by the system can be sorted individually or grouped into layers of priority. The database can be maintained specifically for this purpose or accessed externally. For example, a database can be associated with a user's social network (eg, Facebook ™ , LinkedIn ™ , etc.), and individuals can be prioritized based on their groupings or relationship to the user. For example, individuals identified as "close friends" or family members may take precedence over the user's acquaintances.
选择性调节可以基于根据所捕捉的图像确定的一个或多个个体的确定的行为。在一些实施例中,处理器210可以被配置为确定图像中的个体的视线方向。因此,选择性调节可以基于其他个体对辨识出的个体的行为。例如,处理器210可以选择性地调节与一个或多个其他用户正在看着的第一个体相关联的音频。如果个体的注意力转移到第二个体,则处理器210可以随后切换到选择性地调节与第二用户相关联的音频。在一些实施例中,处理器210可以被配置为基于辨识出的个体是对用户说话还是对另一个体说话来选择性地调节音频。例如,当辨识出的个体正在对用户说话时,选择性调节可以包括相对于从与辨识出的个体相关联的区域之外的方向接收的其他音频信号放大与辨识出的个体相关联的音频信号。当辨识出的个体正在对另一个体说话时,选择性调节可以包括相对于从与辨识出的个体相关联的区域之外的方向接收的其他音频信号衰减该音频信号。Selective adjustments may be based on the determined behavior of one or more individuals determined from the captured images. In some embodiments, the processor 210 may be configured to determine the gaze direction of the individual in the image. Thus, selective conditioning can be based on the behavior of other individuals with respect to the identified individual. For example, the processor 210 may selectively adjust audio associated with the first individual being viewed by one or more other users. If the individual's attention is diverted to the second individual, the processor 210 may then switch to selectively adjust the audio associated with the second user. In some embodiments, the processor 210 may be configured to selectively adjust the audio based on whether the identified individual is speaking to the user or to another individual. For example, when the identified individual is speaking to the user, the selective adjustment may include amplifying the audio signal associated with the identified individual relative to other audio signals received from directions outside the area associated with the identified individual . When the identified individual is speaking to another individual, the selective adjustment may include attenuating the audio signal relative to other audio signals received from directions outside the area associated with the identified individual.
在一些实施例中,处理器210可以访问个体的一个或多个声纹,这可以促进个体2010的语音2012相对于其他声音或语音的选择性调节。有了说话者的声纹,特别是高质量的声纹,可以提供快速和有效的说话者分离。例如,当用户单独说话时,优选地在安静的环境中,可以收集高质量的声纹。通过具有一个或多个说话者的声纹,可以使用滑动时间窗口几乎实时地(例如以最小延迟)分离正在进行的语音信号。延迟可以是例如10毫秒、20毫秒、30毫秒、50毫秒、100毫秒等。根据声纹的质量、捕捉的音频的质量、说话者与其他说话者之间的特性差异、可用的处理资源、所需的分离质量等,可以选择不同的时间窗口。在一些实施例中,可以从个体单独说话的对话的片段中提取声纹,然后用于稍后在对话中分离个体的语音,无论个体的声音是否被识别出。In some embodiments, the processor 210 may access one or more voiceprints of the individual, which may facilitate selective adjustment of the individual's 2010 speech 2012 relative to other sounds or speeches. Having a speaker's voiceprint, especially a high-quality voiceprint, can provide fast and efficient speaker separation. For example, when the user speaks alone, preferably in a quiet environment, high quality voiceprints can be collected. By having the voiceprints of one or more speakers, an ongoing speech signal can be separated in near real-time (eg, with minimal delay) using a sliding time window. The delay may be, for example, 10 milliseconds, 20 milliseconds, 30 milliseconds, 50 milliseconds, 100 milliseconds, and the like. Different time windows can be chosen depending on the quality of the voiceprint, the quality of the captured audio, the differences in characteristics between the speaker and other speakers, the processing resources available, the quality of separation required, etc. In some embodiments, voiceprints may be extracted from segments of a conversation in which an individual speaks alone, and then used to separate the individual's speech later in the conversation, whether or not the individual's voice is recognized.
可以如下执行分离语音:可以从单个说话者的干净音频中提取频谱特征(也称为频谱属性、频谱包络或频谱图),并将其馈送到预先训练的第一神经网络中,该第一神经网络基于提取的特征来生成或更新说话者的语音的签名。该音频可以是例如一秒钟的干净的语音。输出签名可以是表示说话者的语音的矢量,使得该矢量与从同一说话者的语音中提取的另一矢量之间的距离通常小于该矢量与从另一说话者的声音中提取的矢量之间的距离。说话者的模型可以从捕捉的音频中预先生成。可替代地或附加地,该模型可以在其中只有说话者在说话的音频片段之后生成,在该片段之后是在其中听到该说话者和另一说话者(或背景噪声)并且需要分离的另一个片段。Separating speech can be performed as follows: Spectral features (also known as spectral properties, spectral envelopes, or spectrograms) can be extracted from the clean audio of a single speaker and fed into a pretrained first neural network that The neural network generates or updates the signature of the speaker's speech based on the extracted features. The audio may be, for example, one second of clean speech. The output signature may be a vector representing the speaker's speech such that the distance between the vector and another vector extracted from the same speaker's speech is typically less than the distance between the vector and the vector extracted from the other speaker's voice the distance. Speaker models can be pre-generated from captured audio. Alternatively or additionally, the model may be generated after an audio segment in which only the speaker is speaking, followed by another audio segment in which the speaker and another speaker (or background noise) are heard and need to be separated. a fragment.
然后,为了从噪声音频中的另外说话者或背景噪声中分离出说话者的语音,第二预训练神经网络可以接收噪声音频和说话者的签名,并输出从噪声音频中提取出的说话者的语音的音频(也可以表示为属性),该音频与其他语音或背景噪声分离。将理解的是,相同的或附加的神经网络可以用于分离多个说话者的语音。例如,如果有两个可能的说话者,可以激活两个神经网络,每个神经网络具有相同噪声输出和两个说话者中的一个的模型。可替代地,神经网络可以接收两个或多个说话者的语音签名,并且分别输出每个说话者的语音。因此,该系统可以生成两个或更多个不同的音频输出,每个音频输出包括相应说话者的语音。在一些实施例中,如果分离是不可能的,则可以只从背景噪声中清除输入语音。Then, to separate the speaker's speech from other speakers or background noise in the noise audio, a second pretrained neural network can receive the noise audio and the speaker's signature, and output the speaker's speech extracted from the noise audio The audio of speech (which can also be represented as an attribute) that is separated from other speech or background noise. It will be appreciated that the same or additional neural networks can be used to separate the speech of multiple speakers. For example, if there are two possible speakers, two neural networks can be activated, each with the same noise output and a model for one of the two speakers. Alternatively, the neural network may receive the speech signatures of two or more speakers and output each speaker's speech separately. Thus, the system can generate two or more distinct audio outputs, each audio output including the speech of a corresponding speaker. In some embodiments, the input speech may only be cleaned from background noise if separation is not possible.
图21是示出符合所公开实施例的用于选择性地放大与辨识出的个体的语音相关联的音频信号的示例性过程2100的流程图。过程2100可以由与装置110相关联的一个或多个处理器(诸如处理器210)来执行。在一些实施例中,过程2100的一些或全部可以在装置110外部的处理器上执行。换句话说,执行过程2100的处理器可以与麦克风1720和相机1730一起包括在相同公共外壳中,或者可以包括在第二外壳中。例如,过程2100的一个或多个部分可以由听觉接口设备1710或诸如计算设备120的辅助设备中的处理器来执行。21 is a flowchart illustrating an exemplary process 2100 for selectively amplifying an audio signal associated with a recognized individual's speech, consistent with the disclosed embodiments. Process 2100 may be performed by one or more processors associated with apparatus 110, such as processor 210. In some embodiments, some or all of process 2100 may be performed on a processor external to device 110 . In other words, the processor performing process 2100 may be included in the same common housing as microphone 1720 and camera 1730, or may be included in a second housing. For example, one or more portions of process 2100 may be performed by a processor in auditory interface device 1710 or an auxiliary device such as computing device 120 .
在步骤2110中,过程2100可以包括从用户的环境接收由相机捕捉的多个图像。图像可以由诸如装置110的相机1730的可穿戴相机捕捉。在步骤2112中,过程2100可以包括识别在多个图像中的至少一个中的被辨识出的个体的表示。如上所述,个体2010可以由处理器210使用面部识别组件2040来识别。例如,个体2010可能是用户的朋友、同事、亲戚或以前的熟人。处理器210可以基于与个体相关联的一个或多个检测到的面部特征来确定在多个图像中的至少一个中表示的个体是否是被辨识出的个体。如上所述,处理器210还可以基于被确定为与个体的语音相关联的声音的一个或多个检测到的音频特征来确定是否识别出该个体。In step 2110, the process 2100 may include receiving a plurality of images captured by the camera from the user's environment. The images may be captured by a wearable camera such as camera 1730 of device 110 . In step 2112, the process 2100 can include identifying a representation of the identified individual in at least one of the plurality of images. As described above, individual 2010 can be identified by processor 210 using facial recognition component 2040. For example, individual 2010 may be a friend, colleague, relative, or former acquaintance of the user. Processor 210 may determine whether an individual represented in at least one of the plurality of images is an identified individual based on one or more detected facial features associated with the individual. As described above, the processor 210 may also determine whether to identify an individual based on one or more detected audio characteristics of sounds determined to be associated with the individual's speech.
在步骤2114中,过程2100可以包括接收表示由麦克风捕捉的声音的音频信号。例如,装置110可以接收表示由麦克风1720捕捉的声音2020、2021和2022的音频信号。因此,如上所述,麦克风可以包括定向麦克风、麦克风阵列、多端口麦克风或各种其他类型的麦克风。在一些实施例中,麦克风和可穿戴相机可以包括在公共外壳(诸如装置110的外壳)中。执行过程2100的一个或多个处理器也可以包括在该外壳(例如,处理器210)中,或者可以包括在第二外壳中。在使用第二外壳的情况下,处理器可以被配置为经由无线链路(例如,蓝牙TM、NFC等)从公共外壳接收图像和/或音频信号。因此,公共外壳(例如,装置110)和第二外壳(例如,计算设备120)还可以包括发送器、接收器和/或各种其他通信组件。In step 2114, the process 2100 can include receiving an audio signal representing the sound captured by the microphone. For example, device 110 may receive audio signals representing sounds 2020 , 2021 , and 2022 captured by microphone 1720 . Thus, as described above, the microphones may include directional microphones, microphone arrays, multi-port microphones, or various other types of microphones. In some embodiments, the microphone and wearable camera may be included in a common housing, such as that of device 110 . One or more processors that perform process 2100 may also be included in the enclosure (eg, processor 210), or may be included in a second enclosure. Where a second housing is used, the processor may be configured to receive image and/or audio signals from the common housing via a wireless link (eg, Bluetooth ™ , NFC, etc.). Thus, the common housing (eg, apparatus 110) and the second housing (eg, computing device 120) may also include transmitters, receivers, and/or various other communication components.
在步骤2116中,过程2100可以包括对由至少一个麦克风从与至少一个辨识出的个体相关联的区域接收的至少一个音频信号进行选择性调节。如上所述,该区域可以基于根据多个图像或音频信号中的一个或多个所确定的辨识出的个体的方向来确定。该范围可以与关于辨识出的个体的方向的角宽度(例如,10度、20度、45度等)相关联。In step 2116, process 2100 can include selectively adjusting at least one audio signal received by at least one microphone from a region associated with at least one identified individual. As described above, the area may be determined based on the orientation of the identified individual determined from one or more of the plurality of image or audio signals. The range may be associated with an angular width (eg, 10 degrees, 20 degrees, 45 degrees, etc.) with respect to the direction of the identified individual.
如上所述,可以对音频信号执行各种形式的调节。在一些实施例中,调节可以包括改变音频信号的音调或重放速度。例如,调节可以包括改变与音频信号相关联的语速。在一些实施例中,调节可以包括相对于从与辨识出的个体相关联的区域之外接收的其他音频信号对该音频信号进行放大。可以通过各种手段来执行放大,诸如操作配置为聚焦于从该区域发出的音频声音的定向麦克风,或者改变与麦克风相关联的一个或多个参数以使该麦克风聚焦于从该区域发出的音频声音。放大可以包括衰减或抑制由麦克风从该区域之外的方向接收的一个或多个音频信号。在一些实施例中,步骤2116还可以包括基于对多个图像的分析来确定辨识出的个体正在说话,并基于辨识出的个体正在说话的确定来触发选择性调节。例如,可以基于辨识出的个体的唇部的检测到的运动来确定辨识出的个体正在说话。在一些实施例中,选择性调节可以基于如上所述的捕捉图像的进一步分析,例如,基于辨识出的个体的方向或邻近性、辨识出的个体的身份、其他个体的行为等。As mentioned above, various forms of conditioning can be performed on the audio signal. In some embodiments, adjusting may include changing the pitch or playback speed of the audio signal. For example, adjusting may include changing the rate of speech associated with the audio signal. In some embodiments, conditioning may include amplifying the audio signal relative to other audio signals received from outside the region associated with the identified individual. Amplification may be performed by various means, such as operating a directional microphone configured to focus on audio sounds emanating from the area, or changing one or more parameters associated with the microphone to focus the microphone on audio emanating from the area sound. Amplifying may include attenuating or suppressing one or more audio signals received by the microphone from directions outside the area. In some embodiments, step 2116 may also include determining that the identified individual is speaking based on the analysis of the plurality of images, and triggering selective adjustments based on the determination that the identified individual is speaking. For example, it may be determined that the identified individual is speaking based on the detected movement of the identified individual's lips. In some embodiments, selective adjustments may be based on further analysis of the captured images as described above, eg, based on the orientation or proximity of the identified individual, the identity of the identified individual, the behavior of other individuals, and the like.
在步骤2118中,过程2100可以包括使至少一个经调节的音频信号传输到被配置为向用户的耳朵提供声音的听觉接口设备。例如,经调节的音频信号可以被发送到听觉接口设备1710,其可向用户100提供对应于该音频信号的声音。执行过程2100的处理器还可以被配置为使表示背景噪声的一个或多个音频信号被传输到听觉接口设备,该背景噪声可以相对于至少一个经调节的音频信号被衰减。例如,处理器210可以被配置为发送对应于声音2020、2021和2022的音频信号。然而,基于声音2020处于区域2030内的确定,与2020相关联的信号可以相对于声音2021和2022被放大。在一些实施例中,听觉接口设备1710可以包括与听筒相关联的扬声器。例如,听觉接口设备1710可以至少部分地插入用户的耳朵中,用于向用户提供音频。听觉接口设备也可以在耳朵外部,诸如耳后听觉设备、一个或多个耳机、小型便携式扬声器等。在一些实施例中,听觉接口设备可以包括骨传导麦克风,其被配置为通过用户头骨的振动向用户提供音频信号。这样的设备可以与使用者的皮肤外部接触放置,或者可以通过外科手术植入并附接到使用者的骨骼上。In step 2118, the process 2100 may include transmitting the at least one conditioned audio signal to an auditory interface device configured to provide sound to a user's ear. For example, the conditioned audio signal may be sent to auditory interface device 1710, which may provide user 100 with sounds corresponding to the audio signal. The processor performing process 2100 may also be configured to cause one or more audio signals representing background noise, which may be attenuated relative to the at least one conditioned audio signal, to be transmitted to the auditory interface device. For example, processor 210 may be configured to transmit audio signals corresponding to sounds 2020 , 2021 and 2022 . However, based on the determination that sound 2020 is within region 2030, the signal associated with sound 2020 may be amplified relative to sounds 2021 and 2022. In some embodiments, auditory interface device 1710 may include a speaker associated with the earpiece. For example, auditory interface device 1710 may be inserted at least partially into a user's ear for providing audio to the user. The auditory interface device may also be external to the ear, such as a behind-the-ear hearing device, one or more earphones, small portable speakers, and the like. In some embodiments, the auditory interface device may include a bone conduction microphone configured to provide audio signals to the user through vibrations of the user's skull. Such devices may be placed in external contact with the user's skin, or may be surgically implanted and attached to the user's bone.
除了识别对用户100说话的个体的语音之外,上述系统和方法还可以用于识别用户100的语音。例如,语音识别单元2041可以被配置为分析表示从用户的环境收集的声音的音频信号,以识别用户100的语音。类似于对辨识出的个体的语音的选择性调节,用户100的语音可以被选择性地调节。例如,声音可以由麦克风1720或由诸如移动电话(或链接到移动电话的设备)的另一设备的麦克风来收集。例如,通过放大用户100的语音和/或衰减或消除用户语音以外的全部声音,对应于用户100的语音的音频信号可以被选择性地发送到远程设备。因此,可以收集和/或存储装置110的一个或多个用户的声纹,以促进如上文进一步详细描述的用户的语音的检测和/或隔离。In addition to recognizing the speech of an individual speaking to the user 100, the systems and methods described above may also be used to recognize the speech of the user 100. For example, the speech recognition unit 2041 may be configured to analyze audio signals representing sounds collected from the user's environment to recognize the speech of the user 100 . Similar to the selective adjustment of the recognized individual's speech, the user's 100 speech may be selectively adjusted. For example, the sound may be collected by the microphone 1720 or by the microphone of another device such as a mobile phone (or a device linked to a mobile phone). For example, by amplifying the voice of the user 100 and/or attenuating or eliminating all sounds other than the user's voice, an audio signal corresponding to the voice of the user 100 may be selectively sent to the remote device. Accordingly, the voiceprints of one or more users of device 110 may be collected and/or stored to facilitate detection and/or isolation of the user's voice as described in further detail above.
图22是示出符合所公开实施例的用于选择性地发送与识别出的用户的语音相关联的音频信号的示例性过程2200的流程图。过程2200可以由与装置110相关联的一个或多个处理器(诸如处理器210)来执行。FIG. 22 is a flow diagram illustrating an exemplary process 2200 for selectively transmitting audio signals associated with recognized speech of a user, consistent with the disclosed embodiments. Process 2200 may be performed by one or more processors associated with apparatus 110, such as processor 210.
在步骤2210中,过程2200可以包括接收表示由麦克风捕捉的声音的音频信号。例如,装置110可以接收表示由麦克风1720捕捉的声音2020、2021和2022的音频信号。因此,如上所述,麦克风可以包括定向麦克风、麦克风阵列、多端口麦克风或各种其他类型的麦克风。在步骤2212中,过程2200可以包括基于对接收到的音频信号的分析,识别表示识别出的用户的语音的一个或多个语音音频信号。例如,可以基于与用户相关联的声纹来识别用户的语音,该声纹可以存储在存储器550、数据库2050或其他合适的位置中。处理器210可以例如使用语音识别组件2041来识别用户的语音。处理器210可以使用滑动时间窗几乎实时地(例如以最小延迟)分离与用户相关联的正在进行的语音信号。可以通过根据上述方法提取音频信号的频谱特征来分离语音。In step 2210, the process 2200 can include receiving an audio signal representing sound captured by the microphone. For example, device 110 may receive audio signals representing sounds 2020 , 2021 , and 2022 captured by microphone 1720 . Thus, as described above, the microphones may include directional microphones, microphone arrays, multi-port microphones, or various other types of microphones. In step 2212, process 2200 may include identifying one or more speech audio signals representing the recognized speech of the user based on the analysis of the received audio signals. For example, the user's voice may be identified based on a voiceprint associated with the user, which may be stored in memory 550, database 2050, or other suitable location. The processor 210 can recognize the user's speech, eg, using the speech recognition component 2041 . The processor 210 may separate the ongoing speech signal associated with the user in near real-time (eg, with minimal delay) using a sliding time window. The speech can be separated by extracting the spectral features of the audio signal according to the above method.
在步骤2214中,过程2200可以包括使表示用户的识别出的语音的一个或多个语音音频信号传输到远程设备。位于远程的设备可以是被配置为通过有线或无线通信形式远程接收音频信号的任何设备。在一些实施例中,位于远程的设备可以是用户的另一设备,诸如移动电话、音频接口设备或另一形式的计算设备。在一些实施例中,语音音频信号可以由位于远程的设备来处理和/或进一步发送。在步骤2216中,过程2200可以包括防止至少一个背景噪声音频信号向位于远程的设备的传输,该至少一个背景噪声音频信号不同于表示用户的识别出的语音的一个或多个语音音频信号。例如,处理器210可以衰减和/或消除与声音2020、2021或2023相关联的音频信号,它们可以表示背景噪声。可以使用上述音频处理技术将用户的语音与其他噪声分离。In step 2214, the process 2200 may include transmitting one or more speech audio signals representing the recognized speech of the user to the remote device. The remotely located device may be any device configured to receive audio signals remotely via wired or wireless communication. In some embodiments, the remotely located device may be another device of the user, such as a mobile phone, audio interface device, or another form of computing device. In some embodiments, the speech audio signal may be processed and/or further transmitted by a remotely located device. In step 2216, process 2200 can include preventing transmission to the remotely located device of at least one background noise audio signal that is different from one or more speech audio signals representing the user's recognized speech. For example, processor 210 may attenuate and/or cancel audio signals associated with sounds 2020, 2021 or 2023, which may represent background noise. The user's speech can be separated from other noise using the audio processing techniques described above.
在示例性说明中,语音音频信号可以由用户佩戴的耳机或其他设备来捕捉。用户的语音可以被识别并与用户环境中的背景噪声隔离。耳机可以将用户语音的经调节的音频信号发送到用户的移动电话。例如,用户可以处在电话呼叫中,并且经调节的音频信号可以由移动电话发送到呼叫的接收者。用户的语音也可以由位于远程的设备来记录。例如,音频信号可以存储在远程服务器或其他计算设备上。在一些实施例中,位于远程的设备可以处理接收到的音频信号,例如,以将识别出的用户的语音转换为文本。In an exemplary illustration, the speech audio signal may be captured by a headset or other device worn by the user. The user's speech can be recognized and isolated from background noise in the user's environment. The headset may transmit the conditioned audio signal of the user's speech to the user's mobile phone. For example, the user may be on a phone call and the conditioned audio signal may be sent by the mobile phone to the recipient of the call. The user's voice may also be recorded by a remotely located device. For example, the audio signal may be stored on a remote server or other computing device. In some embodiments, the remotely located device may process the received audio signal, eg, to convert the recognized speech of the user into text.
唇部跟踪助听器lip tracking hearing aids
与所公开的实施例一致,助听器系统可以基于跟踪的唇部移动来选择性地放大音频信号。助听器系统分析用户环境的捕捉图像以检测个体的唇部并跟踪个体的唇部的运动。所跟踪的唇部移动可以用作选择性地放大由助听器系统接收的音频的提示。例如,确定为与所跟踪的唇部移动同步或与所跟踪的唇部移动一致的语音信号可以被选择性地放大或以其他方式调节。与检测到的唇部移动不相关联的音频信号可以被抑制、衰减、滤波等。Consistent with the disclosed embodiments, the hearing aid system may selectively amplify the audio signal based on the tracked lip movement. The hearing aid system analyzes the captured images of the user's environment to detect the individual's lips and track the movement of the individual's lips. The tracked lip movement can be used as a cue to selectively amplify the audio received by the hearing aid system. For example, speech signals determined to be synchronized with or consistent with tracked lip movement may be selectively amplified or otherwise conditioned. Audio signals not associated with detected lip movements may be suppressed, attenuated, filtered, etc.
用户100可以佩戴符合上述基于相机的助听器设备的助听器设备。例如,助听器设备可以是如图17A所示的听觉接口设备1710。听觉接口设备1710可以是被配置为向用户100提供听觉反馈的任何设备。听觉接口设备1710可以被放置在用户100的一个或两个耳朵中,类似于传统的听觉接口设备。如上所述,听觉接口设备1710可以是各种样式的,包括耳道内、完全耳道内、耳内、耳后、耳上、耳道内接收器、开放安装或各种其他样式。听觉接口设备1710可以包括用于向用户100提供听觉反馈的一个或多个扬声器、用于检测用户100的环境中的声音的麦克风、内部电子设备、处理器、存储器等。在一些实施例中,除了麦克风之外或替代麦克风,听觉接口设备1710可以包括一个或多个通信单元,以及是一个或多个接收器,用于从设备110接收信号并将信号传送到用户100。听觉接口设备1710可以对应于反馈输出单元230,或者可以与反馈输出单元230分开,并且可以被配置为从反馈输出单元230接收信号。The user 100 may wear a hearing aid device that conforms to the camera-based hearing aid device described above. For example, the hearing aid device may be an auditory interface device 1710 as shown in Figure 17A. Auditory interface device 1710 may be any device configured to provide auditory feedback to user 100 . The auditory interface device 1710 may be placed in one or both ears of the user 100, similar to conventional auditory interface devices. As mentioned above, the auditory interface device 1710 may be of various styles, including in-canal, completely in-canal, in-ear, behind-the-ear, supra-aural, in-canal receiver, open mount, or various other styles. Auditory interface device 1710 may include one or more speakers for providing auditory feedback to user 100, a microphone for detecting sounds in the environment of user 100, internal electronics, a processor, memory, and the like. In some embodiments, auditory interface device 1710 may include one or more communication units in addition to or instead of a microphone, and be one or more receivers for receiving signals from device 110 and transmitting signals to user 100 . The auditory interface device 1710 may correspond to the feedback output unit 230 or may be separate from the feedback output unit 230 and may be configured to receive signals from the feedback output unit 230 .
在一些实施例中,如图17A所示,听觉接口设备1710可以包括骨传导耳机1711。骨传导耳机1711可以通过外科手术植入,并且可以通过声音振动到内耳的骨传导来向用户100提供可听反馈。听觉接口设备1710还可以包括一个或多个耳机(例如,无线耳机、过耳耳机等)或由用户100携带或佩戴的便携式扬声器。在一些实施例中,听觉接口设备1710可以集成到其他设备中,诸如用户的蓝牙TM耳机、眼镜、头盔(例如,摩托车头盔、自行车头盔等)、帽子等。In some embodiments, the auditory interface device 1710 may include a bone conduction headset 1711, as shown in FIG. 17A. Bone conduction earphones 1711 may be surgically implanted and may provide audible feedback to user 100 through bone conduction of sound vibrations to the inner ear. The auditory interface device 1710 may also include one or more earphones (eg, wireless earphones, over-ear earphones, etc.) or portable speakers carried or worn by the user 100 . In some embodiments, the auditory interface device 1710 may be integrated into other devices, such as the user's Bluetooth ™ headset, glasses, helmets (eg, motorcycle helmets, bicycle helmets, etc.), hats, and the like.
听觉接口设备1710可以被配置为与诸如装置110的相机设备进行通信。这种通信可以通过有线连接,或者可以无线地进行(例如,使用蓝牙TM、NFC或无线通信形式)。如上所述,装置110可以由用户100以各种配置来佩戴,包括物理地连接到衬衫、项链、腰带、眼镜、腕带、纽扣或与用户100相关联的其他物品。在一些实施例中,还可以包括诸如计算设备120的一个或多个附加设备。因此,本文关于装置110或处理器210描述的一个或多个过程或功能可以由计算设备120和/或处理器540执行。Auditory interface device 1710 may be configured to communicate with a camera device such as apparatus 110 . Such communication may be via a wired connection, or may be performed wirelessly (eg, using Bluetooth ™ , NFC or wireless forms of communication). As described above, device 110 may be worn by user 100 in various configurations, including physically attached to a shirt, necklace, belt, eyeglasses, wristband, button, or other item associated with user 100 . In some embodiments, one or more additional devices such as computing device 120 may also be included. Accordingly, one or more of the processes or functions described herein with respect to apparatus 110 or processor 210 may be performed by computing device 120 and/or processor 540 .
如上所述,装置110可以包括至少一个麦克风和至少一个图像捕捉设备。如关于图17B所描述的,装置110可以包括麦克风1720。麦克风1720可以被配置为确定用户100的环境中声音的方向性。例如,麦克风1720可以包括一个或多个定向麦克风、麦克风阵列、多端口麦克风等。处理器210可以被配置为区分用户100的环境内的声音并且确定每个声音的近似方向性。例如,使用麦克风阵列1720,处理器210可以对麦克风1720之间个体声音的相对定时或振幅进行比较,以确定相对于装置100的方向性。装置110可以包括诸如相机1730的一个或多个相机,它们可以对应于图像传感器220。相机1730可以被配置为捕捉用户100的周围环境的图像。装置110还可以使用听觉接口设备1710的一个或多个麦克风,并且因此,本文使用的对麦克风1720的引用也可以是指听觉接口设备1710上的麦克风。As mentioned above, the apparatus 110 may include at least one microphone and at least one image capture device. Device 110 may include microphone 1720 as described with respect to FIG. 17B . Microphone 1720 may be configured to determine the directionality of sound in the environment of user 100 . For example, microphone 1720 may include one or more directional microphones, microphone arrays, multi-port microphones, and the like. The processor 210 may be configured to differentiate sounds within the environment of the user 100 and determine the approximate directionality of each sound. For example, using the microphone array 1720 , the processor 210 may compare the relative timing or amplitude of individual sounds between the microphones 1720 to determine the directionality relative to the device 100 . Device 110 may include one or more cameras, such as camera 1730 , which may correspond to image sensor 220 . The camera 1730 may be configured to capture images of the user's 100 surroundings. Apparatus 110 may also use one or more microphones of auditory interface device 1710 , and thus, references to microphone 1720 used herein may also refer to a microphone on auditory interface device 1710 .
处理器210(和/或处理器210a和210b)可以被配置为检测与用户100的环境内的个体相关联的嘴和/或唇部。图23A和图23B示出了可以在符合本公开的用户环境中由相机1730捕捉的示例性个体2310。如图23所示,个体2310可以物理地存在于用户100的环境中。处理器210可以被配置为分析由相机1730捕捉的图像,以检测图像中个体2310的表示。处理器210可以使用面部识别组件,诸如如上所述的面部识别组件2040,来检测和识别用户100的环境中的个体。处理器210可以被配置为检测用户2310的一个或多个面部特征,包括个体2310的唇部2311。因此,处理器210可以使用一种或多种面部识别和/或特征识别技术,如下文进一步描述的。Processor 210 (and/or processors 210a and 210b ) may be configured to detect mouths and/or lips associated with individuals within the environment of user 100 . 23A and 23B illustrate an exemplary individual 2310 that may be captured by camera 1730 in a user environment consistent with the present disclosure. As shown in FIG. 23 , an individual 2310 may physically exist in the environment of the user 100 . Processor 210 may be configured to analyze images captured by camera 1730 to detect representations of individuals 2310 in the images. Processor 210 may use a facial recognition component, such as facial recognition component 2040 described above, to detect and identify individuals in the environment of user 100 . The processor 210 may be configured to detect one or more facial features of the user 2310, including the lips 2311 of the individual 2310. Accordingly, the processor 210 may use one or more facial recognition and/or feature recognition techniques, as described further below.
在一些实施例中,处理器210可以从用户100的环境中检测个体2310的可视表示,诸如用户2310的视频。如图23B所示,可以在显示设备2301的显示器上检测到用户2310。显示设备2301可以是能够显示个体的可视表示的任何设备。例如,显示设备可以是个体计算机、膝上型计算机、移动电话、平板电脑、电视、电影屏幕、手持游戏设备、视频会议设备(例如,Facebook门户TM等)、婴儿监视器等。个体2310的可视表示可以是个体2310的实时视频馈送,诸如视频呼叫、会议呼叫、监视视频等。在其他实施例中,个体2310的可视表示可以是预录制的视频或图像,诸如视频消息、电视节目或电影。处理器210可以基于个体2310的可视表示来检测一个或多个面部特征,包括个体2310的嘴2311。In some embodiments, processor 210 may detect a visual representation of individual 2310 from the environment of user 100, such as a video of user 2310. As shown in FIG. 23B, user 2310 may be detected on the display of display device 2301. Display device 2301 may be any device capable of displaying a visual representation of an individual. For example, the display device may be a personal computer, laptop computer, mobile phone, tablet computer, television, movie screen, handheld gaming device, videoconferencing device (eg, Facebook Portal ™ , etc.), baby monitor, and the like. The visual representation of the individual 2310 may be a real-time video feed of the individual 2310, such as a video call, conference call, surveillance video, or the like. In other embodiments, the visual representation of the individual 2310 may be a pre-recorded video or image, such as a video message, a television show, or a movie. The processor 210 may detect one or more facial features, including the mouth 2311 of the individual 2310, based on the visual representation of the individual 2310.
图23C示出了符合所公开实施例的示例性唇部跟踪系统。处理器210可以被配置为检测个体2310的一个或多个面部特征,其可以包括但不限于个体的嘴2311。因此,处理器210可以使用一种或多种图像处理技术来识别用户的面部特征,诸如卷积神经网络(CNN)、尺度不变特征变换(SIFT)、定向梯度直方图(HOG)特征或其他技术。在一些实施例中,处理器210可以被配置为检测与个体2310的嘴2311相关联的一个或多个点2320。点2320可以表示个体的嘴的一个或多个特征点,诸如沿着个体的唇部或个体的嘴角的一个或多个点。图23C中所示的点仅用于说明目的,并且应理解的是,可以经由一种或多种图像处理技术来确定或识别用于跟踪个体的唇部的任何点。可以在各种其他位置检测点2320,包括与个体的牙齿、舌头、脸颊、下巴、眼睛等相关联的点。处理器210可以基于点2320或基于所捕捉的图像来确定嘴2311的一个或多个轮廓(例如,由线或多边形表示)。该轮廓可以表示整个嘴2311,或者可以包括多个轮廓,例如包括表示上唇的轮廓和表示下唇的轮廓。每个唇还可以由多个轮廓来表示,诸如每个唇的上边缘的轮廓和每个唇的下边缘的轮廓。处理器210还可以使用各种其他技术或特性,诸如颜色、边缘、形状或运动检测算法来识别个体2310的唇部。可以在多个帧或图像上跟踪识别出的唇部。处理器210可以使用一种或多种视频跟踪算法,诸如均值漂移跟踪、轮廓跟踪(例如,压缩算法)或各种其他技术。因此,处理器210可以被配置为实时跟踪个体2310的唇部的运动。23C illustrates an exemplary lip tracking system consistent with the disclosed embodiments. The processor 210 may be configured to detect one or more facial features of the individual 2310, which may include, but are not limited to, the individual's mouth 2311. Accordingly, the processor 210 may identify the user's facial features using one or more image processing techniques, such as convolutional neural network (CNN), scale-invariant feature transform (SIFT), histogram of oriented gradient (HOG) features, or other technology. In some embodiments, the processor 210 may be configured to detect one or more points 2320 associated with the mouth 2311 of the individual 2310. Points 2320 may represent one or more characteristic points of the individual's mouth, such as one or more points along the individual's lips or the corners of the individual's mouth. The points shown in Figure 23C are for illustration purposes only, and it should be understood that any point for tracking an individual's lips may be determined or identified via one or more image processing techniques. Points 2320 may be detected at various other locations, including points associated with an individual's teeth, tongue, cheeks, chin, eyes, and the like. The processor 210 may determine one or more contours (eg, represented by lines or polygons) of the mouth 2311 based on the points 2320 or based on the captured image. The contour may represent the entire mouth 2311, or may include multiple contours, including, for example, a contour representing the upper lip and a contour representing the lower lip. Each lip may also be represented by a plurality of contours, such as the contour of the upper edge of each lip and the contour of the lower edge of each lip. The processor 210 may also identify the lips of the individual 2310 using various other techniques or characteristics, such as color, edge, shape, or motion detection algorithms. Identified lips can be tracked over multiple frames or images. The processor 210 may use one or more video tracking algorithms, such as mean-shift tracking, contour tracking (eg, compression algorithms), or various other techniques. Accordingly, the processor 210 may be configured to track the movement of the lips of the individual 2310 in real time.
如果需要,可以使用跟踪的个体2310的唇部移动来分离,并选择性地调节用户100的环境中的一个或多个声音。图24是示出符合本公开的使用唇部跟踪助听器的示例性环境2400的示意图。用户100佩戴的装置110可以被配置为识别环境2400内的一个或多个个体。例如,装置110可以被配置为使用相机1730来捕捉周围环境2400的一个或多个图像。所捕捉的图像可以包括个体2310和2410的表示,他们可以存在于环境2400中。处理器210可以被配置为使用上述方法来检测个体2310和2410的嘴并跟踪他们各自的唇部移动。在一些实施例中,处理器210还可以被配置为例如如前面所讨论的通过检测个体2310和2410的面部特征并将其与数据库进行比较来识别个体2310和2410。If desired, the tracked lip movement of the individual 2310 can be used to isolate and selectively adjust one or more sounds in the user's 100 environment. 24 is a schematic diagram illustrating an exemplary environment 2400 for using a lip tracking hearing aid consistent with the present disclosure. Device 110 worn by user 100 may be configured to identify one or more individuals within environment 2400 . For example, device 110 may be configured to capture one or more images of surrounding environment 2400 using camera 1730 . The captured images may include representations of individuals 2310 and 2410 , which may exist in environment 2400 . Processor 210 may be configured to detect the mouths of individuals 2310 and 2410 and track their respective lip movements using the methods described above. In some embodiments, processor 210 may also be configured to identify individuals 2310 and 2410 by detecting facial features of individuals 2310 and 2410 and comparing them to a database, for example, as previously discussed.
除了检测图像之外,装置110可以被配置为检测用户100的环境中的一个或多个声音。例如,麦克风1720可以检测环境2400内的一个或多个声音2421、2422和2423。在一些实施例中,声音可以表示各种个体的语音。例如,如图24所示,声音2421可以表示个体2310的语音,并且声音2422可以表示个体2410的语音。声音2423可以表示环境2400内的附加语音和/或背景噪声。处理器210可以被配置为分析声音2421、2422和2423,以分离并识别与语音相关联的音频信号。例如,处理器210可以使用一种或多种语音或语音活动检测(VAD)算法和/或上述语音分离技术。当在环境中检测到多个声音时,处理器210可以隔离与每个语音相关联的音频信号。在一些实施例中,处理器210可以对与检测到的语音活动相关联的音频信号执行进一步分析,以识别个体的语音。例如,处理器210可以使用一种或多种语音识别算法(例如,隐式马尔可夫模型、动态时间规整、神经网络或其他技术)来识别个体的语音。处理器210还可以被配置为使用各种语音到文本算法来识别个体2310所说的词语。在一些实施例中,替代使用麦克风1710,装置110可以通过诸如无线收发器530的通信组件从另一设备接收音频信号。例如,如果用户100正在进行视频呼叫,则装置110可以从显示设备2301或另一辅助设备接收表示用户2310的语音的音频信号。In addition to detecting images, apparatus 110 may be configured to detect one or more sounds in the environment of user 100 . For example, microphone 1720 may detect one or more sounds 2421 , 2422 and 2423 within environment 2400 . In some embodiments, the sounds may represent the speech of various individuals. For example, as shown in FIG. 24, sound 2421 may represent the speech of individual 2310, and sound 2422 may represent the speech of individual 2410. Sound 2423 may represent additional speech and/or background noise within environment 2400. Processor 210 may be configured to analyze sounds 2421, 2422, and 2423 to separate and identify audio signals associated with speech. For example, the processor 210 may use one or more speech or voice activity detection (VAD) algorithms and/or the above-described speech separation techniques. When multiple sounds are detected in the environment, the processor 210 may isolate the audio signal associated with each speech. In some embodiments, the processor 210 may perform further analysis on the audio signal associated with the detected voice activity to identify the individual's voice. For example, the processor 210 may use one or more speech recognition algorithms (eg, hidden Markov models, dynamic time warping, neural networks, or other techniques) to recognize an individual's speech. Processor 210 may also be configured to recognize words spoken by individual 2310 using various speech-to-text algorithms. In some embodiments, instead of using microphone 1710, apparatus 110 may receive audio signals from another device through a communication component, such as wireless transceiver 530. For example, if the user 100 is on a video call, the apparatus 110 may receive an audio signal representing the voice of the user 2310 from the display device 2301 or another auxiliary device.
处理器210可以基于唇部移动和检测到的声音来确定环境2400中的哪些个体正在说话。例如,处理器2310可以跟踪与嘴2311相关联的唇部移动,以确定个体2310正在说话。可以在检测到的唇部移动和接收到的音频信号之间进行比较分析。在一些实施例中,处理器210可以基于在检测到声音2421的同时嘴2311正在运动的确定来确定个体2310正在说话。例如,当个体2310的唇部停止运动时,这可对应于与声音2421相关联的音频信号中的静默或减小音量的时段。在一些实施例中,处理器210可以被配置为确定嘴2311的特定运动是否对应于接收到的音频信号。例如,处理器210可以分析接收到的音频信号以识别接收到的音频信号中的特定音素、音素组合或词语。处理器210可以识别嘴2311的特定唇部移动是否对应于识别出的词语或音素。可以实现各种机器学习或深度学习技术来将预期的唇部移动与检测到的音频相关联。例如,可以将已知声音和对应的唇部移动的训练数据集馈送到机器学习算法,以开发用于将检测到的声音与预期的唇部移动相关联的模型。与装置110相关联的其他数据还可以结合检测到的唇部移动来确定和/或验证个体2310是否在说话,诸如用户100或个体2310的视线方向、检测到的用户2310的身份、识别出的用户2310的声纹等。Processor 210 may determine which individuals in environment 2400 are speaking based on lip movements and detected sounds. For example, processor 2310 may track lip movements associated with mouth 2311 to determine that individual 2310 is speaking. A comparative analysis can be performed between the detected lip movement and the received audio signal. In some embodiments, processor 210 may determine that individual 2310 is speaking based on a determination that mouth 2311 is moving while sound 2421 is detected. For example, when the lips of individual 2310 cease to move, this may correspond to periods of silence or reduced volume in the audio signal associated with sound 2421. In some embodiments, the processor 210 may be configured to determine whether a particular movement of the mouth 2311 corresponds to a received audio signal. For example, the processor 210 may analyze the received audio signal to identify particular phonemes, phoneme combinations, or words in the received audio signal. The processor 210 may identify whether a particular lip movement of the mouth 2311 corresponds to a recognized word or phoneme. Various machine learning or deep learning techniques can be implemented to correlate expected lip movements with detected audio. For example, a training dataset of known sounds and corresponding lip movements can be fed to a machine learning algorithm to develop a model for correlating detected sounds with expected lip movements. Other data associated with device 110 may also be used in conjunction with detected lip movements to determine and/or verify whether individual 2310 is speaking, such as user 100 or individual 2310 gaze direction, detected identity of user 2310, identified Voiceprint of user 2310, etc.
基于检测到的唇部移动,处理器210可以引起对与个体2310相关联的音频的选择性调节。调节可以包括相对于其他音频信号放大被确定为对应于声音2421(其可对应于个体2310的语音)的音频信号。在一些实施例中,放大可以例如通过相对于其他信号处理与声音2421相关联的音频信号来数字化地实现。另外地或者可替代地,可以通过改变麦克风1720的一个或多个参数来实现放大,以聚焦于与个体2310相关联的音频声音。例如,麦克风1720可以是定向麦克风,处理器210可以执行将麦克风1720聚焦在声音2421上的操作。可以使用用于放大声音2421的各种其他技术,诸如使用波束成形麦克风阵列、声学望远镜技术等。经调节的音频信号可以被发送到听觉接口设备1710,并且因此可以向用户100提供基于正在说话的个体而调节的音频。Based on the detected lip movement, the processor 210 may cause selective adjustments to the audio associated with the individual 2310. Adjusting may include amplifying the audio signal determined to correspond to sound 2421 (which may correspond to the speech of individual 2310 ) relative to other audio signals. In some embodiments, amplification may be accomplished digitally, for example, by processing the audio signal associated with sound 2421 relative to other signals. Additionally or alternatively, amplification may be achieved by changing one or more parameters of the microphone 1720 to focus on audio sounds associated with the individual 2310. For example, the microphone 1720 may be a directional microphone, and the processor 210 may perform an operation to focus the microphone 1720 on the sound 2421. Various other techniques for amplifying sound 2421 may be used, such as the use of beamforming microphone arrays, acoustic telescope techniques, and the like. The adjusted audio signal may be sent to auditory interface device 1710, and thus may provide user 100 with audio adjusted based on the individual speaking.
在一些实施例中,选择性调节可以包括衰减或抑制与个体2310不相关联的一个或多个音频信号,诸如声音2422和2423。类似于声音2421的放大,声音的衰减可以通过处理音频信号或通过改变与麦克风1720相关联的一个或多个参数来发生,以指引焦点离开与个体2310不相关联的声音。In some embodiments, the selective adjustment may include attenuating or suppressing one or more audio signals, such as sounds 2422 and 2423, not associated with the individual 2310. Similar to the amplification of sound 2421, attenuation of sound may occur by processing the audio signal or by changing one or more parameters associated with microphone 1720 to direct focus away from sounds not associated with individual 2310.
在一些实施例中,调节还可以包括改变对应于声音2421的一个或多个音频信号的音调,以使该声音对于用户100更易感知。例如,用户100可能对特定范围内的音调具有较小的敏感度,并且音频信号的调节可以调整声音2421的音高。例如,用户100可能经历10kHz以上的频率中的听觉损失,并且处理器210可以将更高的频率(例如,在15kHz处)重新映射到10kHz。在一些实施例中,处理器210可以被配置为改变与一个或多个音频信号相关联的语速。处理器210可以被配置为改变个体2310的语速,以使检测到的语音对于用户100更易感知。如果已经对与声音2421相关联的音频信号执行了语音识别,则调节还可以包括基于检测到的语音来修改音频信号。例如,处理器210可以在词语和/或句子之间引入停顿或增加停顿的持续时间,这可以使语音更容易理解。可以执行各种其他处理(诸如修改声音2421的音调),以维持与原始音频信号相同的音高,或者降低音频信号内的噪声。In some embodiments, adjusting may also include changing the pitch of one or more audio signals corresponding to sound 2421 to make the sound more perceptible to user 100 . For example, the user 100 may have less sensitivity to tones within a certain range, and the adjustment of the audio signal may adjust the pitch of the sound 2421. For example, user 100 may experience hearing loss in frequencies above 10 kHz, and processor 210 may remap higher frequencies (eg, at 15 kHz) to 10 kHz. In some embodiments, the processor 210 may be configured to vary the speech rate associated with one or more audio signals. The processor 210 may be configured to alter the speech rate of the individual 2310 to make the detected speech more perceptible to the user 100 . If speech recognition has been performed on the audio signal associated with the sound 2421, the conditioning may also include modifying the audio signal based on the detected speech. For example, processor 210 may introduce pauses or increase the duration of pauses between words and/or sentences, which may make speech easier to understand. Various other processing may be performed, such as modifying the pitch of the sound 2421, to maintain the same pitch as the original audio signal, or to reduce noise within the audio signal.
然后可以将经调节的音频信号发送到听觉接口设备1710,然后为用户100产生音频信号。因此,在经调节的音频信号中,声音2421(可以比声音2422和2423更响亮和/或更容易区分)。The conditioned audio signal may then be sent to auditory interface device 1710 , which then generates an audio signal for user 100 . Thus, in the conditioned audio signal, sound 2421 (may be louder and/or easier to distinguish than sounds 2422 and 2423).
处理器210可以被配置为基于与音频信号相关联的哪些个体当前正在说话来选择性地调节多个音频信号。例如,个体2310和个体2410可以参与环境2400内的对话,并且处理器210可以被配置为基于个体2310和个体2410的相应唇部移动来从与声音2421相关联的音频信号的调节转换到与声音2422相关联的音频信号的调节。例如,个体2310的唇部移动可以指示个体2310已经停止说话,或者与个体2410相关联的唇部移动可以指示个体2410已经开始说话。因此,处理器210可以在选择性地调节与声音2421相关联的音频信号到与声音2422相关联的音频信号之间转换。在一些实施例中,处理器210可以被配置为同时处理和/或调节两个音频信号,但仅基于哪个个体正在说话而选择性地将经调节的音频发送到听觉接口设备1710。在实现语音识别的情况下,处理器210可以基于语音的背景来确定和/或预期说话者之间的转换。例如,处理器210可以分析与声音2421相关联的音频信号,以确定个体2310已经到达句子的结尾或已经问了一个问题,这可以指示个体2310已经结束或即将结束说话。The processor 210 may be configured to selectively condition the plurality of audio signals based on which individuals associated with the audio signals are currently speaking. For example, individual 2310 and individual 2410 may engage in a conversation within environment 2400, and processor 210 may be configured to transition from the modulation of the audio signal associated with sound 2421 to the modulation of the audio signal associated with sound 2421 based on the respective lip movements of individual 2310 and individual 2410 2422 Adjustment of the associated audio signal. For example, lip movement of individual 2310 may indicate that individual 2310 has stopped speaking, or lip movement associated with individual 2410 may indicate that individual 2410 has begun to speak. Accordingly, the processor 210 may switch between selectively adjusting the audio signal associated with sound 2421 to the audio signal associated with sound 2422. In some embodiments, processor 210 may be configured to process and/or condition both audio signals simultaneously, but selectively send the conditioned audio to auditory interface device 1710 based only on which individual is speaking. Where speech recognition is implemented, the processor 210 may determine and/or anticipate transitions between speakers based on the context of the speech. For example, processor 210 may analyze the audio signal associated with sound 2421 to determine that individual 2310 has reached the end of a sentence or has asked a question, which may indicate that individual 2310 has finished or is about to finish speaking.
在一些实施例中,处理器210可以被配置为在多个活跃说话者之间进行选择,以选择性地调节音频信号。例如,个体2310和2410可能同时都在说话,或者他们的讲话可能在对话期间重叠。处理器210可以相对于其他人选择性地调节与一个说话个体相关联的音频。这可以包括给予一个已经开始但没有完成一个词语或句子或者当另一个说话者开始讲话时他还没有完全完成讲话的说话者优先级。如上所述,该确定还可以由语音的背景驱动。In some embodiments, the processor 210 may be configured to select between multiple active speakers to selectively condition the audio signal. For example, individuals 2310 and 2410 may both be speaking at the same time, or their speech may overlap during the conversation. The processor 210 may selectively adjust audio associated with a speaking individual relative to others. This may include giving priority to a speaker who has started but has not completed a word or sentence or who has not fully finished speaking when another speaker starts speaking. As mentioned above, this determination can also be driven by the context of the speech.
在选择活跃的说话者时,还可以考虑各种其他因素。例如,可以确定用户的视线方向,并且可以在活跃的说话者中给予用户视线方向上的个体更高的优先级。还可以基于说话者的视线方向来分配优先级。例如,如果个体2310正在看着用户100,而个体2410正在看着其他地方,则可以选择性地调节与个体2310相关联的音频信号。在一些实施例中,可以基于环境2400中其他个体的相对行为来分配优先级。例如,如果个体2310和个体2410都在说话,并且看着个体2410的其他个体比看着个体2310的更多,则与个体2410相关联的音频信号可以优先于与个体2310相关联的音频信号被选择性地调节。在确定个体的身份的实施例中,如前面更详细地讨论的,可以基于说话者的相对状态来分配优先级。用户100还可以通过预定义设置或通过主动选择要聚焦于哪个说话者来提供对哪些说话者被优先的输入。Various other factors can also be considered when selecting active speakers. For example, the user's gaze direction may be determined, and individuals in the user's gaze direction may be given higher priority among active speakers. Priority can also be assigned based on the speaker's gaze direction. For example, if individual 2310 is looking at user 100 and individual 2410 is looking elsewhere, the audio signal associated with individual 2310 may be selectively adjusted. In some embodiments, priorities may be assigned based on the relative behavior of other individuals in the environment 2400 . For example, if both individual 2310 and individual 2410 are talking, and more other individuals are looking at individual 2410 than at individual 2310, the audio signal associated with individual 2410 may be prioritized over the audio signal associated with individual 2310 Selectively adjust. In embodiments where the identity of an individual is determined, as discussed in greater detail above, priorities may be assigned based on the relative status of the speakers. User 100 may also provide input on which speakers are to be prioritized through predefined settings or by actively selecting which speakers to focus on.
处理器210还可以基于如何检测个体2310的表示来分配优先级。尽管个体2310和个体2410被示出为物理地存在于环境2400中,但如图23B所示,一个或多个个体可以被检测为个体的可视表示(例如,在显示设备上)。处理器210可以基于说话者是否物理地存在于环境2400中来对其进行优先级排序。例如,处理器210可以将物理上存在的说话者优先于显示器上的说话者。可替代地,例如,如果用户100在视频会议上,或者如果用户100在观看电影,则处理器210可以将视频优先于房间中的说话者。用户100还可以使用与装置110相关联的用户界面来指示优先化的说话者或说话者类型(例如存在或不存在)。The processor 210 may also assign priorities based on how the representation of the individual 2310 is detected. Although individuals 2310 and 2410 are shown physically present in environment 2400, as shown in Figure 23B, one or more individuals may be detected as visual representations of the individuals (eg, on a display device). Processor 210 may prioritize speakers based on whether they are physically present in environment 2400 . For example, the processor 210 may prioritize physically present speakers over speakers on the display. Alternatively, for example, if the user 100 is on a video conference, or if the user 100 is watching a movie, the processor 210 may prioritize the video over the speakers in the room. User 100 may also indicate a prioritized speaker or speaker type (eg, presence or absence) using a user interface associated with device 110 .
图25是示出符合所公开实施例的用于基于跟踪的唇部移动来选择性地放大音频信号的示例性过程2500的流程图。过程2500可以由与装置110相关联的一个或多个处理器(诸如处理器210)来执行。处理器可以包括在与也可以用于过程2500的麦克风1720和相机1730相同的公共外壳中。在一些实施例中,过程2500的一些或全部可以在装置110外部的处理器上执行,它们可以包括在第二外壳中。例如,过程2500的一个或多个部分可以由听觉接口设备1710或诸如计算设备120或显示设备2301的辅助设备中的处理器来执行。在这样的实施例中,处理器可以被配置为经由公共外壳中的发送器与第二外壳中的接收器之间的无线链路接收所捕捉的图像。25 is a flowchart illustrating an exemplary process 2500 for selectively amplifying an audio signal based on tracked lip movement, consistent with the disclosed embodiments. Process 2500 may be performed by one or more processors associated with apparatus 110, such as processor 210. The processor may be included in the same common housing as the microphone 1720 and camera 1730 that may also be used in process 2500 . In some embodiments, some or all of process 2500 may be performed on a processor external to device 110, which may be included in the second housing. For example, one or more portions of process 2500 may be performed by a processor in auditory interface device 1710 or an auxiliary device such as computing device 120 or display device 2301 . In such an embodiment, the processor may be configured to receive the captured image via a wireless link between the transmitter in the common housing and the receiver in the second housing.
在步骤2510中,过程2500可以包括接收由可穿戴相机从用户的环境捕捉的多个图像。图像可以由诸如装置110的相机1730的可穿戴相机捕捉。在步骤2520中,过程2500可以包括识别在多个图像中的至少一个中的至少一个个体的表示。可以使用各种图像检测算法来识别个体,诸如Haar级联、定向梯度直方图(HOG)、深度卷积神经网络(CNN)、尺度不变特征变换(SIFT)等。在一些实施例中,处理器210可以被配置为例如如图23B所示从显示设备检测个体的可视表示。In step 2510, the process 2500 can include receiving a plurality of images captured by the wearable camera from the user's environment. The images may be captured by a wearable camera such as camera 1730 of device 110 . In step 2520, process 2500 can include identifying a representation of at least one individual in at least one of the plurality of images. Various image detection algorithms can be used to identify individuals, such as Haar cascade, Histogram of Oriented Gradients (HOG), Deep Convolutional Neural Networks (CNN), Scale Invariant Feature Transform (SIFT), and others. In some embodiments, processor 210 may be configured to detect a visual representation of an individual from a display device, eg, as shown in Figure 23B.
在步骤2530中,过程2500可以包括基于对多个图像的分析来识别与个体的嘴相关联的至少一个唇部移动或唇部位置。处理器210可以被配置为识别与个体的嘴相关联的一个或多个点。在一些实施例中,处理器210可以开发与个体的嘴相关联的轮廓,该轮廓可以定义与个体的嘴或唇部相关联的边界。可以在多个帧或图像上跟踪在图像中识别出的唇部,以识别唇部移动。因此,处理器210可以使用如上所述的各种视频跟踪算法。In step 2530, the process 2500 can include identifying at least one lip movement or lip position associated with the individual's mouth based on the analysis of the plurality of images. The processor 210 may be configured to identify one or more points associated with the individual's mouth. In some embodiments, the processor 210 may develop a profile associated with the individual's mouth, which may define a boundary associated with the individual's mouth or lips. Lips identified in an image can be tracked over multiple frames or images to identify lip movement. Accordingly, the processor 210 may use various video tracking algorithms as described above.
在步骤2540中,过程2500可以包括接收表示由麦克风从用户的环境捕捉的声音的音频信号。例如,装置110可以接收表示由麦克风1720捕捉的声音2421、2422和2423的音频信号。在步骤2550中,过程2500可以包括基于对麦克风捕捉的声音的分析,识别与第一语音相关联的第一音频信号和与不同于第一语音的第二语音相关联的第二音频信号。例如,处理器210可以识别与分别表示个体2310和2410的语音的声音2421和2422相关联的音频信号。处理器210可以使用任何当前已知或未来开发的技术或算法来分析从麦克风1720接收的声音以分离第一和第二语音。步骤2550还可以包括识别附加声音(诸如声音2423),其可以包括用户环境中的附加声音或背景噪声。在一些实施例中,处理器210可以对第一和第二音频信号执行进一步的分析,例如,通过使用个体2310和2410的可用声纹来确定他们的身份。可替代地或者另外地,处理器210可以使用语音识别工具或算法来识别个体的语音。In step 2540, the process 2500 may include receiving an audio signal representing sound captured by the microphone from the user's environment. For example, device 110 may receive audio signals representing sounds 2421 , 2422 , and 2423 captured by microphone 1720 . In step 2550, process 2500 may include identifying a first audio signal associated with the first speech and a second audio signal associated with a second speech different from the first speech based on the analysis of the sound captured by the microphone. For example, processor 210 may identify audio signals associated with sounds 2421 and 2422 representing the speech of individuals 2310 and 2410, respectively. The processor 210 may use any currently known or future developed technique or algorithm to analyze the sound received from the microphone 1720 to separate the first and second speech. Step 2550 may also include identifying additional sounds (such as sounds 2423), which may include additional sounds or background noise in the user's environment. In some embodiments, the processor 210 may perform further analysis on the first and second audio signals, eg, by using the available voiceprints of the individuals 2310 and 2410 to determine their identities. Alternatively or additionally, the processor 210 may use speech recognition tools or algorithms to recognize the individual's speech.
在步骤2560中,过程2500可以包括基于确定第一音频信号与识别出的与个体的嘴相关联的唇部移动相关联来对第一音频信号进行选择性调节。处理器210可以将识别出的唇部移动与在步骤2550中识别出的第一和第二音频信号进行比较。例如,处理器210可以将检测到的唇部移动的定时与音频信号中的语音模式的定时进行比较。在检测到语音的实施例中,如上所述,处理器210还可以将特定唇部移动与在音频信号中检测到的音素或其他特征进行比较。因此,处理器210可以确定第一音频信号与检测到的唇部移动相关联,并且因此与正在说话的个体相关联。In step 2560, process 2500 may include selectively adjusting the first audio signal based on determining that the first audio signal is associated with the identified lip movement associated with the individual's mouth. The processor 210 may compare the identified lip movement to the first and second audio signals identified in step 2550. For example, the processor 210 may compare the timing of the detected lip movement to the timing of speech patterns in the audio signal. In embodiments where speech is detected, the processor 210 may also compare certain lip movements to phonemes or other features detected in the audio signal, as described above. Accordingly, the processor 210 may determine that the first audio signal is associated with the detected lip movement, and thus with the speaking individual.
如上所述,可以执行各种形式的选择性调节。在一些实施例中,调节可以包括改变音频信号的音调或重放速度。例如,调节可以包括重新映射音频频率或改变与音频信号相关联的语速。在一些实施例中,调节可以包括相对于其他音频信号放大第一音频信号。放大可以通过各种手段来执行,诸如方向性麦克风的操作、改变与麦克风相关联的一个或多个参数、或数字化处理音频信号。调节可以包括衰减或抑制与检测到的唇部移动不相关联的一个或多个音频信号。衰减的音频信号可以包括与在用户的环境中检测到的其他声音(包括诸如第二音频信号的其他语音)相关联的音频信号。例如,处理器210可以基于确定第二音频信号与识别出的与个体的嘴相关联的唇部移动不相关联来选择性地衰减第二音频信号。在一些实施例中,处理器可以被配置为当识别出的第一个体的唇部移动指示第一个体已经完成句子或已经完成说话时,从与第一个体相关联的音频信号的调节转换到与第二个体相关联的音频信号的调节。As mentioned above, various forms of selective adjustment can be performed. In some embodiments, adjusting may include changing the pitch or playback speed of the audio signal. For example, the adjustment may include remapping the audio frequency or changing the speech rate associated with the audio signal. In some embodiments, adjusting may include amplifying the first audio signal relative to other audio signals. Amplification may be performed by various means, such as operation of a directional microphone, changing one or more parameters associated with the microphone, or digitally processing the audio signal. Adjusting may include attenuating or suppressing one or more audio signals not associated with the detected lip movement. The attenuated audio signal may include audio signals associated with other sounds detected in the user's environment, including other speech such as the second audio signal. For example, the processor 210 may selectively attenuate the second audio signal based on determining that the second audio signal is not associated with the identified lip movement associated with the individual's mouth. In some embodiments, the processor may be configured to extract from the audio signal associated with the first individual when the identified lip movement of the first individual indicates that the first individual has completed the sentence or has completed speaking The conditioning translates to conditioning of the audio signal associated with the second individual.
在步骤2570中,过程2500可以包括使经选择性调节的第一音频信号向被配置为向用户的耳朵提供声音的听觉接口设备的传输。例如,经调节的音频信号可以被发送到听觉接口设备1710,其可向用户100提供对应于第一音频信号的声音。还可以发送诸如第二音频信号的附加声音。例如,处理器210可以被配置为发送对应于声音2421、2422和2423的音频信号。然而,可能与检测到的个体2310的唇部移动相关联的第一音频信号可以如上所述相对于声音2422和2423被放大。在一些实施例中,听觉接口设备1710可以包括与听筒相关联的扬声器。例如,听觉接口设备可以至少部分地插入用户的耳朵中,用于向用户提供音频。听觉接口设备也可以在耳朵外部,诸如耳后听觉设备、一个或多个耳机、小型便携式扬声器等。在一些实施例中,听觉接口设备可以包括骨传导麦克风,其被配置为通过用户头骨的振动向用户提供音频信号。这样的设备可以与使用者的皮肤外部接触放置,或者可以通过外科手术植入并附接到使用者的骨骼上。In step 2570, process 2500 may include causing transmission of the selectively conditioned first audio signal to an auditory interface device configured to provide sound to the user's ear. For example, the conditioned audio signal may be sent to auditory interface device 1710, which may provide user 100 with a sound corresponding to the first audio signal. Additional sounds such as a second audio signal may also be sent. For example, processor 210 may be configured to transmit audio signals corresponding to sounds 2421 , 2422 and 2423 . However, the first audio signal that may be associated with the detected lip movement of the individual 2310 may be amplified relative to the sounds 2422 and 2423 as described above. In some embodiments, auditory interface device 1710 may include a speaker associated with the earpiece. For example, an auditory interface device may be inserted at least partially into a user's ear for providing audio to the user. The auditory interface device may also be external to the ear, such as a behind-the-ear hearing device, one or more earphones, small portable speakers, and the like. In some embodiments, the auditory interface device may include a bone conduction microphone configured to provide audio signals to the user through vibrations of the user's skull. Such devices may be placed in external contact with the user's skin, or may be surgically implanted and attached to the user's bone.
多模式助听器Multimodal Hearing Aid
根据本公开的实施例,助听器系统可以包括配置为从用户的环境捕捉多个图像的可穿戴相机。在各种实施例中,助听器系统可以包括被配置为从用户的环境捕捉声音的至少一个麦克风。在一些实施例中,助听器系统可以包括多于一个麦克风。在示例实施例中,助听器系统可以包括用于捕捉第一波长范围内的音频信号的第一麦克风和用于捕捉第二波长范围内的音频信号的第二麦克风。在示例实施例中,助听器系统可以包括多个相机和/或多个麦克风。例如,图26示出了可以佩戴装置110的用户2601,装置110可以包括相机2617a和2617b以及麦克风2613。如本文所述,装置110可以在不同位置附接到用户2601。例如,装置110可以物理地连接到衬衫、项链、腰带、眼镜、腕带、纽扣等。According to embodiments of the present disclosure, a hearing aid system may include a wearable camera configured to capture multiple images from the user's environment. In various embodiments, the hearing aid system may include at least one microphone configured to capture sound from the user's environment. In some embodiments, the hearing aid system may include more than one microphone. In an example embodiment, a hearing aid system may include a first microphone for capturing audio signals in a first wavelength range and a second microphone for capturing audio signals in a second wavelength range. In example embodiments, the hearing aid system may include multiple cameras and/or multiple microphones. For example, FIG. 26 shows a user 2601 who may wear device 110, which may include cameras 2617a and 2617b and microphone 2613. As described herein, device 110 may be attached to user 2601 in various locations. For example, device 110 may be physically attached to shirts, necklaces, belts, glasses, wristbands, buttons, and the like.
装置110可以被配置为与诸如如图26所示的听觉接口设备2615的听觉接口设备进行通信。在一个示例实施例中,听觉接口设备2615、装置110以及各种相机和麦克风形成助听器系统。在一些实施例中,装置110可以分别从其他相机和麦克风接收视频和音频数据。相机2617A可以指向第一方向(例如,前方),并且相机2617B可以指向前方或侧向。应当理解,相机2617A和2617B的特定朝向仅是说明性的,并且可以使用这些相机的任何其他合适朝向。虽然听觉接口设备2615被示出为附接到用户2601的耳朵之一,但在一些实施例中,听觉接口设备2615可以具有配置为附接到左耳的左部分(示出)和配置为附接到右耳的右部分(未示出)。Apparatus 110 may be configured to communicate with an auditory interface device, such as auditory interface device 2615 shown in FIG. 26 . In one example embodiment, the auditory interface device 2615, the apparatus 110, and various cameras and microphones form a hearing aid system. In some embodiments, device 110 may receive video and audio data from other cameras and microphones, respectively. Camera 2617A may point in a first direction (eg, forward), and camera 2617B may point forward or sideways. It should be understood that the particular orientation of cameras 2617A and 2617B is illustrative only and any other suitable orientation of these cameras may be used. While auditory interface device 2615 is shown attached to one of user 2601's ears, in some embodiments auditory interface device 2615 may have a left portion (shown) configured to attach to the left ear and configured to attach to the left ear. Connected to the right part of the right ear (not shown).
应当理解,相机2617A-2617B可以是具有任何合适光学元件的任何合适相机。例如,相机2617A可以具有第一分辨率,并且相机2617B可以具有第二(例如,更高)分辨率。相机可以被配置为经由能够检测任何合适波长光谱(例如,近红外、红外、可见和紫外光谱)中的任何合适光学信号的图像传感器来捕捉图像数据。在一些情况下,相机2617B-2617B可以被配置为捕捉图像,并且在其他情况下,相机2617A-2617B可以被配置为捕捉视频数据。相机2617A-2617B可以包括光学镜头(例如,用于创建宽全景或半球形图像或视频的鱼眼超广角镜头)。在一些情况下,诸如潜望镜镜头的变焦镜头可用于变焦到用户2601的环境中的不同对象。在示例实施例中,相机2617A-2617B可以被配置为朝着麦克风2613检测到的音频信号的方向变焦。在一些情况下,相机2617A-2617B可以具有框架系统以消除振动和/或保持相机的某些方向。如上所述,相机2617A-2617B可以在红外光谱中工作,特别是在黑暗环境中。这样的相机可以包括红外手电筒,并且可以被配置为检测周围人的皮肤温度(例如,皮肤温度可以用于检测黑暗环境中附近的说话者)。It should be understood that cameras 2617A-2617B may be any suitable cameras with any suitable optics. For example, camera 2617A may have a first resolution, and camera 2617B may have a second (eg, higher) resolution. The camera may be configured to capture image data via an image sensor capable of detecting any suitable optical signal in any suitable wavelength spectrum (eg, the near-infrared, infrared, visible, and ultraviolet spectrums). In some cases, cameras 2617B-2617B can be configured to capture images, and in other cases, cameras 2617A-2617B can be configured to capture video data. Cameras 2617A-2617B may include optical lenses (eg, fisheye ultra-wide angle lenses for creating wide panoramic or hemispherical images or video). In some cases, a zoom lens such as a periscope lens may be used to zoom to different objects in the user's 2601 environment. In an example embodiment, cameras 2617A-2617B may be configured to zoom in the direction of the audio signal detected by microphone 2613. In some cases, the cameras 2617A-2617B may have a frame system to eliminate vibration and/or maintain certain orientations of the cameras. As mentioned above, the cameras 2617A-2617B can work in the infrared spectrum, especially in dark environments. Such cameras may include infrared flashlights, and may be configured to detect the skin temperature of surrounding people (eg, skin temperature may be used to detect nearby speakers in dark environments).
装置110可以包括至少一个处理器2641,其被编程为接收由可穿戴相机(例如,由相机2617A)捕捉的多个图像。处理器2641可以被配置为使用基于计算机的模型通过分析由可穿戴相机(例如,相机2617B)收集的图像来分析用户2601的环境。在示例实施例中,相机2617B可以检测可产生音频信号的儿童2602的存在,并且可以检测用户2601的环境中可产生音频信号的其他对象(例如,相机2617B可以检测可产生音频信号的猫2603、可以经由会议软件2618产生音频信号的计算机2619的存在等等)。在各种实施例中,处理器2641可以执行被配置为如本文所讨论的分析和识别图像数据(或视频数据)内的对象、人或动物的合适的软件应用程序。除了处理器2641是装置110的一部分之外,听觉接口设备2615还可以包括被配置为修改各种音频信号并将修改后的音频信号提供给用户2601的耳朵的处理器。在一些情况下,与听觉接口设备2615相关联的处理器可以执行由处理器2641执行的一些(或全部)功能。Device 110 may include at least one processor 2641 programmed to receive a plurality of images captured by a wearable camera (eg, by camera 2617A). Processor 2641 may be configured to analyze the environment of user 2601 using a computer-based model by analyzing images collected by a wearable camera (eg, camera 2617B). In an example embodiment, camera 2617B can detect the presence of children 2602 that can produce audio signals, and can detect other objects in the environment of user 2601 that can produce audio signals (eg, camera 2617B can detect cats 2603 that can produce audio signals, Presence of computer 2619 that can generate audio signals via conference software 2618, etc.). In various embodiments, processor 2641 may execute a suitable software application configured to analyze and identify objects, people or animals within image data (or video data) as discussed herein. In addition to the processor 2641 being part of the apparatus 110, the auditory interface device 2615 may also include a processor configured to modify various audio signals and provide the modified audio signals to the ear of the user 2601. In some cases, a processor associated with auditory interface device 2615 may perform some (or all) of the functions performed by processor 2641.
在示例实施例中,装置110的处理器2641可以分析一个或多个捕捉的图像。例如,处理器2641可以被配置为接收由相机2617A-2617B捕捉的人的各种图像或特性。此外,装置110可以被配置为与可以存储各种对象和/或人的图像的服务器进行通信。在示例实施例中,装置110可以从服务器上传或下载图像。此外,装置110可以执行对存储在服务器处的图像(或视频)的搜索。类似于上面讨论的实施例,处理器2641可以使用基于计算机的模型来分析和识别对象。在示例实施例中,基于计算机的模型可以包括训练过的神经网络(诸如卷积神经网络(CNN))。在一些情况下,面部特征可以通过合适的基于计算机的模型来分析。例如,可以使用诸如CNN的基于计算机的模型来分析图像,并将在捕捉的图像中识别出的人的面部特征或面部特征之间的关系与在服务器的数据库中存储的图像中发现的人的面部特征或两者之间的关系进行比较。在一些实施例中,可以将人的面部动态运动的视频与从数据库获得的各种人的视频数据记录进行比较,以便确定在视频中捕捉的人是辨识出的个体。In an example embodiment, the processor 2641 of the device 110 may analyze one or more captured images. For example, processor 2641 may be configured to receive various images or characteristics of a person captured by cameras 2617A-2617B. Additionally, the apparatus 110 may be configured to communicate with a server that may store images of various objects and/or people. In an example embodiment, the apparatus 110 may upload or download images from the server. Furthermore, the device 110 may perform a search for images (or videos) stored at the server. Similar to the embodiments discussed above, the processor 2641 may use computer-based models to analyze and identify objects. In an example embodiment, the computer-based model may include a trained neural network, such as a convolutional neural network (CNN). In some cases, facial features can be analyzed by suitable computer-based models. For example, a computer-based model, such as a CNN, can be used to analyze images and correlate facial features or relationships between facial features of persons identified in captured images with persons found in images stored in the server's database Facial features or the relationship between the two are compared. In some embodiments, a video of the dynamic movement of a person's face may be compared to video data records of various persons obtained from a database in order to determine that the person captured in the video is an identified individual.
如本文所述,听觉接口设备2615可以是被配置为向用户2601提供听觉反馈的任何设备。听觉接口设备2615可以被放置在用户2601的一个或两个耳朵中,类似于传统的听觉接口设备。听觉接口设备2615可以是各种样式的,包括耳道内、完全耳道内、耳内、耳后、耳上、耳道内接收器、开放安装或各种其他样式。听觉接口设备2615可以包括用于向用户2601提供听觉反馈的一个或多个扬声器、用于检测用户2601的环境中的声音的麦克风、内部电子设备、处理器、存储器等。听觉接口设备2615可以包括用于手动调整由听觉接口设备发送的音频信号的音频信号参数(例如,响度、音高等)的听觉接口(例如,按钮)。在一些情况下,设备2615可以包括用于执行音频信号操纵的处理器、电源(例如,可充电电池)、可选的无线通信设备(其可以包括天线)和一组麦克风。As described herein, auditory interface device 2615 may be any device configured to provide auditory feedback to user 2601. The auditory interface device 2615 may be placed in one or both ears of the user 2601, similar to conventional auditory interface devices. The auditory interface device 2615 may be of various styles, including in-canal, fully in-canal, in-ear, behind-the-ear, supra-ear, in-canal receiver, open mount, or various other styles. Auditory interface device 2615 may include one or more speakers for providing auditory feedback to user 2601, a microphone for detecting sounds in the environment of user 2601, internal electronics, a processor, memory, and the like. The auditory interface device 2615 may include an auditory interface (eg, buttons) for manually adjusting audio signal parameters (eg, loudness, pitch, etc.) of the audio signal sent by the auditory interface device. In some cases, device 2615 may include a processor for performing audio signal manipulation, a power source (eg, a rechargeable battery), an optional wireless communication device (which may include an antenna), and a set of microphones.
在各种实施例中,处理器2641被配置为接收表示由至少一个麦克风从用户2601的环境捕捉的声音的多个音频信号。这样的信号可以是由用户2601的环境中的各种实体产生的声音的组合。例如,声音可以包括儿童2602在说话,同时猫2603在喵喵叫,或/和人的群组正试图经由计算机2619(例如,通过会议软件2618)与用户2601通信。在各种实施例中,音频信号可能重叠,导致声音的杂音。具有多个声音的这样的音频环境可以被称为嘈杂环境,并且这样的环境可以显著不同于相对无噪声(在本文也被称为平静)的环境。如图26所示的嘈杂环境是这种环境的一个示例。嘈杂环境的其他示例可以包括具有多个个体讲话的聚会、具有多个个体在背景上讲话的电视节目(背景可以包括街道声音、音乐等)、戏剧、会议、晚餐、演讲、基于计算机的会议、酒吧或餐馆中的对话、背景上的对话(例如,在繁忙的道路或建筑工地旁边的对话)、公共交通工具(例如,公共汽车、火车、船或飞机)等。平静环境的示例可以包括私人办公室、图书馆、两个体在安静的地方对话,等等。In various embodiments, the processor 2641 is configured to receive a plurality of audio signals representing sounds captured by the at least one microphone from the environment of the user 2601. Such a signal may be a combination of sounds produced by various entities in the user's 2601 environment. For example, the sounds may include children 2602 talking while cats 2603 meowing, or/and a group of people attempting to communicate with user 2601 via computer 2619 (eg, through conferencing software 2618). In various embodiments, the audio signals may overlap, resulting in noise in the sound. Such audio environments with multiple sounds may be referred to as noisy environments, and such environments may differ significantly from relatively noise-free (also referred to herein as calm) environments. A noisy environment as shown in Figure 26 is an example of such an environment. Other examples of noisy environments may include parties with multiple individuals speaking, television programs with multiple individuals speaking on the background (the background may include street sounds, music, etc.), plays, conferences, dinners, lectures, computer-based conferences, Conversations in bars or restaurants, conversations in the background (eg, next to a busy road or construction site), public transport (eg, buses, trains, boats, or planes), etc. Examples of calm environments may include a private office, a library, two persons conversing in a quiet place, and so on.
在各种实施例中,装置110的处理器2641可以执行被配置为分析用户2601的环境的可视和音频数据并确定用户是在嘈杂环境中还是在平静环境中的软件指令。虽然可以理解,分析是由执行软件指令的处理器2641执行的,但是软件指令可以是任何合适的指令并且可以包括机器学习算法,为了简洁起见,处理器可以执行程序指令以分析各种声音并引起影响用户2601经由听觉接口设备2615接收的音频信号的声音特性的各种动作。In various embodiments, the processor 2641 of the device 110 may execute software instructions configured to analyze visual and audio data of the user's 2601 environment and determine whether the user is in a noisy environment or a calm environment. While it will be appreciated that the analysis is performed by a processor 2641 executing software instructions, the software instructions may be any suitable instructions and may include machine learning algorithms, for the sake of brevity, the processor may execute program instructions to analyze various sounds and cause Various actions that affect the sound characteristics of the audio signal received by the user 2601 via the auditory interface device 2615.
取决于环境类型(例如,噪声或平静环境,或各种其他区别),处理器2641可以被配置为当从环境接收不同类型的音频信号时以不同模式操作。在示例实施例中,当处理器2641确定环境平静时,处理器可以在第一模式下操作,第一模式可以包括对多个音频信号中的至少一个音频信号(在本文称为第一音频信号)的特定选择性调节(在本文称为第一选择性调节)。在一些情况下,当环境足够平静时,处理器2641可以被配置为不提供对第一音频信号的调节,并经由听觉接口设备2615直接将第一音频信号发送到用户2601。可替代地,可以执行至少一些第一选择性调节。例如,噪声(例如,风扇的背景噪声)可以被抑制,而人的语音的声音可以被放大。在一些情况下,听觉系统2650可以被配置为确定人的语音是否部分听不见或不清楚(例如,处理器2641可以被配置为执行语音识别),并且当确定语音部分听不见或不清楚(例如,该人没有足够大声或清楚地说一些词语)时,可以修改人的语音,使得语音清晰度得到提高。在示例实施例中,可以转录人的语音,并且可以经由自然阅读语音(这种过程可称为语音渲染)将转录的文本读给用户2601,以清晰与用户2601交互的人的语音。在示例实施例中,语音渲染可以用于移除说话者的口音或将不同的口音重新应用到语音。在示例实施例中,人的语音的原始音频信号可以与渲染的语音组合(例如,变形),以便在清晰语音的同时保留说话者的一些自然特性。语音渲染可以包括改变说话的人的音高(例如,如果用户2601难以识别特定频率,则这种渲染可能是有益的)、人的语音的抑扬顿挫、人的语音的响度或人的语音的任何其他特性(例如,滤波器可以被应用于人的语音以将人的声音从男声改变为女声)。Depending on the type of environment (eg, noisy or calm environment, or various other distinctions), the processor 2641 may be configured to operate in different modes when receiving different types of audio signals from the environment. In an example embodiment, when the processor 2641 determines that the environment is calm, the processor may operate in a first mode, which may include a ) specific selective modulation (referred to herein as the first selective modulation). In some cases, when the environment is sufficiently calm, the processor 2641 may be configured not to provide conditioning of the first audio signal and to send the first audio signal directly to the user 2601 via the auditory interface device 2615. Alternatively, at least some of the first selective adjustments may be performed. For example, noise (eg, background noise from a fan) can be suppressed and the sound of human speech can be amplified. In some cases, the hearing system 2650 may be configured to determine whether a person's speech is partially inaudible or unclear (eg, the processor 2641 may be configured to perform speech recognition), and when it is determined that the speech is partially inaudible or unclear (eg, , the person does not speak some words loud enough or clearly), the person's speech can be modified so that the speech intelligibility is improved. In an example embodiment, human speech may be transcribed, and the transcribed text may be read to user 2601 via natural reading speech (this process may be referred to as speech rendering) to clarify the speech of the person interacting with user 2601. In an example embodiment, speech rendering may be used to remove a speaker's accent or reapply a different accent to the speech. In an example embodiment, the original audio signal of the human speech may be combined (eg, morphed) with the rendered speech in order to preserve some of the speaker's natural characteristics while maintaining clear speech. Speech rendering may include changing the pitch of the person speaking (eg, such rendering may be beneficial if the user 2601 has difficulty recognizing a particular frequency), the cadence of the person's speech, the loudness of the person's speech, or any other aspect of the person's speech characteristics (eg, filters can be applied to human speech to change the human voice from male to female).
需要注意的是,即使在平静的环境中,也可以有几种声音来源(例如,两个体悄悄地与第三人交流,背景中播放安静音乐的声音等等)。在示例实施例中,装置110可以包括用于调整要应用的第一选择性调节的参数的接口。在示例实施例中,这样的接口可以包括基于智能手机的接口,该接口使用例如具有图形用户元素的应用程序、可以是听觉系统2650的一部分的按钮或语音接口(例如,用户2601可以被配置为经由语音命令控制装置110)。除其他外,用户2601请求的用于控制第一选择性调节的至少一些参数的示例命令可以包括对用户2601说话(或向用户2601发出任何类型的音频信号)的说话者(或其他源)的数量。作为示例,当用户2601正在参加演讲(演讲是相对平静的环境)时,他或她可能希望只听到演讲者,而不听到背景噪声或其他人的讲话。对于这种情况,用户2601可以指示听觉系统2650(即,经由适当的软件应用程序的处理器2641)降低环境音频信号(即,与演讲者的语音无关的信号)的振幅,并且在一些情况下放大演讲者的语音。在演讲者远离用户的情况下,用户2601可以指示听觉系统2650提高演讲者讲话的清晰度。在一些情况下,对听觉系统2650的指令还可以包括如上所述改变演讲者的口音,将演讲者的语音翻译成由用户2601选择的不同语言,或者应用任何其他形式的修改。作为另一示例,当用户2601在社交活动中并与几个体交谈时,他或她可能希望在几个体中的每一个体说话时听到他们中的每一个体。在一些情况下,用户2601可能希望或可能不希望听到自己,或者她可能更喜欢以较低的振幅听到自己。在这种情况下,用户2601可以指示装置110降低他或她声音的振幅。用户感知语音的这种改变可以是第一选择性调节的另一示例。另外,如上所述,当捕捉其他音频信号(例如,其他个体的讲话)时,用户2601可能希望听到较低振幅的背景噪声(或不听到),但当助听器2650的一个或多个麦克风没有检测到其他音频信号时,用户2601可能希望听到更高振幅的背景噪声。It is important to note that even in a calm environment, there can be several sources of sound (eg, two bodies quietly communicating with a third person, the sound of quiet music playing in the background, etc.). In an example embodiment, the apparatus 110 may include an interface for adjusting the parameters of the first selective adjustment to be applied. In an example embodiment, such an interface may include a smartphone-based interface using, for example, an application with graphical user elements, buttons that may be part of the hearing system 2650, or a voice interface (eg, the user 2601 may be configured to Control device 110) via voice commands. Among other things, example commands requested by user 2601 to control at least some parameters of the first selective adjustment may include a command from a speaker (or other source) speaking to user 2601 (or emitting any type of audio signal to user 2601). quantity. As an example, when user 2601 is participating in a speech (the speech is a relatively calm environment), he or she may wish to hear only the speaker and not background noise or other people's speech. For this case, the user 2601 may instruct the hearing system 2650 (ie, the processor 2641 via an appropriate software application) to reduce the amplitude of the ambient audio signal (ie, the signal unrelated to the speaker's speech), and in some cases Amplifies the speaker's voice. User 2601 may instruct hearing system 2650 to improve the intelligibility of the speaker's speech when the speaker is far away from the user. In some cases, instructions to hearing system 2650 may also include changing the speaker's accent as described above, translating the speaker's speech into a different language selected by user 2601, or applying any other form of modification. As another example, when user 2601 is at a social event and talking to several entities, he or she may wish to hear each of the several entities as they speak. In some cases, user 2601 may or may not wish to hear herself, or she may prefer to hear herself at a lower amplitude. In this case, the user 2601 may instruct the device 110 to reduce the amplitude of his or her voice. This change in the user's perceived speech may be another example of a first selective adjustment. Additionally, as described above, when capturing other audio signals (eg, speech by other individuals), the user 2601 may wish to hear lower amplitude background noise (or not), but when one or more microphones of the hearing aid 2650 User 2601 may wish to hear higher amplitude background noise when no other audio signal is detected.
在用户处于相对嘈杂的环境中的情况下,可以执行助听器系统的不同操作模式(在本文中,这样的不同操作模式被称为第二操作模式)。在示例实施例中,装置110的处理器2641可以在第二模式下操作,该第二模式可以包括多个音频信号中的至少一个音频信号的特定选择性调节模式(在本文称为第二选择性调节)。在一些情况下,处理器2641可以基于对多个图像或多个音频信号中的至少一个的分析,确定切换到第二模式以引起对第一音频信号的第二选择性调节,该第二选择性调节相对于第一选择性调节在至少一个方面不同。例如,当环境嘈杂时,处理器2641可以被配置为提供与第一选择性调节相比更强的音频信号调节。在示例实施例中,选择性调节的强度可以被定义为在合适度量下比较经调节的音频信号与未调节的音频信号时的音频信号的差值。合适的度量可以确定经调节音频信号与为调节音频信号的音高差,或者振幅、节奏、时间拉伸,或者可用于表征音频信号的任何其他合适参数的差值。在各种实施例中,第二选择性调节相对于第一选择性调节可在至少一个方面不同。Where the user is in a relatively noisy environment, different modes of operation of the hearing aid system may be performed (herein, such different modes of operation are referred to as second modes of operation). In an example embodiment, the processor 2641 of the apparatus 110 may operate in a second mode, which may include a particular selective adjustment mode (herein referred to as a second selection) of at least one audio signal of the plurality of audio signals sexual regulation). In some cases, the processor 2641 may determine to switch to the second mode to cause a second selective adjustment of the first audio signal based on analysis of at least one of the plurality of images or the plurality of audio signals, the second selection The sexual conditioning differs from the first selective conditioning in at least one respect. For example, when the environment is noisy, the processor 2641 may be configured to provide stronger audio signal conditioning than the first selective conditioning. In an example embodiment, the selectively adjusted strength may be defined as the difference in the audio signal when the adjusted audio signal is compared to the unadjusted audio signal under a suitable metric. A suitable metric may determine the pitch difference between the conditioned audio signal and the conditioned audio signal, or the difference in amplitude, tempo, time stretch, or any other suitable parameter that can be used to characterize the audio signal. In various embodiments, the second selective adjustment may differ in at least one respect relative to the first selective adjustment.
第二选择性调节的示例可以包括在使用计算机2619经由网络会议进行通信时降低来自儿童2602的音频信号的振幅。类似地,第二选择性调节可以包括在经由网络会议进行通信时减少环境噪声(例如,来自猫2603的噪声或任何其他屋内噪声)。An example of a second selective adjustment may include reducing the amplitude of the audio signal from the child 2602 when using the computer 2619 to communicate via a web conference. Similarly, the second selective adjustment may include reducing ambient noise (eg, noise from cat 2603 or any other indoor noise) when communicating via a web conference.
在各种实施例中,使用第一选择性调节还是第二选择性调节的确定可以自动地或手动地进行(即,通过利用诸如图形界面、按钮的合适界面或声控命令经由来自用户2601的命令)。在示例实施例中,处理器2641可以通过确定用户2601的环境是平静还是嘈杂来执行自动确定。例如,处理器2641可以在其确定环境平静时切换到第一选择性调节,并且可以在其确定环境嘈杂时切换到第二选择性调节。在一些情况下,在切换操作期间,部分地维持第一选择性调节,而第二选择性调节叠加在第一选择性调节上。例如,如果用户2601在没有环境噪声干扰的情况下经由网络会议进行通信,则第一选择性调节可以包括如上所述的降低用户2601语音的振幅。然而,当儿童2602进入房间从而产生嘈杂环境时,处理器2641可以切换到第二选择性调节,并在维持第一选择性调节的同时降低所感知到的儿童声音的振幅。在示例实施例中,与儿童2602的到达相关联的图像数据(或诸如猫2603的到达的其他图像数据)可以触发处理器2641以确定用户2601即将沉浸在嘈杂环境中,并且结果,处理器2641可以确定在处理用户2601的环境中的音频信号时需要使用第二选择性调节。In various embodiments, the determination of whether to use the first selective adjustment or the second selective adjustment may be made automatically or manually (ie, by using a suitable interface such as a graphical interface, buttons, or voice commands via commands from the user 2601 ) ). In an example embodiment, the processor 2641 may perform the automatic determination by determining whether the environment of the user 2601 is calm or noisy. For example, the processor 2641 may switch to a first selective adjustment when it determines that the environment is calm, and may switch to a second selective adjustment when it determines that the environment is noisy. In some cases, during switching operations, the first selective adjustment is maintained in part, while the second selective adjustment is superimposed on the first selective adjustment. For example, if User 2601 is communicating via a web conference without ambient noise interference, the first selective adjustment may include reducing the amplitude of User 2601's speech as described above. However, when the child 2602 enters the room creating a noisy environment, the processor 2641 can switch to the second selective adjustment and reduce the perceived amplitude of the child's voice while maintaining the first selective adjustment. In an example embodiment, image data associated with the arrival of the child 2602 (or other image data such as the arrival of the cat 2603) may trigger the processor 2641 to determine that the user 2601 is about to be immersed in a noisy environment, and as a result, the processor 2641 It may be determined that a second selective adjustment needs to be used when processing audio signals in the environment of user 2601 .
在一些情况下,用户2601可以确定一组参数(例如,经由用于装置110的合适接口),使得当观察到这些参数时,处理器2641可以确定用户2601可能处于嘈杂环境中。例如,可能参数的列表包括音频和图像/视频参数。音频参数可以包括在给定时间间隔期间音频信号的最大振幅、音频频率的最大变化、在给定时间间隔上平均的音频信号的最大振幅、作为音频频率函数的最大振幅的分布等。图像/视频参数可以包括对象在用户2601的环境中移动的速度、在用户2601的环境中捕捉的图像的图像梯度的变化率、由处理器2641的图像识别软件(或处理器2641经由在用户2601的环境中捕捉的图像数据的传输与之通信的设备的图像识别软件)识别的用户2601的环境中的对象、人或动物的存在。在一些情况下,用户2601的环境中的对象的运动与在用户2601的环境中检测到的音频信号之间的时间相关性可以用于确定用户2601的环境是嘈杂的还是平静的。例如,如果用户2601周围的对象的运动可能与发出的声音不相关,并且发出的声音的振幅低,则处理器2641可以得出用户2601的环境相对平静的结论。可替代地,如果处理器2641确定声音可能是在用户2641的环境中的对象快速移动之后(可能具有一些时间延迟)发出的,则处理器2641可以断定用户2601的环境是混乱的。在一些情况下,环境是平静还是嘈杂的确定可以基于在用户2601周围执行的动作(例如,处理器2641可以通过分析图像数据来检测电话正在被用户2601旁边的人拿起、乐队即将开始演出、用户2601正在进入繁忙的街道等)。In some cases, user 2601 may determine a set of parameters (eg, via an appropriate interface for device 110) such that when these parameters are observed, processor 2641 may determine that user 2601 may be in a noisy environment. For example, the list of possible parameters includes audio and image/video parameters. Audio parameters may include maximum amplitude of the audio signal during a given time interval, maximum change in audio frequency, maximum amplitude of the audio signal averaged over a given time interval, distribution of maximum amplitudes as a function of audio frequency, and the like. The image/video parameters may include the speed at which the object moves in the user's 2601 environment, the rate of change of the image gradient of the image captured in the user's 2601 environment, the image recognition software by the processor 2641 (or the processor 2641 via the The transmission of the image data captured in the environment is the presence of objects, people or animals in the environment of the user 2601 identified by the image recognition software of the device with which it communicates. In some cases, a temporal correlation between motion of objects in user 2601's environment and audio signals detected in user 2601's environment can be used to determine whether user 2601's environment is noisy or calm. For example, if the motion of objects around user 2601 may not be correlated with the sound produced, and the amplitude of the sound produced is low, processor 2641 may conclude that user 2601's environment is relatively calm. Alternatively, processor 2641 may conclude that user 2601's environment is chaotic if processor 2641 determines that the sound may have occurred after objects in user 2641's environment moved quickly (possibly with some time delay). In some cases, the determination of whether the environment is calm or noisy may be based on actions performed around the user 2601 (eg, the processor 2641 may detect, by analyzing the image data, that the phone is being picked up by the person next to the user 2601, a band is about to play, User 2601 is entering a busy street, etc.).
在一些情况下,装置110还可以被配置为与各种其他设备交换数据,以确定用户2601是放置在嘈杂的还是平静的环境中。例如,装置110可以被配置为与智能手机(或能够与系统2650交换数据的类似电子设备)交互,智能手机具有否则装置110不必要访问的各种传感器。在示例实施例中,智能手机可以使用GPS位置来确定用户2601是否处于嘈杂环境中(例如,如果GPS指示用户2601在酒吧中,则智能手机可以(经由与装置110相关的合适软件)断定用户2601处于嘈杂环境中)。另外,如果智能手机记录了与一些其他因素(例如,过度噪声,其又可以跟随/先于/与振动一致)相结合的大声振动,则智能手机可以确定用户2601是否处于嘈杂环境中。在一些实施例中,用户日历中的诸如演讲、会议、音乐会等事件可以提供用户处于平静或嘈杂环境中的指示。在一些情况下,在由装置110或与装置110交互的智能手机收集的图像数据中识别特定个体(例如,儿童2602)可以向处理器2641指示用户2601处于嘈杂环境中。类似地,识别特定音频信号(例如,儿童2602的声音)可以向处理器2641指示用户2601处于嘈杂环境中。In some cases, apparatus 110 may also be configured to exchange data with various other devices to determine whether user 2601 is placed in a noisy or calm environment. For example, the apparatus 110 may be configured to interact with a smartphone (or similar electronic device capable of exchanging data with the system 2650) having various sensors that the apparatus 110 would otherwise have unnecessary access to. In an example embodiment, the smartphone may use the GPS location to determine if the user 2601 is in a noisy environment (eg, if the GPS indicates that the user 2601 is in a bar, the smartphone may conclude (via suitable software associated with the device 110) that the user 2601 is in a bar in a noisy environment). Additionally, the smartphone can determine if the user 2601 is in a noisy environment if the smartphone records loud vibrations combined with some other factor (eg, excessive noise, which in turn may follow/precede/consistent with the vibration). In some embodiments, events in the user's calendar, such as lectures, meetings, concerts, etc., may provide an indication that the user is in a calm or noisy environment. In some cases, identifying a particular individual (eg, child 2602 ) in image data collected by device 110 or a smartphone interacting with device 110 may indicate to processor 2641 that user 2601 is in a noisy environment. Similarly, identifying a particular audio signal (eg, the voice of the child 2602) can indicate to the processor 2641 that the user 2601 is in a noisy environment.
在一些实施例中,装置110的操作模式可以根据用户2601的环境自动改变,其中环境不必被分类为嘈杂或平静。例如,特定环境(例如,与用户2601相关联的事件)可能使得处理器2641(或相关联的设备,诸如智能手机)确定装置110应在特定模式下操作。例如,当用户2601正在进入特定位置(例如,演讲室)时,处理器2641可以确定切换到特定操作模式(例如,第一选择性调节),其中仅将来自一个说话者的音频信号发送到听觉接口设备2615。作为另一示例,当用户2601离开位置(例如,演讲室)时,处理器2641(和/或相关联的设备)可以确定切换到另一操作模式(例如,第二选择性调节),在该模式下当那些个体中的一个正在说话时(当几个个体同时说话时)来自多个个体的音频信号可以被发送到听觉接口设备2615,处理器2641可以基于本文讨论的各种可能因素(例如个体与用户2601的接近程度、个体是否在用户2601的前面、用户2601是否在看着该个体等)来确定放大哪个音频信号和衰减哪个音频信号。In some embodiments, the operating mode of the device 110 may automatically change according to the environment of the user 2601, wherein the environment need not be classified as noisy or calm. For example, certain circumstances (eg, events associated with user 2601) may cause processor 2641 (or an associated device, such as a smartphone) to determine that apparatus 110 should operate in a certain mode. For example, when user 2601 is entering a particular location (eg, a lecture room), processor 2641 may determine to switch to a particular mode of operation (eg, a first selective adjustment) in which audio signals from only one speaker are sent to the auditory Interface device 2615. As another example, when the user 2601 leaves a location (eg, a lecture room), the processor 2641 (and/or associated device) may determine to switch to another mode of operation (eg, a second selective adjustment) at which Mode when one of those individuals is speaking (when several individuals are speaking at the same time) audio signals from multiple individuals may be sent to the auditory interface device 2615, and the processor 2641 may be based on various possible factors discussed herein (eg, The proximity of the individual to the user 2601, whether the individual is in front of the user 2601, whether the user 2601 is looking at the individual, etc.) determines which audio signal is amplified and which audio signal is attenuated.
由处理器2641(或由诸如与处理器2641通信的智能手机之类的相关设备)确定用户2601是处于嘈杂还是平静的环境中,并在第一选择性调节与第二选择性调节之间自动切换,可以是选择音频信号调节的一种可能方式。可替代地,如上所讨论的,可以向用户2601提供用于手动确定是应用第一选择调节还是第二选择调节的合适界面。例如,用户2601可以通过操作用户界面来选择期望的操作模式,该用户界面例如显示在与装置110耦合的设备(诸如智能手机、膝上型计算机等)上。Whether the user 2601 is in a noisy or calm environment is determined by the processor 2641 (or by an associated device such as a smartphone in communication with the processor 2641), and automatically between the first selective adjustment and the second selective adjustment Toggle, which can be one possible way to select audio signal conditioning. Alternatively, as discussed above, the user 2601 may be provided with a suitable interface for manually determining whether to apply the first selection adjustment or the second selection adjustment. For example, user 2601 may select a desired mode of operation by operating a user interface, such as displayed on a device (such as a smartphone, laptop, etc.) coupled to apparatus 110, for example.
如所讨论的,处理器2641(或经由合适接口的用户2601)可以被配置为区分嘈杂环境或平静环境。然而,这样的环境分类只是一个可能的示例,并且可以使用任何其他合适的分类,这可以导致装置110的操作模式的区分。As discussed, the processor 2641 (or the user 2601 via a suitable interface) may be configured to distinguish between a noisy environment or a calm environment. However, such an environment classification is only one possible example, and any other suitable classification may be used, which may result in a differentiation of modes of operation of the device 110 .
图27A示意性地示出了用于确定装置110的操作模式的过程2701。例如,在过程2701的步骤2711处,处理器2641(或诸如智能手机的相关设备)可以被配置为确定用户(例如,用户2601)的环境类型。在步骤2713处,处理器2641可以评估环境类型是否对应于多个可能类型中的一个(例如,环境可以被评估为A型、B型或C型环境),并且基于环境的类型,可以选择具有对应选择性调节的对应操作模式(例如,可以选择具有对应选择性调节2715A-C的操作模式A-C)。FIG. 27A schematically illustrates a process 2701 for determining the operating mode of the device 110 . For example, at step 2711 of process 2701, processor 2641 (or a related device such as a smartphone) may be configured to determine the type of environment of a user (eg, user 2601). At step 2713, the processor 2641 may evaluate whether the environment type corresponds to one of a number of possible types (eg, the environment may be evaluated as a Type A, B, or C environment), and based on the type of environment, may choose to have Corresponding operating modes corresponding to the selective adjustments (eg, operating modes A-C with corresponding selective adjustments 2715A-C may be selected).
图27B示出了符合所公开的实施例的用于将经选择性调节的音频信号发送到听觉接口设备2615的示例过程2702。在过程2702的步骤2721处,装置110的处理器2641被配置为接收用户(例如,用户2601)的环境的图像/视频。在步骤2723处,来自用户2601的环境的音频信号也可以由处理器2641接收。在各种实施例中,来自环境的音频信号可以包括在用户2601的环境中检测到的所有音频声音,诸如在用户2601的环境中的个体的语音和环境噪声。在过程2702的步骤2725处,处理器2641可以被配置为分析在步骤2711和步骤2713中收集的音频和图像数据。在步骤2725处,处理器2641可以被配置为如上所述在第一模式下操作以引起对第一音频信号的第一选择性调节。在步骤2727处,处理器2641可以被配置为基于图像分析或音频分析来确定切换到第二模式以引起对第一音频信号的第二选择性调节,第二选择性调节在至少一个方面不同于第一选择性调节。例如,如上所述,当用户2601的环境嘈杂时,处理器2641可以被配置为提供对第一音频信号的更强调节(例如,增加第一音频信号的振幅)。FIG. 27B illustrates an example process 2702 for sending selectively conditioned audio signals to auditory interface device 2615, consistent with the disclosed embodiments. At step 2721 of process 2702, processor 2641 of device 110 is configured to receive images/videos of the environment of a user (eg, user 2601). Audio signals from the user's 2601 environment may also be received by the processor 2641 at step 2723. In various embodiments, the audio signal from the environment may include all audio sounds detected in user 2601's environment, such as individual speech and ambient noise in user 2601's environment. At step 2725 of process 2702, processor 2641 may be configured to analyze the audio and image data collected in steps 2711 and 2713. At step 2725, the processor 2641 may be configured to operate in the first mode as described above to cause the first selective conditioning of the first audio signal. At step 2727, the processor 2641 may be configured to determine, based on the image analysis or the audio analysis, to switch to the second mode to cause a second selective adjustment to the first audio signal, the second selective adjustment being different in at least one respect The first selective adjustment. For example, as described above, when the environment of the user 2601 is noisy, the processor 2641 may be configured to provide stronger conditioning of the first audio signal (eg, increasing the amplitude of the first audio signal).
在步骤2729处,处理器2641可以被配置为将经调节的信号发送到听觉接口设备2615。注意,在过程2702的一些实施例中,仅将经调节的信号发送到设备2615。在其他实施例中,可以将经调节的信号和未经调节的其他信号两者发送到设备2615。At step 2729, the processor 2641 may be configured to send the conditioned signal to the auditory interface device 2615. Note that in some embodiments of process 2702, only the conditioned signal is sent to device 2615. In other embodiments, both the conditioned signal and other unconditioned signals may be sent to the device 2615.
与所公开的实施例一致,可以通过识别在用户2601的环境中是否存在说话的个体来确定要应用的操作模式(例如,第一操作模式或第二操作模式)。例如,如果检测到该个体,则操作模式可以从第一操作模式切换到第二操作模式,对于第二操作模式,可以使用第二选择性调节,并且可以包括降低环境噪声和放大来自说话个体的音频信号。在示例性实施例中,背景(即,环境)噪声可以构成第一音频信号,并且第二选择性调节可以减少这种第一音频信号。在一些情况下,第二选择性调节可以包括改变音频信号之一(例如,第一音频信号)的音高、改变音频信号的振幅或对音频信号进行时间拉伸。Consistent with the disclosed embodiments, the mode of operation (eg, the first mode of operation or the second mode of operation) to apply may be determined by identifying whether a speaking individual is present in the environment of the user 2601 . For example, if the individual is detected, the mode of operation may be switched from a first mode of operation to a second mode of operation for which a second selective adjustment may be used and may include reducing ambient noise and amplifying noise from the speaking individual audio signal. In an exemplary embodiment, background (ie, ambient) noise may constitute the first audio signal, and the second selective adjustment may reduce such first audio signal. In some cases, the second selective adjustment may include changing the pitch of one of the audio signals (eg, the first audio signal), changing the amplitude of the audio signal, or time-stretching the audio signal.
如上所述,第二选择性调节可以与第一选择性调节一起应用。例如,可以将第二选择性调节应用于第二音频信号,并且可以将第一选择性调节应用于第一音频信号。例如,如果第一音频信号是个体的语音,并且第二音频信号是背景噪声,则第一选择性调节可以包括放大语音的音量,而第二选择性调节可以包括降低背景噪声的振幅(或修改背景噪声的音高,使得其不易与第一音频信号混淆)。在一些情况下,第二选择性调节可以包括相对于第一音频信号衰减多个音频信号中的至少一个第二音频信号。另外地或可替代地,第一或第二选择性调节之一可以包括相对于第一音频信号衰减多个音频信号中的至少一个第二音频信号的音高。在示例实施例中,当用户2601的环境中的个体不说话时,装置110可以使用第一选择性调节。然而,当个体开始说话时,系统2650可以自动切换到第二选择性调节。As described above, the second selective adjustment may be applied in conjunction with the first selective adjustment. For example, the second selective adjustment can be applied to the second audio signal and the first selective adjustment can be applied to the first audio signal. For example, if the first audio signal is the individual's speech and the second audio signal is background noise, the first selective adjustment may include amplifying the volume of the speech, and the second selective adjustment may include reducing the amplitude (or modifying) the background noise the pitch of the background noise so that it is not easily confused with the first audio signal). In some cases, the second selective adjustment may include attenuating at least one second audio signal of the plurality of audio signals relative to the first audio signal. Additionally or alternatively, one of the first or second selective adjustment may include attenuating the pitch of at least one second audio signal of the plurality of audio signals relative to the first audio signal. In an example embodiment, device 110 may use the first selective adjustment when individuals in user 2601's environment are not speaking. However, when the individual begins to speak, the system 2650 can automatically switch to the second selective adjustment.
在一些情况下,第一音频信号可以与第一个体的语音相关联,并且第二音频信号可以与第二个体的语音相关联。另外,在一些实施例中,第一音频信号可以与个体的第一群组相关联,并且第二音频信号可以与个体的第二群组相关联。In some cases, the first audio signal may be associated with the speech of the first individual and the second audio signal may be associated with the speech of the second individual. Additionally, in some embodiments, the first audio signal may be associated with a first group of individuals and the second audio signal may be associated with a second group of individuals.
如前所述,用户2601可以通过向装置110提供指令来选择特定的选择性调节。在示例实施例中,指令可以包括选择可以从用户2601的环境中的多个音频信号生成音频信号子集的若干个个体,并请求使用特定类型的选择性调节(例如,第二选择性调节)来选择性地调节音频信号子集。在一些情况下,特定类型的选择性调节(例如,第二选择性调节)可以通过首先从多个音频信号中确定每个说话者的语音音频信号,并且对于给定时间点,确定具有小于阈值差的信号差的一对语音音频信号,来调节每个时间点在用户2601的环境中的至少一些音频信号。如前所述,可以使用合适的度量来测量信号差。例如,可以通过测量音高差、振幅、节奏、时间拉伸或可用于表征音频信号的任何其他合适参数的差值来测量该差值。然后,第二选择性调节可以包括通过改变来自该对语音音频信号的语音音频信号之一的基高、振幅或持续时间来将信号差放大到阈值之上。使用相应的过程2801和2802由图28A和图28B示意性地示出了这种过程的示例。在过程2801期间,可以使用如本文进一步描述的任何合适的方法将复合音频信号2811(其可以包含重叠的对话)分解为单独的音频信号2821-2825。在一些情况下,这些单独的音频信号(例如,信号2821和2823)可以重叠,并且在由时域2815和2817指示的一些时间间隔内可以相似(例如,这些信号可能足够相似,使得它们可能被用户2601混淆)。对于这种情况,选择性调节还可以包括至少对于时域2815和2817进一步区分信号2821和2823。如图28B中的过程2802所示,为了在信号2821与2823之间进行区分,可以经由基于计算机的应用2830改变信号2821和2823中的至少一个以放大使用合适度量测量出的差值(例如,放大差值使得其高于阈值差)。例如,如图28B所示,信号2817可以被改变(例如,时间拉伸)以得到信号2837,而信号2815可以被时间压缩并且其振幅可以被增加以产生信号2835。在示例实施例中,信号2835与信号2837之间的差值高于阈值,使得用户2601容易区分这些修改后的信号。As previously mentioned, the user 2601 may select a particular selective adjustment by providing instructions to the device 110 . In an example embodiment, the instructions may include selecting a number of individuals that may generate a subset of audio signals from multiple audio signals in the environment of the user 2601, and requesting the use of a particular type of selective adjustment (eg, a second selective adjustment) to selectively adjust a subset of the audio signal. In some cases, a particular type of selective adjustment (eg, a second selective adjustment) may be performed by first determining the speech audio signal of each speaker from the plurality of audio signals, and for a given point in time, determining that the audio signal has less than a threshold value The difference signal is a pair of speech audio signals to adjust at least some of the audio signals in the environment of the user 2601 at each point in time. As previously mentioned, the signal difference can be measured using a suitable metric. For example, the difference can be measured by measuring the difference in pitch, amplitude, rhythm, time stretch, or any other suitable parameter that can be used to characterize an audio signal. The second selective adjustment may then include amplifying the signal difference above a threshold by changing the base height, amplitude or duration of one of the speech audio signals from the pair of speech audio signals. An example of such a process is schematically illustrated by Figures 28A and 28B using corresponding processes 2801 and 2802. During process 2801, composite audio signal 2811 (which may contain overlapping dialogue) may be decomposed into individual audio signals 2821-2825 using any suitable method as described further herein. In some cases, these separate audio signals (eg, signals 2821 and 2823) may overlap and may be similar for some time interval indicated by time domains 2815 and 2817 (eg, the signals may be similar enough that they may be User 2601 confused). For this case, the selective adjustment may also include further distinguishing the signals 2821 and 2823 at least for the time domains 2815 and 2817 . As shown in process 2802 in Figure 28B, to differentiate between signals 2821 and 2823, at least one of signals 2821 and 2823 may be altered via computer-based application 2830 to amplify the difference measured using a suitable metric (eg, Amplify the difference so that it is higher than the threshold difference). For example, as shown in FIG. 28B , signal 2817 may be altered (eg, time stretched) to yield signal 2837 , while signal 2815 may be time compressed and its amplitude increased to yield signal 2835 . In an example embodiment, the difference between signal 2835 and signal 2837 is above a threshold, making it easy for user 2601 to distinguish these modified signals.
在一些实施例中,可以选择特定操作模式以优化装置110的资源使用。例如,如果在用户2601附近只有一个体,则装置110可以被配置为切换到单一说话者操作模式(例如,单一说话者操作模式可以不需要确定说话者的群组中的活跃说话者,因此,减少装置110的功耗,否则该功耗可以与确定不同说话者之间的振幅比相关联)。可以使用各种其他操作模式来降低装置110的功耗并延长电池寿命(例如,不严重依赖于与诸如智能手机、膝上型计算机等的外围设备的无线通信的操作模式、以每分钟仅收集几个图像为特征的操作模式、不需要任何音频分析的操作模式等)。In some embodiments, particular modes of operation may be selected to optimize resource usage of device 110 . For example, if there is only one person in the vicinity of user 2601, device 110 may be configured to switch to a single-speaker mode of operation (eg, a single-speaker mode of operation may not require determining the active speaker in the speaker's group, so, The power consumption of the device 110 is reduced, which might otherwise be associated with determining the amplitude ratio between different speakers). Various other modes of operation can be used to reduce power consumption and extend battery life of the device 110 (eg, modes of operation that do not rely heavily on wireless communication with peripherals such as smartphones, laptops, etc., to collect only per minute Several image-featured operating modes, operating modes that do not require any audio analysis, etc.).
助听器系统(例如,助听器系统可以包括装置110、听觉接口设备2615、相机2617A-2617B以及麦克风2613)的处理器2641可以基于对用户2601的环境中的多个图像的分析或基于对用户2601的环境中的多个音频信号的分析来确定切换到第二模式(如上所述)。例如,如果用户2601的环境中的个体正在说话,则处理器2641可以切换到第二模式。The processor 2641 of the hearing aid system (eg, the hearing aid system may include apparatus 110, auditory interface device 2615, cameras 2617A-2617B, and microphone 2613) may be based on analysis of multiple images in the environment of user 2601 or based on the environment of user 2601 Analysis of the plurality of audio signals in to determine switching to the second mode (as described above). For example, if an individual in the environment of the user 2601 is speaking, the processor 2641 may switch to the second mode.
在示例实施例中,处理器2641可以在第一模式下操作以引起第一选择性调节。第一选择性调节(如上所述)包括第一音频信号的放大。另外,处理器2641可以基于对多个图像或多个音频信号中的至少一个的分析,确定切换到第二模式以引起对第一音频信号的第二选择性调节,该第二选择性调节相对于第一选择性调节在至少一个方面不同。在示例实施例中,第二选择性调节可以包括相对于第一音频信号衰减多个音频信号中的至少一个第二音频信号。In an example embodiment, the processor 2641 may operate in the first mode to cause the first selective adjustment. The first selective adjustment (as described above) includes amplification of the first audio signal. Additionally, processor 2641 may determine, based on analysis of at least one of the plurality of images or the plurality of audio signals, to switch to the second mode to cause a second selective adjustment of the first audio signal, the second selective adjustment relative to Different from the first selective adjustment in at least one respect. In an example embodiment, the second selective adjustment may include attenuating at least one second audio signal of the plurality of audio signals relative to the first audio signal.
在示例实施例中,第一音频信号可以与第一个体的语音相关联,并且第二音频信号可以与第二个体的语音相关联。In an example embodiment, the first audio signal may be associated with the speech of the first individual and the second audio signal may be associated with the speech of the second individual.
在示例实施例中,处理器2641可以被配置为基于与多个图像或多个音频信号中的至少一个相关联的背景来确定切换到第二模式。In an example embodiment, the processor 2641 may be configured to determine to switch to the second mode based on a context associated with at least one of the plurality of images or the plurality of audio signals.
在示例实施例中,处理器2641可以在活动模式控制下操作。在这种模式下,处理器2641可以例如在第一模式与第二模式之间自动切换。在活动模式中,除了其他之外,处理器2641可以控制其音频被发送给用户的若干个说话者。例如,如果用户正在参加一个演讲,他或她可能只想听到演讲者而不想听到背景噪声,或者其他人的讲话,等等。然而,如果用户处在社交事件中并与许多人交谈,则用户可能希望在他们说话时听到他们中的每一个体。用户可以想要或不想要听到他自己或她自己,或者可以想要以较低的振幅听到他自己或她自己。当捕捉其他音频时,用户可能希望听到或不听到背景噪声,但当没有捕捉其他音频时,用户可能希望听到一些振幅的背景噪声,等等。在一些实施例中,处理器2641的操作模式可以根据用户选择而变化。用户可以通过操作用户界面来选择期望的操作模式,该用户界面例如显示在耦合到听觉接口设备的设备(诸如装置110、智能手机、膝上型计算机等)上。在其他实施例中,操作模式可以根据用户和助听器系统的背景(例如,环境)来自动改变。例如,特定事件可以使处理器2641假设特定操作模式(例如,第一模式或第二模式)。例如,当进入被识别为演讲室的场所时,处理器2641可以切换到仅发送一个说话者的模式,而当离开该房间时,处理器2641可以切换到根据活跃说话者发送多个说话者的语音的模式。在示例实施例中,背景指示用户进入房间。可替代地,背景指示用户离开房间。In an example embodiment, the processor 2641 may operate under active mode control. In this mode, the processor 2641 can automatically switch between the first mode and the second mode, for example. In active mode, the processor 2641 may control, among other things, several speakers whose audio is sent to the user. For example, if a user is attending a lecture, he or she may only want to hear the speaker and not background noise, or other people's speech, and so on. However, if the user is at a social event and talking to many people, the user may want to hear each of them as they speak. The user may or may not want to hear himself or herself, or may want to hear himself or herself at a lower amplitude. When capturing other audio, the user may want to hear or not hear background noise, but when no other audio is being captured, the user may want to hear background noise of some amplitude, and so on. In some embodiments, the operating mode of the processor 2641 may vary according to user selection. A user may select a desired mode of operation by operating a user interface, such as displayed on a device (such as device 110, smartphone, laptop, etc.) coupled to the auditory interface device. In other embodiments, the mode of operation may change automatically depending on the user and the context (eg, environment) of the hearing aid system. For example, certain events may cause the processor 2641 to assume a certain mode of operation (eg, a first mode or a second mode). For example, when entering a venue identified as a lecture room, the processor 2641 may switch to a mode in which only one speaker is sent, and when leaving the room, the processor 2641 may switch to sending multiple speakers based on active speakers mode of speech. In an example embodiment, the background indicates that the user is entering the room. Alternatively, the background instructs the user to leave the room.
在示例实施例中,处理器2641可以被配置为基于与多个图像或多个音频信号中的至少一个相关联的背景来选择第一模式或第二模式。该背景指示用户参加演讲或指示用户参加社交事件。该背景可以指示用户进入演讲室,其中至少一个音频信号包括演讲者的语音,并且其中第一选择性调节包括对多个音频信号中的第一音频信号的放大以及衰减该多个音频信号中的至少另一个信号(诸如背景噪声)。背景可以指示用户离开演讲室,其中至少一个音频信号包括在演讲室之外的活跃说话者的语音,并且其中第一选择性调节包括对该语音的放大。在一些情况下,背景可以指示用户仅在一个体的预定距离内,并且其中至少一个音频信号包括该人的语音,并且其中第一选择性调节包括对该语音的放大。预定距离可以是任何合适的距离(例如,0.1米至5米范围内的距离)。在一些情况下,预定距离可以大于五米(例如,10米)。In an example embodiment, the processor 2641 may be configured to select the first mode or the second mode based on a context associated with at least one of the plurality of images or the plurality of audio signals. The context indicates that the user is attending a speech or that the user is attending a social event. The context may indicate that the user is entering a lecture room, wherein the at least one audio signal includes a speaker's speech, and wherein the first selective adjustment includes amplifying a first audio signal of the plurality of audio signals and attenuating a first audio signal of the plurality of audio signals at least another signal (such as background noise). The context may instruct the user to leave the lecture room, wherein the at least one audio signal includes the speech of an active speaker outside the lecture room, and wherein the first selective adjustment includes amplification of the speech. In some cases, the context may indicate that the user is only within a predetermined distance of a person, and wherein the at least one audio signal includes the person's speech, and wherein the first selective adjustment includes amplification of the speech. The predetermined distance may be any suitable distance (eg, a distance in the range of 0.1 meters to 5 meters). In some cases, the predetermined distance may be greater than five meters (eg, 10 meters).
在各种实施例中,如上所述,第一选择性调节包括改变至少一个音频信号的振幅。另外地或可替代地,第一选择性调节包括改变至少一个音频信号的音高。在一些情况下,如前所述,第一选择性调节包括对至少一个音频信号进行时间拉伸。In various embodiments, as described above, the first selective adjustment includes changing the amplitude of the at least one audio signal. Additionally or alternatively, the first selective adjustment includes changing the pitch of the at least one audio signal. In some cases, as previously described, the first selective adjustment includes time-stretching the at least one audio signal.
在一些实施例中,处理器2641可以被配置为基于从用户接收的选择来选择第一模式或第二模式。在示例实施例中,从诸如智能手机、膝上型计算机、智能手表、手镯或任何其他合适的可穿戴电子设备的电子设备接收该选择。In some embodiments, the processor 2641 may be configured to select the first mode or the second mode based on a selection received from a user. In an example embodiment, the selection is received from an electronic device such as a smartphone, laptop, smart watch, bracelet, or any other suitable wearable electronic device.
针对熟人的自定义过滤器Custom filter for acquaintances
根据本公开的实施例,用于有选择地调节声音的助听器系统可以包括如本文所述配置为从用户的环境捕捉多个图像的可穿戴相机(或多个可穿戴相机)。在各种实施例中,助听器系统可以包括一个或多个麦克风,它们被配置为如本文所述的捕捉来自用户环境的声音。如下文关于图26所描述的,助听器系统可以包括装置110和处理器2641。According to embodiments of the present disclosure, a hearing aid system for selectively adjusting sound may include a wearable camera (or multiple wearable cameras) configured to capture multiple images from a user's environment as described herein. In various embodiments, the hearing aid system may include one or more microphones configured to capture sound from the user's environment as described herein. As described below with respect to FIG. 26 , the hearing aid system may include apparatus 110 and processor 2641 .
在示例实施例中,处理器2641可以被配置为接收由相机(例如,如图26所示,相机2617A或2617B)捕捉的多个图像。另外,处理器2641可以被配置为如本文所述接收表示由至少一个麦克风从用户的环境捕捉的声音的多个音频信号。处理器2641可以被配置为识别由多个图像中的至少一个或由多个音频信号中的至少一个表示的至少一个辨识出的个体。在示例实施例中,装置110的处理器2641可以分析一个或多个捕捉的图像。例如,处理器2641可以被配置为接收由相机2617A或相机2617B捕捉的人的各种图像或特性。此外,装置110可以被配置为与可以存储各种对象和/或人的图像的服务器进行通信。在示例实施例中,装置110可以从服务器上传或下载图像。此外,装置110可以执行对存储在服务器处的图像(或视频)的搜索。In an example embodiment, the processor 2641 may be configured to receive a plurality of images captured by a camera (eg, camera 2617A or 2617B as shown in FIG. 26). Additionally, the processor 2641 may be configured to receive a plurality of audio signals representing sounds captured by the at least one microphone from the user's environment as described herein. The processor 2641 may be configured to identify an individual identified by at least one of the plurality of images or at least one represented by at least one of the plurality of audio signals. In an example embodiment, the processor 2641 of the device 110 may analyze one or more captured images. For example, processor 2641 may be configured to receive various images or characteristics of a person captured by camera 2617A or camera 2617B. Additionally, the apparatus 110 may be configured to communicate with a server that may store images of various objects and/or people. In an example embodiment, the apparatus 110 may upload or download images from the server. Furthermore, the device 110 may perform a search for images (or videos) stored at the server.
类似于上面讨论的实施例,处理器2641可以如本文所述使用基于计算机的模型来分析和识别对象。在一些情况下,面部特征可以通过合适的基于计算机的模型来分析。例如,可以使用基于计算机的模型来分析图像,并将在捕捉的图像中识别出的人的面部特征或面部特征之间的关系与在服务器的数据库中存储的图像中发现的人的面部特征或两者之间的关系进行比较。在一些实施例中,可以将人的面部动态运动的视频与从数据库获得的各种人的视频数据记录进行比较,以便确定在视频中捕捉的人是辨识出的个体。Similar to the embodiments discussed above, the processor 2641 may analyze and identify objects using computer-based models as described herein. In some cases, facial features can be analyzed by suitable computer-based models. For example, a computer-based model can be used to analyze images and compare facial features or relationships between facial features of persons identified in captured images with facial features of persons found in images stored in the server's database or Compare the relationship between the two. In some embodiments, a video of the dynamic movement of a person's face may be compared to video data records of various persons obtained from a database in order to determine that the person captured in the video is an identified individual.
图29示出了具有装置110的用户100,装置110可以包括相机2617A和麦克风2613。在示例实施例中,用户100可以面对产生由麦克风2613检测到的相应音频信号2921和2922的个体2911和个体2912。个体2911和个体2912的图像可以由相机2617A检测到。另外,麦克风2613可以检测音频信号2923(例如,来自用户100不可见的对象或人的声音)。使用图像数据识别个体(例如,识别个体2911)可以是一种可能的方法。可替代地,可以基于在音频信号2921中检测到的语音来识别个体2911。在示例实施例中,如上所讨论的,可以检测个体2911的声纹。例如,处理器2641可以确定声音2921对应于个体2911的语音。这可以使用诸如隐式马尔可夫模型、动态时间规整、神经网络或其他技术的语音识别软件(例如,这种语音识别软件可以由处理器2641执行)来执行。在一些情况下,处理器2641可以被配置为将音频信号(例如,2921和2922)上传到服务器,并且服务器可以被配置为通过将音频信号2921与音频信号2922隔离并进一步使用个体2911的声纹来确定信号2921属于个体2911来处理这些信号。对于语音识别,服务器可以访问数据库(例如,如图20B所示的数据库2050),该数据库还可以包括一个或多个个体的声纹。在基于例如个体2911的声纹确定音频信号2921与个体2911匹配之后,助听器系统可以将个体2911识别为辨识出的个体。29 shows user 100 with device 110, which may include camera 2617A and microphone 2613. In an example embodiment, user 100 may face individual 2911 and individual 2912 that produce respective audio signals 2921 and 2922 detected by microphone 2613 . Images of individual 2911 and individual 2912 can be detected by camera 2617A. Additionally, the microphone 2613 can detect audio signals 2923 (eg, voices from objects or people not visible to the user 100). Using image data to identify individuals (eg, identify individual 2911 ) may be one possible approach. Alternatively, the individual 2911 may be identified based on speech detected in the audio signal 2921. In an example embodiment, as discussed above, the voiceprint of individual 2911 may be detected. For example, processor 2641 may determine that sound 2921 corresponds to the speech of individual 2911. This may be performed using speech recognition software such as hidden Markov models, dynamic time warping, neural networks, or other techniques (eg, such speech recognition software may be executed by processor 2641). In some cases, processor 2641 may be configured to upload audio signals (eg, 2921 and 2922) to a server, and the server may be configured to further use the voiceprint of individual 2911 by isolating audio signal 2921 from audio signal 2922 to determine that signals 2921 belong to individuals 2911 to process these signals. For speech recognition, the server may access a database (eg, database 2050 as shown in FIG. 20B ), which may also include the voiceprints of one or more individuals. After determining that the audio signal 2921 matches the individual 2911 based on, for example, the voiceprint of the individual 2911, the hearing aid system may identify the individual 2911 as an identified individual.
该识别过程可以单独使用或与上述图像识别技术(例如,面部识别技术)结合使用。例如,可以使用面部识别技术来识别个体2911,并且可以使用语音识别来验证个体2010,反之亦然。在一些情况下,个体讲话的视频可以用于该个体的识别。例如,面部特征(例如,个体2911的唇部移动)与音频信号2921的同步可以被用于将个体2911识别为说话者。然后,可以使用针对该说话者提取的声纹来将与个体2911相关联的声音从音频信号2921中表示的其他声音中分离出来。在一些实施例中,可以对个体2911发出的音频执行诸如识别词语的附加处理,或者贯穿本公开描述的其他形式的处理。在各种实施例中,处理器2641可以识别一个或多个个体。例如,处理器2641可以被配置为识别个体2911和个体2912。This recognition process can be used alone or in combination with the image recognition techniques described above (eg, facial recognition techniques). For example, facial recognition technology can be used to identify individual 2911 and speech recognition can be used to authenticate individual 2010, and vice versa. In some cases, video of an individual speaking can be used for identification of the individual. For example, synchronization of facial features (eg, lip movement of individual 2911) with audio signal 2921 may be used to identify individual 2911 as a speaker. The voiceprints extracted for that speaker can then be used to separate the voice associated with the individual 2911 from other voices represented in the audio signal 2921. In some embodiments, additional processing such as identifying words, or other forms of processing described throughout this disclosure, may be performed on the audio uttered by the individual 2911. In various embodiments, the processor 2641 may identify one or more individuals. For example, processor 2641 may be configured to identify individual 2911 and individual 2912.
在一些实施例中,装置110可以如图29中所示与音频信号2923一样检测不在装置110的视场内的个体的语音。例如,语音可以通过免提电话、从车辆后座或类似的地方听到。在这样的实施例中,在视场中没有说话者的情况下,个体的识别可以仅基于个体的语音。In some embodiments, device 110 may detect speech of individuals not within the field of view of device 110 as shown in FIG. 29 as audio signal 2923. For example, speech can be heard on a speakerphone, from the back seat of a vehicle, or the like. In such an embodiment, in the absence of a speaker in the field of view, the identification of the individual may be based solely on the individual's speech.
助听器系统的处理器(例如,处理器2641)可以被配置为从存储器中检索与至少一个辨识出的个体相关联的调节配置文件。调节配置文件可以是可由处理器2641执行的用于选择性地调节音频信号的任何合适指令集。选择性调节可以包括音频信号的放大或衰减、从音频信号中去除噪声(例如,抑制在音频信号中识别出的一些频率)等。在一些情况下,音频信号的一些部分可以被放大(例如,对应于个体2911的语音的音频信号可以被放大),而音频信号的其他部分(例如,背景音乐或个体2912的语音)可以被抑制。在一些情况下,处理器2641可以被配置为分析个体2911的语音并识别语音内的词语。如果在音频信号2921的一部分中不能清晰地识别出词语(例如,处理器2641确定在该部分中正确识别出词语的概率较低),则处理器2641可以被配置为选择性地调节该部分(例如,放大音频信号的该部分)。在一些情况下,处理器2641可以被配置为选择性地调节音频信号2921中不能识别其中的词语的部分,然后重复试图放大以识别该部分中的词语。可以多次执行这样的迭代以优化音频信号2921的选择性调节。A processor of the hearing aid system (eg, processor 2641) may be configured to retrieve from memory an adjustment profile associated with the at least one identified individual. The conditioning profile may be any suitable set of instructions executable by the processor 2641 for selectively conditioning the audio signal. Selective conditioning may include amplification or attenuation of the audio signal, removing noise from the audio signal (eg, suppressing some frequencies identified in the audio signal), and the like. In some cases, some portions of the audio signal may be amplified (eg, the audio signal corresponding to individual 2911's speech may be amplified), while other portions of the audio signal (eg, background music or individual 2912's speech) may be suppressed . In some cases, the processor 2641 may be configured to analyze the speech of the individual 2911 and recognize words within the speech. If a word cannot be clearly identified in a portion of the audio signal 2921 (eg, the processor 2641 determines that the probability of correctly identifying a word in the portion is low), the processor 2641 may be configured to selectively adjust the portion ( For example, amplify that part of the audio signal). In some cases, processor 2641 may be configured to selectively condition portions of audio signal 2921 for which words cannot be recognized, and then repeatedly attempt to zoom in to recognize words in that portion. Such iterations may be performed multiple times to optimize the selective conditioning of the audio signal 2921.
调节配置文件可以允许选择性地调节音频信号,使得例如修改个体2911的语音,从而提高语音的清晰度。在示例实施例中,语音渲染可以用于移除说话者的口音或将不同的口音重新应用到语音。在示例实施例中,人的语音的原始音频信号都可以与渲染的语音组合(例如,变形),以便在清晰语音的同时保留说话者的一些自然特性。语音渲染可以包括改变说话的人的音高(例如,如果用户100难以识别特定频率,则这种渲染可能是有益的)、人的语音的抑扬顿挫、人的语音的响度或人的语音的任何其他特性(例如,滤波器可以被应用于人的语音以将人的声音从男声改变为女声)。此外,选择性调节可以用于与个体2911的语音无关的背景声音的任何适当修改。The adjustment profile may allow for the audio signal to be selectively adjusted such that, for example, the speech of the individual 2911 is modified to improve the intelligibility of the speech. In an example embodiment, speech rendering may be used to remove a speaker's accent or reapply a different accent to the speech. In an example embodiment, both the original audio signal of the human speech may be combined (eg, morphed) with the rendered speech in order to preserve some of the speaker's natural characteristics while maintaining clear speech. The speech rendering may include changing the pitch of the person speaking (eg, such rendering may be beneficial if the user 100 has difficulty recognizing a particular frequency), the cadence of the human speech, the loudness of the human speech, or any other aspect of the human speech characteristics (eg, filters can be applied to human speech to change the human voice from male to female). In addition, selective adjustments may be used for any suitable modification of background sounds unrelated to the individual's 2911 speech.
在各种实施例中,调节配置文件可以包括用于选择性地调节音频信号的信息。在一些情况下,对应于调节配置文件的指令可以包括任何合适的逻辑元素。例如,当选择性调节受制于针对音频信号的一部分(例如,在音频信号2921内)观察到的特定特性(例如,音高或响度)时,可以在调节配置文件中使用IF逻辑子句。在执行选择性调节之后,处理器2641可以被配置为引起经调节的第一音频信号向被配置为向用户(例如,用户100)的耳朵提供声音的听觉接口设备的传输。在示例实施例中,调节配置文件可以包括预定义滤波器,用于基于音频信号2921的频率速率或振幅中的至少一个来选择性地调节音频信号(例如,音频信号2921)。In various embodiments, the conditioning profile may include information for selectively conditioning the audio signal. In some cases, the instructions corresponding to the adjustment profile may include any suitable logical elements. For example, an IF logic clause may be used in the adjustment profile when the selective adjustment is subject to a particular characteristic (eg, pitch or loudness) observed for a portion of the audio signal (eg, within audio signal 2921). After performing the selective adjustment, the processor 2641 may be configured to cause transmission of the adjusted first audio signal to an auditory interface device configured to provide sound to the ear of a user (eg, user 100). In an example embodiment, the conditioning profile may include predefined filters for selectively conditioning the audio signal (eg, the audio signal 2921 ) based on at least one of a frequency rate or an amplitude of the audio signal 2921 .
在示例实施例中,存储器可以与可穿戴相机位于相同的外壳内(即,存储器可以是本地存储器)。存储器可以是任何合适的存储器(例如,固态存储器、硬盘驱动器等)。在一些情况下,调节配置文件的至少一部分可以存储在本地存储器中,而另一部分可以存储在远程存储器中(例如,存储在远程数据库中)。在一些情况下,可以基于所识别的个体来从远程数据库中选择和检索调节配置文件。In an example embodiment, the memory may be located within the same housing as the wearable camera (ie, the memory may be local memory). The memory may be any suitable memory (eg, solid state memory, hard drive, etc.). In some cases, at least a portion of the adjustment configuration file may be stored in local storage, while another portion may be stored in remote storage (eg, in a remote database). In some cases, adjustment profiles may be selected and retrieved from a remote database based on the identified individual.
在示例实施例中,助听器系统的处理器2641还被编程为确定对与辨识出的个体(例如,个体2911)相关联的选择性调节的至少一个修改,并基于该修改来更新调节配置文件。例如,确定至少一个修改可以包括基于用户100能有多好地听到个体2911(或用户100能有多好地辨别个体2911的词语)来确定需要执行音频信号2921的放大。在示例实施例中,用户100可以使用任何合适的手段(例如,经由音频信号或经由如本文所讨论的用于助听器系统的合适接口)向处理器2641提供反馈。助听器系统的界面可以包括装置110上的按钮,或者可以是例如在耦合到助听器系统的设备(诸如智能手机、膝上型计算机等)上显示的应用程序。用户100可以向助听器系统提供特定指令(例如,用户100可以请求增加来自个体2911的音频信号2921的振幅)或者可以提供更复杂的指令(例如,提高个体2911的语音的清晰度,提高从相机2617A指向的点接收到的音频信号的清晰度,或者抑制背景声音)。这种复杂的指令可以由助听器系统解释,并且可以进行对应的修改。在一些情况下,可以从可能的修改列表中选择指令。可替代地,当经由来自用户100的语音命令提供指令时,可以通过装置110的语言处理应用来分析这样的指令,并且可以做出对应于指令的修改。在一个示例实施例中,装置110的语言处理应用可以是能够转录人类语音并从所得到的转录文本确定用于修改音频信号的指令的任何合适的软件应用。In an example embodiment, the hearing aid system's processor 2641 is also programmed to determine at least one modification to the selective adjustment associated with the identified individual (eg, individual 2911), and to update the adjustment profile based on the modification. For example, determining at least one modification may include determining that amplification of the audio signal 2921 needs to be performed based on how well the user 100 can hear the individual 2911 (or how well the user 100 can discern the words of the individual 2911). In an example embodiment, the user 100 may provide feedback to the processor 2641 using any suitable means (eg, via an audio signal or via a suitable interface for a hearing aid system as discussed herein). The interface of the hearing aid system may include buttons on the apparatus 110, or may be, for example, an application displayed on a device coupled to the hearing aid system (such as a smartphone, laptop, etc.). The user 100 may provide specific instructions to the hearing aid system (eg, the user 100 may request to increase the amplitude of the audio signal 2921 from the individual 2911) or may provide more complex instructions (eg, increase the intelligibility of the individual's 2911 speech, increase the intelligibility from the camera 2617A) the clarity of the received audio signal at the point pointed to, or suppress the background sound). Such complex instructions can be interpreted by the hearing aid system and modified accordingly. In some cases, an instruction may be selected from a list of possible modifications. Alternatively, when instructions are provided via voice commands from user 100, such instructions may be analyzed by a language processing application of device 110, and modifications corresponding to the instructions may be made. In one example embodiment, the language processing application of device 110 may be any suitable software application capable of transcribing human speech and determining instructions for modifying audio signals from the resulting transcribed text.
另外,由处理器2641确定修改可以基于各种环境因素(噪声的存在、背景音乐、其他说话者),并且可以在不接收来自用户100的指示的情况下自动进行。在各种情况下,可以基于所确定的修改适当地更新调节配置文件。在一些情况下,助听器系统可能需要在更新调节配置文件之前接收用户的批准。一旦确定了修改,处理器2641可以被配置为应用修改。Additionally, the modification determined by the processor 2641 may be based on various environmental factors (presence of noise, background music, other speakers), and may be done automatically without receiving an indication from the user 100 . In each case, the adjustment profile may be updated as appropriate based on the determined modifications. In some cases, the hearing aid system may need to receive approval from the user before updating the adjustment profile. Once the modifications are determined, the processor 2641 may be configured to apply the modifications.
如前所述,修改可以包括用户的调整(即,指令)。可以经由助听器系统的合适接口进行调整。如所描述的,至少一个修改可以基于来自用户的听力困难的指示来确定。该困难指示可以经由来自用户的音频信号或经由助听器系统的接口接收的来自用户的输入中的一个来确定。在示例实施例中,至少一个修改包括音频信号(例如,信号2921)的放大或修改音频信号2921的频谱中的一个。在一些情况下,选择性调节包括衰减与辨识出的个体2921不相关联的至少另一个音频信号(例如,音频信号2922或信号2923)。Modifications may include adjustments (ie, instructions) by the user, as previously described. Adjustments can be made via a suitable interface of the hearing aid system. As described, at least one modification may be determined based on an indication of hearing difficulty from the user. The indication of difficulty may be determined via one of an audio signal from the user or an input from the user received via an interface of the hearing aid system. In an example embodiment, the at least one modification includes one of amplification of the audio signal (eg, signal 2921 ) or modification of the frequency spectrum of the audio signal 2921 . In some cases, the selective adjustment includes attenuating at least another audio signal (eg, audio signal 2922 or signal 2923 ) not associated with the identified individual 2921 .
在示例实施例中,助听器系统的处理器2641可以被编程为识别由多个图像中的至少一个或由多个音频信号中的至少一个表示的另一辨识出的个体。例如,处理器2641可以被配置为识别个体2912。在一些情况下,处理器2641可以被配置为当用户100将相机2617A指向个体2912时识别个体2912。在一些情况下,当装置110包括两个相机(例如,如图26所示的相机2617A和2617B)时,相机2617A可以被配置为识别个体2911,并且相机2617B可以被配置为识别个体2912。一旦识别个体2912(上面讨论了使用图像或音频数据的个体的识别),处理器2641可以从存储器中检索与另一辨识出的个体相关联的另一调节配置文件。例如,如果装置110包括本地存储器,则处理器2641可以从本地存储器检索与个体2912相关联的调节配置文件。在一些情况下,与个体2912相关联的调节配置文件(例如CP2912)可以与个体2911相关联的调节配置文件(例如CP2911)相同。可替代地,CP2912可以不同于CP2911。处理器2641可以使用CP2911来选择性地调节音频信号2921,并且可以使用CP2922来选择性地调节音频信号2922。在一些情况下,当信号2921和2922不能清楚地分离时,CP2911和CP2922可以应用于包含信号2921和2922的组合音频信号。In an example embodiment, the processor 2641 of the hearing aid system may be programmed to identify another identified individual represented by at least one of the plurality of images or by at least one of the plurality of audio signals. For example, processor 2641 may be configured to identify individual 2912. In some cases, the processor 2641 may be configured to identify the individual 2912 when the user 100 points the camera 2617A at the individual 2912. In some cases, when device 110 includes two cameras (eg, cameras 2617A and 2617B as shown in FIG. 26 ), camera 2617A may be configured to identify individual 2911 and camera 2617B may be configured to identify individual 2912. Once the individual 2912 is identified (identification of individuals using image or audio data is discussed above), the processor 2641 may retrieve another conditioning profile associated with another identified individual from memory. For example, if device 110 includes local memory, processor 2641 may retrieve the conditioning profile associated with individual 2912 from the local memory. In some cases, the conditioning profile associated with individual 2912 (eg, CP2912) may be the same as the conditioning profile associated with individual 2911 (eg, CP2911). Alternatively, CP2912 may be different from CP2911. The processor 2641 can selectively condition the audio signal 2921 using the CP2911 and can selectively condition the audio signal 2922 using the CP2922. In some cases, CP2911 and CP2922 can be applied to a combined audio signal comprising signals 2921 and 2922 when signals 2921 and 2922 cannot be clearly separated.
在示例实施例中,如图26所示,使用CP2911调节的音频信号2921可以被发送到听觉接口设备(例如,听觉接口设备2615),并且使用CP2912调节的音频信号2922也可以被发送到听觉接口设备2615。在示例性实施例中,处理器2641可以在时间上分离音频信号2921和2922。例如,当音频信号2921与信号2922大约同时发出时,处理器2641可以将它们分开足够的时间量,以便用户100更好地辨别每个信号。在一些情况下,当听觉接口设备2615向用户100的两只耳朵发送音频信号时,经第一选择性调节的音频信号(例如,经选择性调节的信号2921)可以被发送到左耳(或右耳),并且经第二选择性调节的音频信号(例如,经选择性调节的信号2922)可以被发送到右耳(或左耳)。In an example embodiment, as shown in Figure 26, an audio signal 2921 conditioned using CP2911 may be sent to an auditory interface device (eg, auditory interface device 2615), and an audio signal 2922 conditioned using CP2912 may also be sent to an auditory interface Device 2615. In an exemplary embodiment, processor 2641 may separate audio signals 2921 and 2922 in time. For example, when audio signal 2921 and signal 2922 are emitted at about the same time, processor 2641 may separate them by a sufficient amount of time for user 100 to better discern each signal. In some cases, when auditory interface device 2615 sends audio signals to both ears of user 100, the first selectively conditioned audio signal (eg, selectively conditioned signal 2921 ) may be sent to the left ear (or right ear), and a second selectively conditioned audio signal (eg, selectively conditioned signal 2922) may be sent to the right ear (or left ear).
在一些情况下,助听器系统可以被配置为识别个体群组(而不是单个辨识出的个体)并且选择性地调节来自这样群组的声音。例如,相机2617A可以捕捉人的群组的图像,处理器2641可以被配置为识别该群组。在一个示例实施例中,可以存在针对该群组的调节配置文件(与针对单个个体的调节配置文件相反)。例如,如果处理器2641识别出从教室中的学童的群组接收到音频信号,则调节配置文件可以包括降低由该群组的每个成员发出的音频信号的振幅的指令。In some cases, the hearing aid system may be configured to identify groups of individuals (rather than a single identified individual) and to selectively condition sounds from such groups. For example, camera 2617A can capture an image of a group of people, and processor 2641 can be configured to identify the group. In one example embodiment, there may be an adjustment profile for the group (as opposed to an adjustment profile for a single individual). For example, if the processor 2641 identifies that an audio signal is received from a group of school children in a classroom, the adjustment profile may include instructions to reduce the amplitude of the audio signal emitted by each member of the group.
图30A示出了选择性地调节音频信号并将音频信号发送到用户耳朵的示例过程3001。在过程3001的步骤3011处,处理器2641可以接收由与助听器系统相关联的相机(例如,相机2617A)捕捉的图像。在步骤3013处,处理器2641可以从与助听器系统相关联的麦克风(例如,麦克风2613)接收音频信号。在步骤3015处,如上所述,处理器2641可以使用任何合适的图像识别技术来识别由使用相机2617A捕捉的图像所表示的个体(例如,个体2911)。在一些实施例中,可以基于个体的声纹来识别个体。例如,可以在只有个体在说话的时间段内获得声纹,或者在可以获得个体的声纹的音频信号的其他部分中获得声纹。然后可以将该声纹与预先存储在数据库中的多个声纹进行比较,以识别该个体。在步骤3017处,处理器2641可以如上所述从存储器(例如,与助听器系统相关联的本地存储器或与合适的云计算资源相关联的远程存储器)检索调节配置文件。在步骤3019处,处理器2641可以基于检索到的调节配置文件来对从用户(例如,如图29所述的用户100)的环境接收的音频信号进行选择性调节。可以从辨识出的个体2911接收音频信号(例如,音频信号2921)。在一些情况下,用户100可以将相机2617A指向辨识出的个体2911以及麦克风2613。过程3001可以在步骤3021结束,在步骤3021处,处理器2641可以经由听觉接口设备2615(例如,在图26中示出)将使用从调节配置文件获得的指令选择性地调节的音频信号发送到用户100的耳朵。30A illustrates an example process 3001 for selectively conditioning and sending audio signals to a user's ear. At step 3011 of process 3001, processor 2641 may receive images captured by a camera (eg, camera 2617A) associated with the hearing aid system. At step 3013, the processor 2641 may receive audio signals from a microphone (eg, microphone 2613) associated with the hearing aid system. At step 3015, processor 2641 may use any suitable image recognition technique to identify the individual (eg, individual 2911) represented by the image captured using camera 2617A, as described above. In some embodiments, individuals may be identified based on their voiceprints. For example, voiceprints may be obtained during periods of time during which only the individual is speaking, or in other parts of the audio signal where the individual's voiceprint may be obtained. The voiceprint can then be compared to multiple voiceprints pre-stored in a database to identify the individual. At step 3017, the processor 2641 may retrieve the adjustment profile from memory (eg, local memory associated with the hearing aid system or remote memory associated with a suitable cloud computing resource) as described above. At step 3019, the processor 2641 may selectively adjust the audio signal received from the environment of the user (eg, user 100 as described in FIG. 29) based on the retrieved adjustment profile. An audio signal (eg, audio signal 2921 ) may be received from the identified individual 2911 . In some cases, user 100 may point camera 2617A at recognized individual 2911 and microphone 2613. Process 3001 may end at step 3021, where processor 2641 may send, via auditory interface device 2615 (eg, shown in FIG. 26), an audio signal that is selectively adjusted using instructions obtained from the adjustment profile to The ear of the user 100 .
图30B示出了选择性地调节音频信号并将音频信号发送到用户耳朵的示例过程3002。过程3002是过程3001的变型。例如,过程3002可以包括如上面结合图30A所讨论的步骤3011和3013。在步骤3025处,过程3002可以不同于过程3001。在步骤3025处,处理器2641可以不使用图像数据来识别个体(如在步骤3015中所述),而是使用从个体接收的音频数据来识别个体。例如,可以经由个体的声纹来识别音频数据。如上所述,例如,可以在只有个体说话的时间段内获得声纹。然后可以将该声纹与预先存储在数据库中的声纹进行比较,以识别该个体。在一些实施例中,应当理解,步骤3015和3025可以如上所述由处理器2641同时执行。过程3002还可以包括如上面结合图30A所讨论的步骤3017、3019和3021。30B illustrates an example process 3002 for selectively conditioning and sending audio signals to a user's ear. Process 3002 is a variation of process 3001. For example, process 3002 may include steps 3011 and 3013 as discussed above in connection with Figure 30A. At step 3025, process 3002 may differ from process 3001. At step 3025, the processor 2641 may not use the image data to identify the individual (as described in step 3015), but instead use the audio data received from the individual to identify the individual. For example, audio data may be identified via an individual's voiceprint. As mentioned above, for example, voiceprints can be obtained during periods of time when only the individual is speaking. This voiceprint can then be compared to voiceprints pre-stored in a database to identify the individual. In some embodiments, it should be understood that steps 3015 and 3025 may be performed concurrently by processor 2641 as described above. Process 3002 may also include steps 3017, 3019, and 3021 as discussed above in connection with Figure 30A.
图31示出了选择性地调节音频信号并将音频信号发送到用户耳朵的示例过程3101。过程3101是过程3001或过程3002的变型。例如,过程3101可以包括如上面结合图30A所讨论的步骤3011和3013。在步骤3115和3117处,过程3101可以不同于过程3001或3002。在步骤3115处,处理器2641可以如上所述识别由相机2617A捕捉的图像或由麦克风2613捕捉的音频表示的个体群组。在步骤3117处,处理器2641A可以如上所述从存储器中检索对应于所识别的群组的调节配置文件。过程3101还可以包括如上面结合图30A所讨论的步骤3019和3021。Figure 31 shows an example process 3101 for selectively conditioning and sending audio signals to a user's ear. Process 3101 is a variation of process 3001 or process 3002. For example, process 3101 may include steps 3011 and 3013 as discussed above in connection with Figure 30A. At steps 3115 and 3117, process 3101 may be different from process 3001 or 3002. At step 3115, the processor 2641 may identify the group of individuals represented by the image captured by the camera 2617A or the audio captured by the microphone 2613 as described above. At step 3117, the processor 2641A may retrieve the conditioning profile corresponding to the identified group from memory as described above. Process 3101 may also include steps 3019 and 3021 as discussed above in connection with Figure 30A.
基于用户提示的选择性调节Selective adjustments based on user prompts
助听器系统旨在改进和增强用户与环境的互动。用户可能依赖助听器系统来导航他们的周围环境和日常活动。然而,不同的用户可能取决于环境需要不同程度的帮助。典型的助听器系统可能不能基于用户的需要来充分地校正或调整音频信号。因此,需要用于基于来自用户的提示而自动调节用户的音频信号的装置和方法。Hearing aid systems are designed to improve and enhance the user's interaction with the environment. Users may rely on hearing aid systems to navigate their surroundings and daily activities. However, different users may require different levels of assistance depending on the circumstances. Typical hearing aid systems may not be able to adequately correct or adjust the audio signal based on the needs of the user. Accordingly, there is a need for an apparatus and method for automatically adjusting a user's audio signal based on prompts from the user.
所公开的实施例包括可以被配置为基于来自助听器用户的提示来校正或调整音频信号的助听器系统。例如,该提示可以是物理的(例如,用户将耳朵倾斜或转向声源或语音、用户将他或她的手举向耳朵等)或口头的(例如,用户可以陈述“什么?”或“重复”等),提示可以由可穿戴相机设备上的传感器和/或麦克风来收集。基于检测到的提示,系统可以识别用户听力困难(例如,理解说话的个体等),并自动校正或调整至少一个音频信号。例如,系统可以选择性地放大来自声源的声音。The disclosed embodiments include hearing aid systems that may be configured to correct or adjust audio signals based on cues from a hearing aid user. For example, the prompt may be physical (eg, the user tilts or turns the ear to a sound source or speech, the user raises his or her hand to the ear, etc.) or verbal (eg, the user may state "What?" or "Repeat" ” etc.), cues can be collected by sensors and/or microphones on the wearable camera device. Based on the detected cues, the system can identify the user's hearing difficulties (eg, understand the individual speaking, etc.) and automatically correct or adjust at least one audio signal. For example, the system can selectively amplify sound from a sound source.
图32是示出符合所公开实施例的用于使用具有语音和/或图像识别的助听器的示例性环境的示意图。可穿戴相机(例如,装置110的可穿戴相机)可以被配置为从用户100的环境捕捉多个图像。可替代地或另外地,至少一个麦克风可以被配置为从用户100的环境捕捉声音。32 is a schematic diagram illustrating an exemplary environment for using a hearing aid with speech and/or image recognition consistent with the disclosed embodiments. A wearable camera (eg, the wearable camera of device 110 ) may be configured to capture multiple images from the environment of user 100 . Alternatively or additionally, the at least one microphone may be configured to capture sound from the environment of the user 100 .
例如,处理器210可以接收表示由至少一个麦克风1750从用户100的环境捕捉的一个或多个声音的至少一个音频信号3203、3205或3207。在一些实施例中,处理器210可以基于对至少一个音频信号(例如,音频信号3203)的分析来识别用户100的至少一个动作。在一些实施例中,至少一个动作可以包括用户100的讲话。例如,识别至少一个动作可以包括基于对至少一个音频信号3203的分析来检测用户100所说的词语。在一些实施例中,词语可以指示用户100没有听清楚。例如,用户100可能说了“什么?”或“您能重复您所说的吗?”或“我不明白”。在一些实施例中,处理器210可以根据检测到的词语的类型和/或频率将更大的听力困难与用户100相关联。例如,“重复”可能与比“我不明白”更大的听力困难相关联,和/或重复的词语(例如,用户100重复陈述“什么?”)可能与比不重复的词语更大的听力困难相关联。For example, the processor 210 may receive at least one audio signal 3203 , 3205 or 3207 representing one or more sounds captured by the at least one microphone 1750 from the environment of the user 100 . In some embodiments, processor 210 may identify at least one action of user 100 based on analysis of at least one audio signal (eg, audio signal 3203). In some embodiments, the at least one action may include speech by the user 100 . For example, identifying the at least one action may include detecting words spoken by the user 100 based on analysis of the at least one audio signal 3203 . In some embodiments, the words may indicate that the user 100 did not hear clearly. For example, user 100 may say "What?" or "Can you repeat what you said?" or "I don't understand." In some embodiments, processor 210 may associate greater hearing difficulties with user 100 based on the type and/or frequency of detected words. For example, "repetition" may be associated with greater hearing difficulties than "I don't understand", and/or repeated words (eg, user 100 repeating the statement "what?") may be associated with greater hearing than unrepeated words difficulty associated.
在一些实施例中,麦克风1720可以被配置为确定用户100的环境中声音的方向性。例如,麦克风1720可以包括一个或多个定向麦克风,它们可能对拾取某些方向上的声音更敏感。处理器210可以被配置为区分用户100的环境内的声音并且确定每个声音的近似方向性。例如,使用麦克风阵列1720,处理器210可以对麦克风1720之间声音的相对定时或振幅进行比较,以确定相对于装置100的方向性。In some embodiments, the microphone 1720 may be configured to determine the directionality of sound in the environment of the user 100 . For example, microphone 1720 may include one or more directional microphones, which may be more sensitive to picking up sounds in certain directions. The processor 210 may be configured to differentiate sounds within the environment of the user 100 and determine the approximate directionality of each sound. For example, using the microphone array 1720 , the processor 210 may compare the relative timing or amplitude of sounds between the microphones 1720 to determine the directionality relative to the device 100 .
在一些实施例中,从用户100的环境捕捉的声音可以被分类为包含讲话、音乐、音调、笑声、尖叫等的片段。各个片段的指示可以记录在数据库2050中。In some embodiments, the sounds captured from the environment of the user 100 may be classified into segments containing speech, music, tones, laughter, screams, and the like. The indication of each segment may be recorded in database 2050.
在一些实施例中,所记录的信息可以使处理器210能够基于对至少一个音频信号3203的分析来识别用户100的至少一个动作。如前所讨论的,至少一个动作可以包括用户100的讲话。例如,识别至少一个动作可以包括基于对至少一个音频信号3203的分析来检测用户100所说的词语。词语可以指示用户100没有很好地听到声音(例如,来自另一个体3210、与音频信号3205相关联的声音、与音频信号3207相关联的声音等)。In some embodiments, the recorded information may enable processor 210 to identify at least one action of user 100 based on analysis of at least one audio signal 3203 . As previously discussed, the at least one action may include speech by the user 100 . For example, identifying the at least one action may include detecting words spoken by the user 100 based on analysis of the at least one audio signal 3203 . The words may indicate that the user 100 is not hearing the sound well (eg, from another entity 3210, the sound associated with the audio signal 3205, the sound associated with the audio signal 3207, etc.).
处理器210可以被配置为基于所识别的动作,对由至少一个麦克风接收的用户100的环境中的至少一个音频信号(例如,来自个体3210、音频信号3205、音频信号3207等)进行选择性调节。至少一个经调节的音频信号可以被发送到听觉接口设备1710,听觉接口设备1710被配置为向用户100的耳朵提供声音,并且因此可以向用户100提供对应于该至少一个音频信号的源(例如,与音频信号3205相关联的个体3210、与音频信号3207相关联的个体3210,等等)的听觉反馈。处理器210可以对从麦克风1720接收的音频信号执行各种调节技术。调节可以包括相对于其他音频信号放大被确定为对应于声音(例如,与音频信号3205相关联的个体3210、与音频信号3207相关联的个体3210,等等)的音频信号。例如,可以通过相对于其他音频信号3205或3207处理与个体3210相关联的音频信号来数字化地完成放大。放大还可以通过改变麦克风1720的一个或多个参数来实现,以聚焦于从与用户100相关联的个体3210(例如,感兴趣的区域)发出的音频声音。例如,麦克风1720可以是定向麦克风,并且处理器210可以执行将麦克风1720聚焦在用户100的环境中的个体3210或其他声音上的操作。可以使用用于放大声音的各种其他技术,诸如使用波束成形麦克风阵列、声学望远镜技术等。在一些实施例中,可以基于确定的用户100的听力困难来对至少一个音频信号实现选择性调节。例如,音频信号的放大可以随着与用户100相关联的听力困难的增加而增加。Processor 210 may be configured to selectively adjust at least one audio signal (eg, from individual 3210, audio signal 3205, audio signal 3207, etc.) in the environment of user 100 received by the at least one microphone based on the identified action . The at least one conditioned audio signal can be sent to auditory interface device 1710, which is configured to provide sound to the ear of user 100, and thus can provide user 100 with a source corresponding to the at least one audio signal (eg, auditory feedback of individual 3210 associated with audio signal 3205, individual 3210 associated with audio signal 3207, etc.). The processor 210 may perform various conditioning techniques on the audio signal received from the microphone 1720 . Adjusting may include amplifying audio signals determined to correspond to sounds (eg, individual 3210 associated with audio signal 3205, individual 3210 associated with audio signal 3207, etc.) relative to other audio signals. Amplification may be accomplished digitally, for example, by processing the audio signal associated with the individual 3210 relative to other audio signals 3205 or 3207. Amplification may also be accomplished by changing one or more parameters of the microphone 1720 to focus on audio sounds emanating from the individual 3210 associated with the user 100 (eg, a region of interest). For example, microphone 1720 may be a directional microphone, and processor 210 may perform operations to focus microphone 1720 on individual 3210 or other sounds in the environment of user 100 . Various other techniques for amplifying sound may be used, such as the use of beamforming microphone arrays, acoustic telescope techniques, and the like. In some embodiments, the selective adjustment of the at least one audio signal may be implemented based on the determined hearing difficulty of the user 100 . For example, the amplification of the audio signal may increase as the hearing difficulty associated with the user 100 increases.
调节还可以包括衰减或抑制从感兴趣的区域(例如,个体3210)之外的方向接收的一个或多个音频信号。例如,处理器210可以衰减音频信号3205和3207。类似于来自个体3210的声音的放大,声音的衰减可以通过处理音频信号来发生,或者通过改变与一个或多个麦克风1720相关联的一个或多个参数来引导焦点远离从包括个体3210的区域之外发出的声音。在一些实施例中,如果用户100已经经由对过去交互的分析确定具有预定水平的听力损失,则可以增加从感兴趣区域之外接收的音频信号的衰减。Conditioning may also include attenuating or suppressing one or more audio signals received from directions other than the region of interest (eg, individual 3210). For example, processor 210 may attenuate audio signals 3205 and 3207. Similar to the amplification of sound from the individual 3210, attenuation of the sound can occur by processing the audio signal, or by changing one or more parameters associated with the one or more microphones 1720 to direct focus away from the area containing the individual 3210. sound from outside. In some embodiments, the attenuation of audio signals received from outside the region of interest may be increased if the user 100 has determined through analysis of past interactions to have a predetermined level of hearing loss.
在一些实施例中,调节还可以包括改变对应于来自感兴趣区域的声音的音频信号的音调,以使用户100更容易感知该声音。例如,用户100可能对特定范围内的音调具有较小的敏感度,并且音频信号的调节可以调节来自感兴趣区域的声音的音高以使其对于用户100更易感知。例如,用户100可能经历10kHz以上的频率中的听觉损失。因此,处理器210可以将更高的频率(例如,在15khz处)重新映射到10khz以下频率。在一些实施例中,处理器210可以被配置为改变与一个或多个音频信号相关联的语速。因此,处理器210可以被配置为例如使用语音活动检测(VAD)算法或技术来检测由麦克风1720接收的一个或多个音频信号内的语音。例如,如果确定声音对应于来自个体3210的语音或讲话,则处理器210可以被配置为改变来自个体3210的声音的回放速率。例如,可以降低个体3210的语速以使检测到的语音对于用户100更易感知。可以执行各种其他处理(诸如修改来自个体3210的声音的音调),以维持与原始音频信号相同的音高,或者降低音频信号内的噪声。如果已经对与来自个体3210的声音相关联的音频信号执行了语音识别,则调节还可以包括基于检测到的语音来修改音频信号。在一些实施例中,调节可以包括修改检测到的语速。例如,可以通过延长包括在音频信号中的词语的持续时间和减少词语之间的停顿持续时间(或反之亦然)来修改语速,这可以使讲话更容易理解。In some embodiments, adjusting may also include changing the pitch of the audio signal corresponding to the sound from the region of interest to make it easier for the user 100 to perceive the sound. For example, the user 100 may have less sensitivity to tones within a certain range, and the adjustment of the audio signal may adjust the pitch of the sound from the area of interest to make it more perceptible to the user 100 . For example, user 100 may experience hearing loss in frequencies above 10 kHz. Thus, the processor 210 may remap higher frequencies (eg, at 15khz) to frequencies below 10khz. In some embodiments, the processor 210 may be configured to vary the speech rate associated with one or more audio signals. Accordingly, processor 210 may be configured to detect speech within one or more audio signals received by microphone 1720, eg, using a voice activity detection (VAD) algorithm or technique. For example, if it is determined that the sound corresponds to speech or speech from the individual 3210, the processor 210 may be configured to change the playback rate of the sound from the individual 3210. For example, the speaking rate of the individual 3210 may be reduced to make the detected speech more perceptible to the user 100 . Various other processing may be performed, such as modifying the pitch of the sound from the individual 3210, to maintain the same pitch as the original audio signal, or to reduce noise within the audio signal. If speech recognition has been performed on the audio signal associated with the sound from the individual 3210, the conditioning may also include modifying the audio signal based on the detected speech. In some embodiments, adjusting may include modifying the detected rate of speech. For example, the rate of speech can be modified by increasing the duration of words included in the audio signal and reducing the duration of pauses between words (or vice versa), which can make speech easier to understand.
然后可以将经调节的音频信号发送到听觉接口设备1710,并为用户100产生音频信号。因此,在经调节的音频信号中,来自个体3210的声音可以更容易被用户100听到,比来自音频信号3205或3207的声音更响亮和/或更容易区分,来自音频信号3205或3207的声音可以表示环境内的背景噪声。The conditioned audio signal may then be sent to auditory interface device 1710 and an audio signal generated for user 100 . Thus, in the conditioned audio signal, the sound from the individual 3210 may be more easily heard by the user 100, louder and/or more distinguishable than the sound from the audio signal 3205 or 3207, the sound from the audio signal 3205 or 3207 Can represent background noise within the environment.
在一些实施例中,至少一个麦克风1750可以捕捉在预定长度的移动时间窗口期间接收的一个或多个音频信号,并且处理器210可以被编程为引起选择性地调节和在移动时间窗口内接收的音频信号的一部分的传输。例如,处理器210可以将至少一个音频信号(例如,来自个体3210)的一部分存储在数据库2050中,其中该部分是在用户100的至少一个动作(例如,发出音频信号3203)之前接收的。至少一个音频信号的部分可以被发送到被配置为向用户100的耳朵提供声音的听觉接口设备1710,并且因此可以向用户100提供对应于至少一个音频信号的该部分的源(例如,另一用户)的听觉反馈。可以基于用户100的至少一个动作将至少一个音频信号的该部分发送到听力设备1710。例如,在用户100指示他或她难以听到一个或多个声音之后,可以将该部分发送到听力设备1710,从而复制用户100难以听到的至少一个声音。然后,由于重放先前的声音时段而错过的时段可以以增加的速率被提供给用户100,例如通过减少词语之间的静音时段,或者以任何其他合适的方式。在一些实施例中,可以基于用户执行的手势而不是用户的口头指示来执行声音的自动重放。例如,如果用户口头声明他或她理解困难,可能会提示另一个体重复这些词语。In some embodiments, at least one microphone 1750 can capture one or more audio signals received during a moving time window of predetermined length, and processor 210 can be programmed to cause selective adjustment and Transmission of part of an audio signal. For example, processor 210 may store a portion of at least one audio signal (eg, from individual 3210) in database 2050, where the portion was received prior to at least one action by user 100 (eg, emitting audio signal 3203). The portion of the at least one audio signal may be sent to the auditory interface device 1710 configured to provide sound to the ear of the user 100, and thus may provide the user 100 with a source corresponding to the portion of the at least one audio signal (eg, another user ) auditory feedback. The portion of the at least one audio signal may be sent to the hearing device 1710 based on at least one action of the user 100 . For example, after user 100 indicates that he or she has difficulty hearing one or more sounds, the portion may be sent to hearing device 1710, thereby replicating at least one sound that user 100 has difficulty hearing. Periods missed due to playback of previous sound periods may then be provided to user 100 at an increased rate, such as by reducing periods of silence between words, or in any other suitable manner. In some embodiments, automatic playback of the sound may be performed based on gestures performed by the user rather than verbal instructions from the user. For example, if the user verbally states that he or she has difficulty understanding, another entity may be prompted to repeat the words.
图33是具有符合所公开的实施例的助听器系统的用户100的示例性描述。在一些实施例中,可穿戴相机可以是装置110(例如,基于相机的定向助听器装置)的组件,用于基于用户100的运动3201(例如,手运动、倾斜运动、视线方向等)选择性地改变声音的放大。用户100还可以佩戴例如听觉接口设备1710。在一些实施例中,装置110可以从用户100的环境捕捉至少一个图像。在一些实施例中,处理器210可以接收由装置110捕捉的至少一个图像。处理器210可以通过基于对至少一个图像的分析来检测用户100的运动3201来识别至少一个动作。在一些实施例中,运动3201可以包括用户100的手运动或用户100的倾斜运动。例如,用户100可以将他的手环绕在耳朵周围,指示用户100在听到个体3210方面有困难。在一些实施例中,用户100可以向个体3210倾斜,指示用户100在听到个体3210方面有困难。33 is an exemplary depiction of a user 100 having a hearing aid system consistent with the disclosed embodiments. In some embodiments, the wearable camera may be a component of the device 110 (eg, a camera-based directional hearing aid device) for selectively selectively based on the motion 3201 of the user 100 (eg, hand motion, tilt motion, gaze direction, etc.) Change the amplification of the sound. User 100 may also wear auditory interface device 1710, for example. In some embodiments, device 110 may capture at least one image from the environment of user 100 . In some embodiments, the processor 210 may receive at least one image captured by the device 110 . The processor 210 may identify the at least one action by detecting the motion 3201 of the user 100 based on the analysis of the at least one image. In some embodiments, motion 3201 may include hand motion of user 100 or tilt motion of user 100 . For example, the user 100 may wrap his hands around the ears, indicating that the user 100 is having difficulty hearing the individual 3210. In some embodiments, the user 100 may tilt towards the individual 3210, indicating that the user 100 is having difficulty hearing the individual 3210.
在一些实施例中,可以通过监视用户100的身体部分(例如,手、手臂等)或面部部分(例如,鼻子、眼睛、耳朵、耳朵附近的手等)相对于相机传感器的光轴的方向来跟踪用户100的运动3201。例如,装置110的可穿戴相机可以被配置为例如使用图像传感器220来捕捉用户100的周围环境的一个或多个图像。例如,所捕捉的图像可以包括用户100的身体部分或面部部分的表示,其可以用于确定用户100的手运动。处理器210(和/或处理器210a和210b)可以被配置为使用各种图像检测或处理算法(例如,使用卷积神经网络(CNN)、尺度不变特征变换(SIFT)、定向梯度直方图(HOG)特征或其他技术)来分析捕捉的图像并检测用户100的身体部分或面部部分的运动3201。基于检测到的用户100的身体部分或面部部分的表示,可以确定用户100的运动3201。In some embodiments, this can be done by monitoring the orientation of the user's 100 body part (eg, hand, arm, etc.) or face part (eg, nose, eyes, ears, hand near the ear, etc.) relative to the optical axis of the camera sensor Movement 3201 of user 100 is tracked. For example, the wearable camera of device 110 may be configured to capture one or more images of user 100's surroundings, eg, using image sensor 220 . For example, the captured images may include representations of body parts or facial parts of the user 100 , which may be used to determine the hand movements of the user 100 . Processor 210 (and/or processors 210a and 210b) may be configured to use various image detection or processing algorithms (eg, using convolutional neural networks (CNN), scale-invariant feature transforms (SIFT), histograms of oriented gradients) (HOG) feature or other techniques) to analyze the captured image and detect movement 3201 of the body part or face part of the user 100 . Based on the detected representation of the body part or face part of the user 100, the motion 3201 of the user 100 can be determined.
可以部分地通过将检测到的用户100的身体部分或面部部分的表示与相机传感器1751的光轴进行比较来确定运动3201。例如,光轴1751在每个图像中可以是已知的或固定的,并且处理器210可以通过将用户100的身体部分或面部部分的代表性角度与光轴1751的方向进行比较来确定运动1750。例如,所确定的运动可以包括用户100将他的手环绕在耳朵周围,指示用户100在听到个体3210方面有困难。在一些实施例中,所确定的运动可以包括用户100向个体3210倾斜,指示他们在听到个体3210方面有困难。例如,用户100朝向声音发出对象的倾斜运动可以通过用户100与该对象之间的距离的减小来识别。在一些实施例中,可以基于将该减小与预定阈值或范围(例如5-30厘米)进行比较来检测倾斜运动。该距离可以通过分析一个或多个图像、基于嵌入在装置110内的测距仪或各种其他方法来评估。Motion 3201 may be determined in part by comparing a detected representation of the body part or face part of user 100 to the optical axis of camera sensor 1751 . For example, the optical axis 1751 may be known or fixed in each image, and the processor 210 may determine the motion 1750 by comparing a representative angle of the body part or face part of the user 100 to the direction of the optical axis 1751 . For example, the determined motion may include the user 100 wrapping his hand around the ear, indicating that the user 100 is having difficulty hearing the individual 3210. In some embodiments, the determined motion may include the user 100 leaning toward the individual 3210, indicating that they are having difficulty hearing the individual 3210. For example, a tilting motion of the user 100 towards a sound emitting object may be identified by a decrease in the distance between the user 100 and the object. In some embodiments, tilt motion may be detected based on comparing the reduction to a predetermined threshold or range (eg, 5-30 centimeters). This distance can be assessed by analyzing one or more images, based on a rangefinder embedded within the device 110, or various other methods.
在一些实施例中,处理器210可以基于所识别的动作引起对由至少一个麦克风1720接收的至少一个音频信号(例如,来自个体3210)进行选择性调节,并引起至少一个经调节的音频信号向被配置为向用户100的耳朵提供声音的听觉接口设备1710的传输。在一些实施例中,用户110可以向特定方向倾斜(例如,向用户110的环境中的个体倾斜),并且引起对至少一个音频信号的选择性调节可以包括放大从倾斜运动的方向接收的至少一个音频信号。In some embodiments, processor 210 may cause selective adjustment of at least one audio signal received by at least one microphone 1720 (eg, from individual 3210 ) based on the identified action and cause the at least one adjusted audio signal to be directed toward Transmission of auditory interface device 1710 configured to provide sound to user 100 ears. In some embodiments, user 110 may tilt in a particular direction (eg, tilt toward an individual in the environment of user 110), and causing selective adjustment of the at least one audio signal may include amplifying at least one received from the direction of tilt motion audio signal.
至少一个经调节的音频信号可以被发送到听觉接口设备1710,听觉接口设备1710被配置为向用户100的耳朵提供声音,并且因此可以向用户100提供对应于该至少一个音频信号的源(例如,个体3210)的听觉反馈。处理器210可以对从麦克风1720接收的音频信号执行各种调节技术。调节可以包括相对于其他音频信号(例如,音频信号3205或3207)放大被确定为对应于来自个体3210的声音的音频信号。放大可以例如通过相对于其他信号数字化地处理与个体3210相关联的音频信号来数字化地实现。放大还可以通过改变麦克风1720的一个或多个参数来实现,以聚焦于从与用户100相关联的个体3210(例如,感兴趣的区域)发出的音频声音。例如,麦克风1720可以是定向麦克风,处理器210可以执行将麦克风1720聚焦在声音1820或个体3210的区域内的其他声音上的操作。可以使用用于放大来自个体3210的声音的各种其他技术,诸如使用波束成形麦克风阵列、声学望远镜技术等。The at least one conditioned audio signal can be sent to auditory interface device 1710, which is configured to provide sound to the ear of user 100, and thus can provide user 100 with a source corresponding to the at least one audio signal (eg, auditory feedback from individual 3210). The processor 210 may perform various conditioning techniques on the audio signal received from the microphone 1720 . Adjusting may include amplifying the audio signal determined to correspond to the sound from the individual 3210 relative to other audio signals (eg, audio signal 3205 or 3207). Amplification may be accomplished digitally, for example, by digitally processing the audio signal associated with the individual 3210 relative to other signals. Amplification may also be accomplished by changing one or more parameters of the microphone 1720 to focus on audio sounds emanating from the individual 3210 associated with the user 100 (eg, a region of interest). For example, microphone 1720 may be a directional microphone and processor 210 may perform operations to focus microphone 1720 on sound 1820 or other sounds within the area of individual 3210. Various other techniques for amplifying sound from the individual 3210 may be used, such as the use of beamforming microphone arrays, acoustic telescope techniques, and the like.
调节还可以包括衰减或抑制从感兴趣的区域(例如,个体3210)之外的方向接收的一个或多个音频信号。例如,处理器210可以衰减音频信号3205和3207。类似于来自个体3210的声音的放大,声音的衰减可以通过处理音频信号来发生,或者通过改变与一个或多个麦克风1720相关联的一个或多个参数来引导焦点远离从包括个体3210的区域之外发出的声音。Conditioning may also include attenuating or suppressing one or more audio signals received from directions other than the region of interest (eg, individual 3210). For example, processor 210 may attenuate audio signals 3205 and 3207. Similar to the amplification of sound from the individual 3210, attenuation of the sound can occur by processing the audio signal, or by changing one or more parameters associated with the one or more microphones 1720 to direct focus away from the area containing the individual 3210. sound from outside.
在一些实施例中,调节还可以包括改变对应于来自感兴趣区域的声音的音频信号的音调,以使用户100更容易感知该声音。例如,用户100可能对特定范围内的音调具有较小的敏感度,并且音频信号的调节可以调节来自感兴趣区域的声音的音高以使其对于用户100更易感知。例如,用户100可能经历10kHz以上的频率中的听觉损失。因此,处理器210可以将更高的频率(例如,在15khz处)重新映射到10khz以下频率。在一些实施例中,处理器210可以被配置为改变与一个或多个音频信号相关联的语速。因此,处理器210可以被配置为例如使用语音活动检测(VAD)算法或技术来检测由麦克风1720接收的一个或多个音频信号内的语音。例如,如果确定声音对应于来自个体3210的语音或讲话,则处理器210可以被配置为改变来自个体3210的声音的回放速率。例如,可以降低个体3210的语速以使检测到的语音对于用户100更易感知。例如,如上所述,可以通过延长或减少所讲词语持续时间和/或所讲词语之间的静默时段来修改语速。可以执行各种其他处理(诸如修改来自个体3210的声音的音调),以维持与原始音频信号相同的音高,或者降低音频信号内的噪声。如果已经对与来自个体3210的声音相关联的音频信号执行了语音识别,则调节还可以包括基于检测到的语音来修改音频信号。例如,处理器210可以在词语和/或句子之间引入停顿或增加停顿的持续时间,这可以使语音更容易理解。In some embodiments, adjusting may also include changing the pitch of the audio signal corresponding to the sound from the region of interest to make it easier for the user 100 to perceive the sound. For example, the user 100 may have less sensitivity to tones within a certain range, and the adjustment of the audio signal may adjust the pitch of the sound from the area of interest to make it more perceptible to the user 100 . For example, user 100 may experience hearing loss in frequencies above 10 kHz. Thus, the processor 210 may remap higher frequencies (eg, at 15khz) to frequencies below 10khz. In some embodiments, the processor 210 may be configured to vary the speech rate associated with one or more audio signals. Accordingly, processor 210 may be configured to detect speech within one or more audio signals received by microphone 1720, eg, using a voice activity detection (VAD) algorithm or technique. For example, if it is determined that the sound corresponds to speech or speech from the individual 3210, the processor 210 may be configured to change the playback rate of the sound from the individual 3210. For example, the speaking rate of the individual 3210 may be reduced to make the detected speech more perceptible to the user 100 . For example, as discussed above, the speech rate may be modified by lengthening or reducing the duration of spoken words and/or the period of silence between spoken words. Various other processing may be performed, such as modifying the pitch of the sound from the individual 3210, to maintain the same pitch as the original audio signal, or to reduce noise within the audio signal. If speech recognition has been performed on the audio signal associated with the sound from the individual 3210, the conditioning may also include modifying the audio signal based on the detected speech. For example, processor 210 may introduce pauses or increase the duration of pauses between words and/or sentences, which may make speech easier to understand.
然后可以将经调节的音频信号发送到听觉接口设备1710,并为用户100产生音频信号。因此,在经调节的音频信号中,来自个体3210的声音可以更容易被用户100听到,比来自音频信号3205或3207的声音更响亮和/或更容易区分,来自音频信号3205或3207的声音可以表示环境内的背景噪声。The conditioned audio signal may then be sent to auditory interface device 1710 and an audio signal generated for user 100 . Thus, in the conditioned audio signal, the sound from the individual 3210 may be more easily heard by the user 100, louder and/or more distinguishable than the sound from the audio signal 3205 or 3207, the sound from the audio signal 3205 or 3207 Can represent background noise within the environment.
在一些实施例中,至少一个麦克风1750可以捕捉在预定长度的移动时间窗口期间接收的一个或多个音频信号,并且处理器210可以被编程为引起选择性地调节和在移动时间窗口内接收的音频信号的一部分的传输。例如,处理器210可以将至少一个音频信号(例如,来自个体3210)的一部分存储在数据库2050中,其中该部分是在用户100的至少一个动作(例如,运动3203)之前接收的。至少一个音频信号的部分可以被发送到被配置为向用户100的耳朵提供声音的听觉接口设备1710,并且因此可以向用户100提供对应于至少一个音频信号的该部分的源(例如,另一用户)的听觉反馈。可以基于用户100的至少一个动作将至少一个音频信号的该部分发送到听力设备1710。例如,在用户100状态为“重复”之后,该部分可以被发送到听力设备1710,从而复制用户100难以听到的至少一个声音。In some embodiments, at least one microphone 1750 can capture one or more audio signals received during a moving time window of predetermined length, and processor 210 can be programmed to cause selective adjustment and Transmission of part of an audio signal. For example, processor 210 may store a portion of at least one audio signal (eg, from individual 3210 ) in database 2050 , where the portion was received prior to at least one action (eg, movement 3203 ) of user 100 . The portion of the at least one audio signal may be sent to the auditory interface device 1710 configured to provide sound to the ear of the user 100, and thus may provide the user 100 with a source corresponding to the portion of the at least one audio signal (eg, another user ) auditory feedback. The portion of the at least one audio signal may be sent to the hearing device 1710 based on at least one action of the user 100 . For example, after the user 100 status is "repeat", the portion may be sent to the hearing device 1710, thereby duplicating at least one sound that is difficult for the user 100 to hear.
图34是示出符合所公开实施例的用于选择性地放大声音的示例性过程3400的流程图。34 is a flowchart illustrating an exemplary process 3400 for selectively amplifying sound, consistent with the disclosed embodiments.
在步骤3401中,可穿戴相机(例如,装置110的可穿戴相机)可以从用户100的环境捕捉多个图像。在一些实施例中,可穿戴相机可以是装置110(例如,基于相机的定向助听器装置)的组件,用于基于用户100的运动3201(例如,手运动、倾斜运动、视线方向等)选择性地改变声音的放大。在一些实施例中,可穿戴相机可以从用户100的环境捕捉至少一个图像。In step 3401 , a wearable camera (eg, the wearable camera of device 110 ) may capture multiple images from the environment of user 100 . In some embodiments, the wearable camera may be a component of the device 110 (eg, a camera-based directional hearing aid device) for selectively selectively based on the motion 3201 of the user 100 (eg, hand motion, tilt motion, gaze direction, etc.) Change the amplification of the sound. In some embodiments, the wearable camera can capture at least one image from the user's 100 environment.
在步骤3403中,至少一个麦克风可以从用户100的环境捕捉声音。在一些实施例中,麦克风1720可以被配置为确定用户100的环境中声音的方向性。例如,麦克风1720可以包括一个或多个定向麦克风,它们可能对拾取某些方向上的声音更敏感。处理器210可以被配置为区分用户100的环境内的声音并且确定每个声音的近似方向性。例如,使用麦克风阵列1720,处理器210可以对麦克风1720之间个体声音的相对定时或振幅进行比较,以确定相对于装置100的方向性。In step 3403, at least one microphone may capture sound from the user's 100 environment. In some embodiments, the microphone 1720 may be configured to determine the directionality of sound in the environment of the user 100 . For example, microphone 1720 may include one or more directional microphones, which may be more sensitive to picking up sounds in certain directions. The processor 210 may be configured to differentiate sounds within the environment of the user 100 and determine the approximate directionality of each sound. For example, using the microphone array 1720 , the processor 210 may compare the relative timing or amplitude of individual sounds between the microphones 1720 to determine the directionality relative to the device 100 .
在步骤3405中,处理器210可以接收由可穿戴相机捕捉的多个图像。例如,可穿戴相机可以被配置为例如使用图像传感器220来捕捉用户100的周围环境的一个或多个图像。In step 3405, the processor 210 may receive a plurality of images captured by the wearable camera. For example, the wearable camera may be configured to capture one or more images of the user's 100 surroundings, eg, using the image sensor 220 .
在步骤3407中,处理器210可以接收表示由至少一个麦克风从用户的环境捕捉的声音的至少一个音频信号。例如,处理器210可以接收表示由至少一个麦克风1750从用户100的环境捕捉的声音的至少一个音频信号3203、3205或3207。In step 3407, the processor 210 may receive at least one audio signal representing sound captured by the at least one microphone from the user's environment. For example, the processor 210 may receive at least one audio signal 3203 , 3205 or 3207 representing sound captured by the at least one microphone 1750 from the environment of the user 100 .
在步骤3409中,处理器210可以基于对多个图像中的至少一个或至少一个音频信号的分析来识别用户的至少一个动作。在一些实施例中,处理器210可以基于对至少一个音频信号(例如,音频信号3203)的分析来识别用户100的至少一个动作。在一些实施例中,至少一个动作可以包括用户100的讲话。例如,识别至少一个动作可以包括基于对至少一个音频信号3203的分析来检测用户100所说的词语。在一些实施例中,词语可以指示用户100没有听清楚。例如,用户100可能说了“什么?”、“您能重复您所说的吗?”、“我不明白”或类似的短语。在一些实施例中,处理器210可以根据检测到的词语的类型和/或频率将更大的听力困难与用户100相关联。例如,“重复”可能与比“我不明白”更大的听力困难相关联,和/或重复的词语(例如,用户100重复陈述“什么?”)可能与比不重复的词语更大的听力困难相关联。In step 3409, the processor 210 may identify at least one action of the user based on the analysis of at least one of the plurality of images or at least one audio signal. In some embodiments, processor 210 may identify at least one action of user 100 based on analysis of at least one audio signal (eg, audio signal 3203). In some embodiments, the at least one action may include speech by the user 100 . For example, identifying the at least one action may include detecting words spoken by the user 100 based on analysis of the at least one audio signal 3203 . In some embodiments, the words may indicate that the user 100 did not hear clearly. For example, user 100 may say "what?", "Can you repeat what you said?", "I don't understand", or similar phrases. In some embodiments, processor 210 may associate greater hearing difficulties with user 100 based on the type and/or frequency of detected words. For example, "repetition" may be associated with greater hearing difficulties than "I don't understand", and/or repeated words (eg, user 100 repeating the statement "what?") may be associated with greater hearing than unrepeated words difficulty associated.
在一些实施例中,可以使用任何音频分类技术对从用户10的环境捕捉的声音进行分类。处理器210可以被配置为分析声音以分离和识别音频信号的不同源。例如,处理器210可以使用一种或多种言语或语音活动检测(VAD)算法、语音分离技术和/或声音分类技术。当在用户100的环境中检测到多个声音时,处理器210可以隔离与不同声源相关联的音频信号。在一些实施例中,处理器210可以对与检测到的语音活动相关联的音频信号执行进一步分析,以识别个体的语音。例如,处理器210可以使用一种或多种语音识别算法(例如,隐式马尔可夫模型、动态时间规整、神经网络或其他技术)来识别个体的语音和/或所说的词语。例如,声音可以被分类为包含讲话、音乐、音调、笑声、尖叫等的片段。各个片段的指示可以记录在数据库2050中。In some embodiments, the sounds captured from the environment of the user 10 may be classified using any audio classification technique. The processor 210 may be configured to analyze the sound to separate and identify different sources of the audio signal. For example, the processor 210 may use one or more speech or voice activity detection (VAD) algorithms, speech separation techniques, and/or sound classification techniques. When multiple sounds are detected in the environment of the user 100, the processor 210 may isolate audio signals associated with different sound sources. In some embodiments, the processor 210 may perform further analysis on the audio signal associated with the detected voice activity to identify the individual's voice. For example, processor 210 may use one or more speech recognition algorithms (eg, hidden Markov models, dynamic time warping, neural networks, or other techniques) to recognize an individual's speech and/or spoken words. For example, sounds may be classified into segments containing speech, music, tones, laughter, screams, and the like. The indication of each segment may be recorded in database 2050.
在一些实施例中,所记录的信息可以使处理器210能够基于对至少一个音频信号3203的分析来识别用户100的至少一个动作。在一些实施例中,至少一个动作可以包括用户100的讲话。例如,识别至少一个动作可以包括基于对至少一个音频信号3203的分析来检测用户100所说的词语。在一些实施例中,词语可以指示用户100没有很好地听到声音(例如,来自另一个体3210、与音频信号3205相关联的声音、与音频信号3207相关联的声音等)。In some embodiments, the recorded information may enable processor 210 to identify at least one action of user 100 based on analysis of at least one audio signal 3203 . In some embodiments, the at least one action may include speech by the user 100 . For example, identifying the at least one action may include detecting words spoken by the user 100 based on analysis of the at least one audio signal 3203 . In some embodiments, the words may indicate that the user 100 is not hearing the sound well (eg, from another entity 3210, the sound associated with the audio signal 3205, the sound associated with the audio signal 3207, etc.).
在一些实施例中,处理器210可以接收由可穿戴相机捕捉的至少一个图像,并且可以基于对至少一个图像的分析来识别用户100的至少一个动作。处理器210可以通过基于对至少一个图像的分析来检测用户100的运动3201来识别至少一个动作。在一些实施例中,运动3201可以包括用户100的手运动或用户100的倾斜运动。例如,用户100可以将他的手环绕在耳朵周围,指示他们在听到个体3210方面有困难。在一些实施例中,用户100可以向个体3210倾斜,指示他们在听到个体3210方面有困难。In some embodiments, the processor 210 may receive at least one image captured by the wearable camera, and may identify at least one action of the user 100 based on the analysis of the at least one image. The processor 210 may identify the at least one action by detecting the motion 3201 of the user 100 based on the analysis of the at least one image. In some embodiments, motion 3201 may include hand motion of user 100 or tilt motion of user 100 . For example, the user 100 may wrap his hands around the ears, indicating that they are having difficulty hearing the individual 3210. In some embodiments, the user 100 may tilt towards the individual 3210, indicating that they are having difficulty hearing the individual 3210.
在一些实施例中,可以通过监视用户100的身体部分(例如,手、手臂等)或面部部分(例如,鼻子、眼睛、耳朵、耳朵附近的手等)相对于相机传感器的光轴的方向来跟踪用户100的运动3201。例如,所捕捉的图像可以包括用户100的身体部分或面部部分的表示,其可以用于确定用户100的手运动或倾斜运动。处理器210(和/或处理器210a和210b)可以被配置为使用各种图像检测或处理算法(例如,使用卷积神经网络(CNN)、尺度不变特征变换(SIFT)、定向梯度直方图(HOG)特征或其他技术)来分析捕捉的图像并检测用户100的身体部分或面部部分的运动3201。基于检测到的用户100的身体部分或面部部分的表示,可以确定用户100的运动3201。In some embodiments, this can be done by monitoring the orientation of the user's 100 body part (eg, hand, arm, etc.) or face part (eg, nose, eyes, ears, hand near the ear, etc.) relative to the optical axis of the camera sensor Movement 3201 of user 100 is tracked. For example, the captured image may include a representation of a body part or a face part of the user 100 , which may be used to determine hand movement or tilting movement of the user 100 . Processor 210 (and/or processors 210a and 210b) may be configured to use various image detection or processing algorithms (eg, using convolutional neural networks (CNN), scale-invariant feature transforms (SIFT), histograms of oriented gradients) (HOG) feature or other techniques) to analyze the captured image and detect movement 3201 of the body part or face part of the user 100 . Based on the detected representation of the body part or face part of the user 100, the motion 3201 of the user 100 can be determined.
可以部分地通过将检测到的用户100的身体部分或面部部分的表示与相机传感器1751的光轴进行比较来确定运动3201。例如,光轴1751在每个图像中可以是已知的或固定的,并且处理器210可以通过将用户100的身体部分或面部部分的代表性角度与光轴1751的方向进行比较来确定运动1750。例如,所确定的运动可以包括用户100将他的手环绕在耳朵周围,指示他们在听到个体3210方面有困难。在一些实施例中,所确定的运动可以包括用户100向个体3210倾斜,指示他们在听到个体3210方面有困难。Motion 3201 may be determined in part by comparing a detected representation of the body part or face part of user 100 to the optical axis of camera sensor 1751 . For example, the optical axis 1751 may be known or fixed in each image, and the processor 210 may determine the motion 1750 by comparing a representative angle of the body part or face part of the user 100 to the direction of the optical axis 1751 . For example, the determined motion may include the user 100 wrapping his hands around the ears, indicating that they are having difficulty hearing the individual 3210. In some embodiments, the determined motion may include the user 100 leaning toward the individual 3210, indicating that they are having difficulty hearing the individual 3210.
在步骤3411中,处理器210可以如上更详细描述的基于识别出的动作,引起对由至少一个麦克风接收的至少一个音频信号进行选择性调节。In step 3411, the processor 210 may cause selective adjustment of the at least one audio signal received by the at least one microphone based on the identified action as described in more detail above.
在步骤3413中,处理器210可以使至少一个经调节的音频信号传输到被配置为向用户的耳朵提供声音的听觉接口设备。例如,至少一个经调节的音频信号可以被发送到听觉接口设备1710,听觉接口设备1710被配置为向用户100的耳朵提供声音,并且因此可以向用户100提供对应于该至少一个音频信号的源(例如,与音频信号3205相关联的个体3210、与音频信号3207相关联的个体3210,等等)的听觉反馈。例如,在经调节的音频信号中,来自个体3210的声音可以更容易被用户100听到,比来自音频信号3205或3207的声音更响亮和/或更容易区分,来自音频信号3205或3207的声音可以表示环境内的背景噪声。In step 3413, the processor 210 may cause the at least one conditioned audio signal to be transmitted to an auditory interface device configured to provide sound to the user's ear. For example, at least one conditioned audio signal may be sent to auditory interface device 1710, which is configured to provide sound to the ear of user 100, and thus may provide user 100 with a source corresponding to the at least one audio signal ( For example, auditory feedback of individual 3210 associated with audio signal 3205, individual 3210 associated with audio signal 3207, etc.). For example, in the adjusted audio signal, the sound from the individual 3210 may be more easily heard by the user 100, louder and/or more distinguishable than the sound from the audio signal 3205 or 3207, the sound from the audio signal 3205 or 3207 Can represent background noise within the environment.
活跃说话者的直觉控制Intuitive control of active speakers
助听器系统旨在改进和增强用户与环境的互动。用户可能依赖助听器系统来导航他们的周围环境和日常活动。然而,不同的用户可能取决于环境需要不同程度的帮助。在一些情况下,用户可以优先听到来自他或她的环境中的源的声音,而不是从他的环境中的一个或多个附加源。例如,用户可以优先听到来自他的环境中的家庭成员的声音,而不是他的环境中的陌生人或背景噪声。典型的助听器系统可能不能基于用户的需要来充分地校正或调整音频信号。因此,需要用于基于来自用户的环境的提示而自动调节用户的音频信号的装置和方法。Hearing aid systems are designed to improve and enhance the user's interaction with the environment. Users may rely on hearing aid systems to navigate their surroundings and daily activities. However, different users may require different levels of assistance depending on the environment. In some cases, a user may hear sound from sources in his or her environment preferentially over one or more additional sources in his or her environment. For example, a user may preferentially hear voices from family members in his environment over strangers or background noise in his environment. Typical hearing aid systems may not be able to adequately correct or adjust the audio signal based on the needs of the user. Accordingly, there is a need for an apparatus and method for automatically adjusting a user's audio signal based on cues from the user's environment.
所公开的实施例包括可以被配置为基于来自助听器用户的环境的提示来校正或调整音频信号的助听器系统。提示可以包括用户与一个或多个个体之间的距离、个体相对于用户的视线方向的方向、活跃说话者的手势、个体朝向活跃说话者的视线方向或其他可视提示。例如,来自离用户更近的个体的声音可能比来自离用户更远的个体的声音具有更高的优先级。在一些实施例中,用户可以经由设备手动定义或分配优先级给不同的声源。例如,用户可以将比其他声源(例如,来自设备的声音、陌生人、背景噪声等)更高的优先级分配给他辨识出的个体(例如,家庭成员、朋友等)。在一些实施例中,助听器系统可以识别个体并相应地使用分配给个体的优先级。The disclosed embodiments include hearing aid systems that may be configured to correct or adjust audio signals based on cues from the hearing aid user's environment. The cues may include the distance between the user and one or more individuals, the direction of the individual relative to the user's gaze direction, the active speaker's gesture, the individual's gaze direction toward the active speaker, or other visual cues. For example, sounds from individuals closer to the user may have higher priority than sounds from individuals further away from the user. In some embodiments, the user may manually define or assign priorities to different sound sources via the device. For example, a user may assign a higher priority to his identified individuals (eg, family members, friends, etc.) than other sound sources (eg, sounds from the device, strangers, background noise, etc.). In some embodiments, the hearing aid system can identify the individual and use the priority assigned to the individual accordingly.
在一些实施例中,助听器系统可以通过基于声源的优先级来选择性地调节(例如,放大、衰减、静音等)音频信号来校正或调整音频信号。例如,助听器系统可以优选地放大来自具有更高优先级的声源的音频信号。在一些实施例中,助听器系统可以优选地衰减或静音来自具有较低优先级的声源的音频信号。In some embodiments, the hearing aid system may correct or adjust the audio signal by selectively conditioning (eg, amplifying, attenuating, muting, etc.) the audio signal based on the priority of the sound source. For example, the hearing aid system may preferably amplify audio signals from sound sources with higher priority. In some embodiments, the hearing aid system may preferably attenuate or mute audio signals from sound sources with lower priority.
图35是示出符合所公开实施例的包括具有语音和/或图像识别的助听器的示例性环境的示意图。在一些实施例中,可穿戴相机(例如,装置110的可穿戴相机)可以被配置为从用户100的环境捕捉多个图像。在一些实施例中,至少一个麦克风可以被配置为从用户100的环境捕捉声音。35 is a schematic diagram illustrating an exemplary environment including a hearing aid with speech and/or image recognition consistent with the disclosed embodiments. In some embodiments, a wearable camera (eg, the wearable camera of device 110 ) may be configured to capture multiple images from the environment of user 100 . In some embodiments, at least one microphone may be configured to capture sound from the environment of the user 100 .
例如,处理器210可以接收表示由至少一个麦克风1750从用户100的环境捕捉的声音的至少一个音频信号3511、3513或3515。在一些实施例中,处理器210可以基于对至少一个音频信号(例如,音频信号3511、3513或3515)的分析来识别与第一个体3501相关联的第一语音相关联的第一音频信号3511和与第二个体3503相关联的第二语音相关联的第二音频信号3513。For example, the processor 210 may receive at least one audio signal 3511 , 3513 or 3515 representing sound captured by the at least one microphone 1750 from the environment of the user 100 . In some embodiments, processor 210 may identify a first audio signal associated with a first speech associated with first individual 3501 based on analysis of at least one audio signal (eg, audio signal 3511, 3513, or 3515) 3511 and a second audio signal 3513 associated with the second speech associated with the second individual 3503.
助听器系统可以存储识别出的人的语音样本、图像、语音特征和/或面部特征以帮助识别和选择性放大。例如,当个体(第一个体3501或第二个体3503)进入装置110的视场时,该个体可以被识别为已经被介绍给用户110的个体,或者在过去可能与用户100交互过的个体(例如,朋友、同事、亲戚、先前的熟人等)。因此,相对于用户环境中的其他声音,可以隔离和/或选择性地放大与辨识出的个体的语音相关联的音频信号(例如,音频信号3511或音频信号3513)。与从个体方向以外的方向接收的声音相关联的音频信号(例如,音频信号3515)可以被抑制、衰减、滤波等。The hearing aid system may store speech samples, images, speech features and/or facial features of the recognized person to aid in recognition and selective amplification. For example, when an individual (either the first individual 3501 or the second individual 3503) enters the field of view of the device 110, the individual may be identified as an individual who has been introduced to the user 110, or who may have interacted with the user 100 in the past (eg, friends, colleagues, relatives, previous acquaintances, etc.). Accordingly, an audio signal (eg, audio signal 3511 or audio signal 3513 ) associated with a recognized individual's speech may be isolated and/or selectively amplified relative to other sounds in the user's environment. Audio signals (eg, audio signal 3515) associated with sound received from directions other than the individual's direction may be suppressed, attenuated, filtered, and the like.
用户100可能想要基于用户100希望接收的声音的优先级来放大音频信号。例如,处理器210可以确定个体的层次结构并基于个体的相对状态分配优先权。该层次结构可以基于个体在家庭或组织(例如,公司、运动队、俱乐部等)中相对于用户100的位置。例如,用户100可能处于工作环境中,并且可能需要在他的同事之前听到他的老板。因此,用户100的老板可以比同事或来自不同部门的人排名更高,因此可以在选择性调节过程中具有优先权。在一些实施例中,用户100可能处于具有“密友”、家人和熟人的环境中。例如,识别为密友或家人的个体可以优先于用户100的熟人,因为用户100可能希望听到密友或家人,而不是熟人。The user 100 may want to amplify the audio signal based on the priority of the sounds that the user 100 wishes to receive. For example, the processor 210 may determine a hierarchy of individuals and assign priorities based on the relative status of the individuals. The hierarchy may be based on the location of the individual relative to the user 100 within the family or organization (eg, company, sports team, club, etc.). For example, user 100 may be in a work environment and may need to hear his boss before his colleagues. Therefore, the boss of the user 100 may be ranked higher than colleagues or people from different departments, and thus may have priority in the selective adjustment process. In some embodiments, user 100 may be in an environment with "close friends," family, and acquaintances. For example, individuals identified as close friends or family members may take precedence over acquaintances of user 100 because user 100 may wish to hear from close friends or family members rather than acquaintances.
在一些实施例中,可穿戴相机可以是装置110(例如,基于相机的定向助听器设备)的组件,用于基于用户100的环境中的个体(例如,第一个体3501或第二个体3503)的识别来选择性地改变声音的放大。在一些实施例中,可穿戴相机可以使用图像传感器220从用户100的环境捕捉至少一个图像。在一些实施例中,处理器210可以接收由可穿戴相机捕捉的至少一个图像,并且可以基于对至少一个图像的分析来识别用户100的环境中的至少一个个体。处理器210(和/或处理器210a和210b)可以被配置为使用各种图像检测或处理算法(例如,使用卷积神经网络(CNN)、尺度不变特征变换(SIFT)、定向梯度直方图(HOG)特征或其他技术)来分析捕捉的图像并检测该至少一个个体的身体部分或面部部分的特征。基于检测到的至少一个个体的身体部分或面部部分的表示,可以识别该至少一个个体。在一些实施例中,如图20A或图20B中针对装置110所描述的,处理器210可以被配置为使用面部和/或语音识别组件来识别至少一个个体。In some embodiments, the wearable camera may be a component of the apparatus 110 (eg, a camera-based directional hearing aid device) for an individual (eg, the first individual 3501 or the second individual 3503 ) in the user 100-based environment identification to selectively change the amplification of the sound. In some embodiments, the wearable camera may capture at least one image from the environment of the user 100 using the image sensor 220 . In some embodiments, processor 210 may receive at least one image captured by the wearable camera, and may identify at least one individual in the environment of user 100 based on analysis of the at least one image. Processor 210 (and/or processors 210a and 210b) may be configured to use various image detection or processing algorithms (eg, using convolutional neural networks (CNN), scale-invariant feature transforms (SIFT), histograms of oriented gradients) (HOG) features or other techniques) to analyze the captured images and detect features of the body part or face part of the at least one individual. The at least one individual may be identified based on the detected representation of the body part or face part of the at least one individual. In some embodiments, processor 210 may be configured to identify at least one individual using facial and/or voice recognition components, as described for apparatus 110 in FIG. 20A or FIG. 20B.
例如,面部识别组件2040可以被配置为识别用户100的环境内的一个或多个面部。面部识别组件2040可以识别个体的面部上的面部特征,诸如眼睛、鼻子、颧骨、下巴或其他特征。面部识别组件2040可以分析这些特征的相对大小和位置以识别该个体。在一些实施例中,面部识别组件2040可以利用一种或多种算法来分析检测到的特征,诸如主分量分析(例如,使用本征脸)、线性判别分析、弹性束图匹配(例如,使用Fisher脸)、局部二进制模式直方图(LBPH)、尺度不变特征变换(SIFT)、加速鲁棒特征(SURF)等。可以使用诸如三维识别、皮肤纹理分析和/或热成像的另外的面部识别技术来识别个体。除了个体的面部特征之外的其他特征也可以用于识别,诸如身高、体型或个体的其他区别特征。For example, facial recognition component 2040 can be configured to recognize one or more faces within the environment of user 100 . Facial recognition component 2040 can identify facial features on an individual's face, such as eyes, nose, cheekbones, chin, or other features. The facial recognition component 2040 can analyze the relative size and location of these features to identify the individual. In some embodiments, the facial recognition component 2040 can utilize one or more algorithms to analyze the detected features, such as principal component analysis (eg, using eigenfaces), linear discriminant analysis, elastic bundle map matching (eg, using Fisher face), Local Binary Pattern Histogram (LBPH), Scale Invariant Feature Transform (SIFT), Accelerated Robust Feature (SURF), etc. Individuals may be identified using additional facial recognition techniques such as three-dimensional recognition, skin texture analysis, and/or thermal imaging. Features other than the individual's facial features may also be used for identification, such as height, body size, or other distinguishing characteristics of the individual.
面部识别组件2040可以访问与用户100相关联的数据库或数据,以确定检测到的面部特征是否对应于辨识出的个体。例如,处理器210可以访问数据库2050,数据库2050包含关于用户100已知的个体的信息和表示相关联的面部特征或其他识别特征的数据。这样的数据可以包括个体的一个或多个图像,或者表示可用于通过面部识别进行的识别的用户面部的数据。面部识别组件2040还可以访问用户100的联系人列表,诸如用户电话上的联系人列表、基于网络的联系人列表(例如,通过OutlookTM、SkypeTM、GoogleTM、SalesforceTM等)或与听觉接口设备1710相关联的专用联系人列表。在一些实施例中,数据库2050可以由装置110通过先前的面部识别分析来编译。例如,处理器210可以被配置为将与在由装置110捕捉的图像中识别出的一个或多个面部相关联的数据存储在数据库2050中。每次在图像中检测到面部时,可将检测到的面部特征或其他数据与数据库2050中的先前识别出的面部进行比较。面部识别组件2040可以确定个体是用户100的辨识出的个体、该个体先前是否在超过特定阈值的多个实例中被系统识别出、该个体是否已被明确地介绍给装置110等。Facial recognition component 2040 can access a database or data associated with user 100 to determine whether detected facial features correspond to an identified individual. For example, processor 210 may access database 2050 that contains information about individuals known to user 100 and data representing associated facial or other identifying features. Such data may include one or more images of an individual, or data representing a user's face that may be used for identification by facial recognition. The facial recognition component 2040 can also access a contact list of the user 100, such as a contact list on the user's phone, a web-based contact list (eg, through Outlook ™ , Skype ™ , Google ™ , Salesforce ™ , etc.) or with an auditory interface A private contact list associated with device 1710. In some embodiments, database 2050 may be compiled by device 110 from previous facial recognition analysis. For example, processor 210 may be configured to store data associated with one or more faces identified in images captured by device 110 in database 2050 . Each time a face is detected in an image, the detected facial features or other data may be compared to previously identified faces in database 2050. Facial recognition component 2040 can determine whether the individual is an identified individual of user 100, whether the individual has been previously identified by the system in instances exceeding a certain threshold, whether the individual has been explicitly introduced to device 110, and the like.
装置110可以被配置为基于接收到的由可穿戴相机捕捉的多个图像来识别用户100的环境中的个体(例如,第一个体3501或第二个体3503)。例如,装置110可以被配置为识别与用户100的环境内的第一个体3501相关联的面部3521或与第二个体3503相关联的面部3523。例如,装置110可以被配置为使用相机1730来捕捉用户100的周围环境的一个或多个图像。所捕捉的图像可以包括辨识出的个体(例如,第一个体3501或第二个体3503)的表示,该个体可以是用户100的朋友、同事、亲戚或先前的熟人。处理器210(和/或处理器210a和210b)可以被配置为使用各种面部识别技术来分析捕捉的图像并检测辨识出的个体。因此,装置110,或具体地存储器550,可以包括一个或多个面部识别组件(例如,软件程序、模块、库等)。Apparatus 110 may be configured to identify an individual (eg, first individual 3501 or second individual 3503 ) in the environment of user 100 based on the plurality of received images captured by the wearable camera. For example, apparatus 110 may be configured to identify face 3521 associated with first individual 3501 or face 3523 associated with second individual 3503 within the environment of user 100 . For example, device 110 may be configured to capture one or more images of user 100's surroundings using camera 1730 . The captured image may include a representation of an identified individual (eg, first individual 3501 or second individual 3503 ), which may be a friend, colleague, relative, or previous acquaintance of user 100 . Processor 210 (and/or processors 210a and 210b) may be configured to analyze captured images and detect identified individuals using various facial recognition techniques. Accordingly, apparatus 110, or memory 550 in particular, may include one or more facial recognition components (eg, software programs, modules, libraries, etc.).
在一些实施例中,处理器210可以被配置为基于处理器210确定第一音频信号与高于第二音频信号的优先级的优先级相关联而引起对音频信号(例如,音频信号3511或3513)的选择性调节。可以使用各种方法来确定声音的层次结构。例如,可以通过对两个声音或两个以上声音的比较分析来确定声音的层次结构。在一些实施例中,声源可以包括人、对象(例如,电视、汽车等)、环境(例如,流水、风等)等。例如,处理器210可以使用比较分析来确定来自人的声音相对于来自对象或环境的声音具有优先权。In some embodiments, the processor 210 may be configured to cause a response to the audio signal (eg, audio signal 3511 or 3513) based on the processor 210 determining that the first audio signal is associated with a higher priority than the second audio signal. ) selective adjustment. Various methods can be used to determine the hierarchy of sounds. For example, the hierarchy of sounds can be determined by comparative analysis of two or more sounds. In some embodiments, sound sources may include people, objects (eg, televisions, cars, etc.), environments (eg, flowing water, wind, etc.), and the like. For example, the processor 210 may use comparative analysis to determine that sound from a person has priority over sound from an object or environment.
在一些实施例中,与辨识出的个体相关联的音频信号的选择性调节可以基于用户100的环境中的个体的身份。例如,在图像中检测到多个个体的情况下,处理器210可以如上所述使用一种或多种面部识别技术来识别个体。与用户100已知的个体相关联的音频信号可以被选择性地放大或以其他方式调节以具有相对于未知个体的优先权。例如,处理器210可以被配置为衰减或静音与用户100的环境中的旁观者(诸如嘈杂的办公室同事等)相关联的音频信号。在一些实施例中,处理器210还可以确定个体的层次结构并基于个体的相对状态分配优先权。该层次结构可以基于个体在家庭或组织(例如,公司、运动队、俱乐部等)中相对于用户100的位置。例如,用户100的老板可以比同事或来自不同部门的人排名更高,因此可以在选择性调节过程中具有优先权。在一些实施例中,可以基于列表或数据库来确定层次结构。被系统辨识出的个体可以被单独排序或分组为几层优先级。该数据库可以专门为此目的而维护或者可以从外部访问。例如,数据库可以与用户的社交网络(例如,FacebookTM、LinkedInTM等)相关联,并且可以基于个体的分组或与用户的关系来对其进行优先级排序。例如,被识别为“密友”或家人的个体可以优先于用户100的熟人。In some embodiments, the selective adjustment of the audio signal associated with the identified individual may be based on the identity of the individual in the environment of the user 100 . For example, where multiple individuals are detected in the image, the processor 210 may identify the individuals using one or more facial recognition techniques as described above. Audio signals associated with individuals known to user 100 may be selectively amplified or otherwise adjusted to have priority over unknown individuals. For example, processor 210 may be configured to attenuate or mute audio signals associated with bystanders in the environment of user 100 (such as noisy office colleagues, etc.). In some embodiments, the processor 210 may also determine a hierarchy of individuals and assign priorities based on the relative status of the individuals. The hierarchy may be based on the location of the individual relative to the user 100 within the family or organization (eg, company, sports team, club, etc.). For example, the boss of user 100 may be ranked higher than colleagues or people from different departments and thus may have priority in the selective moderation process. In some embodiments, the hierarchy may be determined based on a list or database. Individuals identified by the system can be sorted individually or grouped into layers of priority. The database can be maintained specifically for this purpose or can be accessed externally. For example, a database can be associated with a user's social network (eg, Facebook ™ , LinkedIn ™ , etc.), and individuals can be prioritized based on their groupings or relationship to the user. For example, individuals identified as "close friends" or family members may take precedence over acquaintances of user 100 .
在一些实施例中,处理器210可以基于个体与用户100的接近度来选择性地调节与该个体相关联的音频信号。处理器210可以基于捕捉的图像、测距仪或其他方法来确定从用户100到每个个体的距离,并且可以基于该距离来选择性地调节与这些个体相关联的音频信号。例如,与远离用户100的个体相比,物理上更接近用户100的个体可以被给予更高优先级,并且他或她的声音可以以更大的幅度被放大。在一些实施例中,处理器210可以确定个体相对于用户的视线方向的方向。在相对于视线方向的更近角度处的个体可以被给予更高优先级。In some embodiments, the processor 210 may selectively adjust the audio signal associated with the individual based on the individual's proximity to the user 100 . The processor 210 may determine the distance from the user 100 to each individual based on captured images, rangefinders, or other methods, and may selectively adjust audio signals associated with the individuals based on the distance. For example, an individual who is physically closer to the user 100 may be given higher priority and his or her voice may be amplified to a greater extent than an individual who is farther away from the user 100 . In some embodiments, the processor 210 may determine the orientation of the individual relative to the user's gaze direction. Individuals at closer angles relative to the line-of-sight direction may be given higher priority.
在一些实施例中,处理器210可以通过基于对多个图像中的至少一个的分析来识别至少一个动作来确定优先级水平。例如,处理器210可以基于唇部移动和检测到的声音来确定用户100的环境中的哪些个体正在说话。例如,处理器210可以跟踪与个体3501或3503相关联的唇部移动,以确定个体3501或3503正在说话。可以在检测到的唇部移动和接收到的音频信号之间进行比较分析。例如,处理器210可以基于在检测到与音频信号3511相关联的声音的同时个体3501的嘴正在运动的确定来确定个体3501正在说话。在一些实施例中,当个体3501的唇部停止运动时,这可对应于与音频信号3511相关联的声音中的静默或减小音量的时段。In some embodiments, the processor 210 may determine the priority level by identifying at least one action based on an analysis of at least one of the plurality of images. For example, processor 210 may determine which individuals in the environment of user 100 are speaking based on lip movements and detected sounds. For example, processor 210 may track lip movements associated with individual 3501 or 3503 to determine that individual 3501 or 3503 is speaking. A comparative analysis can be performed between the detected lip movement and the received audio signal. For example, the processor 210 may determine that the individual 3501 is speaking based on a determination that the individual 3501's mouth is moving while the sound associated with the audio signal 3511 is detected. In some embodiments, when the lips of the individual 3501 cease to move, this may correspond to a period of silence or reduced volume in the sound associated with the audio signal 3511.
在一些实施例中,与装置110相关联的数据还可以结合检测到的唇部移动被用于确定和/或验证个体是否正在说话,诸如用户100或个体3501或3503的视线方向、检测到的个体3501或3503的身份、辨识出的个体3501或3503的声纹等。In some embodiments, data associated with device 110 may also be used in conjunction with detected lip movements to determine and/or verify whether an individual is speaking, such as the gaze direction of user 100 or individual 3501 or 3503, detected The identity of the individual 3501 or 3503, the voiceprint of the recognized individual 3501 or 3503, etc.
在一些实施例中,处理器210可以被配置为基于与音频信号相关联的哪些个体当前正在说话来选择性地调节多个音频信号。也就是,在一些实施例中,处理器210可以优先正在说话的个体而不是不在说话的个体。例如,个体3501和个体3503可以参与用户100的环境内的对话,并且处理器210可以被配置为基于个体3501和3503的相应唇部移动从放大与个体3501相关联的音频信号3511转换到放大与个体3503相关联的音频信号3513。例如,个体3501的唇部移动可以指示个体3501已经停止说话,或者与个体3503相关联的唇部移动可以指示个体3503已经开始说话。因此,处理器210可以在放大音频信号3511到音频信号3513之间转换。在一些实施例中,处理器210可以被配置为同时处理和/或调节两个音频信号,但仅基于哪个个体正在说话而选择性地将经调节的音频发送到听觉接口设备1710。在实现语音识别的情况下,处理器210可以基于语音的背景来确定和/或预期说话者之间的转换。例如,处理器210可以分析音频信号3511,以确定个体3501已经到达句子的结尾或已经问了一个问题,这可以指示个体3501已经结束或即将结束说话。In some embodiments, the processor 210 may be configured to selectively condition the plurality of audio signals based on which individuals associated with the audio signals are currently speaking. That is, in some embodiments, the processor 210 may prioritize individuals who are speaking over individuals who are not speaking. For example, individual 3501 and individual 3503 may engage in a conversation within the environment of user 100, and processor 210 may be configured to transition from amplifying audio signal 3511 associated with individual 3501 to amplifying and Audio signal 3513 associated with individual 3503. For example, lip movement of individual 3501 may indicate that individual 3501 has stopped speaking, or lip movement associated with individual 3503 may indicate that individual 3503 has begun to speak. Therefore, the processor 210 can switch between the amplified audio signal 3511 to the audio signal 3513 . In some embodiments, processor 210 may be configured to process and/or condition both audio signals simultaneously, but selectively send the conditioned audio to auditory interface device 1710 based only on which individual is speaking. Where speech recognition is implemented, the processor 210 may determine and/or anticipate transitions between speakers based on the context of the speech. For example, processor 210 may analyze audio signal 3511 to determine that individual 3501 has reached the end of a sentence or has asked a question, which may indicate that individual 3501 has finished or is about to finish speaking.
在一些实施例中,处理器210可以被配置为在多个活跃说话者之间进行选择,以选择性地调节音频信号。例如,个体3501和3503可能同时都在说话,或者他们的讲话可能在对话期间重叠。处理器210可以将与一个说话个体相关联的音频优先于其他个体进行放大。这可以包括给予一个已经开始但没有完成一个词语或句子或者当另一个说话者开始讲话时他还没有完全完成讲话的说话者优先级。如上所述,该确定还可以由语音的背景驱动。In some embodiments, the processor 210 may be configured to select between multiple active speakers to selectively condition the audio signal. For example, individuals 3501 and 3503 may both be speaking at the same time, or their speech may overlap during the conversation. The processor 210 may amplify audio associated with one speaking individual in preference to other individuals. This may include giving priority to a speaker who has started but has not completed a word or sentence or who has not fully finished speaking when another speaker starts speaking. As mentioned above, this determination can also be driven by the context of the speech.
在一些实施例中,可以确定用户100或个体3501或3503的视线方向,并且可以在活跃说话者中给予视线方向所指向的个体更高的优先级。例如,如果个体3503正在看着个体3501,则与个体3501相关联的音频信号3511可以被选择性地调节(例如,放大)。在一些实施例中,可以基于用户100的环境中其他个体的相对行为来分配优先级。例如,如果个体3501和个体3503都在说话,并且附加的个体正在看着个体3501而不是个体3503,则与个体3501相关联的音频信号3511可以优先于与个体3503相关联的音频信号3511被放大。在确定个体的身份的实施例中,如前面讨论的,可以基于说话者的相对状态来分配优先级。In some embodiments, the gaze direction of the user 100 or individual 3501 or 3503 may be determined, and the individual pointed by the gaze direction may be given higher priority among active speakers. For example, if individual 3503 is looking at individual 3501, the audio signal 3511 associated with individual 3501 may be selectively adjusted (eg, amplified). In some embodiments, priorities may be assigned based on the relative behavior of other individuals in the user's 100 environment. For example, if both individual 3501 and individual 3503 are speaking, and an additional individual is looking at individual 3501 instead of individual 3503, the audio signal 3511 associated with individual 3501 may be amplified in preference to the audio signal 3511 associated with individual 3503 . In embodiments where the identity of an individual is determined, as previously discussed, priorities may be assigned based on the relative status of the speakers.
在一些实施例中,处理器210可以被配置为基于所确定的优先级,对由至少一个麦克风接收的用户100的环境中的至少一个音频信号(例如,音频信号3511或音频信号3513)进行选择性调节。该至少一个经调节的音频信号可以被发送到听觉接口设备1710,并且听觉接口设备1710可以被配置为向用户100的耳朵提供声音。因此,听觉接口设备1710可以向用户100提供对应于至少一个音频信号的源(例如,与音频信号3511相关联的、与音频信号3513相关联的等)的听觉反馈。处理器210可以对从麦克风1720接收的音频信号执行各种调节技术。调节可以包括放大被确定为具有比其他音频信号更高优先级的音频信号。放大可以例如通过处理相对于其他信号与更高优先级相关联的音频信号来数字化地实现。放大还可以通过改变麦克风1720的一个或多个参数来实现,以聚焦于从更高优先级源发出的音频声音。例如,麦克风1720可以是定向麦克风,并且处理器210可以执行将麦克风1720聚焦在用户100的环境中的个体3501或3503或其他声音上的操作。可以使用用于放大声音的各种其他技术,诸如使用波束成形麦克风阵列、声学望远镜技术等。In some embodiments, processor 210 may be configured to select at least one audio signal (eg, audio signal 3511 or audio signal 3513 ) in the environment of user 100 received by at least one microphone based on the determined priority Sexual regulation. The at least one conditioned audio signal may be sent to auditory interface device 1710 , and auditory interface device 1710 may be configured to provide sound to user 100 ears. Accordingly, auditory interface device 1710 may provide user 100 with auditory feedback corresponding to the source of at least one audio signal (eg, associated with audio signal 3511, associated with audio signal 3513, etc.). The processor 210 may perform various conditioning techniques on the audio signal received from the microphone 1720 . Adjusting may include amplifying audio signals determined to have higher priority than other audio signals. Amplification can be accomplished digitally, for example, by processing audio signals that are associated with higher priority relative to other signals. Amplification can also be achieved by changing one or more parameters of the microphone 1720 to focus on audio sounds emanating from higher priority sources. For example, microphone 1720 may be a directional microphone, and processor 210 may perform operations to focus microphone 1720 on individual 3501 or 3503 or other sounds in the environment of user 100 . Various other techniques for amplifying sound may be used, such as the use of beamforming microphone arrays, acoustic telescope techniques, and the like.
调节还可以包括衰减或抑制从较低优先级的源接收的一个或多个音频信号。例如,处理器210可以确定个体3501具有高于个体3503的优先级,并且衰减音频信号3513和3515。类似于声音的放大,声音的衰减可以通过处理音频信号来发生,或者通过改变与一个或多个麦克风1720相关联的一个或多个参数来引导焦点远离从较低优先级源发出的声音。Conditioning may also include attenuating or suppressing one or more audio signals received from lower priority sources. For example, processor 210 may determine that individual 3501 has a higher priority than individual 3503, and attenuate audio signals 3513 and 3515. Similar to sound amplification, sound attenuation can occur by processing the audio signal, or by changing one or more parameters associated with one or more microphones 1720 to direct focus away from sound emanating from lower priority sources.
在一些实施例中,调节还可以包括改变对应于来自更高优先级源的声音的音频信号的音调,以使用户100更容易感知该声音。例如,用户100可能对特定范围内的音调具有较小的敏感度,并且音频信号的调节可以调节来自更高优先级源的声音的音高以使其对于用户100更易感知。例如,用户100可能经历10kHz以上的频率中的听觉损失。因此,处理器210可以将更高的频率(例如,在15khz处)重新映射到10khz以下频率。在一些实施例中,处理器210可以被配置为改变与一个或多个音频信号相关联的语速。因此,处理器210可以被配置为例如使用语音活动检测(VAD)算法或技术来检测由麦克风1720接收的一个或多个音频信号内的语音。例如,如果确定声音对应于来自个体3501的语音或讲话,则处理器210可以被配置为改变来自个体3501的声音的回放速率。例如,可以降低个体3501的语速以使检测到的语音对于用户100更易感知。可以执行各种其他处理(诸如修改来自个体3501的声音的音调),以维持与原始音频信号相同的音高,或者降低音频信号内的噪声。如果已经对与来自个体3501的声音相关联的音频信号执行了语音识别,则调节还可以包括基于检测到的语音来修改音频信号。例如,处理器210可以在词语和/或句子之间引入停顿或增加或减少停顿的持续时间,这可以使语音更容易理解。In some embodiments, the adjustment may also include changing the pitch of the audio signal corresponding to the sound from the higher priority source to make the sound more easily perceived by the user 100 . For example, the user 100 may have less sensitivity to tones within a certain range, and conditioning of the audio signal may adjust the pitch of the sound from a higher priority source to make it more perceptible to the user 100 . For example, user 100 may experience hearing loss in frequencies above 10 kHz. Thus, the processor 210 may remap higher frequencies (eg, at 15khz) to frequencies below 10khz. In some embodiments, the processor 210 may be configured to vary the speech rate associated with one or more audio signals. Accordingly, processor 210 may be configured to detect speech within one or more audio signals received by microphone 1720, eg, using a voice activity detection (VAD) algorithm or technique. For example, if it is determined that the sound corresponds to speech or speech from the individual 3501, the processor 210 may be configured to change the playback rate of the sound from the individual 3501. For example, the speaking rate of the individual 3501 may be reduced to make the detected speech more perceptible to the user 100 . Various other processing may be performed, such as modifying the pitch of the sound from the individual 3501, to maintain the same pitch as the original audio signal, or to reduce noise within the audio signal. If speech recognition has been performed on the audio signal associated with the sound from the individual 3501, the conditioning may also include modifying the audio signal based on the detected speech. For example, processor 210 may introduce pauses or increase or decrease the duration of pauses between words and/or sentences, which may make speech easier to understand.
然后可以将经调节的音频信号发送到听觉接口设备1710,并为用户100产生音频信号。因此,在经调节的音频信号中,来自更高优先级源的声音可以更容易被用户100听到,比来自较低优先级源的声音更响亮和/或更容易区分,来自较低优先级源的声音可以表示环境内的背景噪声。The conditioned audio signal may then be sent to auditory interface device 1710 and an audio signal generated for user 100 . Thus, in the conditioned audio signal, sounds from higher priority sources may be more easily heard by the user 100, louder and/or more distinguishable than sounds from lower priority sources, from lower priority sources The sound of the source may represent background noise within the environment.
图36示出了与符合所公开实施例的具有语音和/或图像识别的助听器一起使用的示例性计算设备120。在一些实施例中,用户100可以通过预定义设置或通过主动选择要聚焦于哪个说话者来提供用于优先化说话者的输入。在一些实施例中,计算设备120可以与装置110配对。例如,计算设备120(例如,移动设备)可以显示与个体(例如,个体3501)相关联的至少一个音频信号优先级接口3601(例如,图形用户界面)。在一些实施例中,用户100可以与至少接口3601交互以提交至少一个优先级设置。例如,用户100可以经由计算设备120上的接口3601输入指示个体3501具有比其他个体更高的优先级的优先级设置。36 illustrates an exemplary computing device 120 for use with a hearing aid with speech and/or image recognition consistent with the disclosed embodiments. In some embodiments, the user 100 may provide input for prioritizing speakers through predefined settings or by actively selecting which speaker to focus on. In some embodiments, computing device 120 may be paired with apparatus 110 . For example, computing device 120 (eg, a mobile device) may display at least one audio signal priority interface 3601 (eg, a graphical user interface) associated with an individual (eg, individual 3501). In some embodiments, user 100 may interact with at least interface 3601 to submit at least one priority setting. For example, user 100 may enter a priority setting via interface 3601 on computing device 120 indicating that individual 3501 has a higher priority than other individuals.
在一些实施例中,用户100可以通过与指示经由计算设备120调整声源(例如,个体、对象、环境等)的音量的各种界面交互来输入一个或多个声源的优先级设置。例如,用户100可以通过与指示个体3501的音量增加的界面交互(例如,选择图标、移动滑块图标等)来输入个体3501的优先级设置。在一些实施例中,用户100可以通过与降低或静音一个或多个声源(例如,其他个体、对象、环境等)的音量的界面交互来输入个体3501的优先级设置。In some embodiments, user 100 may enter priority settings for one or more sound sources by interacting with various interfaces that indicate volume adjustment of sound sources (eg, individuals, objects, environments, etc.) via computing device 120 . For example, the user 100 may enter the priority setting of the individual 3501 by interacting with the interface indicating the volume increase of the individual 3501 (eg, selecting an icon, moving a slider icon, etc.). In some embodiments, user 100 may enter priority settings for individual 3501 by interacting with an interface that reduces or mute the volume of one or more sound sources (eg, other individuals, objects, environments, etc.).
在一些实施例中,用户100可以通过经由计算设备120的接口对声源的优先级进行序号排序来为声源分配层次结构。在一些实施例中,用户100可能“喜爱”(例如,经由计算设备120的界面指派星形符号)一些他优先于其他声源的声源。In some embodiments, the user 100 may assign a hierarchy to the sound sources by ordinal sorting the priority of the sound sources via the interface of the computing device 120 . In some embodiments, the user 100 may "favor" (eg, assign an asterisk via the interface of the computing device 120) some sound sources that he prioritizes over others.
在一些实施例中,音频信号优先级界面3601可以显示由设备110从用户100的环境捕捉的图像。在一些实施例中,计算设备120可以包括多个音频信号优先级界面,其中每个音频信号优先级界面包括由装置110从用户100的环境捕捉的图像。如前所述,处理器210可以基于对多个捕捉图像中的至少一个的分析来识别至少一个动作。例如,处理器210可以基于用户100指向一个或多个声源(例如,个体)来确定声源的层次结构。在一些实施例中,用户100可以按照从最高优先级到最低优先级的顺序指向每个声源的至少一个声源。基于用户100的动作,计算设备120的音频信号优先级界面可以显示与每个声源相关联的声音的层次结构。处理器210可以基于声源的层次结构来选择性地调节与每个声源相关联的音频信号。例如,在用户100的环境中,相对于较低优先级音频信号,更高优先级音频信号可以被隔离和/或选择性地放大。在一些实施例中,低优先级音频信号可以被抑制、衰减、滤波、不变等。In some embodiments, audio signal priority interface 3601 may display images captured by device 110 from the environment of user 100 . In some embodiments, computing device 120 may include a plurality of audio signal priority interfaces, wherein each audio signal priority interface includes an image captured by apparatus 110 from the environment of user 100 . As previously described, the processor 210 may identify at least one action based on analysis of at least one of the plurality of captured images. For example, the processor 210 may determine a hierarchy of sound sources based on the user 100 pointing at one or more sound sources (eg, individuals). In some embodiments, the user 100 may direct at least one sound source of each sound source in order from highest priority to lowest priority. Based on the actions of the user 100, the audio signal priority interface of the computing device 120 may display a hierarchy of sounds associated with each sound source. The processor 210 may selectively adjust the audio signal associated with each sound source based on a hierarchy of sound sources. For example, in the environment of user 100, higher priority audio signals may be isolated and/or selectively amplified relative to lower priority audio signals. In some embodiments, low priority audio signals may be suppressed, attenuated, filtered, unchanged, and the like.
在一些实施例中,用户100可以通过经由计算设备120输入条件来输入一个或多个声源的优先级设置。例如,用户100可以输入条件,使得发出关键字的声源(例如,说“小心”、“紧急情况”等的个体)优先于其他声源(例如,被放大)。在一些实施例中,用户100可以通过输入其他条件或标准来输入声源的优先级设置。例如,用户100可以基于时间(例如,一天中的时间、一周中的时间、一年中的时间等)对声源进行优先级排序,使得声音在指定的时间跨度内被调节。例如,用户100可以输入优先级设置,使得在工作时间期间从上午9:00到下午5:00的周一到周五的声音被放大。In some embodiments, user 100 may enter priority settings for one or more sound sources by entering conditions via computing device 120 . For example, user 100 may input conditions such that sound sources uttering keywords (eg, individuals saying "be careful," "emergency," etc.) are prioritized over other sound sources (eg, amplified). In some embodiments, user 100 may enter priority settings for sound sources by entering other conditions or criteria. For example, user 100 may prioritize sound sources based on time (eg, time of day, time of week, time of year, etc.) so that the sound is modulated within a specified time span. For example, user 100 may enter priority settings such that sound is amplified Monday through Friday from 9:00 am to 5:00 pm during business hours.
图37是示出符合所公开实施例的用于选择性地放大声音的示例性过程3700的流程图。例如,根据过程3700,助听器系统可以被配置为通过基于声源的优先级选择性地调节(例如,放大、衰减、静音等)音频信号来校正或调整音频信号。37 is a flowchart illustrating an exemplary process 3700 for selectively amplifying sound, consistent with the disclosed embodiments. For example, according to process 3700, the hearing aid system may be configured to correct or adjust the audio signal by selectively adjusting (eg, amplifying, attenuating, muting, etc.) the audio signal based on the priority of the sound source.
在步骤3701中,可穿戴相机(例如,装置110的可穿戴相机)可以从用户100的环境捕捉多个图像。在一些实施例中,可穿戴相机可以是装置110(例如,基于相机的定向助听器设备)的组件,用于基于用户100的环境中的个体(例如,第一个体3501或第二个体3503)的识别来选择性地改变声音的放大。在一些实施例中,可穿戴相机可以使用图像传感器220从用户100的环境捕捉至少一个图像。在一些实施例中,处理器210可以接收由可穿戴相机捕捉的至少一个图像。In step 3701 , a wearable camera (eg, the wearable camera of device 110 ) may capture multiple images from the environment of user 100 . In some embodiments, the wearable camera may be a component of the apparatus 110 (eg, a camera-based directional hearing aid device) for an individual (eg, the first individual 3501 or the second individual 3503 ) in the user 100-based environment identification to selectively change the amplification of the sound. In some embodiments, the wearable camera may capture at least one image from the environment of the user 100 using the image sensor 220 . In some embodiments, the processor 210 may receive at least one image captured by the wearable camera.
在步骤3703中,处理器210可以基于对至少一个图像的分析来识别用户100的环境中的至少一个个体。如上所述,处理器210(和/或处理器210a和210b)可以被配置为使用各种图像检测或处理算法(例如,使用卷积神经网络(CNN)、尺度不变特征变换(SIFT)、定向梯度直方图(HOG)特征或其他技术)来分析捕捉的图像并检测该至少一个个体的身体部分或面部部分的特征。基于检测到的至少一个个体的身体部分或面部部分的表示,可以确定个体的至少一个标识。在一些实施例中,如图20A或图20B中针对装置110所描述的,处理器210可以被配置为使用面部和/或语音识别组件来识别至少一个个体。In step 3703, the processor 210 may identify at least one individual in the environment of the user 100 based on the analysis of the at least one image. As noted above, processor 210 (and/or processors 210a and 210b) may be configured to use various image detection or processing algorithms (eg, using convolutional neural networks (CNN), scale-invariant feature transform (SIFT), Histogram of Oriented Gradients (HOG) feature or other techniques) to analyze the captured image and detect features of the body part or face part of the at least one individual. Based on the detected representation of the body part or face part of the at least one individual, at least one identity of the individual can be determined. In some embodiments, processor 210 may be configured to identify at least one individual using facial and/or voice recognition components, as described for apparatus 110 in FIG. 20A or FIG. 20B.
装置110可以被配置为基于接收到的由可穿戴相机捕捉的多个图像来识别用户100的环境中的个体(例如,第一个体3501或第二个体3503)。装置110可以被配置为识别与用户100的环境内的第一个体3501相关联的面部3521或与第二个体3503相关联的面部3523。例如,装置110可以被配置为使用相机1730来捕捉用户100的周围环境的一个或多个图像。所捕捉的图像可以包括辨识出的个体(例如,第一个体3501或第二个体3503)的表示,该个体可以是用户100的朋友、同事、亲戚或其他先前的熟人。处理器210(和/或处理器210a和210b)可以被配置为使用各种面部识别技术来分析捕捉的图像并检测辨识出的个体。因此,装置110,或具体地存储器550,可以包括一个或多个面部识别组件。Apparatus 110 may be configured to identify an individual (eg, first individual 3501 or second individual 3503 ) in the environment of user 100 based on the plurality of received images captured by the wearable camera. The apparatus 110 may be configured to identify the face 3521 associated with the first individual 3501 or the face 3523 associated with the second individual 3503 within the environment of the user 100 . For example, device 110 may be configured to capture one or more images of user 100's surroundings using camera 1730 . The captured image may include a representation of an identified individual (eg, first individual 3501 or second individual 3503 ), which may be a friend, colleague, relative, or other previous acquaintance of user 100 . Processor 210 (and/or processors 210a and 210b) may be configured to analyze captured images and detect identified individuals using various facial recognition techniques. Accordingly, device 110, or memory 550 in particular, may include one or more facial recognition components.
例如,面部识别组件2040可以访问与用户100相关联的数据库或数据,以确定检测到的面部特征是否对应于辨识出的个体。例如,处理器210可以访问数据库2050(例如,远程地、通过网络等),数据库2050包含关于用户100已知的个体的信息和表示相关联的面部特征或其他识别特征的数据。这样的数据可以包括个体的一个或多个图像,或者表示可用于通过面部识别进行的识别的用户面部的数据。在一些实施例中,数据库2050可以由装置110通过先前的面部识别来编译。例如,处理器210可以被配置为将与在由装置110捕捉的图像中识别出的一个或多个面部相关联的数据存储在数据库2050中。每次在图像中检测到面部时,可将检测到的面部特征或其他数据与数据库2050中的先前识别出的面部进行比较。面部识别组件2040可以确定个体是用户100的辨识出的个体、该个体先前是否在超过特定阈值的多个实例中被系统识别出、该个体是否已被明确地介绍给装置110等。For example, facial recognition component 2040 can access a database or data associated with user 100 to determine whether detected facial features correspond to an identified individual. For example, processor 210 may access database 2050 (eg, remotely, via a network, etc.) that contains information about individuals known to user 100 and data representing associated facial or other identifying features. Such data may include one or more images of an individual, or data representing a user's face that may be used for identification by facial recognition. In some embodiments, database 2050 may be compiled by device 110 through previous facial recognition. For example, processor 210 may be configured to store data associated with one or more faces identified in images captured by device 110 in database 2050 . Each time a face is detected in an image, the detected facial features or other data may be compared to previously identified faces in database 2050. Facial recognition component 2040 can determine whether the individual is an identified individual of user 100, whether the individual has been previously identified by the system in instances exceeding a certain threshold, whether the individual has been explicitly introduced to device 110, and the like.
在一些实施例中,音频信号优先级界面3601可以显示由设备110从用户100的环境捕捉的图像。例如,计算设备120可以包括多个音频信号优先级界面,其中每个音频信号优先级界面包括由装置110从用户100的环境捕捉的图像(例如,个体的或其它声源的)。In some embodiments, audio signal priority interface 3601 may display images captured by device 110 from the environment of user 100 . For example, computing device 120 may include a plurality of audio signal priority interfaces, where each audio signal priority interface includes an image (eg, of an individual or other sound source) captured by apparatus 110 from the environment of user 100 .
在步骤3705中,至少一个麦克风可以从用户100的环境捕捉声音。处理器210可以接收表示由至少一个麦克风1750从用户100的环境捕捉的声音的至少一个音频信号3511、3513或3515。In step 3705, at least one microphone may capture sound from the user's 100 environment. The processor 210 may receive at least one audio signal 3511 , 3513 or 3515 representing sound captured by the at least one microphone 1750 from the environment of the user 100 .
在步骤3707中,处理器210可以基于对至少一个音频信号(例如,音频信号3511、3513或3515)的分析来识别与第一个体3501相关联的第一语音相关联的第一音频信号3511和与第二个体3503相关联的第二语音相关联的第二音频信号3513。可以使用各种方法来确定声音的层次结构。例如,可以通过对两个或更多个声音的比较分析来确定声音的层次结构。在一些实施例中,声源可以包括人、对象(例如,电视、汽车等)、环境(例如,流水、风等)等。例如,处理器210可以使用比较分析来确定来自人的声音相对于来自对象或环境的声音具有优先权。如上所述,可以基于用户输入、默认设置等来确定声音的层次结构。In step 3707, the processor 210 may identify a first audio signal 3511 associated with a first speech associated with the first individual 3501 based on the analysis of at least one audio signal (eg, audio signal 3511, 3513 or 3515) A second audio signal 3513 associated with a second speech associated with the second individual 3503. Various methods can be used to determine the hierarchy of sounds. For example, the hierarchy of sounds can be determined by comparative analysis of two or more sounds. In some embodiments, sound sources may include people, objects (eg, televisions, cars, etc.), environments (eg, flowing water, wind, etc.), and the like. For example, the processor 210 may use comparative analysis to determine that sound from a person has priority over sound from an object or environment. As described above, the hierarchy of sounds may be determined based on user input, default settings, and the like.
在步骤3709中,处理器210可以被配置为基于处理器210确定第一音频信号与高于第二音频信号的优先级的优先级相关联而引起对第一音频信号和第二音频信号(例如,音频信号3511和3513)的选择性调节。In step 3709, the processor 210 may be configured to cause the first audio signal and the second audio signal (eg, , selective conditioning of audio signals 3511 and 3513).
在一些实施例中,与辨识出的个体相关联的音频信号的选择性调节可以基于用户100的环境中的个体的身份。例如,在图像中检测到多个个体的情况下,处理器210可以如上所述使用一种或多种面部识别技术来识别个体。与用户100已知的个体相关联的音频信号可以被选择性地放大或以其他方式调节以具有相对于未知个体的优先权。例如,处理器210可以被配置为衰减或静音与用户100的环境中的旁观者(诸如嘈杂的办公室同事等)相关联的音频信号。在一些实施例中,处理器210还可以确定个体的层次结构并基于个体的相对状态给予优先权。在一些实施例中,可以基于列表或数据库来确定层次结构。被系统辨识出的个体可以被单独排序或分组为几层优先级。例如,被识别为“密友”或家人的个体可以优先于用户100的熟人。In some embodiments, the selective adjustment of the audio signal associated with the identified individual may be based on the identity of the individual in the environment of the user 100 . For example, where multiple individuals are detected in the image, the processor 210 may identify the individuals using one or more facial recognition techniques as described above. Audio signals associated with individuals known to user 100 may be selectively amplified or otherwise adjusted to have priority over unknown individuals. For example, processor 210 may be configured to attenuate or mute audio signals associated with bystanders in the environment of user 100 (such as noisy office colleagues, etc.). In some embodiments, the processor 210 may also determine a hierarchy of individuals and give priority based on the relative status of the individuals. In some embodiments, the hierarchy may be determined based on a list or database. Individuals identified by the system can be sorted individually or grouped into layers of priority. For example, individuals identified as "close friends" or family members may take precedence over acquaintances of user 100 .
在一些实施例中,处理器210可以基于个体与用户100的接近度来选择性地调节与一个或多个个体相关联的音频信号。处理器210可以基于捕捉的图像来确定从用户100到每个个体的距离,并且可以基于该距离来选择性地调节与这些个体相关联的音频信号。例如,离用户100较近的个体可能比离用户100较远的个体优先级更高。类似地,在更靠近用户的视线方向的角度上的个体可以比在离用户的视线方向更大角度上的个体优先级更高。In some embodiments, processor 210 may selectively adjust audio signals associated with one or more individuals based on the individual's proximity to user 100 . The processor 210 may determine the distance from the user 100 to each individual based on the captured images, and may selectively adjust audio signals associated with the individuals based on the distance. For example, individuals closer to user 100 may be given higher priority than individuals further away from user 100 . Similarly, individuals at angles closer to the user's line of sight may be given higher priority than individuals at greater angles from the user's line of sight.
在一些实施例中,处理器210可以通过基于对多个图像中的至少一个的分析来识别至少一个动作来确定优先级水平。例如,处理器210可以基于唇部移动和检测到的声音来确定用户100的环境中的哪些个体正在说话。例如,处理器210可以跟踪与个体3501或3503相关联的唇部移动,以确定个体3501或3503正在说话。可以在检测到的唇部移动和接收到的音频信号之间进行比较分析。例如,处理器210可以基于在检测到与音频信号3511相关联的声音的同时个体3501的嘴正在运动的确定来确定个体3501正在说话。在一些实施例中,当个体3501的唇部停止运动时,这可对应于与音频信号3511相关联的声音中的静默或减小音量的时段。In some embodiments, the processor 210 may determine the priority level by identifying at least one action based on an analysis of at least one of the plurality of images. For example, processor 210 may determine which individuals in the environment of user 100 are speaking based on lip movements and detected sounds. For example, processor 210 may track lip movements associated with individual 3501 or 3503 to determine that individual 3501 or 3503 is speaking. A comparative analysis can be performed between the detected lip movement and the received audio signal. For example, the processor 210 may determine that the individual 3501 is speaking based on a determination that the individual 3501's mouth is moving while the sound associated with the audio signal 3511 is detected. In some embodiments, when the lips of the individual 3501 cease to move, this may correspond to a period of silence or reduced volume in the sound associated with the audio signal 3511.
在一些实施例中,与装置110相关联的数据还可以结合检测到的唇部移动被用于确定和/或验证个体是否正在说话,诸如用户100或个体3501或3503的视线方向、检测到的个体3501或3503的身份、辨识出的个体3501或3503的声纹等。In some embodiments, data associated with device 110 may also be used in conjunction with detected lip movements to determine and/or verify whether an individual is speaking, such as the gaze direction of user 100 or individual 3501 or 3503, detected The identity of the individual 3501 or 3503, the voiceprint of the recognized individual 3501 or 3503, etc.
在一些实施例中,可以确定用户100或个体3501或3503的视线方向,并且可以在活跃说话者中给予视线方向所指向的个体更高的优先级。例如,如果个体3503正在看着个体3501,则与个体3501相关联的音频信号3511可以被选择性地调节。在一些实施例中,可以基于用户100的环境中其他个体的相对行为来分配优先级。例如,如果个体3501和个体3503都在说话,并且看着个体3501的另外个体比看着个体3503的更多,则与个体3501相关联的音频信号3511可以优先于与个体3503相关联的音频信号被选择性地调节。在确定个体的身份的实施例中,如前面讨论的,可以基于说话者的相对状态来分配优先级。In some embodiments, the gaze direction of the user 100 or individual 3501 or 3503 may be determined, and the individual pointed by the gaze direction may be given higher priority among active speakers. For example, if individual 3503 is looking at individual 3501, the audio signal 3511 associated with individual 3501 may be selectively adjusted. In some embodiments, priorities may be assigned based on the relative behavior of other individuals in the user's 100 environment. For example, if both individual 3501 and individual 3503 are speaking, and more other individuals are looking at individual 3501 than individual 3503, the audio signal 3511 associated with individual 3501 may take precedence over the audio signal associated with individual 3503 is selectively regulated. In embodiments where the identity of an individual is determined, as previously discussed, priorities may be assigned based on the relative status of the speakers.
在一些实施例中,处理器210可以基于对多个捕捉图像中的至少一个的分析来识别至少一个动作。例如,处理器210可以基于用户100指向一个或多个声源(例如,个体)来确定声源的层次结构。在一些实施例中,用户100可以按照从最高优先级到最低优先级的顺序指向每个声源的至少一个声源。基于用户100的动作,计算设备120的音频信号优先级界面可以显示与每个声源相关联的声音的层次结构。处理器210可以基于声源的层次结构来选择性地调节与每个声源相关联的音频信号。例如,在用户100的环境中,相对于较低优先级音频信号,更高优先级音频信号可以被隔离和/或选择性地放大。在一些实施例中,低优先级音频信号可以被抑制、衰减、滤波、不变等。In some embodiments, the processor 210 may identify at least one action based on analysis of at least one of the plurality of captured images. For example, the processor 210 may determine a hierarchy of sound sources based on the user 100 pointing at one or more sound sources (eg, individuals). In some embodiments, the user 100 may direct at least one sound source of each sound source in order from highest priority to lowest priority. Based on the actions of the user 100, the audio signal priority interface of the computing device 120 may display a hierarchy of sounds associated with each sound source. The processor 210 may selectively adjust the audio signal associated with each sound source based on a hierarchy of sound sources. For example, in the environment of user 100, higher priority audio signals may be isolated and/or selectively amplified relative to lower priority audio signals. In some embodiments, low priority audio signals may be suppressed, attenuated, filtered, unchanged, and the like.
在步骤3711中,处理器210可以被配置为使经选择性调节的第一音频信号传输到听觉接口设备1710,听觉接口设备1710被配置为向用户100的耳朵提供声音。因此,在经调节的音频信号中,来自更高优先级源的声音可以更容易被用户100听到,比来自较低优先级源的声音更响亮和/或更容易区分,来自较低优先级源的声音可以表示环境内的背景噪声。In step 3711 , the processor 210 may be configured to transmit the selectively conditioned first audio signal to the auditory interface device 1710 configured to provide sound to the ear of the user 100 . Thus, in the conditioned audio signal, sounds from higher priority sources may be more easily heard by the user 100, louder and/or more distinguishable than sounds from lower priority sources, from lower priority sources The sound of the source may represent background noise within the environment.
助听器和配对的相机系统Hearing aids and paired camera systems
根据本公开的实施例,助听器系统可以选择性地放大声音。助听器系统可以包括可穿戴相机设备和助听器设备。可穿戴相机设备可以指具有图像、声音、音频和/或视频捕捉能力的设备,可穿戴相机设备可以附接到用户或用户的衣服或配件。助听器设备可以指将音频或声音输出到用户耳朵的设备。助听器设备的输出可以基于从可穿戴相机设备接收的输入或由可穿戴相机设备接收的输入来生成。助听器系统可以包括配对在一起以提供改进的功能的几个设备。例如,助听器系统可以包括用于向用户的耳朵提供声音的助听器设备,其中声音可以由助听器声学地捕捉或从诸如可穿戴相机设备的另一个源电子地接收。助听器设备可以与可穿戴相机设备配对(反之亦然),并且可穿戴相机设备可以捕捉图像和/或音频。可穿戴相机设备可以根据从助听器设备接收的指令来捕捉图像和/或音频。作为响应,助听器设备可以从可穿戴相机设备接收音频和其他信息。该系统还可以包括与可穿戴相机设备和助听器设备配对的移动设备。例如,在移动设备的显示器上显示的GUI可以使用户能够提供输入以控制声音如何在助听器设备处被处理和接收。助听器系统还可以包括诸如测距仪的其他设备的配对。According to embodiments of the present disclosure, the hearing aid system may selectively amplify sound. Hearing aid systems may include wearable camera devices and hearing aid devices. A wearable camera device may refer to a device with image, sound, audio and/or video capture capabilities that may be attached to a user or to a user's clothing or accessories. A hearing aid device may refer to a device that outputs audio or sound to a user's ear. The output of the hearing aid device may be generated based on input received from or by the wearable camera device. A hearing aid system may include several devices that are paired together to provide improved functionality. For example, a hearing aid system may include a hearing aid device for providing sound to a user's ear, where the sound may be acoustically captured by the hearing aid or received electronically from another source such as a wearable camera device. The hearing aid device can be paired with the wearable camera device (and vice versa), and the wearable camera device can capture images and/or audio. The wearable camera device may capture images and/or audio according to instructions received from the hearing aid device. In response, the hearing aid device can receive audio and other information from the wearable camera device. The system may also include a mobile device paired with the wearable camera device and the hearing aid device. For example, a GUI displayed on a display of a mobile device may enable a user to provide input to control how sound is processed and received at the hearing aid device. The hearing aid system may also include pairing of other devices such as rangefinders.
如上所讨论的,在一些实施例中,助听器系统可以包括移动设备。移动设备可以是移动电话,或者诸如PDA、平板电脑、可穿戴电子设备和其他类型的便携式电子设备的设备的其他示例。移动设备可以与助听器设备或可穿戴相机设备中的至少一个或两者配对。配对可以是指使两个或多个设备之间能够进行通信,并且移动设备可以与助听器设备或可穿戴相机设备无线地配对。无线配对的示例包括Wi-Fi、蓝牙(Bluetooth)、NFC和其他类似的无线通信技术。As discussed above, in some embodiments, the hearing aid system may include a mobile device. The mobile device may be a mobile phone, or other examples of devices such as PDAs, tablet computers, wearable electronic devices, and other types of portable electronic devices. The mobile device may be paired with at least one or both of a hearing aid device or a wearable camera device. Pairing can refer to enabling communication between two or more devices, and a mobile device can be wirelessly paired with a hearing aid device or a wearable camera device. Examples of wireless pairing include Wi-Fi, Bluetooth, NFC, and other similar wireless communication technologies.
在一些实施例中,助听器系统可以包括测距仪。测距仪可以指能够确定对象与其自身之间的范围(或距离)的设备。在一些实施例中,测距仪可以无线地配对到可穿戴相机设备、助听器设备和/或移动设备,这些配对设备中的一个或多个接收由测距仪生成的范围测量。在一些实施例中,测距仪可以合并到可穿戴相机设备中。In some embodiments, the hearing aid system may include a rangefinder. A rangefinder may refer to a device capable of determining the range (or distance) between an object and itself. In some embodiments, the rangefinder may be wirelessly paired to a wearable camera device, hearing aid device, and/or mobile device, one or more of these paired devices receiving range measurements generated by the rangefinder. In some embodiments, the rangefinder may be incorporated into the wearable camera device.
在一些实施例中,可穿戴相机设备可以包括配置为从用户的环境捕捉多个图像的至少一个相机。该相机可以包括一个或多个图像传感器(诸如上面讨论的图像传感器220),用于从可穿戴相机设备的用户的环境捕捉一个或多个图像(和/或视频)。在一些实施例中,可穿戴相机设备包括被配置为从用户的环境捕捉声音的至少一个麦克风。至少一个麦克风可以指能够接收声波并基于所接收的声波生成音频信号的组件或设备。在一些实施例中,助听器设备可以包括至少一个扬声器,该扬声器被配置为向用户的耳朵提供声音。至少一个扬声器可以指通常基于音频信号能够生成声音的组件或设备。In some embodiments, the wearable camera device may include at least one camera configured to capture multiple images from the user's environment. The camera may include one or more image sensors (such as image sensor 220 discussed above) for capturing one or more images (and/or video) from the environment of the user of the wearable camera device. In some embodiments, the wearable camera device includes at least one microphone configured to capture sound from the user's environment. The at least one microphone may refer to a component or device capable of receiving sound waves and generating audio signals based on the received sound waves. In some embodiments, the hearing aid device may include at least one speaker configured to provide sound to a user's ear. At least one speaker may refer to a component or device capable of generating sound, typically based on an audio signal.
举例而言,图38示出了包括可穿戴相机设备和助听器设备的助听器系统3800。助听器系统3800包括助听器设备3802和可穿戴相机设备3804。在一些实施例中,助听器设备3802和可穿戴相机设备3804配对以与移动设备3806进行通信。在一些实施例中,助听器设备3802可以与可穿戴相机设备3804配对,并且数据可以在配对的设备之间通信。在一些实施例中,可穿戴相机设备3804可以对应于图4A-图4K中示出的装置110。在一些替代实施例中,可穿戴相机设备3804可以对应于图3A-图3B中示出的装置110。在一些实施例中,助听器设备3802可以对应于听觉接口设备1710。For example, Figure 38 shows a hearing aid system 3800 including a wearable camera device and a hearing aid device. The hearing aid system 3800 includes a hearing aid device 3802 and a wearable camera device 3804 . In some embodiments, the hearing aid device 3802 and the wearable camera device 3804 are paired to communicate with the mobile device 3806. In some embodiments, the hearing aid device 3802 can be paired with the wearable camera device 3804 and data can be communicated between the paired devices. In some embodiments, the wearable camera device 3804 may correspond to the apparatus 110 shown in FIGS. 4A-4K. In some alternative embodiments, the wearable camera device 3804 may correspond to the apparatus 110 shown in Figures 3A-3B. In some embodiments, hearing aid device 3802 may correspond to auditory interface device 1710 .
如图38所示,可穿戴相机设备3804包括相机3804A。相机3804A可以被放置在可穿戴相机设备3804上,以便面朝要成像的对象或人的方向。相机3804A可以包括用于从用户的视场捕捉实时图像数据的图像传感器(例如,图像传感器220)。相机3804A可以是能够检测近红外、红外、可见光、紫外光谱或其任何组合中的光信号并将其转换为电信号的设备。电信号可以用于基于检测到的信号来形成图像或视频流(即图像数据)。术语“图像数据”包括从近红外、红外、可见光、紫外光谱或其任何组合中的光信号中检索出的任何形式的数据。图像传感器的示例可以包括半导体电荷耦合器件(CCD)、互补金属氧化物半导体(CMOS)中的有源像素传感器或N型金属氧化物半导体(NMOS,活跃MOS)。在一些实施例中,相机3804A可以具有测距特征。例如,相机3804A可以确定图像3805中的一个或多个对象的范围测量3807。As shown in Figure 38, the wearable camera device 3804 includes a camera 3804A. The camera 3804A can be placed on the wearable camera device 3804 so as to face in the direction of the object or person to be imaged. Camera 3804A may include an image sensor (eg, image sensor 220) for capturing real-time image data from a user's field of view. Camera 3804A may be a device capable of detecting and converting light signals into electrical signals in the near infrared, infrared, visible, ultraviolet spectrum, or any combination thereof. The electrical signals may be used to form images or video streams (ie, image data) based on the detected signals. The term "image data" includes any form of data retrieved from optical signals in the near infrared, infrared, visible, ultraviolet spectrum, or any combination thereof. Examples of image sensors may include semiconductor charge coupled devices (CCDs), active pixel sensors in complementary metal oxide semiconductors (CMOS), or N-type metal oxide semiconductors (NMOS, active MOS). In some embodiments, camera 3804A may have ranging features. For example, camera 3804A may determine range measurements 3807 for one or more objects in image 3805.
作为示例,图39示出了使用助听器系统3800的用户的示例。根据所公开的实施例,用户100可以佩戴可穿戴相机设备3804和助听器设备3082。如图所示,用户100可以穿戴物理连接到用户100的衬衫或其他衣物的可穿戴相机设备3804。与所公开的实施例一致,可穿戴相机设备3804可以被放置在诸如连接到项链、腰带、眼镜、腕带、纽扣、帽子等的其他位置。可穿戴相机设备3804可以与助听器设备3802和/或与移动设备3806无线配对。As an example, FIG. 39 shows an example of a user using a hearing aid system 3800. User 100 may wear wearable camera device 3804 and hearing aid device 3082 in accordance with the disclosed embodiments. As shown, user 100 may wear wearable camera device 3804 that is physically connected to user 100's shirt or other clothing. Consistent with the disclosed embodiments, the wearable camera device 3804 may be placed in other locations such as attached to necklaces, belts, glasses, wristbands, buttons, hats, and the like. Wearable camera device 3804 can be wirelessly paired with hearing aid device 3802 and/or with mobile device 3806.
如图39所示,助听器设备3802可以被放置在用户100的一个或两个耳朵中,类似于传统的听觉接口设备。助听器设备3802可以是各种样式的,包括耳道内、完全耳道内、耳内、耳后、耳上、耳道内接收器、开放安装或各种其他样式。助听器设备3802可以包括用于向用户100提供听觉反馈的一个或多个扬声器。在一些实施例中,助听器设备3802可以是如图17A所示的听觉接口设备1710。As shown in Figure 39, a hearing aid device 3802 may be placed in one or both ears of the user 100, similar to a conventional auditory interface device. The hearing aid device 3802 can be of various styles, including in-canal, fully in-canal, in-ear, behind-the-ear, supra-ear, in-canal receiver, open mount, or various other styles. Hearing aid device 3802 may include one or more speakers for providing auditory feedback to user 100 . In some embodiments, hearing aid device 3802 may be auditory interface device 1710 as shown in Figure 17A.
在一些实施例中,如图17A所示,助听器设备3802可以包括骨传导耳机1711。骨传导耳机1711可以通过外科手术植入,并且可以通过声音振动到内耳的骨传导来向用户100提供可听反馈。In some embodiments, as shown in FIG. 17A , the hearing aid device 3802 may include a bone conduction earphone 1711. Bone conduction earphones 1711 may be surgically implanted and may provide audible feedback to user 100 through bone conduction of sound vibrations to the inner ear.
用户的环境一般是指正在使用可穿戴相机设备3804的用户的环境。用户的环境可以包括对象和人,其中一些对象和人可以产生由位于可穿戴相机设备3804中的一个或多个麦克风(未示出)接收的声波3803。取决于相机3804A的方向和视场,相机3804A可以捕捉图像3805,图像3805表示由相机3804A沿着光轴3903看到的用户环境中的对象和人。在一些实施例中,声波3803可以由在图像3805中捕捉的对象或人生成。在一些实施例中,图像3805可以包括用户100的下巴的表示,其可以用于确定用户视线方向3901,该方向可以与用户100的视场重合。The user's environment generally refers to the environment of the user who is using the wearable camera device 3804 . The user's environment may include objects and people, some of which may generate sound waves 3803 that are received by one or more microphones (not shown) located in the wearable camera device 3804. Depending on the orientation and field of view of the camera 3804A, the camera 3804A may capture an image 3805 representing objects and people in the user's environment as seen by the camera 3804A along the optical axis 3903. In some embodiments, sound waves 3803 may be generated by objects or people captured in image 3805 . In some embodiments, image 3805 may include a representation of the chin of user 100 , which may be used to determine user's gaze direction 3901 , which may coincide with the user's 100 field of view.
相机设备3804可以包括至少一个处理器(在本公开中可称为至少一个第一处理器)。术语“处理器”包括具有对输入执行逻辑操作的电路的任何物理设备。例如,处理设备可以包括一个或多个集成电路、微芯片、微控制器、微处理器、中央处理单元(CPU)、图形处理单元(GPU)、数字信号处理器(DSP)、现场可编程门阵列(FPGA)的全部或部分、或适于执行指令或执行逻辑操作的其他电路。在一些实施例中,图5A-图5C中示出的处理器210可以是至少一个第一处理器的示例。Camera device 3804 may include at least one processor (may be referred to as at least one first processor in this disclosure). The term "processor" includes any physical device having circuitry that performs logical operations on inputs. For example, a processing device may include one or more integrated circuits, microchips, microcontrollers, microprocessors, central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field programmable gates All or part of an array (FPGA), or other circuitry suitable for executing instructions or performing logical operations. In some embodiments, the processor 210 shown in FIGS. 5A-5C may be an example of at least one first processor.
在一些实施例中,至少一个第一处理器被编程为选择性地调节从至少一个麦克风接收的表示由该至少一个麦克风捕捉的声音的音频信号。调节可以指音频信号的编辑、改变或以其他方式处理的操作。在一些实施例中,处理器210可以基于从助听器设备3802接收的指令来调节声波3083。In some embodiments, the at least one first processor is programmed to selectively condition an audio signal received from at least one microphone representing sound captured by the at least one microphone. Conditioning may refer to editing, changing, or otherwise manipulating an audio signal. In some embodiments, the processor 210 may modulate the sound waves 3083 based on instructions received from the hearing aid device 3802.
在一些实施例中,助听器设备可以包括至少一个第二处理器。在一些实施例中,至少一个第二处理器被编程为引起向可穿戴相机设备3804的一个或多个指令的传输。一个或多个指令可以通过助听器设备3802与可穿戴相机设备3804之间的配对所创建的信道来无线传输。In some embodiments, the hearing aid device may include at least one second processor. In some embodiments, the at least one second processor is programmed to cause the transmission of one or more instructions to the wearable camera device 3804. One or more instructions may be transmitted wirelessly through the channel created by the pairing between the hearing aid device 3802 and the wearable camera device 3804.
在一些实施例中,移动设备可以包括用于向用户提供输出的用户界面。用户界面是指能够与用户交互(诸如为显示器提供输出信息或从用户接收输入)的系统或设备。在一些实施例中,用户界面可以由移动设备3806提供。例如,移动设备3806可以包括显示器,用于显示用户界面3806A以允许用户100与助听器系统3800交互。在一些实施例中,用户界面3806A可以包括用于从用户100接收可视、音频、触觉或任何其他合适的信号或输入的界面。例如,用户界面3806A可以包括可以是移动设备3800的一部分的显示器,诸如具有可由用户手势(例如,触摸屏上的触摸手势)或由适当的物理或虚拟(即,屏幕上)设备(例如,键盘、鼠标等)操纵的GUI元素的触摸屏。在一些实施例中,用户接口3806A可以是能够接收用于调整系统3800的一个或多个参数的用户100音频输入(例如,用户100语音输入)的音频接口。例如,用户100可以经由音频命令经由用户界面3806A提供输入。音频接口可以由移动设备3806提供,诸如与移动设备3806相关联的麦克风。In some embodiments, a mobile device may include a user interface for providing output to a user. A user interface refers to a system or device capable of interacting with a user, such as providing output information to a display or receiving input from a user. In some embodiments, the user interface may be provided by the mobile device 3806. For example, mobile device 3806 may include a display for displaying user interface 3806A to allow user 100 to interact with hearing aid system 3800. In some embodiments, user interface 3806A may include an interface for receiving visual, audio, haptic, or any other suitable signals or input from user 100 . For example, user interface 3806A can include a display that can be part of mobile device 3800, such as with a display that can be accessed by user gestures (eg, touch gestures on a touch screen) or by a suitable physical or virtual (ie, on-screen) device (eg, keyboard, mouse, etc.) to manipulate the touch screen of GUI elements. In some embodiments, user interface 3806A may be an audio interface capable of receiving user 100 audio input (eg, user 100 speech input) for adjusting one or more parameters of system 3800 . For example, user 100 may provide input via user interface 3806A via audio commands. An audio interface may be provided by the mobile device 3806, such as a microphone associated with the mobile device 3806.
在一些实施例中,用户界面3806A还可以呈现与助听器系统3800有关的信息,诸如与助听器设备3802和/或可穿戴相机设备3804的操作有关的信息。这样的信息可以用于通知用户100助听器系统3800的状态,使得用户100可以控制助听器设备3802和/或可穿戴相机设备3804的参数。In some embodiments, user interface 3806A may also present information related to hearing aid system 3800 , such as information related to the operation of hearing aid device 3802 and/or wearable camera device 3804 . Such information can be used to inform the user 100 of the status of the hearing aid system 3800 so that the user 100 can control parameters of the hearing aid device 3802 and/or the wearable camera device 3804.
在一些实施例中,与助听器设备3802的操作有关的信息可以包括与移动设备3806、可穿戴相机3804和/或助听器设备3802之间的配对有关的状态信息。在一些实施例中,用户100可以通过用户界面3806A发起或终止移动设备3806、可穿戴相机3804和/或助听器设备3802之间的配对。在一些实施例中,与助听器系统3800的操作有关的信息可以包括助听器设备3802的操作状态信息(诸如电池水平和/或音量水平),并且用户100可以控制助听器设备3802的参数(诸如调整助听器设备3802的一个或多个扬声器的音量水平)。在一些实施例中,与可穿戴相机设备3804的操作有关的信息可以包括可穿戴相机设备3804的操作状态(诸如电池水平)、与已由可穿戴相机设备3804捕捉的图像有关的信息和/或音频调节操作。In some embodiments, the information related to the operation of the hearing aid device 3802 may include status information related to the pairing between the mobile device 3806 , the wearable camera 3804 , and/or the hearing aid device 3802 . In some embodiments, user 100 may initiate or terminate pairing between mobile device 3806, wearable camera 3804, and/or hearing aid device 3802 through user interface 3806A. In some embodiments, information related to the operation of the hearing aid system 3800 may include operational status information of the hearing aid device 3802 (such as battery level and/or volume level), and the user 100 may control parameters of the hearing aid device 3802 (such as adjusting the hearing aid device) the volume level of one or more speakers of the 3802). In some embodiments, the information related to the operation of the wearable camera device 3804 may include the operational status of the wearable camera device 3804 (such as battery levels), information related to images that have been captured by the wearable camera device 3804 and/or Audio adjustment operation.
在一些实施例中,可穿戴相机设备3804捕捉的图像(诸如图像3805)可以经由用户界面3806A实时地显示给用户100。图像3805可以以使得用户100可以以各种方式操纵图像3805的方式呈现。例如,用户100可以放大或缩小以最大化或最小化图像3805、裁剪图像3805、编辑图像3805和/或将图像3805保存在存储器中,以及本领域已知的其他图像操纵技术。在一些实施例中,由相机3804A确定的范围测量3807可以经由用户界面3806A呈现给用户100。范围测量可以与图像3805中表示的对象或人相关联。In some embodiments, images captured by wearable camera device 3804, such as image 3805, may be displayed to user 100 in real-time via user interface 3806A. Image 3805 may be presented in a manner that enables user 100 to manipulate image 3805 in various ways. For example, user 100 may zoom in or out to maximize or minimize image 3805, crop image 3805, edit image 3805, and/or save image 3805 in memory, as well as other image manipulation techniques known in the art. In some embodiments, range measurements 3807 determined by camera 3804A may be presented to user 100 via user interface 3806A. Range measurements may be associated with objects or people represented in image 3805.
在一些实施例中,一个或多个音频调节操作的状态可以经由用户界面3806A呈现给用户100。例如,可以向用户100呈现任何一个或多个正在进行的音频调节设置。在一些实施例中,用户界面3806A可以向用户100显示选项,以取消正在进行的音频调节操作,和/或通过修改一个或多个音频调节设置来选择不同的音频调节操作。In some embodiments, the status of one or more audio adjustment operations may be presented to user 100 via user interface 3806A. For example, the user 100 may be presented with any one or more ongoing audio adjustment settings. In some embodiments, user interface 3806A may display options to user 100 to cancel an audio adjustment operation in progress, and/or to select a different audio adjustment operation by modifying one or more audio adjustment settings.
用户界面3806A还可以能够接收来自用户100的输入。例如,基于显示器上显示的输出,用户100可能希望改变助听器系统3800的一个或多个操作。来自用户100的输入可以被处理成用于助听器系统3800的指令。在一些实施例中,至少一个第二处理器可以被配置为基于来自用户的输入来确定一个或多个指令。在一些实施例中,移动设备3806可以包括用于接收来自用户的输入的用户界面。经由用户界面3806A接收的用户输入可以从移动设备3806无线发送到助听器设备3802,在该助听器设备处第二处理器(例如,处理器210)可以将用户输入转换为用于控制助听器系统3800的一个或多个操作的指令。User interface 3806A may also be capable of receiving input from user 100 . For example, based on the output displayed on the display, the user 100 may wish to change one or more operations of the hearing aid system 3800. Input from user 100 may be processed into instructions for hearing aid system 3800 . In some embodiments, the at least one second processor may be configured to determine one or more instructions based on input from a user. In some embodiments, the mobile device 3806 can include a user interface for receiving input from a user. User input received via user interface 3806A may be wirelessly transmitted from mobile device 3806 to hearing aid device 3802 where a second processor (eg, processor 210 ) may convert the user input into a device for controlling hearing aid system 3800. or instructions for multiple operations.
在一些实施例中,可穿戴相机设备3804可被配置为基于一个或多个指令来捕捉多个图像和声音。用户100可以通过用户界面3806A手动改变可穿戴相机设备3804如何捕捉图像。例如,用户100可能希望聚焦在人或对象上,因此用户输入可以得到用于缩小相机3804A的视场和/或放大一个或多个捕捉图像的指令。在另一示例中,由用户界面3806A显示的图像3805可能失焦或模糊,并且用户100可能希望改变相机3804A的焦点以改善图像质量,从而得到对相机3804A重新对焦的指令。在又一示例中,用户100可能希望改变图像3805的照明条件以补偿低/高光条件,并且因此用户输入可以相应得到改变相机3804A照明条件的指令。In some embodiments, the wearable camera device 3804 may be configured to capture multiple images and sounds based on one or more instructions. User 100 can manually change how wearable camera device 3804 captures images through user interface 3806A. For example, user 100 may wish to focus on a person or object, so user input may result in instructions for reducing the field of view of camera 3804A and/or zooming in on one or more captured images. In another example, the image 3805 displayed by the user interface 3806A may be out of focus or blurred, and the user 100 may wish to change the focus of the camera 3804A to improve image quality, thereby instructing the camera 3804A to refocus. In yet another example, the user 100 may wish to change the lighting conditions of the image 3805 to compensate for low/high light conditions, and thus the user input may result in instructions to change the lighting conditions of the camera 3804A accordingly.
在一些实施例中,至少一个第一处理器可以被编程为基于一个或多个指令来选择性地调节音频信号。例如,可穿戴相机设备3804的至少一个第一处理器(例如,处理器210)可以编辑、改变或以其他方式处理从用户的环境捕捉的声波3803。可将经调节的音频信号提供给助听器设备3802,从而可以生成声音3801以供用户100听到。声音3801可以基于由可穿戴相机设备3804输出的经调节的音频信号来生成。In some embodiments, the at least one first processor may be programmed to selectively condition the audio signal based on one or more instructions. For example, at least one first processor (eg, processor 210) of the wearable camera device 3804 can edit, alter, or otherwise process the sound waves 3803 captured from the user's environment. The conditioned audio signal may be provided to the hearing aid device 3802 so that a sound 3801 may be generated for the user 100 to hear. Sound 3801 may be generated based on the conditioned audio signal output by wearable camera device 3804 .
在一些实施例中,选择性地调节音频信号可以包括修改声波3803的振幅、音调、音高、低音和/或其他音频效果。例如,可以在用户界面3806A上向用户100呈现类似菜单的界面(例如音频混合滑块),并且用户100可以根据需要选择一个或多个音频效果。在一些实施例中,用户100可以改变对应于声波3803的一个或多个音频信号的音调,以使声音对用户100更易感知。例如,用户100可能对特定范围内的音调具有较小的敏感度,并且音频信号的调节可以包括调整声音3803的音高。例如,用户100可能经历10kHz以上的频率中的听觉损失,并且第一处理器(例如处理器210)可以将更高的频率(例如,在15kHz处)重新映射到低于10kHz的频率。在一些实施例中,第一处理器(处理器210)可以被配置为改变与一个或多个音频信号相关联的语速。例如,第一处理器(例如,处理器210)可以被配置为例如通过使每个词语持续时间更长并相应地减少连续词语之间的静默时段来改变经调节的音频信号中的个体的语速,以使检测到的语音对用户100更易感知。In some embodiments, selectively adjusting the audio signal may include modifying the amplitude, pitch, pitch, bass, and/or other audio effects of the sound waves 3803 . For example, user 100 may be presented with a menu-like interface (eg, an audio mix slider) on user interface 3806A, and user 100 may select one or more audio effects as desired. In some embodiments, user 100 may alter the pitch of one or more audio signals corresponding to sound waves 3803 to make the sound more perceptible to user 100 . For example, the user 100 may have less sensitivity to tones within a certain range, and the conditioning of the audio signal may include adjusting the pitch of the sound 3803. For example, user 100 may experience hearing loss in frequencies above 10 kHz, and a first processor (eg, processor 210 ) may remap higher frequencies (eg, at 15 kHz) to frequencies below 10 kHz. In some embodiments, the first processor (processor 210) may be configured to change the speech rate associated with the one or more audio signals. For example, the first processor (eg, processor 210 ) may be configured to change the language of individuals in the adjusted audio signal, for example by making each word longer in duration and reducing the silence period between consecutive words accordingly speed so that the detected speech is more perceptible to the user 100.
在一些实施例中,选择性地调节音频信号可以包括将声音分类为不同类别的声音。例如,第一处理器(例如,处理器210)可以将声波3803分类为包含音乐、音调、笑声、讲话、尖叫、背景噪声等的片段。各个片段的指示可以记录在数据库中,并且可以证明对于生活记录应用非常有用。作为一个示例,所记录的信息可以使助听器系统3800能够检索和/或确定当用户遇到另一个体时的心情。另外,这样的处理可以相对快速和有效地发生,并且可以不使用大量的计算资源。因此,将信息发送到目的地(例如,另一设备、外部服务器等)可能不需要很大的带宽。此外,一旦音频的某些部分被分类为非语音,更多的计算资源可用于处理其他片段。在一些实施例中,用户100可以经由用户界面3806A提供输入,以将上面讨论的不同音频效果应用于声波3803的不同片段。In some embodiments, selectively conditioning the audio signal may include classifying the sounds into different categories of sounds. For example, a first processor (eg, processor 210) may classify sound waves 3803 into segments containing music, tones, laughter, speech, screams, background noise, and the like. Indications of individual segments can be recorded in a database and can prove very useful for life-logging applications. As one example, the recorded information may enable the hearing aid system 3800 to retrieve and/or determine the user's mood when encountering another entity. Additionally, such processing may occur relatively quickly and efficiently, and may not use significant computing resources. Therefore, sending information to a destination (eg, another device, an external server, etc.) may not require significant bandwidth. Also, once some parts of the audio are classified as non-speech, more computing resources are available to process other segments. In some embodiments, user 100 may provide input via user interface 3806A to apply the different audio effects discussed above to different segments of sound wave 3803.
在一些实施例中,选择性地调节音频信号可以包括音频信号的衰减。例如,第一处理器(例如,处理器210)可以基于来自用户100的输入将一个或多个滤波器(例如数字滤波器)应用于声波3803。在一些情况下,滤波器可以选择性地衰减音频信号(诸如声波3803)。在一些情况下,声波3803可以包括环境噪声(例如,各种背景声音,诸如音乐、来自未参与与用户100的对话的人的声音/噪声等)。用户100可以选择各种滤波选项,从而可以从经调节的音频信号中消除或衰减环境噪声。例如,用户100可能希望在具有高背景噪声水平的环境中衰减声波3803。In some embodiments, selectively adjusting the audio signal may include attenuation of the audio signal. For example, a first processor (eg, processor 210 ) may apply one or more filters (eg, digital filters) to sound waves 3803 based on input from user 100 . In some cases, the filter can selectively attenuate audio signals (such as sound waves 3803). In some cases, the sound waves 3803 may include ambient noise (eg, various background sounds such as music, sounds/noise from people not engaged in the conversation with the user 100, etc.). The user 100 can select various filtering options so that ambient noise can be removed or attenuated from the conditioned audio signal. For example, user 100 may wish to attenuate sound waves 3803 in an environment with high background noise levels.
在一些实施例中,选择性调节可以包括音频信号的放大。例如,第一处理器(例如,处理器210)可以选择声波3803的一个或多个部分来放大。在一些实施例中,声波3803的选定部分可以对应于与用户100与另一个体的对话有关的音频,或者来自用户100感兴趣的音频源(诸如TV、收音机、扬声器等)的音频。例如,用户100可以向用户界面3806A提供输入以放大选定部分声波3803。In some embodiments, the selective adjustment may include amplification of the audio signal. For example, a first processor (eg, processor 210) may select one or more portions of sound wave 3803 to amplify. In some embodiments, selected portions of sound waves 3803 may correspond to audio related to a conversation of user 100 with another entity, or audio from an audio source of interest to user 100 (such as TV, radio, speakers, etc.). For example, user 100 may provide input to user interface 3806A to amplify selected portions of sound waves 3803.
在一些实施例中,选择性调节可以包括将说话者的声音与背景声音分离。可以使用任何合适的方法来执行分离,例如使用多个麦克风(诸如包括在可穿戴相机设备3804上的麦克风)。在一些情况下,至少一个麦克风可以是定向麦克风或麦克风阵列。例如,一个麦克风可以捕捉背景噪声,而另一个麦克风可以捕捉包括背景噪声以及特定人的语音的音频信号。然后可以通过从组合音频中减去背景噪声来获得语音。在一些实施例中,第一处理器(例如,处理器210)可以分析图像3805以确定声波3803的源。例如,图像3805可以帮助识别生成声波3803的对象或人。在一些实施例中,用户100可以从用户界面3806A选择要从中过滤音频的对象或人。In some embodiments, the selective adjustment may include separating the speaker's voice from the background voice. Separation may be performed using any suitable method, such as using multiple microphones (such as those included on the wearable camera device 3804). In some cases, the at least one microphone may be a directional microphone or a microphone array. For example, one microphone can capture background noise, while another microphone can capture an audio signal that includes the background noise as well as a particular person's speech. Speech can then be obtained by subtracting background noise from the combined audio. In some embodiments, the first processor (eg, processor 210 ) can analyze the image 3805 to determine the source of the sound wave 3803 . For example, the image 3805 can help identify the object or person that generated the sound wave 3803. In some embodiments, user 100 may select from user interface 3806A an object or person to filter audio from.
在一些实施例中,用户100可以提供输入以选择性地接合或关闭各种音频处理特征。例如,用户100可以提供输入以基于上面讨论的唇读功能选择性地调节声波3803。例如,可以在图像3805中捕捉人的唇部移动,并且第一处理器(例如,处理器210)可以基于图像3805选择性地放大或衰减声波3803。在其他示例中,用户100可以提供输入以基于上面讨论的语音识别来选择性地调节声波3803。第一处理器(例如,处理器210)可以执行速度内容声波3803的语音识别,并且可以基于是否从声波3803识别出词语来选择性地放大或衰减声波3803。In some embodiments, user 100 may provide input to selectively engage or disable various audio processing features. For example, the user 100 may provide input to selectively adjust the sound waves 3803 based on the lip reading function discussed above. For example, a person's lip movement can be captured in image 3805 and a first processor (eg, processor 210 ) can selectively amplify or attenuate sound waves 3803 based on image 3805 . In other examples, the user 100 may provide input to selectively adjust the sound waves 3803 based on the speech recognition discussed above. The first processor (eg, processor 210 ) can perform speech recognition of the velocity content sound waves 3803 and can selectively amplify or attenuate the sound waves 3803 based on whether words are recognized from the sound waves 3803 .
在一些实施例中,用户100可以选择用于选择性放大或衰减的特定音频信号源。例如,第一处理器(例如,处理器210)可以基于对图像3805的分析来识别声波3803的不同分量及其各自的源。在一些示例中,用户100可以与显示在用户界面3806A上的图像3805交互并选择图像3805的不同部分。基于该用户选择,第一处理器(例如,处理器210)可以选择性地放大从图像3805的选定部分发出的声波3803的部分,并且选择性地衰减从图像3805的一个或多个非选定部分发出的声波3803的其他部分。在一些实施例中,第一处理器可以放大从用户100的视场内的区域或用户100的视场的部分获得的声音。In some embodiments, user 100 may select a particular audio signal source for selective amplification or attenuation. For example, a first processor (eg, processor 210 ) may identify different components of sound wave 3803 and their respective sources based on analysis of image 3805 . In some examples, user 100 may interact with image 3805 displayed on user interface 3806A and select different portions of image 3805. Based on the user selection, the first processor (eg, processor 210 ) may selectively amplify portions of the sound wave 3803 emanating from the selected portion of the image 3805 and selectively attenuate one or more non-selected portions of the image 3805 other parts of the sound wave 3803 emitted by certain parts. In some embodiments, the first processor may amplify sound obtained from an area within the user's 100 field of view or a portion of the user's 100 field of view.
在一些实施例中,至少一个第二处理器可以从可穿戴相机设备接收经调节的音频信号。经调节的音频信号可以从经由配对连接的可穿戴相机设备3804无线地发送到助听器设备3802。In some embodiments, the at least one second processor may receive the conditioned audio signal from the wearable camera device. The conditioned audio signal may be sent wirelessly from the wearable camera device 3804 to the hearing aid device 3802 via a pairing connection.
在一些实施例中,至少一个第二处理器可以基于经调节的音频信号使用至少一个扬声器向用户的耳朵提供声音。助听器设备3802的至少一个扬声器可以生成向用户100的耳朵的声音3801。In some embodiments, the at least one second processor may provide sound to the user's ear using the at least one speaker based on the conditioned audio signal. At least one speaker of the hearing aid device 3802 may generate sound 3801 to the ear of the user 100 .
图40示出符合所公开实施例的助听器和配对的相机系统的示例性过程4000的流程图。在一些实施例中,助听器和配对的相机系统可以是图38和图39中示出的系统3800,其包括助听器设备3802、可穿戴相机设备3804和/或移动设备3806。40 illustrates a flowchart of an exemplary process 4000 for a hearing aid and paired camera system consistent with the disclosed embodiments. In some embodiments, the hearing aid and paired camera system may be the system 3800 shown in FIGS. 38 and 39 that includes a hearing aid device 3802 , a wearable camera device 3804 and/or a mobile device 3806 .
在步骤4002中,助听器设备3802和可穿戴相机设备3804可以配对以彼此进行通信,和/或与移动设备3806进行通信。无线配对的示例包括Wi-Fi、蓝牙(Bluetooth)、NFC和其他类似的无线通信技术。在一些实施例中,配对可以由用户100使用移动设备3806发起。例如,当助听器设备3802和/或可穿戴相机设备3804在与移动设备和/或彼此的通信范围内时,可以在移动设备3806上向用户100呈现通知,允许用户100选择设备的配对。在一些实施例中,助听器设备3802、可穿戴相机设备3804和/或移动设备3806之间的配对可以由移动设备3806上的用户100终止。在其他实施例中,配对可以自动启动(例如,当助听器设备3802和/或可穿戴相机设备3804在与移动设备3806和/或彼此的通信范围内时)。In step 4002 , the hearing aid device 3802 and the wearable camera device 3804 may be paired to communicate with each other, and/or with the mobile device 3806 . Examples of wireless pairing include Wi-Fi, Bluetooth, NFC, and other similar wireless communication technologies. In some embodiments, pairing may be initiated by user 100 using mobile device 3806. For example, when the hearing aid device 3802 and/or the wearable camera device 3804 are within communication range with the mobile device and/or each other, a notification may be presented to the user 100 on the mobile device 3806 allowing the user 100 to select pairing of the devices. In some embodiments, the pairing between the hearing aid device 3802 , the wearable camera device 3804 , and/or the mobile device 3806 may be terminated by the user 100 on the mobile device 3806 . In other embodiments, pairing may be initiated automatically (eg, when the hearing aid device 3802 and/or the wearable camera device 3804 are within communication range with the mobile device 3806 and/or each other).
在步骤4004中,移动设备3806可以生成或显示用户界面3806A。用户界面3806A可以被配置为从用户100接收可视、音频、触觉或任何其他合适的信号。例如,用户界面3806A可以在可以是移动设备3806的一部分的显示器上显示,诸如包括可由用户手势或由适当的物理或虚拟(即,屏幕上)设备(例如,键盘、鼠标等)操纵的GUI元素的触摸屏。在一些实施例中,用户接口3806A可以是能够接收用于调整系统3800的一个或多个参数的用户100音频输入(例如,用户100语音输入)的音频接口。音频接口可以由可以包括麦克风的移动设备3806提供。In step 4004, mobile device 3806 may generate or display user interface 3806A. User interface 3806A may be configured to receive visual, audio, haptic, or any other suitable signals from user 100 . For example, user interface 3806A may be displayed on a display that may be part of mobile device 3806, such as including GUI elements that may be manipulated by user gestures or by suitable physical or virtual (ie, on-screen) devices (eg, keyboard, mouse, etc.) touch screen. In some embodiments, user interface 3806A may be an audio interface capable of receiving user 100 audio input (eg, user 100 speech input) for adjusting one or more parameters of system 3800 . The audio interface may be provided by the mobile device 3806, which may include a microphone.
在一些实施例中,用户界面3806A还可以呈现与助听器系统3800有关的信息,诸如与助听器设备3802和/或可穿戴相机设备3804的操作有关的信息。这样的信息可以用于通知用户100助听器系统3800的状态,使得用户100可以控制助听器设备3802和/或可穿戴相机设备3804的参数。例如,用户100可以通过用户界面3806A发起或终止移动设备3806、可穿戴相机3804和/或助听器设备3802之间的配对。In some embodiments, user interface 3806A may also present information related to hearing aid system 3800 , such as information related to the operation of hearing aid device 3802 and/or wearable camera device 3804 . Such information can be used to inform the user 100 of the status of the hearing aid system 3800 so that the user 100 can control parameters of the hearing aid device 3802 and/or the wearable camera device 3804. For example, user 100 may initiate or terminate a pairing between mobile device 3806, wearable camera 3804, and/or hearing aid device 3802 through user interface 3806A.
在一些实施例中,可穿戴相机设备3804捕捉的图像(诸如图像3805)可以经由用户界面3806A实时地显示给用户100。图像3805可以以能够由用户100操纵的方式呈现。例如,用户100可以放大/缩小以最大化或最小化图像3805、裁剪图像3805、编辑图像3805和/或将图像3805保存在存储器中,以及本领域已知的其他图像操纵。在一些实施例中,由相机3804A确定的范围测量可以经由用户界面3806A呈现给用户100。范围测量可以与图像3805相关联。In some embodiments, images captured by wearable camera device 3804, such as image 3805, may be displayed to user 100 in real-time via user interface 3806A. Image 3805 may be presented in a manner that can be manipulated by user 100 . For example, user 100 may zoom in/out to maximize or minimize image 3805, crop image 3805, edit image 3805, and/or save image 3805 in memory, as well as other image manipulations known in the art. In some embodiments, the range measurements determined by camera 3804A may be presented to user 100 via user interface 3806A. Range measurements may be associated with image 3805.
在步骤4006中,移动设备3806可以经由用户界面3806A从用户100接收输入。在一些实施例中,音频调节操作的状态可以经由用户界面3806A呈现给用户100。例如,可以向用户100呈现任何正在进行的音频调节设置。在一些实施例中,用户界面3806A可以向用户显示选项,以取消正在进行的音频调节操作,和/或通过修改一个或多个音频调节设置来选择不同的音频调节操作。In step 4006, mobile device 3806 may receive input from user 100 via user interface 3806A. In some embodiments, the status of the audio adjustment operation may be presented to the user 100 via the user interface 3806A. For example, the user 100 may be presented with any ongoing audio adjustment settings. In some embodiments, the user interface 3806A may display options to the user to cancel an audio adjustment operation in progress, and/or to select a different audio adjustment operation by modifying one or more audio adjustment settings.
在步骤4008中,移动设备3806可以将经由用户界面3806A接收的用户输入发送到助听器设备3802或可穿戴相机设备3804。助听器设备3802或可穿戴相机设备3804的至少一个第二处理器(例如,处理器210)可以被编程为基于用户输入确定用于系统3800的指令。然后,可以由助听器设备3802的组件和/或由可穿戴相机设备3804执行指令。In step 4008, the mobile device 3806 may transmit the user input received via the user interface 3806A to the hearing aid device 3802 or the wearable camera device 3804. At least one second processor (eg, processor 210 ) of hearing aid device 3802 or wearable camera device 3804 may be programmed to determine instructions for system 3800 based on user input. The instructions may then be executed by components of the hearing aid device 3802 and/or by the wearable camera device 3804.
在步骤4010中,可穿戴相机设备3804可以从用户的环境捕捉音频(诸如声波3803)。可穿戴相机设备3804可以使用至少一个麦克风来基于接收到的声波3803生成音频信号。In step 4010, the wearable camera device 3804 may capture audio (such as sound waves 3803) from the user's environment. The wearable camera device 3804 can use at least one microphone to generate audio signals based on the received sound waves 3803.
在步骤4012中,可穿戴相机设备3804可以捕捉多个图像(诸如图像3805)。如图38和图39所示,可穿戴相机设备3804可以使用相机3804A来捕捉用户100的视场的实时图像数据。相机3804A可以捕捉对象和人的图像,其中一些对象和人可以产生由位于可穿戴相机设备3804中的一个或多个麦克风接收的声波3803。In step 4012, wearable camera device 3804 may capture a plurality of images (such as image 3805). As shown in FIGS. 38 and 39 , the wearable camera device 3804 can use the camera 3804A to capture real-time image data of the user's 100 field of view. Camera 3804A can capture images of objects and people, some of which can generate sound waves 3803 that are received by one or more microphones located in wearable camera device 3804.
在步骤4014中,可穿戴相机设备3804可以确定范围测量3807。例如,可穿戴相机设备3804可以使用测距仪来确定对象或人与可穿戴相机设备3804之间的范围(或距离)。在一些实施例中,可以确定相对于用户的视线方向的角度。In step 4014, the wearable camera device 3804 may determine a range measurement 3807. For example, the wearable camera device 3804 may use a rangefinder to determine the range (or distance) between an object or person and the wearable camera device 3804 . In some embodiments, the angle relative to the user's gaze direction may be determined.
在步骤4016中,可穿戴相机设备3804可以调节声波3803。调节可以包括由至少一个第一处理器(例如,处理器210)进行的修改声波3803的音调、音高、低音和/或其他音频效果的操作;将声音分类为不同的声音类别;对声波3803进行衰减;和/或引起声波3803的放大。调节可以基于从助听器设备3802和/或移动设备3806发送的指令,这些指令又是基于来自用户100的输入而生成的。In step 4016, the wearable camera device 3804 may modulate the sound waves 3803. Adjusting may include operations performed by at least one first processor (eg, processor 210) to modify the pitch, pitch, bass, and/or other audio effects of sound waves 3803; classifying sounds into different sound categories; and/or cause amplification of the sound wave 3803. The adjustments may be based on instructions sent from the hearing aid device 3802 and/or the mobile device 3806 , which in turn are generated based on input from the user 100 .
例如,可以在用户界面3806A上向用户100呈现类似菜单的界面(例如音频混合滑块),并且用户100可以根据需要选择一个或多个音频效果。在一些实施例中,改变对应于声波3803的一个或多个音频信号的音调可以使声音对用户100更易感知。例如,用户100可能对特定范围内的音调具有较小的敏感度,并且音频信号的调节可以调节声音3803的音高。例如,用户100可能经历10kHz以上的频率中的听觉损失,并且第一处理器(例如处理器210)可以将更高的频率(例如,在15kHz处)重新映射到低于10kHz的频率。在一些实施例中,第一处理器(处理器210)可以被配置为改变与一个或多个音频信号相关联的语速。第一处理器(例如,处理器210)可以被配置为改变经调节的音频信号中的个体的语速,以使检测到的语音对用户100更易感知。For example, user 100 may be presented with a menu-like interface (eg, an audio mix slider) on user interface 3806A, and user 100 may select one or more audio effects as desired. In some embodiments, changing the pitch of one or more audio signals corresponding to sound wave 3803 may make the sound more perceptible to user 100 . For example, the user 100 may have less sensitivity to tones within a certain range, and the adjustment of the audio signal may adjust the pitch of the sound 3803. For example, user 100 may experience hearing loss in frequencies above 10 kHz, and a first processor (eg, processor 210 ) may remap higher frequencies (eg, at 15 kHz) to frequencies below 10 kHz. In some embodiments, the first processor (processor 210) may be configured to change the speech rate associated with the one or more audio signals. The first processor (eg, processor 210 ) may be configured to change the speech rate of the individuals in the adjusted audio signal to make the detected speech more perceptible to the user 100 .
例如,第一处理器(例如,处理器210)可以将声波3803分类为包含音乐、音调、笑声、讲话、尖叫、背景噪声等的片段。一旦音频的某些部分被分类为非语音,更多的计算资源可用于处理其他片段。在一些实施例中,用户100可以经由用户界面3806A提供输入,以将上面讨论的不同音频效果应用于声波3803的不同片段。For example, a first processor (eg, processor 210) may classify sound waves 3803 into segments containing music, tones, laughter, speech, screams, background noise, and the like. Once some parts of the audio are classified as non-speech, more computing resources are available to process other segments. In some embodiments, user 100 may provide input via user interface 3806A to apply the different audio effects discussed above to different segments of sound wave 3803.
例如,第一处理器(例如,处理器210)可以基于来自用户100的输入将一个或多个滤波器(例如数字滤波器)应用于声波3803。在一些情况下,滤波器可以选择性地衰减音频信号(诸如声波3803)。在一些情况下,声波3803可以包括环境噪声(例如,各种背景声音,诸如音乐、来自未参与与用户100的对话的人的声音/噪声等)。用户100可以选择各种滤波选项,从而可以从经调节的音频中消除或衰减环境噪声。例如,用户100可能希望衰减与背景噪声相关联的环境中的声波3803。For example, a first processor (eg, processor 210 ) may apply one or more filters (eg, digital filters) to sound waves 3803 based on input from user 100 . In some cases, the filter can selectively attenuate audio signals (such as sound waves 3803). In some cases, the sound waves 3803 may include ambient noise (eg, various background sounds such as music, sounds/noise from people not engaged in the conversation with the user 100, etc.). The user 100 can select various filtering options so that ambient noise can be removed or attenuated from the conditioned audio. For example, user 100 may wish to attenuate sound waves 3803 in the environment associated with background noise.
例如,第一处理器(例如,处理器210)可以选择声波3803的一个或多个部分来放大。在一些实施例中,声波3803的选定部分可以对应于与用户100与另一个体的对话有关的音频,或者来自用户100感兴趣的音频源(诸如TV、收音机、扬声器等)的音频。例如,用户100可以向用户界面3806A提供输入以放大声波3803的选定部分。可以使用任何合适的方法来执行说话者的语音从背景声音的分离,例如使用多个麦克风(诸如包括在可穿戴相机设备3804上的麦克风)。在一些情况下,至少一个麦克风可以是定向麦克风或麦克风阵列。例如,一个麦克风可以捕捉背景噪声,而另一个麦克风可以捕捉包括背景噪声以及特定人的语音的音频信号。然后可以通过从组合音频中减去背景噪声来获得语音。在一些实施例中,第一处理器(例如,处理器210)可以利用图像3805以辅助用户100确定声波3803的源。例如,可以分析图像3805以识别生成声波3803的对象或人,并且其可以继而经由用户界面3806A显示给用户100。For example, a first processor (eg, processor 210) may select one or more portions of sound wave 3803 to amplify. In some embodiments, selected portions of sound waves 3803 may correspond to audio related to a conversation of user 100 with another entity, or audio from an audio source of interest to user 100 (such as TV, radio, speakers, etc.). For example, user 100 may provide input to user interface 3806A to amplify selected portions of sound waves 3803. The separation of the speaker's speech from the background sound may be performed using any suitable method, such as using multiple microphones (such as those included on the wearable camera device 3804). In some cases, the at least one microphone may be a directional microphone or a microphone array. For example, one microphone can capture background noise, while another microphone can capture an audio signal that includes the background noise as well as a particular person's speech. Speech can then be obtained by subtracting background noise from the combined audio. In some embodiments, the first processor (eg, processor 210 ) may utilize image 3805 to assist user 100 in determining the source of sound wave 3803 . For example, the image 3805 can be analyzed to identify the object or person generating the sound wave 3803, and this can then be displayed to the user 100 via the user interface 3806A.
例如,用户100可以提供输入用于基于上面讨论的唇读功能对声波3803的选择性调节。可以在图像3805中捕捉人的唇部移动,并且第一处理器(例如,处理器210)可以基于图像3805选择性地放大或衰减声波3803。在一些其他示例中,用户100可以提供输入以基于上面讨论的语音识别来选择性地调节声波3803。第一处理器(例如,处理器210)可以执行内容声波3803的语音识别,并且可以基于是否从声波3803识别出词语来选择性地放大或衰减声波3803。For example, user 100 may provide input for selective adjustment of sound waves 3803 based on the lip reading function discussed above. Human lip movements can be captured in image 3805 and a first processor (eg, processor 210 ) can selectively amplify or attenuate sound waves 3803 based on image 3805 . In some other examples, the user 100 may provide input to selectively adjust the sound waves 3803 based on the speech recognition discussed above. The first processor (eg, processor 210 ) can perform speech recognition of the content sound waves 3803 and can selectively amplify or attenuate the sound waves 3803 based on whether words are recognized from the sound waves 3803 .
例如,第一处理器(例如,处理器210)可以在图像3805中被捕捉时识别声波3803的不同分量及其各自的源。用户100可以与显示在用户界面3806A上的图像3805交互并选择图像3805的不同部分。基于该用户选择,第一处理器(例如,处理器210)可以选择性地放大从图像3805的选定部分发出的声波3803的部分,并且选择性地衰减从图像3805的非选定部分发出的声波3803的其他部分。For example, a first processor (eg, processor 210 ) can identify the different components of sound wave 3803 and their respective sources as they are captured in image 3805 . User 100 may interact with image 3805 displayed on user interface 3806A and select different portions of image 3805. Based on the user selection, the first processor (eg, processor 210 ) may selectively amplify portions of the sound wave 3803 emanating from selected portions of the image 3805 and selectively attenuate portions of the sound waves 3803 emanating from non-selected portions of the image 3805 Other parts of Sonic 3803.
在步骤4018中,可穿戴相机设备3804可以将经调节的音频信号提供给助听器设备3802。例如,经调节的音频信号可以从经由配对连接的可穿戴相机设备3804无线地发送到助听器设备3802。经调节的音频信号的传输可以包括通过一个或多个网络并使用一个或多个传输协议的传输。In step 4018 , the wearable camera device 3804 may provide the conditioned audio signal to the hearing aid device 3802 . For example, the conditioned audio signal may be sent wirelessly from the wearable camera device 3804 to the hearing aid device 3802 via a pairing connection. Transmission of the conditioned audio signal may include transmission over one or more networks and using one or more transmission protocols.
在步骤4020中,助听器设备3802的一个或多个扬声器可以生成向用户100的一个或多个耳朵的声音3801。In step 4020 , one or more speakers of hearing aid device 3802 may generate sound 3801 to one or more ears of user 100 .
例如,在一些实施例中,至少一个第二处理器可以从可穿戴相机设备接收经调节的音频信号,并且可以基于经调节的音频信号使用至少一个扬声器向用户的耳朵提供声音。助听器设备3802的至少一个扬声器可以生成向用户100的一个或两个耳朵的声音3801。For example, in some embodiments, the at least one second processor may receive the conditioned audio signal from the wearable camera device, and may provide sound to the user's ear using the at least one speaker based on the conditioned audio signal. At least one speaker of the hearing aid device 3802 may generate sound 3801 to one or both ears of the user 100 .
自适应捕捉速率Adaptive capture rate
在一些实施例中,助听器系统可以具有自适应捕捉速率。例如,与麦克风和/或与助听器系统相关联的相机相关联的参数可以基于特定情况或背景来调整。例如,助听器系统可以分析由相机捕捉的图像和/或由麦克风捕捉的声音,以确定应该改变特定参数。取决于助听器系统的用户的情况或背景,可以优化不同的参数。例如,当用户与语速快的个体交互时,助听器系统可以增加相机的捕捉速率(例如,每秒帧数)。这可以允许助听器系统更有效地分析说话者的捕捉图像(例如,用于检测唇部移动等)。在个体说话较慢的情况下,或者在没有活跃说话者的情况下,系统可以降低相机的捕捉速率。这可以是有益的,例如,减少相机设备的功耗、减少所使用的存储器量等等。其他传感器或信息也可以用于控制相机或麦克风的参数,诸如位置信息、检测到的光线水平、一天中的时间等。In some embodiments, the hearing aid system may have an adaptive capture rate. For example, parameters associated with a microphone and/or a camera associated with the hearing aid system may be adjusted based on a particular situation or context. For example, a hearing aid system may analyze images captured by a camera and/or sounds captured by a microphone to determine that certain parameters should be changed. Depending on the situation or context of the user of the hearing aid system, different parameters can be optimized. For example, the hearing aid system may increase the capture rate (eg, frames per second) of the camera when the user is interacting with a fast-speaking individual. This may allow the hearing aid system to more efficiently analyze the captured image of the speaker (eg, for detecting lip movements, etc.). In situations where the individual is speaking slowly, or when there are no active speakers, the system can reduce the capture rate of the camera. This may be beneficial, for example, to reduce the power consumption of the camera device, reduce the amount of memory used, and so on. Other sensors or information can also be used to control parameters of the camera or microphone, such as location information, detected light levels, time of day, etc.
所公开助听器系统可以选择性地放大声音。在一个实施例中,助听器系统可以包括可穿戴相机、助听器和至少一个麦克风。可穿戴相机可以指具有图像、声音、音频和/或视频捕捉能力的设备,它们可以附接到用户或用户的衣服或配件上。助听器可以指将音频或声音输出到用户耳朵的设备。助听器的输出可以基于从可穿戴相机和/或至少一个麦克风接收的输入或由可穿戴相机和/或至少一个麦克风接收的输入来生成。助听器系统可以包括配对在一起以提供改进的功能的几个设备。例如,助听器系统可以包括用于向用户的耳朵提供声音的助听器,其中声音可以由助听器声学地捕捉或从诸如可穿戴相机或至少一个麦克风的另一个源电子地接收。The disclosed hearing aid system can selectively amplify sound. In one embodiment, a hearing aid system may include a wearable camera, a hearing aid, and at least one microphone. Wearable cameras can refer to devices with image, sound, audio and/or video capture capabilities that can be attached to the user or to the user's clothing or accessories. A hearing aid may refer to a device that outputs audio or sound to a user's ear. The output of the hearing aid may be generated based on input received from or by the wearable camera and/or at least one microphone. A hearing aid system may include several devices that are paired together to provide improved functionality. For example, a hearing aid system may include a hearing aid for providing sound to a user's ear, wherein the sound may be captured acoustically by the hearing aid or received electronically from another source such as a wearable camera or at least one microphone.
在一些实施例中,可穿戴相机可以被配置为从用户的环境捕捉多个图像,并且可穿戴相机可以具有图像捕捉参数。相机可以指能够接收来自人、对象和/或环境的光,并基于接收到的光形成图像或视频的组件或设备。该多个图像可以是包含多个静止图像(被称为帧)的视频剪辑。在一些实施例中,助听器系统可以包括被配置为从用户的环境捕捉声音的至少一个麦克风。麦克风可以是能够接收声波并基于所接收的声波生成音频信号的组件或设备。In some embodiments, the wearable camera may be configured to capture multiple images from the user's environment, and the wearable camera may have image capture parameters. A camera may refer to a component or device capable of receiving light from a person, object, and/or environment, and forming an image or video based on the received light. The plurality of images may be video clips containing a plurality of still images (referred to as frames). In some embodiments, the hearing aid system may include at least one microphone configured to capture sound from the user's environment. A microphone may be a component or device capable of receiving sound waves and generating audio signals based on the received sound waves.
作为示例,图41示出了包括可穿戴相机4104和助听器4102的助听器系统。在一些实施例中,助听器设备4102和可穿戴相机设备4104配对以与具有图形用户界面(GUI)的移动设备(未示出)进行通信。在一些实施例中,助听器4102可以与可穿戴相机4104配对,并且数据可以在配对的设备之间通信。在一些实施例中,可穿戴相机4104可以对应于图4A-图4K中示出的装置110。在一些替代实施例中,可穿戴相机设备4104可以对应于图3A和图3B中示出的装置110。助听器4102可以对应于听觉接口设备1710。As an example, FIG. 41 shows a hearing aid system including a wearable camera 4104 and a hearing aid 4102. In some embodiments, the hearing aid device 4102 and the wearable camera device 4104 are paired to communicate with a mobile device (not shown) having a graphical user interface (GUI). In some embodiments, the hearing aid 4102 can be paired with the wearable camera 4104, and data can be communicated between the paired devices. In some embodiments, the wearable camera 4104 may correspond to the device 110 shown in Figures 4A-4K. In some alternative embodiments, the wearable camera device 4104 may correspond to the apparatus 110 shown in Figures 3A and 3B. Hearing aid 4102 may correspond to auditory interface device 1710 .
可穿戴相机4104可以包括用于捕捉基本上对应于从用户的视场的实时图像数据的图像传感器(例如,图像传感器220)。例如,可穿戴相机4104可以是指能够检测近红外、红外、可见光和紫外光谱中的光信号并将其转换为电信号的设备。电信号可以用于基于检测到的信号来形成图像或视频流(即图像数据)。术语“图像数据”包括从近红外、红外、可见光和紫外光谱中的光学信号中检索到的任何形式的数据。图像传感器的示例可以包括半导体电荷耦合器件(CCD)、互补金属氧化物半导体(CMOS)中的有源像素传感器或N型金属氧化物半导体(NMOS,活跃MOS)。Wearable camera 4104 may include an image sensor (eg, image sensor 220) for capturing real-time image data substantially corresponding to a field of view from a user. For example, wearable camera 4104 may refer to a device capable of detecting and converting light signals in the near-infrared, infrared, visible, and ultraviolet spectrums into electrical signals. The electrical signals may be used to form images or video streams (ie, image data) based on the detected signals. The term "image data" includes any form of data retrieved from optical signals in the near infrared, infrared, visible and ultraviolet spectrums. Examples of image sensors may include semiconductor charge coupled devices (CCDs), active pixel sensors in complementary metal oxide semiconductors (CMOS), or N-type metal oxide semiconductors (NMOS, active MOS).
根据图41所示的示例,用户100可以佩戴可穿戴相机4104和助听器4102。如图所示,用户100可以穿戴物理连接到用户100的衬衫或其他衣物的可穿戴相机4104。与所公开的实施例一致,可穿戴相机4104可以被放置在诸如连接到项链、腰带、眼镜、腕带、纽扣等的其他位置。如图41所示,助听器4102可以被放置在用户100的一个或两个耳朵中,类似于传统的听觉接口设备。助听器4102可以是各种样式的,包括耳道内、完全耳道内、耳内、耳后、耳上、耳道内接收器、开放安装或各种其他样式。助听器4102可以包括用于向用户100提供听觉反馈的一个或多个扬声器。According to the example shown in FIG. 41 , the user 100 may wear a wearable camera 4104 and a hearing aid 4102 . As shown, user 100 may wear wearable camera 4104 that is physically attached to user 100's shirt or other clothing. Consistent with the disclosed embodiments, the wearable camera 4104 may be placed in other locations such as attached to necklaces, belts, glasses, wristbands, buttons, and the like. As shown in Figure 41, a hearing aid 4102 may be placed in one or both ears of the user 100, similar to a conventional auditory interface device. The hearing aid 4102 may be of various styles, including in-canal, fully in-canal, in-ear, behind-the-ear, supra-ear, in-canal receiver, open mount, or various other styles. Hearing aid 4102 may include one or more speakers for providing auditory feedback to user 100 .
用户的环境一般是指正在使用可穿戴相机设备4104的用户的周围环境。用户的环境可以包括对象和人,其中一些对象和人可以产生由位于可穿戴相机设备4104中的一个或多个麦克风(未示出)接收的声波。根据可穿戴相机4104的方向和视场,可穿戴相机4104可以捕捉包括可穿戴相机4104沿着光轴4103所看到的用户环境中的对象和人的表示的图像。The user's environment generally refers to the surrounding environment of the user who is using the wearable camera device 4104 . The user's environment may include objects and people, some of which may generate sound waves that are received by one or more microphones (not shown) located in the wearable camera device 4104 . Depending on the orientation and field of view of the wearable camera 4104, the wearable camera 4104 may capture images that include representations of objects and people in the user's environment as seen by the wearable camera 4104 along the optical axis 4103.
可穿戴相机可以包括至少一个处理器。术语“处理器”包括具有对输入执行逻辑操作的电路的任何物理设备。例如,处理设备可以包括一个或多个集成电路、微芯片、微控制器、微处理器、中央处理单元(CPU)、图形处理单元(GPU)、数字信号处理器(DSP)、现场可编程门阵列(FPGA)的全部或部分、或适于执行指令或执行逻辑操作的其他电路。在一些实施例中,图5A-图5C中示出的处理器210可以是至少一个处理器的示例。The wearable camera can include at least one processor. The term "processor" includes any physical device having circuitry that performs logical operations on inputs. For example, a processing device may include one or more integrated circuits, microchips, microcontrollers, microprocessors, central processing units (CPUs), graphics processing units (GPUs), digital signal processors (DSPs), field programmable gates All or part of an array (FPGA), or other circuitry suitable for executing instructions or performing logical operations. In some embodiments, the processor 210 shown in FIGS. 5A-5C may be an example of at least one processor.
在一些实施例中,至少一个处理器可以接收由可穿戴相机捕捉的多个图像。在一些实施例中,至少一个处理器可以接收表示由至少一个麦克风捕捉的声音的音频信号。音频信号可以表示从用户的环境发出的声音。In some embodiments, the at least one processor may receive a plurality of images captured by the wearable camera. In some embodiments, the at least one processor may receive audio signals representing sounds captured by the at least one microphone. The audio signal may represent sounds emanating from the user's environment.
举例而言,图42示出了从用户的环境捕捉图像和音频的助听器系统。用户100周围的区域、地区和/或空间可以构成用户100的环境。在一些实施例中,可穿戴相机4104可以具有由图42中示出的沿着光轴4103的锥体4203所定义的视场。锥体4203的宽度可以是如由其组件或设置所定义的可穿戴相机4104的属性,诸如镜头或光圈、变焦等。在锥体4203内,在可穿戴相机4104的视图内可以存在人或对象(诸如人4200)。在一些实施例中,可穿戴相机4104可以捕捉图像4214。图像4214可以包括椎体4203的视图内的人或对象的表示。在一些实施例中,图像4214是诸如人4200的人的图像,并且可以对人4200的面部成像,使得在图像4214中可以看到人4200的唇部4214a和唇部移动。By way of example, Figure 42 shows a hearing aid system that captures images and audio from a user's environment. The area, region and/or space around the user 100 may constitute the environment of the user 100 . In some embodiments, the wearable camera 4104 may have a field of view defined by the cone 4203 shown in FIG. 42 along the optical axis 4103. The width of the cone 4203 may be a property of the wearable camera 4104 as defined by its components or settings, such as a lens or aperture, zoom, etc. Within the cone 4203, there may be a person or object (such as person 4200) within the view of the wearable camera 4104. In some embodiments, the wearable camera 4104 can capture the image 4214. Image 4214 may include representations of people or objects within the view of pyramid 4203 . In some embodiments, image 4214 is an image of a person, such as person 4200, and person 4200's face can be imaged such that person 4200's lips 4214a and lip movement can be seen in image 4214.
在一些实施例中,声波可以从锥体4203内发出,诸如由人4200生成的声波4202。在一些实施例中,其他声波可以从锥体4203的外部发出,诸如声波4204a和/或声波4204b。因此,由至少一个麦克风捕捉的音频信号可以来自在图像4214中捕捉的人或对象。在一些实施例中,不是由来自锥体4203内的人或对象生成的声波可以被认为是背景声音。在一些实施例中,声波可以通过各种物理属性来表征,诸如振幅、频率、音调、音高等。In some embodiments, sound waves can be emitted from within cone 4203 , such as sound waves 4202 generated by person 4200 . In some embodiments, other sound waves may be emitted from the exterior of cone 4203, such as sound waves 4204a and/or sound waves 4204b. Thus, the audio signal captured by the at least one microphone may be from a person or object captured in image 4214. In some embodiments, sound waves not generated by persons or objects from within cone 4203 may be considered background sounds. In some embodiments, sound waves may be characterized by various physical properties, such as amplitude, frequency, pitch, pitch, and the like.
处理器210可以通过执行任何合适的方法,例如通过使用多个麦克风(诸如包括在可穿戴相机4104上的麦克风),将人4200的语音与背景声音分离。在一些情况下,至少一个麦克风可以是定向麦克风或麦克风阵列。例如,一个麦克风可以捕捉背景噪声(例如,声波4204a和/或声波4204b),而另一个麦克风可以捕捉包括背景噪声例如,声波4204a和/或声波4204b)以及特定人的语音(声波4204)的音频信号。然后可以通过从组合音频中减去背景噪声来获得语音。在一些实施例中,处理器(例如,处理器210)可以分析图像4214以确定声波4202的源。例如,声波4202可以包含由人4200说出的语音4212。语音4212可以与图像4214中看到的唇部4214a的运动相匹配。The processor 210 may separate the speech of the person 4200 from background sounds by performing any suitable method, such as by using multiple microphones, such as those included on the wearable camera 4104. In some cases, the at least one microphone may be a directional microphone or a microphone array. For example, one microphone may capture background noise (eg, sound waves 4204a and/or sound waves 4204b), while another microphone may capture audio that includes background noise (eg, sound waves 4204a and/or sound waves 4204b) and a particular person's speech (sound waves 4204) Signal. Speech can then be obtained by subtracting background noise from the combined audio. In some embodiments, a processor (eg, processor 210 ) can analyze image 4214 to determine the source of sound wave 4202 . For example, sound waves 4202 may contain speech 4212 spoken by person 4200. Speech 4212 may match the movement of lips 4214a seen in image 4214.
在一些实施例中,助听器系统可以改变或调整可穿戴相机4104的图像捕捉参数。图像捕捉参数可以是指表征可穿戴相机4104的一个或多个操作的操作设置、参数、条件和/或其他因素。例如,调整图像捕捉可以增加可穿戴相机4104的性能特性,而降低可穿戴相机4104的性能特性可以减少可穿戴相机4104的功耗,或者可以减少所使用的存储器量等。In some embodiments, the hearing aid system may change or adjust the image capture parameters of the wearable camera 4104. Image capture parameters may refer to operational settings, parameters, conditions, and/or other factors that characterize one or more operations of the wearable camera 4104 . For example, adjusting image capture may increase the performance characteristics of the wearable camera 4104, while reducing the performance characteristics of the wearable camera 4104 may reduce the power consumption of the wearable camera 4104, or may reduce the amount of memory used, or the like.
在一些实施例中,图像捕捉参数可以是相机的帧速率。帧可以指多个图像(诸如视频)中的单个图像。因此,帧速率可以指每单位时间的图像数量。在一些实施例中,可穿戴相机4104的帧速率可以指当可穿戴相机捕捉视频剪辑时,每单位时间捕捉的图像数量。例如,如果可穿戴相机4104以每秒100帧(fps)捕捉视频,则当可穿戴相机4104捕捉视频时,每秒捕捉100个静止图像。相机的帧速率可能会影响所捕捉的视频剪辑的质量。例如,与使用较慢帧速率的相机相比,使用更高帧速率的相机可以在特定时间帧期间捕捉更多的图像,并且更高数量的图像可以增加视频质量,例如,提供更多的运动细节。例如,在感兴趣的对象(诸如人4200)表现出快速或迅速运动的情况或背景中,可能更希望可穿戴相机4104以更高的帧速率操作。可替代地,在某些情况或背景中,更高的帧速率可能不是最佳的。例如,当感兴趣的对象(诸如人4200)没有表现出快速或迅速的运动时,具有较慢帧速率(或较低数量的静止图像)的视频可能不会显著损害视频质量。相反,让相机使用较慢的帧速率操作可以消耗较低的能量,并且捕捉的图像可以需要更少的存储器来存储。In some embodiments, the image capture parameter may be the frame rate of the camera. A frame may refer to a single image of multiple images, such as video. Therefore, frame rate can refer to the number of images per unit of time. In some embodiments, the frame rate of the wearable camera 4104 may refer to the number of images captured per unit of time when the wearable camera captures a video clip. For example, if the wearable camera 4104 captures video at 100 frames per second (fps), then when the wearable camera 4104 captures video, 100 still images are captured per second. The camera's frame rate can affect the quality of the captured video clips. For example, a camera using a higher frame rate can capture more images during a given time frame, and a higher number of images can increase video quality, e.g. provide more motion, than a camera using a slower frame rate detail. For example, in situations or contexts where an object of interest, such as person 4200, exhibits rapid or rapid motion, it may be more desirable for wearable camera 4104 to operate at a higher frame rate. Alternatively, higher frame rates may not be optimal in certain situations or contexts. For example, when an object of interest (such as person 4200) does not exhibit fast or rapid motion, a video with a slower frame rate (or a lower number of still images) may not significantly impair video quality. Conversely, having the camera operate at a slower frame rate can consume less power and the captured images can require less memory to store.
在附加或替代实施例中,图像捕捉参数可以包括捕捉图像的分辨率、捕捉图像的压缩率和/或用于优化捕捉音频信号的压缩质量的参数。In additional or alternative embodiments, the image capture parameters may include the resolution of the captured image, the compression rate of the captured image, and/or parameters for optimizing the compression quality of the captured audio signal.
在一些实施例中,可以基于检测到的语速来调整图像捕捉参数。例如,语速可以指说话或发声的步调或速度。在一些情况下,语速可以与诸如人4200的个体的唇部移动有关。例如,高语速可能暗示迅速的唇部移动,而低语速可能暗示较慢的唇部移动。在一些实施例中,助听器系统可以包括诸如唇部跟踪算法的特征,并且多个捕捉的图像可以由唇部跟踪算法使用。In some embodiments, image capture parameters may be adjusted based on the detected speech rate. For example, speech rate can refer to the pace or speed of speech or vocalization. In some cases, the rate of speech may be related to the movement of the lips of an individual such as person 4200 . For example, a high speech rate might imply rapid lip movement, while a low speech rate might imply slower lip movement. In some embodiments, the hearing aid system may include features such as a lip tracking algorithm, and the plurality of captured images may be used by the lip tracking algorithm.
另外地或可替代地,可以基于检测到的光线水平来调整图像捕捉参数。例如,更高的帧速率设置可能比较低的帧速率设置需要更多的光。处理器210可以基于一个或多个光传感器读数(例如,来自可穿戴相机4104)来确定在某些环境条件(例如,某些光线条件)下对于可穿戴相机4104最优的帧速率。Additionally or alternatively, image capture parameters may be adjusted based on detected light levels. For example, higher frame rate settings may require more light than lower frame rate settings. The processor 210 may determine an optimal frame rate for the wearable camera 4104 under certain environmental conditions (eg, certain lighting conditions) based on one or more light sensor readings (eg, from the wearable camera 4104).
另外地或可替代地,可以基于位置信息来调整图像捕捉参数。例如,可以基于用户100的位置信息(例如,GPS信息、从分析捕捉的图像和/或音频确定的位置信息、用户100提供的位置信息等)来确定用户100的环境。处理器210可以基于位置信息来确定调整可穿戴相机4104的帧速率的相关因素,诸如用户100是否位于繁忙位置、用于在该位置成像的对象的数量和类型、用于再充电的电源插座的可用性、或可以确定可穿戴相机4104的最佳性能和/或最佳电池寿命管理的其他此类因素。在一些实施例中,处理器210可以基于助听器系统的GPS坐标或来自用户100的输入来接收位置信息。在一些实施例中,位置信息可以是可由用户100选择的用户设置。例如,可穿戴相机4104可以被配置为基于位置的用户设置以预定帧速率操作,诸如在城市环境、乡村环境、拥挤位置、稀疏位置、低照明环境等。Additionally or alternatively, image capture parameters may be adjusted based on location information. For example, user 100's environment may be determined based on user 100 location information (eg, GPS information, location information determined from analyzing captured images and/or audio, location information provided by user 100, etc.). The processor 210 may determine factors relevant to adjusting the frame rate of the wearable camera 4104 based on the location information, such as whether the user 100 is in a busy location, the number and type of objects used for imaging at the location, the number of electrical outlets used for recharging. Availability, or other such factors that may determine optimal performance of the wearable camera 4104 and/or optimal battery life management. In some embodiments, the processor 210 may receive location information based on GPS coordinates of the hearing aid system or input from the user 100 . In some embodiments, the location information may be a user setting selectable by the user 100 . For example, the wearable camera 4104 may be configured to operate at a predetermined frame rate based on location-based user settings, such as in urban environments, rural environments, crowded locations, sparse locations, low lighting environments, and the like.
另外地或可替代地,可以基于用户设置来调整图像捕捉参数。例如,用户100可以基于特定情况的环境根据需要增加或减少可穿戴相机4104的帧速率。在一些实施例中,用户100可以选择可穿戴相机4104的预定义设置(例如,通过口头命令或经由配对设备的用户界面),并引起对图像捕捉参数的调整。例如,可穿戴相机4104可以被配置为允许用户100在几个设置中进行选择,每个设置用可穿戴相机4104的一个帧速率编程。例如,用户100可以选择节能设置以节省功率,导致可穿戴相机4104以较慢的帧速率操作。例如,用户100可以选择高性能设置以最大限度地提高视频质量,这可以导致可穿戴相机4104以更高的帧速率操作。在一些实施例中,可穿戴相机4104可以基于说话者的身份来调整帧速率。例如,基于先前的交互,可以知道人4200具有特定的语速,并且每当人4200进入可穿戴相机4104的视野时,可穿戴相机4104可以自动选择用于可穿戴相机4104的帧速率。可替代地,用户100可以基于可穿戴相机4104先前检测到的人来选择某些帧速率设置。Additionally or alternatively, image capture parameters may be adjusted based on user settings. For example, the user 100 may increase or decrease the frame rate of the wearable camera 4104 as needed based on the circumstances of the particular situation. In some embodiments, the user 100 may select predefined settings for the wearable camera 4104 (eg, via a verbal command or via the paired device's user interface) and cause adjustments to image capture parameters. For example, the wearable camera 4104 may be configured to allow the user 100 to select among several settings, each programmed with a frame rate of the wearable camera 4104 . For example, user 100 may select a power saving setting to save power, causing wearable camera 4104 to operate at a slower frame rate. For example, user 100 may select a high performance setting to maximize video quality, which may result in wearable camera 4104 operating at a higher frame rate. In some embodiments, the wearable camera 4104 can adjust the frame rate based on the identity of the speaker. For example, based on previous interactions, person 4200 may be known to have a specific speech rate, and wearable camera 4104 may automatically select a frame rate for wearable camera 4104 whenever person 4200 comes into view of wearable camera 4104. Alternatively, user 100 may select certain frame rate settings based on persons previously detected by wearable camera 4104 .
图43示出了符合所公开实施例的调整可穿戴相机的捕捉参数的示例性过程的流程图。43 shows a flowchart of an exemplary process for adjusting capture parameters of a wearable camera consistent with disclosed embodiments.
在步骤4302中,至少一个处理器(例如,处理器210)可以接收由相机捕捉的多个图像。多个图像可以是静止图像或包含多个静止图像的视频剪辑。图像可以包含在可穿戴相机4104的视场中的对象或人。例如,可穿戴相机4104可以捕捉图像4214,其示出位于锥体4203内的人4200的脸部。在一些实施例中,图像4214可以包含人4200的唇部4214a或其运动的图像。In step 4302, at least one processor (eg, processor 210) may receive a plurality of images captured by the camera. The multiple images may be still images or video clips containing multiple still images. The images may include objects or people in the field of view of the wearable camera 4104 . For example, wearable camera 4104 may capture image 4214 showing the face of person 4200 located within cone 4203. In some embodiments, image 4214 may include an image of person 4200's lips 4214a or movement thereof.
在步骤4304中,至少一个处理器(例如,处理器210)可以接收表示由至少一个麦克风捕捉的声音的音频信号。例如,可穿戴相机4104可以捕捉从锥体4203内发出的声波,诸如由人4200生成的声波4202。可穿戴相机4104还可以捕捉从锥体4203外部发出的其他声波,诸如声波4204a和/或声波4204b。In step 4304, at least one processor (eg, processor 210) may receive an audio signal representing sound captured by at least one microphone. For example, wearable camera 4104 may capture sound waves emanating from within cone 4203 , such as sound waves 4202 generated by person 4200 . Wearable camera 4104 may also capture other sound waves emanating from outside of cone 4203, such as sound waves 4204a and/or sound waves 4204b.
在步骤4306中,至少一个处理器(例如,处理器210)可以识别多个图像中的至少一个中的至少一个体的表示。在一些实施例中,处理器210可以被配置为执行一种或多种图像分类算法以识别图像4214中是否存在人。在一些实施例中,处理器210可以执行面部识别程序或算法以识别图像4214内的一个或多个面部。例如,处理器210可以识别在图像4214中捕捉的人4200的脸上的面部特征,诸如眼睛、鼻子、颧骨、下巴、唇部或其他特征。处理器210可以使用一种或多种算法来分析检测到的特征,诸如主分量分析(例如,使用本征脸)、线性判别分析、弹性束图匹配(例如,使用Fisher脸)、局部二进制模式直方图(LBPH)、尺度不变特征变换(SIFT)、加速鲁棒特征(SURF)等。还可以使用诸如三维识别、皮肤纹理分析和/或热成像的其他面部识别技术来识别个体。除了面部特征之外的其他特征也可以用于识别,诸如身高、体型或人4200的其他区别特征。In step 4306, at least one processor (eg, processor 210) may identify a representation of at least one of the at least one of the plurality of images. In some embodiments, processor 210 may be configured to execute one or more image classification algorithms to identify whether a person is present in image 4214. In some embodiments, processor 210 may execute a facial recognition program or algorithm to recognize one or more faces within image 4214. For example, processor 210 may identify facial features on the face of person 4200 captured in image 4214, such as eyes, nose, cheekbones, chin, lips, or other features. The processor 210 may analyze the detected features using one or more algorithms, such as principal component analysis (eg, using eigenfaces), linear discriminant analysis, elastic bundle map matching (eg, using Fisher faces), local binary patterns Histogram (LBPH), Scale Invariant Feature Transform (SIFT), Accelerated Robust Feature (SURF), etc. Other facial recognition techniques such as three-dimensional recognition, skin texture analysis, and/or thermal imaging may also be used to identify individuals. Features other than facial features can also be used for identification, such as height, body size, or other distinguishing features of the person 4200 .
在步骤4308中,至少一个处理器(例如,处理器210)可以检测与该至少一个个体相关联的语速。至少一个处理器(例如,处理器210)可以基于所捕捉的多个图像或音频信号或两者来检测与至少一个个相关联的语速。In step 4308, at least one processor (eg, processor 210) may detect a speech rate associated with the at least one individual. At least one processor (eg, processor 210 ) can detect a speech rate associated with the at least one based on the captured plurality of images or audio signals, or both.
在一些实施例中,至少一个处理器(例如,处理器210)可以基于对多个图像的分析来识别与至少一个个体的嘴相关联的至少一个唇部移动。处理器210可以基于对多个图像(例如,图像4214)的分析来识别与个体的嘴相关联的至少一个唇部移动或唇部位置。处理器210可以被配置为识别与个体的嘴相关联的一个或多个点。在一些实施例中,处理器210可以开发与个体的嘴相关联的轮廓,该轮廓可以定义与个体的嘴或唇部相关联的边界。可以在多个帧或图像上跟踪在图像中识别出的唇部,以识别唇部移动。因此,处理器210可以使用如上所述的各种视频跟踪算法。例如,处理器210可以识别图像4214内的人4200的唇部4214a,并且可分析其运动。在一些情况下,处理器210可以识别特定面部表情(诸如唇部移动)与特定声音或声音波动之间的相关性。例如,与特定唇部移动相关的面部表情可以与在第一音频信号中捕捉的对话期间可能已经说过的声音或词语相关联。在一些实施例中,通过诸如训练过的神经网络的基于计算机的模型来执行对多个图像的分析。例如,训练过的神经网络可以被训练以接收与个体的面部表情相关的图像和/或视频数据,并预测与所接收的图像和/或视频数据相关联的声音。作为另一示例,训练过的神经网络可以被训练以接收与个体的面部表情和声音有关的图像和/或视频数据,并输出该面部表情是否对应于该声音。在一些实施例中,可以在由可穿戴相机捕捉的一个或多个图像中识别其他因素,诸如个体的手势、个体的位置、个体面部的朝向等。通过将唇部移动与预测的声音或词语相关联,处理器210可以确定在一段时间内与在图像4214中捕捉的唇部移动相关联的词语或声音的数量。至少一个处理器(例如,处理器210)可以基于唇部移动来确定语速。In some embodiments, at least one processor (eg, processor 210) may identify at least one lip movement associated with the mouth of at least one individual based on analysis of the plurality of images. Processor 210 may identify at least one lip movement or lip position associated with the individual's mouth based on analysis of the plurality of images (eg, image 4214). The processor 210 may be configured to identify one or more points associated with the individual's mouth. In some embodiments, the processor 210 may develop a profile associated with the individual's mouth, which may define a boundary associated with the individual's mouth or lips. Lips identified in an image can be tracked over multiple frames or images to identify lip movement. Accordingly, the processor 210 may use various video tracking algorithms as described above. For example, processor 210 may identify lips 4214a of person 4200 within image 4214, and may analyze its motion. In some cases, processor 210 may identify correlations between particular facial expressions, such as lip movements, and particular sounds or sound fluctuations. For example, a facial expression associated with a particular lip movement may be associated with sounds or words that may have been spoken during the conversation captured in the first audio signal. In some embodiments, the analysis of the plurality of images is performed by a computer-based model, such as a trained neural network. For example, a trained neural network may be trained to receive image and/or video data associated with an individual's facial expressions, and to predict sounds associated with the received image and/or video data. As another example, a trained neural network may be trained to receive image and/or video data related to an individual's facial expression and voice, and output whether the facial expression corresponds to the voice. In some embodiments, other factors may be identified in one or more images captured by the wearable camera, such as the individual's gestures, the individual's location, the orientation of the individual's face, and the like. By associating lip movements with predicted sounds or words, processor 210 can determine the number of words or sounds associated with lip movements captured in image 4214 over a period of time. At least one processor (eg, processor 210 ) can determine the speech rate based on the lip movement.
在一些实施例中,至少一个处理器(例如,处理器210)可以分析所接收的音频。例如,处理器210可以识别与声波4202相关联的语音4212,该语音4212包含人4200的语音。在一些实施例中,处理器210可以分析从可穿戴相机4104的麦克风接收的声音,以使用任何当前已知或未来开发的技术或算法将声波4202与声波4204a和4204b分开。例如,处理器210可以接收表示从用户100的环境中的对象发出的声音的音频信号,并且分析所接收的音频信号以获得与一个声音发出对象相关联的隔离音频流。In some embodiments, at least one processor (eg, processor 210) can analyze the received audio. For example, the processor 210 may identify speech 4212 associated with the sound wave 4202, the speech 4212 comprising the speech of the person 4200. In some embodiments, processor 210 may analyze sound received from the microphone of wearable camera 4104 to separate sound waves 4202 from sound waves 4204a and 4204b using any currently known or future developed techniques or algorithms. For example, processor 210 may receive audio signals representing sounds emanating from objects in the environment of user 100 and analyze the received audio signals to obtain an isolated audio stream associated with a sound emitting object.
基于音频分析,至少一个处理器(例如,处理器210)可以识别个体所说的多个词语。处理器210可以被配置为识别个体4200所说的词语。例如,处理器210可以分析声波4202以识别声波4202中的特定音素、音素组合或词语。在一些实施例中,处理器210可以使用各种语音到文本算法来识别所说的词语。在一些实施例中,识别多个词语可以包括使用语音识别算法。语音识别可以指机器、处理器或程序用于接收并且解释声音、语音、口述或类似的能力。例如,语音识别算法可以由处理器210执行以识别或解释在声音中接收到的词语或命令。语音识别算法的示例可以通过AI、深度学习算法、神经嵌入模型或本领域中的其他已知方法来实现。至少一个处理器(例如,处理器210)可以基于多个词语来确定语速。Based on the audio analysis, at least one processor (eg, processor 210) can identify a plurality of words spoken by the individual. Processor 210 may be configured to identify words spoken by individual 4200. For example, the processor 210 may analyze the sound waves 4202 to identify particular phonemes, phoneme combinations, or words in the sound waves 4202. In some embodiments, processor 210 may use various speech-to-text algorithms to recognize spoken words. In some embodiments, recognizing the plurality of words may include using a speech recognition algorithm. Speech recognition may refer to the ability of a machine, processor or program to receive and interpret sound, speech, dictation or the like. For example, speech recognition algorithms may be executed by the processor 210 to recognize or interpret words or commands received in the voice. Examples of speech recognition algorithms can be implemented through AI, deep learning algorithms, neural embedding models, or other methods known in the art. At least one processor (eg, processor 210) may determine the speech rate based on the plurality of words.
在步骤4310中,至少一个处理器(例如,处理器210)可以基于检测到的语速来对可穿戴相机的一个或多个图像捕捉参数进行调整。在一个示例中,当处理器210检测到人4200的唇部移动量大于阈值时,处理器210可以确定高语速。处理器210可以在人4200的唇部移动量高时增加可穿戴相机4104的帧速率。这可以允许助听器系统通过确保捕捉足够数量的唇部移动图像来保持唇部跟踪算法的准确性。在另一示例中,当处理器210检测到小于阈值的语速时,处理器210可以确定已经检测到人4200的低量唇部移动。处理器210可以在人4200的唇部移动量低时减少可穿戴相机4104的帧速率。这可以减少功率使用和/或存储器存储使用,而不损害唇部跟踪算法的准确性。In step 4310, at least one processor (eg, processor 210) may adjust one or more image capture parameters of the wearable camera based on the detected speech rate. In one example, the processor 210 may determine a high speech rate when the processor 210 detects that the amount of lip movement of the person 4200 is greater than a threshold. The processor 210 may increase the frame rate of the wearable camera 4104 when the lip movement of the person 4200 is high. This may allow the hearing aid system to maintain the accuracy of the lip tracking algorithm by ensuring that a sufficient number of lip movement images are captured. In another example, when processor 210 detects a speech rate that is less than a threshold, processor 210 may determine that a low amount of lip movement of person 4200 has been detected. The processor 210 may reduce the frame rate of the wearable camera 4104 when the lip movement of the person 4200 is low. This can reduce power usage and/or memory storage usage without compromising the accuracy of the lip tracking algorithm.
至少一个麦克风可以具有用于捕捉音频信号(特别是语音)的一个或多个参数,还可以根据语速或其他参数或设置来调整。在一些实施例中,至少一个处理器还可以被编程为基于检测到的语速引起对至少一个麦克风的一个或多个音频捕捉参数的调整。音频捕捉参数的示例可以包括音调、音高、振幅、灵敏度水平、频率和/或采样速率。例如,至少一个处理器可以对到达至少一个麦克风的音频信号应用滤波器,以选择性地捕捉具有特定音调、音高或频率的音频信号。作为另一示例,该至少一个处理器可以放大或衰减到达至少一个麦克风的音频信号,以改变捕捉的音频信号的振幅,和/或改变至少一个麦克风的灵敏度。At least one microphone may have one or more parameters for capturing audio signals (particularly speech), and may also be adjusted according to the rate of speech or other parameters or settings. In some embodiments, the at least one processor may also be programmed to cause adjustments to one or more audio capture parameters of the at least one microphone based on the detected rate of speech. Examples of audio capture parameters may include pitch, pitch, amplitude, sensitivity level, frequency, and/or sampling rate. For example, the at least one processor may apply a filter to the audio signal arriving at the at least one microphone to selectively capture the audio signal having a particular tone, pitch or frequency. As another example, the at least one processor may amplify or attenuate the audio signal arriving at the at least one microphone to change the amplitude of the captured audio signal, and/or change the sensitivity of the at least one microphone.
在一些实施例中,至少一个处理器(例如,处理器210)可以基于在步骤4308中检测到的语速来调整至少一个麦克风的采样速率。采样速率可以是指设备对信号进行采样的时间频率。例如,以更高采样速率记录(即采样)的音频信号将包括比以较低采样速率记录的更多的音频信号。例如,当处理器210确定检测到的语速高于阈值时,处理器210可以使至少一个麦克风的采样速率增加,以每单位时间记录更多的声波4202。这可能是希望的,因为当语速高时,每单位时间可以说出更多的词语,并且较低的采样速率可能影响诸如词语/语音识别之类的特征。In some embodiments, at least one processor (eg, processor 210 ) may adjust the sampling rate of at least one microphone based on the rate of speech detected in step 4308 . The sampling rate may refer to the time frequency at which the device samples the signal. For example, an audio signal recorded (ie, sampled) at a higher sampling rate will include more audio signal than one recorded at a lower sampling rate. For example, when the processor 210 determines that the detected speech rate is above a threshold, the processor 210 may increase the sampling rate of at least one microphone to record more sound waves 4202 per unit of time. This may be desirable because when speech rates are high, more words can be spoken per unit time, and lower sampling rates may affect features such as word/speech recognition.
可替代地,例如,当处理器确定检测到的语速低于阈值时,处理器可以使至少一个麦克风的采样速率降低,以每单位时间记录更少的声波4202。当语速较低时,较低的采样速率仍可以充分捕捉足够的细节以保持诸如词语和/或语音识别的特征的完整性。Alternatively, for example, when the processor determines that the detected speech rate is below a threshold, the processor may reduce the sampling rate of at least one microphone to record fewer sound waves 4202 per unit of time. When speech rates are low, the lower sampling rates can still adequately capture enough detail to preserve the integrity of features such as word and/or speech recognition.
处理可变音频质量的音频信号Handling audio signals with variable audio quality
如上所述,在将音频呈现给用户之前,可以处理从用户的环境内捕捉的音频信号。该处理可以包括各种调节或增强以改善用户的体验。例如,如上所述,来自用户正在看着的个体的语音可以被放大,而诸如背景噪声、来自其他说话者的语音等的其他音频可以被静音或衰减。因此,用户可以更容易地听到和理解来自与用户交谈的个体的语音。As described above, audio signals captured from within the user's environment may be processed prior to presentation of the audio to the user. The processing may include various adjustments or enhancements to improve the user's experience. For example, as described above, speech from the individual the user is looking at may be amplified, while other audio such as background noise, speech from other speakers, etc. may be muted or attenuated. Therefore, the user can more easily hear and understand the speech from the individual with whom the user is conversing.
该处理的质量和/或有效性可以取决于如何捕捉和/或处理音频信号的各个方面或变量。这些方面可能导致在经处理音频信号的质量和诸如音频延迟、电池寿命等其他因素之间的各种折衷。例如,许多音频处理技术使用收集的音频的缓冲器来执行处理。具体地,该系统可以分析在当前正在处理的音频样本之前和之后捕捉的样本,以收集可能改进样本处理的附加信息。例如,系统可以使用0-30秒累积音频的滑动窗口。滑动窗口可以包括在被处理的样本之前的样本,但也可以包括在被处理的样本之后的“前瞻(lookahead)”样本。更长的时间窗口提供更大量的音频样本数据,从而得到更高质量的输出。The quality and/or effectiveness of this processing may depend on various aspects or variables of how the audio signal is captured and/or processed. These aspects can lead to various tradeoffs between the quality of the processed audio signal and other factors such as audio latency, battery life, and the like. For example, many audio processing techniques use buffers of collected audio to perform processing. Specifically, the system can analyze samples captured before and after the audio sample currently being processed to gather additional information that may improve sample processing. For example, the system could use a sliding window of 0-30 seconds of accumulated audio. The sliding window may include samples preceding the processed samples, but may also include "lookahead" samples that follow the processed samples. Longer time windows provide a larger amount of audio sample data, resulting in higher quality output.
然而,更长的前瞻意味着更长的延迟。由于更长的前瞻时段,为了处理给定的样本,系统必须等待更长的时间直到收集到未来的样本,这必然会造成处理样本的延迟。例如,如果所需的前瞻时段是一秒,则通过处理当前样本而输出的音频将延迟至少一秒。在一些情况下,这种延迟可能会使用户感到不愉快或分心。例如,当与另一个体说话时,来自该个体的讲话可能与他或她的唇部移动不匹配。此外,延迟可能会在对话中造成不舒服的停顿。因此,在处理音频的时间延迟和所执行的音频处理的质量之间存在折衷。类似的折衷也可能与音频信号的捕捉和处理的其他方面相关联。例如,音频质量可以取决于其他方面,诸如是否使用相机来帮助识别活跃讲话者、所使用的麦克风的数量等。这些变量可以为音频质量与处理延迟、音频质量与电池消耗或其他折衷产生类似的折衷。However, longer look-ahead means longer latency. Due to the longer look-ahead period, in order to process a given sample, the system has to wait longer until future samples are collected, which inevitably creates a delay in processing samples. For example, if the desired look-ahead period is one second, the audio output by processing the current sample will be delayed by at least one second. In some cases, this delay may be unpleasant or distracting to the user. For example, when speaking to another body, the speech from that individual may not match his or her lip movement. Also, delays can create uncomfortable pauses in conversations. Therefore, there is a trade-off between the time delay of processing the audio and the quality of the audio processing performed. Similar tradeoffs may also be associated with other aspects of audio signal capture and processing. For example, audio quality may depend on other aspects, such as whether a camera is used to help identify active speakers, the number of microphones used, etc. These variables can create similar tradeoffs for audio quality versus processing latency, audio quality versus battery consumption, or other tradeoffs.
所公开的系统和方法可以确定如何以各种方式平衡这些折衷。在一些实施例中,用户可以手动调整一个或多个设置以平衡这些折衷。不同的用户可以具有不同的偏好,因此可以应用关于音频质量的不同设置。例如,一些用户可能更能容忍较低的音频处理质量,因此可能更喜欢较短的延迟时间。其他用户可能更依赖于音频处理来听到,因此可能更能容忍延迟。此外,同一用户在不同情况下可能具有不同的偏好。例如,在具有低背景噪声的面对面会议中,可能优选较短的时间延迟。另一方面,当在嘈杂的多个说话者环境中时,用户可能偏好更高的音频处理质量。The disclosed systems and methods can determine how to balance these tradeoffs in various ways. In some embodiments, the user may manually adjust one or more settings to balance these tradeoffs. Different users may have different preferences and thus different settings regarding audio quality may be applied. For example, some users may be more tolerant of lower audio processing quality and may therefore prefer shorter delay times. Other users may rely more on audio processing to hear and therefore may be more tolerant of delays. Furthermore, the same user may have different preferences in different situations. For example, in face-to-face meetings with low background noise, a shorter time delay may be preferred. On the other hand, users may prefer higher audio processing quality when in a noisy multi-speaker environment.
在其他实施例中,系统可以自动调整一个或多个设置以实现音频质量的最佳折衷。例如,用户可以输入关于音频质量或时间延迟的反馈,并且系统可以调整一个或多个设置。在一些实施例中,系统可以根据多个方案并行地处理音频信号,并确定哪种方案提供最佳折衷。例如,系统可以执行具有短延迟的第一处理和具有更长延迟的第二处理,并且可以对所得到的经处理音频信号进行比较以确定哪种方案提供最佳结果。因此,所公开的实施例可以提供比现有技术助听器设备更好的效率、方便性和功能性。In other embodiments, the system may automatically adjust one or more settings to achieve the best compromise in audio quality. For example, the user can enter feedback on audio quality or time delay, and the system can adjust one or more settings. In some embodiments, the system may process audio signals in parallel according to multiple schemes and determine which scheme provides the best compromise. For example, the system may perform a first process with a short delay and a second process with a longer delay, and the resulting processed audio signals may be compared to determine which approach provides the best results. Thus, the disclosed embodiments may provide better efficiency, convenience and functionality than prior art hearing aid devices.
图44示出了可以根据所公开实施例进行处理的示例音频信号4410。如上所述,音频信号4410可以由可穿戴装置110的一个或多个麦克风(诸如麦克风443或444)捕捉。在一些实施例中,音频信号4410可以从多个麦克风(诸如麦克风阵列)接收。音频信号4410可以包括来自可穿戴装置110的用户的环境的声音的表示。音频信号因此可以包括来自一个或多个个体的语音、背景噪声、音乐和/或可穿戴装置110在将其呈现给用户之前处理的其他声音。例如,系统可以选择性地调节音频样本4410,以衰减背景噪声,放大来自特定源(例如,用户正在看着的对象或人)的声音,调整音频信号的音高,调整音频信号的回放速率,从信号中去除噪声或伪影,执行音频压缩,或执行其他增强以改善用户的音频质量。FIG. 44 shows an example audio signal 4410 that may be processed in accordance with the disclosed embodiments. As described above, audio signal 4410 may be captured by one or more microphones of wearable device 110, such as microphone 443 or 444. In some embodiments, audio signal 4410 may be received from multiple microphones, such as a microphone array. Audio signal 4410 may include a representation of sounds from the environment of the user of wearable device 110 . The audio signal may thus include speech from one or more individuals, background noise, music, and/or other sounds that the wearable device 110 processes before presenting it to the user. For example, the system may selectively adjust audio samples 4410 to attenuate background noise, amplify sound from a particular source (eg, an object or person the user is looking at), adjust the pitch of the audio signal, adjust the playback rate of the audio signal, Remove noise or artifacts from the signal, perform audio compression, or perform other enhancements to improve the user's audio quality.
如上所述,经处理音频信号的质量取决于各种因素,包括在当前正在分析的音频样本之后为收集和分析附加音频样本所允许的时间延迟。例如,如图44所示,系统可以处理包括在音频信号4410内的音频样本4412。该处理可以包括贯穿本公开的各种增强或调节中的任何一个。所公开的系统还可以分析先前和/或随后的音频样本以改进当前样本的处理。如图44所示,这可以包括一个或多个“前瞻”样本4414。增加包括在前瞻样本4414中的信息量可以得到音频处理的更高质量,因为系统可以能够更好地确定音频信号的进展。例如,给定更大的样本数据范围,系统可以能够更有效地减少来自音频信号4410的背景或其他噪声。为了处理音频样本4412,系统可以引入时间延迟(t),以允许处理前瞻样本4414的时间。因此,更大的时间延迟(t)可以允许对经处理音频信号增加的音频质量。用户经历的时间延迟可以大于图44所示的时间延迟(t)。例如,用户经历的实际时间延迟可以包括用于处理音频样本4412、将其发送到助听器设备或可能增加延迟的其他步骤的附加延迟。尽管图44示出了单个前瞻样本4414,但应当理解,这可以包括一个以上的样本,因此前瞻样本4414可以被划分为多个样本。As mentioned above, the quality of the processed audio signal depends on various factors, including the time delay allowed for additional audio samples to be collected and analyzed after the audio sample currently being analyzed. For example, as shown in FIG. 44, the system may process audio samples 4412 included within audio signal 4410. The processing may include any of the various enhancements or adjustments throughout this disclosure. The disclosed system may also analyze previous and/or subsequent audio samples to improve processing of the current sample. As shown in Figure 44, this may include one or more "look-ahead" samples 4414. Increasing the amount of information included in the look-ahead samples 4414 may result in higher quality audio processing, as the system may be better able to determine the progress of the audio signal. For example, given a larger range of sample data, the system may be able to reduce background or other noise from audio signal 4410 more effectively. In order to process audio samples 4412, the system may introduce a time delay (t) to allow time for look-ahead samples 4414 to be processed. Therefore, a larger time delay (t) may allow for increased audio quality to the processed audio signal. The time delay experienced by the user may be greater than the time delay (t) shown in FIG. 44 . For example, the actual time delay experienced by the user may include additional delays for processing the audio samples 4412, sending them to the hearing aid device, or other steps that may increase the delay. Although Figure 44 shows a single look-ahead sample 4414, it should be understood that this may include more than one sample, and thus the look-ahead sample 4414 may be divided into multiple samples.
可以以各种方式确定时间延迟与音频质量之间的适当的或期望的平衡。在一些实施例中,可以基于来自用户的输入来定义时间延迟。例如,用户可以提供指示对更高音频质量的偏好或对较短处理延迟的偏好的输入。如本文所使用的,音频质量可以指系统如何有效地处理音频信号的任何度量。在一些实施例中,音频质量可以指特定形式的调节或增强被应用的程度。例如,如果执行音频信号的选择性调节以衰减相对于个体语音的背景噪声,则音频质量可以指背景噪声衰减的程度、系统在语音和背景噪声之间区分的程度、语音的清晰度、系统能够识别特定个体的语音的程度或语音被放大的程度。音频质量还可以指得到的音频信号的更一般的属性,诸如采样速率、信号中有多少噪声等。The appropriate or desired balance between time delay and audio quality can be determined in various ways. In some embodiments, the time delay may be defined based on input from the user. For example, a user may provide input indicating a preference for higher audio quality or a preference for shorter processing delays. As used herein, audio quality can refer to any measure of how effectively a system processes an audio signal. In some embodiments, audio quality may refer to the degree to which a particular form of adjustment or enhancement is applied. For example, if selective conditioning of the audio signal is performed to attenuate background noise relative to individual speech, audio quality may refer to the degree to which background noise is attenuated, the degree to which the system distinguishes between speech and background noise, the clarity of speech, the ability of the system to The degree to which the speech of a particular individual is recognized or the degree to which the speech is amplified. Audio quality can also refer to more general properties of the resulting audio signal, such as the sampling rate, how much noise is in the signal, and so on.
图45A示出了符合所公开实施例的示例用户界面4510,通过该示例用户界面用户可以定义处理音频信号的方面。用户界面4510可以是显示在计算设备4500上的图形用户界面,如图45A所示。计算设备4500可以是与用户可以通过其提供输入的可穿戴装置110和/或听觉接口设备1710相关联的任何设备。例如,计算设备4500可以包括移动电话、平板电脑、膝上型计算机、台式计算机、电视、可穿戴设备(例如,智能手表、智能珠宝等)、家庭IoT设备等。在一些实施例中,计算设备4500可以对应于上面描述的计算设备120。在一些实施例中,用户界面4510可以呈现在可穿戴装置110上。FIG. 45A illustrates an example user interface 4510 through which a user may define aspects of processing audio signals, consistent with the disclosed embodiments. User interface 4510 may be a graphical user interface displayed on computing device 4500, as shown in Figure 45A. Computing device 4500 may be any device associated with wearable device 110 and/or auditory interface device 1710 through which a user may provide input. For example, computing device 4500 may include mobile phones, tablets, laptops, desktop computers, televisions, wearable devices (eg, smart watches, smart jewelry, etc.), home IoT devices, and the like. In some embodiments, computing device 4500 may correspond to computing device 120 described above. In some embodiments, user interface 4510 may be presented on wearable device 110 .
用户界面4510可以包括用户可以通过其提供输入的一个或多个控件。例如,用户界面4510可以包括一个或多个滑块控件4512和4514,如图44所示。使用滑块控件4512,用户可以指示关于音频处理延迟和音频质量之间的折衷的偏好。向左拖动滑块控件4512可以减少时间延迟,从而限制可用于处理的前瞻音频的量。可替代地,用户可以向右拖动滑块控件4512,这可以增加时间延迟以产生更高质量的经处理音频信号。当用户4512感到时间延迟和音频质量之间的平衡不理想时,他或她可以访问用户界面4510。例如,在用户与具有最小背景噪声的个体一对一交谈的情况下,音频质量可能不那么重要,因此,用户可能更愿意最小化时间延迟。然而,在其他环境中,诸如拥挤的餐馆,用户可能有听力困难,并且因此可以更容忍延迟以提高音频处理的质量。User interface 4510 may include one or more controls through which a user may provide input. For example, user interface 4510 may include one or more slider controls 4512 and 4514, as shown in FIG. 44 . Using the slider control 4512, the user can indicate a preference regarding the tradeoff between audio processing delay and audio quality. Dragging the slider control 4512 to the left reduces the time delay, thereby limiting the amount of look-ahead audio available for processing. Alternatively, the user can drag the slider control 4512 to the right, which can increase the time delay to produce a higher quality processed audio signal. User 4512 may access user interface 4510 when he or she feels that the balance between time delay and audio quality is not ideal. For example, in the case of a user talking one-on-one with an individual with minimal background noise, audio quality may not be as important, and thus, the user may prefer to minimize time delays. However, in other environments, such as crowded restaurants, the user may have hearing difficulties and therefore may be more tolerant of delays to improve the quality of the audio processing.
在一些实施例中,滑块控制器4512可以控制除了音频采样时间延迟之外如何处理音频的其他方面。例如,诸如用于捕捉音频的麦克风的数量、是否执行图像处理以增强音频的处理、或影响处理时间的任何其他变量等因素也可以通过滑块控制器4512来调整或控制。在一些实施例中,还可以为这些变量中的每一个提供单独的控制。例如,用户界面4510可以包括一个或多个复选框、单选按钮、开关或类似的控件,允许用户通过使用相机等来实现增强处理。在一些实施例中,用户界面4510可以允许用户控制音频处理的其他方面。例如,滑块控制器4514可以允许用户定义电池寿命(例如,可穿戴装置110、计算设备120或听觉接口设备1710)与音频处理质量之间的偏好。例如,使用相机来增强音频信号的处理(例如,执行唇部跟踪技术、确定活跃说话者等)可能更快地耗尽可穿戴装置110的电池,因此用户可以使用滑块控件4514来管理或定义这种折衷。在一些实施例中,这可以呈现为二进制选项,诸如进入电池节约器模式的选项。此外,虽然作为示例在图45A中示出了滑块控件4512和4514,但是可以使用各种其他形式的控件。例如,用户可以在文本字段中键入值,该值可以对应于时间延迟(例如,以秒或毫秒为单位)、定义时间延迟的范围或刻度(例如,0-5、0-100等)内的值、百分比或可以指示时间延迟偏好的任何其他值。控件还可以包括复选框、单选按钮、下拉列表、按钮、下拉按钮、切换开关、图标等。In some embodiments, slider controller 4512 may control other aspects of how audio is processed in addition to audio sample time delay. For example, factors such as the number of microphones used to capture audio, whether image processing is performed to enhance the processing of the audio, or any other variable that affects processing time, can also be adjusted or controlled by slider controller 4512. In some embodiments, separate controls may also be provided for each of these variables. For example, user interface 4510 may include one or more checkboxes, radio buttons, switches, or similar controls that allow the user to implement enhanced processing through the use of a camera or the like. In some embodiments, user interface 4510 may allow the user to control other aspects of audio processing. For example, slider controller 4514 may allow a user to define a preference between battery life (eg, wearable device 110, computing device 120, or auditory interface device 1710) and audio processing quality. For example, using a camera to enhance the processing of audio signals (eg, performing lip tracking techniques, determining active speakers, etc.) may drain the wearable device 110 battery faster, so the user can use the slider control 4514 to manage or define This compromise. In some embodiments, this may be presented as a binary option, such as an option to enter battery saver mode. Furthermore, although slider controls 4512 and 4514 are shown in FIG. 45A as an example, various other forms of controls may be used. For example, a user can type a value in a text field, which can correspond to a time delay (eg, in seconds or milliseconds), a range or scale that defines the time delay (eg, 0-5, 0-100, etc.) value, percentage, or any other value that can indicate a time delay preference. Controls can also include checkboxes, radio buttons, drop-down lists, buttons, drop-down buttons, toggle switches, icons, and more.
在一些实施例中,可以基于来自用户的反馈来指定如何处理音频的时间延迟或其他方面。不是通过用户界面4510直接控制一个或多个变量,而是用户可以提供关于经处理的音频的反馈,系统可以使用该反馈来调整影响音频质量的一个或多个方面。例如,系统可以提供提示向用户询问“声音质量如何?”,并为用户提供一个或多个响应选项(例如,选择星的数量或以其他方式选择数字评级、选择“大拇指向上”或“大拇指向下”选项、选择诸如“语音不清楚”或“音频延迟”等选项)。基于该响应,系统可以被配置为调整音频处理的一个或多个方面。可以通过各种其他方式获得来自用户的反馈。例如,系统可以检测用户的动作,其可以指示来自用户的反馈。在一些实施例中,反馈可以是明确的。例如,用户可以做出可以被系统识别的拇指向上或向下的手势。在一些实施例中,反馈也可以是隐式的。例如,系统可以基于图像或捕捉的音频来检测用户是否正向说话者倾斜、将他或她的手放在他们的耳朵周围、要求说话者重复自己、或者其他可能指示用户听力困难的动作,这可能提示系统调整音频处理的一个或多个方面。In some embodiments, time delays or other aspects of how to process audio may be specified based on feedback from the user. Rather than directly controlling one or more variables through the user interface 4510, the user can provide feedback on the processed audio, which the system can use to adjust one or more aspects that affect audio quality. For example, the system may provide a prompt to the user asking "How is the sound quality?" and provide the user with one or more response options (eg, select the number of stars or otherwise select a numerical rating, select "thumbs up" or "big" Thumbs Down" option, select options such as "Unclear Speech" or "Audio Delay"). Based on the response, the system can be configured to adjust one or more aspects of audio processing. Feedback from users can be obtained in various other ways. For example, the system can detect the user's actions, which can indicate feedback from the user. In some embodiments, the feedback may be explicit. For example, the user can make a thumb up or down gesture that can be recognized by the system. In some embodiments, feedback may also be implicit. For example, the system can detect, based on images or captured audio, if the user is leaning toward the speaker, placing his or her hands around their ears, asking the speaker to repeat themselves, or other actions that may indicate the user is having difficulty hearing, which The system may be prompted to adjust one or more aspects of audio processing.
根据本公开的一些实施例,该系统可以被配置为自动地和动态地调整音频处理的各个方面。例如,该系统可以分析捕捉的或经处理音频信号以确定前瞻样本的适当数量或持续时间。在一些实施例中,这可以包括根据不同的设置或设置方案并行地处理捕捉的音频信号。然后,系统可以对所得到的经处理音频信号进行比较,以确定时间延迟与音频质量的最佳折衷或其他折衷。因此,系统可以在无需用户输入的情况下自动调整音频处理的一个或多个方面。According to some embodiments of the present disclosure, the system may be configured to automatically and dynamically adjust various aspects of audio processing. For example, the system may analyze the captured or processed audio signal to determine the appropriate number or duration of look-ahead samples. In some embodiments, this may include processing the captured audio signals in parallel according to different setups or setup schemes. The system can then compare the resulting processed audio signals to determine the best tradeoff or other tradeoff between time delay and audio quality. Thus, the system can automatically adjust one or more aspects of audio processing without user input.
图45B示出了符合所公开实施例的用于并行处理音频信号的示例过程。系统可以接收音频信号4540,如图45B所示。类似于音频信号4410,音频信号4540可以由可穿戴装置110的一个或多个麦克风(诸如麦克风443或444)捕捉。在一些实施例中,音频信号4540可以从多个麦克风(诸如麦克风阵列)接收。音频信号4540可以包括来自可穿戴装置110的用户的环境的声音的表示,该声音可以包括来自一个或多个个体的语音、背景噪声、音乐和/或可穿戴装置110在将其呈现给用户之前处理的其他声音。45B illustrates an example process for parallel processing of audio signals consistent with disclosed embodiments. The system may receive audio signal 4540, as shown in Figure 45B. Similar to audio signal 4410, audio signal 4540 may be captured by one or more microphones of wearable device 110, such as microphone 443 or 444. In some embodiments, the audio signal 4540 may be received from multiple microphones, such as a microphone array. Audio signal 4540 may include a representation of sound from the user's environment of wearable device 110, which may include speech from one or more individuals, background noise, music, and/or wearable device 110 prior to presenting it to the user Other sounds processed.
为了确定时间延迟时段的值或音频处理的其他方面,系统可以执行多个并行音频处理流并比较结果。如图45B所示,系统可以根据第一方案4552和第二方案4562处理音频流4540。如本文所使用的,方案可以是影响如何处理音频信号的一个或多个定义的参数、方面或变量的集合。例如,方案4552可以包括用于前瞻时间延迟时段的第一值,并且方案4562可以包括用于前瞻时间延迟的不同值。例如,方案4552可以与更长的延迟相关联,而方案4562可以与较短的延迟相关联。方案4552和4562可以定义如何处理音频信号的其他方面,诸如用于捕捉音频信号的麦克风的数量、是否使用相机来处理音频信号、或者可能影响音频质量的任何其他变量或设置。其他方面可以涉及用于通过各种方案处理音频信号的内部参数或变量。作为并行处理的结果,系统可以生成第一经处理音频信号4556和第二经处理音频信号4566。To determine the value of the time delay period or other aspects of audio processing, the system may execute multiple parallel streams of audio processing and compare the results. As shown in FIG. 45B, the system may process the audio stream 4540 according to the first scheme 4552 and the second scheme 4562. As used herein, a scheme may be a set of one or more defined parameters, aspects or variables that affect how an audio signal is processed. For example, scheme 4552 may include a first value for the look-ahead time delay period, and scheme 4562 may include a different value for the look-ahead time delay. For example, scheme 4552 may be associated with longer delays, while scheme 4562 may be associated with shorter delays. Schemes 4552 and 4562 may define other aspects of how the audio signal is processed, such as the number of microphones used to capture the audio signal, whether a camera is used to process the audio signal, or any other variables or settings that may affect audio quality. Other aspects may involve internal parameters or variables used to process the audio signal through various schemes. The system may generate a first processed audio signal 4556 and a second processed audio signal 4566 as a result of the parallel processing.
然后,系统可以比较经处理音频信号4556和4566,以确定哪个方案在方案4552和4562中定义的一个或多个方面与经处理音频信号4556和4566的音频质量之间提供更好的折衷。因此,该系统可以被配置为评估经处理音频信号4556和4566的结果音频质量,这可以以各种方式执行。作为一个示例,当处理音频信号4540时,可以移除某些频率以“清洁”音频信号。例如,这可以包括移除与背景噪声、其他说话者或其他音频源相关联的某些频率。该处理可能导致信号的累积能量的量的减少。因此,在并行处理之后,可以比较经处理音频信号4556的能级4554和经处理音频信号4566的能级4564。如果能级之差很小,这可以指示这两种方案在衰减信号中不想要的噪声方面同样有效。因此,由方案4552引入的附加时间延迟可能不会提供比由方案4562引入的较短时间延迟显著的优势。另一方面,如果能级4554和4564之间的差值相对较大,则这可以指示由方案4552引入的附加时间延迟显著地改善了经处理音频信号的质量,因此可以选择方案4552。因此,比较经处理音频信号4556和4566可以包括将能级4554和4564之间的差值与阈值能级差进行比较。The system may then compare processed audio signals 4556 and 4566 to determine which scheme provides a better compromise between one or more of the aspects defined in schemes 4552 and 4562 and the audio quality of processed audio signals 4556 and 4566 . Accordingly, the system may be configured to evaluate the resulting audio quality of the processed audio signals 4556 and 4566, which may be performed in various ways. As one example, when processing the audio signal 4540, certain frequencies may be removed to "clean" the audio signal. For example, this may include removing certain frequencies associated with background noise, other speakers, or other audio sources. This processing may result in a reduction in the amount of accumulated energy of the signal. Thus, after parallel processing, the energy level 4554 of the processed audio signal 4556 and the energy level 4564 of the processed audio signal 4566 can be compared. If the difference in energy levels is small, this may indicate that both schemes are equally effective at attenuating unwanted noise in the signal. Therefore, the additional time delay introduced by scheme 4552 may not provide a significant advantage over the shorter time delay introduced by scheme 4562. On the other hand, if the difference between energy levels 4554 and 4564 is relatively large, this may indicate that the additional time delay introduced by scheme 4552 significantly improves the quality of the processed audio signal, so scheme 4552 may be selected. Thus, comparing processed audio signals 4556 and 4566 may include comparing the difference between energy levels 4554 and 4564 to a threshold energy level difference.
可以使用经处理音频信号4556和4566之间的各种其他比较。例如,系统可以分析音频信号的采样速率、信号中的噪声量或经处理音频信号4556和4566的其他特性。在一些实施例中,可以基于来自用户的输入来评估音频质量。例如,系统可以向用户呈现经处理音频信号4556和4566,并且用户可以提供关于一个比另一个提供了多少改进或者它们在音频质量上是否相对接近的输入。Various other comparisons between processed audio signals 4556 and 4566 may be used. For example, the system may analyze the sampling rate of the audio signal, the amount of noise in the signal, or other characteristics of the processed audio signals 4556 and 4566. In some embodiments, audio quality may be assessed based on input from a user. For example, the system may present the processed audio signals 4556 and 4566 to the user, and the user may provide input as to how much improvement one provides over the other or whether they are relatively close in audio quality.
基于经处理音频信号4556和4566的比较,系统可以选择在时间延迟(或其他方面)和音频质量之间提供更好的折衷的方案。该系统还可以例如通过听觉接口设备1710向用户呈现所选择的经处理音频信号。在一些实施例中,可以时段性地执行该并行处理,从而动态地调整音频处理的时间延迟或其他方面。例如,并行处理可以作为每秒、2秒、10秒、60秒、5分钟、10分钟、小时或其他适当时段的检查来执行。在一些实施例中,用户可以例如通过选择用户界面上的校准按钮、可穿戴装置110上的物理按钮等来手动启动并行处理。在一些实施例中,可以基于由可穿戴装置110、计算设备210或听觉接口设备1710检测到的其他提示来启动并行处理。在一些实施例中,可穿戴装置110的相机可以检测用户何时进入不同环境,诸如从安静的车辆移动到嘈杂的餐馆。因此,当用户在餐馆中时,前瞻时间延迟可能变得更加重要,因为可能存在更多必须衰减的背景噪声。作为另一示例,相机可以检测用户环境中的个体是否正在与用户说话,这可以指示需要额外的前瞻时间延迟。并行处理可以基于其他传感器数据来启动,诸如用户的GPS定位的改变、传感器检测到的光线的改变、噪声水平的改变或各种其他传感器数据。在一些实施例中,系统可被配置为基于该信息改变预定时间间隔。例如,在其中调节音频信号可能更重要或可能需要更长的前瞻样本的环境中,系统可以更频繁地执行并行处理。Based on a comparison of processed audio signals 4556 and 4566, the system may select a scheme that provides a better compromise between time delay (or otherwise) and audio quality. The system may also present the selected processed audio signal to the user, eg, through the auditory interface device 1710 . In some embodiments, this parallel processing may be performed periodically to dynamically adjust time delays or other aspects of audio processing. For example, parallel processing may be performed as checks per second, 2 seconds, 10 seconds, 60 seconds, 5 minutes, 10 minutes, hours, or other suitable time period. In some embodiments, the user may manually initiate parallel processing, eg, by selecting a calibration button on the user interface, a physical button on the wearable device 110, or the like. In some embodiments, parallel processing may be initiated based on other cues detected by wearable device 110 , computing device 210 , or auditory interface device 1710 . In some embodiments, the camera of the wearable device 110 can detect when the user enters different environments, such as moving from a quiet vehicle to a noisy restaurant. Therefore, when the user is in a restaurant, the look-ahead time delay may become more important because there may be more background noise that must be attenuated. As another example, a camera may detect whether an individual in the user's environment is speaking to the user, which may indicate that an additional look-ahead time delay is required. Parallel processing may be initiated based on other sensor data, such as changes in the user's GPS location, changes in light detected by the sensors, changes in noise levels, or various other sensor data. In some embodiments, the system may be configured to vary the predetermined time interval based on this information. For example, in environments where conditioning the audio signal may be more important or longer look-ahead samples may be required, the system may perform parallel processing more frequently.
虽然在图45B中示出了两个并行处理方案,但在一些实施例中,可以根据两个以上的方案来执行并行处理。因此,该系统可以比较两个以上经处理音频信号的能级(或其他音频度量)的差值。在一些实施例中,系统可以不必选择在方案之一中定义的值。例如,系统可以内插或外推跨多个经处理音频信号输出的能级,并且可以确定表示音频质量的最佳折衷的前瞻时间延迟或其他值。Although two parallel processing schemes are shown in Figure 45B, in some embodiments parallel processing may be performed according to more than two schemes. Thus, the system can compare the difference in energy levels (or other audio metrics) of two or more processed audio signals. In some embodiments, the system may not have to select a value defined in one of the schemes. For example, the system may interpolate or extrapolate the energy levels output across multiple processed audio signals, and may determine a look-ahead time delay or other value that represents the best compromise in audio quality.
在系统确定前瞻时间延迟或其他方面应该改变的情况下,系统可以以各种方式实现改变。在一些实施例中,可以立即或在并行处理之后不久实现改变。在一些实施例中,改变可能不必立即实现。例如,当延长前瞻延迟时,可以通过延长音频信号的固定片段(诸如安静时间段、相对一致噪声的时间段(例如,水流等)或音频处理设置中的改变可能不太明显的其他时间段)来延长延迟。因此,用户可能不会注意到该改变。在一些实施例中,如果在预定时间段内没有检测到静止时间段,则可以在音频信号的另一部分上执行转换。在一些实施例中,可以逐渐改变延迟以减少对用户的影响。例如,如果延迟要从30毫秒延长到100毫秒,则可以在7个循环中执行,其中延迟在每个循环上延长10毫秒。如果在一个或多个循环之后检测到音频信号的静止片段(例如,安静时间段),则延迟的其余部分可以在静止片段期间延长。Where the system determines that the look-ahead time delay or other aspects should be changed, the system may implement the change in various ways. In some embodiments, changes may be implemented immediately or shortly after parallel processing. In some embodiments, changes may not have to be implemented immediately. For example, when extending the look-ahead delay, it can be done by extending fixed segments of the audio signal (such as periods of silence, periods of relatively consistent noise (eg, water flow, etc.), or other periods where changes in audio processing settings may be less pronounced) to extend the delay. Therefore, the user may not notice the change. In some embodiments, if a period of stillness is not detected within a predetermined period of time, the transition may be performed on another portion of the audio signal. In some embodiments, the delay may be gradually varied to reduce the impact on the user. For example, if the delay is to be extended from 30ms to 100ms, it can be executed in 7 loops, where the delay is extended by 10ms on each loop. If a still segment of the audio signal (eg, a quiet period) is detected after one or more cycles, the remainder of the delay may be extended during the still segment.
在一些实施例中,尽管选择一个经处理音频信号而不是另一个,但是可以向用户呈现多个经处理音频信号。例如,当用户说话时,他或她可能想听自己说话。特别是,用户可能希望听到他或她对其他人的声音。在这样的实施例中,用户可能希望以最小的延迟听到自己。因此,可以向用户呈现基于具有最小时间延迟的方案的经处理音频信号。例如,该音频信号可以在接收音频信号的300毫秒内被发送。在一些实施例中,这可以更快(例如,在200毫秒、100毫秒、80毫秒、40毫秒等)。当可穿戴装置110检测到用户正在说话时,它可以呈现具有最小延迟的经处理音频信号。在一些实施例中,这一具有最小延迟的经处理音频信号可以与所选择的经处理音频信号一起呈现(例如,作为混合或组合的音频信号),所选择的经处理音频信号可以以更长的延迟呈现。在其他实施例中,系统可以在优选方案和具有最小延迟的方案之间来回切换。In some embodiments, although one processed audio signal is selected over the other, multiple processed audio signals may be presented to the user. For example, when a user speaks, he or she may want to hear himself speak. In particular, the user may want to hear his or her voice against other people. In such an embodiment, the user may wish to hear himself with minimal delay. Thus, a processed audio signal based on a scheme with minimal time delay can be presented to the user. For example, the audio signal may be transmitted within 300 milliseconds of receiving the audio signal. In some embodiments, this may be faster (eg, at 200 milliseconds, 100 milliseconds, 80 milliseconds, 40 milliseconds, etc.). When the wearable device 110 detects that the user is speaking, it can present a processed audio signal with minimal delay. In some embodiments, this processed audio signal with minimal delay may be presented (eg, as a mixed or combined audio signal) with a selected processed audio signal, which may be presented in a longer delayed rendering. In other embodiments, the system can switch back and forth between the preferred scheme and the scheme with the least delay.
图46A是示出符合所公开实施例的用于选择性地放大音频信号的示例过程4600A的流程图。过程4600A可以由可穿戴装置的至少一个处理设备(诸如如上所述的处理器210)执行。在一些实施例中,过程4600A的一些或全部可以由诸如计算设备120的不同设备执行。应当理解,在贯穿本公开中,术语“处理器”用作“至少一个处理器”的简略表达。换句话说,处理器可以包括执行逻辑操作的一个或多个结构,无论这些结构是被并置、连接或分布式的。在一些实施例中,非暂时性计算机可读介质可以包含当由处理器执行时使得处理器执行过程4600A的指令。此外,过程4600A不一定限于图46A中所示的步骤,并且贯穿本公开内容描述的各种实施例的任何步骤或过程也可以包括在过程4600A中,包括上面关于图44和图45A描述的那些步骤或过程。尽管在时间延迟的背景中描述了过程4600A,但应理解,过程4600A可以被应用于处理音频信号的其他方面,包括用于捕捉音频信号的麦克风数量、是否使用相机来处理音频信号、或者可能影响音频质量的任何其他方面。46A is a flowchart illustrating an example process 4600A for selectively amplifying an audio signal, consistent with the disclosed embodiments. Process 4600A may be performed by at least one processing device of the wearable device, such as processor 210 as described above. In some embodiments, some or all of process 4600A may be performed by a different device, such as computing device 120 . It should be understood that throughout this disclosure, the term "processor" is used as a shorthand for "at least one processor." In other words, a processor may include one or more structures that perform logical operations, whether collocated, connected, or distributed. In some embodiments, a non-transitory computer-readable medium may contain instructions that, when executed by a processor, cause the processor to perform process 4600A. Furthermore, process 4600A is not necessarily limited to the steps shown in Figure 46A, and any steps or processes of various embodiments described throughout this disclosure may also be included in process 4600A, including those described above with respect to Figures 44 and 45A step or process. Although process 4600A is described in the context of time delays, it should be understood that process 4600A may be applied to other aspects of processing audio signals, including the number of microphones used to capture audio signals, whether cameras are used to process audio signals, or may affect any other aspect of audio quality.
在步骤4610中,过程4600A可以包括接收表示由至少一个麦克风从用户的环境接收的声音的音频信号。例如,麦克风443或444(或麦克风1720)可以捕捉来自用户环境的声音,并可以将它们发送到处理器210。这可以包括如上所述的音频信号4410。In step 4610, process 4600A can include receiving an audio signal representing sound received by at least one microphone from the user's environment. For example, microphone 443 or 444 (or microphone 1720 ) can capture sounds from the user's environment and can send them to processor 210 . This may include audio signal 4410 as described above.
在步骤4612中,处理器4600A可以包括接收与处理音频信号相关联的时间延迟的指示。如上面更详细地描述的,可以以各种方式确定时间延迟。在一些实施例中,时间延迟可以由用户通过用户界面定义。用户界面可以被包括在与助听器系统接合的设备(诸如计算设备120)上。例如,该设备可以是移动电话、台式机、膝上型计算机或平板电脑中的至少一个。在一些实施例中,定义时间延迟的设定点可以被存储在远程存储设备上。例如,设定点可以被存储在计算设备120、助听器设备1710、远程服务器(例如,云存储平台、基于网络的服务器等)等上。因此,接收时间延迟的指示可以包括访问远程存储设备。在一些实施例中,时间延迟可以由助听器系统基于来自用户的关于先前处理的音频信号的反馈来确定。例如,用户可以提供音频质量的评级,可以提供关于与音频信号相关联的延迟的反馈,或者可以指示优选的时间延迟或音频质量设置的其他反馈。In step 4612, the processor 4600A may include receiving an indication of a time delay associated with processing the audio signal. As described in more detail above, the time delay may be determined in various ways. In some embodiments, the time delay may be defined by the user through the user interface. The user interface may be included on a device, such as computing device 120, that interfaces with the hearing aid system. For example, the device may be at least one of a mobile phone, desktop, laptop, or tablet. In some embodiments, the set points defining the time delay may be stored on a remote storage device. For example, the setpoints may be stored on the computing device 120, the hearing aid device 1710, a remote server (eg, a cloud storage platform, a web-based server, etc.), or the like. Accordingly, receiving an indication of a time delay may include accessing a remote storage device. In some embodiments, the time delay may be determined by the hearing aid system based on feedback from the user regarding previously processed audio signals. For example, the user may provide a rating of audio quality, may provide feedback on the delay associated with the audio signal, or may indicate other feedback that indicates a preferred time delay or audio quality setting.
在步骤4614中,处理器4600A可以包括在缓冲器中存储表示音频信号的部分的多个音频样本。例如,多个音频样本可以包括如上文关于图44所描述的音频样本4412和4414。缓冲器可以是至少临时存储音频样本以用于分析和/或处理的任何存储位置。在一些实施例中,如上所述,缓冲器可以对应于存储器550。In step 4614, the processor 4600A may include storing in a buffer a plurality of audio samples representing the portion of the audio signal. For example, the plurality of audio samples may include audio samples 4412 and 4414 as described above with respect to FIG. 44 . A buffer may be any storage location that stores audio samples at least temporarily for analysis and/or processing. In some embodiments, the buffer may correspond to memory 550, as described above.
在步骤4616中,过程4600A可以包括处理多个音频样本中的第一音频样本以生成经处理的第一音频样本。例如,第一音频样本可以对应于音频样本4412。处理第一音频样本可以包括分析第二音频样本,诸如如上所描述的前瞻音频样本4414。因此,如图44所示,第二音频样本可以在第一音频样本之后的音频信号中表示,并且可以具有由时间延迟定义的长度。虽然步骤4616描述单个第二音频样本,但应理解,处理第一音频样本可以包括分析多个前瞻音频样本。因此,在一些实施例中,第二音频样本可以包括多个音频样本。在一些实施例中,经处理的第一音频样本的音频质量可以取决于第二音频样本的长度。例如,更大的前瞻样本4414可以得到更多的可用于处理音频信号4412的数据,这可以得到更好的调节或增强音频信号的能力。因此,更长的第二音频样本可以与更高的音频质量相关联。In step 4616, process 4600A may include processing a first audio sample of the plurality of audio samples to generate a processed first audio sample. For example, the first audio sample may correspond to audio sample 4412. Processing the first audio sample may include analyzing a second audio sample, such as the look-ahead audio sample 4414 as described above. Thus, as shown in Figure 44, the second audio sample may be represented in the audio signal following the first audio sample, and may have a length defined by the time delay. Although step 4616 describes a single second audio sample, it should be understood that processing the first audio sample may include analyzing multiple look-ahead audio samples. Thus, in some embodiments, the second audio sample may comprise a plurality of audio samples. In some embodiments, the audio quality of the processed first audio sample may depend on the length of the second audio sample. For example, larger look-ahead samples 4414 may result in more data available for processing the audio signal 4412, which may result in a better ability to condition or enhance the audio signal. Therefore, longer second audio samples can be associated with higher audio quality.
图46B是示出符合所公开实施例的用于选择性地放大音频信号的示例过程4600B的流程图。过程4600B可以由可穿戴装置的至少一个处理设备(诸如如上所述的处理器210)执行。在一些实施例中,过程4600B的一些或全部可以由诸如计算设备120的另一设备执行。在一些实施例中,非暂时性计算机可读介质可以包含当由处理器执行时使得处理器执行过程4600B的指令。此外,过程4600B不一定限于图46B中所示的步骤,并且贯穿本公开内容描述的各种实施例的任何步骤或过程也可以包括在过程4600B中,包括上面关于图44-图46A描述的那些步骤或过程。46B is a flowchart illustrating an example process 4600B for selectively amplifying an audio signal, consistent with the disclosed embodiments. Process 4600B may be performed by at least one processing device of the wearable device, such as processor 210 as described above. In some embodiments, some or all of process 4600B may be performed by another device, such as computing device 120 . In some embodiments, a non-transitory computer-readable medium may contain instructions that, when executed by a processor, cause the processor to perform process 4600B. Furthermore, process 4600B is not necessarily limited to the steps shown in Figure 46B, and any steps or processes of various embodiments described throughout this disclosure may also be included in process 4600B, including those described above with respect to Figures 44-46A step or process.
在步骤4640中,过程4600B可以包括接收表示由至少一个麦克风从用户的环境捕捉的声音的音频信号。例如,麦克风443或444(或麦克风1720)可以捕捉来自用户环境的声音,并可以将它们发送到处理器210。这可以包括如上所述的音频信号4540。In step 4640, process 4600B can include receiving an audio signal representing sound captured by at least one microphone from the user's environment. For example, microphone 443 or 444 (or microphone 1720 ) can capture sounds from the user's environment and can send them to processor 210 . This may include audio signal 4540 as described above.
在步骤4642中,过程4600B可以包括使用至少一个方面的第一值处理音频信号以生成第一经处理音频信号。例如,如上文关于图45B所描述的,音频信号4540可以被处理以生成经处理音频信号4556。至少一个方面可以包括与处理音频信号相关联的任何形式的变量、设置、选项或其他参数,它们可以影响经处理音频信号的音频质量。在一些实施例中,至少一个方面可以包括用于处理音频信号的时间延迟。如上所述,时间延迟可以定义用于处理音频信号样本的前瞻样本的长度。在一些实施例中,至少一个方面可以包括用于捕捉所述音频信号的麦克风的数量。在一些实施例中,至少一个方面可以包括是否除了音频信号之外还处理图像以改进音频信号的处理。因此,过程4600B还可以包括从可穿戴相机接收从用户的环境捕捉的至少一个图像,并且至少一个方面可以包括是否处理该至少一个图像。In step 4642, process 4600B can include processing the audio signal using the first value of at least one aspect to generate a first processed audio signal. For example, audio signal 4540 may be processed to generate processed audio signal 4556 as described above with respect to FIG. 45B. At least one aspect may include any form of variables, settings, options or other parameters associated with the processed audio signal that may affect the audio quality of the processed audio signal. In some embodiments, at least one aspect may include a time delay for processing the audio signal. As mentioned above, the time delay may define the length of the look-ahead samples used to process the audio signal samples. In some embodiments, at least one aspect may include the number of microphones used to capture the audio signal. In some embodiments, at least one aspect may include whether to process images in addition to audio signals to improve processing of audio signals. Accordingly, process 4600B can also include receiving, from the wearable camera, at least one image captured from the user's environment, and at least one aspect can include whether to process the at least one image.
在步骤4644中,过程4600B可以包括使用至少一个方面的第二值处理音频信号以生成第二经处理音频信号。例如,音频信号4540可以被处理以生成经处理音频信号4566。因此,步骤4644的至少一部分可以与步骤4642并行执行。然而,应当理解,在一些实施例中,步骤4644可以在步骤4642之前或之后执行。在一些实施例中,第二值可以不同于第一值。例如,如果至少一个方面包括时间延迟,则第一值可以是比第二值短的时间延迟,反之亦然。在一些实施例中,第一经处理音频信号或第二经处理音频信号可以被处理以使时间延迟最小化。在一些实施例中,第一值和第二值可以根据第一和第二方案来定义。例如,如上所述,可以根据方案4552和4562来处理音频信号。In step 4644, the process 4600B can include processing the audio signal using the second value of the at least one aspect to generate a second processed audio signal. For example, audio signal 4540 may be processed to generate processed audio signal 4566. Accordingly, at least a portion of step 4644 may be performed in parallel with step 4642. It should be understood, however, that step 4644 may be performed before or after step 4642 in some embodiments. In some embodiments, the second value may be different from the first value. For example, if at least one aspect includes a time delay, the first value may be a shorter time delay than the second value, and vice versa. In some embodiments, the first processed audio signal or the second processed audio signal may be processed to minimize time delay. In some embodiments, the first value and the second value may be defined according to the first and second schemes. For example, the audio signal may be processed according to schemes 4552 and 4562, as described above.
在步骤4646中,过程4600B可以包括将第一经处理音频信号与第二经处理音频信号进行比较,以选择第一经处理音频信号或第二经处理音频信号。可以基于第一经处理音频信号和第二经处理音频信号的音频质量的至少一个方面之间的折衷来选择经处理音频信号。例如,在至少一个方面包括时间延迟的情况下,所选择的经处理音频信号可以是提供最低时间延迟而不损害音频质量的经处理音频信号。可以以各种方式执行比较。例如,如上文更详细描述的,比较第一经处理音频信号和第二经处理音频信号可以包括确定第一经处理音频信号与第二经处理音频信号之间的能级差。在一些实施例中,选择第一经处理音频信号或第二经处理音频信号包括确定能级的差值低于预定阈值。例如,如果能级的差值相对较低,则这可以指示通过第一值和第二值之间的差值实现音频质量的最小增益。在一些实施例中,可以将多个方面与延迟和音频质量一起考虑并且进行加权。例如,可以考虑处理所需的处理能力或其他资源。因此,可以选择提供最大益处(例如,更短的时间延迟和更低的电池消耗)的值。在一些实施例中,所选择的经处理音频信号的能级低于未选择的经处理音频信号的能级。例如,如果能级差值低于某一阈值,则步骤4646可以包括选择具有最低能级的经处理音频信号,该最低能级可以指示更好的音频质量。In step 4646, process 4600B may include comparing the first processed audio signal to the second processed audio signal to select either the first processed audio signal or the second processed audio signal. The processed audio signal may be selected based on a compromise between at least one aspect of audio quality of the first processed audio signal and the second processed audio signal. For example, where at least one aspect includes a time delay, the selected processed audio signal may be one that provides the lowest time delay without compromising audio quality. The comparison can be performed in various ways. For example, as described in greater detail above, comparing the first processed audio signal and the second processed audio signal may include determining an energy level difference between the first processed audio signal and the second processed audio signal. In some embodiments, selecting the first processed audio signal or the second processed audio signal includes determining that the difference in energy levels is below a predetermined threshold. For example, if the difference in energy levels is relatively low, this may indicate that a minimum gain in audio quality is achieved by the difference between the first value and the second value. In some embodiments, multiple aspects may be considered and weighted along with delay and audio quality. For example, the processing power or other resources required for processing can be considered. Therefore, the value that provides the greatest benefit (eg, shorter time delay and lower battery consumption) can be selected. In some embodiments, the energy level of the selected processed audio signal is lower than the energy level of the unselected processed audio signal. For example, if the energy level difference is below a certain threshold, step 4646 may include selecting the processed audio signal with the lowest energy level, which may indicate better audio quality.
在步骤4648中,过程4600B可以包括向用户的听觉接口设备发送所选择的经处理音频信号。例如,可穿戴装置110可以通过无线收发器530将所选择的经处理音频信号发送到听觉接口设备1710。因此,所选择的经处理音频信号可以在用户的耳朵中可听地呈现。在一些实施例中,发送所选择的经处理音频信号可以包括将至少一个方面的值转换为不同值(例如,转换为第一或第二值)。如上所述,这一改变可以在检测到的音频信号的静止时间段中、在若干个循环内逐渐发生,或者以另一种方式发生,以降低用户的可感知性。In step 4648, process 4600B may include sending the selected processed audio signal to the user's auditory interface device. For example, wearable device 110 may transmit the selected processed audio signal to auditory interface device 1710 via wireless transceiver 530 . Thus, the selected processed audio signal may be presented audibly in the ear of the user. In some embodiments, transmitting the selected processed audio signal may include converting the value of at least one aspect to a different value (eg, to a first or second value). As mentioned above, this change may occur gradually over several cycles during periods of stillness of the detected audio signal, or in another manner to reduce user perceptibility.
如上所述,过程4600B可以作为校准过程来执行,使得至少一个方面可以在不需要用户输入的情况下自动和动态地更新。因此,过程4600B的一些或全部可以时段性地执行。例如,过程4600B还可以包括以预定时间间隔对第一经处理音频信号和第二经处理音频信号进行比较。如上所述,过程4600B的一些或全部可以基于其他提示来执行。例如,过程4600B可以基于检测用户环境的改变(例如,通过相机图像、GPS传感器数据、光传感器数据或其他传感器数据)、基于音频信号的变化、基于来自用户的输入(例如,按下校准按钮等)或各种其他指示器来执行。As described above, process 4600B can be performed as a calibration process such that at least one aspect can be updated automatically and dynamically without requiring user input. As such, some or all of process 4600B may be performed periodically. For example, process 4600B may also include comparing the first processed audio signal and the second processed audio signal at predetermined time intervals. As noted above, some or all of process 4600B may be performed based on other prompts. For example, process 4600B may be based on detecting changes in the user's environment (eg, via camera images, GPS sensor data, light sensor data, or other sensor data), based on changes in audio signals, based on input from the user (eg, pressing a calibration button, etc.) ) or various other indicators.
用于有源声音替代的可穿戴装置Wearable device for active sound replacement
如上所述,在将音频呈现给用户之前,可以处理从用户的环境内捕捉的音频信号。该处理可以包括各种调节或增强以改善用户的体验。例如,如上所述,来自用户正在看着的个体的语音可以被放大,而诸如背景噪声、来自其他说话者的语音等的其他音频可以被静音或衰减。因此,用户可以更容易地听到和理解来自与用户交谈的个体的语音。由于调节音频所需的处理时间,用户最初可以从他或她的周围环境中听到说话者的声音,然后可以通过助听器设备听到说话者的延迟的经调节的声音。因此,由于用于调节音频的处理时间,用户可能会经历不想要的“回声”,这对用户来说可能是不希望的。As described above, audio signals captured from within the user's environment may be processed prior to presentation of the audio to the user. The processing may include various adjustments or enhancements to improve the user's experience. For example, as described above, speech from the individual the user is looking at may be amplified, while other audio such as background noise, speech from other speakers, etc. may be muted or attenuated. Therefore, the user can more easily hear and understand the speech from the individual with whom the user is conversing. Due to the processing time required to condition the audio, the user may initially hear the speaker's voice from his or her surroundings, and then the delayed, adjusted voice of the speaker may be heard through the hearing aid device. Consequently, the user may experience unwanted "echoes" due to the processing time used to condition the audio, which may be undesirable to the user.
因此,助听器系统可以被配置为至少部分地实时地取消声音,并向用户提供经调节的声音,从而减少或消除回声。例如,助听器可以实时取消说话者的声音,然后通过助听器设备向用户提供说话者声音的经调节版本。为了实时取消噪声,系统可以确定到达麦克风的声音和到达用户耳朵的声音之间的时间差。这可以使用以声速在空气中传播的个体声音与以电(例如,以光速)传播的噪声取消信号的传输时间之间的时间差来实现。确定该差值能够实时地取消用户的噪声/声音。Accordingly, the hearing aid system may be configured to cancel sound, at least in part, in real time and provide the user with sound adjusted to reduce or eliminate echo. For example, a hearing aid can cancel the speaker's voice in real-time and then provide the user with an adjusted version of the speaker's voice through the hearing aid device. To cancel noise in real time, the system can determine the time difference between the sound that reaches the microphone and the sound that reaches the user's ear. This can be accomplished using the time difference between the transit time of an individual sound traveling in air at the speed of sound and the transmission time of a noise cancelling signal traveling at the speed of electricity (eg, at the speed of light). Determining the difference enables cancellation of the user's noise/voice in real time.
图47是示出符合所公开实施例的用于活跃声音替换的示例过程4700的框图。过程4700可用于处理音频信号4710,如图47所示。如上所述,音频信号4710可以由可穿戴装置110的一个或多个麦克风(诸如麦克风443或444)捕捉。在一些实施例中,音频信号4710可以从多个麦克风(诸如麦克风阵列)接收。音频信号4710可以包括来自可穿戴装置110的用户的环境的声音的表示。例如,音频信号因此可以包括来自一个或多个个体的语音、背景噪声、音乐和/或可穿戴装置110在将其呈现给用户之前处理的其他声音。47 is a block diagram illustrating an example process 4700 for active sound replacement consistent with the disclosed embodiments. Process 4700 may be used to process audio signal 4710, as shown in FIG. 47 . As described above, audio signal 4710 may be captured by one or more microphones of wearable device 110, such as microphone 443 or 444. In some embodiments, the audio signal 4710 may be received from multiple microphones, such as a microphone array. Audio signal 4710 may include a representation of sounds from the environment of the user of wearable device 110 . For example, the audio signal may thus include speech from one or more individuals, background noise, music, and/or other sounds that the wearable device 110 processes before presenting it to the user.
过程4700可以包括对音频信号4710执行音频处理4720。音频处理4720可以包括对音频信号的任何形式的调节或增强,包括贯穿本公开的任何形式的选择性调节。例如,如本申请中所公开的,系统可以选择性地调节音频信号4710,以衰减背景噪声,放大来自特定源(例如,用户正在看着的对象或人)的声音,调整音频信号的音高,调整音频信号的回放速率,从信号中去除噪声或伪影,执行音频压缩,或执行其他增强以改善用户的音频质量。如图47所示,音频处理4720可以用于生成选择性调节的音频信号4722,该音频信号可以被发送到听觉接口设备4740。Process 4700 may include performing audio processing 4720 on audio signal 4710. Audio processing 4720 may include any form of conditioning or enhancement of the audio signal, including any form of selective conditioning throughout this disclosure. For example, as disclosed in this application, the system may selectively condition the audio signal 4710 to attenuate background noise, amplify sound from a particular source (eg, an object or person the user is looking at), adjust the pitch of the audio signal , adjust the playback rate of the audio signal, remove noise or artifacts from the signal, perform audio compression, or perform other enhancements to improve the user's audio quality. As shown in FIG. 47 , audio processing 4720 can be used to generate a selectively conditioned audio signal 4722 that can be sent to auditory interface device 4740 .
除了音频处理4720之外,过程4700还可以执行噪声取消4730。噪声取消4730可以是被配置为取消或衰减来自音频信号的一个或多个声音的音频信号的任何形式的处理。如图47所示,噪声取消4730可以用于生成至少一个取消音频信号4732。噪声取消音频信号4732可以是被配置为取消或抵消另一音频信号的任何音频信号。例如,噪声取消4730可以包括有源噪声控制(ANC)过程。因此,取消音频信号4732可以包括与具有不希望的声音(诸如预测到达用户耳朵的声音)的另一音频信号相异相位的“负”音频信号。取消音频信号4732可以被配置为使得当不希望声音的声波的声压高时,取消音频信号4732的声波的声压低。因此,当取消音频信号与预测音频信号组合时,两个音频信号的波可以被抵消或“取消”。与经选择性调节的音频信号4722一样,取消音频信号4732可以被发送到听觉接口设备4740。In addition to audio processing 4720, process 4700 may also perform noise cancellation 4730. Noise cancellation 4730 may be any form of processing of an audio signal that is configured to cancel or attenuate one or more sounds from the audio signal. As shown in FIG. 47, noise cancellation 4730 may be used to generate at least one cancellation audio signal 4732. Noise cancelling audio signal 4732 may be any audio signal that is configured to cancel or cancel another audio signal. For example, noise cancellation 4730 may include an active noise control (ANC) process. Thus, the cancel audio signal 4732 may include a "negative" audio signal that is out of phase with another audio signal having undesired sounds, such as sounds predicted to reach the user's ears. The cancel audio signal 4732 may be configured such that when the sound pressure of the sound wave of the undesired sound is high, the sound pressure of the sound wave of the cancel audio signal 4732 is low. Thus, when the cancellation audio signal is combined with the predicted audio signal, the waves of the two audio signals can be cancelled or "cancelled". Like the selectively conditioned audio signal 4722, the cancel audio signal 4732 may be sent to the auditory interface device 4740.
取消音频信号4732可以被配置为取消音频信号4710的至少一部分。在一些实施例中,取消音频信号4732可以被配置为在所需相位处取消音频信号4710的特定部分,诸如个体的语音。因此,当该特定部分到达用户耳朵时,该特定部分可以被取消。在一些实施例中,该部分也可以包括在音频处理4720中。例如,如果音频处理4720被配置为选择性地调节来自音频信号4710的个体的语音,则取消音频信号4732可以从到达用户耳朵的声音抵消或取消个体的声音。因此,取消音频信号4732可以消除或减少当选择性地调节音频信号4722被发送到听觉接口设备4740时用户可能体验到的回声。在一些实施例中,取消音频信号4732可以被配置为取消来自用户环境的所有声音。因此,用户可以听到经选择性调节的音频信号4722,而听不到来自用户环境的任何声音或一些声音。例如,音频处理4720可以选择性地调节用户环境内的多个声音,以调整声音相对于彼此的音量或其他属性,并且用户可以在没有回声效应的情况下听到经选择性调节的声音。Cancel audio signal 4732 may be configured to cancel at least a portion of audio signal 4710. In some embodiments, canceling audio signal 4732 may be configured to cancel a particular portion of audio signal 4710, such as an individual's speech, at a desired phase. Therefore, when the specific part reaches the user's ear, the specific part can be canceled. In some embodiments, this portion may also be included in audio processing 4720. For example, if the audio processing 4720 is configured to selectively condition the individual's speech from the audio signal 4710, the cancellation audio signal 4732 may cancel or cancel the individual's voice from the sound reaching the user's ear. Accordingly, canceling the audio signal 4732 can eliminate or reduce the echo that the user may experience when the selectively conditioned audio signal 4722 is sent to the auditory interface device 4740 . In some embodiments, cancel audio signal 4732 may be configured to cancel all sounds from the user's environment. Thus, the user can hear the selectively conditioned audio signal 4722 without hearing any sound or sounds from the user's environment. For example, audio processing 4720 may selectively adjust multiple sounds within the user's environment to adjust the volume or other properties of the sounds relative to each other, and the user may hear the selectively adjusted sounds without echo effects.
为了有效地取消来自用户环境的声音,可以发送取消音频信号,使得它与正在被取消的声音同时呈现在用户的耳朵中。因此,该系统可以被配置为确定定义了当音频信号4710被装置110接收时(或者,在一些实施例中,当取消音频信号4732被生成时)与向用户呈现取消信号时之间的时间的时间延迟。因此,系统可以基于音频信号4710来预测将在用户耳朵处接收的声音,并且可以确定时间延迟,使得取消音频信号4732取消预测的声音。在一些实施例中,时间延迟可以是与可穿戴装置110相关联的预设或预定义值。例如,系统可以假设预测的声音在被捕捉为音频信号4710之后被用户听到所需的时间。可以基于来自用户的输入来调整该预定时间延迟。例如,用户可以能够通过用户界面微调延迟,或者可以能够提供正在经历回声的反馈。To effectively cancel sound from the user's environment, a cancel audio signal can be sent so that it appears in the user's ear at the same time as the sound being canceled. Accordingly, the system may be configured to determine a time interval that defines the time between when the audio signal 4710 is received by the device 110 (or, in some embodiments, when the cancel audio signal 4732 is generated) and when the cancel signal is presented to the user. time delay. Accordingly, the system can predict the sound to be received at the user's ear based on the audio signal 4710, and can determine a time delay such that the canceling audio signal 4732 cancels the predicted sound. In some embodiments, the time delay may be a preset or predefined value associated with the wearable device 110 . For example, the system may assume the time required for the predicted sound to be heard by the user after being captured as an audio signal 4710. The predetermined time delay can be adjusted based on input from the user. For example, the user may be able to fine-tune the delay through the user interface, or may be able to provide feedback that the echo is being experienced.
在一些实施例中,可以基于声音将在其作为音频信号4710被捕捉的位置与用户耳朵之间传播的距离来确定时间延迟。图48A、48B和48C示出了符合所公开的实施例的用于有源声音替换的可穿戴装置110和听觉接口设备4820的示例配置。在一些实施例中,可穿戴装置110可以作为眼镜佩戴,如图48A所示。如上所述,可穿戴装置110可以被穿戴在用户身上的其他位置。例如,可穿戴装置110可以被穿戴在用户身上的腰带、衬衫、手腕或各种其他位置上。可穿戴装置110可以包括麦克风4812,其可以被配置为捕捉音频信号(诸如上文描述的音频信号4710)。麦克风4812可以是被配置为从用户的环境捕捉声音的任何设备。例如,麦克风4812可以对应于上述麦克风443、444或1720。如上所述,可穿戴装置110可以向助听器设备4740发送一个或多个信号。例如,可穿戴装置110可以使用无线收发器530将经选择性调节的音频信号4722和取消音频信号4732发送到听觉接口设备4740。听觉接口设备4740可以对应于如上所述的听觉接口设备1710。因此,上述关于听觉接口设备1710的任何细节或实施例可以应用于听觉接口设备4740。例如,虽然听觉接口设备4740被示为耳内设备,但是听觉接口设备4740可以包括另一种形式的听觉接口设备,诸如骨传导耳机、耳外设备等。In some embodiments, the time delay may be determined based on the distance the sound will travel between the location where it will be captured as audio signal 4710 and the user's ear. 48A, 48B, and 48C illustrate example configurations of wearable device 110 and auditory interface device 4820 for active sound replacement consistent with the disclosed embodiments. In some embodiments, the wearable device 110 may be worn as glasses, as shown in Figure 48A. As described above, the wearable device 110 may be worn at other locations on the user's body. For example, wearable device 110 may be worn on a user's body on a belt, shirt, wrist, or various other locations. Wearable device 110 may include microphone 4812, which may be configured to capture audio signals (such as audio signal 4710 described above). Microphone 4812 may be any device configured to capture sound from the user's environment. For example, the microphone 4812 may correspond to the microphones 443, 444 or 1720 described above. As described above, the wearable device 110 may transmit one or more signals to the hearing aid device 4740. For example, wearable device 110 may transmit selectively conditioned audio signal 4722 and cancel audio signal 4732 to auditory interface device 4740 using wireless transceiver 530 . Auditory interface device 4740 may correspond to auditory interface device 1710 as described above. Accordingly, any of the details or embodiments described above with respect to auditory interface device 1710 may be applied to auditory interface device 4740. For example, while auditory interface device 4740 is shown as an in-ear device, auditory interface device 4740 may include another form of auditory interface device, such as a bone conduction headset, an out-of-ear device, and the like.
所公开的方法和系统可以包括确定距离d,其表示声波4810在麦克风4812处被捕捉之前将传播的距离与声波在到达用户(或听觉接口设备4740)的耳朵之前将传播的距离之间的差值。上面讨论的时间延迟可以基于声波4810移动距离d将花费多长时间来确定(假设它以声速移动)。因此,可以垂直于声波4810的传播来确定距离d。在一些实施例中,可以考虑其他因素来确定时间延迟,诸如发送取消音频信号4732所需的时间、在听觉接口设备4740处处理和播放取消音频信号4732的时间、或可能影响向用户呈现取消音频信号的定时的各种其他因素(或其组合)。在一些实施例中,所公开的系统可能需要用于处理取消音频信号的累积音频背景。例如,该系统可以引入40-80毫秒的延迟(或其他合适的延迟)以允许处理音频信号4710。在一些实施例中,该累积的音频背景延迟可以是如上文关于图44-图46B所描述的可变的或可调整的。The disclosed methods and systems may include determining a distance d that represents the difference between the distance the sound wave 4810 will travel before being captured at the microphone 4812 and the distance the sound wave will travel before reaching the ear of the user (or auditory interface device 4740). value. The time delay discussed above can be determined based on how long it will take for the sound wave 4810 to travel the distance d (assuming it travels at the speed of sound). Thus, the distance d can be determined perpendicular to the propagation of the sound wave 4810. In some embodiments, other factors may be considered to determine the time delay, such as the time required to send the cancellation audio signal 4732, the time the cancellation audio signal 4732 is processed and played at the auditory interface device 4740, or that may affect presentation of the cancellation audio to the user Various other factors (or combinations thereof) of the timing of the signal. In some embodiments, the disclosed system may require accumulated audio background for processing canceled audio signals. For example, the system may introduce a delay of 40-80 milliseconds (or other suitable delay) to allow the audio signal 4710 to be processed. In some embodiments, this accumulated audio background delay may be variable or adjustable as described above with respect to Figures 44-46B.
如上所述,可穿戴装置110可以被放置在用户上的各种其他位置。距离d可以至少部分地取决于可穿戴装置110的放置。图48B示出了夹在用户的项圈上的可穿戴装置110。如图48B所示,距离d可以取决于可穿戴装置110的放置位置而变化。距离d可以类似地基于可穿戴设备110的类型或其他特性而变化。例如,可穿戴装置110在实现为一副眼镜时可具有与可附接到用户衣服上的设备不同的预定时间延迟。在一些实施例中,距离d(和时间延迟)可以取决于用户对设备的放置位置。例如,同一设备可以被夹住或以其他方式固定在用户的不同位置上。因此,设备可以接收指示放置位置的数据。在一些实施例中,数据可以是指示放置位置的用户输入。例如,用户可以通过计算设备120(或其他计算设备)的用户界面提供输入。这可以包括从位置列表中选择、点击指示近似放置位置的用户图像或其他界面。用户输入可以通过其他装置接收,诸如可穿戴装置110上的物理开关或按钮。在一些实施例中,可穿戴装置110可以根据基于可穿戴装置110的图像传感器220捕捉的图像、用户的语音或其他声音的感知方向性等所感知的相机位置来推断放置位置。在一些实施例中,时间延迟可以取决于设备的大小。例如,当实现为一副眼镜时,具有更长镜脚的设备可以假设比具有较短镜脚的设备更长的时间延迟。还可以基于听觉接口设备4740的放置位置或其他属性(例如,耳内与骨传导、处理速度等)来配置时间延迟。As mentioned above, the wearable device 110 may be placed in various other locations on the user. The distance d may depend, at least in part, on the placement of the wearable device 110 . Figure 48B shows the wearable device 110 clipped to the user's collar. As shown in FIG. 48B , the distance d may vary depending on where the wearable device 110 is placed. Distance d may similarly vary based on the type of wearable device 110 or other characteristics. For example, wearable device 110, when implemented as a pair of glasses, may have a different predetermined time delay than a device that may be attached to a user's clothing. In some embodiments, the distance d (and time delay) may depend on where the user places the device. For example, the same device may be clipped or otherwise secured in different locations on the user. Therefore, the device can receive data indicating the placement location. In some embodiments, the data may be user input indicating a placement location. For example, a user may provide input through a user interface of computing device 120 (or other computing device). This may include selecting from a list of locations, clicking on a user image or other interface indicating an approximate placement location. User input may be received through other devices, such as physical switches or buttons on the wearable device 110 . In some embodiments, the wearable device 110 may infer the placement location based on the perceived camera position based on images captured by the image sensor 220 of the wearable device 110, the perceived directionality of the user's voice or other sounds, and the like. In some embodiments, the time delay may depend on the size of the device. For example, when implemented as a pair of glasses, a device with longer temples may assume a longer time delay than a device with shorter temples. The time delay may also be configured based on the placement location of the auditory interface device 4740 or other properties (eg, in-ear and bone conduction, processing speed, etc.).
距离d和相关联的时间延迟还可以取决于声音发出对象相对于用户的位置。图48C示出了从相对于图48B的更高位置接收的声波4810,其中可穿戴装置110和听觉接口设备4740的放置位置保持相同。因为声波4810到达麦克风4812和用户耳朵的角度不同,所以距离d也不同(在本例中缩短)。因此,可穿戴装置110可以被配置为负责生成在确定时间延迟时要取消的声波4810的声音发出对象的位置。可穿戴装置110可以以各种方式确定或估计声音发出对象的位置。在一些实施例中,可以基于声音发出对象的类型来假设该位置。例如,如果可穿戴装置110确定声波4810与个体相关联,则可穿戴装置110可以假设声音发出对象处于个体嘴的平均高度。作为另一示例,如果声波4810与诸如狗或猫之类的动物相关联,则可穿戴装置可以假设声波4810是从地面附近发出的。各种其他类型的声音发出对象可以被识别并与预定高度相关联。The distance d and associated time delay may also depend on the location of the sounding object relative to the user. Figure 48C shows sound waves 4810 received from a higher position relative to Figure 48B, where the placement of the wearable device 110 and the auditory interface device 4740 remains the same. Because the angle at which the sound wave 4810 reaches the microphone 4812 and the user's ear is different, the distance d is also different (shortened in this example). Accordingly, the wearable device 110 may be configured to be responsible for generating the sound wave 4810 to cancel the location of the sound emitting object when the time delay is determined. The wearable device 110 may determine or estimate the location of the sound-emitting object in various ways. In some embodiments, the location may be assumed based on the type of sound-emitting object. For example, if wearable device 110 determines that sound wave 4810 is associated with an individual, wearable device 110 may assume that the sound emitting object is at the average height of the individual's mouth. As another example, if the sound waves 4810 are associated with animals such as dogs or cats, the wearable device may assume that the sound waves 4810 are emanating from near the ground. Various other types of sound-emitting objects may be identified and associated with predetermined heights.
在一些实施例中,声音发出对象的位置可以基于可穿戴装置110接收的传感器数据来确定。例如,可穿戴装置110可以分析由图像传感器220捕捉的一个或多个图像以确定声音发出对象的位置。如贯穿本公开所描述的,这可以包括各种对象或特征检测或其他图像处理技术。在一些实施例中,可以基于来自麦克风4812的输入来确定声音发出对象的位置。例如,麦克风4812可以包括麦克风的多方向阵列或其他类型的麦克风,其被配置为确定捕捉声音的方向。因此,时间延迟(和方向d)可以取决于对可穿戴装置110的各种输入。如上所述,用户可以能够通过图形用户界面(例如,在计算设备120上,在可穿戴装置110上,等等),通过物理控制(例如,按钮、拨号、开关等)或各种其他输入设备来调谐或调整所确定的延迟。In some embodiments, the location of the sound emitting object may be determined based on sensor data received by the wearable device 110 . For example, wearable device 110 may analyze one or more images captured by image sensor 220 to determine the location of the sound-emitting object. This may include various object or feature detection or other image processing techniques, as described throughout this disclosure. In some embodiments, the location of the sounding object may be determined based on input from the microphone 4812. For example, microphone 4812 may include a multidirectional array of microphones or other types of microphones configured to determine the direction in which to capture sound. Thus, the time delay (and direction d) may depend on various inputs to the wearable device 110 . As described above, the user may be able to use a graphical user interface (eg, on computing device 120, on wearable device 110, etc.), through physical controls (eg, buttons, dials, switches, etc.), or various other input devices to tune or adjust the determined delay.
图49是示出符合所公开实施例的用于选择性地替换音频信号的示例过程4900的流程图。过程4900可以由可穿戴装置的至少一个处理设备(诸如如上所述的处理器210)执行。在一些实施例中,过程4900的一些或全部可以由诸如计算设备120或听觉接口设备4740的不同设备执行。应当理解,在贯穿本公开中,术语“处理器”用作“至少一个处理器”的简略表达。换句话说,处理器可以包括执行逻辑操作的一个或多个结构,无论这些结构是被并置、连接或分布式的。在一些实施例中,非暂时性计算机可读介质可以包含当由处理器执行时使得处理器执行过程4900的指令。此外,过程4900不一定限于图49中所示的步骤,并且贯穿本公开内容描述的各种实施例的任何步骤或过程也可以包括在过程4900中,包括上面关于图47、图48A、图48B和图48C描述的那些步骤或过程。49 is a flowchart illustrating an example process 4900 for selectively replacing an audio signal, consistent with the disclosed embodiments. Process 4900 may be performed by at least one processing device of the wearable device, such as processor 210 as described above. In some embodiments, some or all of process 4900 may be performed by a different device, such as computing device 120 or auditory interface device 4740. It should be understood that throughout this disclosure, the term "processor" is used as a shorthand for "at least one processor." In other words, a processor may include one or more structures that perform logical operations, whether collocated, connected, or distributed. In some embodiments, a non-transitory computer-readable medium may contain instructions that, when executed by a processor, cause the processor to perform process 4900. Furthermore, process 4900 is not necessarily limited to the steps shown in FIG. 49, and any steps or processes of various embodiments described throughout this disclosure may also be included in process 4900, including above with respect to FIGS. 47, 48A, 48B and those steps or processes described in Figure 48C.
在步骤4910中,过程4900可以包括接收由可穿戴相机从用户的环境捕捉的多个图像。例如,步骤4910可以包括接收由图像传感器220捕捉的图像。所捕捉的图像可以包括用户环境内的个体或其他声音发出对象的表示。In step 4910, process 4900 can include receiving a plurality of images captured by the wearable camera from the user's environment. For example, step 4910 may include receiving an image captured by image sensor 220 . The captured images may include representations of individuals or other sound-emitting objects within the user's environment.
在步骤4912中,过程4900可以包括接收表示由至少一个麦克风从用户的环境捕捉的声音的音频信号。例如,麦克风443或444(或麦克风1720)可以捕捉来自用户环境的声音,并可以将它们发送到处理器210。这可以包括如上所述的音频信号4710。In step 4912, the process 4900 can include receiving an audio signal representing sound captured by the at least one microphone from the user's environment. For example, microphone 443 or 444 (or microphone 1720 ) can capture sounds from the user's environment and can send them to processor 210 . This may include the audio signal 4710 as described above.
在步骤4914中,过程4900可以包括基于对多个图像或音频信号的分析,从与用户环境中的一个或多个声音发出对象相关联的多个音频信号中识别音频信号。在一些实施例中,声音发出对象可以是个体。例如,步骤4914可以包括处理多个图像以识别正在与用户说话的个体。这可以基于个体的唇部移动、用户的视线方向、声纹或者贯穿本公开描述的用于识别与个体相关联的音频信号的各种其他方法。In step 4914, process 4900 may include identifying audio signals from a plurality of audio signals associated with one or more sound-emitting objects in the user's environment based on analysis of the plurality of image or audio signals. In some embodiments, the sound emitting object may be an individual. For example, step 4914 may include processing the plurality of images to identify the individual who is speaking to the user. This may be based on the individual's lip movement, the user's gaze direction, voiceprint, or various other methods described throughout this disclosure for identifying audio signals associated with the individual.
在步骤4916中,过程4900可以包括基于多个音频信号,预测将在用户耳朵处从用户的环境接收到的声音。预测的声音可以对应于当与识别出的音频信号相关联的声波到达用户耳朵时用户将听到的声音。例如,参考图48A,预测的声音可以是一旦声波4810在由麦克风4812记录后到达用户耳朵时期望用户听到的声音。In step 4916, the process 4900 can include predicting the sound to be received at the user's ear from the user's environment based on the plurality of audio signals. The predicted sound may correspond to the sound that the user will hear when the sound waves associated with the identified audio signal reach the user's ear. For example, referring to FIG. 48A, the predicted sound may be the sound that the user is expected to hear once the sound wave 4810 reaches the user's ear after being recorded by the microphone 4812.
在步骤4918中,过程4900可以包括生成被配置为抵消用户的耳朵处的至少预测声音的取消音频信号。例如,如上所述,这可以包括通过噪声取消4730来生成取消音频信号4732。在一些实施例中,噪声取消音频信号可以被配置为抵消来自用户环境的除来自与用户说话的个体的声音之外的至少一种声音。例如,噪声取消音频信号可以被配置为抵消背景噪声、其他说话者或来自用户环境中的其他声音发出对象的声音。In step 4918, the process 4900 may include generating a cancellation audio signal configured to cancel at least the predicted sound at the user's ear. For example, as described above, this may include generating a cancellation audio signal 4732 through noise cancellation 4730. In some embodiments, the noise cancelling audio signal may be configured to cancel at least one sound from the user's environment other than the sound from the individual speaking to the user. For example, the noise-cancelling audio signal may be configured to cancel background noise, other speakers, or other sounds from the user's environment emitting the object's voice.
在步骤4920中,过程4900可以包括基于识别出的音频信号来生成经选择性调节的音频信号。例如,步骤4920可以包括通过如上所述的音频处理4720生成经选择性调节的音频信号4722。因此,生成选择性地调节的音频信号可以包括对识别出的音频信号的任何形式的调节或增强。例如,选择性调节可以包括相对于多个音频信号中的附加音频信号放大识别出的音频信号。在贯穿本公开中描述了各种其他形式的选择性调节。In step 4920, process 4900 may include generating a selectively conditioned audio signal based on the identified audio signal. For example, step 4920 may include generating a selectively conditioned audio signal 4722 by audio processing 4720 as described above. Thus, generating the selectively adjusted audio signal may include any form of adjustment or enhancement of the identified audio signal. For example, the selective adjustment may include amplifying the identified audio signal relative to additional audio signals of the plurality of audio signals. Various other forms of selective modulation are described throughout this disclosure.
在步骤4922中,过程4900可以包括将取消音频信号和经选择性调节的音频信号发送到被配置为向用户的耳朵提供声音的助听器接口设备。例如,如图47所示,经选择地调节的音频信号4722和取消音频信号4732可以被发送到听觉接口设备4740。在一些实施例中,取消音频信号和经选择性调节的音频信号可以一起被发送,尽管它们可以相对于彼此有时间移位。例如,可以组合或混合取消音频信号和经选择性调节的音频信号,使得当与预测的声音一起呈现给用户耳朵时,只听到经调节音频信号。在一些实施例中,可以分别发送取消音频信号和经选择性调节的音频信号。例如,可以在发送经选择性调节的音频信号之前发送取消音频信号。因此,预测的声音可以在用户耳朵处被取消,使得经选择性调节的音频信号不会为用户引入回声。In step 4922, the process 4900 may include sending the cancel audio signal and the selectively adjusted audio signal to a hearing aid interface device configured to provide sound to the user's ear. For example, as shown in FIG. 47 , selectively conditioned audio signal 4722 and cancel audio signal 4732 may be sent to auditory interface device 4740 . In some embodiments, the cancel audio signal and the selectively conditioned audio signal may be sent together, although they may be time-shifted relative to each other. For example, the cancelling audio signal and the selectively conditioned audio signal may be combined or mixed such that when presented to the user's ear along with the predicted sound, only the conditioned audio signal is heard. In some embodiments, the cancel audio signal and the selectively adjusted audio signal may be sent separately. For example, the cancel audio signal may be sent before the selectively adjusted audio signal is sent. Thus, the predicted sound can be canceled at the user's ear so that the selectively conditioned audio signal does not introduce echoes to the user.
如上所述,取消音频信号的呈现可以被定时以与到达用户耳朵的预测的声音一致。因此,在一些实施例中,过程4900还可以包括确定当接收到多个音频信号时与当预测的声音将在用户耳朵处接收到时之间的时间延迟。在这些实施例中,取消音频信号可以在基于时间延迟的时间处被发送。在一些实施例中,时间延迟可以至少部分地基于在空气中传播的声速来确定。例如,如上面关于图48A、图48B和图48C所描述的,时间延迟可以对应于声波4810行进距离d所需的时间。在一些实施例中,时间延迟可以至少部分地基于声音发出对象相对于至少一个麦克风和助听器接口设备的位置来确定。例如,如上面关于图48C所描述的,可以基于多个图像来确定声音发出对象的位置。声音发出对象的位置可以基于其他输入(诸如基于麦克风确定的方向性,或者其他数据)来确定。在一些实施例中,可以至少部分地基于来自用户的输入来确定时间延迟。例如,通过诸如计算设备120的外部设备的用户界面来接收输入。用户界面可以包括类似于图45A中所示的和上面描述的那些控件。在一些实施例中,可以通过诸如按钮、拨号、开关等物理控件来提供用户输入。As described above, the presentation of the canceled audio signal may be timed to coincide with the predicted sound arriving at the user's ear. Accordingly, in some embodiments, process 4900 may also include determining a time delay between when the plurality of audio signals are received and when the predicted sound will be received at the user's ear. In these embodiments, the cancel audio signal may be sent at a time based on a time delay. In some embodiments, the time delay may be determined based at least in part on the speed of sound propagating in air. For example, as described above with respect to Figures 48A, 48B, and 48C, the time delay may correspond to the time required for the acoustic wave 4810 to travel the distance d. In some embodiments, the time delay may be determined based at least in part on the position of the sound emitting object relative to the at least one microphone and hearing aid interface device. For example, as described above with respect to FIG. 48C, the location of the sound-emitting object may be determined based on a plurality of images. The location of the sound-emitting object may be determined based on other inputs, such as directionality determined based on the microphone, or other data. In some embodiments, the time delay may be determined based at least in part on input from a user. For example, input is received through a user interface of an external device, such as computing device 120 . The user interface may include controls similar to those shown in Figure 45A and described above. In some embodiments, user input may be provided through physical controls such as buttons, dials, switches, and the like.
声音的模拟方向性Analog directionality of sound
与所公开的实施例一致,助听器系统可以基于声音发出对象的位置来选择性地放大声音。现有的助听器系统可能无法以足够的保真度和精确度复制声音的定时和音量,以使用户能够识别声音的来源。在一些情况下,用户可能有学习困难、脑损伤、畸形的耳朵或耳道,或损伤的影响,这些损伤削弱了用户基于声音位置来定位对象的能力。例如,用户可能在他或她的耳朵中具有不相等的听力损失,导致错误地认为对象更接近用户听力损失较小的耳朵,而不是更接近用户听力损失较大的耳朵。另外,在一些情况下,用户可能希望通过组合延迟感知和声强感知来增强声音定位能力。Consistent with the disclosed embodiments, the hearing aid system may selectively amplify sound based on the location of the object emitting the sound. Existing hearing aid systems may not be able to reproduce the timing and volume of sounds with sufficient fidelity and precision to allow users to identify the source of the sounds. In some cases, the user may have learning difficulties, brain damage, a deformed ear or ear canal, or the effects of damage that impairs the user's ability to locate objects based on sound location. For example, a user may have unequal hearing loss in his or her ears, leading to the false belief that the object is closer to the user's less hearing-impaired ear than to the user's more hearing-impaired ear. Additionally, in some cases the user may wish to enhance sound localization capabilities by combining delay perception and sound intensity perception.
因此,助听器系统可以分析用户环境的捕捉图像和声音以确定声源的位置。助听器系统然后可以向用户发送声音,使得声音在不同的时间和音量到达用户的耳朵,从而产生立体声效果。通过这样做,助听器系统可以向用户提供替代和/或增强的声音定位能力。Thus, the hearing aid system can analyze the captured images and sounds of the user's environment to determine the location of the sound source. The hearing aid system can then send sounds to the user so that the sounds reach the user's ears at different times and volumes, creating a stereo effect. By doing so, the hearing aid system may provide the user with alternative and/or enhanced sound localization capabilities.
用户100可以佩戴符合上述基于相机的助听器设备的助听器设备。例如,助听器设备可以是如图17A所示的听觉接口设备1710。听觉接口设备1710可以是被配置为向用户100提供听觉反馈的任何设备。听觉接口设备1710可以被放置在用户100的每个耳朵中,类似于传统的听觉接口设备。如上所述,听觉接口设备1710可以是各种样式的,包括耳道内、完全耳道内、耳内、耳后、耳上、耳道内接收器、开放安装或各种其他样式。听觉接口设备1710可以包括用于向用户100提供听觉反馈的一个或多个扬声器、用于检测用户100的环境中的声音的麦克风、内部电子设备、处理器、存储器等。在一些实施例中,除了麦克风之外或替代麦克风,听觉接口设备1710可以包括一个或多个通信单元,以及是一个或多个接收器,用于从设备110接收信号并将信号传送到用户100。听觉接口设备1710可以对应于反馈输出单元230,或者可以与反馈输出单元230分开,并且可以被配置为从反馈输出单元230接收信号。The user 100 may wear a hearing aid device that conforms to the camera-based hearing aid device described above. For example, the hearing aid device may be an auditory interface device 1710 as shown in Figure 17A. Auditory interface device 1710 may be any device configured to provide auditory feedback to user 100 . An auditory interface device 1710 may be placed in each ear of the user 100, similar to conventional auditory interface devices. As mentioned above, the auditory interface device 1710 may be of various styles, including in-canal, completely in-canal, in-ear, behind-the-ear, supra-aural, in-canal receiver, open mount, or various other styles. Auditory interface device 1710 may include one or more speakers for providing auditory feedback to user 100, a microphone for detecting sounds in the environment of user 100, internal electronics, a processor, memory, and the like. In some embodiments, auditory interface device 1710 may include one or more communication units in addition to or instead of a microphone, and be one or more receivers for receiving signals from device 110 and transmitting signals to user 100 . The auditory interface device 1710 may correspond to the feedback output unit 230 or may be separate from the feedback output unit 230 and may be configured to receive signals from the feedback output unit 230 .
在一些实施例中,如图17A所示,听觉接口设备1710可以包括骨传导耳机1711。骨传导耳机1711可以通过外科手术植入,并且可以通过声音振动到内耳的骨传导来向用户100提供可听反馈。听觉接口设备1710还可以包括一个或多个耳机(例如,无线耳机、过耳耳机等)或由用户100携带或佩戴的便携式扬声器。在一些实施例中,听觉接口设备1710可以集成到其他设备中,诸如用户的蓝牙TM耳机、眼镜、头盔(例如,摩托车头盔、自行车头盔等)、帽子等。在一些实施例中,可以提供两个听觉接口设备1710,每个耳朵一个。两个听觉接口设备1710可以用电线连接或者可以无线连接。此外,第一听觉接口设备可以从第二听觉接口设备接收指令或音频。另外,两个听觉接口设备1710可以从另一个源(诸如装置110或配对设备)接收音频。In some embodiments, the auditory interface device 1710 may include a bone conduction headset 1711, as shown in FIG. 17A. Bone conduction earphones 1711 may be surgically implanted and may provide audible feedback to user 100 through bone conduction of sound vibrations to the inner ear. The auditory interface device 1710 may also include one or more earphones (eg, wireless earphones, over-ear earphones, etc.) or portable speakers carried or worn by the user 100 . In some embodiments, the auditory interface device 1710 may be integrated into other devices, such as the user's Bluetooth ™ headset, glasses, helmets (eg, motorcycle helmets, bicycle helmets, etc.), hats, and the like. In some embodiments, two auditory interface devices 1710 may be provided, one for each ear. The two auditory interface devices 1710 may be wired or may be wirelessly connected. Additionally, the first auditory interface device may receive instructions or audio from the second auditory interface device. Additionally, the two auditory interface devices 1710 may receive audio from another source, such as the device 110 or a paired device.
听觉接口设备1710可以被配置为与诸如装置110的相机设备进行通信。这种通信可以通过有线连接,或者可以无线地进行(例如,使用蓝牙TM、NFC或无线通信形式)。如上所述,装置110可以由用户100以各种配置来佩戴,包括物理地连接到衬衫、项链、腰带、眼镜、腕带、纽扣或与用户100相关联的其他物品。在一些实施例中,还可以包括诸如计算设备120的一个或多个附加设备。因此,本文关于装置110或处理器210描述的一个或多个过程或功能可以由计算设备120和/或处理器540执行。Auditory interface device 1710 may be configured to communicate with a camera device such as apparatus 110 . Such communication may be via a wired connection, or may be performed wirelessly (eg, using Bluetooth ™ , NFC or wireless forms of communication). As described above, device 110 may be worn by user 100 in various configurations, including physically attached to a shirt, necklace, belt, eyeglasses, wristband, button, or other item associated with user 100 . In some embodiments, one or more additional devices such as computing device 120 may also be included. Accordingly, one or more of the processes or functions described herein with respect to apparatus 110 or processor 210 may be performed by computing device 120 and/or processor 540 .
如上所述,装置110可以包括至少一个麦克风和至少一个图像捕捉设备。如关于图17B所描述的,装置110可以包括麦克风1720。麦克风1720可以被配置为确定用户100的环境中声音的方向性。例如,麦克风1720可以包括一个或多个定向麦克风、麦克风阵列、多端口麦克风等。处理器210可以被配置为区分用户100的环境内的声音并且确定每个声音的近似方向性。例如,使用麦克风阵列1720,处理器210可以对麦克风1720之间个体声音的相对定时或振幅进行比较,以确定相对于装置100的方向性。装置110可以包括诸如相机1730的一个或多个相机,它们可以对应于诸如图像传感器220的一个或多个图像传感器。相机1730可以被配置为捕捉用户100的周围环境的图像。装置110还可以使用听觉接口设备1710的一个或多个麦克风,并且因此,本文使用的对麦克风1720的引用也可以是指听觉接口设备1710上的麦克风。As mentioned above, the apparatus 110 may include at least one microphone and at least one image capture device. Device 110 may include microphone 1720 as described with respect to FIG. 17B . Microphone 1720 may be configured to determine the directionality of sound in the environment of user 100 . For example, microphone 1720 may include one or more directional microphones, microphone arrays, multi-port microphones, and the like. The processor 210 may be configured to differentiate sounds within the environment of the user 100 and determine the approximate directionality of each sound. For example, using the microphone array 1720 , the processor 210 may compare the relative timing or amplitude of individual sounds between the microphones 1720 to determine the directionality relative to the device 100 . Device 110 may include one or more cameras such as camera 1730 , which may correspond to one or more image sensors such as image sensor 220 . The camera 1730 may be configured to capture images of the user's 100 surroundings. Apparatus 110 may also use one or more microphones of auditory interface device 1710 , and thus, references to microphone 1720 used herein may also refer to a microphone on auditory interface device 1710 .
处理器210(和/或处理器210a和210b)可以被配置为检测用户100的环境内的声音发出对象(诸如个体)。图50A是示出示例性环境的示意图。如图50A所示,佩戴装置110的用户100可以物理地存在于环境中并且个体5002产生声音5004。因此,在图50A中呈现的场景中,个体5002是声音发出对象。尽管图50A示出了作为声音发出对象的个体,但是声音发出对象可以是环境中产生可被用户100听到或被装置100检测到的声波的任何对象。例如,声音发出对象可以是机器、动物或自然产生的声源(诸如风)。在一些情况下,声音发出对象可能在产生声音的同时移动。可替代地,声音发出对象可以在没有可观察到的运动的情况下产生声音(诸如收音机的扬声器)。Processor 210 (and/or processors 210a and 210b ) may be configured to detect sound-emitting objects (such as individuals) within the environment of user 100 . 50A is a schematic diagram illustrating an exemplary environment. As shown in FIG. 50A , user 100 wearing device 110 may be physically present in the environment and individual 5002 produces sound 5004 . Thus, in the scene presented in Figure 50A, the individual 5002 is the sound-emitting object. Although FIG. 50A shows an individual as the sound-emitting object, the sound-emitting object may be any object in the environment that produces sound waves that can be heard by the user 100 or detected by the device 100 . For example, the sound emitting object may be a machine, an animal, or a naturally occurring sound source such as wind. In some cases, the sound emitting object may move while producing the sound. Alternatively, the sound emitting object may produce sound without observable motion (such as a speaker of a radio).
可以通过检测声音在多个监听设备处的到达时间的差值和/或检测音量的差值来确定声音发出对象的位置。例如,对于人类,声音定位是通过确定声音到达人的左耳和右耳的时间差值来进行的。声音定位也是通过确定左耳和右耳处的声音音量差值来实现的。例如,在图50A中,用户100可以通过注意到声音5004到达用户100的右耳比到达用户100的左耳更早且更响亮来确定个体5002正站在用户100的前面和右边。然而,如果用户100听力受损,则用户100可能不得不依赖视力来确定个体5002的位置。如果用户100视力受损,或者看向不同方向,则用户100可能无法确定声音的位置,从而无法确定声源。The location of the sound emitting object may be determined by detecting the difference in arrival times of the sound at the plurality of listening devices and/or by detecting the difference in volume. For humans, for example, sound localization is performed by determining the time difference between the sound reaching the left and right ears of the person. Sound localization is also achieved by determining the difference in sound volume at the left and right ears. For example, in Figure 50A, user 100 may determine that individual 5002 is standing in front and to the right of user 100 by noting that sound 5004 reaches user 100's right ear earlier and louder than user 100's left ear. However, if user 100 is hearing impaired, user 100 may have to rely on vision to determine the location of individual 5002. If the user 100 is visually impaired, or looks in a different direction, the user 100 may not be able to determine the location of the sound and thus the source of the sound.
为了弥补这一点,本公开的某些实施例可以基于声音发出对象的确定位置提供立体声信号。在一些实施例中,装置110可以使用由相机1730捕捉的图像来确定声音发出对象的位置。例如,图50B是符合本公开的由成像捕捉设备捕捉的示例性图像的示意图。处理器210可以被配置为分析由相机1730捕捉的图像,以检测相机1730的视场5006中的声音发出对象(诸如个体5002),并确定从用户100到声音发出对象的方向。例如,处理器210可以确定个体5002在视场5006内的位置。如图所示,视场5006可以被划分为表示角度的部分。例如,视场5006可以具有与用户视线方向1750对齐的中心线5008。视场5006可以被划分为表示由线5008和5010划定的偏离中心0-5度的区域的部分;由5010和5012线划定的中心5-10度的区域的部分;和由5012和5014线划定的偏离中心10-15度的区域的部分。个体5002或其他声音发出对象的位置可以通过参考视场5006的这些部分来确定。例如,如图50B所示,个体5002的嘴位于由线5010和5012划定的区域内。因此,为了确定朝向声音发出对象的方向,处理器210可以使用如下文进一步描述的运动检测和/或声音定位技术。尽管图50B示出了线5008-5014,但处理器210可以使用替代度量来确定声音发出对象相对于用户100的方向,诸如视场5006的x和/或y坐标,或视场5006的较小和/或较大部分。另外,如下面将更详细地描述的,不是或除了由相机1730捕捉的图像之外,处理器210可以使用声音到达时间信息。To compensate for this, certain embodiments of the present disclosure may provide a stereo signal based on the determined location of the sound-emitting object. In some embodiments, the device 110 may use the image captured by the camera 1730 to determine the location of the sound emitting object. For example, FIG. 50B is a schematic diagram of an exemplary image captured by an imaging capture device consistent with the present disclosure. Processor 210 may be configured to analyze images captured by camera 1730 to detect sound-emitting objects, such as individual 5002, in field of view 5006 of camera 1730, and to determine the direction from user 100 to the sound-emitting objects. For example, the processor 210 may determine the location of the individual 5002 within the field of view 5006. As shown, the field of view 5006 may be divided into sections representing angles. For example, the field of view 5006 may have a centerline 5008 aligned with the user's gaze direction 1750. Field of view 5006 may be divided into portions representing the area delineated by lines 5008 and 5010 0-5 degrees off center; the portion delimiting the area 5-10 degrees from center delimited by lines 5010 and 5012; and The portion of the area delimited by a line that is 10-15 degrees off center. The location of the individual 5002 or other sound-emitting object can be determined by referring to these portions of the field of view 5006 . For example, as shown in Figure 50B, the mouth of individual 5002 is located within the area delineated by lines 5010 and 5012. Thus, in order to determine the direction towards the sound emitting object, the processor 210 may use motion detection and/or sound localization techniques as described further below. Although FIG. 50B shows lines 5008-5014, the processor 210 may use alternative metrics to determine the orientation of the sound-emitting object relative to the user 100, such as the x and/or y coordinates of the field of view 5006, or the smaller of the field of view 5006 and/or larger portions. Additionally, processor 210 may use sound time of arrival information in addition to or in addition to images captured by camera 1730, as will be described in more detail below.
基于检测到的声音发出对象的位置,处理器210可以引起对音频的选择性调节,以便将声音发出对象的位置传送给用户100。调节可以包括相对于其他音频信号放大被确定为对应于声音5004(其可对应于个体5002的语音)的音频信号。在一些实施例中,放大可以例如通过相对于其他信号处理与声音5004相关联的音频信号来数字化地实现。另外地或者可替代地,可以通过改变麦克风1720的一个或多个参数来实现放大,以聚焦于与个体5002相关联的音频声音。例如,麦克风1720可以是定向麦克风,处理器210可以执行将麦克风1720聚焦在声音5004上的操作。可以使用用于放大声音5004的各种其他技术,诸如使用波束成形麦克风阵列、声学望远镜技术等。经调节的音频信号可以被发送到两个听觉接口设备1710,并且因此可以向用户100提供基于声源位置的经调节音频。选择性调节还可以包括在声音被传送到第一耳朵与声音被传送到第二耳朵之间引入振幅差或延迟。以下将提供选择性调节的进一步细节。Based on the detected location of the sound emitting object, the processor 210 may cause selective adjustments to the audio to communicate the location of the sound emitting object to the user 100 . Adjusting may include amplifying the audio signal determined to correspond to sound 5004 (which may correspond to the speech of individual 5002 ) relative to other audio signals. In some embodiments, amplification may be accomplished digitally, for example, by processing the audio signal associated with sound 5004 relative to other signals. Additionally or alternatively, amplification may be achieved by changing one or more parameters of the microphone 1720 to focus on the audio sounds associated with the individual 5002. For example, microphone 1720 may be a directional microphone and processor 210 may perform operations to focus microphone 1720 on sound 5004. Various other techniques for amplifying sound 5004 may be used, such as the use of beamforming microphone arrays, acoustic telescope techniques, and the like. The adjusted audio signal may be sent to the two auditory interface devices 1710 and thus may provide adjusted audio to the user 100 based on the location of the sound source. The selective adjustment may also include introducing an amplitude difference or delay between the sound being delivered to the first ear and the sound being delivered to the second ear. Further details of selective adjustment will be provided below.
声音的调节程度(诸如放大差或延迟的长度)可以基于相对于用户100的到声音发出对象的确定方向。例如,参照图50B,如果个体5002位于视场5006的中心线5008附近,则处理器210可以提供小程度的调节或不调节。可替代地,如果个体5002位于视场5006的右边缘附近,则处理器210可以提供更大程度的调节。此外,如果个体5002在视场5006之外,则过程210可以基于多个麦克风处的声音到达时间来确定个体5002的位置,并引入更大程度的调节。The degree of modulation of the sound, such as the length of the amplification difference or delay, may be based on the determined direction to the sound-emitting object relative to the user 100 . For example, referring to Figure 50B, if the individual 5002 is located near the centerline 5008 of the field of view 5006, the processor 210 may provide a small degree of accommodation or no accommodation. Alternatively, if the individual 5002 is located near the right edge of the field of view 5006, the processor 210 may provide a greater degree of accommodation. Additionally, if the individual 5002 is outside the field of view 5006, the process 210 may determine the location of the individual 5002 based on the arrival times of sound at multiple microphones and introduce a greater degree of accommodation.
通过参考图51,可以进一步理解声音调节,图51示出了由符合本发明的助听器系统获取和重放的音频信号的示意图。根据本公开,装置110可以接收由可穿戴麦克风1720获取的音频信号,该音频信号反射由诸如个体5002的声音发出对象生成的声音。接收信号5102是表示由至少一个麦克风捕捉的声音的音频信号。接收信号5102可以是例如图50A的声音5004。一旦接收信号5102到达过程210,处理器210可以确定对应于接收信号5102的声音发出对象的位置。处理器210可以通过例如识别如图50B所示的相机的视场中的移动对象、通过计算在多个麦克风处的接收信号的到达时间差、通过操纵定向麦克风以及其他方法来确定位置。Sound conditioning can be further understood by referring to Figure 51, which shows a schematic diagram of an audio signal acquired and played back by a hearing aid system consistent with the present invention. In accordance with the present disclosure, device 110 may receive audio signals acquired by wearable microphone 1720 that reflect sounds generated by sound-emitting objects such as individual 5002 . Received signal 5102 is an audio signal representing sound captured by at least one microphone. The received signal 5102 may be, for example, the sound 5004 of Figure 50A. Once the received signal 5102 reaches the process 210, the processor 210 can determine the location of the sounding object corresponding to the received signal 5102. The processor 210 may determine position by, for example, identifying moving objects in the field of view of the camera as shown in Figure 50B, by calculating the time difference of arrival of received signals at multiple microphones, by manipulating directional microphones, and other methods.
在处理器210分析接收信号5102之后,处理器210可以引起立体声表示的传输。例如,如图51所示,处理器210可以生成第一信号5104和第二信号5106。例如,第一信号5104可以被发送到与用户100的右耳相关联的听觉接口设备1710,第二信号5106可以被发送到与用户100的左耳相关联的听觉接口设备1710。After processor 210 analyzes received signal 5102, processor 210 may cause transmission of a stereo representation. For example, as shown in FIG. 51, the processor 210 may generate the first signal 5104 and the second signal 5106. For example, first signal 5104 may be sent to auditory interface device 1710 associated with the right ear of user 100 and second signal 5106 may be sent to auditory interface device 1710 associated with user 100's left ear.
为了创建立体声表示,处理器210可以在第一信号5014的传输之后将第二信号5106的传输延迟一段时间。例如,处理器210可以在声音到达装置110后100毫秒开始第一信号5104的传输,如第一传输开始线5108所示。虽然作为示例使用100毫秒,但也可以使用其他合适的时间段。例如,时间段的范围可以是50-400毫秒,或者任何其他合适的时间段。处理器210然后可以在声音到达装置110 800毫秒后发送第二信号5106,如第二传输开始线5110所示。因此,第二信号5106可以在第一信号5104之后大约100毫秒被发送,如图51的延迟5112所示。因此,用户100将在第二信号5106之前听到第一信号5104。如果第一信号5104与例如用户的右耳相关联,则用户100将感知声源朝向他的右侧。尽管在接收信号5102到达后100毫秒示出了第一传输开始,但在某些实施例中,处理器210可以在其他时间帧中(诸如在接收信号5102到达后小于100毫秒)使第一信号5104传输到听觉接口设备1710。类似地,延迟5112可以是其他持续时间,并且可以根据声音发出对象的位置而变化。To create a stereo representation, the processor 210 may delay the transmission of the second signal 5106 for a period of time after the transmission of the first signal 5014 . For example, processor 210 may begin transmission of first signal 5104 100 milliseconds after the sound arrives at device 110, as indicated by first transmission start line 5108. Although 100 milliseconds is used as an example, other suitable time periods may be used. For example, the time period may be in the range of 50-400 milliseconds, or any other suitable time period. Processor 210 may then send second signal 5106 800 milliseconds after the sound reaches device 110, as indicated by second transmission start line 5110. Thus, the second signal 5106 may be sent approximately 100 milliseconds after the first signal 5104, as shown by delay 5112 in FIG. 51 . Therefore, the user 100 will hear the first signal 5104 before the second signal 5106. If the first signal 5104 is associated with eg the user's right ear, the user 100 will perceive the sound source to be towards his right. Although the first transmission is shown to begin 100 milliseconds after the arrival of the received signal 5102, in some embodiments, the processor 210 may enable the first signal in other time frames (such as less than 100 milliseconds after the arrival of the received signal 5102). 5104 is transmitted to auditory interface device 1710. Similarly, delay 5112 can be of other durations and can vary depending on the location of the sounding object.
图51还示出了可以在每个信号中放大和减小接收信号,从而提供声音发出对象的位置的附加指示。例如,接收信号5102的最大强度接近30,000个单位。然而,第一信号5104具有接近40,000的最大强度,说明处理器210已经放大了第一信号5104。另一方面,第二信号5106具有大约20,000的最大强度,指示处理器210已经衰减了第二信号5106。因此,假设第一信号5104被发送到用户右耳中的听觉接口设备1710,并且第二信号5106被发送到用户左耳中的听觉接口设备1710,用户将在他的右耳中听到比在他或她的左耳中更大的声音,增强了用户确定声源的能力。Figure 51 also shows that the received signal can be amplified and reduced in each signal to provide an additional indication of the location of the sounding object. For example, the maximum strength of the received signal 5102 is approximately 30,000 units. However, the first signal 5104 has a maximum strength close to 40,000, indicating that the processor 210 has amplified the first signal 5104. On the other hand, the second signal 5106 has a maximum strength of approximately 20,000, indicating that the processor 210 has attenuated the second signal 5106. Thus, assuming that the first signal 5104 is sent to the auditory interface device 1710 in the user's right ear, and the second signal 5106 is sent to the auditory interface device 1710 in the user's left ear, the user will hear in his right ear better than in The louder sound in his or her left ear enhances the user's ability to determine the source of the sound.
在一些实施例中,选择性调节可以包括衰减或抑制与确定的声音发出对象不相关联的一个或多个音频信号(诸如背景噪声)。调节还可以包括改变接收信号5102的音调以使声音对于用户100更易感知。例如,用户100可能对特定范围内的音调具有较小的敏感度,并且音频信号的调节可以调整接收信号5102的音高。例如,用户100可能经历10kHz以上的频率中的听觉损失,并且处理器210可以将更高的频率(例如,在15kHz处)重新映射到低于10kHz的频率。在一些实施例中,处理器210可以被配置为改变与一个或多个音频信号相关联的语速。In some embodiments, the selective adjustment may include attenuating or suppressing one or more audio signals (such as background noise) not associated with the determined sound-emitting object. Adjusting may also include changing the pitch of the received signal 5102 to make the sound more perceptible to the user 100 . For example, the user 100 may have less sensitivity to tones within a certain range, and the adjustment of the audio signal may adjust the pitch of the received signal 5102. For example, user 100 may experience hearing loss in frequencies above 10 kHz, and processor 210 may remap higher frequencies (eg, at 15 kHz) to frequencies below 10 kHz. In some embodiments, the processor 210 may be configured to vary the speech rate associated with one or more audio signals.
图52是示出用于生成符合所公开实施例的立体声表示的示例性过程的流程图。过程5200可以由与装置110相关联的一个或多个处理器(诸如处理器210)来执行。处理器可以包括在与也可以用于过程5200的麦克风1720和相机1730相同的公共外壳中。例如,装置110可以包括被配置为从用户的环境捕捉声音的至少一个麦克风。装置110可以包括被配置为从用户的环境捕捉多个图像的可穿戴相机。在一些实施例中,过程5200的一些或全部可以在装置110外部的处理器上执行,它们可以包括在第二外壳中。例如,过程5200的一个或多个部分可以由听觉接口设备1710或诸如计算设备120或显示设备2301的辅助设备中的处理器来执行。在这样的实施例中,处理器可以被配置为经由公共外壳中的发送器与第二外壳中的接收器之间的无线链路接收所捕捉的图像。52 is a flowchart illustrating an exemplary process for generating a stereo representation consistent with the disclosed embodiments. Process 5200 may be performed by one or more processors associated with apparatus 110, such as processor 210. The processor may be included in the same common housing as the microphone 1720 and camera 1730 that may also be used in process 5200. For example, apparatus 110 may include at least one microphone configured to capture sound from the user's environment. Apparatus 110 may include a wearable camera configured to capture multiple images from the user's environment. In some embodiments, some or all of process 5200 may be performed on a processor external to device 110, which may be included in the second housing. For example, one or more portions of process 5200 may be performed by a processor in auditory interface device 1710 or an auxiliary device such as computing device 120 or display device 2301 . In such an embodiment, the processor may be configured to receive the captured image via a wireless link between the transmitter in the common housing and the receiver in the second housing.
在步骤5202中,过程5200可以包括接收由相机捕捉的多个图像。例如,装置110可以捕捉图像并存储被压缩为JPG文件的图像的表示。作为另一示例,装置110可以捕捉彩色图像,但存储彩色图像的黑白表示。作为又一示例,装置110可以捕捉图像并存储图像的不同表示(例如,图像的一部分)。例如,装置110可以存储图像的一部分,该部分包括出现在图像中的人的脸,但基本上不包括围绕该人的环境。类似地,装置110例如可以存储图像的一部分,该部分包括出现在图像中的对象,但基本上不包括围绕该对象的环境。作为又一示例,装置110可以以降低的分辨率(即,以比捕捉的图像的分辨率低的分辨率)存储图像的表示。存储图像的表示可以允许装置110节省存储器550中的存储空间。此外,处理图像的表示可以允许装置110提高处理效率和/或帮助维持电池寿命。In step 5202, process 5200 may include receiving a plurality of images captured by the camera. For example, device 110 may capture an image and store a representation of the image compressed as a JPG file. As another example, device 110 may capture a color image, but store a black and white representation of the color image. As yet another example, device 110 may capture an image and store different representations of the image (eg, a portion of the image). For example, device 110 may store a portion of an image that includes the face of a person appearing in the image, but does not substantially include the environment surrounding the person. Similarly, device 110 may, for example, store a portion of an image that includes an object that appears in the image, but does not substantially include the environment surrounding the object. As yet another example, the apparatus 110 may store the representation of the image at a reduced resolution (ie, at a lower resolution than the captured image). Storing the representation of the image may allow device 110 to conserve storage space in memory 550 . Additionally, processing the representation of the image may allow device 110 to increase processing efficiency and/or help maintain battery life.
步骤5202还可以包括确定一系列图像中的差异。例如,可以减去先前的图像(诸如逐像素减去)以指示在第一图像的时间与第二图像的时间之间移动的图像的部分。在一些实施例中,该差分图像可以被存储为图像的表示。Step 5202 may also include determining differences in the series of images. For example, the previous image may be subtracted (such as pixel-by-pixel subtraction) to indicate the portion of the image that moved between the time of the first image and the time of the second image. In some embodiments, the differential image may be stored as a representation of the image.
在步骤5204中,过程5200可以包括接收表示由至少一个麦克风捕捉的声音的音频信号。在一些实施例中,过程5200可以从多个麦克风接收和组合多个音频信号。例如,装置110可以包括被设计为收集具有低频的声音的麦克风,以及被设计为收集具有高频的声音的麦克风。然后,步骤5204可以将声音组合成表示低频和高频两者的单个音频信号。步骤5204还可以包括确定在多个麦克风的每一个中的声音的到达时间之间的延迟。In step 5204, process 5200 can include receiving an audio signal representing sound captured by at least one microphone. In some embodiments, process 5200 may receive and combine multiple audio signals from multiple microphones. For example, the device 110 may include a microphone designed to collect sounds with low frequencies, and a microphone designed to collect sounds with high frequencies. Step 5204 may then combine the sounds into a single audio signal representing both low and high frequencies. Step 5204 may also include determining a delay between arrival times of sounds in each of the plurality of microphones.
在步骤5206中,过程5200可以包括基于对多个图像的分析和/或基于对接收到的声音的分析来确定声音发出对象的位置。该位置可以是声音发出对象相对于用户的角度,如图50B所示。另外地或可替代地,该位置可以是从用户到声音发出对象的距离。处理器210可以使用捕捉的图像中的参考对象或相机1730的镜头的焦距来确定距离。在一些实施例中,装置110可以包括提供立体图像的多个相机,它们可以使处理器210确定到对象的距离。在一些实施例中,过程5200可以包括确定声音的多普勒效应以例如通过计算声波的频率改变来识别声音发出对象的位置。另外,过程5200可以将声音与声音配置文件匹配,以帮助识别多个图像中的声音发出对象的类型。例如,处理器210可以基于对多个图像的分析来识别多个可能的声音发出对象(诸如汽车和割草机)。处理器210还可以将接收到的声音与汽车和割草机的声音配置文件进行比较,并确定割草机是声音发出对象。此外,处理器210可以使用来自惯性测量或来自装置110的测距仪的数据来确定距离。例如,处理器210可以基于惯性测量来计算装置110向左移动两英尺,而相机1730的视场中的对象向视场的中心的右侧移动15度。处理器210可以使用来自例如设置在装置110上的测距仪的测量来识别到最近对象的距离。例如,测距仪可以测量在用户前面或到用户侧面的距离,并且该距离可以用于确定位置。基于这些测量,处理器210可以确定到声音发出对象的距离。In step 5206, the process 5200 may include determining the location of the sound emitting object based on the analysis of the plurality of images and/or based on the analysis of the received sound. The position may be the angle of the sound-emitting object relative to the user, as shown in Figure 50B. Additionally or alternatively, the location may be the distance from the user to the object emitting the sound. The processor 210 may use the reference object in the captured image or the focal length of the lens of the camera 1730 to determine the distance. In some embodiments, the apparatus 110 may include a plurality of cameras that provide stereoscopic images, which may enable the processor 210 to determine the distance to the object. In some embodiments, process 5200 may include determining the Doppler effect of the sound to identify the location of the sound emitting object, eg, by calculating frequency changes of the sound waves. Additionally, process 5200 can match sounds to sound profiles to help identify types of sound-emitting objects in the plurality of images. For example, the processor 210 may identify a number of possible sound-emitting objects (such as cars and lawn mowers) based on analysis of the multiple images. The processor 210 may also compare the received sound to the sound profiles of the car and the lawn mower and determine that the lawn mower is the sound emitting object. Additionally, the processor 210 may use data from inertial measurements or from a rangefinder of the device 110 to determine the distance. For example, processor 210 may calculate, based on inertial measurements, that device 110 moves two feet to the left while objects in the field of view of camera 1730 move 15 degrees to the right of the center of the field of view. The processor 210 may use measurements from, for example, a rangefinder provided on the device 110 to identify the distance to the closest object. For example, a rangefinder can measure the distance in front of the user or to the side of the user, and this distance can be used to determine location. Based on these measurements, the processor 210 can determine the distance to the sound-emitting object.
如果处理器210检测到多个可能的声音发出对象,则可能需要附加步骤来确定多个可能对象中的哪个与接收到的声音相关联。例如,步骤5206可以包括确定最接近用户100的对象。因此,步骤5206可以包括确定第一对象到用户的距离,以及确定第二对象到用户的距离。一旦处理器210确定从用户到第一对象和第二对象的距离,处理器210可以基于所确定的距离选择第一和第二对象中的一个作为声音发出对象。例如,处理器210可以选择最近的对象作为声音发出对象。If the processor 210 detects multiple possible sound-emitting objects, additional steps may be required to determine which of the multiple possible objects is associated with the received sound. For example, step 5206 may include determining the object closest to user 100 . Thus, step 5206 may include determining the distance of the first object to the user, and determining the distance of the second object to the user. Once the processor 210 determines the distance from the user to the first object and the second object, the processor 210 may select one of the first and second objects as the sound emitting object based on the determined distance. For example, the processor 210 may select the closest object as the sound emitting object.
在步骤5206中,过程5200还可以包括识别在多个图像中的至少一个中的声音发出对象的表示。在一个实施例中,步骤5206可以包括基于对多个图像的分析来识别与个体的嘴相关联的至少一个唇部移动或唇部位置。处理器210可以被配置为识别与个体的嘴相关联的一个或多个点。在一些实施例中,处理器210可以开发与个体的嘴相关联的轮廓,该轮廓可以定义与个体的嘴或唇部相关联的边界。可以在多个帧或图像上跟踪在图像中识别出的唇部,以识别唇部移动。处理器210还可以使用一种或多种视频跟踪算法,诸如均值漂移跟踪、轮廓跟踪(例如,压缩算法)或各种其他技术。In step 5206, the process 5200 may also include identifying a representation of the sound-emitting object in at least one of the plurality of images. In one embodiment, step 5206 may include identifying at least one lip movement or lip position associated with the individual's mouth based on analysis of the plurality of images. The processor 210 may be configured to identify one or more points associated with the individual's mouth. In some embodiments, the processor 210 may develop a profile associated with the individual's mouth, which may define a boundary associated with the individual's mouth or lips. Lips identified in an image can be tracked over multiple frames or images to identify lip movement. The processor 210 may also use one or more video tracking algorithms, such as mean-shift tracking, contour tracking (eg, compression algorithms), or various other techniques.
在一些情况下,声音发出对象可能与其他运动相关联。例如,声音发出对象可以是移动的汽车、在水池中溅水的人或锤子敲击钉子。因此,在一些实施例中,处理器210可以通过识别用户环境中的至少一个移动对象来识别声音发出对象的表示。步骤5206可以包括在一系列捕捉的图像之间进行背景减法以识别运动,如果用户100静止,这可以帮助运动检测。步骤5206可以使用其他减法技术来确定观察到的运动是否是时段性的,并且基于运动时段和声音时段来确定观察到的运动与接收到的音频信号相关联还是不相关联。在步骤5208中,过程5200可以包括基于声音发出对象的位置生成立体声表示,立体声表示包括第一音频信号和第二音频信号,第一音频信号在至少一个方面不同于第二音频信号以模拟该对象相对于用户的位置。例如,如图51的第一信号5104和第二信号5106所示,第一音频信号和第二音频信号可以不同。In some cases, the sound-emitting object may be associated with other movements. For example, the sound can be a moving car, a person splashing in a pool, or a hammer hitting a nail. Thus, in some embodiments, the processor 210 may identify a representation of a sound-emitting object by identifying at least one moving object in the user's environment. Step 5206 may include background subtraction between a series of captured images to identify motion, which may aid motion detection if the user 100 is stationary. Step 5206 may use other subtraction techniques to determine whether the observed motion is temporal, and to determine whether the observed motion is associated with the received audio signal based on the motion period and the sound period. In step 5208, process 5200 can include generating a stereo representation based on the location of the sound-emitting object, the stereo representation including a first audio signal and a second audio signal, the first audio signal being different from the second audio signal in at least one respect to simulate the object The location relative to the user. For example, as shown in the first signal 5104 and the second signal 5106 of FIG. 51 , the first audio signal and the second audio signal may be different.
例如,在一些实施例中,相对于第二音频信号,与声音发出对象相关联的声音可以在第一音频信号中被衰减。处理器210可以将声音的振幅增加小于1的因子以产生第一音频信号,该第一音频信号可以具有小于原始声音的强度。可替代地或另外地,处理器210可以将声音乘以大于1的因子以产生强度大于原始声音以及大于第一音频信号的第二音频信号。在一些实施例中,放大和衰减可以由处理器210数字化地执行。放大和衰减也可以由容纳在装置110中的放大器或衰减器电路来执行。For example, in some embodiments, the sound associated with the sound-emitting object may be attenuated in the first audio signal relative to the second audio signal. The processor 210 may increase the amplitude of the sound by a factor of less than 1 to generate a first audio signal, which may have a smaller intensity than the original sound. Alternatively or additionally, the processor 210 may multiply the sound by a factor greater than 1 to generate a second audio signal having an intensity greater than the original sound and greater than the first audio signal. In some embodiments, the amplification and attenuation may be performed digitally by the processor 210 . Amplification and attenuation may also be performed by amplifier or attenuator circuits housed in device 110 .
此外,衰减或放大程度可以基于声音发出对象的位置(例如,在过程5200的步骤5206处确定的位置)来确定。处理器210可以执行位置到衰减和/或放大的转换。例如,处理器210可以将衰减因子计算为相对于相机1730的视场中心线朝向声音发出对象的方向的函数。处理器210可以通过将朝向声音发出对象的角度乘以系数(例如0.5)来确定衰减因子。例如,参照图50B,声音发出对象(即,个体5002)位于距离视场5006的中心5度至10度之间。处理器210可以将来自个体5002的声音除以确定为(10度×0.5)=5的衰减因子,使得第一信号比第二信号安静五倍。作为附加示例,如果个体5002位于图50B的中心线5008,则衰减因子将为0(0度×0.5=0),使得第一信号具有与第二信号相等的音量。以这种方式,来自远离相机1730的视场中心线的声音发出对象的声音得到比来自相机1730的视场中心线附近的声音发出对象的声音的被更大程度衰减的第一信号。本示例仅用于说明目的,并不一定限制本实施例。Additionally, the degree of attenuation or amplification may be determined based on the location of the sound emitting object (eg, the location determined at step 5206 of process 5200). The processor 210 may perform the conversion of position to attenuation and/or amplification. For example, the processor 210 may calculate the attenuation factor as a function of the direction toward the sound-emitting object relative to the centerline of the field of view of the camera 1730 . The processor 210 may determine the attenuation factor by multiplying the angle towards the sound emitting object by a factor (eg, 0.5). For example, referring to FIG. 50B , the sound emitting object (ie, the individual 5002 ) is located between 5 and 10 degrees from the center of the field of view 5006 . The processor 210 may divide the sound from the individual 5002 to determine an attenuation factor of (10 degrees x 0.5) = 5, making the first signal five times quieter than the second signal. As an additional example, if the individual 5002 were located on the centerline 5008 of Figure 50B, the attenuation factor would be 0 (0 degrees x 0.5 = 0) such that the first signal has equal volume as the second signal. In this way, the sound of the sounding object from far from the centerline of the field of view of the camera 1730 results in a more attenuated first signal than the sound of the sounding object from near the centerline of the field of view of the camera 1730 . This example is for illustrative purposes only and does not necessarily limit this embodiment.
在确定从用户到声音发出对象的距离的实施例中,衰减因子可以基于距离而不是方向,或者除了方向之外还基于距离。例如,如果对象远离用户100,则衰减因子可以小于靠近用户100但相对于用户100在相同方向上的对象。这可以模拟立体声并且使得用户100能够更准确地进行声音定位,因为随着与耳朵的距离增加,每个耳朵中到达时间和音量的差值减小。设想了确定衰减因子的附加方法,诸如替代转换方法。此外,处理器210可以基于声音发出对象的位置来计算放大因子。In embodiments where the distance from the user to the object emitting the sound is determined, the attenuation factor may be based on distance rather than direction, or in addition to direction. For example, if an object is farther away from user 100 , the attenuation factor may be less than an object that is close to user 100 but in the same direction relative to user 100 . This can simulate stereo sound and enable more accurate sound localization by the user 100, as the difference in arrival time and volume in each ear decreases as the distance from the ear increases. Additional methods of determining attenuation factors are envisaged, such as alternative conversion methods. Furthermore, the processor 210 may calculate the amplification factor based on the position of the sound emitting object.
第一音频信号也可以或可替代地通过被延迟而不同于第二音频信号。即,与声音发出对象相关联的声音可以在第一音频信号中相对于第二音频信号被延迟,如先前在图51中由延迟5112所示。处理器210可以基于声音发出对象的位置来计算延迟,类似于上面描述的衰减或放大因子的计算。例如,在一些实施例中,基于声音发出对象与用户的距离,与声音发出对象相关联的声音在第一音频信号中被延迟某个延迟持续时间。此外,与声音发出对象相关联的声音基于声音发出对象的相对于用户的方向在第一音频信号中被延迟某个延迟持续时间。如上所述,在一些实施例中,延迟可以基于声音发出对象的方向和距离两者。例如,如果声音发出对象靠近用户并且靠近用户的右侧,声音将比用户的左耳更早到达用户的右耳。到达时间的差值与声音的总传播时间相比可能很大。例如,如果声音发出对象距离用户右耳6英寸,而用户右耳距离用户左耳6英寸,那么声音到达用户左耳的时间是到达用户右耳的时间的两倍。但是,如果声音发出对象离用户100英尺,那么到达用户耳朵的时间与声音传播的总时间相比就很小了。因此,处理器210可以基于声音发出对象的方向和距离两者来确定衰减因子,以更好地实现用户的声音定位能力。The first audio signal may also or alternatively differ from the second audio signal by being delayed. That is, the sound associated with the sound-emitting object may be delayed in the first audio signal relative to the second audio signal, as previously shown by delay 5112 in FIG. 51 . The processor 210 may calculate the delay based on the position of the sound emitting object, similar to the calculation of the attenuation or amplification factor described above. For example, in some embodiments, the sound associated with the sound-emitting object is delayed in the first audio signal by a delay duration based on the distance of the sound-emitting object from the user. Furthermore, the sound associated with the sound-emitting object is delayed in the first audio signal by a delay duration based on the direction of the sound-emitting object relative to the user. As mentioned above, in some embodiments, the delay may be based on both the direction and distance of the sounding object. For example, if the sound-emitting object is close to the user and to the user's right side, the sound will reach the user's right ear earlier than the user's left ear. The difference in arrival times can be large compared to the total travel time of the sound. For example, if the sound originating object is 6 inches from the user's right ear, and the user's right ear is 6 inches from the user's left ear, the sound takes twice as long to reach the user's left ear as it takes to reach the user's right ear. However, if the sound emitting object is 100 feet away from the user, the time to reach the user's ear is small compared to the total time the sound travels. Therefore, the processor 210 can determine the attenuation factor based on both the direction and the distance of the sound emitting object, so as to better realize the sound localization ability of the user.
在一些场景中,声音发出对象可能在相机1730的视场之外,或者可能不产生可由相机1730检测到的运动,诸如收音机。然后,处理器210可能无法基于所接收的多个图像来确定到声音的方向和/或距离。因此,为了即使当声音发出对象在相机1730的视场之外时也能够进行声音定位,装置110可以包括多个麦克风。例如,至少一个麦克风可以包括第一麦克风,助听器系统还可以包括第二麦克风。处理器210可以被配置为确定与第一麦克风处的声音发出对象相关联的声音到达时间和与第二麦克风处的声音发出对象相关联的声音到达时间之间的差值。例如,处理器210可以分析从多个麦克风接收的音频信号以确定音频信号是否匹配,诸如确定峰值强度的定时或比较其他波形特性。另外,处理器210可以基于声速和麦克风之间的距离来应用定时窗口,使得当在第一麦克风处测量声音时,仅在声音在定时窗口内到达第二麦克风时分析其相似性。如果接收的音频信号来自同一源,则处理器210可以确定到达时间的差值。然后延迟的持续时间可以基于与第一麦克风处的声音发出对象相关联的声音到达时间和与第二麦克风处的声音发出对象相关联的声音到达时间之间的差值。在一些实施例中,延迟的持续时间可以相对于到达时间的差值而减小或增加,这可以使用户能够增强声音定位能力。In some scenarios, the sound-emitting object may be outside the field of view of the camera 1730, or may not produce motion detectable by the camera 1730, such as a radio. The processor 210 may then be unable to determine the direction and/or distance to the sound based on the plurality of images received. Thus, in order to enable sound localization even when the sound emitting object is outside the field of view of the camera 1730, the device 110 may include multiple microphones. For example, the at least one microphone may include a first microphone, and the hearing aid system may also include a second microphone. The processor 210 may be configured to determine the difference between the time of arrival of the sound associated with the sound emitting object at the first microphone and the time of arrival of the sound associated with the sound emitting object at the second microphone. For example, the processor 210 may analyze audio signals received from multiple microphones to determine whether the audio signals match, such as determining the timing of peak intensities or comparing other waveform characteristics. Additionally, the processor 210 may apply a timing window based on the speed of sound and the distance between the microphones, such that when a sound is measured at the first microphone, its similarity is only analyzed when the sound reaches the second microphone within the timing window. If the received audio signals are from the same source, the processor 210 may determine the difference in arrival times. The duration of the delay may then be based on the difference between the sound arrival time associated with the sound emitting object at the first microphone and the sound arrival time associated with the sound emitting object at the second microphone. In some embodiments, the duration of the delay may be decreased or increased relative to the difference in arrival time, which may enable the user to enhance sound localization capabilities.
在一些实施例中,在步骤5208中生成第一音频信号和第二音频信号还可以包括选择性地调节与声音发出对象相关联的至少一个音频信号。也就是,调节可以包括改变音频信号的音调或重放速度。例如,调节可以包括重新映射音频频率或改变与音频信号相关联的语速。在一些实施例中,调节可以包括相对于其他音频信号的第一音频信号的其他放大方法,诸如方向性麦克风的操作、改变与麦克风相关联的一个或多个参数、或者数字化地处理音频信号。调节可以包括衰减或抑制与人相关联的一个或多个音频信号。衰减的音频信号可以包括与在用户的环境中检测到的其他声音(包括诸如第二音频信号的其他语音)相关联的音频信号。例如,处理器210可以基于确定第二音频信号不与人相关联来选择性地衰减第二音频信号。In some embodiments, generating the first audio signal and the second audio signal in step 5208 may further include selectively adjusting at least one audio signal associated with the sound-emitting object. That is, adjusting may include changing the pitch or playback speed of the audio signal. For example, the adjustment may include remapping the audio frequency or changing the speech rate associated with the audio signal. In some embodiments, the adjustment may include other methods of amplifying the first audio signal relative to other audio signals, such as operation of a directional microphone, changing one or more parameters associated with the microphone, or digitally processing the audio signal. Conditioning may include attenuating or suppressing one or more audio signals associated with the person. The attenuated audio signal may include audio signals associated with other sounds detected in the user's environment, including other speech such as the second audio signal. For example, the processor 210 may selectively attenuate the second audio signal based on determining that the second audio signal is not associated with a person.
在步骤5210中,过程5200可以包括使立体声表示传输到助听器接口设备,助听器接口设备被配置为将基于第一音频信号的声音提供给用户的第一耳朵,并将基于第二音频信号的声音提供给用户的第二耳朵。例如,立体声表示可以被发送到连接到用户右耳的听觉接口设备1710,以及连接到用户左耳的听觉接口设备1710,从而向用户100提供对应于接收到的音频信号的声音。在一些实施例中,听觉接口设备1710可以包括与听筒相关联的扬声器。例如,听觉接口设备可以至少部分地插入用户的耳朵中,用于向用户提供音频。听觉接口设备也可以在耳朵外部,诸如耳后听觉设备、一个或多个耳机、小型便携式扬声器等。在一些实施例中,听觉接口设备可以包括骨传导麦克风,其被配置为通过用户头骨的振动向用户提供音频信号。这样的设备可以与使用者的皮肤外部接触放置,或者可以通过外科手术植入并附接到使用者的骨骼上。In step 5210, process 5200 may include transmitting the stereo representation to a hearing aid interface device configured to provide sound based on the first audio signal to a first ear of the user and sound based on the second audio signal To the user's second ear. For example, a stereo representation may be sent to the auditory interface device 1710 connected to the user's right ear, and to the auditory interface device 1710 connected to the user's left ear, thereby providing the user 100 with sound corresponding to the received audio signal. In some embodiments, auditory interface device 1710 may include a speaker associated with the earpiece. For example, an auditory interface device may be inserted at least partially into a user's ear for providing audio to the user. The auditory interface device may also be external to the ear, such as a behind-the-ear hearing device, one or more earphones, small portable speakers, and the like. In some embodiments, the auditory interface device may include a bone conduction microphone configured to provide audio signals to the user through vibrations of the user's skull. Such devices may be placed in external contact with the user's skin, or may be surgically implanted and attached to the user's bone.
嗓音特征的现场改变On-the-spot changes in vocal characteristics
与所公开的实施例一致,助听器系统可以基于语音特征选择性地调节声音,以使用户能够更好地理解具有语音障碍、口音或其他可能阻碍用户理解的语音特征的个体。虽然现有的助听器系统可以放大声音来克服听力损失,但这些系统可能无法消除理解障碍。例如,助听器系统的使用者除了听力损失之外,还可能有认知障碍,而传统助听器系统提供的简单扩音方法并不能解决这些认知障碍。此外,即使当用户没有表现出认知障碍时,用户也可能遇到带有口音或语音障碍的人,而扩音并不能解决这些问题,甚至会更糟。因此,本发明的助听器系统可以选择性地调节与语音相关联的音频以减少理解障碍。Consistent with the disclosed embodiments, hearing aid systems may selectively adjust sounds based on speech characteristics to enable users to better understand individuals with speech impairments, accents, or other speech characteristics that may hinder user understanding. While existing hearing aid systems can amplify sound to overcome hearing loss, these systems may not remove barriers to understanding. For example, users of hearing aid systems may have cognitive impairments in addition to hearing loss, and the simple amplification methods provided by traditional hearing aid systems do not address these cognitive impairments. Furthermore, even when the user does not exhibit cognitive impairment, the user may encounter someone with an accent or speech impairment, and loudspeaker does not solve these problems, or worse. Thus, the hearing aid system of the present invention can selectively adjust the audio associated with speech to reduce comprehension barriers.
用户100可以佩戴符合上述基于相机的助听器设备的助听器设备。例如,助听器设备可以是如图17A所示的听觉接口设备1710。听觉接口设备1710可以是被配置为向用户100提供听觉反馈的任何设备。听觉接口设备1710可以被放置在用户100的每个耳朵中,类似于传统的听觉接口设备。如上所述,听觉接口设备1710可以是各种样式的,包括耳道内、完全耳道内、耳内、耳后、耳上、耳道内接收器、开放安装或各种其他样式。听觉接口设备1710可以包括用于向用户100提供听觉反馈的一个或多个扬声器、用于检测用户100的环境中的声音的麦克风、内部电子设备、处理器、存储器等。在一些实施例中,除了麦克风之外或替代麦克风,听觉接口设备1710可以包括一个或多个通信单元,以及是一个或多个接收器,用于从设备110接收信号并将信号传送到用户100。听觉接口设备1710可以对应于反馈输出单元230,或者可以与反馈输出单元230分开,并且可以被配置为从反馈输出单元230接收信号。The user 100 may wear a hearing aid device that conforms to the camera-based hearing aid device described above. For example, the hearing aid device may be an auditory interface device 1710 as shown in Figure 17A. Auditory interface device 1710 may be any device configured to provide auditory feedback to user 100 . An auditory interface device 1710 may be placed in each ear of the user 100, similar to conventional auditory interface devices. As mentioned above, the auditory interface device 1710 may be of various styles, including in-canal, completely in-canal, in-ear, behind-the-ear, supra-aural, in-canal receiver, open mount, or various other styles. Auditory interface device 1710 may include one or more speakers for providing auditory feedback to user 100, a microphone for detecting sounds in the environment of user 100, internal electronics, a processor, memory, and the like. In some embodiments, auditory interface device 1710 may include one or more communication units in addition to or instead of a microphone, and be one or more receivers for receiving signals from device 110 and transmitting signals to user 100 . The auditory interface device 1710 may correspond to the feedback output unit 230 or may be separate from the feedback output unit 230 and may be configured to receive signals from the feedback output unit 230 .
在一些实施例中,如图17A所示,听觉接口设备1710可以包括骨传导耳机1711。骨传导耳机1711可以通过外科手术植入,并且可以通过声音振动到内耳的骨传导来向用户100提供可听反馈。听觉接口设备1710还可以包括一个或多个耳机(例如,无线耳机、过耳耳机等)或由用户100携带或佩戴的便携式扬声器。在一些实施例中,听觉接口设备1710可以集成到其他设备中,诸如用户的蓝牙TM耳机、眼镜、头盔(例如,摩托车头盔、自行车头盔等)、帽子等。In some embodiments, the auditory interface device 1710 may include a bone conduction headset 1711, as shown in FIG. 17A. Bone conduction earphones 1711 may be surgically implanted and may provide audible feedback to user 100 through bone conduction of sound vibrations to the inner ear. The auditory interface device 1710 may also include one or more earphones (eg, wireless earphones, over-ear earphones, etc.) or portable speakers carried or worn by the user 100 . In some embodiments, the auditory interface device 1710 may be integrated into other devices, such as the user's Bluetooth ™ headset, glasses, helmets (eg, motorcycle helmets, bicycle helmets, etc.), hats, and the like.
听觉接口设备1710可以被配置为与诸如装置110的相机设备进行通信。这种通信可以通过有线连接,或者可以无线地进行(例如,使用蓝牙TM、NFC或无线通信形式)。如上所述,装置110可以由用户100以各种配置来佩戴,包括物理地连接到衬衫、项链、腰带、眼镜、腕带、纽扣或与用户100相关联的其他物品。在一些实施例中,还可以包括诸如计算设备120的一个或多个附加设备。因此,本文关于装置110或处理器210描述的一个或多个过程或功能可以由计算设备120和/或处理器540执行。装置110还可以使用听觉接口设备1710的一个或多个麦克风,并且因此,本文使用的对麦克风1720的引用也可以是指听觉接口设备1710上的麦克风。Auditory interface device 1710 may be configured to communicate with a camera device such as apparatus 110 . Such communication may be via a wired connection, or may be performed wirelessly (eg, using Bluetooth ™ , NFC or wireless forms of communication). As described above, device 110 may be worn by user 100 in various configurations, including physically attached to a shirt, necklace, belt, eyeglasses, wristband, button, or other item associated with user 100 . In some embodiments, one or more additional devices such as computing device 120 may also be included. Accordingly, one or more of the processes or functions described herein with respect to apparatus 110 or processor 210 may be performed by computing device 120 and/or processor 540 . Apparatus 110 may also use one or more microphones of auditory interface device 1710 , and thus, references to microphone 1720 used herein may also refer to a microphone on auditory interface device 1710 .
处理器210(和/或处理器210a和210b)可以被配置为检测用户100的环境内的个体。图53是示出符合本公开的用于使用提供声音特征的现场改变的助听器的示例性环境的示意图。如图53所示,佩戴装置110的用户100可以物理地存在于环境中并且个体5302产生声音5304。尽管图53将个体5302示出为正在说话,但装置110的麦克风也可以捕捉来自用户100的环境的其他声音,诸如机器、动物或自然产生的声源(诸如风)。因此,处理器210可以被配置为从由装置110的麦克风捕捉的多个声音中识别和隔离对应于语音的声音。例如,机器发出的声音通常有一个与机器运动相对应的时段,诸如运转的马达每循环就发出声音,或者手钻每敲击就发出声音。在这些情况下,处理器210可以过滤时段性声音以隔离语音。作为另一示例,处理器210可以过滤比典型语音音量范围更大或更安静的声音、具有不同于典型语音的谐波的谐波的声音或在典型语音音高之外的音高。处理器210可以采用诸如傅立叶变换的信号分析方法来隔离语音信号。Processor 210 (and/or processors 210a and 210b ) may be configured to detect individuals within the environment of user 100 . 53 is a schematic diagram illustrating an exemplary environment for using a hearing aid that provides field modification of sound characteristics consistent with the present disclosure. As shown in FIG. 53 , user 100 wearing device 110 may be physically present in the environment and individual 5302 produces sound 5304 . Although Figure 53 shows the individual 5302 as speaking, the microphone of the device 110 may also capture other sounds from the user's 100 environment, such as machines, animals, or naturally occurring sound sources such as wind. Accordingly, the processor 210 may be configured to identify and isolate the sound corresponding to speech from among the plurality of sounds captured by the microphone of the device 110 . For example, the sound produced by a machine usually has a period of time corresponding to the movement of the machine, such as the sound of each cycle of a running motor, or the sound of each stroke of a hand drill. In these cases, the processor 210 may filter the temporal sounds to isolate speech. As another example, the processor 210 may filter sounds that are louder or quieter than a typical speech volume range, sounds that have harmonics other than those of typical speech, or pitches that are outside the typical speech pitch. The processor 210 may employ signal analysis methods such as Fourier transform to isolate the speech signal.
语音与其他声音的隔离可以使佩戴根据本公开的助听器系统的用户更好地理解与他人的对话。例如,处理器210可以通过增加与语音相关联的声音的音量以及减少甚至消除与除语音之外的源相关联的声音来选择性地调节声音。因此,助听器系统可以帮助用户专注于对话并避免分心。然而,在一些情况下,选择性地增加声音的音量可能不足以使用户理解说话者。例如,如上所述,用户可能具有抑制理解典型语音的认知障碍,诸如需要比通常更长的时间来区分词语。用户还可能具有理解语音的物理障碍,诸如在对应于人类声音范围的某些频率中的听力损失。用户还可能与说话者有文化差异,使说话者难以被理解(诸如口音)。The isolation of speech from other sounds may allow a user wearing a hearing aid system according to the present disclosure to better understand conversations with others. For example, the processor 210 may selectively adjust the sound by increasing the volume of the sound associated with speech and reducing or even eliminating the sound associated with sources other than speech. Therefore, hearing aid systems can help users focus on the conversation and avoid distractions. However, in some cases, selectively increasing the volume of the sound may not be sufficient for the user to understand the speaker. For example, as discussed above, users may have cognitive impairments that inhibit understanding typical speech, such as taking longer than usual to distinguish words. Users may also have physical barriers to understanding speech, such as hearing loss in certain frequencies corresponding to the human sound range. The user may also have cultural differences with the speaker, making the speaker difficult to understand (such as accent).
为了解决这些问题,本公开的某些实施例可以提供额外的语音的选择性调节以进一步改善用户理解。例如,图54A是由符合本公开的助听器系统获取的音频信号的示意图,而图54B是由符合本公开的助听器系统重放的音频信号的示意图。图54A和图54B可以是从装置110的麦克风导出的频谱图。图54A的获取的音频信号与图54B的重放的音频信号的比较示出了如本公开的实施例中的选择性调节的实施例。To address these issues, certain embodiments of the present disclosure may provide additional selective adjustment of speech to further improve user understanding. For example, Figure 54A is a schematic diagram of an audio signal acquired by a hearing aid system consistent with the present disclosure, and Figure 54B is a schematic diagram of an audio signal played back by a hearing aid system consistent with the present disclosure. 54A and 54B may be spectrograms derived from a microphone of device 110. FIG. A comparison of the acquired audio signal of FIG. 54A with the played back audio signal of FIG. 54B illustrates an embodiment of selective adjustment as in an embodiment of the present disclosure.
处理器210可以通过如上所述去除非语音声音,以及基于指示词语中断的静默时段将与语音相关联的声音分离成片段来将词语从对应于语音的更长的声音样本中分离出来。例如,图54A在概念上表示从语音中提取的词语。在本示例中,该词语可以是当由说话者说出时用户难以理解的词语,或者由于例如说话者的口音而容易与类似发音词语混淆的词语。例如,图54A的词语中所示的图形可以对应于说话者说出词语“hearing(听到)”,该词语在某些口音中或对于具有某些言语障碍的一些人来说,听起来可能像“earring”或者甚至“erin”。也就是,一些说话者可能会去掉“hearing”一词中的前导“h”或最后的“g”,或者可能会改变元音发音。The processor 210 may separate words from longer sound samples corresponding to speech by removing non-speech sounds as described above, and separating the sounds associated with speech into segments based on periods of silence indicating word breaks. For example, Figure 54A conceptually represents words extracted from speech. In this example, the word may be a word that is difficult for a user to understand when spoken by a speaker, or a word that is easily confused with similarly pronounced words due to, for example, the speaker's accent. For example, the graph shown in the words of Figure 54A may correspond to a speaker saying the word "hearing" which may sound likely in certain accents or to some people with certain speech impairments Like "earring" or even "erin". That is, some speakers may drop the leading "h" or the final "g" in the word "hearing," or may change the vowel pronunciation.
处理器210还可以将识别出的词语划分为音素。例如,如图54A所示,处理器210可以将词语“hearing”分成四个音素:区域5402中的“h”音素、区域5404中的“ea”音素、区域5406中的“r”音素和区域5408中的“ng”音素。处理器210可以通过将从音频导出的频谱图与描述声音的字符串相关地存储的频谱图库相匹配来识别音素。例如,处理器210可以将库存储在装置110的存储器550中,或者处理器210可以访问存储库的数据库。例如,区域5402的频谱图可以存储在库中并链接到字母“h”。此外,库可以存储应该强调音素以增强用户理解的指示。例如,如上所述,有些口音轻发“h”音,或完全丢弃“h”音,而库可以存储当检测到“h”音时,应该强调“h”音的指示。此外,库可以存储条件性强调指示,诸如“h”应该在词语的开头处强调,而不是中间的规则。库还可以存储强调规则,例如在音素之前引入延迟、增大音量、减小音量、增大持续时间或减小持续时间。The processor 210 may also divide the identified words into phonemes. For example, as shown in Figure 54A, the processor 210 may separate the word "hearing" into four phonemes: the "h" phoneme in region 5402, the "ea" phoneme in region 5404, the "r" phoneme in region 5406, and the region The "ng" phoneme in 5408. The processor 210 may identify phonemes by matching a spectrogram derived from the audio with a spectrogram library stored in association with a string describing the sound. For example, the processor 210 may store the library in the memory 550 of the device 110, or the processor 210 may access a database of the storage library. For example, a spectrogram for region 5402 may be stored in a library and linked to the letter "h". Additionally, the library can store indications that phonemes should be emphasized to enhance user understanding. For example, as mentioned above, some accents light the "h" sound, or drop the "h" sound entirely, while the library can store an indication that the "h" sound should be emphasized when it is detected. Additionally, the library can store conditional emphasis instructions, such as "h" should be emphasized at the beginning of a word, not a rule in the middle. The library can also store emphasis rules, such as introducing a delay before a phoneme, increasing the volume, decreasing the volume, increasing the duration, or decreasing the duration.
因此,图54B示出了处理器210基于图54A的接收声音产生的重放声音和“h”声音应该在词语的开头被拉长和放大以增强用户对词语的理解的规则。因此,图54B的区域5402中的“h”音素具有比图54A的区域5402中的声音更大的强度,从初始强度(如果大约20,000个单位)放大到超过30,000个任意强度单位。另外,图54B中的“h”音素开始早于区域5402的开始,示出该音素具有增加的持续时间以进一步增强用户理解。在一些实施例中,处理器210还可以缩短词语中的其他音素,使得总的词语持续时间不变,从而防止例如对于阅读唇部以帮助听力理解的用户而言,可能使用户困惑的说话和听到之间的延迟。在一些实施例中,一个或多个音素可以具有增加的持续时间,而其他音素可以不变。因此,整个词语的持续时间可以增加。在一些实施例中,可以减少连续词语之间的一个或多个空格的持续时间,以避免由于增加的音素持续时间而造成的累积延迟。Thus, Figure 54B shows the rules that the reproduced sound and the "h" sound generated by the processor 210 based on the received sound of Figure 54A should be elongated and amplified at the beginning of the word to enhance the user's understanding of the word. Thus, the "h" phoneme in region 5402 of Fig. 54B has a greater intensity than the sound in region 5402 of Fig. 54A, amplified from the initial intensity (if around 20,000 units) to over 30,000 arbitrary intensity units. Additionally, the "h" phoneme in Figure 54B begins earlier than the beginning of region 5402, showing that the phoneme has an increased duration to further enhance user understanding. In some embodiments, the processor 210 may also shorten other phonemes in the word so that the total word duration is constant, thereby preventing speech and speech that may be confusing to a user, such as for a user reading lips to aid listening comprehension. The delay between hearing. In some embodiments, one or more phonemes may have an increased duration, while other phonemes may be unchanged. Therefore, the duration of the entire word can be increased. In some embodiments, the duration of one or more spaces between consecutive words may be reduced to avoid cumulative delays due to increased phoneme durations.
在一些实施例中,选择性调节可以包括改变说话者的语音的音调以使声音对用户100更易感知。例如,用户100可能对特定范围内的音调具有较小的敏感度,并且音频信号的调节可以调整接收信号5102的音高。例如,用户100可能经历10kHz以上的频率中的听觉损失,并且处理器210可以将更高的频率(例如,在15kHz处)重新映射到低于10kHz的频率。在一些实施例中,处理器210可以被配置为改变与一个或多个音频信号相关联的语速。In some embodiments, the selective adjustment may include changing the pitch of the speaker's speech to make the sound more perceptible to the user 100 . For example, the user 100 may have less sensitivity to tones within a certain range, and the adjustment of the audio signal may adjust the pitch of the received signal 5102. For example, user 100 may experience hearing loss in frequencies above 10 kHz, and processor 210 may remap higher frequencies (eg, at 15 kHz) to frequencies below 10 kHz. In some embodiments, the processor 210 may be configured to vary the speech rate associated with one or more audio signals.
图55A是示出符合所公开实施例的用于选择性地调节音频信号的示例性过程的流程图。过程5500A可以由与装置110相关联的一个或多个处理器(诸如处理器210)来执行。处理器可以包括在与也可以在过程5500A期间使用的麦克风1720和相机1730相同的公共外壳中。例如,装置110可以包括被配置为从用户的环境捕捉声音的至少一个麦克风。在一些实施例中,过程5500A的一些或全部可以在装置110外部的处理器上执行,它们可以包括在第二外壳中。例如,过程5500A的一个或多个部分可以由听觉接口设备1710或诸如计算设备120或显示设备2301的辅助设备中的处理器来执行。在这样的实施例中,处理器可以被配置为经由公共外壳中的发送器与第二外壳中的接收器之间的无线链路接收所捕捉的图像和声音。55A is a flowchart illustrating an exemplary process for selectively conditioning an audio signal, consistent with the disclosed embodiments. Process 5500A may be performed by one or more processors associated with apparatus 110, such as processor 210. The processor may be included in the same common housing as the microphone 1720 and camera 1730, which may also be used during process 5500A. For example, apparatus 110 may include at least one microphone configured to capture sound from the user's environment. In some embodiments, some or all of process 5500A may be performed on a processor external to device 110, which may be included in the second housing. For example, one or more portions of process 5500A may be performed by a processor in auditory interface device 1710 or an auxiliary device such as computing device 120 or display device 2301 . In such an embodiment, the processor may be configured to receive the captured images and sounds via a wireless link between the transmitter in the common housing and the receiver in the second housing.
在步骤5502中,过程5500A可以包括接收表示由至少一个麦克风捕捉的声音的多个音频信号。在一些实施例中,过程5500A可以从多个麦克风接收和组合多个音频信号。例如,装置110可以包括被设计为收集具有低频的声音的第一麦克风,以及被设计为收集具有高频的声音的第二麦克风。然后,步骤5502可以将声音组合成表示低频和高频两者的单个音频信号。In step 5502, process 5500A can include receiving a plurality of audio signals representing sounds captured by at least one microphone. In some embodiments, process 5500A may receive and combine multiple audio signals from multiple microphones. For example, device 110 may include a first microphone designed to collect sounds having low frequencies, and a second microphone designed to collect sounds having high frequencies. Step 5502 may then combine the sounds into a single audio signal representing both low and high frequencies.
在步骤5504中,过程5500A可以包括识别多个音频信号中的第一音频信号,该第一音频信号与个体相关联。例如,如上所述,过程5500A可以移除多个音频信号中具有正常人类对话范围之外的频率、时段、音高和音量的音频信号。在一些情况下,用户可能在多个说话者附近。在这些情况下,过程5500A可以选择最响亮的音频信号,其可以对应于最近的说话者,该说话者可能是用户希望清楚听到的说话者。可替代地,过程5500A可以从用户接收聚焦于来自不同说话者的信号的指示。例如,过程5500A可以使听觉接口设备1710播放来自第一个体的音频,从用户接收按钮按下或其他指示,然后使来自第二个体的音频由听觉接口设备1710播放。In step 5504, process 5500A may include identifying a first audio signal of the plurality of audio signals, the first audio signal being associated with the individual. For example, as described above, process 5500A may remove audio signals from the plurality of audio signals having frequencies, periods, pitches, and volumes that are outside the range of normal human conversation. In some cases, the user may be in the vicinity of multiple speakers. In these cases, process 5500A may select the loudest audio signal, which may correspond to the closest speaker, which may be the speaker the user wishes to hear clearly. Alternatively, process 5500A may receive an indication from the user to focus on signals from different speakers. For example, process 5500A may cause auditory interface device 1710 to play audio from a first individual, receive a button press or other indication from the user, and then cause audio from a second individual to be played by auditory interface device 1710.
在步骤5506中,过程5500A可以包括处理第一音频信号以选择性地调节个体的至少一个语音特征。在过程5500A中,语音特征可以是个体语音的任何特性,其可能会抑制用户的理解。语音特征的示例可包括但不限于口音、言语障碍(如口齿不清、口吃、言语抽搐、异常压力、异常舌头运动或牙齿缺失)、诸如吹口哨的干扰声音或声音质量(诸如高音或发音困难,即声音嘶哑)。In step 5506, process 5500A can include processing the first audio signal to selectively adjust at least one speech characteristic of the individual. In process 5500A, a speech feature may be any characteristic of an individual speech that may inhibit a user's comprehension. Examples of speech features may include, but are not limited to, accent, speech disturbances (such as slurred speech, stuttering, speech jerks, abnormal pressure, abnormal tongue movement, or missing teeth), disturbing sounds such as whistling, or sound quality (such as high pitch or dysphonia) , i.e. hoarse voice).
例如,在一些实施例中,至少一个语音特征可以包括个体的口音,并且选择性调节可以包括改变该口音。诸如存储器550的存储器或由装置110访问的另一数据库可以存储多个口音的特性。例如,存储器550可以存储用于引起对用户的本地口音的音素替换的英国口音的选择性调节规则。例如,一些说英国英语的人可能会用声门停顿代替“t”音素。结果,存储器550可以包括选择性调节规则,以用“t”音替换在第一音频信号中检测到的声门停顿,从而使得说英国英语的人对于说美国英语的人来说更容易理解。For example, in some embodiments, the at least one speech characteristic may include the individual's accent, and the selective adjustment may include changing the accent. A memory such as memory 550 or another database accessed by device 110 may store characteristics of multiple accents. For example, memory 550 may store selective adjustment rules for British accents for causing phonemic substitution of the user's local accent. For example, some speakers of British English may replace the "t" phoneme with a glottal pause. As a result, memory 550 may include selective adjustment rules to replace the detected glottal pause in the first audio signal with a "t" tone, thereby making it easier for British English speakers to understand for American English speakers.
作为另一示例,在一些实施例中,至少一个语音特征可以包括个体的口齿不清,并且选择性调节包括移除该口齿不清。在这种情况下,处理器210可以用第一音频信号中的“s”声音替换“th”声音。然而,由于口齿不清的说话者可能在一些词语中适当地使用“th”声音,处理器210可以包括自然语言处理算法,以确定在第一音频信号中识别出的词语是否应该用“th”而不是“s”发音,并因此避免用“s”声音替换“th”。可替代地,处理器210可以访问词典并确定观察到的词语是否是真实词语,并且用“s”声音替换“th”声音。例如,第一音频信号可以包括说话者说“thit”的表示。处理器210可以确定“thit”不是词典中的词语。然后,处理器210可以确定“sit”是词典中的词语,并选择性地调节个体的口齿不清语音特征。As another example, in some embodiments, the at least one speech feature may include slurred speech by the individual, and the selective adjustment includes removing the slurred speech. In this case, the processor 210 may replace the "th" sound with the "s" sound in the first audio signal. However, since slurred speakers may appropriately use the "th" sound in some words, the processor 210 may include a natural language processing algorithm to determine whether the words identified in the first audio signal should use "th" instead of the "s" sound, and thus avoid replacing "th" with the "s" sound. Alternatively, the processor 210 may access the dictionary and determine whether the observed word is a real word, and replace the "th" sound with the "s" sound. For example, the first audio signal may include a representation of the speaker saying "thit". The processor 210 may determine that "thit" is not a word in the dictionary. The processor 210 may then determine that "sit" is a word in the dictionary and selectively adjust the individual's lisp characteristics.
作为又另一个示例,至少一个语音特征可以包括词语的发音,并且选择性调节包括改变该词语的发音。例如,一个体(诸如一个孩子)可能会把词语“cookie”读错为“tootie”。此外,该个体可能没有妨碍用户理解的其他可识别的声音特征。然后,处理器210可通过每当个体发音错误时播放个体正确地说出词语“cookie”的记录来选择性地调节词语“cookie”发音错误的语音特征。换句话说,处理器210可以用正确发音的词语替换整个词语,而不是替换单个音素。此外,语音特征可以包括词语的多个发音,并且处理器210可以访问包含多个词语中的每一个的替换音频文件的存储器。可替代地,处理器210可以在检测到发音错误的词语时生成正确的发音。处理器210还可以调节所生成的发音以匹配个体的语音特性,诸如音调、音高和质量。As yet another example, the at least one phonetic feature may include the pronunciation of a word, and the selective adjustment includes changing the pronunciation of the word. For example, an individual (such as a child) might mispronounce the word "cookie" as "tootie". Furthermore, the individual may have no other identifiable vocal characteristics that interfere with the user's comprehension. The processor 210 may then selectively adjust the phonetic characteristics of the mispronounced word "cookie" by playing a recording of the individual correctly saying the word "cookie" each time the individual mispronounced it. In other words, the processor 210 may replace entire words with correctly pronounced words, rather than replacing individual phonemes. Additionally, the speech features may include multiple pronunciations of words, and the processor 210 may access memory containing replacement audio files for each of the multiple words. Alternatively, the processor 210 may generate a correct pronunciation when a mispronounced word is detected. The processor 210 may also adjust the generated pronunciation to match the individual's speech characteristics, such as pitch, pitch, and quality.
在一些实施例中,选择性调节还可以包括改变个体的语音以模拟第二个体的语音。例如,处理器210可以创建第一音频信号的转录并将该转录提供给具有被选择以匹配用户偏好的语音的语音合成算法。或者,处理器210可以对第一音频信号过滤分量或添加其他分量,以增强或减少某些特征,并使第一音频信号更紧密地匹配另一语音的特性。例如,处理器210可以通过改变个体语音的音高来改变个体的语音。因此,如果个体是男性,处理器210可以处理第一音频信号以增加其音高,以便更清楚地匹配女性的语音,或者可以添加特定人的语音的泛音特性。选择性调节还可以包括翻译,诸如提供转录到用户语言的机器翻译,然后使用语音合成算法大声朗读翻译文本。选择性调节还可以包括以较慢或较快的速率播放个体的语音,例如通过较快或较慢地播放一个或多个词语。在一些实施例中,这还可以包括减少或增加词语之间的静默时段的持续时间,以负责所说词语的持续时间的改变。In some embodiments, the selective adjustment may also include altering the individual's speech to simulate the speech of a second individual. For example, the processor 210 may create a transcription of the first audio signal and provide the transcription to a speech synthesis algorithm with speech selected to match the user's preference. Alternatively, the processor 210 may filter components or add other components to the first audio signal to enhance or reduce certain characteristics and to make the first audio signal more closely match characteristics of another speech. For example, the processor 210 may change the individual's speech by changing the pitch of the individual's speech. Thus, if the individual is male, the processor 210 may process the first audio signal to increase its pitch to more clearly match the female's speech, or may add the overtone properties of the particular person's speech. Selective conditioning may also include translation, such as providing machine translation transcribing into the user's language, and then reading the translated text aloud using a speech synthesis algorithm. Selective adjustments may also include playing the individual's speech at a slower or faster rate, eg, by playing one or more words faster or slower. In some embodiments, this may also include reducing or increasing the duration of silence periods between words to account for changes in the duration of the spoken words.
可以在处理期间识别个体的语音特征。因此,在一些实施例中,步骤5506可以包括通过例如频率分析或音素强度和持续时间分析来分析第一音频信号以确定至少一个语音特征。步骤5506还可以包括确定用户对选择性调节的偏好,诸如选择性调节英国口音或强调语音中字母“h”的指令。例如,用户偏好可以被存储在存储器550中。存储器550还可以存储处理算法和常量,诸如通过将对应于字母“h”的音频信号的识别部分放大为原始音频的1.5倍并将持续时间延长10%来强调字母“h”的算法。An individual's speech characteristics can be identified during processing. Thus, in some embodiments, step 5506 may include analyzing the first audio signal to determine at least one speech feature by, for example, frequency analysis or phoneme strength and duration analysis. Step 5506 may also include determining the user's preference for selective adjustments, such as instructions for selectively adjusting British accents or emphasizing the letter "h" in speech. For example, user preferences may be stored in memory 550 . Memory 550 may also store processing algorithms and constants, such as an algorithm that emphasizes the letter "h" by amplifying the identified portion of the audio signal corresponding to the letter "h" to 1.5 times the original audio and extending the duration by 10%.
可替代地或另外地,步骤5506可以包括识别第一音频信号中的语音签名,以及基于语音签名确定个体的身份。例如,可以通过从单个说话者的干净音频中提取频谱特征(也称为频谱属性、频谱包络或频谱图)来执行语音特征提取。音频信号可以包括与诸如背景噪声或其他声音之类的任何其他声音隔离的单个说话者的语音的短样本(例如,一秒长、两秒长等)。该干净的音频可以被输入到基于计算机的模型(诸如预先训练的神经网络)中,该模型可以基于提取的特征输出说话者的语音的签名。在一些实施例中,语音签名可以是语音障碍、发音错误、语速或口音。例如,英国口音可能具有共同的频谱特征,其可以被识别为语音签名。此外,个体可能以独特的方式读错词语(诸如常用词语),并且读错词语的频谱图可以是个体语音签名的一部分。同样,个体的速度障碍可能会导致他的言语中没有某些音素,或者可替代地,某个音素以不寻常的速率出现。这种音素的存在或不存在也可以形成语音签名。Alternatively or additionally, step 5506 may include identifying the voice signature in the first audio signal, and determining the identity of the individual based on the voice signature. For example, speech feature extraction can be performed by extracting spectral features (also known as spectral properties, spectral envelopes, or spectrograms) from clean audio of a single speaker. The audio signal may include short samples (eg, one second long, two seconds long, etc.) of a single speaker's speech isolated from any other sounds, such as background noise or other sounds. This clean audio can be input into a computer-based model, such as a pre-trained neural network, which can output a signature of the speaker's speech based on the extracted features. In some embodiments, the speech signature may be speech impairment, pronunciation errors, speech rate, or accent. For example, British accents may share common spectral characteristics that can be identified as speech signatures. Furthermore, individuals may mispronounce words (such as commonly used words) in unique ways, and a spectrogram of mispronounced words may be part of an individual's speech signature. Likewise, an individual's speed disorder may cause certain phonemes to be absent from his speech, or alternatively, a certain phoneme to appear at an unusual rate. The presence or absence of such phonemes can also form a speech signature.
输出签名可以是数字的矢量。例如,对于提交给基于计算机的模型(例如,训练过的神经网络)的每个音频样本,基于计算机的模型可以输出形成矢量的数字集。可以使用任何合适的基于计算机的模型来处理由助听器系统的一个或多个麦克风捕捉的音频数据以返回输出签名。在示例实施例中,基于计算机的模型可以检测并输出所捕捉音频的各种统计特性,诸如音频的平均响度或平均音高、音频的频谱频率、音频的响度或音高的变化、音频的节奏模式等。这些参数可以用于形成包括形成向量的数字集的输出签名。The output signature can be a vector of numbers. For example, for each audio sample submitted to a computer-based model (eg, a trained neural network), the computer-based model may output a set of numbers that form a vector. Audio data captured by one or more microphones of the hearing aid system may be processed using any suitable computer-based model to return an output signature. In an example embodiment, the computer-based model may detect and output various statistical properties of the captured audio, such as the average loudness or average pitch of the audio, the spectral frequency of the audio, the loudness or pitch variation of the audio, the rhythm of the audio mode, etc. These parameters can be used to form an output signature comprising a set of numbers forming a vector.
一旦建立了语音签名,步骤5506可以包括通过访问包括一个或多个个体的声纹的数据库来执行一个或多个语音识别算法,诸如隐式马尔可夫模型、动态时间规整、神经网络或其他技术。因此,处理器210可以基于语音签名来确定个体的身份。另外,在确定身份之后,处理器210还可以访问存储器以确定至少一个语音特征,该至少一个语音特征与个体相关联地存储在存储器中。Once the speech signature is established, step 5506 may include executing one or more speech recognition algorithms, such as hidden Markov models, dynamic time warping, neural networks, or other techniques, by accessing a database that includes the voiceprints of one or more individuals . Accordingly, the processor 210 may determine the identity of the individual based on the voice signature. Additionally, after determining the identity, the processor 210 may also access memory to determine at least one voice characteristic stored in the memory in association with the individual.
为了进一步说明,用户100可能具有声音低沉沙哑的朋友,其无法发音字母“l”。处理器210可以基于朋友的音高和由朋友的语音的沙哑质量产生的泛音为该朋友建立语音签名。此外,语音签名可以注意到在来自朋友的音频信号中没有识别出字母“l”。该语音签名可以存储在存储器220中。另外,用户可以指定朋友无法说出包含“l”的词语会抑制用户对朋友的理解。处理器210可以存储规则,即当朋友的语音签名匹配第一音频信号(指示用户正在与朋友交谈)时,第一音频信号的一些片段应该被适当地表示“l”声音的音频信号替换。例如,处理器210还可以使用自然语言处理方法来确定原始声音是否正确并避免插入“l”声音。处理器210还可以选择性地将语音调节到更高的音高,并去除负责朋友语音的沙哑质量的泛音。To illustrate further, user 100 may have a low-pitched friend who cannot pronounce the letter "1". The processor 210 may establish a voice signature for the friend based on the friend's pitch and the overtones produced by the husky quality of the friend's voice. Additionally, the voice signature may notice that the letter "l" is not recognized in the audio signal from the friend. The voice signature may be stored in memory 220 . Additionally, the user may specify that the friend's inability to speak words containing "l" inhibits the user's understanding of the friend. The processor 210 may store a rule that when the friend's voice signature matches the first audio signal (indicating that the user is conversing with the friend), some segments of the first audio signal should be replaced by audio signals appropriately representing the "1" sound. For example, the processor 210 may also use natural language processing methods to determine whether the original sound is correct and avoid inserting the "1" sound. The processor 210 may also selectively adjust the speech to a higher pitch and remove the overtones responsible for the husky quality of the friend's speech.
在一些情况下,语音特征可以是一种语言的通用特征,而不是仅限于特定的说话者。例如,英语中的一些词语,被称为近同音字,除了一个小的区别(诸如对单个字母的强调更强之外),听起来很相似。“refuse(拒绝)”和“refuge(避难)”、“hiss(嘶嘶)”和“his(他的)”、“advice(建议)”和“advise(劝告)”都可以被认为是近同音字。在某些实施例中,本公开的助听器系统可以增强用户区分近同音字的能力。例如,如前所述,处理器210可以识别第一音频信号中的词语。然后,处理器210可以访问数据库以确定该词语的近同音字。例如,数据库可以存储在存储器220中,或者可以通过诸如计算设备120的移动设备来访问。数据库可以存储用户所理解的语言的近同音字的预先填充的列表。如果数据库中不存在近同音字,则处理器210可以移动到分析第一音频信号中的下一个词语。可替代地,如果该词语存在,处理器210可以将接收到的音频中的词语与音频文件或区分特性进行比较,以确定词语与近同音字之间的差异。一旦识别出差异,处理器210可以增加对应于该差异的音素的音量或持续时间中的至少一个。举例说明,用户附近的一个体可能会说“his”这个词语,在美国英语中,通常发音为“hiz”。处理器210可以确定该个体说了词语“his”,并确定词语“hiss”(在美式英语中用拉长的、柔和的“s”音发音)是近同音字。然后,处理器210可以将“his”和“hiss”之间的差异确定为结尾“s”的发音,并增加对应于结尾“s”的第一音频信号的片段的音量。这样,佩戴本公开的助听器系统的用户可以更清楚地理解个体。In some cases, speech features may be general to a language, rather than limited to specific speakers. For example, some words in English, known as near-homonyms, sound similar except for a small difference (such as a stronger emphasis on individual letters). 'refuse' and 'refuge', 'hiss' and 'his', 'advice' and 'advise' can all be considered nearly homophones Character. In certain embodiments, the hearing aid system of the present disclosure may enhance a user's ability to distinguish near homophones. For example, as previously described, the processor 210 may identify words in the first audio signal. The processor 210 can then access the database to determine the near homophones of the word. For example, the database may be stored in memory 220 or accessed through a mobile device such as computing device 120 . The database may store a pre-populated list of near homophones of the language understood by the user. If no near homophones exist in the database, the processor 210 may move to analyzing the next word in the first audio signal. Alternatively, if the word exists, the processor 210 may compare the word in the received audio to the audio file or distinguishing characteristics to determine the difference between the word and the near homophone. Once the difference is identified, the processor 210 may increase at least one of the volume or the duration of the phoneme corresponding to the difference. For example, an entity near the user might say the word "his", which is commonly pronounced "hiz" in American English. The processor 210 may determine that the individual spoke the word "his" and that the word "hiss" (pronounced with the elongated, soft "s" sound in American English) is a near homophone. The processor 210 may then determine the difference between "his" and "hiss" as the pronunciation of the ending "s" and increase the volume of the segment of the first audio signal corresponding to the ending "s". In this way, a user wearing the hearing aid system of the present disclosure can understand the individual more clearly.
在处理器210已经选择性调节语音特征之后,如果需要调节,则处理器210前进到过程5500A的步骤5508。在步骤5508中,过程5500A可以使经处理的第一音频信号传输到被配置为向用户的耳朵提供声音的听觉接口设备。例如,第一音频信号可以被发送到连接到用户的耳朵的听觉接口设备1710,或者被发送到连接到用户两个耳朵的两个听觉接口设备,从而向用户100提供对应于所接收的音频信号的声音。在一些实施例中,听觉接口设备1710可以包括与听筒相关联的扬声器。例如,听觉接口设备可以至少部分地插入用户的耳朵中,用于向用户提供音频。听觉接口设备1710也可以在耳朵外部,诸如耳后听觉设备、一个或多个耳机、小型便携式扬声器等。在一些实施例中,听觉接口设备可以包括骨传导麦克风,其被配置为通过用户头骨的振动向用户提供音频信号。这样的设备可以与使用者的皮肤外部接触放置,或者可以通过外科手术植入并附接到使用者的骨骼上。After processor 210 has selectively adjusted the speech features, if adjustment is required, processor 210 proceeds to step 5508 of process 5500A. In step 5508, process 5500A may cause the processed first audio signal to be transmitted to an auditory interface device configured to provide sound to the user's ear. For example, the first audio signal may be sent to the auditory interface device 1710 connected to the user's ear, or to two auditory interface devices connected to both ears of the user, thereby providing the user 100 with an audio signal corresponding to the received audio signal the sound of. In some embodiments, auditory interface device 1710 may include a speaker associated with the earpiece. For example, an auditory interface device may be inserted at least partially into a user's ear for providing audio to the user. The auditory interface device 1710 may also be external to the ear, such as a behind-the-ear hearing device, one or more earphones, a small portable speaker, and the like. In some embodiments, the auditory interface device may include a bone conduction microphone configured to provide audio signals to the user through vibrations of the user's skull. Such devices may be placed in external contact with the user's skin, or may be surgically implanted and attached to the user's bone.
除了语音签名匹配过程之外,根据本公开的助听器系统还可以依赖于视觉识别技术来识别说话的个体。图55B是示出符合所公开实施例的用于基于个体的视觉识别来确定语音特征的示例性过程的流程图。图55B中所示的过程5500B的视觉识别方法可以用于代替过程5500A的步骤5504或与过程5500A的步骤5504结合使用。In addition to the speech signature matching process, hearing aid systems according to the present disclosure may also rely on visual recognition techniques to identify individuals speaking. 55B is a flowchart illustrating an exemplary process for determining speech characteristics based on visual recognition of an individual, consistent with disclosed embodiments. The visual recognition method of process 5500B shown in Figure 55B may be used in place of or in conjunction with step 5504 of process 5500A.
例如,装置110可以包括被配置为从用户的环境捕捉多个图像的可穿戴相机,并且处理器210可以执行过程5500B的步骤。因此,在步骤5512中,过程5500B可以包括接收由相机捕捉的多个图像。例如,装置110可以捕捉图像并存储被压缩为JPG文件的图像的表示。作为另一示例,装置110可以捕捉彩色图像,但存储彩色图像的黑白表示。作为又一示例,装置110可以捕捉图像并存储图像的不同表示(例如,图像的一部分)。例如,装置110可以存储图像的一部分,该部分包括出现在图像中的人的脸,但基本上不包括围绕该人的环境。作为又一示例,装置110可以以降低的分辨率(即,以比捕捉的图像的分辨率低的分辨率)存储图像的表示。存储图像的表示可以允许装置110节省存储器550中的存储空间。此外,处理图像的表示可以允许装置110提高处理效率和/或帮助维持电池寿命。For example, apparatus 110 may include a wearable camera configured to capture multiple images from the user's environment, and processor 210 may perform the steps of process 5500B. Accordingly, in step 5512, process 5500B may include receiving a plurality of images captured by the camera. For example, device 110 may capture an image and store a representation of the image compressed as a JPG file. As another example, device 110 may capture a color image, but store a black and white representation of the color image. As yet another example, device 110 may capture an image and store different representations of the image (eg, a portion of the image). For example, device 110 may store a portion of an image that includes the face of a person appearing in the image, but does not substantially include the environment surrounding the person. As yet another example, the apparatus 110 may store the representation of the image at a reduced resolution (ie, at a lower resolution than the captured image). Storing the representation of the image may allow device 110 to conserve storage space in memory 550 . Additionally, processing the representation of the image may allow device 110 to increase processing efficiency and/or help maintain battery life.
在步骤5514中,过程5500B可以包括识别在多个图像中的至少一个中的个体的表示。可以使用各种图像检测算法来识别个体,诸如Haar级联、定向梯度直方图(HOG)、深度卷积神经网络(CNN)、尺度不变特征变换(SIFT)等。在一些实施例中,处理器210可以被配置为例如从显示设备检测个体的可视表示。In step 5514, process 5500B can include identifying a representation of the individual in at least one of the plurality of images. Various image detection algorithms can be used to identify individuals, such as Haar cascade, Histogram of Oriented Gradients (HOG), Deep Convolutional Neural Networks (CNN), Scale Invariant Feature Transform (SIFT), and others. In some embodiments, the processor 210 may be configured to detect a visual representation of the individual, eg, from a display device.
在一些实施例中,步骤5512可以包括基于对多个图像的分析识别与该个体的嘴相关联的至少一个唇部移动或唇部位置,以帮助识别该个体在多个图像中的表示,如上文参考图23所描述的。例如,许多个体可能在相机1760的视场内,但一个体可能正在说话。因此,为了确定如何选择性地调节第一音频信号,处理器210可以从多个个体中识别处正在说话的个体。处理器210还可以使用各种其他技术或特性,诸如颜色、边缘、形状或运动检测算法来识别个体2310的面部。In some embodiments, step 5512 may include identifying at least one lip movement or lip position associated with the individual's mouth based on analysis of the plurality of images to assist in identifying the individual's representation in the plurality of images, as above as described with reference to FIG. 23 . For example, many individuals may be within the field of view of the camera 1760, but one individual may be speaking. Thus, in order to determine how to selectively adjust the first audio signal, the processor 210 may identify the speaking individual from among the plurality of individuals. The processor 210 may also identify the face of the individual 2310 using various other techniques or characteristics, such as color, edge, shape, or motion detection algorithms.
在步骤5516中,过程5500B可以包括基于表示来确定个体的身份。步骤5516可以包括执行个体图像的面部分析。因此,处理器210可以识别个体的面部上的面部特征,诸如眼睛、鼻子、颧骨、下巴或其他特征。处理器210可以使用一种或多种算法来分析检测到的特征,诸如主分量分析(例如,使用本征脸)、线性判别分析、弹性束图匹配(例如,使用Fisher脸)、局部二进制模式直方图(LBPH)、尺度不变特征变换(SIFT)、加速鲁棒特征(SURF)等。In step 5516, process 5500B may include determining the identity of the individual based on the representation. Step 5516 may include performing a facial analysis of the individual image. Thus, the processor 210 may identify facial features on the individual's face, such as eyes, nose, cheekbones, chin, or other features. The processor 210 may analyze the detected features using one or more algorithms, such as principal component analysis (eg, using eigenfaces), linear discriminant analysis, elastic bundle map matching (eg, using Fisher faces), local binary patterns Histogram (LBPH), Scale Invariant Feature Transform (SIFT), Accelerated Robust Feature (SURF), etc.
在步骤5518中,过程5500B可以包括访问存储器以确定至少一个语音特征,该至少一个语音特征与身份相关联地存储在存储器中。例如,处理器210可以在步骤5514中识别图像中的个体。处理器210可以在步骤5516处进一步分析图像中个体的表示,以确定该人的面部的特性。然后,处理器210可以将所确定的特性与具有与个体的身份相关联的特性集合的数据库进行比较,并获得个体的身份(诸如姓名)。在步骤5518,处理器210可以访问相同的或替代的数据库,以获得适用于辨识出的个体的语音特征的选择性调节规则。此外,如上所述,过程5500B可以与语音签名识别过程相结合,以例如在个体的面部被眼镜或面部毛发遮住,或者个体的语音被环境噪声遮住的情况下,提供个体身份的更大确定性。In step 5518, process 5500B can include accessing memory to determine at least one speech characteristic stored in memory in association with the identity. For example, processor 210 may identify individuals in the image in step 5514. The processor 210 may further analyze the representation of the individual in the image at step 5516 to determine characteristics of the person's face. The processor 210 may then compare the determined characteristics to a database having sets of characteristics associated with the identity of the individual and obtain the identity of the individual (such as a name). At step 5518, the processor 210 may access the same or an alternate database to obtain selective adjustment rules applicable to the speech characteristics of the identified individual. Additionally, as described above, process 5500B can be combined with a voice signature recognition process to provide greater visibility into the identity of the individual, for example, where the individual's face is obscured by glasses or facial hair, or the individual's speech is obscured by ambient noise certainty.
基于语音签名和读唇的选择性调节Selective conditioning based on voice signature and lip reading
人类有着鲜明而不同的语音。虽然有些人有很好的语音记忆力,可以很容易地认出他们的第一个小学老师,但其他人可能很难只从他们的语音认出他们最亲密的朋友,尤其是当一个环境中有几个语音时。因此,需要识别活跃说话者或确定要聚焦于多个语音中的哪个语音。例如,当用户100和他的孩子在公园时,他可能希望相对于其他附近孩子的语音放大他的孩子的语音。Humans have distinct and distinct voices. While some people have excellent phonetic memory and can easily recognize their first elementary school teacher, others may have difficulty recognizing their closest friends from just their phonetics, especially when an environment has several voices. Therefore, there is a need to identify active speakers or determine which speech of multiple speeches to focus on. For example, when user 100 is at the park with his child, he may wish to amplify his child's speech relative to the speech of other nearby children.
所公开的助听器系统可以被配置为结合读唇使用语音签名来选择性地调节或以其他方式处理个体的语音。助听器系统可以从用户的环境接收表示由麦克风捕捉的声音的音频信号。声音可以用于确定可被存储的个体的语音签名。该语音签名可以用于识别用户环境内的个体。例如,助听器系统可以通过将检测到的语音签名与存储的语音签名进行比较来确定个体是否是用户所认识的。助听器系统还可以基于从相机接收的图像来检测个体的唇部移动,该图像还可以用于识别活跃说话者。虽然语音签名检测和读唇可以分别执行以识别个体或活跃说话者,但是这些中的每一个单独都可能导致某种程度的不确定性。当组合使用时(即,语音签名和读唇),助听器系统可以以更高效和/或有效的方式识别要被选择性地调节、转录或以其他方式处理的语音。The disclosed hearing aid systems may be configured to use speech signatures in conjunction with lip reading to selectively condition or otherwise process an individual's speech. The hearing aid system may receive audio signals from the user's environment representing sounds captured by the microphone. Sound can be used to determine an individual's voice signature that can be stored. The voice signature can be used to identify individuals within the user's environment. For example, a hearing aid system can determine whether an individual is known to the user by comparing the detected speech signature to the stored speech signature. The hearing aid system can also detect individual lip movements based on images received from the camera, which can also be used to identify active speakers. While speech signature detection and lip reading can be performed separately to identify an individual or active speaker, each of these alone may lead to some degree of uncertainty. When used in combination (ie, speech signature and lip reading), the hearing aid system can identify speech to be selectively conditioned, transcribed, or otherwise processed in a more efficient and/or effective manner.
图56是符合所公开实施例的用于选择性地调节声音的示例性助听器系统5600的示意图。助听器系统5600在图56中以简化形式示出,并且助听器系统5600可以包括附加元件或者可以具有替代配置,例如,如图5A-图5C所示。如图所示,助听器系统5600包括可穿戴相机5601、麦克风5602、处理器5603、收发器5604和存储器5605。56 is a schematic diagram of an exemplary hearing aid system 5600 for selectively adjusting sound, consistent with disclosed embodiments. Hearing aid system 5600 is shown in simplified form in Figure 56, and hearing aid system 5600 may include additional elements or may have alternative configurations, eg, as shown in Figures 5A-5C. As shown, the hearing aid system 5600 includes a wearable camera 5601 , a microphone 5602 , a processor 5603 , a transceiver 5604 and a memory 5605 .
可穿戴相机5601可以被配置为从用户100的环境捕捉多个图像。例如,如上所述,可穿戴相机5601可以是相机1730。可穿戴相机5601可以具有图像捕捉速率,该图像捕捉速率可以由用户配置或基于预定设置来配置。在一些实施例中,可穿戴相机5601可以包括一个或多个相机,每个相机可以对应于图像传感器220。Wearable camera 5601 may be configured to capture multiple images from the user's 100 environment. For example, wearable camera 5601 may be camera 1730, as described above. The wearable camera 5601 may have an image capture rate that may be configured by the user or based on predetermined settings. In some embodiments, wearable camera 5601 may include one or more cameras, each camera may correspond to image sensor 220 .
麦克风5602可以被配置为从用户100的环境捕捉声音。例如,如上所述,麦克风5601可以是麦克风1720。麦克风5602可以包括一个或多个麦克风。麦克风5602可以包括定向麦克风、麦克风阵列、多端口麦克风或各种其他类型的麦克风。在一些实施例中,麦克风5602和可穿戴相机5601可以包括在公共外壳(诸如装置110的外壳)中。Microphone 5602 may be configured to capture sound from the environment of user 100 . For example, microphone 5601 may be microphone 1720, as described above. Microphone 5602 may include one or more microphones. Microphones 5602 may include directional microphones, microphone arrays, multi-port microphones, or various other types of microphones. In some embodiments, microphone 5602 and wearable camera 5601 may be included in a common housing, such as that of device 110.
收发器5604可以被配置为向听觉接口设备(例如,1710)发送音频信号,该听觉接口设备被配置为向用户100的耳朵提供声音。收发器5604可以包括一个或多个无线收发器。一个或多个无线收发器可以是被配置为通过使用射频、红外频率、磁场或电场在空中接口上交换传输的任何设备。一个或多个无线收发器可以使用任何已知标准来发送和/或接收数据(例如,WiFi、蓝牙蓝牙智能、802.15.4或ZigBee)。在一些实施例中,收发器5604可以将数据(例如,原始图像数据、经处理的图像和/或音频数据、提取的信息)从助听器系统5600发送到听觉接口设备和/或服务器250。收发器5604还可以从听觉接口设备和/或服务器250接收数据。在一些实施例中,收发器5604可以将数据和指令发送到外部反馈输出单元230。The transceiver 5604 may be configured to transmit audio signals to an auditory interface device (eg, 1710 ) that is configured to provide sound to the ear of the user 100 . Transceiver 5604 may include one or more wireless transceivers. The one or more wireless transceivers may be any device configured to exchange transmissions over the air interface using radio, infrared, magnetic or electric fields. One or more wireless transceivers may transmit and/or receive data using any known standard (eg, WiFi, Bluetooth Bluetooth Smart, 802.15.4 or ZigBee). In some embodiments, transceiver 5604 may transmit data (eg, raw image data, processed image and/or audio data, extracted information) from hearing aid system 5600 to auditory interface device and/or server 250. The transceiver 5604 may also receive data from the auditory interface device and/or the server 250. In some embodiments, transceiver 5604 may send data and instructions to external feedback output unit 230 .
存储器5605可以包括个体信息数据库5606和声纹数据库5607。声纹数据库5607可以包括一个或多个个体的一个或多个声纹。个体信息数据库5606可以包括将存储在声纹数据库5607中的一个或多个声纹与一个或多个个体相关联的信息。将一个或多个声纹与一个或多个个体相关联的信息可以包括映射表。个体信息数据库5606还可以包括指示用户100是否已知一个或多个个体的信息。例如,映射表还可以包括指示个体与用户100的关系的信息。可选地,存储器5605还可以包括其他组件,例如如图20B所示。可选地,存储器5605还可以包括如图6所示的朝向识别模块601、朝向调整模块602和监视模块603。个体信息数据库5606和声纹数据库5607仅作为示例示出在存储器5605内,并且可以位于其他位置。例如,数据库可以位于听觉接口设备1710中、、远程服务器上或另一关联设备中。个体信息数据库5606和声纹数据库5607可以在同一数据库内实现,或者可以实现为两个或更多个单独的数据库。The memory 5605 may include an individual information database 5606 and a voiceprint database 5607. Voiceprint database 5607 may include one or more voiceprints of one or more individuals. Individual information database 5606 may include information associating one or more voiceprints stored in voiceprint database 5607 with one or more individuals. The information associating one or more voiceprints with one or more individuals may include a mapping table. Individual information database 5606 may also include information indicating whether one or more individuals are known to user 100 . For example, the mapping table may also include information indicating the relationship of the individual to the user 100 . Optionally, the memory 5605 may also include other components, such as shown in FIG. 20B . Optionally, the memory 5605 may further include an orientation identification module 601 , an orientation adjustment module 602 and a monitoring module 603 as shown in FIG. 6 . Individual information database 5606 and voiceprint database 5607 are shown within memory 5605 by way of example only, and may be located elsewhere. For example, the database may reside in auditory interface device 1710, on a remote server, or in another associated device. Individual information database 5606 and voiceprint database 5607 may be implemented within the same database, or may be implemented as two or more separate databases.
处理器5603可以包括一个或多个处理单元。处理器5603可以被编程为接收由可穿戴相机5601捕捉的多个图像。处理器5603还可以被编程为接收表示由麦克风5602捕捉的声音的多个音频信号。在一个实施例中,处理器5603可以与麦克风5602和可穿戴相机5601一起包括在相同的外壳中。在另一实施例中,麦克风5602和可穿戴相机5601可以包括在第一外壳中,处理器5603可以包括在第二外壳中。在这样的实施例中,处理器5603可以被配置为经由无线链路(例如,蓝牙TM、NFC等)从第一外壳接收多个图像和/或音频信号。因此,第一壳体和第二壳体还可以包括发送器或各种其他通信组件。处理器5603可以被编程为分析接收到的多个音频信号,以使用自组织创建或存储在存储器5605中的个体的声纹来识别用户100的环境中的个体的语音。处理器5603还可以被编程为基于对多个图像的分析,检测与个体的嘴相关联的至少一个唇部移动。处理器5603还可以被编程为基于声纹或检测到的唇部移动中的至少一个,识别多个音频信号中与个体的语音相关联的第一音频信号。处理器5603还可以被编程为引起对第一音频信号的选择性调节或其他处理。处理器5603还可以被编程为使收发器5604将经选择性调节的第一音频信号发送到被配置为向用户100的耳朵提供声音的听觉接口设备。Processor 5603 may include one or more processing units. Processor 5603 may be programmed to receive a plurality of images captured by wearable camera 5601. Processor 5603 may also be programmed to receive a plurality of audio signals representing sounds captured by microphone 5602. In one embodiment, the processor 5603 may be included in the same housing as the microphone 5602 and the wearable camera 5601. In another embodiment, the microphone 5602 and the wearable camera 5601 may be included in the first housing and the processor 5603 may be included in the second housing. In such an embodiment, the processor 5603 may be configured to receive a plurality of image and/or audio signals from the first housing via a wireless link (eg, Bluetooth ™ , NFC, etc.). Accordingly, the first and second housings may also include transmitters or various other communication components. The processor 5603 may be programmed to analyze the received plurality of audio signals to identify the speech of an individual in the environment of the user 100 using the individual's voiceprint created ad hoc or stored in the memory 5605. The processor 5603 may also be programmed to detect at least one lip movement associated with the individual's mouth based on analysis of the plurality of images. The processor 5603 may also be programmed to identify a first audio signal of the plurality of audio signals associated with the individual's speech based on at least one of a voiceprint or detected lip movement. The processor 5603 may also be programmed to cause selective conditioning or other processing of the first audio signal. The processor 5603 may also be programmed to cause the transceiver 5604 to transmit the selectively conditioned first audio signal to an auditory interface device configured to provide sound to the ear of the user 100 .
在一些实施例中,处理器5603可以被编程为获取个体的声纹。例如,可以基于对话中较早时收集的个体的语音(例如,当他个体单独讲话而没有其它背景噪声时)来识别声纹。如本文所使用的,“较早时”可以指同一事件中的前一段,或指在此期间已创建并存储声纹的前一经历。然后可以基于声纹和检测到的唇部移动的组合来识别第一音频信号。在一些实施例中,在引起对第一音频信号的选择性调节时,处理器5603可以被编程为相对于多个音频信号中的至少一个第二音频信号来放大第一音频信号或从第一音频信号中去除背景噪声。在一些实施例中,在引起对第一音频信号的选择性调节时,处理器5603可以被编程为相对于第一音频信号衰减多个音频信号中的至少一个第二音频信号,或滤除多个音频信号中的至少一个第二音频信号。在一些实施例中,在引起对第一音频信号的选择性调节时,处理器5603可以被编程为改变所识别语音的速率或在所识别语音的词语或句子之间引入一个或多个停顿。In some embodiments, the processor 5603 may be programmed to acquire the individual's voiceprint. For example, a voiceprint may be identified based on an individual's speech collected earlier in the conversation (eg, when the individual is speaking alone without other background noise). As used herein, "earlier" may refer to a previous segment within the same event, or to a previous experience during which a voiceprint was created and stored. The first audio signal can then be identified based on a combination of the voiceprint and the detected lip movement. In some embodiments, upon causing the selective adjustment of the first audio signal, the processor 5603 may be programmed to amplify the first audio signal relative to the at least one second audio signal of the plurality of audio signals or from the first audio signal Remove background noise from audio signals. In some embodiments, upon causing the selective adjustment of the first audio signal, the processor 5603 may be programmed to attenuate at least one second audio signal of the plurality of audio signals relative to the first audio signal, or to filter out a plurality of audio signals at least one second audio signal of the audio signals. In some embodiments, the processor 5603 may be programmed to alter the rate of the recognized speech or to introduce one or more pauses between words or sentences of the recognized speech when causing the selective adjustment of the first audio signal.
在一些实施例中,识别第一音频信号可以包括确定用户100已知该个体。确定用户100已知该个体可以包括从存储在存储器5605中的个体信息数据库5606检索信息。个体信息数据库5606可以将声纹与个体和/或个体的图像相关联。In some embodiments, identifying the first audio signal may include determining that the individual is known to the user 100 . Determining that the individual is known to the user 100 may include retrieving information from an individual information database 5606 stored in memory 5605 . Individual information database 5606 may associate voiceprints with individuals and/or images of individuals.
图57是示出符合所公开实施例的助听器系统5600的用户100的示例性环境的示意图。由用户100佩戴的助听器系统5600可以被配置为捕捉多个声音5704、5705和5706,并识别用户环境内的一个或多个个体。例如,在多个声音5704、5705和5706中,用户100可能希望关注来自个体5701的声音5704。在一些实施例中,个体5701可能是用户100的朋友、同事、亲戚或以前的熟人。在一些实施例中,用户100可能不知道个体5701。如图57所示,助听器系统5600可以被配置为检测唇部5703的一个或多个运动或识别与用户100的环境内的个体5701相关联的语音5707。57 is a schematic diagram illustrating an exemplary environment for a user 100 of a hearing aid system 5600 consistent with the disclosed embodiments. Hearing aid system 5600 worn by user 100 may be configured to capture multiple sounds 5704, 5705 and 5706 and identify one or more individuals within the user's environment. For example, among multiple voices 5704, 5705, and 5706, user 100 may wish to focus on voice 5704 from individual 5701. In some embodiments, individual 5701 may be a friend, colleague, relative, or former acquaintance of user 100 . In some embodiments, individual 5701 may not be known to user 100 . As shown in FIG. 57 , the hearing aid system 5600 may be configured to detect one or more movements of the lips 5703 or to recognize speech 5707 associated with an individual 5701 within the environment of the user 100 .
助听器系统5600可以被配置为使用麦克风5602来捕捉声音5704、5705和5706。声音5704与个体5701的语音5707相关联,并且声音5705和5706可以与用户100的环境中的附加声音或背景噪声相关联。在一些实施例中,多个声音可以包括用户100附近的一个或多个体和/或一个或多个对象的语音或非语音声音、环境声音(例如,音乐、音调或环境噪声)等。处理器5603可以被配置为分析由麦克风5602捕捉的音频信号,以分离与个体5701的语音5707相关联的声音5704。例如,处理器5603可以使用个体5701的预先获取的声纹,该个体5701可以被确定为正在说话。如果用户100知道个体5701,则可以检索并使用先前存储的声纹。例如,处理器5603可以访问声纹数据库5607,其可以包括对应于一个或多个个体的一个或多个声纹。处理器5603可以将表示声音5704的声纹与存储在声纹数据库5607中的声纹进行比较,以确定数据库中是否存在个体5701的更好的声纹。Hearing aid system 5600 may be configured to capture sounds 5704, 5705, and 5706 using microphone 5602. Sound 5704 is associated with speech 5707 of individual 5701 , and sounds 5705 and 5706 may be associated with additional sounds or background noise in the environment of user 100 . In some embodiments, the plurality of sounds may include speech or non-speech sounds of one or more bodies and/or one or more objects in the vicinity of the user 100, ambient sounds (eg, music, tones, or ambient noise), and the like. The processor 5603 may be configured to analyze the audio signal captured by the microphone 5602 to separate the sound 5704 associated with the speech 5707 of the individual 5701. For example, processor 5603 may use a pre-acquired voiceprint of individual 5701, who may be determined to be speaking. If the user 100 knows the individual 5701, the previously stored voiceprint can be retrieved and used. For example, processor 5603 can access voiceprint database 5607, which can include one or more voiceprints corresponding to one or more individuals. Processor 5603 may compare the voiceprint representing voice 5704 with the voiceprint stored in voiceprint database 5607 to determine if a better voiceprint for individual 5701 exists in the database.
助听器系统5600可以被配置为使用可穿戴相机5601来捕捉个体5701的一个或多个面部图像5702。处理器5603可以被配置为分析捕捉的个体5701的面部图像5702。例如,如上文关于图23A-图23C所述,处理器5603可以被配置为使用一种或多种图像处理技术(诸如卷积神经网络(CNN)、尺度不变特征变换(SIFT)、定向梯度直方图(HOG)特征或其他技术)来检测个体5701的一个或多个面部特征,其可以包括但不限于个体5701的嘴5703。处理器5603还可以被配置为检测与个体5701的嘴5703相关联的一个或多个点,并实时跟踪个体5701的唇部的运动。基于检测到的唇部移动,处理器5603可以从多个音频信号中识别与个体5701的声音5704相关联的音频信号。例如,处理器5603可以将检测到的唇部移动的定时与接收到的音频信号中的语音模式的定时进行比较,以确定对应于唇部移动的音频信号。因此,个体5701的语音可以被识别为与其他信号分离和/或使用读唇结合预先获取的声纹进行处理。与单独使用每种技术相比,这种联合分析可以提供更好的结果。Hearing aid system 5600 may be configured to capture one or more facial images 5702 of individual 5701 using wearable camera 5601 . The processor 5603 may be configured to analyze the captured facial image 5702 of the individual 5701. For example, as described above with respect to Figures 23A-23C, the processor 5603 may be configured to use one or more image processing techniques such as convolutional neural networks (CNN), scale-invariant feature transforms (SIFT), oriented gradients Histogram (HOG) features or other techniques) to detect one or more facial features of the individual 5701, which may include, but are not limited to, the mouth 5703 of the individual 5701. The processor 5603 may also be configured to detect one or more points associated with the mouth 5703 of the individual 5701 and track the movement of the lips of the individual 5701 in real time. Based on the detected lip movement, the processor 5603 can identify the audio signal associated with the voice 5704 of the individual 5701 from the plurality of audio signals. For example, the processor 5603 may compare the timing of the detected lip movement to the timing of speech patterns in the received audio signal to determine the audio signal corresponding to the lip movement. Thus, the speech of the individual 5701 can be identified as being separated from other signals and/or processed using lip-reading in conjunction with a pre-acquired voiceprint. This combined analysis can provide better results than using each technique individually.
在一些实施例中,在发送与个体5701的声音5704相关联的音频信号之前,处理器5603可以被编程为执行用户100的声音的选择性调节。在一些实施例中,用户100的声音的选择性调节可以包括相对于来自用户环境的至少一个第二音频信号来放大用户的音频信号或从用户的音频信号中去除背景噪声。在一些实施例中,用户100的声音的选择性调节可以包括相对于用户的音频信号衰减来自用户环境的至少一个第二音频信号或滤除至少一个第二音频信号。在一些实施例中,用户100的声音的选择性调节可以包括改变用户的语速或在的词语或句子之间引入一个或多个停顿。In some embodiments, the processor 5603 may be programmed to perform selective adjustment of the voice of the user 100 prior to transmitting the audio signal associated with the voice 5704 of the individual 5701. In some embodiments, the selective adjustment of the user's 100 voice may include amplifying or removing background noise from the user's audio signal relative to at least one second audio signal from the user's environment. In some embodiments, the selective adjustment of the user's 100 voice may include attenuating or filtering at least one second audio signal from the user's environment relative to the user's audio signal. In some embodiments, selective adjustment of the user's 100 voice may include changing the user's speech rate or introducing one or more pauses between words or sentences.
图58是示出符合所公开实施例的用于选择性地调节或以其他方式处理助听器系统中的声音的示例性方法5800的流程图。处理器5603可以执行过程5800,以在系统5600捕捉个体5701的语音的音频信号和/或个体5701的图像之后选择性地调节来自用户100的周围环境的声音。58 is a flowchart illustrating an exemplary method 5800 for selectively conditioning or otherwise processing sound in a hearing aid system, consistent with disclosed embodiments. The processor 5603 can perform the process 5800 to selectively adjust the sound from the user's 100 surroundings after the system 5600 captures the audio signal of the individual 5701's speech and/or the image of the individual 5701.
方法5800可以包括从用户的环境接收多个图像的步骤5801。多个图像可以由可穿戴相机捕捉。例如,在步骤5801处,处理器5603可以接收由可穿戴相机5601捕捉的多个图像。在一些实施例中,该多个图像可以包括个体5701的面部图像5702。The method 5800 can include the step 5801 of receiving a plurality of images from a user's environment. Multiple images can be captured by the wearable camera. For example, at step 5801, the processor 5603 may receive a plurality of images captured by the wearable camera 5601. In some embodiments, the plurality of images may include a facial image 5702 of the individual 5701 .
方法5800可以包括基于对多个图像的分析,检测与个体的嘴相关联的至少一个唇部移动的步骤5802。例如,在步骤5802处,过程5603可以基于对多个图像的分析,检测与个体5701的嘴5703相关联的至少一个唇部移动或唇部位置。处理器5603可以识别与个体5701的嘴5703相关联的一个或多个点。在一些实施例中,处理器5603可以开发与个体5701的嘴5703相关联的轮廓,该轮廓可以定义与个体的嘴或唇部相关联的边界。可以在多个帧或图像上跟踪在图像中识别出的唇部,以识别唇部移动。因此,处理器5603可以使用如上所述的各种视频跟踪算法。The method 5800 can include the step 5802 of detecting at least one lip movement associated with the individual's mouth based on analysis of the plurality of images. For example, at step 5802, process 5603 may detect at least one lip movement or lip position associated with mouth 5703 of individual 5701 based on analysis of the plurality of images. The processor 5603 can identify one or more points associated with the mouth 5703 of the individual 5701. In some embodiments, the processor 5603 can develop a profile associated with the individual's 5701 mouth 5703, which can define a boundary associated with the individual's mouth or lips. Lips identified in an image can be tracked over multiple frames or images to identify lip movement. Thus, the processor 5603 can use various video tracking algorithms as described above.
方法5800还可以包括接收表示由至少一个麦克风捕捉的声音的多个音频信号的步骤5803。例如,在步骤5802处,麦克风5602可以捕捉多个声音5704、5705和5706,处理器5603可以接收表示多个声音5704、5705和5706的多个音频信号。声音5704与个体5701的语音相关联,并且声音5705和5706可以是用户100的环境中的附加声音或背景噪声。在一些实施例中,声音5705和5706可以包括个体5701以外的一个或多个体的语音或非语音声音、环境声音(例如,音乐、音调或环境噪声)等。The method 5800 may also include the step 5803 of receiving a plurality of audio signals representing sounds captured by the at least one microphone. For example, at step 5802, the microphone 5602 can capture a plurality of sounds 5704, 5705 and 5706 and the processor 5603 can receive a plurality of audio signals representing the plurality of sounds 5704, 5705 and 5706. Sound 5704 is associated with the speech of individual 5701 , and sounds 5705 and 5706 may be additional sounds or background noise in the environment of user 100 . In some embodiments, sounds 5705 and 5706 may include speech or non-speech sounds of one or more entities other than individual 5701, ambient sounds (eg, music, tones, or ambient noise), and the like.
方法5800可以包括获得与用户环境内的个体相关联的声纹的步骤5804。在一些实施例中,可以基于多个图像或多个音频信号中的至少一个来识别个体。例如,步骤5804可以包括使用面部识别、语音识别或用于识别个体的其他手段。可以以各种方式获得声纹。在一些实施例中,获得声纹可以包括基于与个体的语音相关联的先前音频信号来生成声纹。例如,这可以包括检测个体5701在其中单独说话的片段,以及在该片段期间提取个体5701的声纹。The method 5800 can include the step 5804 of obtaining a voiceprint associated with an individual within the user's environment. In some embodiments, the individual may be identified based on at least one of the plurality of images or the plurality of audio signals. For example, step 5804 may include the use of facial recognition, voice recognition, or other means for identifying individuals. Voiceprints can be obtained in various ways. In some embodiments, obtaining the voiceprint may include generating the voiceprint based on previous audio signals associated with the individual's speech. For example, this may include detecting a segment in which the individual 5701 speaks alone, and extracting the individual's 5701 voiceprint during the segment.
在一些实施例中,获得声纹包括基于对多个图像中的至少一个中的说话者的识别来从数据库检索声纹。例如,可以通过将一个或多个捕捉图像中的个体5701的表示或特征与个体信息数据库5606中的条目进行比较来识别个体5701。基于该比较,可以从声纹数据库5607检索个体5701的先前声纹。提取出的声纹或检索到的声纹可用于分析接收的音频信号以分离和处理个体5701的语音。在一些实施例中,如果没有先前的声纹可用,或者如果提取出的声纹比从声纹数据库5607检索到的声纹具有更高的质量(例如,在较安静区域捕捉的音频上生成的声纹等),则新生成的声纹可以除了先前存储的声纹之外被存储在数据库中或者替代先前存储的声纹而存储。步骤5804可以使用训练的模型来确定音频信号是否包括与特定声纹相关联的语音,或者提供音频信号包括与特定声纹相关联的语音的概率。在一些实施例中,可以只发生步骤5801和5802,从而只执行读唇。在一些实施例中,可以只发生步骤5803和5804,从而只执行语音签名检测。在一些实施例中,可以发生所有步骤5801-5804,从而执行读唇和语音签名检测两者。In some embodiments, obtaining the voiceprint includes retrieving the voiceprint from a database based on the identification of the speaker in at least one of the plurality of images. For example, individual 5701 may be identified by comparing representations or characteristics of individual 5701 in one or more captured images to entries in individual information database 5606. Based on this comparison, the individual 5701's previous voiceprints can be retrieved from the voiceprint database 5607. The extracted voiceprint or the retrieved voiceprint can be used to analyze the received audio signal to separate and process the speech of the individual 5701. In some embodiments, if no previous voiceprints are available, or if the extracted voiceprints are of higher quality than those retrieved from the voiceprint database 5607 (eg, generated on audio captured in quieter areas) voiceprint, etc.), the newly generated voiceprint may be stored in the database in addition to or in place of the previously stored voiceprint. Step 5804 may use the trained model to determine whether the audio signal includes speech associated with a particular voiceprint, or to provide a probability that the audio signal includes speech associated with a particular voiceprint. In some embodiments, only steps 5801 and 5802 may occur so that only lip reading is performed. In some embodiments, only steps 5803 and 5804 may occur so that only voice signature detection is performed. In some embodiments, all steps 5801-5804 may occur to perform both lip reading and speech signature detection.
方法5800可以包括基于声纹或检测到的唇部移动中的至少一个,识别多个音频信号中与个体5701的语音相关联的第一音频信号的步骤5805。例如,这可以包括将第一音频信号与不与个体相关联的一个或多个音频信号分离。例如,在步骤5805处,处理器5603可以基于在步骤5804处创建或检索出的声纹或在步骤5802处检测到的唇部移动中的至少一个,从与声音5704、5705和5706相关联的多个音频信号中识别与个体5701的语音5707相关联的音频信号。在一些实施例中,处理器5603可以基于在步骤5804处获得的声纹和在步骤5802处检测到的唇部移动的组合,从多个音频信号中识别与个体5701的声音相关联的音频信号。如上所述,一旦分离出第一音频信号,处理器5603可以将检测到的特定唇部移动与在第一音频信号中识别出的音素或其他特征进行比较。在一些实施例中,识别第一音频信号可以包括确定用户已知该个体。确定用户已知该个体可以包括从存储在存储器中的数据库中检索信息,该数据库将声纹与该个体相关联。例如,数据库中将一个或多个声纹与一个或多个个体相关联的信息可以包括映射表,映射表还可以包括指示一个或多个个体是否为用户100已知以及他们与用户100的关系的信息。处理器5603可以访问个体信息数据库5606,以检索存储在存储器5605中的个体信息,并确定用户是否已知该个体。The method 5800 can include the step 5805 of identifying a first audio signal of the plurality of audio signals associated with the speech of the individual 5701 based on at least one of a voiceprint or a detected lip movement. For example, this may include separating the first audio signal from one or more audio signals not associated with the individual. For example, at step 5805, the processor 5603 may, based on at least one of the voiceprint created or retrieved at step 5804 or the lip movement detected at step 5802, from the data associated with sounds 5704, 5705, and 5706 An audio signal associated with the speech 5707 of the individual 5701 is identified among the plurality of audio signals. In some embodiments, the processor 5603 may identify an audio signal associated with the voice of the individual 5701 from the plurality of audio signals based on a combination of the voiceprint obtained at step 5804 and the lip movement detected at step 5802 . As described above, once the first audio signal is isolated, the processor 5603 may compare the detected particular lip movement to phonemes or other features identified in the first audio signal. In some embodiments, identifying the first audio signal may include determining that the individual is known to the user. Determining that the individual is known to the user may include retrieving information from a database stored in memory that associates the voiceprint with the individual. For example, the information in the database associating one or more voiceprints with one or more individuals may include a mapping table, which may also include an indication of whether the one or more individuals are known to the user 100 and their relationship to the user 100 Information. Processor 5603 may access individual information database 5606 to retrieve individual information stored in memory 5605 and determine whether the individual is known to the user.
方法5800可以包括处理第一音频信号的步骤5806。在一些实施例中,如贯穿本公开所描述的,该处理可以包括选择性调节。例如,在步骤5806处,处理器5603可以对与个体5701的语音相关联的音频信号执行各种形式的选择性调节。在一些实施例中,处理第一音频信号可以包括相对于多个音频信号中的至少一个第二音频信号放大第一音频信号或去除第一音频信号的背景噪声。例如,处理器5603可以相对于与声音5705和5706相关联的音频信号中的至少一个放大与个体5701的语音相关联的音频信号。放大可以通过各种手段来执行,诸如方向性麦克风的操作、改变与麦克风相关联的一个或多个参数、或数字化处理音频信号。处理器5603还可以去除与个体5701的语音相关联的音频信号的背景噪声。在一些实施例中,处理第一音频信号可以包括相对于第一音频信号衰减多个音频信号中的至少一个第二音频信号或滤除多个音频信号中的至少一个第二音频信号。例如,处理器5603可以选择性地衰减与声音5705和5706相关联的音频信号中的至少一个,或滤除与声音5705和5706相关联的音频信号中的至少一个。在一些实施例中,处理第一音频信号可以包括改变识别出的语音的速率或在识别出的语音的词语或句子之间引入一个或多个停顿。例如,处理器5603可以改变与个体5701的语音相关联的音频信号相关联的识别出的语音的速率,或者在与个体5701的语音相关联的音频信号相关联的识别出的语音的词语或句子之间引入一个或多个停顿。在一些实施例中,处理第一音频信号可以包括改变与个体5701的语音相关联的音频信号的音调。在一些实施例中,处理第一音频信号可以包括转录第一音频信号。Method 5800 may include a step 5806 of processing the first audio signal. In some embodiments, the processing may include selective conditioning, as described throughout this disclosure. For example, at step 5806, the processor 5603 may perform various forms of selective conditioning on the audio signal associated with the individual's 5701 speech. In some embodiments, processing the first audio signal may include amplifying the first audio signal or removing background noise of the first audio signal relative to at least one second audio signal of the plurality of audio signals. For example, the processor 5603 may amplify the audio signal associated with the speech of the individual 5701 relative to at least one of the audio signals associated with the sounds 5705 and 5706. Amplification may be performed by various means, such as operation of a directional microphone, changing one or more parameters associated with the microphone, or digitally processing the audio signal. The processor 5603 may also remove background noise from the audio signal associated with the individual's 5701 speech. In some embodiments, processing the first audio signal may include attenuating at least one second audio signal of the plurality of audio signals relative to the first audio signal or filtering out at least one second audio signal of the plurality of audio signals. For example, processor 5603 may selectively attenuate at least one of the audio signals associated with sounds 5705 and 5706, or filter out at least one of the audio signals associated with sounds 5705 and 5706. In some embodiments, processing the first audio signal may include changing the rate of the recognized speech or introducing one or more pauses between words or sentences of the recognized speech. For example, the processor 5603 may vary the rate of the recognized speech associated with the audio signal associated with the speech of the individual 5701, or the words or sentences of the recognized speech associated with the audio signal associated with the speech of the individual 5701 Introduce one or more pauses in between. In some embodiments, processing the first audio signal may include changing the pitch of the audio signal associated with the individual 5701's speech. In some embodiments, processing the first audio signal may include transcribing the first audio signal.
方法5800可以包括使经选择性调节的第一音频信号传输到听觉接口设备的步骤5807,该听觉接口设备被配置为向用户的耳朵提供声音。例如,收发器5604可以将经调节的音频信号发送到听觉接口设备(诸如听觉接口设备1710),该听觉接口设备可以向用户100提供对应于与个体5701的语音相关联的音频信号的声音。在一些实施例中,听觉接口设备可以包括与听筒相关联的扬声器。例如,听觉接口设备可以至少部分地插入用户的耳朵中,用于向用户提供音频。听觉接口设备也可以在耳朵外部,诸如耳后听觉设备、一个或多个耳机、小型便携式扬声器等。在一些实施例中,听觉接口设备可以包括骨传导耳机1711(诸如上文讨论的骨传导耳机1711),其被配置为通过用户头骨的振动向用户提供音频信号。这样的设备可以与使用者的皮肤外部接触放置,或者可以通过外科手术植入并附接到使用者的骨骼上。Method 5800 may include a step 5807 of transmitting the selectively conditioned first audio signal to an auditory interface device configured to provide sound to a user's ear. For example, transceiver 5604 may transmit the conditioned audio signal to an auditory interface device, such as auditory interface device 1710 , which may provide user 100 with sounds corresponding to the audio signal associated with individual 5701's speech. In some embodiments, the auditory interface device may include a speaker associated with the earpiece. For example, an auditory interface device may be inserted at least partially into a user's ear for providing audio to the user. The auditory interface device may also be external to the ear, such as a behind-the-ear hearing device, one or more earphones, small portable speakers, and the like. In some embodiments, the auditory interface device may include a bone conduction headset 1711 (such as the bone conduction headset 1711 discussed above) configured to provide audio signals to the user through vibrations of the user's skull. Such devices may be placed in external contact with the user's skin, or may be surgically implanted and attached to the user's bone.
在一些实施例中,存储器5605可以包括存储由处理器5603执行以执行如上所述的方法5800的程序指令的非暂时性计算机可读存储介质。In some embodiments, memory 5605 may include a non-transitory computer-readable storage medium storing program instructions for execution by processor 5603 to perform method 5800 as described above.
上述描述是为了说明的目的而提出的。它不是穷尽性的,并且不限于所公开的精确形式或实施例。从所公开的实施例的说明书和实践的考虑来看,修改和适配对于本领域的技术人员将是显而易见的。另外,尽管所公开的实施例的方面被描述为存储在存储器中,但本领域技术人员将理解,这些方面也可以存储在其他类型的计算机可读介质(诸如辅助存储设备)上,例如硬盘或CD ROM,或其他形式的RAM或ROM、USB介质、DVD、蓝光、超高清蓝光或其他光驱介质。The foregoing description has been presented for purposes of illustration. It is not exhaustive and is not limited to the precise forms or embodiments disclosed. Modifications and adaptations will be apparent to those skilled in the art from consideration of the specification and practice of the disclosed embodiments. Additionally, although aspects of the disclosed embodiments are described as being stored in memory, those skilled in the art will appreciate that these aspects may also be stored on other types of computer-readable media, such as secondary storage devices, such as hard disks or CD ROM, or other forms of RAM or ROM, USB media, DVD, Blu-ray, Ultra HD Blu-ray, or other optical drive media.
基于书面描述和公开的方法的计算机程序在有经验的开发人员的技能范围内。各种程序或程序模块可以使用本领域技术人员已知的任何技术来创建,或者可以结合现有软件来设计。例如,程序部分或程序模块可以用.Net Framework、.NET Compact Framework(以及相关语言,如Visual Basic、C等)、Java、C++、Objective-C、HTML、HTML/Ajax组合、XML或HTML及其附带的Java小程序来设计。Computer programs based on written descriptions and disclosed methods are within the skill of an experienced developer. The various programs or program modules may be created using any technique known to those skilled in the art, or may be designed in conjunction with existing software. For example, program parts or program modules can be written in .NET Framework, .NET Compact Framework (and related languages such as Visual Basic, C, etc.), Java, C++, Objective-C, HTML, HTML/Ajax combination, XML or HTML and their Comes with a Java applet to design.
此外,虽然本文已经描述了说明性实施例,但是本领域技术人员基于本公开将理解具有等效元素、修改、省略、组合(例如,跨各种实施例的方面)、自适应和/或改变的任何和所有实施例的范围。权利要求中的限制应基于权利要求中使用的语言广义地解释,而不限于本说明书中描述的示例或在应用程序的执行过程中描述的示例。这些示例将被解释为非排他性的。此外,可以以任何方式修改所公开的方法的步骤,包括通过重新排序步骤和/或插入或删除步骤。因此,本说明书和实例仅被认为是说明性的,其真正的范围和精神由下面的权利要求及其等同物的全部范围来指示。Furthermore, although illustrative embodiments have been described herein, those skilled in the art will appreciate having equivalent elements, modifications, omissions, combinations (eg, across aspects of the various embodiments), adaptations, and/or changes based on this disclosure the scope of any and all embodiments. The limitations in the claims should be construed broadly based on the language used in the claims, and not limited to the examples described in this specification or the examples described during execution of the application. These examples are to be construed as non-exclusive. Furthermore, the steps of the disclosed methods may be modified in any way, including by reordering steps and/or inserting or deleting steps. Accordingly, this specification and examples are to be regarded as illustrative only, the true scope and spirit of which is to be indicated by the following claims along with their full scope of equivalents.