KR20200009556A

KR20200009556A - Apparatus and Method for managing text changed from voice in call

Info

Publication number: KR20200009556A
Application number: KR1020180084074A
Authority: KR
Inventors: 안영수; 이상훈; 이은동; 신동진; 임치완; 권기재; 백민석; 전준용
Original assignee: 주식회사 케이티
Priority date: 2018-07-19
Filing date: 2018-07-19
Publication date: 2020-01-30
Anticipated expiration: 2038-07-19
Also published as: KR102136393B1

Abstract

통화 중의 음성을 텍스트로 변환하여 관리하는 장치 및 방법이 개시된다. 일 측면에 따른, 서비스 가입자의 통화 음성을 텍스트 변환하고, 변환된 텍스트를 관리하는 장치는, 서비스 가입자가 발신자 또는 수신자가 되는 통화의 음성 데이터를 수신하는 수신부; 수신된 음성 데이터를 발신 음성 데이터 및 수신 음성 데이터로 각각 분류하는 분류부; 분류된 음성 데이터를 발신 텍스트 및 수신 텍스트의 데이터로 각각 변환하는 변환부; 변환된 발신자의 발신 텍스트 및 수신자의 수신 텍스트를 구분하고 시간순으로 나열하여 통화 텍스트를 생성하는 통화 텍스트부; 및 서비스 가입자의 요청에 의해, 생성된 통화 텍스트를 조회하여 서비스 가입자의 단말로 제공하는 제공부를 포함한다. Disclosed are an apparatus and method for converting and managing voice in a call into text. According to an aspect, an apparatus for text-converting a call voice of a service subscriber and managing the converted text includes: a receiver configured to receive voice data of a call in which the service subscriber becomes a caller or a receiver; A classification unit which classifies the received voice data into outgoing voice data and received voice data, respectively; A conversion unit for converting the classified voice data into data of the outgoing text and the received text, respectively; A call text unit for generating a call text by dividing the outgoing text of the converted sender and the received text of the receiver and listing them in chronological order; And a providing unit for inquiring the generated call text and providing the generated call text to the service subscriber's terminal at the request of the service subscriber.

Description

Apparatus and Method for managing text changed from voice in call}

본 발명은 음성 통화의 기술로서, 통화 중에 발신자 및 수신자의 대화 음성을 텍스트로 변환하고, 변환된 텍스트를 관리하는 장치 및 방법에 관한 것이다.BACKGROUND OF THE INVENTION 1. Field of the Invention [0001] The present invention relates to a technique of voice call, to an apparatus and method for converting a conversational voice of a caller and a receiver during a call into text and managing the converted text.

기존에 통화 내용을 저장하는 방법으로는 통화 내용의 녹취 서비스가 유일했다. 사용자는 모바일 단말에 녹취용 어플리케이션을 설치하고, 모바일 통화 중에 어플리케이션의 녹취 버튼을 눌러서 통화 내용을 음성 파일로 저장하였다.In the past, the only way to save a call was to record the call. The user installed the recording application on the mobile terminal, and stored the call contents as a voice file by pressing the recording button of the application during the mobile call.

여기서, 사용자가 녹취된 내용을 확인하기 위해서는 녹취된 상기 음성 파일을 처음부터 끝까지 청취해야만 한다. 만약, 녹취된 상기 음성 파일이 여러 개일 경우, 사용자는 원하는 내용을 찾을 때까지 복수개의 상기 음성 파일을 모두 청취해야 한다.Here, in order to confirm the recorded contents, the user must listen to the recorded voice file from the beginning to the end. If there are several recorded voice files, the user must listen to all of the plurality of voice files until a desired content is found.

따라서, 복수개의 음성 파일의 경우, 사용자가 원하는 내용이 어느 음성 파일에 있는지 확인하고, 해당 음성 파일의 어느 위치에 존재하는지 확인하는 것이 어려운 문제점이 있었다. Therefore, in the case of a plurality of voice files, it is difficult to identify which voice file the user wants and in which position of the voice file.

한국등록특허 10-0935524(2009.12.28)Korea Patent Registration 10-0935524 (2009.12.28)

본 발명은 상기와 같은 문제점을 해결하기 위한 것으로서, 발신자와 수신자 사이의 통화 음성을 텍스트로 변환하고, 변환된 텍스트를 관리하는 장치 및 방법을 제공하는 것을 목적으로 한다.The present invention has been made to solve the above problems, and an object of the present invention is to provide an apparatus and method for converting a call voice between a caller and a receiver into text and managing the converted text.

일 측면에 따른, 서비스 가입자의 통화 음성을 텍스트 변환하고, 변환된 텍스트를 관리하는 장치는, 상기 서비스 가입자가 발신자 또는 수신자가 되는 통화의 음성 데이터를 수신하는 수신부; 상기 수신된 음성 데이터를 발신 음성 데이터 및 수신 음성 데이터로 각각 분류하는 분류부; 상기 분류된 음성 데이터를 발신 텍스트 및 수신 텍스트의 데이터로 각각 변환하는 변환부; 상기 변환된 발신자의 발신 텍스트 및 수신자의 수신 텍스트를 구분하고 시간순으로 나열하여 통화 텍스트를 생성하는 통화 텍스트부; 및 서비스 가입자의 요청에 의해, 상기 생성된 통화 텍스트를 조회하여 서비스 가입자의 단말로 제공하는 제공부를 포함한다.According to an aspect, an apparatus for text-converting a call voice of a service subscriber and managing the converted text includes: a receiver configured to receive voice data of a call in which the service subscriber becomes a caller or a receiver; A classification unit which classifies the received voice data into outgoing voice data and received voice data, respectively; A conversion unit for converting the classified voice data into data of an outgoing text and a received text, respectively; A call text unit for generating a call text by dividing the outgoing text of the converted sender and the received text of the receiver and arranging them in chronological order; And a provision unit for querying the generated call text and providing the generated call text to a terminal of the service subscriber at the request of the service subscriber.

상기 장치는, 통화 중인 상기 서비스 가입자의 통화 단말로부터 텍스트 변환을 요청하는 DTMF(Dual Tone Multiple Frequency) 신호를 수신하는 DTMF부를 더 포함하고, 상기 분류부는 수신된 상기 DTMF 신호에 의해 분류 처리한다.The apparatus further includes a DTMF unit for receiving a dual tone multiple frequency (DTMF) signal for requesting text conversion from a call terminal of the service subscriber in a call, and the classification unit classifies the received DTMF signal.

상기 분류부는, SIP(Session Initiation Protocol) 메시지의 발신 측 아이피 및 포트, 수신 측 아이피 및 포트와 RTP(Realtime Transfer Protocol) 패킷의 동기화 소스 아이디를 참조하여 발신 및 수신의 음성 데이터를 각각 분류한다.The classification unit classifies voice data of origination and reception by referring to a source IP and a port of a SIP (Session Initiation Protocol) message, a reception source IP and a port, and a synchronization source ID of a Realtime Transfer Protocol (RTP) packet.

상기 장치는, 상기 분류된 발신 음성 데이터 및 수신 음성 데이터의 RTP 패킷 중에서 음성 패킷만을 남기기 위해 무음에 해당되는 SID(Silence Indicator) 패킷을 제거하는 무음 제거부를 더 포함하고, 상기 변환부는 남겨진 상기 음성 패킷에 대해 STT(Speech To Text) 엔진을 이용하여 텍스트로 변환한다.The apparatus may further include a silence remover configured to remove a silence indicator (SID) packet corresponding to a silence to leave only a voice packet among the classified RTP packets of the outgoing voice data and the received voice data, and the conversion unit may include the left voice packet. Convert to text using Speech To Text (STT) engine.

상기 장치는, SIP 메시지의 SDP(Session Description Protocol)의 ptime 값에서 참조된 RTP 패킷의 시간 분량을 이용하여 변환된 텍스트의 타임스탬프 정보로서 음성 발생 시간을 계산하는 타임스탬프부를 더 포함한다.The apparatus further includes a timestamp unit for calculating a voice generation time as timestamp information of the converted text using the amount of time of the RTP packet referenced in the ptime value of the Session Description Protocol (SDP) of the SIP message.

상기 타임스탬프부는, 상기 RTP 패킷의 코덱 정보를 확인하고, 확인된 코덱의 샘플링 레이트로부터 초당 증가하는 타임스탬프를 확인하고, 직전 패킷으로부터 증가된 SID 패킷의 타임스탬프 값으로부터 상기 SID 패킷의 시간 분량을 계산하고, 상기 RTP 패킷의 시간 분량 및 상기 계산된 SID 패킷의 시간 분량을 이용하여, 상기 음성 발생 시간을 계산한다.The timestamp unit checks codec information of the RTP packet, confirms a timestamp that increases per second from the confirmed codec sampling rate, and calculates a time amount of the SID packet from a timestamp value of an SID packet increased from a previous packet. The voice generation time is calculated by using the time portion of the RTP packet and the calculated time portion of the SID packet.

상기 통화 텍스트부는, 발신 전화번호, 수신 전화번호, 총 통화 시간, 발신 텍스트 데이터 및 음성 발생 시간의 적어도 하나 이상의 세트, 수신 텍스트 데이터 및 음성 발생 시간의 적어도 하나 이상의 세트를 포함하는 상기 통화 텍스트를 저장한다.The call text section stores the call text including at least one set of an outgoing call number, an incoming call number, total call time, outgoing text data and voice generation time, at least one set of received text data and voice generation time. do.

상기 장치는, 상기 발신 음성 데이터 및 상기 수신 음성 데이터 중에서 먼저 도착된 음성 데이터의 시작 위치에서, 상기 먼저 도착된 음성 데이터의 RTP 패킷의 시간만큼 시간을 뒤로하여 늦게 도착된 발신 또는 수신의 음성 데이터의 시작 위치를 설정하고, 설정된 시작 위치를 이용하여 각각의 상기 발신 음성 데이터 및 상기 수신 음성 데이터의 음성 스트림을 하나의 통합된 음성 데이터의 스트림으로 믹싱하는 믹싱부를 더 포함하고, 상기 제공부는 상기 통합된 음성 데이터를 제공한다.The device may be configured such that, at the start position of the first voice data received from the outgoing voice data and the received voice data, the voice data of the late arrival or reception of the late arrival or reception of the voice data arrives later by the time of the RTP packet of the first voice data. A mixing unit for setting a starting position and mixing each of the outgoing voice data and the received voice data streams into a single integrated voice data stream using the set start position, wherein the providing unit includes the integrated unit; Provide voice data.

상기 제공부는, 문자 메시지, 이메일, SNS(Social Network Service), 웹 페이지 중에서 적어도 하나 이상을 이용하여 상기 통화 텍스트를 서비스 가입자의 단말로 제공한다.The providing unit provides the call text to a terminal of a service subscriber using at least one of a text message, an email, a social network service (SNS), and a web page.

다른 측면에 따른, 장치가 서비스 가입자의 통화 음성을 텍스트 변환하고, 변환된 텍스트를 관리하는 방법은, 상기 서비스 가입자가 발신자 또는 수신자가 되는 통화의 음성 데이터를 수신하는 단계; 상기 수신된 음성 데이터를 발신 음성 데이터 및 수신 음성 데이터로 각각 분류하는 단계; 상기 분류된 음성 데이터를 발신 텍스트 및 수신 텍스트의 데이터로 각각 변환하는 단계; 상기 변환된 발신자의 발신 텍스트 및 수신자의 수신 텍스트를 구분하고 시간순으로 나열하여 통화 텍스트를 생성하는 단계; 및 상기 서비스 가입자의 요청에 의해, 상기 생성된 통화 텍스트를 조회하여 서비스 가입자의 단말로 제공하는 단계를 포함한다.According to another aspect, a method in which a device text-converts a voice of a service subscriber and manages the converted text, comprises: receiving voice data of a call in which the service subscriber becomes a caller or a receiver; Classifying the received voice data into outgoing voice data and received voice data, respectively; Converting the classified voice data into data of an outgoing text and a received text, respectively; Generating a call text by dividing the outgoing text of the converted sender and the received text of the receiver and arranging them in chronological order; And querying the generated call text and providing the generated call text to a terminal of a service subscriber at the request of the service subscriber.

본 발명의 일 측면에 따르면, 통화 내용을 텍스트로 관리하여 사용자는 필요에 따라 통화 내용의 텍스트를 조회하고 검색할 수 있다.According to an aspect of the present invention, by managing the content of the call as text, the user can search and search the text of the call content as needed.

통화 내용의 텍스트 변환시, 무음 패킷을 제거하고 통화 음성의 패킷만을 변환함으로써 STT 엔진의 처리 부하가 경감될 수 있다.In text conversion of the call content, the processing load of the STT engine can be reduced by removing the silent packet and converting only the packet of the call voice.

발신자 및 수신자의 통화 내용을 하나의 통합된 음성 데이터로 저장할 때, 늦게 도착된 발신측 또는 수신측의 음성 패킷의 시작 위치를 먼저 도착한 발신측 또는 수신측의 패킷의 시간 분량만큼 뒤로 설정함으로써 크로스 토크가 제거된 하나의 통합된 음성 데이터를 저장할 수 있다.When storing the caller's and receiver's conversations into a single piece of unified voice data, crosstalk is achieved by setting the starting position of the late arriving or receiving party's voice packet back by the amount of time of the first arriving or receiving party's packet. Can store one unified voice data removed.

본 명세서에 첨부되는 다음의 도면들은 본 발명의 바람직한 실시예를 예시하는 것이며, 후술한 발명의 상세한 설명과 함께 본 발명의 기술사상을 더욱 이해시키는 역할을 하는 것이므로, 본 발명은 그러한 도면에 기재된 사항에만 한정되어 해석되지 않아야 한다.
도 1은 본 발명의 일 실시예에 따른 통화 관리 시스템의 개략적 구성도이다.
도 2는 도 1의 텍스트 관리 서버의 개략적 구성도이다.
도 3은 도 2의 텍스트 관리 서버가 Tx 및 Rx의 음성 데이터를 분류하는 예시도이다.
도 4a 및 도 4b는 도 2의 텍스트 관리 서버가 통화 음성 패킷을 발신 음성 패킷 및 수신 음성 패킷으로 분류하는 예시도이다.
도 5는 도 2의 텍스트 관리 서버가 수신한 통화 음성 패킷 중에서 SID 패킷의 예시도이다.
도 6은 도 2의 텍스트 관리 서버가 도 5의 SID 패킷의 타임스탬프의 시간을 계산하는 예시도이다.
도 7은 본 발명의 다른 실시예에 따라 도 1의 텍스트 관리 서버가 발신 음성 데이터와 수신 음성 데이터를 하나의 통합된 음성 스트림으로 믹싱하는 예시도이다.
도 8a 및 도 8b는 도 7의 믹싱에서 늦게 도착된 발신 또는 수신 측의 음성 데이터의 시작 위치를 늦추는 예시도이다.
도 9는 본 발명의 일 실시예에 따른 통화 음성의 텍스트 관리 방법의 개략적 순서도이다.The following drawings attached to this specification are illustrative of preferred embodiments of the present invention, and together with the detailed description of the invention to serve to further understand the technical spirit of the present invention, the present invention is a matter described in such drawings It should not be construed as limited to
1 is a schematic structural diagram of a call management system according to an embodiment of the present invention.
FIG. 2 is a schematic configuration diagram of the text management server of FIG. 1.
3 is an exemplary diagram in which the text management server of FIG. 2 classifies voice data of Tx and Rx.
4A and 4B are exemplary diagrams in which the text management server of FIG. 2 classifies call voice packets into outgoing voice packets and received voice packets.
5 is an exemplary diagram of an SID packet among call voice packets received by the text management server of FIG. 2.
6 is an exemplary diagram in which the text management server of FIG. 2 calculates a time stamp time of an SID packet of FIG. 5.
7 is an exemplary diagram in which the text management server of FIG. 1 mixes outgoing voice data and received voice data into one integrated voice stream according to another embodiment of the present invention.
8A and 8B are exemplary diagrams of delaying the start position of voice data of a transmitting or receiving side that arrives late in the mixing of FIG. 7.
9 is a schematic flowchart of a text management method of a call voice according to an embodiment of the present invention.

이하, 첨부된 도면을 참조하여 본 발명의 바람직한 실시예를 상세히 설명하기로 한다. 이에 앞서, 본 명세서 및 청구 범위에 사용된 용어나 단어는 통상적이거나 사전적인 의미로 한정해서 해석되어서는 아니되며, 발명자는 그 자신의 발명을 가장 최선의 방법으로 설명하기 위해 용어의 개념을 적절하게 정의할 수 있다는 원칙에 입각하여 본 발명의 기술적 사상에 부합하는 의미와 개념으로 해석되어야만 한다. 따라서, 본 명세서에 기재된 실시예와 도면에 도시된 구성은 본 발명의 가장 바람직한 일 실시예에 불과할 뿐이고 본 발명의 기술적 사상에 모두 대변하는 것은 아니므로, 본 출원 시점에 있어서 이들을 대체할 수 있는 다양한 균등물과 변형예들이 있을 수 있음을 이해하여야 한다.Hereinafter, exemplary embodiments of the present invention will be described in detail with reference to the accompanying drawings. Prior to this, terms or words used in the present specification and claims should not be construed as being limited to the ordinary or dictionary meanings, and the inventors should properly explain the concept of terms in order to best explain their own invention. It should be interpreted as meaning and concept corresponding to the technical idea of the present invention based on the principle that it can be defined. Therefore, the embodiments described in the specification and the drawings shown in the drawings are only the most preferred embodiment of the present invention and do not represent all of the technical spirit of the present invention, various modifications that can be replaced at the time of the present application It should be understood that there may be equivalents and variations.

도 1은 본 발명의 일 실시예에 따른 통화 관리 시스템(100)의 개략적 구성도이다.1 is a schematic diagram of a call management system 100 according to an embodiment of the present invention.

본 발명의 일 실시예에 따른 통화 관리 시스템(100)은 발신자 통화 단말(110), 텍스트 관리 서버(130), TAS(Telephony Application Server)(140), CSCF(Call Session Control Function)(150) 및 수신자 통화 단말(170)을 포함하여 구성된다.Call management system 100 according to an embodiment of the present invention is a caller call terminal 110, a text management server 130, Telephony Application Server (TAS) 140, Call Session Control Function (CSCF) 150 and It is configured to include the receiver call terminal 170.

발신자 통화 단말(110)에서 수신자 통화 단말(170)로 통화 호를 발신하면, 통신사 서버를 통해 호 데이터(통화 데이터)가 중계되고, 수신자 통화 단말(170)이 착신 호를 수신하면 통화가 연결되어 발신자와 수신자의 음성 통화가 시작된다. 상기 호 데이터는 신호 데이터(예 : SIP 메시지)와 음성 데이터(예 : RTP 패킷)로 구분되고, 구분된 각 데이터는 별도의 처리와 전송 경로를 갖는다.When the call call is sent from the caller call terminal 110 to the receiver call terminal 170, call data (call data) is relayed through the communication company server, and when the receiver call terminal 170 receives the incoming call, the call is connected. The caller's and receiver's voice call begins. The call data is divided into signal data (eg, SIP message) and voice data (eg, RTP packet), and each of the separated data has a separate processing and transmission path.

VoLTE 환경에서 통신사 서버는 상기 신호 데이터의 처리를 위해 TAS(140) 및 CSCF(150)를 포함한다. CSCF(150)는 호 처리 서버로부터 호 데이터의 SIP 메시지를 중계 처리한다. 또한, TAS(140)는 호 기반의 멀티미디어 부가 서비스(예 : 발신자 표시 서비스, 통화 연결음 서비스, 통화중 대기 서비스 등)를 해당 부가 서비스의 가입자에게 제공하는 서버이다. 통화 서비스 가입자가 본 발명의 통화 관리 서비스에 가입하면, 서비스 가입 정보가 TAS(140)에서 관리된다. 호 데이터의 음성 데이터는 통신사의 통화 처리 서버를 통해 발신자 통화 단말(110) 및 수신자 통화 단말(170)로 중계된다.The carrier server in the VoLTE environment includes a TAS 140 and a CSCF 150 for processing the signal data. CSCF 150 relays the SIP message of call data from the call processing server. In addition, the TAS 140 is a server that provides a call-based multimedia supplementary service (eg, caller ID service, ring back tone service, call waiting service, etc.) to a subscriber of the supplementary service. When the call service subscriber subscribes to the call management service of the present invention, the service subscription information is managed in the TAS 140. The voice data of the call data is relayed to the caller call terminal 110 and the callee terminal 170 through the call processing server of the carrier.

여기서, 본 발명은 상기 통화 처리 서버를 텍스트 관리 서버(130)로 구현한다. 텍스트 관리 서버(130)는 발신자 통화 단말(110) 및 수신자 통화 단말(170)의 사이에서 호 데이터의 음성 데이터를 중계하는 과정에서 상기 통화 관리 서비스의 가입자에게 통화 내용의 텍스트 변환 서비스를 제공한다. 즉, 상기 통화 관리 서비스는 서비스 가입자의 요청에 의해, 음성 통화 내용을 텍스트 변환하고, 변환된 텍스트를 관리하여 가입자에게 조회 및 검색 서비스를 제공할 수 있다.Here, the present invention implements the call processing server as a text management server 130. The text management server 130 provides a text conversion service of call contents to the subscriber of the call management service in the process of relaying voice data of call data between the caller call terminal 110 and the callee terminal 170. That is, the call management service may convert the voice call contents into text at the request of the service subscriber and manage the converted text to provide the search and search service to the subscriber.

도 2는 도 1의 텍스트 관리 서버(130)의 개략적 구성도이다.2 is a schematic diagram of the text management server 130 of FIG. 1.

본 발명의 일 실시예에 따른 텍스트 관리 서버(130)는 수신부(231), DTMF부(232), 분류부(233), 무음 제거부(234), 변환부(235), 타임스탬프부(236), 통화 텍스트부(237) 및 제공부(239)를 포함하여 구성된다.Text management server 130 according to an embodiment of the present invention is a receiver 231, DTMF unit 232, classification unit 233, silence remover 234, converter 235, time stamp unit 236 ), A call text unit 237 and a provider 239.

텍스트 관리 서버(130)가 메모리와 프로세서로 구성된 컴퓨터 단말이라고 가정하면, 각 구성부(231 ~ 239)들은 프로그램의 형태로 메모리에 로딩되어 프로세서를 통해 실행될 수 있다. 예를 들면, 각 구성부(231 ~ 239)들은 텍스트 관리 프로그램으로 제작된 후, 텍스트 관리 서버(130)의 프로세서에 의해 실행되어 통화 내용의 음성으로부터 변환된 텍스트를 이용하여 상기 텍스트 관리 서비스를 제공할 수 있다.Assuming that the text management server 130 is a computer terminal composed of a memory and a processor, each component 231 to 239 may be loaded into a memory in the form of a program and executed by the processor. For example, the components 231 to 239 are produced by a text management program, and then executed by a processor of the text management server 130 to provide the text management service using text converted from voice of a call content. can do.

상기 수신부(231)는 통화 중에 발생된 발신자 및 수신자의 각 음성 데이터를 RTP 패킷으로 수신한다. 물론, 발신 음성 데이터는 텍스트 관리 서버(130)를 거쳐 수신자 통화 단말(170)을 목적지로 하여 중계되고, 수신 음성 데이터는 텍스트 관리 서버(130)를 거쳐 발신자 통화 단말(110)을 목적지로 하여 중계된다.The receiver 231 receives the voice data of the sender and the receiver generated during the call in an RTP packet. Of course, the outgoing voice data is relayed through the text management server 130 to the recipient call terminal 170 as a destination, and the received voice data is relayed through the text management server 130 to the caller call terminal 110 as a destination. do.

상기 DTMF부(232)는 본 발명의 통화 관리 서비스에 가입된 가입자의 통화 중에 가입자의 통화 단말에서 통화 음성의 텍스트 변환을 요청하는 DTMF 신호를 수신한다. 즉, 상기 통화 관리 서비스의 가입자는 통화 중에 통화 음성의 텍스트 변환을 요청할 경우, 특정 키(예 : * 키)를 눌러서 DTMF 신호를 발생시킨다.The DTMF unit 232 receives a DTMF signal for requesting text conversion of a voice call from a subscriber's call terminal during a subscriber's call subscribed to the call management service of the present invention. That is, the subscriber of the call management service generates a DTMF signal by pressing a specific key (eg, * key) when requesting text conversion of a call voice during a call.

상기 분류부(233)는 DTMF부(232)에서 텍스트 관리 서비스의 DTMF 신호가 확인되면, 수신부(231)에서 수신된 음성 데이터를 발신 음성 데이터 및 수신 음성 데이터로 각각 분류한다. 참고로, 상기 음성 데이터의 분류는 도 3 내지 도 4b를 참조하여 후술된다.When the DTMF signal of the text management service is confirmed by the DTMF unit 232, the classifier 233 classifies the voice data received by the receiver 231 into outgoing voice data and received voice data, respectively. For reference, classification of the voice data will be described later with reference to FIGS. 3 to 4B.

상기 무음 제거부(234)는 분류부(233)에서 분류된 발신 음성 데이터 및 수신 음성 데이터의 각 RTP 패킷이 무음에 해당할 경우, 무음은 텍스트 변환할 필요가 없으므로 무음의 RTP 패킷을 제거한다. 상기 제거에 의해, 통화 음성을 포함하는 RTP 패킷만 남겨질 수 있다. 참고로, 상기 무음의 RTP 패킷은 도 5 및 도 6을 참조하여 후술된다.When the RTP packets of the outgoing voice data and the received voice data classified by the classifier 233 correspond to the silence, the silence remover 234 removes the silent RTP packets because the silence does not need to be converted into text. By this removal, only RTP packets containing call voice can be left. For reference, the silent RTP packet will be described later with reference to FIGS. 5 and 6.

상기 변환부(235)는 발신 RTP 패킷의 음성을 발신 텍스트로 변환하고, 수신 RTP 패킷의 음성을 수신 텍스트로 변환한다. 변환부(235)는 STT 엔진으로 구성되고, 별도의 STT 서버로 구축되어도 무방하다.The conversion unit 235 converts the voice of the outgoing RTP packet into outgoing text, and converts the voice of the incoming RTP packet into outgoing text. The conversion unit 235 may be configured as an STT engine and may be constructed as a separate STT server.

여기서, 변환부(235)는 무음 제거부(234)에 의해 무음의 RTP 패킷이 제거된 통화 음성의 RTP 패킷들만 입력받아 STT 기반의 텍스트 변환을 할 수 있기 때문에 데이터 처리 부하가 경감되는 장점을 갖는다.Here, the conversion unit 235 has an advantage that the data processing load is reduced because only the RTP packets of the call voice from which the silent RTP packet is removed by the silent remover 234 can be converted into STT-based text. .

상기 타임스탬프부(236)는 RTP 패킷의 타임스탬프의 값을 시간으로 계산하고, 계산된 시간을 변환된 텍스트의 통화 시간에 해당하는 타임스탬프 정보로 설정한다. 예를 들면, 통화 음성의 변환된 텍스트는 메시지처럼 발생 시간의 타임스탬프가 설정된다.The time stamp unit 236 calculates a time stamp value of the RTP packet as a time, and sets the calculated time as time stamp information corresponding to the talk time of the converted text. For example, the converted text of a call voice has a timestamp of the time of occurrence, like a message.

여기서, 타임스탬프부(236)가 시간을 계산하는 이유는 텍스트 관리 서버(130), 발신자 통화 단말(110) 및 수신자 통화 단말(170)의 시간이 다르고, RTP 패킷의 경로와 그 경로를 경유하는 시간이 다르기 때문이다. 또한, 텍스트 관리 서버(130)에 도착하는 RTP 패킷의 도착 순서는 음성 발생 순서와 일치하지 않으므로, 그 도착 시간을 음성 발생 시간으로 사용해서는 안된다. 이러한 점을 고려하여, 타임스탬프부(236)는 정확한 통화 음성의 발생 시간을 위해, 통화 시작 시간을 기준으로 설정한 후, 각 RTP 패킷에 포함된 타임스탬프의 값으로 발생 시간을 계산한다. 따라서, 상기 계산에 의해, 음성으로부터 변환된 텍스트의 음성 발생 시간은 정확성을 갖는 장점이 있다. 상기 기준 시간에 해당하는 통화 시작 시간은 첫 번째 패킷의 도착 시간일 수 있다. 참고로, 상기 계산은 도 6을 참조하여 후술된다.Here, the time stamp unit 236 calculates the time because the time of the text management server 130, the caller terminal 110 and the receiver call terminal 170 is different, and the path of the RTP packet and the path Because time is different. In addition, since the arrival order of the RTP packets arriving at the text management server 130 does not match the voice generation order, the arrival time should not be used as the voice generation time. In consideration of this point, the time stamp unit 236 sets the reference time based on the call start time, and calculates the occurrence time based on the value of the time stamp included in each RTP packet. Therefore, by the above calculation, the speech generation time of the text converted from the speech has the advantage of accuracy. The call start time corresponding to the reference time may be the arrival time of the first packet. For reference, the calculation will be described later with reference to FIG. 6.

상기 통화 텍스트부(237)는 변환부(235)에서 변환된 발신자의 발신 텍스트 및 수신자의 수신 텍스트를 구분하고 타임스탬프부(356)에서 계산된 시간순으로 나열하여 통화 텍스트를 생성하고, 생성된 통화 텍스트를 저장한다.The call text unit 237 generates a call text by dividing the caller's outgoing text and the caller's received text converted by the converter 235 and arranging them in chronological order calculated by the time stamp unit 356, and generating the call. Save the text.

여기서, 저장되는 통화 텍스트 정보는 서비스 가입자에게 제공되기 위해, 발신 전화번호, 수신 전화번호, 총 통화 시간, 발신 텍스트 데이터 및 음성 발생 시간의 적어도 하나 이상의 세트, 수신 텍스트 데이터 및 음성 발생 시간의 적어도 하나 이상의 세트를 포함한다. 즉, 상기 세트는 각 텍스트 문장 및 그 음성 발생 시간으로 구성된다. 통화 텍스트 정보가 표시되는 화면 UI는 메신저 화면과 유사하게 구성될 수 있으며 특별한 제한을 두지 않는다.Here, the stored call text information is provided to the service subscriber, so that at least one or more sets of the outgoing telephone number, the received telephone number, the total talk time, the outgoing text data and the voice generation time, the at least one of the received text data and the voice generation time It includes the above set. That is, the set consists of each text sentence and its speech generation time. The screen UI on which the call text information is displayed may be configured similarly to the messenger screen, and there is no particular limitation.

상기 제공부(239)는 서비스 가입자의 요청에 의해, 통화 텍스트부(237)에서 데이터 관리되는 통화 텍스트를 조회하여 서비스 가입자의 단말로 제공한다.The provider 239 inquires the call text managed by the call text unit 237 at the request of the service subscriber and provides the call text to the terminal of the service subscriber.

통화 텍스트 정보는 문자 메시지, 이메일, SNS(Social Network Service), 웹 페이지 중에서 적어도 하나 이상을 이용하여 서비스 가입자의 단말로 제공될 수 있다. 예를 들면, 카톡의 화면 UI처럼 발신자 및 수신자의 통화 내용이 텍스트 정보로 나열된 상기 웹 페이지가 서비스 가입자의 단말로 제공될 수 있다.The call text information may be provided to a terminal of a service subscriber using at least one of a text message, an email, a social network service (SNS), and a web page. For example, the web page in which caller and receiver call contents are listed as text information, such as a screen UI of katok, may be provided to a terminal of a service subscriber.

도 3은 도 2의 텍스트 관리 서버(130)가 Tx 및 Rx의 음성 데이터를 분류하는 예시도이다.3 is an exemplary diagram in which the text management server 130 of FIG. 2 classifies voice data of Tx and Rx.

서비스 가입자 A가 발신자가 되어 수신자에게 전화를 거는 경우, 발신 데이터의 Rx 스트림은 A의 음성 내용이고, 수신 데이터의 Tx 스트림은 발신자 A가 듣는 수신자의 음성 내용이다.When the service subscriber A becomes the caller and calls the receiver, the Rx stream of the outgoing data is the voice content of A, and the Tx stream of the received data is the voice content of the receiver the caller A listens to.

텍스트 관리 서버(130)의 분류부(233)는 호 데이터의 SIP(Session Initiation Protocol) 메시지에 포함된 발신 측 아이피 및 포트(src IP/Port), 수신 측 아이피 및 포트(dst IP/Port)와 RTP 패킷의 동기화 소스 아이디(SSRC : Synchronized Source ID)를 참조하여 통화 데이터의 RTP 패킷을 발신 음성 데이터의 Tx 스트림과 수신 음성 데이터의 Rx 스트림으로 분류한다. SSRC는 동기화에 의해 스트림으로 연결되는 통화 소스 음원의 식별자에 해당한다.The classification unit 233 of the text management server 130 may include an originating IP and port, a receiving IP, and a port included in the Session Initiation Protocol (SIP) message of the call data. The RTP packet of the call data is classified into a Tx stream of the outgoing voice data and an Rx stream of the received voice data by referring to a synchronized source ID (SSRC) of the RTP packet. SSRC corresponds to an identifier of a call source sound source connected to a stream by synchronization.

예를 들어, 서비스 가입자 A가 발신자가 되어 수신자에게 전화를 거는 경우, 소스 IP/Port는 A의 통화 단말이고 목적지 IP/Port는 텍스트 관리 서버(130)인 RTP 패킷들이 A가 말하는 발신 음성 데이터의 Tx 스트림으로 분류된다. 또한, 소스 IP/Port가 텍스트 관리 서버(130)이고 목적지 IP/Port가 A의 통화 단말인 RTP 패킷들이 상대방이 말하여 A가 듣게 되는 수신 음성 데이터의 Rx 스트림으로 분류된다.For example, if service subscriber A is a caller and calls a recipient, the RTP packets of source IP / Port is A's call terminal and destination IP / Port is text management server 130, indicating Classified as Tx stream. In addition, the RTP packets whose source IP / Port is the text management server 130 and the destination IP / Port is the call terminal of A are classified into Rx streams of the received voice data that the other party speaks.

도 4a 및 도 4b는 도 2의 텍스트 관리 서버(130)가 통화 음성 패킷을 발신 음성 패킷 및 수신 음성 패킷으로 분류하는 예시도이다.4A and 4B are exemplary diagrams in which the text management server 130 of FIG. 2 classifies a call voice packet into an outgoing voice packet and a received voice packet.

도 4a를 참조하면, 텍스트 관리 서버(130)에 도착되는 발신자 및 수신자의 음성 RTP 패킷이 도시된다. RTP 패킷 1이 수신 음성이고 RTP 패킷 2가 발신 음성이라 가정하면, 수신 측의 RTP 패킷 1의 3개 패킷이 텍스트 관리 서버(130)에 최초로 도착된 후, 이어서 발신 측 RTP 패킷 2의 2개 패킷이 도착된다. 통화 중 대화가 진행될수록 각 RTP 패킷들이 텍스트 관리 서버(130)에 도착된다.Referring to FIG. 4A, a voice RTP packet of a sender and a receiver arriving at the text management server 130 is shown. Assuming that RTP packet 1 is the incoming voice and RTP packet 2 is the outgoing voice, three packets of RTP packet 1 on the receiving side first arrive at text management server 130 and then two packets of originating RTP packet 2 onwards. It arrives. As the conversation progresses during the call, each RTP packet arrives at the text management server 130.

여기서, RTP 패킷의 시간 분량이 20 ms의 음성 데이터를 갖는 경우, RTP 패킷 1의 수신자가 60 ms 동안 먼저 말한 후, 이어서 RTP 패킷 2의 발신자가 말하는 것으로 해석될 수 있다.Here, when the time portion of the RTP packet has 20 ms of voice data, it can be interpreted that the receiver of RTP packet 1 speaks for 60 ms first, and then the sender of RTP packet 2 speaks.

도 4b를 참조하면, 분류부(233)는 도 3에서와 같은 IP/Port 값 및 SSRC 값을 참조하여 수신 측의 RTP 패킷 1과 발신 측의 RTP 패킷 2의 음성 패킷을 각각 분류한다.Referring to FIG. 4B, the classifier 233 classifies voice packets of the RTP packet 1 on the receiving side and the RTP packet 2 on the calling side with reference to the IP / Port value and the SSRC value as shown in FIG. 3.

도 5는 도 2의 텍스트 관리 서버가 수신한 통화 음성 패킷 중에서 SID 패킷(500)의 예시도이다.5 is an exemplary diagram of an SID packet 500 among call voice packets received by the text management server of FIG. 2.

음성 데이터의 RTP 패킷은 실제 사람의 음성이 포함된 음성 패킷과 무음의 SID 패킷(500)으로 구분된다. SID 패킷(500)은 사용자가 말을 중단하고 듣는 경우 발생되는 무음의 RTP 패킷이다. RTP packets of voice data are divided into voice packets containing real human voices and silent SID packets 500. The SID packet 500 is a silent RTP packet generated when the user stops talking and listens.

통화의 실제 대화 내용을 살펴보면, 발신자 또는 수신자가 얘기하는 동안 상대방은 듣게 된다. 음성 데이터를 발신 측 Tx 및 수신 측 Rx 데이터로 분리하면, 분리된 각 데이터마다 약 1/2 은 묵음인 상황이 발생한다. 즉, 통화 데이터의 RTP 패킷들 중 1/2은 상기 묵음에 해당되는 SID 패킷(500)일 수 있다.If you look at the actual conversation of the call, the other party hears while the caller or the receiver is talking. When voice data is separated into source Tx and receiver Rx data, a situation occurs in which approximately 1/2 of each of the separated data is silent. That is, one half of the RTP packets of the call data may be SID packets 500 corresponding to the silence.

RTP 패킷의 페이로드에서 정의된 "FT" 필드에서 각 코덱마다 정의되는 값을 확인하면, 묵음의 SID 패킷(500)을 알 수 있다. 참고로, 도 5의 AMR-NB 코덱의 경우, "FT"의 값 "8"이 확인되면, 해당 RTP 패킷은 사용자의 음성이 무음인 SID 패킷(500)이다. 또한, AMR-WB 코덱의 경우, "FT"의 값 "9"가 확인되면, SID 패킷(500)이다. 참고로, 코덱 정보를 확인하는 방법은 도 6을 참조하여 후술된다. 무음 제거부(234)는 RTP 패킷에서 "FT" 필드의 값을 확인하여 SID 패킷(500)을 필터링하여 음성 패킷만 남긴다. 만약, 남겨진 음성 패킷만 STT 서버의 변환부(235)로 전송되어 텍스트 변환될 경우, STT 서버의 부하는 경감된다.If the value defined for each codec is checked in the "FT" field defined in the payload of the RTP packet, the silent SID packet 500 can be known. For reference, in the case of the AMR-NB codec of FIG. 5, when the value "8" of "FT" is confirmed, the corresponding RTP packet is the SID packet 500 in which the user's voice is silent. In the case of the AMR-WB codec, when the value "9" of "FT" is confirmed, it is the SID packet 500. For reference, a method of confirming codec information will be described later with reference to FIG. 6. The silence remover 234 filters the SID packet 500 by checking the value of the "FT" field in the RTP packet and leaves only the voice packet. If only the remaining voice packets are transmitted to the conversion unit 235 of the STT server and text converted, the load of the STT server is reduced.

도 6은 도 2의 텍스트 관리 서버(130)가 도 5의 SID 패킷(500)의 타임스탬프의 시간을 계산하는 예시도이다.6 is an exemplary diagram in which the text management server 130 of FIG. 2 calculates a time stamp time of the SID packet 500 of FIG. 5.

음성 순서에 따라, n-1번째 RTP 패킷, n번째 RTP 패킷(600), n+1번째 SID 패킷(500), n+2번째 RTP 패킷이 나열된 것으로 가정한다. 각 RTP 패킷은 타임스탬프의 값을 포함한다. RTP 패킷에서 음성이 포함된 패킷은 시간 분량이 동일하지만, SID 패킷(500)의 시간 분량은 코덱마다 다르므로, 각 코덱의 정보로부터 시간 분량이 계산되어야 한다.It is assumed that the n-1 th RTP packet, the n th RTP packet 600, the n + 1 th SID packet 500, and the n + 2 th RTP packet are listed according to the voice order. Each RTP packet contains the value of a timestamp. Packets containing voice in the RTP packet have the same amount of time, but since the amount of time of the SID packet 500 is different for each codec, the amount of time must be calculated from the information of each codec.

음성이 포함된 RTP 패킷의 경우, 텍스트 관리 서버(130)의 타임스탬프부(236)는 SIP 메시지의 SDP의 ptime 값(예 : 20 ms)으로 음성이 포함된 RTP 패킷의 시간 분량을 정의한다.In the case of the RTP packet including the voice, the time stamp unit 236 of the text management server 130 defines the amount of time of the RTP packet including the voice as the ptime value (eg, 20 ms) of the SDP of the SIP message.

SID 패킷(500)에 해당되는 RTP 패킷의 경우, 타임스탬프부(236)는 코덱 정보를 참조하여 시간 분량을 계산한다. 타임스탬프부(236)는 RTP 페이로드 내에 있는 코덱 헤더를 확인하거나 그 페이로드의 길이로부터 코덱 정보를 알아낼 수 있다. 참고로, VoLTE 단말은 EVS, AMR-WB, AMR-NB 코덱을 사용한다. 예를 들어, EVS 코덱의 경우, SID 패킷(500)은 묵음 패킷으로써 일반 음성 패킷보다 페이로드의 길이가 짧으며 일반 음성 패킷들은 20 ms의 시간 분량마다 전송되는데 SID 패킷(500)은 20 ms 마다 전송할 수도 있고 그 이상의 기간(20 ms의 배수)으로 전달될 수도 있다. 즉, SID 패킷(500)은 그 패킷 사이즈에 해당되는 시간 분량의 계산이 요구되며, RTP 패킷의 타임스탬프 값을 이용하여 상기 시간 분량이 계산될 수 있다.In the case of the RTP packet corresponding to the SID packet 500, the time stamp unit 236 calculates the amount of time with reference to the codec information. The time stamp unit 236 may check the codec header in the RTP payload or find the codec information from the length of the payload. For reference, the VoLTE terminal uses an EVS, AMR-WB, AMR-NB codec. For example, in the case of the EVS codec, the SID packet 500 is a silent packet and has a shorter payload length than the general voice packet, and the general voice packets are transmitted every 20 ms. The SID packet 500 is transmitted every 20 ms. It can be transmitted or delivered in longer periods (multiples of 20 ms). That is, the SID packet 500 is required to calculate the amount of time corresponding to the packet size, and the amount of time may be calculated using the timestamp value of the RTP packet.

예를 들어, n번째 RTP 패킷(600)의 타임스탬프가 1,000이고, n+1번째 SID 패킷(500)의 타임스탬프가 2,600이라 가정하면, 타임스탬프부(236)는 타임스탬프의 차이 값을 1,600으로 계산한다. RTP 패킷(500, 600)에서 타임스탬프 필드는 통화 코덱의 샘플링 레이트(rate)와 관계된 상대적인 값이다. 따라서, RTP의 타임스탬프를 통해 절대적인 시간 분량을 계산하기 위해서는 통화 코덱의 샘플링 레이트를 알아야 한다. 샘플링 레이트는 통화 호가 성립될 때 SIP 메시지의 SDP를 확인하면 알 수 있다. AMR-WB 코덱으로 통화 호가 성립됐다고 가정하면, 타임스탬프부(236)는 SDP로부터 AMR-WB 코덱의 샘플링 레이트 16 kHz를 참조한다. 16 kHz는 1초에 16000번 샘플링을 의미하므로, 1 ms 에는 16번 샘플링되고, RTP의 타임스탬프의 값은 1 ms 당 16씩 증가한다. 음성 RTP 패킷(600)의 경우, 20 ms 마다 RTP 패킷을 전송하므로 타임스탬프의 값은 320(16*20)씩 증가한다.For example, assuming that the time stamp of the n th RTP packet 600 is 1,000 and the time stamp of the n + 1 th SID packet 500 is 2,600, the time stamp unit 236 sets the difference value of the time stamp to 1,600. Calculate The timestamp field in the RTP packets 500 and 600 is a relative value related to the sampling rate of the call codec. Therefore, in order to calculate an absolute amount of time through the timestamp of the RTP, the sampling rate of the call codec must be known. The sampling rate can be known by checking the SDP of the SIP message when the call is established. Assuming that a call is established with the AMR-WB codec, the timestamp section 236 refers to the sampling rate 16 kHz of the AMR-WB codec from the SDP. Since 16 kHz means 16000 samplings per second, it is sampled 16 times in 1 ms, and the value of the time stamp of the RTP is increased by 16 per 1 ms. In the case of the voice RTP packet 600, since the RTP packet is transmitted every 20 ms, the value of the time stamp is increased by 320 (16 * 20).

여기서, RTP 패킷(600)과 SID 패킷(500)의 타임스탬프의 상기 차이 값이 1,600으로 계산되었으므로, 1,600/320=5가 되고, 5*20 ms = 100 ms가 계산된다. 그러면, 타임스탬프부(236)는 SID 패킷(500)의 시간 분량을 100 ms로 계산한다. RTP 패킷(600)과 SID 패킷(500)의 통합된 시간 분량은 20+100 = 120 ms로 계산된다. 따라서, n-1번째 RTP 패킷, n번째 RTP 패킷(600), n+1번째 SID 패킷(500) 및 n+2번째 RTP 패킷의 순서에서, 타임스탬프부(236)는 RTP 패킷(600)의 텍스트에 시간 t1의 타임스탬프 정보를 설정한 경우, n+2번째의 RTP 패킷의 텍스트에 발생 시간 t1+120을 계산해서 타임스탬프 정보로 설정할 수 있다.Here, since the difference value of the time stamp of the RTP packet 600 and the SID packet 500 is calculated to be 1,600, 1,600 / 320 = 5, and 5 * 20 ms = 100 ms is calculated. Then, the time stamp unit 236 calculates the amount of time of the SID packet 500 to 100 ms. The combined amount of time for the RTP packet 600 and the SID packet 500 is calculated as 20 + 100 = 120 ms. Accordingly, in the order of the n-1 th RTP packet, the n th RTP packet 600, the n + 1 th SID packet 500, and the n + 2 th RTP packet, the time stamp unit 236 may be configured as the RTP packet 600. When time stamp information of time t1 is set in the text, the occurrence time t1 + 120 can be calculated in the text of the n + 2th RTP packet and set as time stamp information.

도 7은 본 발명의 다른 실시예에 따라 도 1의 텍스트 관리 서버(130)가 발신 음성 데이터와 수신 음성 데이터를 하나의 통합된 음성 스트림으로 믹싱하는 예시도이다.7 is an exemplary diagram in which the text management server 130 of FIG. 1 mixes outgoing voice data and received voice data into one integrated voice stream according to another embodiment of the present invention.

본 발명의 다른 실시예에서는 서비스 가입자가 텍스트 관리 서버(130)로부터 통화 텍스트를 제공받아 확인한 후, 통화 음성 내용을 요청할 경우, 통화 내용의 음성 스트림을 서비스 가입자에게 제공하기 위해, 도 2의 믹싱부(238)가 발신 통화 음성과 수신 통화 음성을 하나의 통합된 음성으로 믹싱한다.In another embodiment of the present invention, when the service subscriber receives the call text from the text management server 130 and confirms the call text, and requests the call voice content, the mixing unit of FIG. 2 provides the voice stream of the call content to the service subscriber. 238 mixes the outgoing call voice and the incoming call voice into one integrated voice.

여기서, 수신 음성의 RTP 패킷 1 스트림(710)과 발신 음성의 RTP 패킷 2 스트림(730)을 하나의 통합된 음성 스트림으로 믹싱한다고 가정한다.Here, it is assumed that the RTP packet 1 stream 710 of the received voice and the RTP packet 2 stream 730 of the outgoing voice are mixed into one integrated voice stream.

도 4b에서와 같이 수신 및 발신의 RTP 패킷(400, 410)의 시작 위치를 동일하게 하여 믹싱할 경우, 발신 음성과 통화 음성이 겹치는 크로스 토크(cross talk)가 발생한다. 크로스 토크를 방지하고 최대한 실제 통화 내용과 동일한 통화 음성 스트림을 믹싱하기 위해, 믹싱부(238)는 늦게 도착한 발신 음성의 RTP 패킷 2 스트림(730)을 먼저 도착한 수신 음성의 RTP 패킷 1의 3개 패킷의 3*20 ms = 60 ms의 시간 분량만큼 뒤로 늦춘다.As shown in FIG. 4B, when the start and the start positions of the RTP packets 400 and 410 are mixed in the same manner, cross talk occurs between the outgoing voice and the call voice. In order to prevent crosstalk and to mix the same voice stream as the actual content of the call as much as possible, the mixing unit 238 has three packets of the RTP packet 1 of the received voice first arriving at the RTP packet 2 stream 730 of the late-outed voice. Slows down by 3 * 20 ms = 60 ms.

도 8a 및 도 8b는 도 7의 믹싱에서 늦게 도착된 발신 또는 수신 측의 음성 데이터의 시작 위치를 늦추는 예시도이다.8A and 8B are exemplary diagrams of delaying the start position of voice data of a transmitting or receiving side that arrives late in the mixing of FIG. 7.

도 8a를 참조하면, A와 B의 실제 통화 내용이 시간 축을 기준으로 도시된다. A의 음성으로 통화가 시작하여 A와 B 사이의 통화 대화가 이어진다.Referring to FIG. 8A, actual call contents of A and B are shown based on the time axis. The call starts with A's voice followed by a call conversation between A and B.

도 8b를 참조하면, 본 발명의 기술 적용없이 Tx 및 Rx의 RTP 패킷의 시작 위치를 동일 위치로 하거나 패킷의 도착 시간을 기준으로 할 경우, 상기 크로스 토크가 발생되는 영역(810, 830)에서 A와 B의 음성이 겹칠 수 있다.Referring to FIG. 8B, when the start positions of the RTP packets of Tx and Rx are the same position or based on the arrival time of the packet without applying the technique of the present invention, in the regions 810 and 830 where the crosstalk is generated. The voices of B and B can overlap.

본 발명은 크로스 토크를 배제하기 위해, 믹싱부(238)가 먼저 도착된 A의 RTP 패킷의 시간 분량만큼 뒤로 늦춘 시작 위치에서 B의 RTP 패킷의 스트림을 배치하여 믹싱한다. 따라서, 크로스 토크가 배제된 하나의 통합된 통화 스트림으로 A 및 B의 통화 음성이 믹싱된다.The present invention arranges and mixes the streams of the RTP packets of B at the start position, where the mixing unit 238 is delayed backward by the amount of time of the RTP packets of A, which are first arrived, to exclude cross talk. Thus, the call voices of A and B are mixed into one integrated call stream that excludes cross talk.

도 9는 본 발명의 일 실시예에 따른 통화 음성의 텍스트 관리 방법의 개략적 순서도이다.9 is a schematic flowchart of a text management method of a call voice according to an embodiment of the present invention.

발신자 통화 단말(110)과 수신자 통화 단말(170) 사이에서 통화가 개시되면, 텍스트 관리 서버(130)는 발신 및 수신의 통화 데이터를 수신한다(S910). 여기서, 텍스트 관리 서버(130)는 TAS(140)를 통해 신호 데이터의 SIP 메시지를 수신한다. 또한, 텍스트 관리 서버는 음성 데이터의 RTP 패킷을 수신한다.When a call is initiated between the caller call terminal 110 and the callee terminal 170, the text management server 130 receives the call data of the origination and reception (S910). Here, the text management server 130 receives the SIP message of the signaling data through the TAS 140. The text management server also receives an RTP packet of voice data.

통화 중에 서비스가 가입자가 본 발명의 텍스트 변환 서비스를 요청하기 위해 통화 단말의 키 버튼을 누르면, 텍스트 관리 서버(130)는 가입자의 통화 단말로부터 해당 서비스 키의 DTMF 신호를 수신한다(S920).If the subscriber presses the key button of the call terminal to request the text conversion service of the present invention during the call, the text management server 130 receives the DTMF signal of the corresponding service key from the call terminal of the subscriber (S920).

상기 DTMF 신호의 수신에 의해, 텍스트 관리 서버(130)는 발신 음성 데이터의 RTP 패킷과 수신 음성 데이터의 패킷을 분류한다(S930). 상기 패킷의 분류는 도 3에서와 같이 발신 측 아이피 및 포트, 수신 측 아이피 및 포트와 SSRC 값의 참조에 의해 처리된다. 또한, 도 4b에서와 같이 각각 수신 측 RTP 패킷 1(400) 및 발신 측 RTP 패킷 2(410)으로 분류된다.By receiving the DTMF signal, the text management server 130 classifies the RTP packet of the outgoing voice data and the packet of the received voice data (S930). The classification of the packet is handled by reference to the originating IP and port, the receiving IP and port and the SSRC value as shown in FIG. In addition, as shown in Figure 4b, it is classified into a receiving side RTP packet 1 (400) and a sending side RTP packet 2 (410), respectively.

발신자 및 수신자의 RTP 패킷이 분류되면, 텍스트 관리 서버(130)는 분류된 각 RTP 패킷들 중에서 무음의 SID 패킷(500)을 제거하여 음성이 포함된 RTP 패킷들을 남긴다(S940). When the RTP packets of the sender and the receiver are classified, the text management server 130 removes the silent SID packet 500 from each of the classified RTP packets and leaves RTP packets including voice (S940).

무음의 SID 패킷(500)이 제거된 후, 텍스트 변환 서버(130)는 음성이 포함된 RTP 패킷들을 STT 엔진을 이용하여 텍스트 변환한다(S950). After the silent SID packet 500 is removed, the text conversion server 130 performs text conversion on the RTP packets including the voice using the STT engine (S950).

여기서, 텍스트 변환 서버(130)는 도 4b에서와 같이 음성 발생 순서로 정렬된 각 RTP 패킷(400, 410)들을 상대로 기준 시간을 최초 통화 발생 시간으로 설정하여 각 패킷의 시작 시간을 계산한다(S951). 도 6을 참조하여 상기에서 설명한 바와 같이, SID 패킷(500)은 코덱 정보를 참조하여 시간 분량이 계산된다.Here, the text conversion server 130 calculates the start time of each packet by setting the reference time as the initial call occurrence time with respect to each of the RTP packets 400 and 410 arranged in the voice generation order as shown in FIG. 4B (S951). ). As described above with reference to FIG. 6, the SID packet 500 is calculated with reference to codec information.

텍스트 변환 및 시간 계산이 완료된 후, 텍스트 변환 서버(130)는 변환된 테스트와 계산된 음성 발생 시간을 한 개 세트로 매칭하고, 기 설정된 화면 UI로 표시되기 위해, 매칭된 각 세트들을 포함하는 통화 텍스트를 생성하여 저장한다(S960). After the text conversion and time calculation are completed, the text conversion server 130 matches the converted test with the calculated speech generation time as one set, and includes a call including each matched set to be displayed in a preset screen UI. The text is generated and stored (S960).

이후, 서비스 가입자가 통화 텍스트의 제공을 요청하면, 텍스트 변환 서버(130)는 가입자의 요청에 대응되는 통화 텍스트를 조회하여 제공한다(S970). 즉, 텍스트 변환 서버(130)는 서비스 가입자의 요청에 따라 통화 내용으로부터 변경된 통화 텍스트의 조회 및 검색 서비스를 제공할 수 있다.Then, when the service subscriber requests the provision of the call text, the text conversion server 130 inquires and provides the call text corresponding to the request of the subscriber (S970). That is, the text conversion server 130 may provide a search and retrieval service of the call text changed from the call content according to the request of the service subscriber.

본 발명은 비록 한정된 실시예와 도면에 의해 설명되었으나, 본 발명은 이것에 의해 한정되지 않으며 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에 의해 본 발명의 기술사상과 아래에 기재될 특허청구범위의 균등범위 내에서 다양한 수정 및 변형이 가능함은 물론이다.Although the present invention has been described by way of limited embodiments and drawings, the present invention is not limited thereto, and the technical spirit of the present invention and claims to be described below by those skilled in the art to which the present invention pertains. Of course, various modifications and variations are possible within the scope of the equivalents.

100 : 시스템 110 : 발신자 통화 단말
120 : 텍스트 관리 서버 140 : TAS
150 : CSCF 170 : 수신자 통화 단말100: system 110: caller call terminal
120: text management server 140: TAS
150: CSCF 170: receiver call terminal

Claims

An apparatus for text-converting call voice of a service subscriber and managing the converted text,
A receiver which receives voice data of a call in which the service subscriber becomes a caller or a receiver;
A classification unit which classifies the received voice data into outgoing voice data and received voice data, respectively;
A conversion unit for converting the classified voice data into data of an outgoing text and a received text, respectively;
A call text unit for generating a call text by dividing the outgoing text of the converted sender and the received text of the receiver and arranging them in chronological order; And
Providing unit for inquiring the generated call text and providing to the terminal of the service subscriber at the request of the service subscriber
Device comprising a.

The method of claim 1,
And a DTMF unit for receiving a dual tone multiple frequency (DTMF) signal for requesting text conversion from a call terminal of the service subscriber in a call,
And the classification unit processes the classification by the received DTMF signal.

The method of claim 1,
The classification unit,
And classifying voice data of origination and reception by referring to a source IP and a port of a SIP (Session Initiation Protocol) message, a synchronization source ID of a receiver IP and a port, and a Realtime Transfer Protocol (RTP) packet, respectively.

The method of claim 1,
And a silence remover configured to remove silence indicator (SID) packets corresponding to silence to leave only voice packets among the classified RTP packets of the outgoing voice data and the received voice data.
And the converting unit converts the left speech packet into text using a speech to text (STT) engine.

The method of claim 1,
And a timestamp unit for calculating a voice generation time as timestamp information of the converted text using the amount of time of the RTP packet referenced in the ptime value of the Session Description Protocol (SDP) of the SIP message.

The method of claim 5,
The time stamp unit,
Confirming codec information of the RTP packet, checking a timestamp increasing per second from the confirmed codec sampling rate, calculating a time portion of the SID packet from the timestamp value of the SID packet increased from the previous packet, and checking the RTP The voice generation time is calculated using the time amount of the packet and the time amount of the calculated SID packet.

The method of claim 1,
The call text unit,
Storing said call text comprising at least one set of an outgoing call number, an incoming call number, total talk time, outgoing text data and voice generation time, at least one set of received text data and voice generation time Device.

The method of claim 1,
From the outgoing voice data and the received voice data, the start position of the voice data of the late arrival or reception of the late arrival or reception is set by delaying the time by the time of the RTP packet of the first arrived voice data. And a mixing unit configured to mix the voice streams of each of the outgoing voice data and the received voice data into one integrated voice data stream using the set start position.
And the providing unit provides the integrated voice data.

The method of claim 1,
The providing unit,
And providing the call text to a terminal of a service subscriber using at least one of a text message, an email, a social network service, and a web page.

A method for text-converting a voice of a service subscriber by a device and managing the converted text, the method comprising:
Receiving voice data of a call where the service subscriber becomes a caller or a receiver;
Classifying the received voice data into outgoing voice data and received voice data, respectively;
Converting the classified voice data into data of an outgoing text and a received text, respectively;
Generating a call text by dividing the outgoing text of the converted sender and the received text of the receiver and listing them in chronological order; And
At the request of the service subscriber, providing the generated call text to the terminal of the service subscriber
How to include.

The method of claim 10,
Prior to the classifying step,
Receiving a dual tone multiple frequency (DTMF) signal for requesting text conversion from a call terminal of the service subscriber in a call;
And the classifying step is classifying the received DTMF signal.

The method of claim 10,
The classifying step,
And classifying voice data of the originating and receiving by referring to the source IP and port of the SIP (Session Initiation Protocol) message, the source IP of the receiving side and the synchronization source ID of the Real Time Transfer Protocol (RTP) packet, respectively. Way.

The method of claim 10,
After the classifying step,
Removing a silence indicator (SID) packet corresponding to a silence to leave only a voice packet among the classified RTP packets of the outgoing voice data and the received voice data,
The converting step is a step of converting the remaining speech packet to text using a Speech To Text (STT) engine.

The method of claim 10,
After the converting step,
Calculating a voice generation time as time stamp information of the converted text using the amount of time of the RTP packet referred to in the ptime value of the Session Description Protocol (SDP) of the SIP message.

The method of claim 10,
The calculating step,
Confirming codec information of the RTP packet, checking a timestamp increasing per second from the confirmed codec sampling rate, calculating a time portion of the SID packet from the timestamp value of the SID packet increased from the previous packet, and checking the RTP And calculating the voice generation time using the time amount of the packet and the time amount of the calculated SID packet.

The method of claim 10,
The generating step,
Storing the call text comprising at least one set of an outgoing call number, an incoming call number, a total call time, outgoing text data and a timestamp, and at least one or more sets of incoming text data and a timestamp. Way.

The method of claim 10,
Prior to the providing step,
From the outgoing voice data and the received voice data, the start position of the voice data of the late arrival or reception of the late arrival or reception is set by delaying the time by the time of the RTP packet of the first arrived voice data. And mixing the voice streams of each of the outgoing voice data and the received voice data into one integrated voice data stream using the set start position.
The providing step is providing the integrated voice data.

The method of claim 10,
The providing step,
And providing the call text to a terminal of a service subscriber using at least one of a text message, an email, a social network service, and a web page.