[go: up one dir, main page]

CN111797214A - Question screening method, device, computer equipment and medium based on FAQ database - Google Patents

Question screening method, device, computer equipment and medium based on FAQ database Download PDF

Info

Publication number
CN111797214A
CN111797214A CN202010591151.4A CN202010591151A CN111797214A CN 111797214 A CN111797214 A CN 111797214A CN 202010591151 A CN202010591151 A CN 202010591151A CN 111797214 A CN111797214 A CN 111797214A
Authority
CN
China
Prior art keywords
question
similarity
word
questions
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010591151.4A
Other languages
Chinese (zh)
Inventor
张山
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
OneConnect Smart Technology Co Ltd
OneConnect Financial Technology Co Ltd Shanghai
Original Assignee
OneConnect Financial Technology Co Ltd Shanghai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by OneConnect Financial Technology Co Ltd Shanghai filed Critical OneConnect Financial Technology Co Ltd Shanghai
Priority to CN202010591151.4A priority Critical patent/CN111797214A/en
Publication of CN111797214A publication Critical patent/CN111797214A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/338Presentation of query results
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/194Calculation of difference between files
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Mathematical Physics (AREA)
  • Human Computer Interaction (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例属于人工智能领域,涉及一种基于FAQ数据库的问题筛选方法、装置、计算机设备及介质,包括分析用户输入的问题语句,对问题语句进行分词处理,并确定分词后各词语的权重,查询与问题语句相对应的候选问题,根据各词语的权重对在FAQ数据库中查询出来的候选问题进行打分筛选,然后按照相似度算法模型计算问题语句与查询结果之间的相似度值,将相似度值不在预设范围内的查询结果进行过滤,最后采用分类算法对过滤后的查询结果进行计算,确定出与问题语句相似度最高的预设数量的问题。此外,本申请还涉及区块链技术,用户输入的问题语句可存储于区块链中。本申请可以提高问题筛选的精准度,为用户推荐出高质量的问题。

Figure 202010591151

The embodiments of the present application belong to the field of artificial intelligence, and relate to a question screening method, device, computer equipment and medium based on a FAQ database, including analyzing a question sentence input by a user, performing word segmentation processing on the question sentence, and determining the weight of each word after the word segmentation. , query the candidate questions corresponding to the question statement, score and screen the candidate questions queried in the FAQ database according to the weight of each word, and then calculate the similarity value between the question statement and the query result according to the similarity algorithm model. The query results whose similarity value is not within the preset range are filtered, and finally, a classification algorithm is used to calculate the filtered query results, and a preset number of questions with the highest similarity to the question statement are determined. In addition, this application also involves blockchain technology, and the question statement input by the user can be stored in the blockchain. This application can improve the accuracy of question screening and recommend high-quality questions for users.

Figure 202010591151

Description

基于FAQ数据库的问题筛选方法、装置、计算机设备及介质Question screening method, device, computer equipment and medium based on FAQ database

技术领域technical field

本申请涉及人工智能技术领域,尤其涉及基于FAQ数据库的问题筛选方法、装置、计算机设备及介质。The present application relates to the field of artificial intelligence technology, and in particular, to a method, device, computer equipment and medium for question screening based on FAQ database.

背景技术Background technique

在常见问答(Frequently Asked Question,FAQ)对话中,FAQ系统中预先构建有包括大量问答对的问答库,当接收到用户提出的问题时,FAQ系统可以基于用户提出的问题在上述问答库中查找到与用户提出的问题相匹配的问题,并将FAQ系统确定的问题与问题答案返回给用户。In the Frequently Asked Question (FAQ) dialogue, the FAQ system is pre-built with a question-and-answer library that includes a large number of question-and-answer pairs. When a question raised by a user is received, the FAQ system can search in the above-mentioned question-and-answer library based on the question raised by the user. To the question that matches the question asked by the user, and return the question and the answer to the question determined by the FAQ system to the user.

目前行业内的问答系统,多采用直接匹配或者是直接分词的方式来实现FAQ标准问题的查找,这样的方式对标准问题、相似问题、关联问题的查找精确程度存在缺陷,难以查找出语义接近的问题,导致获取到的问题与用户真正想要获得的问题答案之间的匹配度比较差;另外,通过FAQ有限的配置难以匹配出无限的用户对问题的描述的对应问题,也就难以为客户推荐出高质量的问题。所以传统的FAQ问答需要维护巨大的知识库,在问题筛选方面仍然存在精度较低的问题。At present, the question answering systems in the industry mostly use direct matching or direct word segmentation to find FAQ standard questions. This method has defects in the accuracy of finding standard questions, similar questions, and related questions, and it is difficult to find the semantically similar questions. In addition, through the limited configuration of the FAQ, it is difficult to match the corresponding problems of unlimited user descriptions of the problems, and it is difficult for customers High-quality questions are recommended. Therefore, the traditional FAQ needs to maintain a huge knowledge base, and there are still problems with low precision in question screening.

发明内容SUMMARY OF THE INVENTION

本申请实施例的目的在于提出一种基于FAQ数据库的问题筛选方法、装置、计算机设备及介质,其主要目的是为用户快速且精确地筛选出与提问相匹配的问题。The purpose of the embodiments of the present application is to propose a question screening method, device, computer equipment and medium based on a FAQ database, the main purpose of which is to quickly and accurately screen out questions matching the questions for users.

为了解决上述技术问题,本申请实施例提供一种基于FAQ数据库的问题筛选方法,采用了如下所述的技术方案:In order to solve the above-mentioned technical problems, the embodiment of the present application provides a problem screening method based on FAQ database, which adopts the following technical solutions:

分析用户输入的问题语句,对所述问题语句进行分词处理;Analyzing the problem sentence input by the user, and performing word segmentation processing on the problem sentence;

统计分词后各词语在FAQ数据库中的词频,确定各词语的权重,并将所述各词语的权重和分词结果存入FAQ数据库;After the word segmentation, the word frequency of each word in the FAQ database is counted, the weight of each word is determined, and the weight of each word and the word segmentation result are stored in the FAQ database;

查询所述FAQ数据库中与所述问题语句相对应的候选问题,并根据所述各词语的权重对所述候选问题进行打分,筛选出分值大于等于预设分值的候选问题作为查询结果;Query the candidate questions corresponding to the question statement in the FAQ database, and score the candidate questions according to the weights of the words, and screen out the candidate questions with a score greater than or equal to a preset score as the query result;

按照相似度算法模型计算所述问题语句与所述查询结果之间的相似度值,将所述相似度值不在预设范围内的查询结果进行过滤;Calculate the similarity value between the question statement and the query result according to the similarity algorithm model, and filter the query results whose similarity value is not within the preset range;

采用分类算法对过滤后的查询结果进行计算,确定出与所述输入的问题语句相似度最高的预设数量的问题。A classification algorithm is used to calculate the filtered query results, and to determine a preset number of questions with the highest similarity to the input question sentence.

为了解决上述技术问题,本申请实施例还提供一种基于FAQ数据库的问题筛选装置,采用了如下所述的技术方案:In order to solve the above-mentioned technical problems, the embodiment of the present application also provides a problem screening device based on a FAQ database, which adopts the following technical solutions:

分词模块,用于分析用户输入的问题语句,对所述问题语句进行分词处理;The word segmentation module is used to analyze the problem sentence input by the user, and perform word segmentation processing on the problem sentence;

处理模块,用于统计分词后各词语在FAQ数据库中的词频,确定各词语的权重,并将所述各词语的权重和分词结果存入FAQ数据库;The processing module is used to count the word frequency of each word in the FAQ database after word segmentation, determine the weight of each word, and store the weight of each word and the word segmentation result in the FAQ database;

查询打分模块,用于查询所述FAQ数据库中与所述问题语句相对应的候选问题,根据所述各词语的权重对候选问题进行打分;A query scoring module, used for querying the candidate questions corresponding to the question statement in the FAQ database, and scoring the candidate questions according to the weight of each word;

筛选模块,用于筛选出分值大于等于预设分值的问题作为查询结果;The screening module is used to filter out the questions whose score is greater than or equal to the preset score as the query result;

相似度计算模块,用于按照相似度算法模型计算所述问题语句与所述查询结果之间的相似度值,将所述相似度值不在预设范围内的查询结果进行过滤;a similarity calculation module, configured to calculate the similarity value between the question statement and the query result according to the similarity algorithm model, and filter the query results whose similarity value is not within a preset range;

分类计算模块,用于采用分类算法对过滤后的查询结果进行计算;及A classification calculation module for calculating the filtered query results using a classification algorithm; and

确定模块,用于确定出与所述输入的问题语句相似度最高的预设数量的问题。A determination module, configured to determine a preset number of questions with the highest similarity to the input question sentence.

为了解决上述技术问题,本申请实施例还提供一种计算机设备,采用了如下所述的技术方案:In order to solve the above-mentioned technical problems, the embodiment of the present application also provides a computer device, which adopts the following technical solutions:

该计算机设备包括存储器和处理器,所述存储器中存储有计算机可读指令,所述处理器执行所述计算机可读指令时实现如如上所述的基于FAQ数据库的问题筛选方法的步骤。The computer device includes a memory and a processor, the memory stores computer-readable instructions, and when the processor executes the computer-readable instructions, the processor implements the steps of the FAQ database-based question screening method as described above.

为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,采用了如下所述的技术方案:In order to solve the above technical problems, the embodiments of the present application also provide a computer-readable storage medium, which adopts the following technical solutions:

所述计算机可读存储介质上存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如上所述的基于FAQ数据库的问题筛选方法的步骤。The computer-readable storage medium has computer-readable instructions stored thereon, and when the computer-readable instructions are executed by the processor, implements the steps of the above-mentioned FAQ database-based question screening method.

与现有技术相比,本申请实施例主要有以下有益效果:Compared with the prior art, the embodiments of the present application mainly have the following beneficial effects:

本发明通过分析用户输入的问题语句,对问题语句进行分词处理;统计分词后各词语在FAQ数据库中的词频,确定各词语的权重,并将各词语的权重和分词结果存入FAQ数据库;查询FAQ数据库中与问题语句相对应的候选问题,并根据各词语的权重对候选问题进行打分,筛选出分值大于等于预设分值的候选问题作为查询结果,能够对查询出来的问题进行初步筛选得到比较精确的查询结果;按照相似度算法模型计算问题语句与查询结果之间的相似度值,将相似度值不在预设范围内的查询结果进行过滤,能够实现对初步筛选的结果进行进一步的相似度分析,从而查找出切实与用户输入的问题相似的问题;采用分类算法对过滤后的查询结果进行计算,确定出与输入的问题语句相似度最高的预设数量的问题,可以进一步提高筛选出来的问题的精准度,为用户推荐出高质量的问题。The present invention performs word segmentation processing on the question statement by analyzing the question statement input by the user; after the word segmentation, the word frequency of each word in the FAQ database is counted, the weight of each word is determined, and the weight of each word and the word segmentation result are stored in the FAQ database; query The candidate questions corresponding to the question sentences in the FAQ database, and the candidate questions are scored according to the weight of each word, and the candidate questions with a score greater than or equal to the preset score are screened out as the query result, which can be preliminarily screened. Obtain relatively accurate query results; calculate the similarity value between the question statement and the query result according to the similarity algorithm model, and filter the query results whose similarity value is not within the preset range, which can realize further filtering of the results of the preliminary screening. Similarity analysis to find out the questions that are actually similar to the questions input by the user; the classification algorithm is used to calculate the filtered query results to determine the preset number of questions with the highest similarity to the input question statement, which can further improve the screening process. The accuracy of the questions asked, recommends high-quality questions for users.

附图说明Description of drawings

为了更清楚地说明本申请中的方案,下面将对本申请实施例描述中所需要使用的附图作一个简单介绍,显而易见地,下面描述中的附图是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to illustrate the solutions in the present application more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments of the present application. For those of ordinary skill, other drawings can also be obtained from these drawings without any creative effort.

图1是本申请可以应用于其中的示例性系统架构图;FIG. 1 is an exemplary system architecture diagram to which the present application can be applied;

图2根据本申请的基于FAQ数据库的问题筛选方法的一个实施例的流程图;Fig. 2 is a flow chart of an embodiment of the method for screening questions based on FAQ database according to the present application;

图3是图2中步骤S203的一种具体实施方式的流程图;Fig. 3 is a flow chart of a specific implementation manner of step S203 in Fig. 2;

图4是图2中步骤S204的一种具体实施方式的流程图;Fig. 4 is a flow chart of a specific implementation manner of step S204 in Fig. 2;

图5是图2中步骤S205的一种具体实施方式的流程图;Fig. 5 is a flow chart of a specific implementation manner of step S205 in Fig. 2;

图6是根据本申请的基于FAQ数据库的问题筛选装置的一个实施例的结构示意图;6 is a schematic structural diagram of an embodiment of a question screening device based on a FAQ database according to the present application;

图7是根据本申请的计算机设备的一个实施例的结构示意图。FIG. 7 is a schematic structural diagram of an embodiment of a computer device according to the present application.

具体实施方式Detailed ways

除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the technical field of this application; the terms used herein in the specification of the application are for the purpose of describing specific embodiments only It is not intended to limit the application; the terms "comprising" and "having" and any variations thereof in the description and claims of this application and the above description of the drawings are intended to cover non-exclusive inclusion. The terms "first", "second" and the like in the description and claims of the present application or the above drawings are used to distinguish different objects, rather than to describe a specific order.

在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。Reference herein to an "embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the present application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor a separate or alternative embodiment that is mutually exclusive of other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

为了使本技术领域的人员更好地理解本申请方案,下面将结合附图,对本申请实施例中的技术方案进行清楚、完整地描述。In order to make those skilled in the art better understand the solutions of the present application, the technical solutions in the embodiments of the present application will be described clearly and completely below with reference to the accompanying drawings.

为了解决FAQ在问题筛选方面存在精度较低的问题,本申请提供了基于FAQ数据库的问题筛选方法,涉及人工智能语义分析,可以应用于如图1所示的系统架构100中,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。In order to solve the problem of low accuracy in question screening of FAQ, the present application provides a problem screening method based on FAQ database, which involves artificial intelligence semantic analysis, and can be applied to the system architecture 100 shown in FIG. 1 . The system architecture 100 can Including terminal equipment 101 , 102 , 103 , network 104 and server 105 . The network 104 is a medium used to provide a communication link between the terminal devices 101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, among others.

用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。终端设备101、102、103上可以安装有各种通讯客户端应用,例如网页浏览器应用、购物类应用、搜索类应用、即时通信工具、邮箱客户端、社交平台软件等。The user can use the terminal devices 101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications may be installed on the terminal devices 101 , 102 and 103 , such as web browser applications, shopping applications, search applications, instant messaging tools, email clients, social platform software, and the like.

终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture ExpertsGroup Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving PictureExperts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。The terminal devices 101, 102, and 103 may be various electronic devices that have a display screen and support web browsing, including but not limited to smart phones, tablet computers, e-book readers, MP3 players (Moving Picture Experts Group Audio Layer III, moving image Expert Compression Standard Audio Layer 3), MP4 (Moving PictureExperts Group Audio Layer IV, Moving Picture Experts Group Audio Layer 4) Players, Laptops and Desktops, etc.

服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。The server 105 may be a server that provides various services, such as a background server that provides support for the pages displayed on the terminal devices 101 , 102 , and 103 .

需要说明的是,本申请实施例所提供的基于FAQ数据库的问题筛选方法一般由服务器/终端设备执行,相应地,基于FAQ数据库的问题筛选装置一般设置于服务器/终端设备中。It should be noted that the FAQ database-based question screening method provided by the embodiments of the present application is generally performed by a server/terminal device, and accordingly, the FAQ database-based question screening apparatus is generally set in the server/terminal device.

应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器。It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

请参阅图2,图2示出了本发明实施例提供的一种基于FAQ数据库的问题筛选的方法的一个实施例的流程图,以该方法应用在图1的服务端为例进行说明。所述的基于FAQ数据库的问题筛选方法,包括以下步骤:Please refer to FIG. 2. FIG. 2 shows a flowchart of an embodiment of a method for question screening based on a FAQ database provided by an embodiment of the present invention. The method is applied to the server of FIG. 1 as an example for description. The described question screening method based on FAQ database includes the following steps:

步骤S201,分析用户输入的问题语句,对问题语句进行分词处理。Step S201, analyze the question sentence input by the user, and perform word segmentation processing on the question sentence.

需要说明的是,用户输入问题语句可以通过音频输入,也可以是通过文字输入,在此并不进行限定。进一步地,在获取到用户输入的信息为音频文件的情形下,对用户输入的音频文件进行语音识别,将得到的语音识别结果转化为文本数据,并对文本数据进行分词处理,得到相应的分词结果。It should be noted that, the question sentence input by the user may be input through audio, or may be input through text, which is not limited herein. Further, under the situation that the information input by the user is obtained as an audio file, voice recognition is performed on the audio file input by the user, the obtained voice recognition result is converted into text data, and the text data is subjected to word segmentation to obtain the corresponding word segmentation. result.

在本实施例中,接收到用户输入的问题语句,并对该问题语句进行分析,具体地,可以对问题语句进行语义分析,语义分析包括分词、词性分析、命名实体识别、停用词去除等,基于语义利用分词控制器对问题语句进行分词处理。In this embodiment, the problem sentence input by the user is received, and the problem sentence is analyzed. Specifically, the problem sentence can be semantically analyzed, and the semantic analysis includes word segmentation, part-of-speech analysis, named entity recognition, stop word removal, etc. , based on semantics, the word segmentation controller is used to process the word segmentation of the problem sentence.

需要强调的是,为进一步保证上述用户输入问题语句的私密和安全性,上述用户输入问题语句还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned user-input question statement, the above-mentioned user-input question statement may also be stored in a node of a blockchain.

本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.

步骤S202,统计分词后各词语在FAQ数据库中的词频,确定各词语的权重,并将各词语的权重和分词结果存入FAQ数据库。In step S202, the word frequency of each word in the FAQ database after word segmentation is counted, the weight of each word is determined, and the weight of each word and the word segmentation result are stored in the FAQ database.

需要说明的是,FAQ数据库是预先建立好的,并在FAQ数据库搭建上一个搜索引擎,在FAQ数据库中利用搜索引擎的查询、分析和探索功能对用户输入的问题语句进行检索,通过检索获得与用户输入的问题语句相对应的查询结果。本申请中的搜索引擎包括但不限于Elastic Search。It should be noted that the FAQ database is pre-established, and a search engine is built in the FAQ database. In the FAQ database, the query, analysis and exploration functions of the search engine are used to retrieve the question statements input by the user, and the query and the The query result corresponding to the question statement entered by the user. Search engines in this application include but are not limited to Elastic Search.

在本实施例中,可以周期性地收集各用户的问答历史数据,利用收集到的各用户的问答历史数据对FAQ数据库进行更新,其中,用户的问答历史数据中可以包括:用户回答的问题、提出的问题、浏览过的问题、查询过的问题等,还可以利用网络爬虫从Google、百度、雅虎等搜索引擎上爬取与用户提出的问题匹配的问答对来更新FAQ数据库。In this embodiment, the question and answer history data of each user may be collected periodically, and the FAQ database may be updated by using the collected question and answer history data of each user, wherein the question and answer history data of the user may include: questions answered by the user, Asked questions, browsed questions, queried questions, etc., can also use web crawlers from Google, Baidu, Yahoo and other search engines to crawl question-and-answer pairs that match the questions raised by users to update the FAQ database.

其中,确定各词语的权重可以基于统计分布的特征考察了每个词语在FAQ数据库中的分布情况,并根据词频等分布特征设定词语的权重。具体的,获取分词后的各词语,统计各词语在FAQ数据库中的词频,则各词语的权重值为:词语权重值=词频*100。Wherein, to determine the weight of each word, the distribution of each word in the FAQ database can be examined based on the characteristics of statistical distribution, and the weight of the word is set according to the distribution characteristics such as word frequency. Specifically, each word after word segmentation is obtained, and the word frequency of each word in the FAQ database is counted, and the weight value of each word is: word weight value=word frequency*100.

此处将所述各词语的权重和分词结果存入数据库是为了用户在输入问题的过程中进行训练学习,FAQ数据库不断进行更新,同时词语的权重也不断进行更新。本实施例中,可以通过训练模型将分词结果更新到FAQ数据库。Here, the weights of the words and the word segmentation results are stored in the database for the user to perform training and learning in the process of inputting questions. The FAQ database is constantly updated, and the weights of words are also constantly updated. In this embodiment, the word segmentation result can be updated to the FAQ database by training the model.

步骤S203,查询FAQ数据库中与问题语句相对应的候选问题,并根据各词语的权重对候选问题进行打分,筛选出分值大于等于预设分值的候选问题作为查询结果。Step S203 , query the candidate questions corresponding to the question sentences in the FAQ database, and score the candidate questions according to the weight of each word, and select the candidate questions with a score greater than or equal to a preset score as the query result.

本实施例中,通过利用在FAQ数据库搭建上的搜索引擎的查询功能对FAQ数据库中进行查询,查询出与用户输入的问题语句相对应的候选问题,并通过搜索引擎的分析等功能对候选问题进行打分,筛选出分值大于等于预设分值的候选问题作为查询结果。In this embodiment, by using the query function of the search engine built on the FAQ database to query the FAQ database, the candidate questions corresponding to the question sentences input by the user are queried, and the candidate questions are searched through functions such as analysis of the search engine. Scoring is performed, and candidate questions with a score greater than or equal to a preset score are screened out as query results.

本实施例中,进一步还可以将筛选出来的查询结果根据分值大小进行排序。In this embodiment, the filtered query results may further be sorted according to the scores.

本实施例中,通过将分值大于等于预设分值的候选问题筛选出来是对查询出来的候选问题进行初步筛选,这样可以得到比较精确的查询结果。In this embodiment, by screening out candidate questions with a score greater than or equal to a preset score, the queried candidate questions are preliminarily screened, so that a relatively accurate query result can be obtained.

步骤S204,按照相似度算法模型计算问题语句与查询结果之间的相似度值,将相似度值不在预设范围内的查询结果进行过滤。Step S204: Calculate the similarity value between the question statement and the query result according to the similarity algorithm model, and filter the query results whose similarity value is not within the preset range.

本实施例中,为了计算查询结果与用户输入问题的相似度,采用多种相似度算法模型进行相似度计算,包括:Jaccard相似度算法、BM25算法、余弦相似度(cosine)算法以及编辑距离(Edit_distance),具体的,Jaccard相似度算法,用于计算样本之间的相似度,计算出来的Jaccard系数值越大,样本相似度越高;BM25算法,是一种基于概率检索模型的算法,用来评价搜索词和文档之间相关性;余弦相似性,通过测量两个向量的夹角的余弦值来度量它们之间的相似性,在信息检索中,每个词项被赋予不同的维度,而一个维度由一个向量表示,其各个维度上的值对应于该词项在文档中出现的频率,余弦相似度可以给出两篇文档在其主题方面的相似度;编辑距离,指两个字串之间,由一个转成另一个所需的最少编辑操作次数,描述了两个字符串的相近程度,如果它们的距离越大,说明它们越是不同。In this embodiment, in order to calculate the similarity between the query result and the user input question, a variety of similarity algorithm models are used for similarity calculation, including: Jaccard similarity algorithm, BM25 algorithm, cosine similarity (cosine) algorithm and edit distance ( Edit_distance), specifically, the Jaccard similarity algorithm, used to calculate the similarity between samples, the larger the calculated Jaccard coefficient value, the higher the sample similarity; the BM25 algorithm is an algorithm based on a probability retrieval model, using to evaluate the correlation between search terms and documents; cosine similarity, by measuring the cosine value of the angle between two vectors to measure the similarity between them, in information retrieval, each term is given a different dimension, One dimension is represented by a vector, and the values on each dimension correspond to the frequency of the term appearing in the document, and the cosine similarity can give the similarity of two documents in terms of their subject; edit distance, referring to two words Between strings, the minimum number of editing operations required to convert from one to the other, describes how similar two strings are, and the greater the distance between them, the more different they are.

需要说明的是,这四种相似度模型按照不同场景不同的权重进行组合使用,举例说明,计算A场景下第一问句A与第二问句B之间的相似度Sim(A,B),假设此场景下Jaccard相似度算法权重为0.3,BM25算法为0.2,余弦相似度的权重为0.2,编辑距离相似度的权重为0.3,则Sim(A,B)=Jaccard*0.3+BM25*0.2+cosine*0.2+Edit_distance*0.3。It should be noted that these four similarity models are used in combination according to different weights in different scenarios. For example, calculate the similarity Sim(A, B) between the first question A and the second question B in scene A , assuming that the weight of the Jaccard similarity algorithm in this scenario is 0.3, the weight of the BM25 algorithm is 0.2, the weight of the cosine similarity is 0.2, and the weight of the edit distance similarity is 0.3, then Sim(A, B)=Jaccard*0.3+BM25*0.2 +cosine*0.2+Edit_distance*0.3.

进一步地,在本实施例的一些可选实现方式中,筛选出分值大于等于预设分值的候选问题并进行排序后,可以选取排序在先的预设数量的查询结果采用相似度算法进行相似度计算,这样可以提高筛选效率,保证用户提问高效返回。Further, in some optional implementations of this embodiment, after screening out candidate questions with a score greater than or equal to a preset score and sorting them, a preset number of query results that are sorted earlier may be selected to use a similarity algorithm to conduct the query. Similarity calculation, which can improve the screening efficiency and ensure the efficient return of user questions.

本实施例中,对初筛得到的查询结果进行相似度计算,过滤掉相似度值不在预设范围内的查询结果,即对查询结果通过二次筛选,进一步查找出与用户输入的问题语句语义相似的问题。In this embodiment, the similarity calculation is performed on the query results obtained from the primary screening, and the query results whose similarity value is not within the preset range are filtered out, that is, the query results are filtered twice, and the semantics of the question sentence input by the user are further found out. Similar question.

步骤S205,采用分类算法对过滤后的查询结果进行计算,确定出与问题语句相似度最高的预设数量的问题。In step S205, a classification algorithm is used to calculate the filtered query results, and a preset number of questions with the highest similarity to the question sentence is determined.

具体地,在过滤后的查询结果中,通过使用FastText(快速文本分类算法)及textCNN(卷积神经网络文本分类算法)两种算法组合来实现分类算法,在FAQ数据库中获得与用户输入的问题语句相似度最高的预设数量的问题。需要说明的是,在文本分类任务中,FastText往往能取得和深度网络相媲美的精度,却在训练时间上比深度网络快许多数量级,以及使用textCNN将卷积神经网络用到应用到文本分类任务,利用多尺寸的kernel来提取句子中的关键信息。Specifically, in the filtered query results, the classification algorithm is implemented by using the combination of FastText (Fast Text Classification Algorithm) and textCNN (Convolutional Neural Network Text Classification Algorithm), and the problem with the user input is obtained in the FAQ database. Questions about the preset number of sentences with the highest similarity. It should be noted that in text classification tasks, FastText can often achieve accuracy comparable to that of deep networks, but is many orders of magnitude faster than deep networks in training time, and using textCNN to apply convolutional neural networks to text classification tasks , using multi-dimensional kernels to extract key information in sentences.

本实施例中,对过滤后的查询结果进行分类算法计算,确定出与问题语句相似度最高的预设数量的问题,这是第三次筛选,可以更大限度地保证用户的提问能够有效的准确的返回。In this embodiment, a classification algorithm is performed on the filtered query results, and the preset number of questions with the highest similarity to the question statement is determined. This is the third screening, which can maximize the effectiveness of the user's question. accurate return.

本实施例通过分析用户输入的问题语句,对问题语句进行分词处理,并确定分词后各词语的权重,查询与问题语句相对应的候选问题,根据各词语的权重对在FAQ数据库中查询出来的候选问题进行打分筛选;然后按照相似度算法模型计算问题语句与筛选出来的查询结果之间的相似度值,将相似度值不在预设范围内的查询结果进行过滤;最后采用分类算法对过滤后的查询结果进行计算,从而确定出与输入的问题语句相似度最高的预设数量的问题,实现对查询出来的问题进行多次筛选,可以提高筛选问题的精准度,为用户推荐出高质量的问题。In this embodiment, by analyzing the question sentence input by the user, the word segmentation process is performed on the question sentence, and the weight of each word after the word segmentation is determined, and the candidate questions corresponding to the question sentence are inquired. The candidate questions are scored and screened; then the similarity value between the question statement and the filtered query results is calculated according to the similarity algorithm model, and the query results whose similarity value is not within the preset range are filtered; The query results are calculated, so as to determine the preset number of questions with the highest similarity to the input question statement, and realize multiple screening of the queried questions, which can improve the accuracy of screening questions and recommend high-quality questions for users. question.

在本实施例的一些可选实现方式中,参见图3所示,步骤S203中,查询FAQ数据库中与问题语句相对应的候选问题,并根据各词语的权重对候选问题进行打分具体包括如下步骤:In some optional implementations of this embodiment, as shown in FIG. 3 , in step S203 , query the FAQ database for candidate questions corresponding to the question statement, and score the candidate questions according to the weight of each word, which specifically includes the following steps :

步骤S301,根据各词语的权重提取出问题语句中的关键词。Step S301, extracting the keywords in the question sentence according to the weight of each word.

关键词是问题中的重要组成部分,可以有助于理解用户的提问意图。本实施例中,在步骤S202中已经将用户输入的问题语句进行分词处理,基于分词结果以及各词语的权重,将权重大于预设权重值的词语确定为该问题语句的关键词。Keywords are an important part of the question and can help understand the user's questioning intent. In this embodiment, the question sentence input by the user has been subjected to word segmentation processing in step S202, and based on the word segmentation result and the weight of each word, a word with a weight greater than a preset weight value is determined as the keyword of the question sentence.

步骤S302,确定关键词的扩展词。Step S302, determining the expanded word of the keyword.

本实施例中,扩展词的获取可以通过关联词表获得,通过关联词表获得与问题语句中关键词相关度高的词语,并将关键词替换为该词语;也可以通过FAQ数据库的知识图谱获得,知识图谱可以预先构建好,根据关键词的上下位词或相关同义属性利用预先构建好的知识图谱,自动地生成扩展词。需要说明的是,知识图谱可以根据用户的查询不断地采集数据进行训练使得扩展词的规模越来越大,越来越准确。本实施例中,通过将关键词形成扩展词,能够扩大搜索范围,进而提高用户提问的召回率。In this embodiment, the expansion word can be obtained through the associated vocabulary, the word with high correlation with the keyword in the question sentence is obtained through the associated vocabulary, and the keyword is replaced with the word; it can also be obtained through the knowledge map of the FAQ database, The knowledge graph can be pre-built, and the pre-built knowledge graph can be used to automatically generate expanded words according to the hyponyms or related synonymous attributes of keywords. It should be noted that the knowledge graph can continuously collect data for training according to the user's query, so that the scale of the expanded word becomes larger and more accurate. In this embodiment, by forming keywords into expansion words, the search range can be expanded, thereby improving the recall rate of user questions.

步骤S303,根据关键词和扩展词生成与问题语句相对应的候选问题。Step S303, generating candidate questions corresponding to the question sentences according to the keywords and expansion words.

本实施例中,利用搜索引擎的查询、分析和探索功能,通过关键词和扩展词在FAQ数据库中获得与用户输入的问题语句相对应的候选问题,扩展词能够增加候选问题的数量。In this embodiment, the query, analysis and exploration functions of the search engine are used to obtain candidate questions corresponding to the question sentences input by the user in the FAQ database through keywords and expansion words, and expansion words can increase the number of candidate questions.

例如,用户输入的问题语句为:深圳哪里最热闹,提取出的关键词为<深圳><哪里><最热闹>,根据关键词和扩展词生成的候选问题如下:For example, the question sentence input by the user is: where is the most lively in Shenzhen, the extracted keywords are <shenzhen> <where> <most lively>, and the candidate questions generated according to the keywords and expansion words are as follows:

A、深圳哪里最繁华;A. Where is Shenzhen the most prosperous?

B、深圳什么地方最繁华。B. Shenzhen is the most prosperous place.

步骤S304,按照词语的权重为候选问题打分。In step S304, the candidate questions are scored according to the weight of the words.

本实施例中,为候选问题打分可以是将生成的候选问题进行分词处理,按照分词后的各词语的权重进行叠加计算,从而根据候选问题的总权重进行打分,也可以是将候选问题中关键词和/或扩展词的权重进行叠加计算,根据结果进行打分。需要说明的是,本申请中并不限于采用权重的叠加计算来为相似问题进行打分。In this embodiment, scoring the candidate questions may be to perform word segmentation processing on the generated candidate questions, and perform superposition calculation according to the weight of each word after word segmentation, so as to score according to the total weight of the candidate questions. The weights of the words and/or extended words are superimposed, and the scores are scored according to the results. It should be noted that this application is not limited to using the weighted superposition calculation to score similar problems.

本实施例通过提取出用户输入问题语句的关键词,并确定关键词的扩展词,根据关键词和扩展词进行查询,从而生成与问题语句相对应的候选问题,这样能够扩大搜索范围,进而提高用户提问的召回率。In this embodiment, the keyword of the question sentence input by the user is extracted, the expansion word of the keyword is determined, and the query is performed according to the keyword and the expansion word, so as to generate a candidate question corresponding to the question sentence, which can expand the search range, thereby improving the The recall rate of user questions.

在一些可选的实现方式中,参见图4所示,步骤S204具体包括如下步骤:In some optional implementations, as shown in FIG. 4 , step S204 specifically includes the following steps:

步骤S401,分别计算问题语句与查询结果之间的Jaccard相似度、BM25相似度、余弦相似度以及编辑距离相似度。Step S401: Calculate the Jaccard similarity, BM25 similarity, cosine similarity, and edit distance similarity between the question sentence and the query result, respectively.

为了更方便的描述本发明实施例的基本原理,分别用T1和T2表示用户输入的问题语句和查询结果,则采用Jaccard相似度算法计算问题语句T1与查询结果T2之间的相似度值S1 In order to describe the basic principles of the embodiments of the present invention more conveniently, T1 and T2 are used to represent the question statement and the query result input by the user respectively, then the Jaccard similarity algorithm is used to calculate the similarity between the question statement T1 and the query result T2 Degree value S 1 :

Figure BDA0002555588520000101
Figure BDA0002555588520000101

需要说明的是,使用Jaccard相似度算法计算两个文本之间的相似度之前,首先需要进行分词,分词处理与步骤S201相同。It should be noted that, before using the Jaccard similarity algorithm to calculate the similarity between two texts, word segmentation needs to be performed first, and the word segmentation process is the same as step S201.

举例说明,问题语句T1与查询结果T2的文本分别为:For example, the texts of the question statement T1 and the query result T2 are :

T1:深圳哪里最热闹;T 1 : Where is the most lively in Shenzhen;

T2:深圳哪里最繁华;T 2 : Where is Shenzhen the most prosperous;

分词结果分别为:The word segmentation results are:

T1=[深圳,哪里,最热闹];T 1 = [Shenzhen, where is the most lively];

T2=[深圳,哪里,最繁华];T 2 = [Shenzhen, where is the most prosperous];

Figure BDA0002555588520000102
but
Figure BDA0002555588520000102

使用BM25算法计算问题语句T1与查询结果T2之间的相似度值S2,采用的算法公式如下:The BM25 algorithm is used to calculate the similarity value S 2 between the question statement T 1 and the query result T 2 , and the algorithm formula used is as follows:

Figure BDA0002555588520000111
Figure BDA0002555588520000111

其中,qi表示T1解析之后的一个语素,例如对中文而言,我们可以把对T1的分词作为语素分析,每个词语看成语素qi;dl为查询结果T2的文档长度,avgdl为所有文档的平均长度;k1,b为调节因子,k1=2,b=0.75;fi为qi在T2中的出现频率;n为包含qi的文档数。Among them, qi represents a morpheme after T 1 parsing, for example, for Chinese, we can analyze the word segmentation of T 1 as a morpheme analysis, and each word is regarded as a morpheme qi ; dl is the document length of the query result T 2 , avgdl is the average length of all documents; k 1 , b are adjustment factors, k 1 =2, b = 0.75; f i is the frequency of occurrence of qi in T 2 ; n is the number of documents containing qi .

使用余弦相似度算法计算问题语句T1与查询结果T2之间的相似度值S3,采用的算法公式如下:The cosine similarity algorithm is used to calculate the similarity value S 3 between the question sentence T 1 and the query result T 2 , and the algorithm formula used is as follows:

Figure BDA0002555588520000112
Figure BDA0002555588520000112

其中,xi表示问题语句T1中第i个分词的TF-IDF权重,yi表示查询结果T2中第i个分词的TF-IDF权重,TF-IDF(term frequency-inverse document frequency)是一种用于信息检索与数据挖掘的常用加权技术。TF意思是词频(Term Frequency),IDF意思是逆文本频率指数(Inverse Document Frequency)。在句子分词后,利用TF-IDF计算句子中词的权重进行句子取词,取词后利用空间向量余弦夹角的相似度度量不会受指标刻度的影响,余弦值落于区间[0,1],值越大,则差异越小。Among them, x i represents the TF-IDF weight of the ith participle in the question sentence T1, yi represents the TF-IDF weight of the ith participle in the query result T2, and TF - IDF (term frequency-inverse document frequency) is A commonly used weighting technique for information retrieval and data mining. TF means Term Frequency, and IDF means Inverse Document Frequency. After the sentence segmentation, use TF-IDF to calculate the weight of the words in the sentence for sentence selection. After the word is selected, the similarity measurement of the cosine angle of the space vector will not be affected by the index scale, and the cosine value falls in the interval [0,1 ], the larger the value, the smaller the difference.

使用编辑距离相似度计算问题语句T1与查询结果T2之间的相似度值S4,采用的算法公式如下:The similarity value S 4 between the question statement T 1 and the query result T 2 is calculated using the edit distance similarity, and the algorithm formula used is as follows:

Figure BDA0002555588520000113
Figure BDA0002555588520000113

其中,editsim(w1i,w2j)为问题语句T1的第i个字符串w1i与查询结果T2的第j个字符串w2j之间的相似度,其中i、j、n、m均为正整数,且1≤i≤n、1≤j≤m;editsim(w1i,w2j)的计算公式如下:Among them, editsim(w 1i , w 2j ) is the similarity between the i-th string w 1i of the question statement T 1 and the j-th string w 2j of the query result T 2 , where i, j, n, m All are positive integers, and 1≤i≤n, 1≤j≤m; the calculation formula of editsim(w 1i ,w 2j ) is as follows:

Figure BDA0002555588520000114
Figure BDA0002555588520000114

其中,editdis(w1i.length,w2j.length)为问题语句T1的第i个字符串w1i与查询结果T2的第j个字符串w2j之间的长度,w1i.length为问题语句T1第i个字符串的长度;w2j.length为查询结果T2的第j个字符串的长度。Among them, editdis(w 1i .length, w 2j .length) is the length between the i-th string w 1i of the problem statement T 1 and the j-th string w 2j of the query result T 2 , and w 1i .length is The length of the i-th string of the problem statement T 1 ; w 2j .length is the length of the j-th string of the query result T 2 .

步骤S402,将计算出的Jaccard相似度、BM25相似度、余弦相似度以及编辑距离相似度按照各自的权重值进行加权求和得出问题语句与查询结果之间的相似度值。Step S402, the calculated Jaccard similarity, BM25 similarity, cosine similarity and edit distance similarity are weighted and summed according to their respective weight values to obtain the similarity value between the question sentence and the query result.

本实施例中,问题语句与查询结果之间的相似度值Sim(T1,T2)采用如下的计算公式:In this embodiment, the similarity value Sim(T 1 , T 2 ) between the question statement and the query result adopts the following calculation formula:

Sim(T1,T2)=α×S1+β×S2+γ×S3+ω×S4Sim(T 1 , T 2 )=α×S 1 +β×S 2 +γ×S 3 +ω×S 4 ;

其中,α为Jaccard相似度的权重值,β为BM25相似度的权重值,γ为余弦相似度的权重值,ω为编辑距离相似度的权重值;α+β+γ+ω=1,α≥0,β≥0,γ≥0,ω≥0。Among them, α is the weight value of Jaccard similarity, β is the weight value of BM25 similarity, γ is the weight value of cosine similarity, ω is the weight value of edit distance similarity; α+β+γ+ω=1, α ≥0, β≥0, γ≥0, ω≥0.

需要说明的是,不同的场景下,α、β、γ以及ω的值是不一样的,可以根据实际应用进行相应设置。It should be noted that in different scenarios, the values of α, β, γ and ω are different, and can be set accordingly according to the actual application.

步骤S403,将相似度值不在预设范围内的查询结果进行过滤。Step S403, filtering the query results whose similarity value is not within the preset range.

具体地,将计算出的相似度值与预设相似度阈值进行比较,将相似度值大于或者等于预设相似度阈值的查询结果保留;将相似度值小于预设相似度阈值的查询结果去除。Specifically, the calculated similarity value is compared with the preset similarity threshold, and the query results whose similarity value is greater than or equal to the preset similarity threshold are retained; the query results whose similarity value is less than the preset similarity threshold are removed. .

本实施例中,通过计算出用户输入的问题语句与查询结果之间的相似度值,并将相似度值不在预设范围内的查询结果进行过滤,可以从大量相似问题中切实查找出相似更高的问题。In this embodiment, by calculating the similarity value between the question sentence input by the user and the query result, and filtering the query results whose similarity value is not within the preset range, it is possible to effectively find out the similarities and more from a large number of similar questions. high question.

在一些可选的实现方式中,参见图5所示,步骤S205具体包括如下步骤:In some optional implementations, as shown in FIG. 5 , step S205 specifically includes the following steps:

步骤S501,通过FastText分别获取用户输入的问题语句与查询结果的词向量。In step S501, the question sentence input by the user and the word vector of the query result are obtained respectively through FastText.

FastText模型是自然语言处理领域中,一种现有的开源的词向量与文本分类模型。其将以向量形式表示的各词语、以及各词语分别对应的N-Gram特征作为输入,输出文本对应的标签。在其输出中,存在一种输出副产物——各词语分别对应的embedding向量,即本实施例中的词向量。其中,embedding向量是指经过降维处理的向量;N-Gram特征是指用来评估词语之间差异程度的词语特征。本实施例中,将向量形式表示的各词语、以及各词语分别对应的N-Gram特征作为FastText模型的输入,即可得到各词语对应的词向量。The FastText model is an existing open source word vector and text classification model in the field of natural language processing. It takes each word represented in the form of a vector and the N-Gram feature corresponding to each word as input, and outputs the label corresponding to the text. In its output, there is an output by-product—the embedding vector corresponding to each word, that is, the word vector in this embodiment. Among them, the embedding vector refers to the vector that has undergone dimensionality reduction; the N-Gram feature refers to the word feature used to evaluate the degree of difference between words. In this embodiment, each word represented in the form of a vector and the N-Gram feature corresponding to each word are used as the input of the FastText model, and the word vector corresponding to each word can be obtained.

步骤S502,将词向量输入到textCNN模型中,通过卷积层和池化层的操作后,构建问题语句与查询结果的相似度矩阵,经过全连接层输出问题语句与查询结果的相似度值。Step S502, the word vector is input into the textCNN model, after the operation of the convolution layer and the pooling layer, the similarity matrix between the question sentence and the query result is constructed, and the similarity value between the question sentence and the query result is output through the full connection layer.

具体地,把用户输入的问题语句的词向量和查询结果的词向量分别连接起来得到相应的句子向量,用户输入问题语句向量和查询结果向量分别经过卷积层获取句子的序列信息,并经过池化层进行句子向量维度的压缩,构建出用户输入问题与查询结果的相似度矩阵,经过全连接层将用户输入问题语句向量与查询结果向量转换为一个向量,将向量输入到逻辑斯蒂回归模型中,即可得到用户输入问题语句与查询结果的相似度值。Specifically, the word vector of the question sentence input by the user and the word vector of the query result are respectively connected to obtain the corresponding sentence vector. The user-input question sentence vector and the query result vector respectively pass through the convolution layer to obtain the sequence information of the sentence, and pass through the pooling layer. The transformation layer compresses the dimension of the sentence vector, and constructs the similarity matrix between the user input question and the query result. After the full connection layer, the user input question sentence vector and the query result vector are converted into a vector, and the vector is input into the logistic regression model. , the similarity value between the user input question statement and the query result can be obtained.

需要说明的是,逻辑斯蒂回归模型用于将输入向量的结果压缩为[0,1]进行输出。It should be noted that the logistic regression model is used to compress the result of the input vector into [0, 1] for output.

举例说明,假设经过全连接层后,得到的向量为X(x1,x2,x3,……,xn),将这个向量作为逻辑斯蒂回归模型的输入,原则如下:For example, suppose that after the fully connected layer, the obtained vector is X(x 1 ,x 2 ,x 3 ,...,x n ), and this vector is used as the input of the logistic regression model. The principles are as follows:

Figure BDA0002555588520000131
Figure BDA0002555588520000131

进一步地,

Figure BDA0002555588520000132
further,
Figure BDA0002555588520000132

其中,ω1,ω2,……,ωn分别为输入向量x1,x2,……,xn的权重。Among them, ω 1 , ω 2 , ..., ω n are the weights of the input vectors x 1 , x 2 , ..., x n respectively.

本实施例通过将FastText模型与textCNN模型结合计算问题语句与查询结果之间的相似度值,可以提高计算效率,还可以进一步最大限度地保证用户的提问能够准确地返回。In this embodiment, by combining the FastText model and the textCNN model to calculate the similarity value between the question statement and the query result, the calculation efficiency can be improved, and the user's question can be returned accurately to the maximum extent.

在一些可选的实现方式中,在步骤S203中的查询FAQ数据库中与问题语句相对应的候选问题之后还包括如下步骤:判断候选问题中问题的数量,并根据判断结果执行相应的操作。In some optional implementation manners, after querying the candidate questions corresponding to the question statement in the FAQ database in step S203, the following steps are further included: judging the number of questions in the candidate questions, and performing corresponding operations according to the judgment result.

具体地,若问题的数量为零,则直接采用分类算法进行计算;若问题的数量小于等于预设阈值,将候选问题直接返回给客户;若问题的数量大于预设阈值,则根据各词语的权重对候选问题进行打分,筛选出分值大于等于预设分值的候选问题作为查询结果并进行排序,之后执行步骤S204和步骤S205;这种方式可以在未查询到与问题语句相对应的候选问题时直接通过分类算法计算,为用户返回与提问同类别的问题,避免召回率为零;在问题数量小于等于预设阈值时,直接将候选问题返回给客户,可以提高查询效率;而在问题的数量大于预设阈值时,通过对查询出来的问题进行多次计算筛选,可以提高筛选问题的精准度,为用户推荐出高质量的问题。Specifically, if the number of questions is zero, the classification algorithm is directly used for calculation; if the number of questions is less than or equal to the preset threshold, the candidate questions are directly returned to the customer; if the number of questions is greater than the preset threshold, the The weights are used to score the candidate questions, and the candidate questions with a score greater than or equal to the preset score are screened out as query results and sorted, and then steps S204 and S205 are executed; When the question is asked, it is directly calculated by the classification algorithm, and the question of the same category as the question is returned to the user to avoid the recall rate being zero; when the number of questions is less than or equal to the preset threshold, the candidate questions are directly returned to the customer, which can improve the query efficiency; When the number of questions is greater than the preset threshold, by performing multiple calculations and screening on the queried questions, the accuracy of the question screening can be improved, and high-quality questions can be recommended for users.

本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机程序来指令相关的硬件来完成,该计算机程序可存储于一计算机可读取存储介质中,该程序在执行时,可包括如上述各方法的实施例的流程。其中,前述的存储介质可为磁碟、光盘、只读存储记忆体(Read-Only Memory,ROM)等非易失性存储介质,或随机存储记忆体(Random Access Memory,RAM)等。Those of ordinary skill in the art can understand that the realization of all or part of the processes in the methods of the above embodiments can be accomplished by instructing relevant hardware through a computer program, and the computer program can be stored in a computer-readable storage medium, and the program is During execution, it may include the processes of the embodiments of the above-mentioned methods. The aforementioned storage medium may be a non-volatile storage medium such as a magnetic disk, an optical disk, a read-only memory (Read-Only Memory, ROM), or a random access memory (Random Access Memory, RAM).

应该理解的是,虽然附图的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,其可以以其他的顺序执行。而且,附图的流程图中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,其执行顺序也不必然是依次进行,而是可以与其他步骤或者其他步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。It should be understood that although the various steps in the flowchart of the accompanying drawings are sequentially shown in the order indicated by the arrows, these steps are not necessarily executed in sequence in the order indicated by the arrows. Unless explicitly stated herein, the execution of these steps is not strictly limited to the order and may be performed in other orders. Moreover, at least a part of the steps in the flowchart of the accompanying drawings may include multiple sub-steps or multiple stages, and these sub-steps or stages are not necessarily executed at the same time, but may be executed at different times, and the execution sequence is also It does not have to be performed sequentially, but may be performed alternately or alternately with other steps or at least a portion of sub-steps or stages of other steps.

在一实施例中,如图6所示,提供了一种基于FAQ数据库的问题筛选装置,该基于FAQ数据库的问题筛选装置与上述实施例中基于FAQ数据库的问题筛选方法一一对应。该基于FAQ数据库的问题筛选装置包括:In one embodiment, as shown in FIG. 6 , a problem screening apparatus based on a FAQ database is provided, and the problem screening apparatus based on the FAQ database corresponds one-to-one with the problem screening method based on the FAQ database in the above embodiment. The question screening device based on the FAQ database includes:

分词模块601,用于分析用户输入的问题语句,对问题语句进行分词处理;The word segmentation module 601 is used to analyze the question sentence input by the user, and perform word segmentation processing on the question sentence;

处理模块602,用于统计分词后各词语在FAQ数据库中的词频,确定各词语的权重,并将各词语的权重和分词结果存入FAQ数据库;The processing module 602 is used to count the word frequency of each word in the FAQ database after word segmentation, determine the weight of each word, and store the weight of each word and the word segmentation result in the FAQ database;

查询打分模块603,用于查询FAQ数据库中与问题语句相对应的候选问题,根据各词语的权重对候选问题进行打分;The query scoring module 603 is used to query the candidate questions corresponding to the question statement in the FAQ database, and score the candidate questions according to the weight of each word;

筛选模块604,用于筛选出分值大于等于预设分值的问题作为查询结果;A screening module 604, configured to screen out questions with a score greater than or equal to a preset score as a query result;

相似度计算模块605,用于按照相似度算法模型计算问题语句与查询结果之间的相似度值,将相似度值不在预设范围内的查询结果进行过滤;The similarity calculation module 605 is configured to calculate the similarity value between the question statement and the query result according to the similarity algorithm model, and filter the query results whose similarity value is not within the preset range;

分类计算模块606,用于采用分类算法对过滤后的查询结果进行计算;及A classification calculation module 606, used to calculate the filtered query results by using a classification algorithm; and

确定模块607,用于确定出与输入的问题语句相似度最高的预设数量的问题。The determining module 607 is configured to determine a preset number of questions with the highest similarity to the input question sentence.

需要强调的是,为进一步保证上述用户输入问题语句的私密和安全性,上述用户输入问题语句还可以存储于一区块链的节点中。It should be emphasized that, in order to further ensure the privacy and security of the above-mentioned user-input question statement, the above-mentioned user-input question statement may also be stored in a node of a blockchain.

在本实施例的一些可选的实现方式中,查询打分模块603包括:In some optional implementations of this embodiment, the query scoring module 603 includes:

提取单元,用于根据各词语的权重提取出问题语句中的关键词;The extraction unit is used for extracting the keywords in the question sentence according to the weight of each word;

扩展单元,用于确定关键词的扩展词;expansion unit, used to determine the expansion word of the keyword;

查询打分单元,用于根据关键词和扩展词生成与问题语句相对应的候选问题,并按照词语的权重为候选问题打分。The query scoring unit is used to generate candidate questions corresponding to the question sentences according to the keywords and expansion words, and score the candidate questions according to the weight of the words.

在本实施例的一些可选的实现方式中,相似度计算模块605包括:In some optional implementations of this embodiment, the similarity calculation module 605 includes:

计算单元,用于分别计算问题语句与查询结果之间的Jaccard相似度、BM25相似度、余弦相似度以及编辑距离相似度,并将计算出的Jaccard相似度、BM25相似度、余弦相似度以及编辑距离相似度按照各自的权重值进行加权求和得出问题语句与查询结果之间的相似度值;The calculation unit is used to calculate the Jaccard similarity, BM25 similarity, cosine similarity and edit distance similarity between the question statement and the query result respectively, and calculate the calculated Jaccard similarity, BM25 similarity, cosine similarity and edit distance. The distance similarity is weighted and summed according to their respective weight values to obtain the similarity value between the question statement and the query result;

过滤单元,用于将相似度值不在预设范围内的查询结果进行过滤。The filtering unit is used to filter the query results whose similarity value is not within the preset range.

具体地,过滤单元用于将相似度值与预设相似度阈值进行比较,根据比较结果将相似度值大于或者等于预设相似度阈值的查询结果保留,将相似度值小于预设相似度阈值的查询结果去除。Specifically, the filtering unit is configured to compare the similarity value with the preset similarity threshold, retain the query results whose similarity value is greater than or equal to the preset similarity threshold according to the comparison result, and store the similarity value smaller than the preset similarity threshold according to the comparison result. The query results are removed.

在本实施例的一些可选的实现方式中,分类计算模块606包括:In some optional implementations of this embodiment, the classification calculation module 606 includes:

获取单元,用于通过FastText分别获取问题语句与查询结果的词向量;The obtaining unit is used to obtain the word vector of the question statement and the query result respectively through FastText;

处理单元,用于将词向量输入到textCNN模型中,通过卷积层和池化层的操作后,构建问题语句与查询结果的相似度矩阵,经过全连接层输出问题语句与查询结果的相似度值。The processing unit is used to input the word vector into the textCNN model. After the operation of the convolution layer and the pooling layer, the similarity matrix between the question sentence and the query result is constructed, and the similarity between the question sentence and the query result is output through the fully connected layer. value.

上述基于FAQ数据库的问题筛选装置,通过分析用户输入的问题语句,对问题语句进行分词处理,并确定分词后各词语的权重,查询与问题语句相对应的候选问题,根据各词语的权重对在FAQ数据库中查询出来的候选问题进行打分筛选;然后按照相似度算法模型计算问题语句与筛选出来的查询结果之间的相似度值,将相似度值不在预设范围内的查询结果进行过滤;最后采用分类算法对过滤后的查询结果进行计算,从而确定出与输入的问题语句相似度最高的预设数量的问题,实现对查询出来的问题进行多次筛选,可以提高筛选问题的精准度,为用户推荐出高质量的问题。The above-mentioned question screening device based on the FAQ database analyzes the question sentence input by the user, performs word segmentation processing on the question sentence, and determines the weight of each word after the word segmentation, inquires about the candidate questions corresponding to the question sentence, and determines the number of words in the question sentence according to the weight of each word. The candidate questions queried in the FAQ database are scored and screened; then the similarity value between the question statement and the filtered query results is calculated according to the similarity algorithm model, and the query results whose similarity value is not within the preset range are filtered; finally The classification algorithm is used to calculate the filtered query results, so as to determine the preset number of questions with the highest similarity to the input question statement, and to screen the queried questions multiple times, which can improve the accuracy of the screening questions. Users recommend high-quality questions.

为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图7,图7为本实施例计算机设备基本结构框图。To solve the above technical problems, the embodiments of the present application also provide computer equipment. For details, please refer to FIG. 7 , which is a block diagram of the basic structure of a computer device according to this embodiment.

所述计算机设备7包括通过系统总线相互通信连接存储器71、处理器72、网络接口73。需要指出的是,图中仅示出了具有组件71-73的计算机设备7,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(ApplicationSpecific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable GateArray,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。The computer device 7 includes a memory 71 , a processor 72 , and a network interface 73 that communicate with each other through a system bus. It should be pointed out that only the computer device 7 with components 71-73 is shown in the figure, but it should be understood that it is not required to implement all of the shown components, and more or less components may be implemented instead. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing according to pre-set or stored instructions, and its hardware includes but is not limited to microprocessors, special-purpose Integrated circuit (ApplicationSpecific Integrated Circuit, ASIC), programmable gate array (Field-Programmable GateArray, FPGA), digital processor (Digital Signal Processor, DSP), embedded equipment, etc.

所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。The computer equipment may be a desktop computer, a notebook computer, a palmtop computer, a cloud server and other computing equipment. The computer device can perform human-computer interaction with the user through a keyboard, a mouse, a remote control, a touch pad or a voice control device.

所述存储器71至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器71可以是所述计算机设备7的内部存储单元,例如该计算机设备7的硬盘或内存。在另一些实施例中,所述存储器71也可以是所述计算机设备7的外部存储设备,例如该计算机设备7上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(FlashCard)等。当然,所述存储器71还可以既包括所述计算机设备7的内部存储单元也包括其外部存储设备。本实施例中,所述存储器71通常用于存储安装于所述计算机设备7的操作系统和各类应用软件,例如基于FAQ数据库的问题筛选方法的计算机可读指令等。此外,所述存储器71还可以用于暂时地存储已经输出或者将要输出的各类数据。The memory 71 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static Random Access Memory (SRAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Programmable Read Only Memory (PROM), Magnetic Memory, Magnetic Disk, Optical Disk, etc. In some embodiments, the memory 71 may be an internal storage unit of the computer device 7 , such as a hard disk or a memory of the computer device 7 . In other embodiments, the memory 71 may also be an external storage device of the computer device 7 , such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, flash memory card (FlashCard) and so on. Of course, the memory 71 may also include both the internal storage unit of the computer device 7 and its external storage device. In this embodiment, the memory 71 is generally used to store the operating system and various application software installed on the computer device 7 , such as computer-readable instructions for a question screening method based on a FAQ database. In addition, the memory 71 can also be used to temporarily store various types of data that have been output or will be output.

所述处理器72在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器72通常用于控制所述计算机设备7的总体操作。本实施例中,所述处理器72用于运行所述存储器71中存储的计算机可读指令或者处理数据,例如运行所述基于FAQ数据库的问题筛选方法的计算机可读指令。In some embodiments, the processor 72 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips. This processor 72 is typically used to control the overall operation of the computer device 7 . In this embodiment, the processor 72 is configured to execute computer-readable instructions stored in the memory 71 or process data, for example, computer-readable instructions for executing the problem screening method based on the FAQ database.

所述网络接口73可包括无线网络接口或有线网络接口,该网络接口73通常用于在所述计算机设备7与其他电子设备之间建立通信连接。The network interface 73 may include a wireless network interface or a wired network interface, and the network interface 73 is generally used to establish a communication connection between the computer device 7 and other electronic devices.

本实施例通过处理器执行存储在存储器的计算机可读指令时实现如上述实施例基于FAQ数据库的问题筛选方法的步骤,能够实现对查询到的问题进行多次计算筛选,提高筛选问题的精准度,为用户推荐出高质量的问题。In this embodiment, when the processor executes the computer-readable instructions stored in the memory, the steps of the question screening method based on the FAQ database in the above-mentioned embodiment can be implemented, and the queried questions can be calculated and screened multiple times to improve the accuracy of screening questions. , which recommends high-quality questions for users.

本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令可被至少一个处理器执行,以使所述至少一个处理器执行如上述的基于FAQ数据库的问题筛选方法的步骤,从而实现对查询到的问题进行多次计算筛选,提高筛选问题的精准度,为用户推荐出高质量的问题。The present application also provides another embodiment, that is, to provide a computer-readable storage medium, where the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions can be executed by at least one processor to The at least one processor is made to execute the steps of the above-mentioned question screening method based on the FAQ database, so as to perform multiple calculation and screening on the queried questions, improve the accuracy of the screening questions, and recommend high-quality questions for users.

通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。From the description of the above embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus a necessary general hardware platform, and of course hardware can also be used, but in many cases the former is better implementation. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence or in a part that contributes to the prior art, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, CD-ROM), including several instructions to make a terminal device (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of this application.

显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。Obviously, the above-described embodiments are only a part of the embodiments of the present application, rather than all of the embodiments. The accompanying drawings show the preferred embodiments of the present application, but do not limit the scope of the patent of the present application. This application may be embodied in many different forms, rather these embodiments are provided so that a thorough and complete understanding of the disclosure of this application is provided. Although the present application has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing specific embodiments, or perform equivalent replacements for some of the technical features. . Any equivalent structure made by using the contents of the description and drawings of the present application, which is directly or indirectly used in other related technical fields, is also within the scope of protection of the patent of the present application.

Claims (10)

1. A problem screening method based on an FAQ database is characterized by comprising the following steps:
analyzing a question sentence input by a user, and performing word segmentation processing on the question sentence;
counting the word frequency of each word in the FAQ database after word segmentation, determining the weight of each word, and storing the weight of each word and the word segmentation result into the FAQ database;
inquiring candidate questions corresponding to the question sentences in the FAQ database, scoring the candidate questions according to the weight of each word, and screening out the candidate questions with the score larger than or equal to a preset score as inquiry results;
calculating a similarity value between the question statement and the query result according to a similarity algorithm model, and filtering the query result of which the similarity value is not in a preset range;
and calculating the filtered query result by adopting a classification algorithm, and determining the problems with the highest similarity to the input problem sentences and the preset number.
2. The method as claimed in claim 1, wherein the step of querying candidate questions corresponding to the question sentences in the FAQ database and scoring the queried questions according to the weight of each term comprises:
extracting keywords in the question sentences according to the weight of each word;
determining an expansion word of the keyword;
generating a candidate question corresponding to the question sentence according to the keyword and the expansion word;
the candidate questions are scored according to the weight of the word.
3. The method for screening questions based on the FAQ database according to claim 1, wherein the step of calculating the similarity value between the question statement and the query result according to a similarity algorithm model, and filtering the query result whose similarity value is not within a preset range specifically comprises:
respectively calculating the similarity of Jaccard, the similarity of BM25, the similarity of cosine and the similarity of edit distance between the question sentence and the query result;
weighting and summing the calculated Jaccard similarity, BM25 similarity, cosine similarity and edit distance similarity according to respective weight values to obtain a similarity value between the question statement and the query result;
and filtering the query results of which the similarity values are not in a preset range.
4. The FAQ-database-based question screening method according to claim 3, wherein the step of filtering the query results whose similarity values are not within a preset range specifically comprises:
comparing the similarity value with a preset similarity threshold value;
and reserving the query results with the similarity values larger than or equal to a preset similarity threshold according to the comparison results, and removing the query results with the similarity values smaller than the preset similarity threshold.
5. The FAQ-database-based question screening method according to claim 1, wherein the step of calculating the filtered query result by using a classification algorithm and determining the question with the highest similarity to the input question sentence comprises:
respectively acquiring word vectors of the question sentences and the query results through FastText;
and inputting the word vector into a textCNN model, constructing a similarity matrix of the question statement and the query result after the operation of a convolution layer and a pooling layer, and outputting the similarity of the question statement and the query result through a full connection layer.
6. The FAQ-database-based question screening method according to claim 1, further comprising, after the step of querying candidate questions in the FAQ database corresponding to the question sentences:
and judging the number of the problems in the candidate problems, and executing corresponding operation according to the judgment result.
7. The method as claimed in claim 6, wherein the step of determining the number of questions in the candidate questions and performing corresponding operations according to the determination result specifically comprises:
if the number of the problems is zero, directly adopting a classification algorithm to calculate;
if the number of the problems is less than or equal to a preset threshold value, directly returning the candidate problems to the client;
and if the number of the problems is larger than a preset threshold value, scoring the candidate problems according to the weight of each word.
8. An FAQ database-based question screening apparatus, comprising:
the word segmentation module is used for analyzing question sentences input by a user and carrying out word segmentation processing on the question sentences;
the processing module is used for counting the word frequency of each word in the FAQ database after word segmentation, determining the weight of each word and storing the weight of each word and the word segmentation result into the FAQ database;
the query scoring module is used for querying candidate questions corresponding to the question sentences in the FAQ database and scoring the candidate questions according to the weight of each word;
the screening module is used for screening out the problems with the scores larger than or equal to the preset scores as query results;
the similarity calculation module is used for calculating a similarity value between the question statement and the query result according to a similarity calculation model and filtering the query result of which the similarity value is not in a preset range;
the classification calculation module is used for calculating the filtered query result by adopting a classification algorithm; and
and the determining module is used for determining the preset number of questions with the highest similarity to the input question sentences.
9. A computer device comprising a memory having computer readable instructions stored therein and a processor that when executed implements the steps of the FAQ database-based question screening method of any one of claims 1 to 7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon computer readable instructions, which when executed by a processor, implement the steps of the FAQ database-based question screening method according to any one of claims 1 to 7.
CN202010591151.4A 2020-06-24 2020-06-24 Question screening method, device, computer equipment and medium based on FAQ database Pending CN111797214A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010591151.4A CN111797214A (en) 2020-06-24 2020-06-24 Question screening method, device, computer equipment and medium based on FAQ database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010591151.4A CN111797214A (en) 2020-06-24 2020-06-24 Question screening method, device, computer equipment and medium based on FAQ database

Publications (1)

Publication Number Publication Date
CN111797214A true CN111797214A (en) 2020-10-20

Family

ID=72804208

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010591151.4A Pending CN111797214A (en) 2020-06-24 2020-06-24 Question screening method, device, computer equipment and medium based on FAQ database

Country Status (1)

Country Link
CN (1) CN111797214A (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112527988A (en) * 2020-12-14 2021-03-19 深圳市优必选科技股份有限公司 Automatic reply generation method and device and intelligent equipment
CN112541076A (en) * 2020-11-09 2021-03-23 北京百度网讯科技有限公司 Method and device for generating extended corpus of target field and electronic equipment
CN112565663A (en) * 2020-11-26 2021-03-26 平安普惠企业管理有限公司 Demand question reply method and device, terminal equipment and storage medium
CN112632395A (en) * 2020-12-31 2021-04-09 深圳追一科技有限公司 Search recommendation method and device, server and computer-readable storage medium
CN112765960A (en) * 2021-02-07 2021-05-07 成都新潮传媒集团有限公司 Text matching method and device and computer equipment
CN112905752A (en) * 2021-03-30 2021-06-04 中国建设银行股份有限公司 Intelligent interaction method, device, equipment and storage medium
CN112925889A (en) * 2021-02-26 2021-06-08 北京声智科技有限公司 Natural language processing method, device, electronic equipment and storage medium
CN112948553A (en) * 2021-02-26 2021-06-11 平安国际智慧城市科技股份有限公司 Legal intelligent question and answer method and device, electronic equipment and storage medium
CN113313472A (en) * 2021-06-15 2021-08-27 海南君麟环境科技有限公司 Intelligent environmental control platform establishing method and system based on big data
CN113312525A (en) * 2021-06-07 2021-08-27 浙江工业大学 Method for reversely calibrating steel seal code through java
CN113722465A (en) * 2021-11-02 2021-11-30 北京卓建智菡科技有限公司 Intention identification method and device
CN114065735A (en) * 2021-11-24 2022-02-18 北京房江湖科技有限公司 Text error correction method
CN114090747A (en) * 2021-10-14 2022-02-25 特斯联科技集团有限公司 Automatic question answering method, device, equipment and medium based on multiple semantic matching
CN114372122A (en) * 2021-12-08 2022-04-19 阿里云计算有限公司 Information acquisition method, computing device and storage medium
CN114510918A (en) * 2022-02-16 2022-05-17 数字浙江技术运营有限公司 Expert matching method and device
CN114579601A (en) * 2022-02-28 2022-06-03 阿里巴巴(中国)有限公司 Data generation method and device, computing equipment and medium
CN114691994A (en) * 2022-04-02 2022-07-01 零氪科技(北京)有限公司 Medical knowledge recommendation method and device, electronic equipment and storage medium
CN114780700A (en) * 2022-04-20 2022-07-22 平安科技(深圳)有限公司 Intelligent question-answering method, device, equipment and medium based on machine reading understanding
CN115269804A (en) * 2022-08-03 2022-11-01 中国银行股份有限公司 A reply method, device, equipment and storage medium
CN116050426A (en) * 2022-12-30 2023-05-02 浪潮通用软件有限公司 Method, system, device and storage medium for fast menu query
CN116340481A (en) * 2023-02-27 2023-06-27 华院计算技术(上海)股份有限公司 Method and device for automatically replying to question, computer readable storage medium and terminal
CN116737875A (en) * 2023-05-31 2023-09-12 四川长虹电器股份有限公司 Skill semantic similarity retrieval method
CN117473069A (en) * 2023-12-26 2024-01-30 深圳市明源云客电子商务有限公司 Business corpus generation method, device and equipment and computer readable storage medium
CN117725148A (en) * 2024-02-07 2024-03-19 湖南三湘银行股份有限公司 Question-answer word library updating method based on self-learning
CN118093768A (en) * 2024-02-19 2024-05-28 国网江苏省电力有限公司南通供电分公司 A method for constructing a knowledge base of power scientific research results based on large language model
CN119782511A (en) * 2024-12-05 2025-04-08 湖北泰跃卫星技术发展股份有限公司 Question matching method and system in digital man system

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989040A (en) * 2015-02-03 2016-10-05 阿里巴巴集团控股有限公司 Intelligent question-answer method, device and system
CN106557508A (en) * 2015-09-28 2017-04-05 北京神州泰岳软件股份有限公司 A kind of text key word extracting method and device
CN106611041A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 New text similarity solution method
CN107153639A (en) * 2016-03-04 2017-09-12 北大方正集团有限公司 Intelligent answer method and system
CN108345672A (en) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
CN108415980A (en) * 2018-02-09 2018-08-17 平安科技(深圳)有限公司 Question and answer data processing method, electronic device and storage medium
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN108595619A (en) * 2018-04-23 2018-09-28 海信集团有限公司 A kind of answering method and equipment
CN109815492A (en) * 2019-01-04 2019-05-28 平安科技(深圳)有限公司 A kind of intension recognizing method based on identification model, identification equipment and medium
CN110414004A (en) * 2019-07-31 2019-11-05 阿里巴巴集团控股有限公司 A kind of method and system that core information extracts
CN111309878A (en) * 2020-01-19 2020-06-19 支付宝(杭州)信息技术有限公司 Retrieval type question-answering method, model training method, server and storage medium

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105989040A (en) * 2015-02-03 2016-10-05 阿里巴巴集团控股有限公司 Intelligent question-answer method, device and system
CN106557508A (en) * 2015-09-28 2017-04-05 北京神州泰岳软件股份有限公司 A kind of text key word extracting method and device
CN107153639A (en) * 2016-03-04 2017-09-12 北大方正集团有限公司 Intelligent answer method and system
CN106611041A (en) * 2016-09-29 2017-05-03 四川用联信息技术有限公司 New text similarity solution method
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN108519970A (en) * 2018-02-06 2018-09-11 平安科技(深圳)有限公司 The identification method of sensitive information, electronic device and readable storage medium storing program for executing in text
CN108345672A (en) * 2018-02-09 2018-07-31 平安科技(深圳)有限公司 Intelligent response method, electronic device and storage medium
CN108415980A (en) * 2018-02-09 2018-08-17 平安科技(深圳)有限公司 Question and answer data processing method, electronic device and storage medium
CN108595619A (en) * 2018-04-23 2018-09-28 海信集团有限公司 A kind of answering method and equipment
CN109815492A (en) * 2019-01-04 2019-05-28 平安科技(深圳)有限公司 A kind of intension recognizing method based on identification model, identification equipment and medium
CN110414004A (en) * 2019-07-31 2019-11-05 阿里巴巴集团控股有限公司 A kind of method and system that core information extracts
CN111309878A (en) * 2020-01-19 2020-06-19 支付宝(杭州)信息技术有限公司 Retrieval type question-answering method, model training method, server and storage medium

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112541076A (en) * 2020-11-09 2021-03-23 北京百度网讯科技有限公司 Method and device for generating extended corpus of target field and electronic equipment
CN112541076B (en) * 2020-11-09 2024-03-29 北京百度网讯科技有限公司 Method and device for generating expanded corpus in target field and electronic equipment
CN112565663A (en) * 2020-11-26 2021-03-26 平安普惠企业管理有限公司 Demand question reply method and device, terminal equipment and storage medium
CN112565663B (en) * 2020-11-26 2022-11-18 平安普惠企业管理有限公司 Demand question reply method and device, terminal equipment and storage medium
CN112527988A (en) * 2020-12-14 2021-03-19 深圳市优必选科技股份有限公司 Automatic reply generation method and device and intelligent equipment
CN112632395A (en) * 2020-12-31 2021-04-09 深圳追一科技有限公司 Search recommendation method and device, server and computer-readable storage medium
CN112765960A (en) * 2021-02-07 2021-05-07 成都新潮传媒集团有限公司 Text matching method and device and computer equipment
CN112765960B (en) * 2021-02-07 2022-11-25 成都新潮传媒集团有限公司 Text matching method and device and computer equipment
CN112925889A (en) * 2021-02-26 2021-06-08 北京声智科技有限公司 Natural language processing method, device, electronic equipment and storage medium
CN112925889B (en) * 2021-02-26 2024-04-30 北京声智科技有限公司 Natural language processing method, device, electronic equipment and storage medium
CN112948553A (en) * 2021-02-26 2021-06-11 平安国际智慧城市科技股份有限公司 Legal intelligent question and answer method and device, electronic equipment and storage medium
CN112905752A (en) * 2021-03-30 2021-06-04 中国建设银行股份有限公司 Intelligent interaction method, device, equipment and storage medium
CN113312525A (en) * 2021-06-07 2021-08-27 浙江工业大学 Method for reversely calibrating steel seal code through java
CN113312525B (en) * 2021-06-07 2024-02-09 浙江工业大学 Method for reversely calibrating seal code through java
CN113313472A (en) * 2021-06-15 2021-08-27 海南君麟环境科技有限公司 Intelligent environmental control platform establishing method and system based on big data
CN114090747A (en) * 2021-10-14 2022-02-25 特斯联科技集团有限公司 Automatic question answering method, device, equipment and medium based on multiple semantic matching
CN113722465A (en) * 2021-11-02 2021-11-30 北京卓建智菡科技有限公司 Intention identification method and device
CN113722465B (en) * 2021-11-02 2022-01-21 北京卓建智菡科技有限公司 Intention identification method and device
CN114065735A (en) * 2021-11-24 2022-02-18 北京房江湖科技有限公司 Text error correction method
CN114372122A (en) * 2021-12-08 2022-04-19 阿里云计算有限公司 Information acquisition method, computing device and storage medium
CN114510918A (en) * 2022-02-16 2022-05-17 数字浙江技术运营有限公司 Expert matching method and device
CN114579601B (en) * 2022-02-28 2024-09-03 阿里巴巴(中国)有限公司 Data generation method, device, computing equipment and medium
CN114579601A (en) * 2022-02-28 2022-06-03 阿里巴巴(中国)有限公司 Data generation method and device, computing equipment and medium
CN114691994A (en) * 2022-04-02 2022-07-01 零氪科技(北京)有限公司 Medical knowledge recommendation method and device, electronic equipment and storage medium
CN114780700A (en) * 2022-04-20 2022-07-22 平安科技(深圳)有限公司 Intelligent question-answering method, device, equipment and medium based on machine reading understanding
CN115269804A (en) * 2022-08-03 2022-11-01 中国银行股份有限公司 A reply method, device, equipment and storage medium
CN116050426A (en) * 2022-12-30 2023-05-02 浪潮通用软件有限公司 Method, system, device and storage medium for fast menu query
CN116340481A (en) * 2023-02-27 2023-06-27 华院计算技术(上海)股份有限公司 Method and device for automatically replying to question, computer readable storage medium and terminal
CN116340481B (en) * 2023-02-27 2024-05-10 华院计算技术(上海)股份有限公司 Method and device for automatically replying to question, computer readable storage medium and terminal
CN116737875A (en) * 2023-05-31 2023-09-12 四川长虹电器股份有限公司 Skill semantic similarity retrieval method
CN117473069A (en) * 2023-12-26 2024-01-30 深圳市明源云客电子商务有限公司 Business corpus generation method, device and equipment and computer readable storage medium
CN117473069B (en) * 2023-12-26 2024-04-12 深圳市明源云客电子商务有限公司 Business corpus generation method, device, equipment and computer-readable storage medium
CN117725148A (en) * 2024-02-07 2024-03-19 湖南三湘银行股份有限公司 Question-answer word library updating method based on self-learning
CN118093768A (en) * 2024-02-19 2024-05-28 国网江苏省电力有限公司南通供电分公司 A method for constructing a knowledge base of power scientific research results based on large language model
CN118093768B (en) * 2024-02-19 2024-12-20 国网江苏省电力有限公司南通供电分公司 A method for constructing a knowledge base of power scientific research results based on large language model
CN119782511A (en) * 2024-12-05 2025-04-08 湖北泰跃卫星技术发展股份有限公司 Question matching method and system in digital man system

Similar Documents

Publication Publication Date Title
CN111797214A (en) Question screening method, device, computer equipment and medium based on FAQ database
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN112818093B (en) Evidence document retrieval method, system and storage medium based on semantic matching
US9311823B2 (en) Caching natural language questions and results in a question and answer system
CN113988157B (en) Semantic retrieval network training method, device, electronic equipment and storage medium
US20130060769A1 (en) System and method for identifying social media interactions
CN111737997A (en) A text similarity determination method, device and storage medium
CN114116997A (en) Knowledge question answering method, knowledge question answering device, electronic equipment and storage medium
CN113761125B (en) Dynamic summary determination method and device, computing device and computer storage medium
CN112686053A (en) Data enhancement method and device, computer equipment and storage medium
CN113204953A (en) Text matching method and device based on semantic recognition and device readable storage medium
CN115438149A (en) End-to-end model training method and device, computer equipment and storage medium
CN119046432A (en) Data generation method and device based on artificial intelligence, computer equipment and medium
CN119599130A (en) Self-adaptive sensitive information intelligent identification method, device, equipment, storage medium and product
CN118797005A (en) Intelligent question-answering method, device, electronic device, storage medium and product
CN111126073B (en) Semantic retrieval method and device
CN116245139A (en) Graph neural network model training method and device, event detection method and device
CN115878761A (en) Event context generation method, apparatus, and medium
CN113360602B (en) Method, apparatus, device and storage medium for outputting information
CN111737607B (en) Data processing method, device, electronic equipment and storage medium
CN116610782B (en) Text retrieval method, device, electronic equipment and medium
CN114647739B (en) Entity chain finger method, device, electronic equipment and storage medium
CN116010704A (en) Enterprise peer recommendation method, electronic equipment and storage medium
CN116383340A (en) Information search method, device, electronic device and storage medium
CN115328945A (en) Data asset retrieval method, electronic device and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20201020

WD01 Invention patent application deemed withdrawn after publication