CN113220999A - User feature generation method and device, electronic equipment and storage medium - Google Patents
User feature generation method and device, electronic equipment and storage medium Download PDFInfo
- Publication number
- CN113220999A CN113220999A CN202110529089.0A CN202110529089A CN113220999A CN 113220999 A CN113220999 A CN 113220999A CN 202110529089 A CN202110529089 A CN 202110529089A CN 113220999 A CN113220999 A CN 113220999A
- Authority
- CN
- China
- Prior art keywords
- word
- word segmentation
- topic
- user
- under
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0635—Risk analysis of enterprise or organisation activities
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/03—Credit; Loans; Processing thereof
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Theoretical Computer Science (AREA)
- Human Resources & Organizations (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- Strategic Management (AREA)
- Economics (AREA)
- Entrepreneurship & Innovation (AREA)
- Databases & Information Systems (AREA)
- General Business, Economics & Management (AREA)
- Development Economics (AREA)
- General Engineering & Computer Science (AREA)
- Marketing (AREA)
- Quality & Reliability (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Operations Research (AREA)
- Game Theory and Decision Science (AREA)
- Tourism & Hospitality (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Accounting & Taxation (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Finance (AREA)
- Educational Administration (AREA)
- Technology Law (AREA)
- Data Mining & Analysis (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
本公开提供了用户特征的生成方法、装置、电子设备和存储介质,涉及计算机技术领域,尤其涉及自然语言处理和深度学习等人工智能领域。具体实现方案为:获取目标用户对应的第一历史文本数据;对第一历史文本数据进行解析,以确定目标用户对应的第一分词集;根据第一分词集中的各个分词分别与各个主题对应的各个分词之间的匹配度,确定第一分词集中包含的每个主题下的分词数量;根据第一分词集中包含的每个主题下的分词数量,确定目标用户对应的用户特征。由此,基于目标用户在每个主题下的分词数量,确定用户特征,提高了得到的用户特征的准确性。
The present disclosure provides a method, apparatus, electronic device and storage medium for generating user characteristics, and relates to the field of computer technology, in particular to the fields of artificial intelligence such as natural language processing and deep learning. The specific implementation scheme is: obtaining the first historical text data corresponding to the target user; analyzing the first historical text data to determine the first word segmentation set corresponding to the target user; The matching degree between each participle determines the number of participles under each topic contained in the first participle set; according to the number of participles under each topic contained in the first participle set, the user characteristics corresponding to the target user are determined. Thus, the user characteristics are determined based on the number of word segments of the target user under each topic, which improves the accuracy of the obtained user characteristics.
Description
技术领域technical field
本公开涉及计算机技术领域,尤其涉及自然语言处理和深度学习等人工智能领域,具体涉及一种用户特征的生成方法及装置、模型训练方法及装置、电子设备和存储介质。The present disclosure relates to the field of computer technology, in particular to the fields of artificial intelligence such as natural language processing and deep learning, and in particular to a method and device for generating user characteristics, a method and device for training a model, electronic equipment and a storage medium.
背景技术Background technique
随着互联网技术的不断发展,很多基于互联网的产品、服务等应运而生。为了提高服务质量和用户体验,可对用户进行特征分析,基于用户特征向用户提供个性化、精准的服务。With the continuous development of Internet technology, many Internet-based products and services have emerged. In order to improve service quality and user experience, user characteristics can be analyzed, and personalized and accurate services can be provided to users based on user characteristics.
因此,如何提高得到的用户特征的准确性是亟待解决的问题。Therefore, how to improve the accuracy of the obtained user features is an urgent problem to be solved.
发明内容SUMMARY OF THE INVENTION
本公开提供了一种用户特征的生成方法及装置、模型训练方法及装置、电子设备和存储介质。The present disclosure provides a method and apparatus for generating user characteristics, a method and apparatus for training a model, an electronic device and a storage medium.
根据本公开的一方面,提供了一种用户特征的生成方法,包括:According to an aspect of the present disclosure, a method for generating user characteristics is provided, including:
获取目标用户对应的第一历史文本数据;Obtain the first historical text data corresponding to the target user;
对所述第一历史文本数据进行解析,以确定所述目标用户对应的第一分词集;Parsing the first historical text data to determine the first segmented word set corresponding to the target user;
根据所述第一分词集中的各个分词分别与各个主题对应的各个分词之间的匹配度,确定所述第一分词集中包含的每个所述主题下的分词数量;According to the degree of matching between each word segment in the first word segment set and each word segment corresponding to each topic, determine the number of word segments under each of the topics included in the first word segment set;
根据所述第一分词集中包含的每个所述主题下的分词数量,确定所述目标用户对应的用户特征。The user characteristic corresponding to the target user is determined according to the number of word segmentations under each topic contained in the first word segmentation set.
根据本公开的一方面,提供了一种模型训练方法,包括:According to an aspect of the present disclosure, a model training method is provided, comprising:
获取训练数据集,其中,所述训练数据集中包括多个用户分别对应的多个历史文本数据;Obtaining a training data set, wherein the training data set includes a plurality of historical text data corresponding to a plurality of users respectively;
分别对每个所述用户对应的多个历史文本数据进行解析,以确定每个所述用户对应的分词集;Respectively analyze a plurality of historical text data corresponding to each of the users to determine the word segmentation set corresponding to each of the users;
确定每个所述用户对应的分词集中包含的各个主题下的分词,以及每个所述用户对应的标注风险等级;Determine the word segmentation under each topic included in the word segmentation set corresponding to each of the users, and the labeling risk level corresponding to each of the users;
将每个所述用户对应的分词集中包含的各个主题下的分词及对应的主题输入至初始神经网络模型,以获取所述初始神经网络模型输出的预测风险等级;Input the word segmentation and the corresponding theme under each topic contained in the word segmentation set corresponding to each user to the initial neural network model to obtain the predicted risk level output by the initial neural network model;
根据所述预测风险等级与所述标注风险等级之间的差异,对所述初始神经网络模型进行修正,以生成风控模型。According to the difference between the predicted risk level and the marked risk level, the initial neural network model is modified to generate a risk control model.
根据本公开的另一方面,提供了一种用户特征的生成装置,包括:According to another aspect of the present disclosure, an apparatus for generating user characteristics is provided, comprising:
第一获取模块,用于获取目标用户对应的第一历史文本数据;a first acquisition module, configured to acquire the first historical text data corresponding to the target user;
第一解析模块,用于对所述第一历史文本数据进行解析,以确定所述目标用户对应的第一分词集;a first parsing module, configured to parse the first historical text data to determine the first segmented word set corresponding to the target user;
第一确定模块,用于根据所述第一分词集中的各个分词分别与各个主题对应的各个分词之间的匹配度,确定所述第一分词集中包含的每个所述主题下的分词数量;A first determination module, configured to determine the number of word segmentations under each of the topics contained in the first word segmentation set according to the degree of matching between each word segmentation in the first word segmentation set and each word segmentation corresponding to each theme;
第二确定模块,用于根据所述第一分词集中包含的每个所述主题下的分词数量,确定所述目标用户对应的用户特征。The second determination module is configured to determine the user characteristics corresponding to the target user according to the number of word segmentations under each of the topics contained in the first word segmentation set.
根据本公开的另一方面,提供了一种模型训练装置,包括:According to another aspect of the present disclosure, a model training apparatus is provided, comprising:
第二获取模块,用于获取训练数据集,其中,所述训练数据集中包括多个用户分别对应的多个历史文本数据;a second acquisition module, configured to acquire a training data set, wherein the training data set includes a plurality of historical text data corresponding to a plurality of users respectively;
第二解析模块,用于分别对每个所述用户对应的多个历史文本数据进行解析,以确定每个所述用户对应的分词集;A second parsing module, configured to parse a plurality of historical text data corresponding to each of the users, respectively, to determine a word segmentation set corresponding to each of the users;
第八确定模块,用于确定每个所述用户对应的分词集中包含的各个主题下的分词,以及每个所述用户对应的标注风险等级;The eighth determination module is used to determine the word segmentation under each topic included in the word segmentation set corresponding to each of the users, and the labeling risk level corresponding to each of the users;
第二训练模块,用于将每个所述用户对应的分词集中包含的各个主题下的分词及对应的主题输入至初始神经网络模型,以获取所述初始神经网络模型输出的预测风险等级;根据所述预测风险等级与所述标注风险等级之间的差异,对所述初始神经网络模型进行修正,以生成风控模型。The second training module is used to input the word segmentation and the corresponding theme contained in the word segmentation set corresponding to each user to the initial neural network model to obtain the predicted risk level output by the initial neural network model; according to The difference between the predicted risk level and the marked risk level is used to modify the initial neural network model to generate a risk control model.
根据本公开的另一方面,提供了一种电子设备,包括:According to another aspect of the present disclosure, there is provided an electronic device, comprising:
至少一个处理器;以及at least one processor; and
与所述至少一个处理器通信连接的存储器;其中,a memory communicatively coupled to the at least one processor; wherein,
所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行上述任一实施例所述的方法。The memory stores instructions executable by the at least one processor, the instructions being executed by the at least one processor to enable the at least one processor to perform the method of any of the above embodiments.
根据本公开的另一方面,提供了一种存储有计算机指令的非瞬时计算机可读存储介质,其中,所述计算机指令用于使所述计算机执行根据上述任一实施例所述的方法。According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions, wherein the computer instructions are used to cause the computer to perform the method according to any of the above embodiments.
根据本公开的另一方面,提供了一种计算机程序产品,包括计算机程序,所述计算机程序在被处理器执行时实现根据上述任一实施例所述的方法。According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method according to any of the above embodiments.
应当理解,本部分所描述的内容并非旨在标识本公开的实施例的关键或重要特征,也不用于限制本公开的范围。本公开的其它特征将通过以下的说明书而变得容易理解。It should be understood that what is described in this section is not intended to identify key or critical features of embodiments of the disclosure, nor is it intended to limit the scope of the disclosure. Other features of the present disclosure will become readily understood from the following description.
附图说明Description of drawings
附图用于更好地理解本方案,不构成对本公开的限定。其中:The accompanying drawings are used for better understanding of the present solution, and do not constitute a limitation to the present disclosure. in:
图1为本公开实施例提供的一种用户特征的生成方法的流程示意图;FIG. 1 is a schematic flowchart of a method for generating user characteristics according to an embodiment of the present disclosure;
图2为本公开实施例提供的另一种用户特征的生成方法的流程示意图;FIG. 2 is a schematic flowchart of another method for generating user characteristics according to an embodiment of the present disclosure;
图3为本公开实施例提供的另一种用户特征的生成方法的流程示意图;FIG. 3 is a schematic flowchart of another method for generating user characteristics according to an embodiment of the present disclosure;
图4为本公开实施例提供的另一种用户特征的生成方法的流程示意图;FIG. 4 is a schematic flowchart of another method for generating user characteristics according to an embodiment of the present disclosure;
图5为本公开实施例提供的另一种用户特征的生成方法的流程示意图;FIG. 5 is a schematic flowchart of another method for generating user characteristics according to an embodiment of the present disclosure;
图6为本公开实施例提供的一种模型训练方法的流程示意图;6 is a schematic flowchart of a model training method provided by an embodiment of the present disclosure;
图7为本公开实施例提供的一种用户特征的生成装置的结构示意图;7 is a schematic structural diagram of an apparatus for generating user characteristics according to an embodiment of the present disclosure;
图8为本公开实施例提供的一种模型训练装置的结构示意图;FIG. 8 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;
图9是用来实现本公开实施例的方法的电子设备的框图。9 is a block diagram of an electronic device used to implement the method of an embodiment of the present disclosure.
具体实施方式Detailed ways
以下结合附图对本公开的示范性实施例做出说明,其中包括本公开实施例的各种细节以助于理解,应当将它们认为仅仅是示范性的。因此,本领域普通技术人员应当认识到,可以对这里描述的实施例做出各种改变和修改,而不会背离本公开的范围和精神。同样,为了清楚和简明,以下的描述中省略了对公知功能和结构的描述。Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding and should be considered as exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted from the following description for clarity and conciseness.
下面参考附图描述本公开实施例的用户特征的生成方法、装置、电子设备和存储介质。The method, apparatus, electronic device, and storage medium for generating user characteristics according to the embodiments of the present disclosure are described below with reference to the accompanying drawings.
人工智能是研究使用计算机来模拟人的某些思维过程和智能行为(如学习、推理、思考、规划等)的学科,既有硬件层面的技术领域也有软件层面的技术。人工智能硬件技术一般包括如传感器、专用人工智能芯片、云计算、分布式存储、大数据处理等技术;人工智能软件技术包括计算机视觉技术、语音识别技术、自然语言处理技术以及深度学习、大数据处理技术、知识图谱技术等几大方向。Artificial intelligence is a discipline that studies the use of computers to simulate certain thinking processes and intelligent behaviors of people (such as learning, reasoning, thinking, planning, etc.), both in the technical field of hardware and software. Artificial intelligence hardware technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, and big data processing; artificial intelligence software technologies include computer vision technology, speech recognition technology, natural language processing technology, as well as deep learning, big data and other technologies. Processing technology, knowledge graph technology and other major directions.
NLP(Natural Language Processing,自然语言处理)是计算机科学领域与人工智能领域中的一个重要方向,NLP研究的内容包括但不限于如下分支领域:文本分类、信息抽取、自动摘要、智能问答、话题推荐、机器翻译、主题词识别、知识库构建、深度文本表示、命名实体识别、文本生成、文本分析(词法、句法、语法等)、语音识别与合成等。NLP (Natural Language Processing, Natural Language Processing) is an important direction in the field of computer science and artificial intelligence. The content of NLP research includes but is not limited to the following branches: text classification, information extraction, automatic summarization, intelligent question answering, topic recommendation , machine translation, subject word recognition, knowledge base construction, deep text representation, named entity recognition, text generation, text analysis (lexical, syntactic, grammar, etc.), speech recognition and synthesis, etc.
深度学习是机器学习领域中一个新的研究方向。深度学习是学习样本数据的内在规律和表示层次,这些学习过程中获得的信息对诸如文字,图像和声音等数据的解释有很大的帮助。它的最终目标是让机器能够像人一样具有分析学习能力,能够识别文字、图像和声音等数据。Deep learning is a new research direction in the field of machine learning. Deep learning is to learn the inherent laws and representation levels of sample data, and the information obtained during these learning processes is of great help to the interpretation of data such as text, images, and sounds. Its ultimate goal is to enable machines to have the ability to analyze and learn like humans, and to recognize data such as words, images, and sounds.
图1为本公开实施例提供的一种用户特征的生成方法的流程示意图。FIG. 1 is a schematic flowchart of a method for generating a user feature according to an embodiment of the present disclosure.
如图1所示,该用户特征的生成方法包括:As shown in Figure 1, the generation method of the user feature includes:
步骤101,获取目标用户对应的第一历史文本数据。Step 101: Obtain first historical text data corresponding to the target user.
本公开中,可获取某一用户在过去预设时长内的历史文本数据,为了便于区分,可将该用户称为目标用户,将获取的历史文本数据称为第一历史文本数据。其中,第一历史文本数据可以是目标用户浏览的网页内容、浏览的视频内容等。In the present disclosure, historical text data of a certain user within a preset time period in the past may be acquired. For convenience of distinction, the user may be referred to as a target user, and the acquired historical text data may be referred to as first historical text data. The first historical text data may be webpage content browsed by the target user, video content browsed by the target user, and the like.
在实际应用中,也可基于目标用户进行某一操作的时间,获取该操作时间之前预设时长内的历史文本数据。比如,用户A于2020年2月20日16点发起信贷请求,那么可获取该时间之前15天内用户A的历史文本数据,比如浏览的网页内容、网上购物的情况、浏览的视频内容等。In practical applications, historical text data within a preset time period before the operation time can also be acquired based on the time when the target user performs a certain operation. For example, if user A initiates a credit request at 16:00 on February 20, 2020, the historical text data of user A in the 15 days before the time can be obtained, such as the content of web pages browsed, online shopping, and video content browsed.
需要说明的是,本公开中,所涉及的用户个人信息的获取、存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。It should be noted that, in this disclosure, the acquisition, storage, and application of the user's personal information involved all comply with the provisions of relevant laws and regulations, and do not violate public order and good customs.
步骤102,对第一历史文本数据进行解析,以确定目标用户对应的第一分词集。
本公开中,可对第一历史文本数据进行解析,比如去除感叹词、助词等,并进行切词处理和去重处理,得到多个分词,这些分词构成了分词集,为了便于区分称为第一分词集。In the present disclosure, the first historical text data can be parsed, such as removing interjections, auxiliary words, etc., and word segmentation and deduplication processing to obtain a plurality of word segmentations, which constitute a word segmentation set, which is called the first A set of words.
在实际应用中,可能会获取目标用户的多个历史文本数据,可对每个历史文本数据进行解析,得到每个历史文本数据的分词集。之后,按照历史文本数据的生成时间的先后顺序,将多个分词集合并并进行去重,得到第一分词集。或者,也可将多个历史文本数据按照时间先后顺序排序,整合成一个历史文本数据,即第一历史文本数据可以是多个历史文本数据整合得到的,之后进行解析得到多个分词,这些分词构成第一分词集。In practical applications, multiple historical text data of the target user may be obtained, and each historical text data can be parsed to obtain a word segmentation set of each historical text data. After that, according to the sequence of the generation time of the historical text data, the multiple word segmentation sets are combined and deduplicated to obtain the first word segmentation set. Alternatively, multiple historical text data can also be sorted in chronological order, and integrated into one historical text data, that is, the first historical text data can be obtained by integrating multiple historical text data, and then parsed to obtain multiple word segmentations. Form the first participle set.
步骤103,根据第一分词集中的各个分词分别与各个主题对应的各个分词之间的匹配度,确定第一分词集中包含的每个主题下的分词数量。Step 103: Determine the number of word segments under each topic included in the first word segment set according to the degree of matching between each word segment in the first word segment set and each word segment corresponding to each topic.
本公开中,可预先获取多个主题以及每个主题对应的各个分词。这里获取的多个主题以及每个主题对应的各个分词,可以是人工对多个文档中的分词进行分类得到的。In the present disclosure, multiple topics and each word segment corresponding to each topic may be acquired in advance. The multiple topics obtained here and the respective word segments corresponding to each topic may be obtained by manually classifying the word segments in multiple documents.
比如,获取两个主题“游戏”和“旅行”,其中,主题“游戏”对应的分词有[王者荣耀打野荣耀辅助上分新英雄],与主题“旅行”对应的分词有[景点天气车票出发地目标地]等。For example, obtain two themes "game" and "travel", among which, the participles corresponding to the theme "game" are [King of Glory, jungler and glory assist the new heroes], and the participles corresponding to the theme "travel" are [attractions weather tickets Departure and destination] etc.
需要说明的是,上述举例中主题对应的分词仅为示例,不能看作对本公开的限制。It should be noted that the word segmentation corresponding to the subject in the above examples is only an example, and cannot be regarded as a limitation of the present disclosure.
在获取第一分词集后,可将第一分词集中的每个分词分别与每个主题对应的各个分词进行匹配,以根据第一分词集中的各个分词分别与各个主题对应的各个分词之间的匹配度,确定第一分词集中包含的每个主题下的分词,从而确定第一分词集中包含的每个主题下的分词数量。After obtaining the first word segmentation set, each word segment in the first word segmentation set may be matched with each word segment corresponding to each topic, so that each word segment in the first word segmentation set and each word segment corresponding to each topic can be matched The matching degree is to determine the word segmentation under each topic contained in the first word segmentation set, thereby determining the number of word segmentations under each topic included in the first word segmentation set.
在进行匹配时,可计算两个分词分别对应的词向量之间的距离,用距离衡量两个分词的匹配度。其中,距离越小匹配度越高,距离越大匹配度越低。When matching, the distance between the word vectors corresponding to the two participles can be calculated, and the matching degree of the two participles can be measured by the distance. Among them, the smaller the distance is, the higher the matching degree is, and the larger the distance is, the lower the matching degree is.
比如,将第一分词集中的分词p1与主题a中的每个分词进行匹配,若分词p1与主题a中某分词的匹配度大于预设匹配度阈值,可认为第一分词集中包含主题下的该分词。For example, match the participle p1 in the first participle set with each participle in topic a. If the matching degree between participle p1 and a participle in topic a is greater than the preset matching degree threshold, it can be considered that the first participle set contains the subtopics under the topic the participle.
步骤104,根据第一分词集中包含的每个主题下的分词数量,确定目标用户对应的用户特征。Step 104: Determine the user characteristics corresponding to the target user according to the number of word segmentations under each topic contained in the first word segmentation set.
本公开中,可将第一分词集中包含的每个主题下的分词数量大于预设数量进行比较,将分词数量大于预设数量的主题,作为目标用户对应的用户特征。其中,预设数量可根据实际需要确定。In the present disclosure, the number of word segmentations under each topic included in the first word segmentation set is greater than a preset number for comparison, and the topics with the number of word segmentations greater than the preset number are used as user characteristics corresponding to the target user. The preset number can be determined according to actual needs.
比如,预设数量为0,共有8个主题,若有5个主题,第一分词集中包含这些主题下的分词,可将这5个主题作为目标用户对应的用户特征。For example, the preset number is 0, and there are 8 topics in total. If there are 5 topics, the first word segmentation set contains the word segmentation under these topics, and these 5 topics can be used as the user characteristics corresponding to the target user.
或者,也可将第一分词集中包含的主题下分词数量最大的一个或多个主题作为目标用户的对应的用户特征。比如,第一分词集中包含的主题a下、主题b下的分词数量最大,那么可将主题a和主题b作为目标用户对应的用户特征。Alternatively, one or more topics with the largest number of subtopic sub-words contained in the first word segmentation set may also be used as the corresponding user characteristics of the target user. For example, if the first word segmentation set contains the largest number of word segmentations under topic a and topic b, then topic a and topic b can be used as user characteristics corresponding to the target user.
又或者,也可将第一分词集中包含的每个主题下的分词数量,直接作为目标用户对应的用户特征。Alternatively, the number of word segmentations under each topic included in the first word segmentation set may be directly used as the user feature corresponding to the target user.
比如,有5主题,第一分词集中包含的每个主题下的分词数量分别为6、5、4、0、0,那么可将5个主题分别对应的分词数量,为目标用户对应的用户特征。For example, if there are 5 topics, and the number of segmentations under each topic contained in the first segmentation set is 6, 5, 4, 0, and 0, then the number of segmentations corresponding to the 5 topics can be the user characteristics corresponding to the target user. .
本公开实施例中,通过对目标用户对应的第一历史文本数据进行解析,以确定目标用户对应的第一分词集,将第一分词集中的各个分词分别与各个主题下的各分词进行匹配,以确定第一分词集中包含的每个主题下的分词数量,并根据第一分词集中包含的每个主题下的分词数量,确定目标用户对应的用户特征,从而基于目标用户在每个主题下的分词数量,确定用户特征,提高了得到的用户特征的准确性。In the embodiment of the present disclosure, the first historical text data corresponding to the target user is analyzed to determine the first word segment set corresponding to the target user, and each word segment in the first word segment set is matched with each word segment under each theme, respectively, Determine the number of word segmentations under each topic included in the first word segmentation set, and determine the user characteristics corresponding to the target user according to the number of word segmentations under each topic included in the first word segmentation set. The number of word segmentations determines the user characteristics and improves the accuracy of the obtained user characteristics.
在本公开的一个实施例中,在获取用户特征之后,还可基于用户特征向目标用户推送推广信息,从而可以提高推广信息推送的精准性。下面结合图2进行说明,图2为本公开实施例提供的另一种用户特征的生成方法的流程示意图。In an embodiment of the present disclosure, after the user characteristics are acquired, promotion information may also be pushed to the target user based on the user characteristics, thereby improving the accuracy of promotion information push. The following description will be made with reference to FIG. 2 , which is a schematic flowchart of another method for generating a user feature provided by an embodiment of the present disclosure.
如图2所示,该用户特征的生成方法包括:As shown in Figure 2, the generation method of the user feature includes:
步骤201,获取目标用户对应的第一历史文本数据。Step 201: Obtain first historical text data corresponding to the target user.
步骤202,对第一历史文本数据进行解析,以确定目标用户对应的第一分词集。Step 202: Parse the first historical text data to determine the first segmented word set corresponding to the target user.
步骤203,根据第一分词集中的各个分词分别与各个主题对应的各个分词之间的匹配度,确定第一分词集中包含的每个主题下的分词数量。Step 203: Determine the number of word segments under each topic included in the first word segment set according to the degree of matching between each word segment in the first word segment set and each word segment corresponding to each topic.
步骤204,根据第一分词集中包含的每个主题下的分词数量,确定目标用户对应的用户特征。Step 204: Determine user characteristics corresponding to the target user according to the number of word segmentations under each topic contained in the first word segmentation set.
本公开中,步骤201-步骤204与上述步骤101-步骤104类似,故在此不再赘述。In the present disclosure, steps 201 to 204 are similar to the above-mentioned
步骤205,确定用户特征分别与各个待推广信息之间的关联度。Step 205: Determine the degree of association between the user characteristics and each piece of information to be promoted.
本公开中,可计算用户特征分别与每个待推广信息之间的关联度。其中,待推广信息可以是广告、视频、新闻等。In the present disclosure, the degree of association between the user characteristics and each piece of information to be promoted can be calculated. The information to be promoted may be advertisements, videos, news, and the like.
比如,用户特征为游戏和旅行,可计算每个待推广信息与游戏之间的关联度,以及每个待推广信息与旅行之间的关联度。For example, if the user features are games and travel, the degree of association between each piece of information to be promoted and the game, and the degree of association between each piece of information to be promoted and travel can be calculated.
步骤206,根据各个关联度,确定目标推广信息。Step 206: Determine target promotion information according to each correlation degree.
在获取用户特征与每个待推广信息之间的关联度之后,可将关联度大于预设关联度阈值的待推广信息,作为目标推广信息。After obtaining the degree of association between the user feature and each piece of information to be promoted, the information to be promoted whose degree of association is greater than a preset threshold of the degree of association may be used as target promotion information.
可以理解的是,目标推广信息可以是一个,也可以是多个。It can be understood that the target promotion information may be one or multiple.
步骤207,向目标用户推送目标推广信息。
在获取目标推广信息后,可通过目标用户所用的客户端将目标推广信息推送给目标用户。After acquiring the target promotion information, the target promotion information may be pushed to the target user through the client terminal used by the target user.
比如,目标用户对应的用户特征包括游戏,那么可向用户推荐与游戏相关的新闻、软件等。For example, if the user characteristic corresponding to the target user includes a game, then news, software, etc. related to the game may be recommended to the user.
本公开实施例中,在确定目标用户对应的用户特征之后,还可确定用户特征分别与各个待推广信息之间的关联度,根据各个关联度,确定目标推广信息,并向目标用户推送目标推广信息。由此,基于用户特征向目标用户推送推广信息,从而可以提高推广信息推送的精准性。In the embodiment of the present disclosure, after determining the user characteristics corresponding to the target users, the degree of association between the user characteristics and each piece of information to be promoted can also be determined, and according to each degree of association, the target promotion information is determined, and the target promotion is pushed to the target user. information. Therefore, the promotion information is pushed to the target user based on the user characteristics, so that the accuracy of the promotion information push can be improved.
在本公开的一个实施例中,也可在获取第一分词集中包含的每个主题下的分词数量后,向目标用户推送待推广信息。比如,确定第一分词集中包含的主题b下的分词数量最大,可基于主题b确定目标推广信息,并向目标用户推送目标推广信息。In an embodiment of the present disclosure, the information to be promoted may also be pushed to the target user after obtaining the number of word segments under each topic included in the first word segment set. For example, if it is determined that the number of word segments under topic b contained in the first word segment set is the largest, target promotion information may be determined based on topic b, and the target promotion information may be pushed to target users.
在本公开的一个实施例中,在确定第一分词集中包含的每个主题下的分词数量之后,还可确定目标用户是否为目标风险用户。下面结合图3进行说明,图3为本公开实施例提供的另一种用户特征的生成方法的流程示意图。In an embodiment of the present disclosure, after determining the number of word segmentations under each topic included in the first word segmentation set, it may also be determined whether the target user is a target risk user. The following description will be made with reference to FIG. 3 , which is a schematic flowchart of another method for generating a user feature provided by an embodiment of the present disclosure.
如图3所示,该用户特征的生成方法包括:As shown in Figure 3, the generation method of the user feature includes:
步骤301,获取目标用户对应的第一历史文本数据。Step 301: Obtain first historical text data corresponding to the target user.
步骤302,对第一历史文本数据进行解析,以确定目标用户对应的第一分词集。Step 302: Parse the first historical text data to determine the first segmented word set corresponding to the target user.
本公开中,步骤301-步骤302与上述步骤101-步骤102类似,故在此不再赘述。In the present disclosure, steps 301 to 302 are similar to the above-mentioned
步骤303,确定第一分词集中包含的指定类型主题下的分词数量,其中,指定类型主题与目标风险相关。Step 303: Determine the number of segmented words under the subject of the specified type contained in the first segmented set, where the subject of the specified type is related to the target risk.
在获取第一分词集中包含的每个主题下的分词数量后,可确定第一分词集中包含的指定类型主题下的分词数量。其中,指定类型主题与目标风险相关。After acquiring the number of word segmentations under each topic included in the first word segmentation set, the number of word segmentations under a topic of a specified type included in the first word segmentation set may be determined. Among them, the specified type topic is related to the target risk.
比如,指定类型主题为分期支付,该主题与超前消费风险相关,可以确定第一分词集中包含的分期付款主题下的分词数量。又如,目标风险也可以为还款逾期风险等。For example, if the specified type topic is installment payment, and the topic is related to the risk of advance consumption, the number of word segments under the installment payment topic included in the first word segment set can be determined. For another example, the target risk may also be the risk of overdue repayment.
需要说明的是,本公开对指定类型主题和目标风险不作限定。It should be noted that the present disclosure does not limit the specified type of subject matter and target risk.
步骤304,在第一分词集中包含的指定类型主题下的分词的数量大于预设阈值的情况下,确定目标用户为具有目标风险的用户。
本公开中,若第一分词集中包含的指定类型主题下的分词的数量大于预设阈值,可以确定目标用户为具有目标风险的用户。In the present disclosure, if the number of word segmentations under the topic of the specified type contained in the first word segmentation set is greater than a preset threshold, it may be determined that the target user is a user with target risk.
比如,第一分词集中包含的分期付款主题下的分词数量为20,大于预设阈值10,可以认为目标用户具有超前消费风险的用户。For example, if the number of word segmentations under the installment theme included in the first word segmentation set is 20, which is greater than the preset threshold of 10, it can be considered that the target user has a risk of advanced consumption.
又如,以信贷风控场景为例,某用户对应的分词集中包含的信贷主题下的分词数量大于预设阈值,可以认为该用户具有逾期还款风险,那么可拒绝向该用户提供相应服务。For another example, taking the credit risk control scenario as an example, if the number of word segmentations under the credit topic contained in the word segmentation set corresponding to a user is greater than the preset threshold, it can be considered that the user has the risk of overdue repayment, and the corresponding service can be refused to the user.
本公开实施例中,在确定第一分词集中包含的每个主题下的分词数量之后,还可确定第一分词集中包含的指定类型主题下的分词数量,其中,指定类型主题与目标风险相关,在第一分词集中包含的指定类型主题下的分词的数量大于预设阈值的情况下,确定目标用户为具有目标风险的用户。由此,可根据第一分词集中包含的指定类型主题下的分词数量,识别用户是否为目标风险用户,可以此确定是否向目标用户提供某些服务。In the embodiment of the present disclosure, after determining the number of word segmentations under each topic included in the first word segmentation set, the number of word segmentations under a specified type of topic included in the first word segmentation set may also be determined, wherein the specified type of topic is related to the target risk, In the case that the number of word segments under the subject of the specified type contained in the first segment set is greater than a preset threshold, the target user is determined to be a user with a target risk. Therefore, whether the user is a target risk user can be identified according to the number of word segmentations under the subject of the specified type contained in the first word segmentation set, and it can be determined whether to provide certain services to the target user.
在本公开的一个实施例中,可通过聚类方式,确定多个主题以及主题对应的各个分词。下面结合图4进行说明,图4为本公开实施例提供的另一种用户特征的生成方法的流程示意图。In an embodiment of the present disclosure, a clustering manner may be used to determine multiple topics and each word segment corresponding to the topic. The following description will be made with reference to FIG. 4 , which is a schematic flowchart of another method for generating a user feature provided by an embodiment of the present disclosure.
如图4所示,该用户特征的生成方法还可包括:As shown in Figure 4, the method for generating the user feature may further include:
步骤401,获取多个用户分别对应的多个第二历史文本数据。Step 401: Acquire a plurality of second historical text data corresponding to a plurality of users respectively.
步骤402,分别对每个用户对应的多个第二历史文本数据进行解析,以确定每个用户对应的第二分词集。
本公开中,获取第二分词集的方式与获取第一分词集的方式类似,故在此不再赘述。In the present disclosure, the manner of acquiring the second segmented word set is similar to the manner of acquiring the first segmented word set, and thus will not be repeated here.
步骤403,对多个用户对应的多个第二分词集中的分词进行聚类,以获取多个主题词库。Step 403: Clustering the word segments in the second word segment sets corresponding to the multiple users to obtain a plurality of thesaurus.
本公开中,每个用户对应一个第二分词集,可采用LDA(Latent DirichletAllocation,隐含狄利克雷分配)模型对多个第二分词集进行聚类。在获取多个第二分词集后,可设定聚类得到的主题数量,并将每个用户对应的第二分词集输入至初始LDA模型中,对初始LDA模型进行训练。In the present disclosure, each user corresponds to a second segmented word set, and an LDA (Latent Dirichlet Allocation, latent Dirichlet Allocation) model can be used to cluster multiple second segmented word sets. After acquiring multiple second word segmentation sets, the number of topics obtained by clustering can be set, and the second word segmentation sets corresponding to each user are input into the initial LDA model to train the initial LDA model.
在LDA模型收敛时,可获取分词概率分布,其中,分词概率分布中包含每个分词属于每个主题的概率,之后可根据分词概率分布对多个第二分词集中的分词进行聚类,比如,对于每个主题,可将概率大于预设概率的分词作为主题对应的分词,由此可以得到多个主题词库。其中,每个主题词库中包含一个或多个分词。When the LDA model converges, the word segmentation probability distribution can be obtained, wherein the word segmentation probability distribution includes the probability that each word segment belongs to each topic, and then the word segmentation in multiple second word segmentation sets can be clustered according to the word segmentation probability distribution, for example, For each topic, a word segment with a probability greater than a preset probability can be used as the word segment corresponding to the topic, thereby obtaining multiple topic thesaurus. Among them, each subject thesaurus contains one or more participles.
比如,在训练LDA模型时,设定主题数量为m个,基于模型收敛时的分词概率分布,可确定topic_1、topic_2、……、topic_m-1、topic_m共m个主题词库。For example, when training the LDA model, the number of topics is set to m, and based on the probability distribution of word segmentation when the model converges, a total of m topic thesaurus of topic_1, topic_2, ..., topic_m-1, and topic_m can be determined.
比如,topic_7:[王者荣耀打野荣耀辅助上分新英雄];topic_10:[信用信用卡申请卡我爱卡额度爱卡分期金融还款];topic_11:[景点天气火车火车票汽车汽车票飞机飞机票出发地目的地]。For example, topic_7:[The King's Glory, Jungle Glory, Auxiliary, and New Heroes]; topic_10:[Credit Card Application Card I Love Card Limit Love Card Installment Financial Repayment]; topic_11: [Scenic Spots, Weather, Train Tickets, Train Tickets, Bus Tickets, Airplane Tickets Departure and destination].
上述举例说明了topic_7、topic_10和topic_11三个主题词库包含的分词,或者也可以认为是topic_7、topic_10和topic_11三个主题分别对应的分词,这些分词是主题词库中的部分分词。The above examples illustrate the word segmentation contained in the topic thesaurus topic_7, topic_10 and topic_11, or can also be considered as the word segmentation corresponding to the three topics of topic_7, topic_10 and topic_11 respectively. These word segmentations are part of the word segmentation in the topic thesaurus.
步骤404,根据每个主题词库中的各个分词与预设主题之间的匹配度,确定各个主题词库分别对应的主题。
由于利用上述方式不确定每个主题词库对应的主题,比如,主题词库topic_11为:[景点天气火车火车票汽车汽车票飞机飞机票出发地目的地],但是不确定topic_11的主题是什么,因此,本公开中,可计算每个主题词库中的各个分词与预设主题之间的匹配度,根据每个主题词库中的分词与预设主题之间的匹配度,确定每个主题词库对应的主题。Because the above method is used to determine the topic corresponding to each topic thesaurus, for example, the topic thesaurus topic_11 is: [spots, weather, train ticket, bus ticket, plane, flight ticket departure destination], but I am not sure what the topic of topic_11 is. Therefore, in the present disclosure, the matching degree between each word segment in each thesaurus and the preset topic can be calculated, and each topic is determined according to the matching degree between the word segmentation and the preset topic in each thesaurus The thesaurus corresponds to the subject.
本公开中,预设主题为多个,可计算每个主题词库中的每个分词与每个预设主题之间的匹配度,若主题词库中每个分词与某预设主题匹配度均大于预设匹配度阈值,可确定主题词库对应的主题为该预设主题。由此,可以确定每个主题词库对应的主题,从而得到多个主题以及每个主题对应的分词。比如,可以确定上述主题词库topic_11对应的主题为“旅行”。In the present disclosure, there are multiple preset topics, and the matching degree between each word segment in each thesaurus and each preset topic can be calculated. If each word segment in the thesaurus matches a certain preset topic are greater than the preset matching degree threshold, it can be determined that the theme corresponding to the thesaurus is the preset theme. In this way, the topic corresponding to each topic thesaurus can be determined, thereby obtaining multiple topics and the word segmentation corresponding to each topic. For example, it can be determined that the topic corresponding to the topic thesaurus topic_11 is "travel".
本公开实施例的用户特征的生成方法,可以广泛用于建模和特征开发相关的项目,比如,基于联合建模项目,可获取多个用户的信贷数据,比如包括用户的借款时间等,可基于借款时间,获取用户在借款时间之前预设时长内相关的历史文本数据,比如,浏览网页内容、视频内容等,基于这些历史文本数据,确定多个主题词库以及每个主题词库对应的主题。The method for generating user features in this embodiment of the present disclosure can be widely used in projects related to modeling and feature development. For example, based on a joint modeling project, credit data of multiple users can be obtained, such as including the user's borrowing time, etc. Based on the borrowing time, obtain the relevant historical text data of the user within a preset period of time before the borrowing time, such as browsing web content, video content, etc., and based on these historical text data, determine multiple thesaurus and the corresponding thesaurus for each thesaurus. theme.
本公开实施例中,还可通过获取多个用户分别对应的多个第二历史文本数据,对多个第二历史文本数据进行解析,以确定多个用户分别对应的第二分词集,并对多个第二分词集中的分词进行聚类,以获取多个主题词库,并根据每个主题词库中的分词与预设主题之间的匹配度,确定各个主题词库分别对应的主题。由此,可通过利用多个用户对应的多个第二历史文本数据,得到多个主题词库以及每个主题词库对应的主题,从而可以利用多个主题和每个主题对应的分词,确定用户特征。In the embodiment of the present disclosure, it is also possible to obtain a plurality of second historical text data corresponding to a plurality of users, and parse the plurality of second historical text data to determine the second word segmentation sets corresponding to the plurality of users, and to analyze the second historical text data corresponding to the plurality of users. The word segments in the plurality of second word segmentation sets are clustered to obtain a plurality of thesaurus, and according to the matching degree between the word segments in each thesaurus and the preset topic, the respective themes corresponding to each thesaurus are determined. Thus, by using multiple second historical text data corresponding to multiple users, multiple subject thesaurus and themes corresponding to each subject thesaurus can be obtained, so that multiple subjects and the word segmentation corresponding to each subject can be used to determine User characteristics.
在本公开的一个实施例中,在确定各个主题词库分别对应的主题之后,可基于多个用户对应的多个第二分词集,训练得到风控模型。下面结合图5进行说明,图5为本公开实施例提供的另一种用户特征的生成方法的流程示意图。In one embodiment of the present disclosure, after determining the topics corresponding to each topic thesaurus, the risk control model can be obtained by training based on multiple second word segmentation sets corresponding to multiple users. The following description will be made with reference to FIG. 5 , which is a schematic flowchart of another method for generating a user feature provided by an embodiment of the present disclosure.
如图5所示,上述在确定各个主题词库分别对应的各个主题之后,还包括:As shown in Figure 5, after determining the respective topics corresponding to each topic thesaurus, the above also includes:
步骤501,根据每个第二分词集中的各个分词分别与各个主题对应的各个分词之间的匹配度,确定每个第二分词集中包含的每个主题下的分词。
本公开中,确定第二分词集中包含的每个主题下的分词的方法,与上述确定第一分词集中包含的每个主题下的分词的方法类似,故在此不再赘述。In the present disclosure, the method for determining the word segmentation under each topic included in the second word segmentation set is similar to the above-mentioned method for determining the word segmentation under each topic included in the first word segmentation set, so it is not repeated here.
步骤502,基于每个第二分词集中包含的每个主题下的分词数量,确定每个用户对应的标注风险等级。Step 502: Determine the labeling risk level corresponding to each user based on the number of word segmentations under each topic included in each second word segmentation set.
本公开中,可基于每个第二分词集中包含的每个主题下的分词数量,确定每个第二分词集中包含的与风险相关的主题下的分词数量,以此确定每个用户对应的标注风险等级。In the present disclosure, the number of word segments under each topic included in each second segment set may be determined based on the number of segments under each topic included in each second segment set, so as to determine the label corresponding to each user Risk level.
其中,第二分词集中包含的与风险相关的主题下的分词数量越多,标注风险等级越高。Among them, the more the number of word segmentations under risk-related topics contained in the second word segmentation set, the higher the marked risk level.
比如,与逾期还款风险相关的主题有信贷、分期支付等,可根据每个用户对应的第二分词集中包含的信贷和分期支付主题下的分词数量,确定每个用户对应的标注风险等级。For example, the topics related to overdue repayment risk include credit, installment payment, etc., according to the number of word segmentations under the subject of credit and installment payment included in the second word segmentation set corresponding to each user, to determine the corresponding labeling risk level for each user.
步骤503,将每个第二分词集中包含的每个主题下的分词及对应的主题输入至初始神经网络模型,以获取初始神经网络模型输出的预测风险等级。
本公开中,可将第二分词集中包含的每个主题下的分词及对应的主题,输入至初始神经网络模型中,利用初始神经网络模型进行预测,得到用户对应的预测风险等级。In the present disclosure, the word segmentation and the corresponding topic under each topic included in the second word segmentation set can be input into the initial neural network model, and the initial neural network model can be used for prediction to obtain the predicted risk level corresponding to the user.
步骤504,根据预测风险等级与标注风险等级之间的差异,对初始神经网络模型进行修正,以生成风控模型。
本公开中,可确定预测风险等级与标注风险等级之间的差异,如果差异大于预设阈值,可对初始神经网络模型进行修正,利用剩余的第二分词集对修正后的模型继续训练,直至模型收敛生成风控模型。In the present disclosure, the difference between the predicted risk level and the marked risk level can be determined, and if the difference is greater than a preset threshold, the initial neural network model can be revised, and the revised model can be trained by using the remaining second segmentation set until the The model converges to generate the risk control model.
在对风控模型进行训练时,可以通过深度学习的方式进行训练,相比于其他机器学习方法,深度学习在大数据集上的表现更好。When training the risk control model, it can be trained through deep learning. Compared with other machine learning methods, deep learning performs better on large data sets.
本公开中,风控模型比如可以是信贷风控模型,在确定标注风险等级时,可根据每个用户对应的第二分词集中包含的信贷和分期支付主题下的分词数量,确定每个用户对应的标注风险等级,从而训练得到信贷风控模型。又如,风控模型可以是保险风控模型,在确定标注风险等级时,可根据每个用户对应的第二分词集中包含的与保险相关的主题下的分词数量,以确定每个用户对应的标注风险等级,从而训练得到保险风控模型。In the present disclosure, the risk control model may be, for example, a credit risk control model. When determining the marked risk level, the number of word segments under the subject of credit and installment payment included in the second word segment set corresponding to each user may be used to determine the corresponding value of each user. , so as to train the credit risk control model. For another example, the risk control model may be an insurance risk control model. When determining the labeling risk level, the number of word segmentations under insurance-related topics included in the second word segmentation set corresponding to each user can be used to determine the number of words corresponding to each user. Label the risk level to train the insurance risk control model.
本公开实施例中,在确定各个主题词库分别对应的各个主题之后,还可根据每个第二分词集中的各个分词与各个主题对应的各个分词之间的匹配度,确定每个第二分词集中包含的每个主题下的分词,并基于每个第二分词集中包含的每个主题下的分词数量,确定每个用户对应的标注风险等级,将每个第二分词集中包含的每个主题下的分词及对应的主题输入至初始神经网络模型,以获取初始神经网络模型输出的预测风险等级,并根据预测风险等级与标注风险等级之间的差异,对初始神经网络模型进行修正,以生成风控模型。由此,可利用每个用户对应的第二分词集中包含的每个主题下的分词,训练得到风控模型。In the embodiment of the present disclosure, after each topic corresponding to each topic thesaurus is determined, each second word segment can also be determined according to the degree of matching between each word segment in each second word segment set and each word segment corresponding to each topic The word segmentation under each topic included in the set, and based on the number of word segmentations under each topic included in each second word segmentation set, determine the labeling risk level corresponding to each user, and classify each topic included in each second word segmentation set. The word segmentation and the corresponding topic below are input to the initial neural network model to obtain the predicted risk level output by the initial neural network model, and the initial neural network model is modified according to the difference between the predicted risk level and the marked risk level to generate wind control model. Thus, the word segmentation under each topic included in the second word segmentation set corresponding to each user can be used to train the risk control model.
在实际应用中,在确定各个主题词库分别对应的主题之后,也可利用每个用户对应的第二分词集中包含的每个主题下的分词,训练得到推荐模型,以利用推荐模型向目标用户推送推广信息。In practical applications, after determining the topics corresponding to each topic thesaurus, the word segmentation under each topic contained in the second word segmentation set corresponding to each user can also be used to train a recommendation model, so as to use the recommendation model to target users. Push promotional information.
在本公开的一个实施例中,还可利用风控模型,确定是否响应目标用户的用户请求。In an embodiment of the present disclosure, the risk control model may also be used to determine whether to respond to the user request of the target user.
本公开中,在获取到目标用户发送的用户请求的情况下,可获取目标用户在发起用户请求之前预设时长内的第一历史文本数据,对第一历史文本数据进行解析,确定第一分词集,根据第一分词集中各个分词分别与各个主题下的各个分词之间的匹配度,确定第一分词集中包含的每个主题下的分词,可将目标用户对应的第一分词集中包含的每个主题下的分词及对应的主题,输入至风控模型,风控模型输出目标用户对应的风险等级。若目标用户对应的风险等级小于预设风险等级,说明目标用户逾期还款的风险比较小,可响应目标用户的用户请求。可以理解的是,若目标用户对应的风险等级大于或等于预设风险等级,可拒绝目标用户的用户请求。In the present disclosure, when the user request sent by the target user is obtained, the first historical text data within the preset time period before the target user initiates the user request can be obtained, the first historical text data can be parsed, and the first word segmentation can be determined. Set, according to the degree of matching between each participle in the first participle set and each participle under each theme, determine the participle under each topic contained in the first participle set, and each part contained in the first participle set corresponding to the target user can be The word segmentation under each topic and the corresponding topic are input to the risk control model, and the risk control model outputs the risk level corresponding to the target user. If the risk level corresponding to the target user is lower than the preset risk level, it means that the risk of overdue repayment of the target user is relatively small, and the target user can respond to the user request of the target user. It can be understood that if the risk level corresponding to the target user is greater than or equal to the preset risk level, the user request of the target user can be rejected.
或者,也可设定不同的用户请求对应不同的风险等级,若目标用户对应的风险等级小于或等于用户请求对应的风险等级,可响应目标用户的用户请求。Alternatively, different user requests may be set to correspond to different risk levels. If the risk level corresponding to the target user is less than or equal to the risk level corresponding to the user request, the user request of the target user may be responded to.
比如,用户请求为信贷请求,在基于信贷风控模型,确定用户对应的风险等级小于预设风险等级的情况下,可以响应用户的信贷请求,从而可以向用户提供相应的服务。For example, if the user's request is a credit request, and based on the credit risk control model, it is determined that the user's corresponding risk level is lower than the preset risk level, and the user's credit request can be responded to, so that the corresponding service can be provided to the user.
本公开实施例中,在获取到目标用户发送的用户请求的情况下,可将目标用户对应的第一分词集中包含的每个主题下的分词及对应的主题输入至风控模型,以确定目标用户对应的风险等级;在风险等级小于预设风险等级的情况下,响应用户请求。由此,可基于目标用户对应的第一分词集中包含的每个主题下的分词,利用风控模型,确定是否响应目标用户的用户请求,提高了模型对逾期用户的区分效果,可以减少经济损失。In the embodiment of the present disclosure, when the user request sent by the target user is obtained, the word segmentation and the corresponding topic under each topic contained in the first word segmentation set corresponding to the target user can be input into the risk control model to determine the target The risk level corresponding to the user; if the risk level is less than the preset risk level, respond to the user request. Therefore, based on the word segmentation under each topic contained in the first word segmentation set corresponding to the target user, the risk control model can be used to determine whether to respond to the user request of the target user, which improves the model's distinguishing effect on overdue users and reduces economic losses. .
为了实现上述实施例,本公开实施例还提出一种模型训练方法。图6为本公开实施例提供的一种模型训练方法的流程示意图。In order to implement the above embodiments, the embodiments of the present disclosure further provide a model training method. FIG. 6 is a schematic flowchart of a model training method provided by an embodiment of the present disclosure.
如图6所示,该模型训练方法包括:As shown in Figure 6, the model training method includes:
步骤601,获取训练数据集,其中,训练数据集中包括多个用户分别对应的多个历史文本数据。Step 601: Acquire a training data set, wherein the training data set includes a plurality of historical text data corresponding to a plurality of users respectively.
本公开中,可获取多个用户分别对应的多个历史文本数据,比如浏览的网页内容、浏览的视频内容等,作为训练数据集。In the present disclosure, a plurality of historical text data corresponding to a plurality of users, such as browsed webpage content, browsed video content, etc., can be obtained as a training data set.
需要说明的是,本公开中,所涉及的用户个人信息的获取、存储和应用等,均符合相关法律法规的规定,且不违背公序良俗。It should be noted that, in this disclosure, the acquisition, storage, and application of the user's personal information involved all comply with the provisions of relevant laws and regulations, and do not violate public order and good customs.
步骤602,分别对每个用户对应的多个历史文本数据进行解析,以确定每个用户对应的分词集。
本公开中,多个用户中的每个用户对应多个历史文本数据,可对每个历史文本数据进行解析,比如进行切词处理和去重处理,得到每个历史文本数据的分词集,之后按照历史文本数据的生成时间的先后顺序,将多个分词集合并并进行去重,得到用户对应的分词集。In the present disclosure, each of the multiple users corresponds to multiple historical text data, and each historical text data can be parsed, such as word segmentation and deduplication processing, to obtain a word segmentation set of each historical text data, and then According to the sequence of the generation time of the historical text data, the multiple word segmentation sets are combined and deduplicated to obtain the word segmentation set corresponding to the user.
或者,在对每个用户对应的多个历史文本数据进行解析时,可将用户对应的多个历史文本数据按照时间先后顺序排序,整合成一个历史文本数据,即第一历史文本数据可以是多个历史文本数据整合得到的,之后进行解析得到多个分词,这些分词构成分词集。Alternatively, when analyzing multiple historical text data corresponding to each user, the multiple historical text data corresponding to the user may be sorted in chronological order and integrated into one historical text data, that is, the first historical text data may be multiple It is obtained by integrating the historical text data, and then parsed to obtain multiple word segmentations, which constitute a word segmentation set.
步骤603,确定每个用户对应的分词集中包含的各个主题下的分词,以及每个用户对应的标注风险等级。Step 603: Determine the word segmentation under each topic included in the word segmentation set corresponding to each user, and the labeling risk level corresponding to each user.
步骤604,将每个用户对应的分词集中包含的各个主题下的分词及对应的主题输入至初始神经网络模型,以获取初始神经网络模型输出的预测风险等级。
步骤605,根据预测风险等级与标注风险等级之间的差异,对初始神经网络模型进行修正,以生成风控模型。Step 605, modifying the initial neural network model according to the difference between the predicted risk level and the marked risk level to generate a risk control model.
本公开中,步骤603-605与上述步骤502-504类似,故在此不在赘述。In the present disclosure, steps 603-605 are similar to the above-mentioned steps 502-504, so they will not be repeated here.
本公开中,风控模型比如可以是信贷风控模型,在确定标注风险等级时,可根据每个用户对应的分词集中包含的信贷和分期支付主题下的分词数量,确定每个用户对应的标注风险等级,从而训练得到信贷风控模型。又如,风控模型可以是保险风控模型,在确定标注风险等级时,可根据每个用户对应的分词集中包含的与保险相关的主题下的分词数量,以确定每个用户对应的标注风险等级,从而训练得到保险风控模型。In the present disclosure, the risk control model may be, for example, a credit risk control model. When determining the labeling risk level, the labeling corresponding to each user may be determined according to the number of word segmentations under the subject of credit and installment payment included in the word segmentation set corresponding to each user. risk level, so as to train the credit risk control model. For another example, the risk control model may be an insurance risk control model. When determining the labeling risk level, the labeling risk corresponding to each user may be determined according to the number of word segmentations under insurance-related topics contained in the word segmentation set corresponding to each user. level, so as to train the insurance risk control model.
本公开实施例中,通过获取训练数据,分别对训练数据集中每个用户对应的多个历史文本数据进行解析,以确定每个用户对应的分词集,并确定每个用户对应的分词集中包含的各个主题下的分词,以及每个用户对应的标注风险等级,将每个用户对应的分词集中包含的各个主题下的分词及对应的主题输入至初始神经网络模型,以获取初始神经网络模型输出的预测风险等级,并根据预测风险等级与标注风险等级之间的差异,对初始神经网络模型进行修正,以生成风控模型。由此,可通过获取每个用户对应的分词集,并确定每个用户对应的分词集中包含的各个主题下的分词,利用每个用户对应的分词集中包含的每个主题下的分词,训练得到风控模型,从而提高了模型的准确性。In the embodiment of the present disclosure, by acquiring training data, the plurality of historical text data corresponding to each user in the training data set are respectively parsed to determine the word segmentation set corresponding to each user, and the word segmentation set corresponding to each user is determined. The word segmentation under each topic, as well as the labeling risk level corresponding to each user, input the word segmentation and corresponding topic under each topic contained in the word segmentation set corresponding to each user to the initial neural network model to obtain the output of the initial neural network model. Predict the risk level, and modify the initial neural network model according to the difference between the predicted risk level and the marked risk level to generate a risk control model. Therefore, by obtaining the word segmentation set corresponding to each user, determining the word segmentation under each topic contained in the word segmentation set corresponding to each user, and using the word segmentation under each topic contained in the word segmentation set corresponding to each user, the training result can be obtained. Risk control model, thereby improving the accuracy of the model.
在本公开的一个实施例中,上述在确定每个用户对应的分词集中包含的各个主题下的分词时,可采用文档主题生成模型,比如LDA模型,对多个用户对应的多个分词集进行聚类。In an embodiment of the present disclosure, when determining the word segmentation under each topic included in the word segmentation set corresponding to each user, a document topic generation model, such as an LDA model, can be used to perform the segmentation on multiple word segmentation sets corresponding to multiple users. clustering.
在获取多个分词集后,可设定聚类得到的主题数量,并将每个用户对应的分词集中的每个分词输入至初始文档主题生成模型中,对初始文档主题生成模型进行训练。After obtaining multiple word segmentation sets, the number of topics obtained by clustering can be set, and each word segmentation in the word segmentation set corresponding to each user is input into the initial document topic generation model to train the initial document topic generation model.
在文档主题生成模型收敛时,可获取分词概率分布,其中,分词概率分布中包含每个分词属于各个主题的概率,之后可根据分词概率分布对多个分词集中的分词进行聚类,比如,对于每个主题,可将概率大于预设概率的分词,作为主题对应的分词,由此可以得到每个主题对应的各分词,每个主题对应的各个分词构成一个主题词库,即获取多个主题词库。其中,主题词库的数量与主题数量相同,每个主题词库中包含一个或多个分词。When the document topic generation model converges, the word segmentation probability distribution can be obtained, wherein the word segmentation probability distribution includes the probability that each word segment belongs to each topic, and then the word segmentation in multiple word segmentation sets can be clustered according to the word segmentation probability distribution. For example, for For each topic, the word segmentation with a probability greater than the preset probability can be used as the word segmentation corresponding to the topic, so that the word segmentation corresponding to each topic can be obtained. thesaurus. Among them, the number of subject thesaurus is the same as the number of subjects, and each subject thesaurus contains one or more word segmentations.
在获取多个主题词库时,通过利用文档主题生成模型获得的分词概率分布,得到多个主题词库,提高了主题词库的准确性。When acquiring multiple subject thesaurus, by using the word segmentation probability distribution obtained by the document subject generation model, multiple subject thesaurus are obtained, which improves the accuracy of the subject thesaurus.
由于利用文档主题生成模型可以得到多个主题词库,但是不确定每个主题词库对应的具体主题是什么,因此,本公开中,可计算每个主题词库中的各个分词与预设主题之间的匹配度,根据每个主题词库中的各分词与预设主题之间的匹配度,确定每个主题词库对应的主题。Since multiple thesaurus can be obtained by using the document topic generation model, but it is not sure what the specific topic corresponding to each thesaurus is, therefore, in the present disclosure, each segmented word and preset topic in each thesaurus can be calculated. The matching degree between each thesaurus is determined according to the matching degree between each segmented word in each thesaurus and the preset topic, and the topic corresponding to each thesaurus is determined.
本公开中,确定每个主题词库对应的主题的方式与上述步骤404类似,故在此不再赘述。In the present disclosure, the method of determining the subject corresponding to each subject thesaurus is similar to the above-mentioned
在确定每个主题词库对应的主题后,可根据每个用户对应的分词集中包含的各个主题下的分词,与上述确定第一分词集中包含的每个主题下的分词的方法类似,故在此不再赘述。After the topic corresponding to each thesaurus is determined, the word segmentation under each topic included in the word segmentation set corresponding to each user can be similar to the above-mentioned method of determining the word segmentation under each topic included in the first word segmentation set. Therefore, in This will not be repeated here.
在每个用户对应的标注风险等级时,可根据每个用户对应的分词集中包含的每个主题下的分词数量,确定每个用户对应的分词集中包含与风险相关的主题下的分词数量,基于每个用户对应的分词集中包含与风险相关的主题下的分词数量,可以确定每个用户对应的标注风险等级。When marking the risk level corresponding to each user, the number of word segmentations under each topic included in the word segmentation set corresponding to each user can be determined according to the number of word segmentations under each topic contained in the word segmentation set corresponding to each user. The word segmentation set corresponding to each user includes the number of word segmentations under risk-related topics, and the labeling risk level corresponding to each user can be determined.
本公开中,通过基于每个用户对应的分词集中包含的每个主题下的分词数量,确定每个用户对应的标注风险等级,提高了标注的准确性。In the present disclosure, the labeling risk level corresponding to each user is determined based on the number of word segments under each topic included in the word segment set corresponding to each user, thereby improving the labeling accuracy.
为了实现上述实施例,本公开实施例还提出一种用户特征的生成装置。图7为本公开实施例提供的一种用户特征的生成装置的结构示意图。In order to implement the above embodiments, an embodiment of the present disclosure further provides an apparatus for generating user characteristics. FIG. 7 is a schematic structural diagram of an apparatus for generating user characteristics according to an embodiment of the present disclosure.
如图7所示,该用户特征的生成装置700包括:第一获取模块710、第一解析模块720、第一确定模块730和第二确定模块740。As shown in FIG. 7 , the
第一获取模块710,用于获取目标用户对应的第一历史文本数据;a first obtaining
第一解析模块720,用于对所述第一历史文本数据进行解析,以确定所述目标用户对应的第一分词集;A
第一确定模块730,用于根据所述第一分词集中的各个分词分别与各个主题对应的各个分词之间的匹配度,确定所述第一分词集中包含的每个所述主题下的分词数量;The
第二确定模块740,用于根据所述第一分词集中包含的每个所述主题下的分词数量,确定所述目标用户对应的用户特征。The
在本公开实施例一种可能的实现方式中,该装置还可包括:In a possible implementation manner of the embodiment of the present disclosure, the apparatus may further include:
第三确定模块,用于确定所述用户特征分别与各个待推广信息之间的各个关联度;a third determining module, configured to determine each correlation degree between the user feature and each information to be promoted;
第四确定模块,用于根据所述各个关联度,确定目标推广信息;a fourth determining module, configured to determine target promotion information according to the respective degrees of association;
推送模块,用于向所述目标用户推送所述目标推广信息。A push module, configured to push the target promotion information to the target user.
在本公开实施例一种可能的实现方式中,第一确定模块730,还用于确定所述第一分词集中包含的指定类型主题下的分词数量,其中,所述指定类型主题与目标风险相关;In a possible implementation manner of the embodiment of the present disclosure, the
第二确定模块740,还用于在所述第一分词集中包含的指定类型主题下的分词的数量大于预设阈值的情况下,确定所述目标用户为具有所述目标风险的用户。The
在本公开实施例一种可能的实现方式中,所述第一获取模块710,还用于获取多个用户分别对应的多个第二历史文本数据;In a possible implementation manner of the embodiment of the present disclosure, the first obtaining
所述第一解析模块720,还用于对每个所述用户对应的多个第二历史文本数据进行解析,以确定每个所述用户对应的第二分词集;The
该装置还可包括:The device may also include:
聚类模块,用于对所述多个用户对应的多个第二分词集中的分词进行聚类,以获取多个主题词库;a clustering module, configured to cluster the word segments in a plurality of second word segmentation sets corresponding to the plurality of users to obtain a plurality of thesaurus;
第五确定模块,用于根据每个所述主题词库中的各个分词与预设主题之间的匹配度,确定各个主题词库分别对应的主题。The fifth determining module is configured to determine the topics corresponding to each thesaurus according to the degree of matching between each segmented word in each of the thesaurus and the preset topic.
在本公开实施例一种可能的实现方式中,所述第一确定模块730,还用于根据每个所述第二分词集中的各个分词分别与各个主题对应的各个分词之间的匹配度,确定每个所述第二分词集中包含的每个所述主题下的分词;In a possible implementation manner of the embodiment of the present disclosure, the first determining
该装置还可包括:The device may also include:
第六确定模块,用于基于每个所述第二分词集中包含的每个所述主题下的分词数量,确定每个所述用户对应的标注风险等级;A sixth determination module, configured to determine the labeling risk level corresponding to each of the users based on the number of word segmentations under each of the topics contained in each of the second word segmentation sets;
第一训练模块,用于将每个所述第二分词集中包含的每个所述主题下的分词及对应的主题输入至初始神经网络模型,以获取所述初始神经网络模型输出的预测风险等级;根据所述预测风险等级与所述标注风险等级之间的差异,对所述初始神经网络模型进行修正,以生成风控模型。The first training module is used to input the word segmentation and the corresponding topic under each of the topics contained in each of the second word segmentation sets to the initial neural network model to obtain the predicted risk level output by the initial neural network model ; According to the difference between the predicted risk level and the marked risk level, modify the initial neural network model to generate a risk control model.
在本公开实施例一种可能的实现方式中,该装置还可包括:In a possible implementation manner of the embodiment of the present disclosure, the apparatus may further include:
第七确定模块,用于在获取到所述目标用户发送的用户请求的情况下,将所述目标用户对应的第一分词集中包含的每个所述主题下的分词及对应的主题,输入至所述风控模型,以确定所述目标用户对应的风险等级;The seventh determination module is used to input the word segmentation and the corresponding theme under each of the topics contained in the first word segmentation set corresponding to the target user in the case of obtaining the user request sent by the target user into the the risk control model to determine the risk level corresponding to the target user;
响应模块,用于在所述风险等级小于预设风险等级的情况下,响应所述用户请求。A response module, configured to respond to the user request when the risk level is less than a preset risk level.
需要说明的是,前述用户特征的生成方法实施例的解释说明,也适用于该实施例的用户特征的生成装置,故在此不再赘述。It should be noted that, the explanations of the foregoing embodiments of the method for generating user characteristics are also applicable to the device for generating user characteristics in this embodiment, so they are not repeated here.
本公开实施例中,通过对目标用户对应的第一历史文本数据进行解析,以确定目标用户对应的第一分词集,将第一分词集中的各个分词分别与各个主题下的各分词进行匹配,以确定第一分词集中包含的每个主题下的分词数量,并根据第一分词集中包含的每个主题下的分词数量,确定目标用户对应的用户特征,从而基于目标用户在每个主题下的分词数量,确定用户特征,提高了得到的用户特征的准确性。In the embodiment of the present disclosure, the first historical text data corresponding to the target user is analyzed to determine the first word segment set corresponding to the target user, and each word segment in the first word segment set is matched with each word segment under each theme, respectively, Determine the number of word segmentations under each topic included in the first word segmentation set, and determine the user characteristics corresponding to the target user according to the number of word segmentations under each topic included in the first word segmentation set. The number of word segmentations determines the user characteristics and improves the accuracy of the obtained user characteristics.
为了实现上述实施例,本公开还提出了一种模型训练装置。图8为本公开实施例提供的一种模型训练装置的结构示意图。In order to realize the above embodiments, the present disclosure also proposes a model training device. FIG. 8 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure.
如图8所示,该模型训练装置包括:第二获取模块810、第二解析模块820、第八确定模块830和第二训练模块840。As shown in FIG. 8 , the model training apparatus includes: a
第二获取模块810,用于获取训练数据集,其中,所述训练数据集中包括多个用户分别对应的多个历史文本数据;The second obtaining
第二解析模块820,用于分别对每个所述用户对应的多个历史文本数据进行解析,以确定每个所述用户对应的分词集;A
第八确定模块830,用于确定每个所述用户对应的分词集中包含的各个主题下的分词,以及每个所述用户对应的标注风险等级;The
第二训练模块840,用于将每个所述用户对应的分词集中包含的各个主题下的分词及对应的主题输入至初始神经网络模型,以获取所述初始神经网络模型输出的预测风险等级;根据所述预测风险等级与所述标注风险等级之间的差异,对所述初始神经网络模型进行修正,以生成风控模型。The
在本公开实施例一种可能的实现方式中,所述第八确定模块830,包括:In a possible implementation manner of the embodiment of the present disclosure, the
聚类单元,用于对所述多个用户对应的多个分词集中的分词进行聚类,以获取多个主题词库;a clustering unit, configured to cluster the word segments in the multiple word segment sets corresponding to the multiple users to obtain multiple subject thesaurus;
第一确定单元,用于根据每个所述主题词库中的各个分词与预设主题之间的匹配度,确定各个主题词库分别对应的主题;a first determining unit, configured to determine the respectively corresponding themes of each thesaurus according to the matching degree between each word segment in each of the thesaurus and a preset theme;
第二确定单元,用于根据每个所述分词集中的各个分词分别与各个主题对应的各个分词之间的匹配度,确定每个所述分词集中包含的每个所述主题下的分词;The second determination unit is used to determine the word segmentation under each of the described topics contained in each of the described word segmentation sets according to the degree of matching between the respective word segmentations in each of the described word segmentation sets and the respective word segmentations corresponding to the respective topics;
第三确定单元,用于基于每个所述用户对应的分词集中包含的每个所述主题下的分词数量,确定每个所述用户对应的标注风险等级。The third determining unit is configured to determine the labeling risk level corresponding to each user based on the number of word segments under each topic included in the word segment set corresponding to each user.
在本公开实施例一种可能的实现方式中,所述聚类单元,用于:In a possible implementation manner of the embodiment of the present disclosure, the clustering unit is used for:
将所述多个分词集中的每个分词输入至文档主题生成模型,以获取分词概率分布,其中,所述分词概率分布包括每个分词属于各个主题的概率;Inputting each word segment in the plurality of word segmentation sets into a document topic generation model to obtain a word segmentation probability distribution, wherein the word segmentation probability distribution includes the probability that each word segment belongs to each topic;
根据所述分词概率分布对所述多个分词集中的分词进行聚类,获取多个主题词库。The word segmentations in the multiple word segmentation sets are clustered according to the word segmentation probability distribution to obtain multiple subject thesaurus.
本公开实施例中,通过获取训练数据,分别对训练数据集中每个用户对应的多个历史文本数据进行解析,以确定每个用户对应的分词集,并确定每个用户对应的分词集中包含的各个主题下的分词,以及每个用户对应的标注风险等级,将每个用户对应的分词集中包含的各个主题下的分词及对应的主题输入至初始神经网络模型,以获取初始神经网络模型输出的预测风险等级,并根据预测风险等级与标注风险等级之间的差异,对初始神经网络模型进行修正,以生成风控模型。由此,可通过获取每个用户对应的分词集,并确定每个用户对应的分词集中包含的各个主题下的分词,利用每个用户对应的分词集中包含的每个主题下的分词,训练得到风控模型,从而提高了模型的准确性。In the embodiment of the present disclosure, by acquiring training data, the plurality of historical text data corresponding to each user in the training data set are respectively parsed to determine the word segmentation set corresponding to each user, and the word segmentation set corresponding to each user is determined. The word segmentation under each topic, as well as the labeling risk level corresponding to each user, input the word segmentation and corresponding topic under each topic contained in the word segmentation set corresponding to each user to the initial neural network model to obtain the output of the initial neural network model. Predict the risk level, and modify the initial neural network model according to the difference between the predicted risk level and the marked risk level to generate a risk control model. Therefore, by obtaining the word segmentation set corresponding to each user, determining the word segmentation under each topic contained in the word segmentation set corresponding to each user, and using the word segmentation under each topic contained in the word segmentation set corresponding to each user, the training result can be obtained. Risk control model, thereby improving the accuracy of the model.
根据本公开的实施例,本公开还提供了一种电子设备、一种可读存储介质和一种计算机程序产品。According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium, and a computer program product.
图9示出了可以用来实施本公开的实施例的示例电子设备900的示意性框图。电子设备旨在表示各种形式的数字计算机,诸如,膝上型计算机、台式计算机、工作台、个人数字助理、服务器、刀片式服务器、大型计算机、和其它适合的计算机。电子设备还可以表示各种形式的移动装置,诸如,个人数字处理、蜂窝电话、智能电话、可穿戴设备和其它类似的计算装置。本文所示的部件、它们的连接和关系、以及它们的功能仅仅作为示例,并且不意在限制本文中描述的和/或者要求的本公开的实现。FIG. 9 shows a schematic block diagram of an example
如图9所示,设备900包括计算单元901,其可以根据存储在ROM(Read-OnlyMemory,只读存储器)902中的计算机程序或者从存储单元908加载到RAM(Random AccessMemory,随机访问/存取存储器)903中的计算机程序,来执行各种适当的动作和处理。在RAM903中,还可存储设备900操作所需的各种程序和数据。计算单元901、ROM 902以及RAM 903通过总线904彼此相连。I/O(Input/Output,输入/输出)接口905也连接至总线904。As shown in FIG. 9, the
设备900中的多个部件连接至I/O接口905,包括:输入单元906,例如键盘、鼠标等;输出单元907,例如各种类型的显示器、扬声器等;存储单元908,例如磁盘、光盘等;以及通信单元909,例如网卡、调制解调器、无线通信收发机等。通信单元909允许设备900通过诸如因特网的计算机网络和/或各种电信网络与其他设备交换信息/数据。Various components in the
计算单元901可以是各种具有处理和计算能力的通用和/或专用处理组件。计算单元901的一些示例包括但不限于CPU(Central Processing Unit,中央处理单元)、GPU(Graphic Processing Units,图形处理单元)、各种专用的AI(Artificial Intelligence,人工智能)计算芯片、各种运行机器学习模型算法的计算单元、DSP(Digital SignalProcessor,数字信号处理器)、以及任何适当的处理器、控制器、微控制器等。计算单元901执行上文所描述的各个方法和处理,例如用户特征的生成方法。例如,在一些实施例中,用户特征的生成方法可被实现为计算机软件程序,其被有形地包含于机器可读介质,例如存储单元908。在一些实施例中,计算机程序的部分或者全部可以经由ROM 902和/或通信单元909而被载入和/或安装到设备900上。当计算机程序加载到RAM 903并由计算单元901执行时,可以执行上文描述的用户特征的生成方法的一个或多个步骤。备选地,在其他实施例中,计算单元901可以通过其他任何适当的方式(例如,借助于固件)而被配置为执行用户特征的生成方法。
本文中以上描述的系统和技术的各种实施方式可以在数字电子电路系统、集成电路系统、FPGA(Field Programmable Gate Array,现场可编程门阵列)、ASIC(Application-Specific Integrated Circuit,专用集成电路)、ASSP(Application Specific StandardProduct,专用标准产品)、SOC(System On Chip,芯片上系统的系统)、CPLD(ComplexProgrammable Logic Device,复杂可编程逻辑设备)、计算机硬件、固件、软件、和/或它们的组合中实现。这些各种实施方式可以包括:实施在一个或者多个计算机程序中,该一个或者多个计算机程序可在包括至少一个可编程处理器的可编程系统上执行和/或解释,该可编程处理器可以是专用或者通用可编程处理器,可以从存储系统、至少一个输入装置、和至少一个输出装置接收数据和指令,并且将数据和指令传输至该存储系统、该至少一个输入装置、和该至少一个输出装置。Various implementations of the systems and techniques described herein above may be implemented in digital electronic circuitry, integrated circuit systems, FPGAs (Field Programmable Gate Arrays), ASICs (Application-Specific Integrated Circuits) , ASSP (Application Specific Standard Product), SOC (System On Chip, System On Chip), CPLD (Complex Programmable Logic Device), computer hardware, firmware, software, and/or their implemented in combination. These various embodiments may include being implemented in one or more computer programs executable and/or interpretable on a programmable system including at least one programmable processor that The processor, which may be a special purpose or general-purpose programmable processor, may receive data and instructions from a storage system, at least one input device, and at least one output device, and transmit data and instructions to the storage system, the at least one input device, and the at least one output device an output device.
用于实施本公开的方法的程序代码可以采用一个或多个编程语言的任何组合来编写。这些程序代码可以提供给通用计算机、专用计算机或其他可编程数据处理装置的处理器或控制器,使得程序代码当由处理器或控制器执行时使流程图和/或框图中所规定的功能/操作被实施。程序代码可以完全在机器上执行、部分地在机器上执行,作为独立软件包部分地在机器上执行且部分地在远程机器上执行或完全在远程机器或服务器上执行。Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer or other programmable data processing apparatus, such that the program code, when executed by the processor or controller, performs the functions/functions specified in the flowcharts and/or block diagrams. Action is implemented. The program code may execute entirely on the machine, partly on the machine, partly on the machine and partly on a remote machine as a stand-alone software package or entirely on the remote machine or server.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、RAM、ROM、EPROM(Electrically Programmable Read-Only-Memory,可擦除可编程只读存储器)或快闪存储器、光纤、CD-ROM(Compact Disc Read-Only Memory,便捷式紧凑盘只读存储器)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in connection with the instruction execution system, apparatus or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. Machine-readable media may include, but are not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, devices, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer disks, hard disks, RAM, ROM, EPROM (Electrically Programmable Read-Only-Memory) Or flash memory, optical fiber, CD-ROM (Compact Disc Read-Only Memory), optical storage device, magnetic storage device, or any suitable combination of the above.
为了提供与用户的交互,可以在计算机上实施此处描述的系统和技术,该计算机具有:用于向用户显示信息的显示装置(例如,CRT(Cathode-Ray Tube,阴极射线管)或者LCD(Liquid Crystal Display,液晶显示器)监视器);以及键盘和指向装置(例如,鼠标或者轨迹球),用户可以通过该键盘和该指向装置来将输入提供给计算机。其它种类的装置还可以用于提供与用户的交互;例如,提供给用户的反馈可以是任何形式的传感反馈(例如,视觉反馈、听觉反馈、或者触觉反馈);并且可以用任何形式(包括声输入、语音输入或者、触觉输入)来接收来自用户的输入。To provide interaction with a user, the systems and techniques described herein may be implemented on a computer having: a display device (eg, a CRT (Cathode-Ray Tube) or an LCD (Cathode-Ray Tube) for displaying information to the user Liquid Crystal Display (liquid crystal display) monitor); and a keyboard and pointing device (eg, mouse or trackball) through which a user can provide input to the computer. Other kinds of devices can also be used to provide interaction with the user; for example, the feedback provided to the user can be any form of sensory feedback (eg, visual feedback, auditory feedback, or tactile feedback); and can be in any form (including acoustic input, voice input, or tactile input) to receive input from the user.
可以将此处描述的系统和技术实施在包括后台部件的计算系统(例如,作为数据服务器)、或者包括中间件部件的计算系统(例如,应用服务器)、或者包括前端部件的计算系统(例如,具有图形用户界面或者网络浏览器的用户计算机,用户可以通过该图形用户界面或者该网络浏览器来与此处描述的系统和技术的实施方式交互)、或者包括这种后台部件、中间件部件、或者前端部件的任何组合的计算系统中。可以通过任何形式或者介质的数字数据通信(例如,通信网络)来将系统的部件相互连接。通信网络的示例包括:LAN(LocalArea Network,局域网)、WAN(Wide Area Network,广域网)、互联网和区块链网络。The systems and techniques described herein may be implemented on a computing system that includes back-end components (eg, as a data server), or a computing system that includes middleware components (eg, an application server), or a computing system that includes front-end components (eg, a user's computer having a graphical user interface or web browser through which a user may interact with implementations of the systems and techniques described herein), or including such backend components, middleware components, Or any combination of front-end components in a computing system. The components of the system may be interconnected by any form or medium of digital data communication (eg, a communication network). Examples of communication networks include: LAN (Local Area Network, Local Area Network), WAN (Wide Area Network, Wide Area Network), Internet, and blockchain networks.
计算机系统可以包括客户端和服务器。客户端和服务器一般远离彼此并且通常通过通信网络进行交互。通过在相应的计算机上运行并且彼此具有客户端-服务器关系的计算机程序来产生客户端和服务器的关系。服务器可以是云服务器,又称为云计算服务器或云主机,是云计算服务体系中的一项主机产品,以解决了传统物理主机与VPS服务(VirtualPrivate Server,虚拟专用服务器)中,存在的管理难度大,业务扩展性弱的缺陷。服务器也可以为分布式系统的服务器,或者是结合了区块链的服务器。A computer system can include clients and servers. Clients and servers are generally remote from each other and usually interact through a communication network. The relationship of client and server arises by computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also known as a cloud computing server or a cloud host. It is a host product in the cloud computing service system to solve the existing management problems in traditional physical hosts and VPS services (VirtualPrivate Server, virtual private server). Difficulty, weak business expansion defects. The server can also be a server of a distributed system, or a server combined with a blockchain.
需要说明的是,上述电子设备也可以执行本公开中的模型训练方法。It should be noted that the above-mentioned electronic device can also execute the model training method in the present disclosure.
根据本公开的实施例,本公开还提供了一种计算机程序产品,当计算机程序产品中的指令处理器执行时,执行本公开上述实施例提出的用户特征的生成方法,或模型训练方法。According to an embodiment of the present disclosure, the present disclosure also provides a computer program product that, when executed by an instruction processor in the computer program product, executes the user feature generation method or the model training method proposed by the above embodiments of the present disclosure.
应该理解,可以使用上面所示的各种形式的流程,重新排序、增加或删除步骤。例如,本发公开中记载的各步骤可以并行地执行也可以顺序地执行也可以不同的次序执行,只要能够实现本公开公开的技术方案所期望的结果,本文在此不进行限制。It should be understood that steps may be reordered, added or deleted using the various forms of flow shown above. For example, the steps described in the present disclosure can be executed in parallel, sequentially, or in different orders. As long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, there is no limitation herein.
上述具体实施方式,并不构成对本公开保护范围的限制。本领域技术人员应该明白的是,根据设计要求和其他因素,可以进行各种修改、组合、子组合和替代。任何在本公开的精神和原则之内所作的修改、等同替换和改进等,均应包含在本公开保护范围之内。The above-mentioned specific embodiments do not constitute a limitation on the protection scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may occur depending on design requirements and other factors. Any modifications, equivalent replacements, and improvements made within the spirit and principles of the present disclosure should be included within the protection scope of the present disclosure.
Claims (21)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110529089.0A CN113220999B (en) | 2021-05-14 | 2021-05-14 | User characteristic generation method and device, electronic equipment and storage medium |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202110529089.0A CN113220999B (en) | 2021-05-14 | 2021-05-14 | User characteristic generation method and device, electronic equipment and storage medium |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN113220999A true CN113220999A (en) | 2021-08-06 |
| CN113220999B CN113220999B (en) | 2024-07-09 |
Family
ID=77092063
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202110529089.0A Active CN113220999B (en) | 2021-05-14 | 2021-05-14 | User characteristic generation method and device, electronic equipment and storage medium |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113220999B (en) |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114004291A (en) * | 2021-10-28 | 2022-02-01 | 北京百度网讯科技有限公司 | Method, apparatus, apparatus and medium for generating bag of words |
| CN114048348A (en) * | 2021-10-14 | 2022-02-15 | 盐城金堤科技有限公司 | Video quality scoring method and device, storage medium and electronic equipment |
| CN114154995A (en) * | 2021-12-08 | 2022-03-08 | 河北晓博互联网科技有限公司 | Abnormal payment data analysis method and system applied to big data wind control |
| CN115859975A (en) * | 2023-02-07 | 2023-03-28 | 支付宝(杭州)信息技术有限公司 | Data processing method, device and equipment |
Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105224521A (en) * | 2015-09-28 | 2016-01-06 | 北大方正集团有限公司 | Key phrases extraction method and use its method obtaining correlated digital resource and device |
| CN105630769A (en) * | 2015-12-24 | 2016-06-01 | 东软集团股份有限公司 | Document subject term extraction method and device |
| CN107704512A (en) * | 2017-08-31 | 2018-02-16 | 平安科技(深圳)有限公司 | Financial product based on social data recommends method, electronic installation and medium |
| WO2019184217A1 (en) * | 2018-03-26 | 2019-10-03 | 平安科技(深圳)有限公司 | Hotspot event classification method and apparatus, and storage medium |
| US20190362707A1 (en) * | 2017-07-26 | 2019-11-28 | Tencent Technology (Shenzhen) Company Limited | Interactive method, interactive terminal, storage medium, and computer device |
| CN112052397A (en) * | 2020-09-29 | 2020-12-08 | 北京百度网讯科技有限公司 | User feature generation method and device, electronic equipment and storage medium |
| CN112632987A (en) * | 2020-12-25 | 2021-04-09 | 北京百度网讯科技有限公司 | Word slot recognition method and device and electronic equipment |
| WO2021068610A1 (en) * | 2019-10-12 | 2021-04-15 | 平安国际智慧城市科技股份有限公司 | Resource recommendation method and apparatus, electronic device and storage medium |
-
2021
- 2021-05-14 CN CN202110529089.0A patent/CN113220999B/en active Active
Patent Citations (8)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN105224521A (en) * | 2015-09-28 | 2016-01-06 | 北大方正集团有限公司 | Key phrases extraction method and use its method obtaining correlated digital resource and device |
| CN105630769A (en) * | 2015-12-24 | 2016-06-01 | 东软集团股份有限公司 | Document subject term extraction method and device |
| US20190362707A1 (en) * | 2017-07-26 | 2019-11-28 | Tencent Technology (Shenzhen) Company Limited | Interactive method, interactive terminal, storage medium, and computer device |
| CN107704512A (en) * | 2017-08-31 | 2018-02-16 | 平安科技(深圳)有限公司 | Financial product based on social data recommends method, electronic installation and medium |
| WO2019184217A1 (en) * | 2018-03-26 | 2019-10-03 | 平安科技(深圳)有限公司 | Hotspot event classification method and apparatus, and storage medium |
| WO2021068610A1 (en) * | 2019-10-12 | 2021-04-15 | 平安国际智慧城市科技股份有限公司 | Resource recommendation method and apparatus, electronic device and storage medium |
| CN112052397A (en) * | 2020-09-29 | 2020-12-08 | 北京百度网讯科技有限公司 | User feature generation method and device, electronic equipment and storage medium |
| CN112632987A (en) * | 2020-12-25 | 2021-04-09 | 北京百度网讯科技有限公司 | Word slot recognition method and device and electronic equipment |
Cited By (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114048348A (en) * | 2021-10-14 | 2022-02-15 | 盐城金堤科技有限公司 | Video quality scoring method and device, storage medium and electronic equipment |
| CN114004291A (en) * | 2021-10-28 | 2022-02-01 | 北京百度网讯科技有限公司 | Method, apparatus, apparatus and medium for generating bag of words |
| CN114154995A (en) * | 2021-12-08 | 2022-03-08 | 河北晓博互联网科技有限公司 | Abnormal payment data analysis method and system applied to big data wind control |
| CN115859975A (en) * | 2023-02-07 | 2023-03-28 | 支付宝(杭州)信息技术有限公司 | Data processing method, device and equipment |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113220999B (en) | 2024-07-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN113553412B (en) | Question-answering processing method, question-answering processing device, electronic equipment and storage medium | |
| CN113657100B (en) | Entity identification method, entity identification device, electronic equipment and storage medium | |
| KR102597357B1 (en) | Method and System for Sentiment Analysis of News Articles based on AI | |
| CN113220999B (en) | User characteristic generation method and device, electronic equipment and storage medium | |
| CN111191428A (en) | Comment information processing method, apparatus, computer equipment and medium | |
| CN112749344A (en) | Information recommendation method and device, electronic equipment, storage medium and program product | |
| CN113392218A (en) | Training method of text quality evaluation model and method for determining text quality | |
| US20210049443A1 (en) | Densely connected convolutional neural network for service ticket classification | |
| CN114387061A (en) | Product push method, device, electronic device and readable storage medium | |
| CN112926308B (en) | Method, device, equipment, storage medium and program product for matching text | |
| CN112784589B (en) | A method, device and electronic device for generating training samples | |
| JP7369228B2 (en) | Method, device, electronic device, and storage medium for generating images of user interest | |
| CN114416995A (en) | Information recommendation method, device and equipment | |
| US20230114673A1 (en) | Method for recognizing token, electronic device and storage medium | |
| CN115714002A (en) | Depression risk detection model training method, depression state early warning method and related equipment | |
| CN110457691B (en) | Script role based emotional curve analysis method and device | |
| CN114255067A (en) | Data pricing method and device, electronic equipment and storage medium | |
| CN116235163A (en) | Inference-based natural language interpretation | |
| CN113505293A (en) | Information pushing method and device, electronic equipment and storage medium | |
| CN116402166B (en) | Training method and device of prediction model, electronic equipment and storage medium | |
| CN114358736A (en) | Customer service ticket generation method, device, storage medium and electronic device | |
| CN114416976A (en) | Text annotation method, device and electronic equipment | |
| CN115809334B (en) | Training method, text processing method and device for event correlation classification model | |
| US12417345B2 (en) | Method and apparatus for constructing object relationship network, and electronic device | |
| CN113360602A (en) | Method, apparatus, device and storage medium for outputting information |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |