CN109065174B - Medical record theme acquisition method and device considering similarity constraint - Google Patents
Medical record theme acquisition method and device considering similarity constraint Download PDFInfo
- Publication number
- CN109065174B CN109065174B CN201810843072.0A CN201810843072A CN109065174B CN 109065174 B CN109065174 B CN 109065174B CN 201810843072 A CN201810843072 A CN 201810843072A CN 109065174 B CN109065174 B CN 109065174B
- Authority
- CN
- China
- Prior art keywords
- medical record
- similarity
- topic
- distribution
- subject
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 49
- 238000009826 distribution Methods 0.000 claims abstract description 85
- 238000003745 diagnosis Methods 0.000 claims abstract description 24
- 238000004364 calculation method Methods 0.000 claims description 47
- 201000010099 disease Diseases 0.000 claims description 29
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 29
- 238000009795 derivation Methods 0.000 claims description 7
- 239000013598 vector Substances 0.000 claims description 7
- 238000012952 Resampling Methods 0.000 claims description 5
- 239000011159 matrix material Substances 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 5
- 230000009286 beneficial effect Effects 0.000 abstract description 3
- 206010012601 diabetes mellitus Diseases 0.000 description 9
- 208000002249 Diabetes Complications Diseases 0.000 description 5
- 239000003814 drug Substances 0.000 description 5
- 206010012655 Diabetic complications Diseases 0.000 description 4
- 208000028659 discharge Diseases 0.000 description 4
- 229940079593 drug Drugs 0.000 description 4
- 238000005065 mining Methods 0.000 description 4
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000011161 development Methods 0.000 description 3
- 230000018109 developmental process Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000002474 experimental method Methods 0.000 description 3
- 230000011218 segmentation Effects 0.000 description 3
- 208000024891 symptom Diseases 0.000 description 3
- 238000012549 training Methods 0.000 description 3
- 238000003759 clinical diagnosis Methods 0.000 description 2
- 238000001514 detection method Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 238000010835 comparative analysis Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000011478 gradient descent method Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000000474 nursing effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000001356 surgical procedure Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16H—HEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
- G16H50/00—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics
- G16H50/70—ICT specially adapted for medical diagnosis, medical simulation or medical data mining; ICT specially adapted for detecting, monitoring or modelling epidemics or pandemics for mining of medical data, e.g. analysing previous cases of other patients
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Public Health (AREA)
- Biomedical Technology (AREA)
- Databases & Information Systems (AREA)
- Pathology (AREA)
- Epidemiology (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
本发明提供了一种考虑相似约束的病历主题获取方法及装置。所述方法包括:计算初始病历中任意两个病历文档之间的相似度,得到所述相似度大于或等于相似度阈值的多个病历文档构成的相似性约束病历集合;将所述相似性约束病历集合中各病历文档依次输入到预设LDA模型,通过所述预设LDA模型推导各病历文档的文档‑主题分布和主题‑单词分布。可见,本实施例中可以很好地模拟医生诊疗过程确定病历文本的思维过程,从而有利于获取主题的准确度。
The present invention provides a medical record subject acquisition method and apparatus that considers similar constraints. The method includes: calculating the similarity between any two medical record documents in the initial medical record, and obtaining a similarity-constrained medical record set composed of a plurality of medical record documents whose similarity is greater than or equal to a similarity threshold; Each medical record document in the medical record collection is sequentially input into a preset LDA model, and the document-topic distribution and the topic-word distribution of each medical record document are derived through the preset LDA model. It can be seen that, in this embodiment, the thinking process of determining the medical record text can be well simulated by the doctor's diagnosis and treatment process, which is beneficial to obtain the accuracy of the subject.
Description
技术领域technical field
本发明涉及数据挖掘技术领域,尤其涉及一种考虑相似约束的病历主题获取方法及装置。The invention relates to the technical field of data mining, and in particular, to a method and device for obtaining medical record subjects considering similar constraints.
背景技术Background technique
目前,主题模型大多应用于在线社交媒体领域的网络舆情主题演化分析方面,有利于根据不同时间段的网络主题分布对网络舆情变化有效监控,甚至积极引导其发展方向。另外,主题模型在临床诊疗领域也有少量应用,目的在于分析病历文档中疾病-用药和疾病-症状之间的诊治规律,分析过程包括:将每个病历文档作为一个独立样本输入的模型中,通过大量数量的训练,得到最终的主题分析结果。At present, topic models are mostly used in the evolution analysis of online public opinion topics in the field of online social media, which is beneficial to effectively monitor the changes of online public opinion according to the distribution of network topics in different time periods, and even actively guide its development direction. In addition, the topic model also has a small number of applications in the field of clinical diagnosis and treatment. The purpose is to analyze the diagnosis and treatment rules between disease-medication and disease-symptom in medical record documents. The analysis process includes: inputting each medical record document as an independent sample. A large number of training, to get the final topic analysis results.
然而,在实现本发明方案的过程中发明人发现:一方面,由于同一疾病的两个患者之间的病情发展具有相似性,导致医生对其做出的诊断方案会受到以前治疗相似患者的诊断方案的影响。另一方面,两个患者之间存在个体差异,例如体质、性别、年龄、病情阶段等,这样医生会根据不同患者出具不同的诊疗方案。实际诊疗过程中,可能存在身体条件和所患疾病具有相似的两个患者,这个他们的诊疗方案也存在相似部分。例如:糖尿病患者会同时患有多种糖尿病并发症,但相同的并发症的诊疗方案和病情发展应该具有相似性。However, in the process of realizing the solution of the present invention, the inventors found that: on the one hand, due to the similarity in the development of the disease between two patients with the same disease, the diagnosis plan made by a doctor will be affected by the diagnosis of similar patients who have previously treated them. impact of the program. On the other hand, there are individual differences between two patients, such as constitution, gender, age, disease stage, etc., so doctors will issue different diagnosis and treatment plans according to different patients. In the actual diagnosis and treatment process, there may be two patients with similar physical conditions and diseases, and their diagnosis and treatment plans also have similar parts. For example, diabetic patients will suffer from multiple diabetic complications at the same time, but the diagnosis and treatment plan and disease development of the same complications should be similar.
发明内容SUMMARY OF THE INVENTION
针对现有技术中的缺陷,本发明提供了一种考虑相似约束的病历主题获取方法及装置,用于解决相关技术中存在的技术问题。Aiming at the deficiencies in the prior art, the present invention provides a method and device for obtaining a medical record subject that considers similar constraints, so as to solve the technical problems existing in the related art.
第一方面,本发明实施例提供了一种考虑相似约束的病历主题获取方法,所述方法包括:In a first aspect, an embodiment of the present invention provides a method for obtaining a medical record subject that considers similar constraints, and the method includes:
计算初始病历中任意两个病历文档之间的相似度,得到所述相似度大于或等于相似度阈值的多个病历文档构成的相似性约束病历集合;Calculate the similarity between any two medical record documents in the initial medical record, and obtain a similarity-constrained medical record set composed of multiple medical record documents whose similarity is greater than or equal to the similarity threshold;
将所述相似性约束病历集合中各病历文档依次输入到预设LDA模型,通过所述预设LDA模型推导各病历文档的文档-主题分布和主题-单词分布。Each medical record document in the similarity-constrained medical record collection is input into a preset LDA model in turn, and the document-topic distribution and the topic-word distribution of each medical record document are derived by the preset LDA model.
可选地,计算初始病历中任意两个病历文档之间的相似度包括:Optionally, calculating the similarity between any two medical record documents in the initial medical record includes:
获取病历的多个相似性计算因素及各相似性计算因素的权重值;Obtain multiple similarity calculation factors of medical records and the weight values of each similarity calculation factor;
分别计算任意两个病历文档关于各相似性计算因素的数值;Calculate the numerical value of each similarity calculation factor for any two medical record documents respectively;
根据各相似性计算因素的数值和各相似性计算因素的权重值计算所述任意两个病历文档的相似度。The similarity of any two medical record documents is calculated according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
可选地,所述相似性计算因素包括:性别属性的距离、年龄所属分段的距离、诊断结果的距离。Optionally, the similarity calculation factors include: distances of gender attributes, distances of segments to which age belongs, and distances of diagnosis results.
可选地,通过所述预设LDA模型推导各病历文档的文档-主题分布和主题-单词分布包括:Optionally, deriving the document-topic distribution and topic-word distribution of each medical record document by using the preset LDA model includes:
对所述相似性约束病历集合中各病历文档中每个单词随机赋予主题编号z;Randomly assign subject number z to each word in each medical record document in the similarity-constrained medical record collection;
重新扫描所述相似性约束病历集合,对每个单词按照重新采样主题,使得到的新主题满足GibbsSampling收敛;Rescan the set of similarity constrained medical records, for each word according to Resampling topics so that the new topics obtained satisfy GibbsSampling convergence;
统计语料库中主题-单词共现频率矩阵,得到文档-主题分布和主题-单词分布。Count the topic-word co-occurrence frequency matrix in the corpus to obtain document-topic distribution and topic-word distribution.
可选地,所述预设LDA模型包括:Optionally, the preset LDA model includes:
任意两个病历文档相似性约束采用主题分布距离dis(θrm,θrn)表示,公式为:The similarity constraint of any two medical record documents is represented by the topic distribution distance dis(θr m , θr n ), and the formula is:
其中θrm={θm,1,θm,2,…,θm,Lm},表示每个病历文档包括Lm个病程记录;θm,Lm表示第Lm个病程记录的主题;d(θm,Lm,θn,Ln)表示为两个病程的主题向量之间的欧式距离;where θr m = {θ m,1 ,θ m,2 ,...,θ m,Lm }, indicating that each medical record file includes L m disease course records; θ m, Lm indicates the subject of the L mth course record; d (θm ,Lm ,θn ,Ln ) is expressed as the Euclidean distance between the subject vectors of the two disease courses;
所述预设LDA模型还包括Gibbs-EM迭代函数,为:The preset LDA model also includes a Gibbs-EM iteration function, which is:
代表相似性约束病历集合中主题为k的单词i的数量。 represents the number of words i with subject k in the similarity-constrained medical record collection.
第二方面,本发明实施例提供了一种考虑相似约束的病历主题获取装置,所述装置包括:In a second aspect, an embodiment of the present invention provides an apparatus for obtaining a medical record subject that considers similar constraints, and the apparatus includes:
病历集合获取模块,用于计算初始病历中任意两个病历文档之间的相似度,得到所述相似度大于或等于相似度阈值的多个病历文档构成的相似性约束病历集合;a medical record collection acquisition module, configured to calculate the similarity between any two medical record documents in the initial medical record, and obtain a similarity-constrained medical record collection composed of a plurality of medical record documents whose similarity is greater than or equal to a similarity threshold;
主题分布推导模块,用于将所述相似性约束病历集合中各病历文档依次输入到预设LDA模型,通过所述预设LDA模型推导各病历文档的文档-主题分布和主题-单词分布。The topic distribution derivation module is used for sequentially inputting each medical record document in the similarity constrained medical record collection into a preset LDA model, and deriving the document-topic distribution and topic-word distribution of each medical record document through the preset LDA model.
可选地,所述病历集合获取模块包括:Optionally, the medical record collection acquisition module includes:
权重值获取单元,用于获取病历的多个相似性计算因素及各相似性计算因素的权重值;a weight value obtaining unit, used for obtaining a plurality of similarity calculation factors of the medical record and the weight value of each similarity calculation factor;
因素数据计算单元,用于分别计算任意两个病历文档关于各相似性计算因素的数值;The factor data calculation unit is used to calculate the numerical value of each similarity calculation factor for any two medical record documents respectively;
相似度计算单元,用于根据各相似性计算因素的数值和各相似性计算因素的权重值计算所述任意两个病历文档的相似度。The similarity calculation unit is configured to calculate the similarity of any two medical record documents according to the numerical value of each similarity calculation factor and the weight value of each similarity calculation factor.
可选地,所述相似性计算因素包括:性别属性的距离、年龄所属分段的距离、诊断结果的距离。Optionally, the similarity calculation factors include: distances of gender attributes, distances of segments to which age belongs, and distances of diagnosis results.
可选地,所述主题分布推导模块包括:Optionally, the topic distribution derivation module includes:
主题编号单元,用于对所述相似性约束病历集合中各病历文档中每个单词随机赋予主题编号z;a subject numbering unit, for randomly assigning subject number z to each word in each medical record document in the similarity-constrained medical record collection;
主题迭代单元,用于重新扫描所述相似性约束病历集合,对每个单词按照重新采样主题,使得到的新主题满足GibbsSampling收敛;A topic iteration unit for rescanning the similarity-constrained medical record collection, for each word according to Resampling topics so that the new topics obtained satisfy GibbsSampling convergence;
主题分布计算单元,用于统计语料库中主题-单词共现频率矩阵,得到文档-主题分布和主题-单词分布。The topic distribution computing unit is used to count the topic-word co-occurrence frequency matrix in the corpus to obtain document-topic distribution and topic-word distribution.
可选地,所述预设LDA模型包括:Optionally, the preset LDA model includes:
任意两个病历文档相似性约束采用主题分布距离dis(θrm,θrn)表示,公式为:The similarity constraint of any two medical record documents is represented by the topic distribution distance dis(θr m , θr n ), and the formula is:
其中θrm={θm,1,θm,2,…,θm,Lm},表示每个病历文档包括Lm个病程记录;θm,Lm表示第Lm个病程记录的主题;d(θm,Lm,θn,Ln)表示为两个病程的主题向量之间的欧式距离;where θr m = {θ m,1 ,θ m,2 ,...,θ m,Lm }, indicating that each medical record file includes L m disease course records; θ m, Lm indicates the subject of the L mth course record; d (θm ,Lm ,θn ,Ln ) is expressed as the Euclidean distance between the subject vectors of the two disease courses;
所述预设LDA模型还包括Gibbs-EM迭代函数,为:The preset LDA model also includes a Gibbs-EM iteration function, which is:
代表相似性约束病历集合中主题为k的单词i的数量。 represents the number of words i with subject k in the similarity-constrained medical record collection.
由上述技术方案可知,本发明实施例中通过计算两个病历文档的相似度,可以从初始病历中筛选出大于或等于相似度阈值的多个病历文档,后续过程中利用多个病历文档构成的相似性约束病历集合作为主题分析文档。可见,本实施例中可以很好地模拟医生诊疗过程确定病历文本的思维过程,从而有利于获取主题的准确度。It can be seen from the above technical solutions that in the embodiment of the present invention, by calculating the similarity of two medical record documents, a plurality of medical record documents greater than or equal to the similarity threshold can be selected from the initial medical record, and the subsequent process is composed of multiple medical record documents. Similarity constrained medical record collections as subject analysis documents. It can be seen that, in this embodiment, the thinking process of determining the medical record text can be well simulated by the doctor's diagnosis and treatment process, which is beneficial to obtain the accuracy of the subject.
附图说明Description of drawings
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些图获得其他的附图。In order to explain the embodiments of the present invention or the technical solutions in the prior art more clearly, the following briefly introduces the accompanying drawings that need to be used in the description of the embodiments or the prior art. Obviously, the accompanying drawings in the following description are only These are some embodiments of the present invention. For those of ordinary skill in the art, other drawings can also be obtained from these drawings without creative efforts.
图1为本发明一实施例提供的考虑相似约束的病历主题获取方法的流程示意图;1 is a schematic flowchart of a method for obtaining a medical record subject that considers similar constraints according to an embodiment of the present invention;
图2为病历文档中病程记录;Fig. 2 is the disease course record in the medical record file;
图3为男性患者糖尿病并发症数量分布图;Figure 3 shows the distribution of the number of diabetic complications in male patients;
图4为女性患者糖尿病并发症数量分布图;Figure 4 shows the distribution of the number of diabetic complications in female patients;
图5为相似度阈值分别为0.5和0.6时主题数量和相似度约束指示SIM之间关系的示意图;5 is a schematic diagram of the relationship between the number of topics and the similarity constraint indication SIM when the similarity thresholds are 0.5 and 0.6 respectively;
图6为相似度阈值分别为0.7和0.8时主题数量和相似度约束指示SIM之间关系的示意图;6 is a schematic diagram of the relationship between the number of topics and the similarity constraint indication SIM when the similarity thresholds are 0.7 and 0.8 respectively;
图7为主题数与交交互信息之间的关系;Figure 7 shows the relationship between the number of topics and the interaction information;
图8~图10是本发明一实施例提供的考虑相似约束的病历主题获取装置的框图。8 to 10 are block diagrams of an apparatus for obtaining a medical record subject that considers similar constraints according to an embodiment of the present invention.
具体实施方式Detailed ways
下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.
图1为本发明一实施例提供的考虑相似约束的病历主题获取方法的流程示意图。参见图1,一种考虑相似约束的病历主题获取方法包括:FIG. 1 is a schematic flowchart of a method for obtaining a medical record subject that considers similar constraints according to an embodiment of the present invention. Referring to Figure 1, a medical record subject acquisition method that considers similar constraints includes:
101,计算初始病历中任意两个病历文档之间的相似度,得到所述相似度大于或等于相似度阈值的多个病历文档构成的相似性约束病历集合;101. Calculate the similarity between any two medical record documents in the initial medical record, and obtain a similarity-constrained medical record set composed of multiple medical record documents whose similarity is greater than or equal to a similarity threshold;
102,将所述相似性约束病历集合中各病历文档依次输入到预设LDA模型,通过所述预设LDA模型推导各病历文档的文档-主题分布和主题-单词分布。102. Input each medical record document in the similarity-constrained medical record collection into a preset LDA model in turn, and derive document-topic distribution and topic-word distribution of each medical record document by using the preset LDA model.
下面结合附图和实施例对考虑相似约束的病历主题获取方法的各步骤作详细描述。The steps of the method for obtaining the subject of medical records considering similar constraints will be described in detail below with reference to the accompanying drawings and embodiments.
首先,介绍101,计算初始病历中任意两个病历文档之间的相似度,得到所述相似度大于或等于相似度阈值的多个病历文档构成的相似性约束病历集合的步骤。First, 101 is introduced, the steps of calculating the similarity between any two medical record documents in the initial medical record, and obtaining a similarity-constrained medical record set composed of multiple medical record documents whose similarity is greater than or equal to the similarity threshold.
患者在住院治疗过程中,会产生各种检测记录,例如入院记录、出院记录、病程记录、会诊记录等。若直接计算检测记录之间的相似性,则会极大的增加计算量。为方便说明,本实施例中将处理前的检测记录称之为初始病历。In the process of hospitalization, patients will generate various test records, such as admission records, discharge records, disease course records, and consultation records. If the similarity between detection records is directly calculated, the calculation amount will be greatly increased. For convenience of description, in this embodiment, the detection record before processing is referred to as the initial medical record.
为降低计算量,本实施例中仅考虑初始病历中入院诊断部分的相似性。在一实施例中,相似性即是计算任意两份初始病历的距离,并且病历相似性约束构建可理解为收集两两间距离小于某个阈值的病历集合。In order to reduce the amount of calculation, only the similarity of the admission diagnosis part in the initial medical record is considered in this embodiment. In one embodiment, the similarity is to calculate the distance between any two initial medical records, and the construction of the medical record similarity constraint can be understood as collecting a medical record set whose distance between the two is less than a certain threshold.
实际应用中,初始病历中还会包括某个病症的多种并发症,例如糖尿病会导致多种并发症,如表1所示。In practical applications, the initial medical record will also include multiple complications of a certain condition, such as diabetes, which can lead to multiple complications, as shown in Table 1.
表1糖尿病患者并发症示例Table 1 Examples of complications in patients with diabetes
分析表1可知,不同年龄段的患者对糖尿病及其并发症表征存在差异;加之,不同年龄段患者对药剂的承受能力不同,导致在临床诊疗过程中会存在表征、用药等方面的不同。因此,在计算病历文档的相似性时需要考虑患者的基本信息,本实施例中将患者姓名和年龄纳入病历文档的相似性计算因素。Analysis of Table 1 shows that patients of different age groups have differences in the characteristics of diabetes and its complications; in addition, patients of different age groups have different tolerance to drugs, which leads to differences in clinical diagnosis and treatment. Therefore, the basic information of the patient needs to be considered when calculating the similarity of the medical record documents. In this embodiment, the patient's name and age are included in the similarity calculation factor of the medical record document.
在一实施例中,将相同性别之间性别属性的距离设置为1,不同性别之间性别属性的距离设置为0,如下式所示:In one embodiment, the distance of gender attributes between the same gender is set to 1, and the distance of gender attributes between different genders is set to 0, as shown in the following formula:
其中,sexi,sexj表示为不同两个人的性别。Among them, sex i and sex j represent the gender of two different people.
在一实施例中,根据国际人口年龄结构将年龄划分为4个年龄段,分别为:少年,0~17岁,表示为1;青年,18~45岁,表示为2;中年,18~45岁,表示为3;老年,大于59岁,表示为4。这样,本实施例可以计算两个患者所属年龄段的距离,如下式表示:In one embodiment, age is divided into 4 age groups according to the international population age structure, which are: teenagers, 0-17 years old, denoted as 1; youth, 18-45 years old, denoted as 2; middle-aged, 18-45 years old, denoted as 2; 45 years old, expressed as 3; old age, more than 59 years old, expressed as 4. In this way, the present embodiment can calculate the distance between the age groups to which two patients belong, as expressed by the following formula:
其中,agei,agej表示为不同两个人的年龄,flagi,flagj表示不同年龄所属分段。并且,两个年龄所属分段越靠近则距离越小,所属分段越远则距离越大。Among them, age i and age j represent the ages of two different people, and flag i and flag j represent the segments to which different ages belong. In addition, the closer the segments to which the two ages belong, the smaller the distance, and the farther the segments belong, the larger the distance.
考虑到初始病历中采用离散型的文本化描述,本实施例中采用Jaccard距离计算不同初始病历中诊断结果之间的距离,如下式所示:Considering that the discrete textual description is used in the initial medical records, in this embodiment, the Jaccard distance is used to calculate the distance between the diagnosis results in different initial medical records, as shown in the following formula:
其中,diai,diaj表示病历i和病历j的出院诊断布尔向量空间,大量本文考虑糖尿病并发症之间的病症。Among them, dia i , dia j represent the Boolean vector space of discharge diagnosis of medical record i and medical record j, and a large number of diseases between diabetes complications are considered in this paper.
例如:diai={123},diaj={234},diai∩diaj={2,3};diai∪diaj={1,2,3,4},那么d(diai,diaj)=2/4=0.5。For example: dia i ={123}, dia j ={234}, dia i ∩dia j ={2,3}; dia i ∪dia j ={1,2,3,4}, then d(dia i , dia j )=2/4=0.5.
需要说明的是,本实施例中仅考虑了所述相似性计算因素包括:性别属性的距离、年龄所属分段的距离、诊断结果的距离的情况,在文本主题获取方法的应用场景改变时,相似性计算因素的具体组成也可以作相应的调整,调整后的方案同样落入本申请的保护范围。It should be noted that in this embodiment, only the similarity calculation factors are considered including: the distance of the gender attribute, the distance of the age to which the segment belongs, and the distance of the diagnosis result. When the application scene of the text topic acquisition method changes, The specific composition of the similarity calculation factor can also be adjusted accordingly, and the adjusted solution also falls within the protection scope of the present application.
在确定出相似性计算因素后,分别设置权重调节调节参数μ1,μ2,μ3,并计算任意两个初始病历之间的相似度,如下式所示:After determining the similarity calculation factors, set the weight adjustment adjustment parameters μ 1 , μ 2 , μ 3 respectively, and calculate the similarity between any two initial medical records, as shown in the following formula:
sim(Ti,Tj)=μ1*d(sexi,sexj)+μ2*d(agei,agej)+μ3*d(diai,diaj)sim(T i ,T j )=μ 1 *d(sex i ,sex j )+μ 2 *d(age i ,age j )+μ 3 *d(dia i ,dia j )
(3)(3)
μ1+μ2+μ3=1 (4)μ 1 +μ 2 +μ 3 =1 (4)
0≤μ1,μ2,μ3≤1 (5)0≤μ 1 ,μ 2 ,μ 3 ≤1 (5)
最后,将相似度与相似度阈值τ作比较,筛选出相似度值大于或者等于相似度阈值的多个初始病历,并得到多个初始病历构成的相似性约束病历集合,记做D={(Ti,Tj)|i,j∈[1,M]}。Finally, compare the similarity with the similarity threshold τ, screen out multiple initial medical records whose similarity is greater than or equal to the similarity threshold, and obtain a similarity-constrained medical record set composed of multiple initial medical records, denoted as D={( T i ,T j )|i,j∈[1,M]}.
其次,介绍102,将所述相似性约束病历集合中各病历文档依次输入到预设LDA模型,通过所述预设LDA模型推导各病历文档的文档-主题分布和主题-单词分布的步骤。Next, introduce 102, the steps of sequentially inputting each medical record document in the similarity constrained medical record collection into a preset LDA model, and deriving the document-topic distribution and topic-word distribution of each medical record document through the preset LDA model.
本实施例中,预设LDA模型是在现有的LDA模型基础上改进得到的。为方便技术人更好的理解预设LDA模型,先描述一下LDA模型的基本原理:In this embodiment, the preset LDA model is improved on the basis of the existing LDA model. In order to facilitate the technical person to better understand the preset LDA model, first describe the basic principle of the LDA model:
潜在狄利克雷分布(Latent Dirichlet Allocation,LDA)是一种主题模型,其目的是寻找文档主题,包含文档、主题和单词三层结构,并且每篇文档都有各自主题相关的概率分布,而文档中单词是由不同主题分布抽样,如下式(6)所示:Latent Dirichlet Allocation (LDA) is a topic model whose purpose is to find document topics, including a three-layer structure of documents, topics and words, and each document has a probability distribution related to its own topic, while the document Medium words are sampled from different topic distributions, as shown in Equation (6) below:
∑p(单词|文档)=∑p(单词|主题)*p(主题|文档) (6)∑p(word|document)=∑p(word|topic)*p(topic|document) (6)
利用LDA模型对病历文档建模,设有病历文档总数M,第m个病历文档中存在Nm个临床描述单词,每个单词表示为ωm,n,根据现有的词袋模型(bag of words)将文档和单词表示为文档-主题分布和主题-单词分布。在病历文本中主题可以理解为用药、观察、症状、手术等临床护理手段的统称,每个病历文本是多个主题的多项式分布,即每个病历文本是由临床护理过程中的多个步骤组合而成。The LDA model is used to model the medical record documents. The total number of medical record documents M is set. There are Nm clinical description words in the mth medical record document. Each word is represented as ω m,n . According to the existing bag of words model (bag of words) ) represent documents and words as document-topic distribution and topic-word distribution. In the medical record text, the subject can be understood as a general term for clinical care methods such as medication, observation, symptoms, surgery, etc. Each medical record text is a multinomial distribution of multiple topics, that is, each medical record text is a combination of multiple steps in the clinical nursing process. made.
相关技术中,LDA模型生成病历文本的步骤,如表2所示。In the related art, the steps of generating medical record text by the LDA model are shown in Table 2.
可理解的是,由于每个主题是多个单词的多项式分布,对应每个临床护理步骤包含多个临床实际操作,并且文档-主题分布和主题-单词分布均符合狄利克雷参数为α和β先验分布,因此LDA模型能够很好模拟医生在诊疗过程中做出病历文本的思维过程。Understandably, since each topic is a multinomial distribution of multiple words, each clinical care step contains multiple clinical practices, and both the document-topic distribution and the topic-word distribution conform to the Dirichlet parameters α and β. Prior distribution, so the LDA model can well simulate the thinking process of doctors making medical record texts in the process of diagnosis and treatment.
基于上述分析可知,LDA模型推理目的在于:通过当前测试文档集计算出LDA模型中的未知参数并根据计算主题-单词分布和文档-主题分布。实际上,计算过程中可以直接推导出主题-单词分布和文档-主题分布,而无需计算 Based on the above analysis, the purpose of the LDA model reasoning is to calculate the unknown parameters in the LDA model through the current test document set and according to Compute topic-word distribution and document-topic distribution. In fact, the topic-word distribution and document-topic distribution can be directly derived during the calculation process without computing
实际应用中,LDA模型的参数推理算法包括Gibbs抽样和EM变分两种。下面介绍两种方法。In practical application, the parameter inference algorithm of LDA model includes two kinds of Gibbs sampling and EM variation. Two methods are described below.
第一,Gibbs Sampling核心思想是马尔科夫蒙特卡洛(MCMC)方法,在每一次迭代过程中只改变一个维度的参数值,直到收敛输出待估计参数值。根据狄利克雷参数估计,推理可得到:First, the core idea of Gibbs Sampling is the Markov Monte Carlo (MCMC) method, which only changes the parameter value of one dimension in each iteration process until it converges and outputs the parameter value to be estimated. According to the Dirichlet parameter estimation, the inference can be obtained:
其中:表示文档-主题分布,表示主题-单词分布,表示单词分布为k的概率,i为一个数据对(m,n),表示第m个文档中的第n个词。in: represents the document-topic distribution, represents the topic-word distribution, represent words The probability that the distribution is k, i is a data pair (m, n), representing the nth word in the mth document.
由于共有K个主题,因此需要进行K次迭代,采用训练步骤如表3所示:Since there are K topics in total, K iterations are required, and the training steps are shown in Table 3:
第二,EM变分算法在于寻找合适的参数,使得文本集中所观测到的主题-单词分布概率最大,类似于极大似然估计问题。EM变分算法分为两个迭代步骤:Second, the EM variational algorithm is to find suitable parameters to maximize the probability of the topic-word distribution observed in the text set, which is similar to the maximum likelihood estimation problem. The EM variational algorithm is divided into two iterative steps:
变分E-step考虑原步骤中后验概率p(w|α,β)公式求导困难,引入变分参数(γ,)求得近似后验概率分布q(θ,z|γ,)。The variational E-step considers the difficulty in derivation of the posterior probability p(w|α,β) formula in the original step, and introduces variational parameters (γ, ) to obtain the approximate posterior probability distribution q(θ,z|γ, ).
变分M-step根据变分E-step的变分参数最大化近似函数L(γ,β)。其中,先验狄利克雷分布参数(α,β)决定了主题-单词分布和文档-主题分布θ,w代表单词,z代表主题。The variational M-step maximizes the approximate function L(γ, β). Among them, the prior Dirichlet distribution parameters (α, β) determine the topic-word distribution and document-topic distribution θ, w represents words, and z represents topics.
由于LDA模型的迭代目标是最大化词语出现概率p(Z,W|α,β),这样可以有效满足糖尿病病程记录的数据特征,同时也会导致相似病历的主题分布出现较大差异,从而导致无法根据病历主题分布对病历进行有效的统计分析。Since the iterative goal of the LDA model is to maximize the probability of word occurrence p(Z,W|α,β), this can effectively meet the data characteristics of diabetes course records, and at the same time, it will also lead to large differences in the subject distribution of similar medical records, resulting in Valid statistical analysis of medical records based on the subject distribution of medical records is not possible.
为建立一个满足病历相似性约束的主题模型,本实施例中通过改变Gibbs抽样收敛条件策略来实现这一目标。In order to establish a topic model that satisfies the constraint of similarity of medical records, this goal is achieved by changing the Gibbs sampling convergence condition strategy in this embodiment.
考虑到每个病历中会同时存在多个按时间排序的病程记录,病历文档相似性计算应该考虑各病历文档中不同病程记录集合之间的相似性,即相似性约束病历集合D中各病历文档的不同病程记录集合的文档-主题分布尽可能相似。Considering that there will be multiple time-ordered medical records in each medical record at the same time, the similarity calculation of medical record documents should consider the similarity between different medical record sets in each medical record document, that is, the similarity constrained medical record set D. The document-topic distributions of the different course record collections were as similar as possible.
设Tm表示编号m的病历,包括Lm个病程记录,其病程记录的主题集合表示为θrm={θm,1,θm,2,…,θm,Lm}。存在两个病历文档的病程记录主题集合θγm,θrn,可以利用两两主题分布距离均值计算病历相似性约束,如下:Let T m represent the medical records numbered m, including L m disease course records, and the subject set of the disease course records is represented as θrm ={θm ,1 ,θm ,2 ,...,θm ,Lm }. There are two medical record document subject sets θγ m , θrn , and the medical record similarity constraint can be calculated by using the mean distribution distance between the two subjects, as follows:
其中,d(θm,Lm,θn,Ln)表示为两个病和向量之间的欧式距离,dis(θrm,θrn)越大表示相似度越低。Among them, d(θm ,Lm ,θn ,Ln ) is expressed as the Euclidean distance between two disease and vectors, and the larger dis( θrm , θrn ) is, the lower the similarity is.
最大目标函数可以修改为:The maximum objective function can be modified as:
本实施例中采用Gibbs-EM迭代方法进行LDA模型推导,将其将文档-主题分布αm修改为正态分布μm,得到预设LDA模型:In this embodiment, the Gibbs-EM iterative method is used to deduce the LDA model, which modifies the document-topic distribution α m to the normal distribution μ m to obtain the preset LDA model:
其中,μmk代表病历文档m属于主题k的概率,既然认为μm服从标准正态分布,则改进最大目标函数如下表达:Among them, μ mk represents the probability that the medical record document m belongs to the topic k. Since μ m is considered to obey the standard normal distribution, the improved maximum objective function is expressed as follows:
另外,本实施例中在采样过程中先固定文档主题分布αm,那么Gibbs-EM迭代函数表达式为:In addition, in this embodiment, the document topic distribution α m is fixed first in the sampling process, then the Gibbs-EM iteration function expression is:
其中,代表相似性约束病历集合中主题为k的单词i的数量,由于采用正态分布来代替原来的α,所以公式(14)可以用随机梯度下降方法进行推导,模型训练过程如表4:in, Represents the number of words i with subject k in the similarity-constrained medical record set. Since the normal distribution is used to replace the original α, the formula (14) can be derived by the stochastic gradient descent method. The model training process is shown in Table 4:
之后,将相似性约束病历集合中各病历文档依次输入到预设LDA模型,通过所述预设LDA模型推导各病历文档的文档-主题分布和主题-单词分布。After that, each medical record document in the similarity-constrained medical record collection is sequentially input into a preset LDA model, and the document-topic distribution and the topic-word distribution of each medical record document are deduced by the preset LDA model.
至此,本发明实施例中在分析文本挖掘对医疗诊断的影响以及潜在狄利克雷主题模型的建模过程和推理方法的基础上,设计了基于病历相似度约束的预设LDA模型。该预设LDA模型不仅仅考虑到不同病历文档之间的相似性约束,而且确定了医疗文本主题建模目标、推理过程和模型相关度量指标,从而可以从设LDA模型能够清晰反映各个诊疗阶段的侧重点以及病情演化过程,有利于提升病历主题挖掘的科学性、有效性和准确性。So far, on the basis of analyzing the influence of text mining on medical diagnosis and the modeling process and reasoning method of the underlying Dirichlet topic model, a preset LDA model based on similarity constraints of medical records is designed in the embodiment of the present invention. The preset LDA model not only considers the similarity constraints between different medical record documents, but also determines the medical text topic modeling goals, reasoning process and model-related metrics, so that the LDA model can clearly reflect the characteristics of each stage of diagnosis and treatment. The focus and the evolution process of the disease are conducive to improving the scientificity, effectiveness and accuracy of the subject mining of medical records.
下面采用LDA模型和本申请的预设LDA模型(后续称Medical Record Similaritybased Latent Dirichlet Allocation,MRS-LDA)来对比实验来说明本发明实施例提供的一种考虑相似约束的病历主题获取方法的有效性和优越性。The following adopts the LDA model and the preset LDA model of the present application (hereinafter referred to as Medical Record Similaritybased Latent Dirichlet Allocation, MRS-LDA) to compare experiments to illustrate the effectiveness of a method for obtaining a medical record subject provided by the embodiment of the present invention that considers similarity constraints and superiority.
其中,初始病历采用安徽医科大学第一附属医院内分泌科患者的住院病历,包括2015年至2017年总共1294个糖尿病患者的住院记录,每份病历文档主要包括入院记录、病程记录(如图2所示)、会诊记录和出院记录等。其中男女患者病历文档个数比例648:646,大致相同。Among them, the initial medical records adopted the inpatient medical records of patients in the Department of Endocrinology, The First Affiliated Hospital of Anhui Medical University, including a total of 1294 inpatient records of diabetic patients from 2015 to 2017. display), consultation records and discharge records, etc. Among them, the ratio of the number of medical records of male and female patients was 648:646, which was roughly the same.
参见图3和图4,在安徽医科大学第一附属医院接诊的糖尿病患者中,根据入院诊断判别不同年龄段以及不同性别的患者在同时患有的并发症数量上有明显区别。其中,老年人同时患有的糖尿病并发症数量较其他年龄段的人数大大增加,中年人多同时患有3到5种并发症,青年人会有糖尿病发生,但并没有更多的并发症出现,幼儿患糖尿患者数较少。Referring to Figure 3 and Figure 4, among the diabetic patients admitted to the First Affiliated Hospital of Anhui Medical University, the number of complications suffered by patients of different ages and genders was significantly different according to the admission diagnosis. Among them, the number of diabetic complications that the elderly suffer from at the same time is much higher than that of other age groups. Middle-aged people mostly suffer from 3 to 5 complications at the same time. Young people will have diabetes, but there are no more complications. There are fewer children with diabetes.
本实施例中选取入院记录中患者的性别、年龄和入院诊断作为病历相似性约束计算数据基础,利用医生在患者住院期间的病程记录进行相关主题分析。在实验过程中,还可以做如下处理,包括:In this embodiment, the gender, age, and admission diagnosis of the patient in the admission record are selected as the data basis for the similarity constraint calculation of the medical record, and the relevant subject analysis is performed by using the medical record of the doctor during the hospitalization of the patient. During the experiment, you can also do the following processing, including:
(1)使用python爬虫方法,从1294位患者HTML格式的病历文档中分割入院记录、出院记录、病程记录等各个阶段的文本记录,同时分离出需要的患者信息、诊断结果以及病程记录文本。(1) Using the python crawler method, segment the text records of admission records, discharge records, course records, etc. from the medical record documents in HTML format of 1294 patients, and separate the required patient information, diagnosis results, and course record texts at the same time.
(2)构建词典和停止词词库。本发明的研究内容是和医学相关的一些症状、药品、治疗方式等单词,除此之外病历文本中包含了大量对本文无关的单词,在统计了各个单词在病历中出现的频次之后,手动提取了12599个单词作为停止词添加到停止词库。同时,添加了ICD10中国的疾病名称作为补充的特征添加到了词典中。(2) Build a dictionary and stop word thesaurus. The research content of the present invention is some words related to medicine, such as symptoms, medicines, and treatment methods. In addition, the medical record text contains a large number of words that are irrelevant to this article. After counting the frequency of each word in the medical record, manually 12599 words were extracted as stop words and added to the stop word bank. At the same time, the disease name of ICD10 China was added as a supplementary feature to the dictionary.
(3)利用python中的jieba分词作为分词工具,使用上述词典和停止词库进行分词和去除停止词操作。(3) Use the jieba word segmentation in python as a word segmentation tool, and use the above dictionary and stop word library to perform word segmentation and stop word removal operations.
考虑到在病历文档主题挖掘中,主题数量对文本主题建模的影响以及相似度阈值不同带来的相似病历数量不同,本实施例中将相似度阈值和主题数量为调节参数,病历相似度阈值τ取值范围为0.5~0.8,主题数量K=7,10,13,15,20,30,在以上各个参数下分别计算模型的PMI-Score和病历相似性约束。Considering that in the topic mining of medical record documents, the influence of the number of topics on the modeling of text topics and the difference in the number of similar medical records caused by different similarity thresholds, in this embodiment, the similarity threshold and the number of topics are used as adjustment parameters, and the similarity threshold of medical records is The value of τ ranges from 0.5 to 0.8, and the number of subjects is K=7, 10, 13, 15, 20, and 30. The PMI-Score and medical record similarity constraints of the model are calculated under the above parameters.
参见图5和图6,MRS-LDA模型与LDA模型在不同的主题参数和不同相似度下相似度约束结果对比,其中横坐标为主题数量K,纵坐标为相似度约束指标SIM。对比分析MRS-LDA模型在病历相似度约束上有明显优势。当主题相似度阈值一致时,伴随主题数量的增加,病历相似度约束有着不明显的下降,但MRA-LDA模型较LDA模型在病历相似度约束指标方面仍然存在较大优势。Referring to Figure 5 and Figure 6, the MRS-LDA model and the LDA model compare the similarity constraint results under different topic parameters and different degrees of similarity, where the abscissa is the number of topics K, and the ordinate is the similarity constraint index SIM. Comparative analysis of MRS-LDA model has obvious advantages in medical record similarity constraints. When the topic similarity threshold is the same, with the increase of the number of topics, the medical record similarity constraint has an insignificant decline, but the MRA-LDA model still has a greater advantage in the medical record similarity constraint index than the LDA model.
参见图7,在不同主题参数和不同相似度阈值下,MRS-LDA模型与LDA模型交交互信息(PIM-Score)的结果对比,其中横坐标为主题数量K,纵坐标为度量指标PIM-Score。在主题数量K=15时,MRS-LDA模型在PIM-Score度量指标上优于LDA模型,并且在病历相似度阈值为0.5时比LDA模型好。Referring to Figure 7, under different topic parameters and different similarity thresholds, the results of the MRS-LDA model and the LDA model cross-interaction information (PIM-Score) are compared, where the abscissa is the number of topics K, and the ordinate is the metric PIM-Score . When the number of subjects K=15, the MRS-LDA model outperforms the LDA model on the PIM-Score metric, and outperforms the LDA model when the medical record similarity threshold is 0.5.
通过对比实验,MRS-LDA模型在相似性约束度量指标上有很好的表现,在相同病历相似性阈值和主题数量下,MRS-LDA模型得出的相似病历的主题分布之间的距离更小,能够更好的描述相似病历之间的存在的关联。也就是说,本发明中在构造目标函数时添加了病历相似这一约束条件,可以使相似病历之间的主题分布较为接近,能够适用于病历主题挖掘的使用场景,且准确度较高。Through comparative experiments, the MRS-LDA model has a good performance on the similarity constraint metric. Under the same medical record similarity threshold and number of subjects, the distance between the subject distributions of similar medical records obtained by the MRS-LDA model is smaller. , which can better describe the existing associations between similar medical records. That is to say, in the present invention, the constraint condition of similarity of medical records is added when constructing the objective function, which can make the subject distribution between similar medical records closer, and can be applied to the use scenario of subject mining of medical records, and the accuracy is high.
第二方面,本发明实施例提供了一种考虑相似约束的病历主题获取装置,参见图8,所述装置包括:In a second aspect, an embodiment of the present invention provides an apparatus for obtaining a medical record subject that considers similar constraints. Referring to FIG. 8 , the apparatus includes:
病历集合获取模块801,用于计算初始病历中任意两个病历文档之间的相似度,得到所述相似度大于或等于相似度阈值的多个病历文档构成的相似性约束病历集合;A medical record
主题分布推导模块802,用于将所述相似性约束病历集合中各病历文档依次输入到预设LDA模型,通过所述预设LDA模型推导各病历文档的文档-主题分布和主题-单词分布。The topic
可选地,参见图9,所述病历集合获取模块801包括:Optionally, referring to FIG. 9 , the medical record
权重值获取单元901,用于获取病历的多个相似性计算因素及各相似性计算因素的权重值;a weight
因素数据计算单元902,用于分别计算任意两个病历文档关于各相似性计算因素的数值;A factor
相似度计算单元903,用于根据各相似性计算因素的数值和各相似性计算因素的权重值计算所述任意两个病历文档的相似度。The
可选地,所述相似性计算因素包括:性别属性的距离、年龄所属分段的距离、诊断结果的距离。Optionally, the similarity calculation factors include: distances of gender attributes, distances of segments to which age belongs, and distances of diagnosis results.
可选地,参见图10,所述主题分布推导模块802包括:Optionally, referring to FIG. 10 , the topic
主题编号单元1001,用于对所述相似性约束病历集合中各病历文档中每个单词随机赋予主题编号z;
主题迭代单元1002,用于重新扫描所述相似性约束病历集合,对每个单词按照重新采样主题,使得到的新主题满足GibbsSampling收敛;The
主题分布计算单元1003,用于统计语料库中主题-单词共现频率矩阵,得到文档-主题分布和主题-单词分布。The topic
可选地,所述预设LDA模型包括:Optionally, the preset LDA model includes:
任意两个病历文档相似性约束采用主题分布距离dis(θrm,θrn)表示,公式为: The similarity constraint of any two medical record documents is represented by the topic distribution distance dis(θr m , θr n ), and the formula is:
其中θrm={θm,1,θm,2,…,θm,Lm},表示每个病历文档包括Lm个病程记录;θm,Lm表示第Lm个病程记录的主题;d(θm,Lm,θn,Ln)表示为两个病程的主题向量之间的欧式距离;where θr m = {θ m,1 ,θ m,2 ,...,θ m,Lm }, indicating that each medical record file includes L m disease course records; θ m, Lm indicates the subject of the L mth course record; d (θm ,Lm ,θn ,Ln ) is expressed as the Euclidean distance between the subject vectors of the two disease courses;
所述预设LDA模型还包括Gibbs-EM迭代函数,为:The preset LDA model also includes a Gibbs-EM iteration function, which is:
代表相似性约束病历集合中主题为k的单词i的数量。 represents the number of words i with subject k in the similarity-constrained medical record collection.
需要说明的是,本发明实施例提供的考虑相似约束的病历主题获取装置与上述方法是一一对应的关系,上述方法的实施细节同样适用于上述装置,本发明实施例不再对上述系统进行详细说明。It should be noted that there is a one-to-one correspondence between the medical record subject acquisition device that considers similar constraints and the above method provided by the embodiment of the present invention, and the implementation details of the above method are also applicable to the above device. Detailed description.
本发明的说明书中,说明了大量具体细节。然而,能够理解,本发明的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。In the description of the present invention, numerous specific details are set forth. It will be understood, however, that embodiments of the invention may be practiced without these specific details. In some instances, well-known methods, structures and techniques have not been shown in detail in order not to obscure an understanding of this description.
最后应说明的是:以上各实施例仅用以说明本发明的技术方案,而非对其限制;尽管参照前述各实施例对本发明进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分或者全部技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本发明各实施例技术方案的范围,其均应涵盖在本发明的权利要求和说明书的范围当中。Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, but not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: The technical solutions described in the foregoing embodiments can still be modified, or some or all of the technical features thereof can be equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the technical solutions of the embodiments of the present invention. The scope of the invention should be included in the scope of the claims and description of the present invention.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810843072.0A CN109065174B (en) | 2018-07-27 | 2018-07-27 | Medical record theme acquisition method and device considering similarity constraint |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810843072.0A CN109065174B (en) | 2018-07-27 | 2018-07-27 | Medical record theme acquisition method and device considering similarity constraint |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109065174A CN109065174A (en) | 2018-12-21 |
| CN109065174B true CN109065174B (en) | 2022-02-18 |
Family
ID=64836831
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201810843072.0A Active CN109065174B (en) | 2018-07-27 | 2018-07-27 | Medical record theme acquisition method and device considering similarity constraint |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109065174B (en) |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN110046339A (en) * | 2018-12-24 | 2019-07-23 | 北京字节跳动网络技术有限公司 | Determine method, apparatus, storage medium and the electronic equipment of document subject matter |
| CN109871434B (en) * | 2019-02-25 | 2019-12-10 | 内蒙古工业大学 | A method for tracking the evolution of public opinion based on a dynamic incremental probability graph model |
| CN110517789B (en) * | 2019-08-30 | 2023-06-16 | 深圳市汇健医疗工程有限公司 | Digital compound operating room with various imaging equipment |
| CN111370086A (en) * | 2020-02-27 | 2020-07-03 | 平安国际智慧城市科技股份有限公司 | Electronic case detection method, electronic case detection device, computer equipment and storage medium |
| CN111430037B (en) * | 2020-03-30 | 2024-04-09 | 讯飞医疗科技股份有限公司 | Similar medical record searching method and system |
| CN114913951B (en) * | 2022-05-14 | 2025-05-13 | 云知声智能科技股份有限公司 | A medical record inconsistency detection method, system, device and storage medium |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102317786A (en) * | 2007-04-18 | 2012-01-11 | 特提斯生物科学公司 | Diabetes correlativity biological marker and method of application thereof |
| CN103365978A (en) * | 2013-07-01 | 2013-10-23 | 浙江大学 | Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model |
| CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
| CN107613520A (en) * | 2017-08-29 | 2018-01-19 | 重庆邮电大学 | A Method for Discovering Telecom User Similarity Based on LDA Topic Model |
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CA2625744A1 (en) * | 2005-10-11 | 2007-04-19 | Tethys Bioscience, Inc. | Diabetes-associated markers and methods of use thereof |
-
2018
- 2018-07-27 CN CN201810843072.0A patent/CN109065174B/en active Active
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN102317786A (en) * | 2007-04-18 | 2012-01-11 | 特提斯生物科学公司 | Diabetes correlativity biological marker and method of application thereof |
| CN103365978A (en) * | 2013-07-01 | 2013-10-23 | 浙江大学 | Traditional Chinese medicine data mining method based on LDA (Latent Dirichlet Allocation) topic model |
| CN106156272A (en) * | 2016-06-21 | 2016-11-23 | 北京工业大学 | A kind of information retrieval method based on multi-source semantic analysis |
| CN107613520A (en) * | 2017-08-29 | 2018-01-19 | 重庆邮电大学 | A Method for Discovering Telecom User Similarity Based on LDA Topic Model |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109065174A (en) | 2018-12-21 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN109065174B (en) | Medical record theme acquisition method and device considering similarity constraint | |
| CN109460473B (en) | Multi-label classification method of electronic medical records based on symptom extraction and feature representation | |
| Caballero Barajas et al. | Dynamically modeling patient's health state from electronic medical records: a time series approach | |
| CN110111887A (en) | Clinical aid decision-making method and device | |
| CN109036577A (en) | Diabetic complication analysis method and device | |
| CN111048167B (en) | Hierarchical case structuring method and system | |
| CN114628008A (en) | Social user depression tendency detection method based on heterogeneous graph attention network | |
| Li et al. | Reliable medical diagnosis from crowdsourcing: Discover trustworthy answers from non-experts | |
| CN112966508B (en) | Universal automatic term extraction method | |
| Hao et al. | [Retracted] Online Disease Identification and Diagnosis and Treatment Based on Machine Learning Technology | |
| CN118116578A (en) | Drug recommendation method based on GPT-4 and LangChain | |
| CN115146179A (en) | Information recommendation method and device based on cross-domain medical data and computer equipment | |
| CN111640507B (en) | Quantum prediction method for human health state | |
| Han et al. | The development history and research tendency of medical informatics: topic evolution analysis | |
| Yan et al. | EIRAD: An evidence-based dialogue system with highly interpretable reasoning path for automatic diagnosis | |
| CN112149411A (en) | Ontology construction method in field of clinical use of antibiotics | |
| CN120376014A (en) | Slow pulmonary patient that hinders leaves hospital monitoring system | |
| CN116110542A (en) | Data analysis method based on trusted multi-view | |
| CN110060749B (en) | Intelligent diagnosis method of electronic medical record based on SEV-SDG-CNN | |
| Weng et al. | Bayesian non-parametric classification with tree-based feature transformation for NIPPV efficacy prediction in COPD patients | |
| CN117194604B (en) | A method for constructing a smart medical patient consultation corpus | |
| CN116562266B (en) | Text analysis method, computer equipment and computer-readable storage medium | |
| Kumar et al. | Deep learning based patient-friendly clinical expert recommendation framework | |
| CN110391026B (en) | Information classification method, device and equipment based on medical probability map | |
| CN108831560B (en) | Method and device for determining medical data attribute data |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |