CN113806496B

CN113806496B - Method and device for extracting entity from text sequence

Info

Publication number: CN113806496B
Application number: CN202111373041.1A
Authority: CN
Inventors: 郑俊康; 经小川; 王潇茵; 张家华; 丁醒醒
Original assignee: Aerospace Hongkang Intelligent Technology Beijing Co ltd
Current assignee: Aerospace Hongkang Intelligent Technology Beijing Co ltd
Priority date: 2021-11-19
Filing date: 2021-11-19
Publication date: 2022-02-15
Anticipated expiration: 2041-11-19
Also published as: CN113806496A

Abstract

The application provides a method and a device for extracting entities from text sequences, wherein the method comprises the following steps: acquiring a text sequence; calculating a first entity position probability of each character in the text sequence based on the text sequence; determining a probability mean based on the first entity position probability; comparing the position probability of each first entity with the probability mean value, determining candidate characters, and adding position identification of the candidate characters to a first entity position list; based on the first entity location list, characters appearing at predetermined reference locations are determined to extract the first entity from the text sequence. According to the method and the device for extracting the entity from the text sequence, the problem that the extracted semantic information has large deviation due to low entity identification accuracy is solved, the entity position probability mean value of the characters of the whole text sequence can be counted based on the entity position probability of each character in the text sequence, and therefore the entity position can be determined more accurately, and the entity extraction accuracy is improved.

Description

Method and device for extracting entity from text sequence

Technical Field

The present application relates to the field of natural language processing, and more particularly, to a method and apparatus for extracting entities from text sequences.

Background

With the rapid development of internet technology, the number of processing demands for text data of natural language is increased dramatically, and obtaining valuable semantic information from text data is one of the key researches.

In semantic information processing of text data, it is generally necessary to extract entities and relationship information between entities from the text data, where in natural language processing, an entity refers to a set of a specific kind of things. In the extraction process, the positions of several entities in the text data and the relationships between the entities are usually determined, so as to obtain semantic information. Therefore, the accuracy of entity extraction affects the accuracy of the processing result of semantic information.

In the current semantic information processing mode, a pipeline method can be adopted, which firstly identifies entities for text data and then judges the relationship types among the entities. However, in this method, entity identification is relatively independent, correlation between sub-tasks in the identification process is ignored, and there are cases of error accumulation, which results in low accuracy of entity identification, so that there may be large deviation of extracted information.

Disclosure of Invention

In view of the problem that the extracted semantic information has large deviation due to low accuracy of entity identification in the conventional entity identification method, the application provides a method and a device for extracting entities from a text sequence.

According to a first aspect of the present application, there is provided a method of extracting an entity from a text sequence, the method comprising: acquiring a text sequence; calculating a first entity position probability for each character in the text sequence based on the text sequence, wherein the first entity position probability refers to the probability that the character appears at a predetermined reference position in a first entity; determining a probability mean of first entity position probabilities for all characters in the text sequence based on the first entity position probability; comparing the first entity position probability of each character with the probability mean, determining candidate characters appearing at the preset reference position according to the comparison result, and adding position identification of the candidate characters to a first entity position list, wherein the position identification represents the positions of the candidate characters in the text sequence; based on the first entity location list, determining a character appearing at the predetermined reference location from the location identifiers in the first entity location list to extract a first entity comprising the determined character from the text sequence.

Optionally, the step of determining, based on the first entity location list, a character appearing at the predetermined reference location from the location identifiers in the first entity location list, so as to extract the first entity including the determined character from the text sequence, includes: combining position identifications of candidate characters appearing at each predetermined reference position in the first entity position list into position combinations on the basis of the first entity position list according to the character head-tail direction and the direction opposite to the head-tail direction of the text sequence respectively to obtain a position group set comprising the position combinations; extracting, based on the set of position groups, characters corresponding to position identifications in each position combination in the set of position groups from the text sequence for use in determining the first entity.

Optionally, the predetermined reference position includes a head position and a tail position, the first entity position list includes a head position list and a tail position list, the head position list includes a position identifier of a candidate character as a head character of the first entity, and the tail position list includes a position identifier of a candidate character as a tail character of the first entity.

Determining a character appearing at the predetermined reference position from the position identifications in the first entity position list based on the first entity position list, so as to extract a first entity including the determined character from the text sequence, comprising: for each position identifier in the head position list, according to the head and tail directions of the text sequence, determining a first position identifier adjacent to each position identifier in the head position list in the tail position list, and combining each position identifier in the head position list and the corresponding first position identifier into a first position pair to obtain a first position pair set, wherein the first position pair set comprises the first position pair for each position identifier in the head position list; for each position identifier in the tail position list, according to a direction of the text sequence opposite to the character head-tail direction, determining a second position identifier adjacent to each position identifier in the tail position list in the head position list, and combining each position identifier in the tail position list and the corresponding second position identifier into a second position pair to obtain a second position pair set, wherein the second position pair set comprises the second position pair for each position identifier in the tail position list; determining a union of the first set of position pairs and the second set of position pairs, extracting from the text sequence a character pair corresponding to each position pair in the union and characters between the corresponding character pairs to determine the first entity.

Optionally, the predetermined reference position includes a head position and a tail position, the first entity position probability includes a first entity head position probability and a first entity tail position probability, and the probability mean includes a head position probability mean and a tail position probability mean, wherein the step of determining the probability mean of the first entity position probabilities of all the characters in the text sequence based on the first entity position probability includes: determining a first entity head position probability mean value based on first entity head position probabilities of all characters in the text sequence; and determining the probability mean value of the tail positions based on the probability of the tail positions of the first entities of all characters in the text sequence.

Optionally, the step of comparing the first entity position probability of each character with the probability mean, determining candidate characters appearing at the predetermined reference position according to the comparison result, and adding the position identifiers of the candidate characters to the first entity position list includes: comparing the first entity initial position probability of each character with the initial position probability mean value, determining initial position candidate characters appearing at the initial positions according to the comparison result, and adding position marks of the initial position candidate characters to an initial position list; and comparing the first entity tail position probability of each character with the tail position probability mean value, determining tail position candidate characters appearing at the tail position according to a comparison result, and adding the position identification of the tail position candidate characters to a tail position list.

Optionally, the method further comprises: calculating a second entity position probability of each character in the text sequence under each entity relationship category according to the entity relationship categories between the preset first entities and the second entities, wherein the second entity position probability refers to the probability that the character appears in the second entity meeting the entity relationship categories with the extracted first entities, and each entity relationship category represents one entity relationship between the first entities and the second entities; and for each extracted first entity, comparing the second entity position probability of each character in the text sequence with a probability threshold, and updating the information of the extracted first entity when the second entity position probability of each character in the text sequence under all entity relationship categories is less than the probability threshold or when no character at least one predetermined reference position in the predetermined reference positions in the second entity exists under all entity relationship categories.

Optionally, the method further comprises: and for each extracted first entity, when the probability of the second entity position of one or more characters in the text sequence under the entity relationship category is greater than or equal to a probability threshold and a character located in each predetermined reference position in the second entity exists in the one or more characters, determining the second entity according to the one or more characters, and determining the entity relationship represented by the entity relationship category as the entity relationship between the first entity and the second entity.

According to a second aspect of the present application, there is provided an apparatus for extracting an entity from a text sequence, the apparatus comprising: an acquisition unit that acquires a text sequence; a probability determination unit which calculates a first entity position probability for each character in the text sequence based on the text sequence, wherein the first entity position probability refers to the probability that the character appears at a predetermined reference position in a first entity; the mean value determining unit is used for determining the probability mean value of the first entity position probabilities of all the characters in the text sequence based on the first entity position probabilities; the list determining unit is used for comparing the first entity position probability of each character with the probability mean value, determining candidate characters appearing at the preset reference position according to the comparison result, and adding position marks of the candidate characters to the first entity position list, wherein the position marks represent the positions of the candidate characters in the text sequence; and the extraction unit is used for determining the character appearing at the preset reference position from the position identification in the first entity position list based on the first entity position list so as to extract the first entity comprising the determined character from the text sequence.

According to a third aspect of the present application, there is provided an electronic device comprising: a processor; a memory storing a computer program which, when executed by the processor, implements a method of extracting entities from a text sequence according to the first aspect of the present application.

According to a fourth aspect of the present application, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, carries out the method of extracting entities from a text sequence according to the first aspect of the present application.

According to the method and the device for extracting the entity from the text sequence, the probability mean value of the entity position of the character of the whole text sequence can be counted based on the probability distribution of the entity position of each character in the text sequence, so that the entity position can be determined more accurately, and the accuracy of entity extraction can be improved.

In order to make the aforementioned objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.

Fig. 1 shows a flowchart of a method for extracting an entity from a text sequence according to an embodiment of the present application.

Fig. 2 is a flowchart illustrating a step of determining a probability mean in a method of extracting an entity from a text sequence according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating a step of determining a first entity location list in a method of extracting an entity from a text sequence according to an embodiment of the present application.

Fig. 4 is a flowchart illustrating a bidirectional entity extraction step in a method for extracting an entity from a text sequence according to an embodiment of the present application.

Fig. 5 shows a schematic block diagram of an apparatus for extracting entities from a text sequence according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. Every other embodiment that can be obtained by a person skilled in the art without making creative efforts based on the embodiments of the present application falls within the protection scope of the present application.

It should be noted that in the embodiments of the present application, the term "comprising" is used to indicate the presence of the features stated hereinafter, but does not exclude the addition of further features.

Before the application is filed, the existing method for extracting the entity has low accuracy on entity identification, so that the extracted information has large deviation.

Generally, entities and entity relationships extracted from text data can be represented in the form of triples, i.e., two entities and relationships between entities extracted from a text sequence, the extraction results of which include a subject entity (subject), an entity relationship (predicate), and a guest entity (object). In this regard, a pipeline method may be adopted to extract entities and entity relationships, however, as described above, each computation module in the pipeline method is relatively independent, training and building of the computation model are relatively simple, but errors may be accumulated between the computation modules, mutual relations between subtasks are ignored, and information redundancy may exist.

For this case, entity extraction may be performed by using a joint extraction method, in which the model performs extraction of entities and entity relationships by using shared parameters or based on a novel labeling manner. Taking the form of a triple as an example, the position of the host entity may be determined, and then the determined position of the host entity is combined with the specific text sequence to further extract the guest entity and the entity relationship in the triple.

Here, for the determination of the main entity, in the joint extraction method, the probability that different characters (words/sub-words) in the text sequence occur at a predetermined reference position of the entity may be calculated, then characters in the text sequence having a probability greater than a preset fixed threshold may be used as candidate characters constituting the entity, and then a final entity character may be determined from the candidate characters. However, in the process, the design of the preset fixed threshold is not flexible enough, and in the process of extracting the candidate characters, omission occurs with a high probability, and the omission of the characters causes more errors in subsequent entity judgment, so that the entity position cannot be accurately determined.

In view of this, a first aspect of the present application provides a method for extracting an entity from a text sequence, where the method is capable of counting an entity position probability mean of characters of the entire text sequence based on an entity position probability distribution of each character in the text sequence, so as to determine an entity position more accurately, and improve accuracy of entity extraction.

As shown in fig. 1, a method for extracting an entity from a text sequence according to an embodiment of the present application may include:

step S10, a text sequence is acquired. In this step, the text may refer to natural language text. The text sequence may include a plurality of characters, each character having a predetermined sequence position in the text sequence. Here, the characters may also be referred to as subwords/words.

Step S20, calculating a first entity position probability for each character in the text sequence based on the text sequence.

In this step, the first entity position probability refers to a probability that a character appears at a predetermined reference position in the first entity.

Here, the predetermined reference position may include a head position and a tail position of the first entity, and thus, when the character appearing at the predetermined reference position of the first entity is determined, the character at the head position, the character at the tail position, and all characters between the head position and the tail position of the first entity may be extracted from the text sequence to constitute the first entity. However, the predetermined reference positions are not limited to the leading position and the trailing position, and may be added or deleted according to actual needs, for example, the predetermined reference positions may also include an intermediate position or other specific positions, and the like, in which case the characters at each predetermined reference position and the characters between the characters may be extracted from the text sequence to constitute the first entity.

Furthermore, the probability values mentioned herein can be calculated by a machine learning model trained in advance for semantic probability prediction, for example, an existing information joint extraction model can be used for calculation.

Step S30, determining a probability mean of the first entity position probabilities for all characters in the text sequence based on the first entity position probabilities.

The method can count the entity position probability mean value of the characters of the whole text sequence based on the entity position probability distribution of each character in the text sequence, thereby more accurately determining the entity position and improving the accuracy of entity extraction.

In this step, taking the predetermined reference position as an example including a head position and a tail position, the first entity position probability may include a first entity head position probability and a first entity tail position probability, where the first entity head position probability refers to a probability that a character appears at a head position in the first entity, and the first entity tail position probability refers to a probability that a character appears at a tail position in the first entity.

Accordingly, the probability mean may include a head position probability mean, which refers to a probability mean of first entity head position probabilities of all characters, and a tail position probability mean, which refers to a probability mean of first entity tail position probabilities of all characters.

In this example, as shown in fig. 2, step S30 may include: s31, determining a head position probability mean value based on the first entity head position probabilities of all characters in the text sequence; s32, determining the probability mean value of the tail positions based on the probability of the tail positions of the first entities of all characters in the text sequence.

Step S40, comparing the first entity position probability of each character with the probability mean, determining candidate characters appearing at the predetermined reference position according to the comparison result, and adding the position identifiers of the candidate characters to the first entity position list.

Here, the position identification of the character indicates the position of the character in the text sequence. Since each character has a certain and unique position in the text sequence, each character can be marked by a position identification, for example, the sequence number of the character in the text sequence can be used as the position identification. Taking the text sequence { happy day, white curiosity in poetry as an example, the text sequence has 12 characters in total, the position identification of the character "happy" can be "6", and the position identification of the character "day" can be "7".

In this step, the first entity position probability of each character may be compared with the probability mean to determine whether each character is a candidate character appearing at a predetermined reference position, and if so, the position identification of the candidate character is added to the first entity position list. Specifically, when the comparison result indicates that the probability of the first entity position is greater than or equal to the probability mean, the character can be predicted to appear at the first entity position and be taken as a candidate character; when the comparison indicates that the first entity position probability is less than the mean probability, it can be predicted that the character does not appear at the first entity position.

Here, the candidate character is a predicted character located at a predetermined position of the first entity, and the position of the first entity in the text sequence may be determined according to the position of the candidate character in the text sequence, which will be described in detail below.

In an example where the predetermined reference position includes a leading position and a trailing position, as shown in fig. 3, step S40 may include: s41, comparing the first entity initial position probability of each character with the average value of the initial position probabilities, determining initial position candidate characters appearing at initial positions according to the comparison result, and adding position marks of the initial position candidate characters to an initial position list; s42, comparing the first entity tail position probability of each character with the tail position probability mean value, determining tail position candidate characters appearing at the tail position according to the comparison result, and adding the position identification of the tail position candidate characters to a tail position list.

Taking the text sequence { happy day in poetry, i.e. white curio } as an example, the text sequence has 12 characters in total, and assuming that the candidate character at the head position of the first entity is predicted to be the character "happy", and the candidate characters at the tail position of the first entity are the characters "day" and "easy", the head position list may include the position identification "6" of the character "happy" in the text sequence, i.e. {6}, and the tail position list may include the position identifications "7" and "12" of the characters "day" and "easy" in the text sequence, i.e. {7,12 }.

In the present application, in determining the position of the first entity, candidate characters belonging to the first entity are determined by predicting a probability that each character in the text sequence appears at a predetermined reference position in the first entity. Therefore, all the first entities can be identified in a joint calculation mode, subtasks for identifying each first entity are associated, and errors accumulated in the process of pipeline calculation and redundancy of information are reduced.

Further, probability mean values are respectively determined aiming at different preset reference positions, position identification of candidate characters aiming at each preset reference position is obtained by comparing each first entity position probability with the probability mean value, and a first entity position list is established for each preset reference position. Therefore, compared with the mode of determining the candidate characters by setting the probability threshold as the fixed value, the method has the advantages that the probability threshold is set from the fixed value to the mode of comparing the probability distribution among all characters in the text sequence, the characters with the probability values larger than the probability mean value of the characters are used as the candidate characters, the calculation is more flexible, errors occurring in calculation aiming at different text sequences are small, the problem that the fixed probability threshold possibly reduces the accuracy due to the fact that the probability values among different text sequences are large in change is solved, and particularly the accuracy and the flexibility in determining the head and tail positions of the entity are improved.

Step S50, based on the first entity position list, determines a character appearing at a predetermined reference position from the position identifications in the first entity position list to extract the first entity including the determined character from the text sequence.

In this step, as an example, the position identifiers of the characters appearing at the predetermined reference positions may be sequentially searched out from the position identifiers of the candidate characters in the first entity position list according to the character arrangement direction of the text sequence, where the arrangement direction may include a head-to-tail direction and a direction opposite to the head-to-tail direction of the text sequence, so that the characters appearing at the predetermined reference positions may be searched for twice in the head-to-tail direction and the head-to-head direction, and results of the two searches may be compared, repeated results may be eliminated, and the remaining results may be used as final search results.

Specifically, as shown in fig. 4, step S50 may include:

s51, combining position identifications of candidate characters appearing at each preset reference position in the first entity position list into position combinations on the basis of the first entity position list according to the character head-tail direction and the direction opposite to the head-tail direction of the text sequence respectively to obtain a position group set comprising the position combinations;

and S52, extracting character combinations corresponding to each position identification in the position combinations in the position group set from the text sequence based on the position group set so as to determine the first entity.

In step S51, the position combination refers to a combination of position identifications of characters located at each predetermined reference position extracted from the position identifications of the candidate characters of the first entity position list, where the predetermined reference position may be one or more, and the character position identification for each of the one or more predetermined reference positions is extracted from the candidate characters, thereby obtaining one or more character position identifications to compose the position combination.

Here, since there may be a plurality of candidate characters for a single predetermined reference position in the first entity position list, there may be one or more position combinations, and the position group set includes all the position combinations.

For example, the predetermined reference location may include location a for which there are n candidate characters in the first entity location list and location B for which there are m candidate characters in the first entity location list. In this way, the n candidate characters at the position a may be combined with the m candidate characters at the position B, respectively, to obtain a plurality of position combinations, each of which includes one candidate character at the position a and one candidate character at the position B, and the plurality of position combinations form a position group set.

In step S52, since the position combination includes the position identifiers of the characters, the characters corresponding to the position identifiers and the characters between the character combinations can be extracted from the text sequence according to the position identifiers in the position combination, and these characters can be used to determine the first entity. Here, since the characters are located at the predetermined reference positions of the first entity, the first entity may be determined according to the arrangement of the predetermined reference positions in the first entity.

Taking the case where the predetermined reference position includes a leading position and a trailing position as an example, the position combination may be a position pair composed of position identifications of a leading position character and a trailing position character.

Specifically, step S50 may include:

s501, aiming at each position identifier in the head position list, determining a first position identifier adjacent to each position identifier in the head position list in the tail position list according to the head and tail directions of the text sequence, and combining each position identifier in the head position list and the corresponding first position identifier into a first position pair to obtain a first position pair set, wherein the first position pair set comprises the first position pair aiming at each position identifier in the head position list;

s502, aiming at each position identification in the tail position list, determining a second position identification adjacent to each position identification in the tail position list in the head position list according to the direction of the text sequence opposite to the character head-tail direction, and combining each position identification in the tail position list and the corresponding second position identification into a second position pair to obtain a second position pair set, wherein the second position pair set comprises the second position pair aiming at each position identification in the tail position list;

s503, determining a union of the first position set and the second position set, and extracting a character pair corresponding to each position pair in the union and characters between the corresponding character pairs from the text sequence to determine a first entity.

In step S501, based on the position identifier of each head candidate character in the head list, the position identifier of the tail candidate character adjacent to the head candidate character in the head-tail direction may be searched for from the tail list according to the proximity principle.

Taking the text sequence { day of music in poetry, i.e. white curiosity } as an example, if the head position list is {6}, and the tail position list is {7,12}, the first position pair found in the head-tail direction may be {6,7}, and thus, the first position pair set is { {6,7} }.

In step S502, based on the position identifier of each end position candidate character in the end position list, the position identifier of the end position candidate character adjacent to the end position candidate character in the end direction may be searched from the end position list according to the proximity principle.

Taking the text sequence { day "in poetry, i.e. white curio } as an example, if the head position list is {6}, and the tail position list is {7,12}, the position pairs found in the tail direction can be {6,7} and {6,12}, so the second set of position pairs is { {6,7}, {6,12} }. Here, the position characters in the position combinations found in different directions may all be arranged in the same direction, for example, may all be arranged in the head-to-tail direction, i.e., in the position combinations, the first position character is the front and the tail position character is the back.

In step S503, a union of the first position pair set and the second position pair set may be obtained to eliminate duplicate position combinations included in the first position set and the second position set. For the above example, the first set of position pairs is { {6,7} }, and the second set of position pairs is { {6,7}, {6,12} }, so the union of the two sets is { {6,7}, {6,12} }.

Based on the union, the character pairs corresponding to each position identification pair in the union, i.e., the first character and the last character of the first entity, may be extracted from the text sequence, and then the middle characters between the first character and the last character pairs in the text sequence may be extracted to determine all the characters constituting the first entity. For the above example, the character pair corresponding to the union location identity pair is { { le, day }, { le, yi } }. Thus, two first entities can be extracted: happy days; letian' is a white and easy thing.

Here, extracting the position identification pair from two opposite directions can avoid entity extraction omission, as compared with extracting the position identification pair of the top and bottom characters from only a single direction (e.g., the top and bottom directions). And matching the entities by utilizing a bidirectional matching mode instead of a one-way proximity principle on the basis of the determined head-tail position list of the first entity, and reserving all the entities with different positions as candidates of the first entity.

For example, the text sequence may be { robust set }, the predicted first position list is {1}, and the predicted last position list is {2,4 }.

When extracting the position identification pairs of the head and tail characters according to the head and tail directions, because the elements in the head position list are used as the reference of the search operation, only the position identification "2" adjacent to the position identification "1" can be extracted from the tail position list, and the position identification "4" is discarded because it is not adjacent to the position identification "1", that is, the position pair set is { {1,2} }, so the character "set" corresponding to the position identification "4" does not appear in the extracted first entity, and the finally extracted first entity is: is robust.

When the position identification pairs of the head and tail characters are respectively extracted according to the head and tail directions, elements in a head position list and an tail position list can be respectively used as references of search operation, and therefore, a first position pair set { {1,2} } is obtained based on the head position list; based on the tail position list, a second position pair set { {1,2}, {1,4} } is obtained, and the union of the two sets is { {1,2}, {1,4} }, so that two first entities are extracted finally: the method comprises the following steps of (1) fast; a robust corpus. Thus, the first entity 'robust corpus' can be prevented from being omitted in the extraction process.

It can be seen that, according to the bidirectional entity extraction method described in step 50 of the embodiment of the present application, compared with the unidirectional entity extraction, the problem of entity omission that may occur in the case of overlapping entity characters (e.g., overlapping characters "lu", "xun" in entity "xun" and entity "xun text set") can be solved. The result of the bidirectional extraction covers more entities than the result of the unidirectional extraction, so as to avoid omission, avoid influencing subsequent processing on the entities, and determine the relation between other entities and entities based on the extracted entities. This is particularly advantageous in the case that the first entity is a triplet of principal entities, i.e., the more comprehensive the extracted principal entity is, the more accurate the relationship between the guest entity and the entity found based on the principal entity is, and when the principal entity is missing, the search result of the triplet is most likely to be inaccurate.

In the above description of the process of extracting the first entity from the text sequence, according to the entity extraction method of the embodiment of the present application, the second entity and the entity relationship therebetween can also be determined based on the extracted first entity. Which will be described in detail below.

Specifically, the method according to the embodiment of the present application may further include:

a first additional step of calculating, for each extracted first entity, a second entity position probability for each character in the text sequence under each entity relationship category, according to predetermined entity relationship categories between the first entity and the second entity.

Here, the entity relationship category may refer to a kind of entity relationship that may exist between the first entity and the second entity, each entity relationship category representing one kind of entity relationship between the first entity and the second entity. For example, in the form of a triple, the first entity may be a host entity and the second entity may be a guest entity, and there may be an entity relationship between the host entity and the guest entity.

The second entity location probability refers to the probability that a character appears in a second entity that satisfies the entity relationship category with the extracted first entity.

In this step, for each extracted first entity, a second entity position probability for each character may be calculated separately. That is, the likelihood that each character belongs to a second entity satisfying each entity relationship category with the first entity is predicted using the extracted first entity as a known quantity.

Since the position information of the first entity has already been determined after the first entity is extracted, the position information of the first entity may be fused into the source sequence information of the text sequence, and then the second entity and the entity relationship may be determined based on the sequence information of the fused text sequence.

A second additional step of comparing, for each first entity extracted, the second entity position probability for each character in the text sequence with a probability threshold.

Updating the extracted information of the first entity when the second entity position probability of each character in the text sequence under all entity relationship categories is less than the probability threshold; and when the second entity position probability of one or more characters in the sequence under the entity relationship category is larger than or equal to the probability threshold, determining the second entity according to the one or more characters, and determining the entity relationship represented by the entity relationship category as the entity relationship between the first entity and the second entity.

Here, the probability threshold of the second entity position probability may be a fixed value or may be a probability average value, and the probability average value is determined in the same manner as the probability average value of the first entity position probability.

In this step, if for a certain extracted first entity, the second entity position probability of each character in the text is less than the probability threshold, or a character with the second entity position probability greater than or equal to the probability threshold cannot constitute a second entity, it may be considered that all characters and the first entity cannot satisfy any entity relationship, or it may be considered that there is no second entity corresponding to the first entity, so that the information of the first entity may be deleted from the extracted information of the first entity to modify the extraction result of the first entity, and thus the first entity may not be output in the final output result including the first entity, the second entity and the entity relationship in a group. Only when there is more than a probability threshold, a set of first entities, second entities, and entity relationships may be extracted.

In one case, taking the text sequence { happy day in poetry, i.e. white curiosity } as an example, the first entity extracted is: happy days; happy day "white curiosity, in the second additional step above, for the first entity: the space is the white curiosity, and under all entity relationship categories, the second entity position probability of all characters is smaller than the probability threshold, so that the extracted first entity can be considered to be actually not a real entity and can be deleted to correct the extraction result of the first entity.

In another case, taking the first entity as the main entity in the ternary combination as an example, in the text sequence, the extracted main entity is: however, the correct triplet form for the above text should be ("subject": white curiosity "," previous ": word", "object": music "), i.e." music "should belong to the guest entity, and the host entity corresponding to it is" white curiosity ". In this case, in the above-described second additional step, since there is no entity relationship category with "happy day" as the primary entity, when the second entity (i.e., guest entity) is predicted with "happy day" as the primary entity, the second entity position probability of each character is smaller than the probability threshold value under all entity relationship categories, and thus, the first entity happy day can be deleted.

In yet another case, even if there is a character for which the second entity location probability is greater than or equal to the probability threshold, the second entity may be considered to be absent if there is no character located at least one of the predetermined reference locations in the second entity under all entity relationship categories.

Specifically, among the characters in which the probability of the position of the second entity is greater than or equal to the probability threshold, there should be a character located at each of the predetermined reference positions in the second entity. When a character located at least one of the predetermined reference positions in the second entity is absent, it is considered that the second entity corresponding to the first entity cannot be constituted even if there is a character whose second entity position probability is greater than or equal to the probability threshold, and the first entity is deleted.

Here, the predetermined reference position in the second entity may have the same meaning as the predetermined reference position in the first entity, and may include, for example, a head position and a tail position of the second entity, and thus, when a character appearing at the predetermined reference position of the second entity is determined, the character at the head position, the character at the tail position, and all characters between the head position and the tail position of the second entity may be extracted from the text sequence to constitute the second entity. However, the predetermined reference positions in the second entity are not limited to the leading position and the trailing position, and may be added or deleted according to actual needs, for example, the predetermined reference positions may also include intermediate positions or other specific positions, and the like, in which case characters at each predetermined reference position and characters between the characters may be extracted from the text sequence to constitute the second entity.

Therefore, in the first appending step and the second appending step, when the first entity determines that the first entity is incorrect, the extraction result of the first entity may be corrected in the extraction stage of the second entity (for example, if the "happy day" non-master entity cannot extract the corresponding triplet in the extraction stage of the guest entity), and the first entity with the incorrect extraction is filtered in this stage.

Another aspect of the present application relates to an apparatus for extracting entities from a text sequence. Fig. 4 shows a schematic block diagram of an apparatus for extracting entities from a text sequence according to an exemplary embodiment of the present application.

As shown in fig. 5, an apparatus for extracting an entity from a text sequence according to an exemplary embodiment of the present application includes an obtaining unit 100, a probability determining unit 200, a mean determining unit 300, a list determining unit 400, and an extracting unit 500.

The acquisition unit 100 acquires a text sequence.

The probability determination unit 200 calculates a first entity position probability for each character in the text sequence based on the text sequence. Here, the first entity position probability refers to a probability that a character appears at a predetermined reference position in the first entity.

The mean determination unit 300 determines a probability mean of the first entity position probabilities for all characters in the text sequence based on the first entity position probability.

The list determining unit 400 compares the first entity position probability of each character with the probability mean, determines a candidate character appearing at the predetermined reference position according to the comparison result, and adds a position identification of the candidate character to the first entity position list. Here, the position identity represents a position of a candidate character in the text sequence.

The extraction unit 500 determines a character appearing at the predetermined reference position from the position identifications in the first entity position list based on the first entity position list to extract a first entity including the determined character from the text sequence.

The extraction unit 500 may also: combining position identifications of candidate characters appearing at each predetermined reference position in the first entity position list into position combinations on the basis of the first entity position list according to the character head-tail direction and the direction opposite to the head-tail direction of the text sequence respectively to obtain a position group set comprising the position combinations; extracting characters corresponding to position identifications in position combinations in the position group set from the text sequence based on the position group set for determining the first entity.

The predetermined reference position may include a head position and a tail position, the first entity position list may include a head position list and a tail position list, the head position list may include position identifications of candidate characters as first characters of the first entity, and the tail position list may include position identifications of candidate characters as tail characters of the first entity.

In this case, the extraction unit 500 may further: and aiming at each position identifier in the head position list, determining a first position identifier adjacent to each position identifier in the head position list in the tail position list according to the head and tail directions of the text sequence, and combining each position identifier in the head position list and the corresponding first position identifier into a first position pair to obtain a first position pair set. Here, the first set of location pairs includes a first location pair identified for each location in the head location list.

The extraction unit 500 may also: and for each position identifier in the tail position list, determining a second position identifier adjacent to each position identifier in the tail position list in the head position list according to the direction of the text sequence opposite to the character head-tail direction, and combining each position identifier in the tail position list and the corresponding second position identifier into a second position pair to obtain a second position pair set. Here, the second set of location pairs includes a second location pair identified for each location in the tail location list.

The extraction unit 500 may also: determining a union of the first set of position pairs and the second set of position pairs, and extracting character pairs corresponding to the position pairs in the union and characters between the character pairs from the text sequence to determine the first entity.

In addition, the first entity position probability may include a first entity head position probability and a first entity tail position probability, and the probability mean may include a head position probability mean and a tail position probability mean.

In this case, the mean value determining unit 300 may further: determining a first position probability mean value of first entity head position probabilities of all characters in the text sequence based on the first entity head position probabilities; and determining the tail position probability mean value of the first entity head position probabilities of all characters in the text sequence based on the first entity tail position probability.

The list determination unit 400 may further: comparing the first entity initial position probability of each character with the initial position probability mean value, determining initial position candidate characters appearing at the initial positions according to the comparison result, and adding position marks of the initial position candidate characters to an initial position list; and comparing the first entity tail position probability of each character with the tail position probability mean value, determining tail position candidate characters appearing at the tail position according to a comparison result, and adding the position identification of the tail position candidate characters to a tail position list.

The apparatus for extracting an entity from a text sequence according to an exemplary embodiment of the present application may further include a second entity determining unit 600.

The second entity determining unit 600 may calculate, for each extracted first entity, a second entity position probability for each character in the text sequence under each entity relationship category according to predetermined entity relationship categories between the first entity and the second entity. Here, the second entity position probability refers to a probability that a character appears in a second entity satisfying entity relationship categories with the extracted first entity, each entity relationship category representing one entity relationship between the first entity and the second entity.

The second entity determining unit 600 may further compare the second entity position probability for each character in the text sequence with a probability threshold for each extracted first entity.

The second entity determining unit 600 may delete the extracted first entity when the second entity position probability of each character in the text sequence is less than the probability threshold under all entity relationship categories or when there is no character located at least one of the predetermined reference positions in the second entity under all entity relationship categories. When the probability of the second entity position of the one or more characters in the text sequence under the entity relationship category is greater than or equal to the probability threshold and there is a character in the one or more characters located in each of the predetermined reference positions in the second entity, the second entity determining unit 600 may determine the second entity according to the one or more characters, and determine the entity relationship represented by the entity relationship category as the entity relationship between the first entity and the second entity.

It should be noted that the obtaining unit 100, the probability determining unit 200, the average determining unit 300, the list determining unit 400, the extracting unit 500, and the second entity determining unit 600 may perform corresponding steps in the method according to the method for extracting an entity from a text sequence in the method embodiments shown in fig. 1 to fig. 4, for example, the method may be implemented by machine readable instructions executable by the obtaining unit 100, the probability determining unit 200, the average determining unit 300, the list determining unit 400, the extracting unit 500, and the second entity determining unit 600, and specific implementation manners of the obtaining unit 100, the probability determining unit 200, the average determining unit 300, the list determining unit 400, the extracting unit 500, and the second entity determining unit 600 may refer to the above-described method embodiments, and are not described herein again.

An embodiment of the present application further provides an electronic device, which includes a processor and a memory. The memory stores a computer program. When the computer program is executed by a processor, the electronic device may perform corresponding steps in the method according to the method for extracting an entity from a text sequence in the method embodiments shown in fig. 1 to fig. 4, for example, by machine-readable instructions executable by the electronic device, and specific implementation manners of the electronic device may refer to the above-described method embodiments, which are not described herein again.

The embodiment of the present application further provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the computer program may perform the steps of the method for extracting an entity from a text sequence in the method embodiments shown in fig. 1 to fig. 4.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the embodiment scheme of the application.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer-readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

In addition, according to the method and the device for extracting the entities from the text sequence, the entities can be extracted from two directions, the entities are prevented from being omitted, and the accuracy of entity extraction is improved.

In addition, the method and the device for extracting the entity from the text sequence according to the application can extract the position pair of the first position character and the last position character from two directions by taking the first position and the last position as reference positions so as to combine the first position character, the last position character and the middle character between the first position character and the last position character into the entity.

In addition, according to the method and the device for extracting the entity from the text sequence, the head position and the tail position can be used as reference positions, probability mean values of the head position and the tail position are respectively determined, the head position probability and the tail position probability are separately processed to obtain two probability mean values, and the probability values of the head position and the tail position are compared with the corresponding probability mean values, so that the accuracy of entity extraction is further improved.

In addition, according to the method and the device for extracting the entities from the text sequence, the extraction result of the first entity can be corrected through the process of predicting the second entity based on the first entity, and the overall accuracy of semantic information extraction is improved.

Finally, it should be noted that: the above-mentioned embodiments are only specific embodiments of the present application, and are used for illustrating the technical solutions of the present application, but not limiting the same, and the scope of the present application is not limited thereto, and although the present application is described in detail with reference to the foregoing embodiments, those skilled in the art should understand that: any person skilled in the art can modify or easily conceive the technical solutions described in the foregoing embodiments or equivalent substitutes for some technical features within the technical scope disclosed in the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the exemplary embodiments of the present application, and are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for extracting entities from a text sequence, the method comprising:

acquiring a text sequence;

calculating a first entity position probability for each character in the text sequence based on the text sequence, wherein the first entity position probability refers to the probability that the character appears at a predetermined reference position in a first entity;

determining a probability mean of first entity position probabilities for all characters in the text sequence based on the first entity position probability;

comparing the first entity position probability of each character with the probability mean, determining candidate characters appearing at the preset reference position according to the comparison result, and adding position identification of the candidate characters to a first entity position list, wherein the position identification represents the positions of the candidate characters in the text sequence;

determining a character appearing at the predetermined reference position from among the position identifications in the first entity position list based on the first entity position list to extract a first entity including the determined character from the text sequence,

wherein the predetermined reference position comprises a head position and a tail position, the first entity position list comprises a head position list and a tail position list, the head position list comprises position identifications of candidate characters as first characters of the first entity, and the tail position list comprises position identifications of candidate characters as tail characters of the first entity.

2. The method of claim 1, wherein determining a character appearing at the predetermined reference position from the position identifiers in the first entity position list based on the first entity position list to extract the first entity including the determined character from the text sequence comprises:

combining position identifications of candidate characters appearing at each predetermined reference position in the first entity position list into position combinations on the basis of the first entity position list according to the character head-tail direction and the direction opposite to the head-tail direction of the text sequence respectively to obtain a position group set comprising the position combinations;

extracting, based on the set of position groups, characters corresponding to position identifications in each position combination in the set of position groups from the text sequence for use in determining the first entity.

3. The method of claim 1, wherein determining a character appearing at the predetermined reference position from the position identifiers in the first entity position list based on the first entity position list to extract the first entity including the determined character from the text sequence comprises:

for each position identifier in the head position list, according to the head and tail directions of the text sequence, determining a first position identifier adjacent to each position identifier in the head position list in the tail position list, and combining each position identifier in the head position list and the corresponding first position identifier into a first position pair to obtain a first position pair set, wherein the first position pair set comprises the first position pair for each position identifier in the head position list;

for each position identifier in the tail position list, according to a direction of the text sequence opposite to the character head-tail direction, determining a second position identifier adjacent to each position identifier in the tail position list in the head position list, and combining each position identifier in the tail position list and the corresponding second position identifier into a second position pair to obtain a second position pair set, wherein the second position pair set comprises the second position pair for each position identifier in the tail position list;

determining a union of the first set of position pairs and the second set of position pairs, extracting from the text sequence a character pair corresponding to each position pair in the union and characters between the corresponding character pairs to determine the first entity.

4. The method of claim 1, wherein the first entity position probability comprises a first entity head position probability and a first entity tail position probability, and wherein the probability mean comprises a head position probability mean and a tail position probability mean,

wherein the step of determining the probability mean of the first entity position probabilities for all characters in the text sequence based on the first entity position probabilities comprises:

determining a first entity head position probability mean value based on first entity head position probabilities of all characters in the text sequence;

and determining the probability mean value of the tail positions based on the probability of the tail positions of the first entities of all characters in the text sequence.

5. The method of claim 4, wherein the step of comparing the first entity position probability of each character with the probability mean, determining candidate characters appearing at the predetermined reference position according to the comparison result, and adding the position identifications of the candidate characters to the first entity position list comprises:

comparing the first entity initial position probability of each character with the initial position probability mean value, determining initial position candidate characters appearing at the initial positions according to the comparison result, and adding position marks of the initial position candidate characters to the initial position list;

and comparing the first entity tail position probability of each character with the tail position probability mean value, determining tail position candidate characters appearing at the tail position according to a comparison result, and adding the position identification of the tail position candidate characters to the tail position list.

6. The method according to any one of claims 1 to 5, further comprising:

calculating a second entity position probability of each character in the text sequence under each entity relationship category according to the entity relationship categories between the preset first entities and the second entities, wherein the second entity position probability refers to the probability that the character appears in the second entity meeting the entity relationship categories with the extracted first entities, and each entity relationship category represents one entity relationship between the first entities and the second entities;

and for each extracted first entity, comparing the second entity position probability of each character in the text sequence with a probability threshold, and updating the information of the extracted first entity when the second entity position probability of each character in the text sequence under all entity relationship categories is less than the probability threshold or when no character at least one predetermined reference position in the predetermined reference positions in the second entity exists under all entity relationship categories.

7. The method of claim 6, further comprising:

and for each extracted first entity, when the probability of the second entity position of one or more characters in the text sequence under the entity relationship category is greater than or equal to a probability threshold and a character located in each predetermined reference position in the second entity exists in the one or more characters, determining the second entity according to the one or more characters, and determining the entity relationship represented by the entity relationship category as the entity relationship between the first entity and the second entity.

8. An apparatus for extracting entities from a text sequence, the apparatus comprising:

an acquisition unit that acquires a text sequence;

a probability determination unit which calculates a first entity position probability for each character in the text sequence based on the text sequence, wherein the first entity position probability refers to the probability that the character appears at a predetermined reference position in a first entity;

the mean value determining unit is used for determining the probability mean value of the first entity position probabilities of all the characters in the text sequence based on the first entity position probabilities;

the list determining unit is used for comparing the first entity position probability of each character with the probability mean value, determining candidate characters appearing at the preset reference position according to the comparison result, and adding position marks of the candidate characters to the first entity position list, wherein the position marks represent the positions of the candidate characters in the text sequence;

an extraction unit that determines a character appearing at the predetermined reference position from position identification in the first entity position list based on the first entity position list to extract a first entity including the determined character from the text sequence,

9. An electronic device, characterized in that the electronic device comprises:

a processor;

memory storing a computer program which, when executed by a processor, implements a method of extracting entities from a text sequence according to any one of claims 1 to 7.

10. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out a method of extracting an entity from a text sequence according to any one of claims 1 to 7.