[go: up one dir, main page]

CN107894978B - Time word extraction method and device - Google Patents

Time word extraction method and device Download PDF

Info

Publication number
CN107894978B
CN107894978B CN201711123985.7A CN201711123985A CN107894978B CN 107894978 B CN107894978 B CN 107894978B CN 201711123985 A CN201711123985 A CN 201711123985A CN 107894978 B CN107894978 B CN 107894978B
Authority
CN
China
Prior art keywords
time
word
words
candidate
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711123985.7A
Other languages
Chinese (zh)
Other versions
CN107894978A (en
Inventor
任宁
张建军
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Science and Technology (Beijing) Co., Ltd.
Original Assignee
Dingfu Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dingfu Intelligent Technology Co Ltd filed Critical Dingfu Intelligent Technology Co Ltd
Priority to CN201711123985.7A priority Critical patent/CN107894978B/en
Publication of CN107894978A publication Critical patent/CN107894978A/en
Application granted granted Critical
Publication of CN107894978B publication Critical patent/CN107894978B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The embodiment of the invention discloses a method and a device for extracting time words, wherein the method comprises the following steps: acquiring a text of a time word to be extracted; extracting all candidate words in the text, wherein each candidate word at least has one semantic meaning for representing time; determining semantic regions corresponding to the candidate words in the text respectively, wherein the semantic regions comprise the candidate words and a preset number of characters before and after the candidate words; and if the semantic region does not contain a first preset character string corresponding to the candidate word, determining the candidate word as a time word, and outputting the time word. According to the technical scheme, on one hand, the extraction rule can be simplified, the number of extracted candidate words can be increased, and the condition that a large number of time words are omitted due to the fact that the extraction rule is too complex is avoided; on the other hand, the candidate words are disambiguated, so that the time words in the text can be extracted more accurately, and the method is particularly suitable for Chinese texts with diversified time word expression forms.

Description

Time word extraction method and device
Technical Field
The invention relates to the technical field of information extraction and processing, in particular to a time word extraction method. In addition, the invention also relates to a time word extraction device.
Background
The information extraction is a technology for extracting information points from texts in natural languages, and aims to provide better information acquisition tools for people so as to meet serious challenges brought by information explosion. The time information is an important component of the natural language and is an indispensable element for completely understanding the semantics of the natural language. Therefore, one of the important tasks of information extraction is to extract time words for representing time information from the text.
The conventional method for extracting time words from texts mainly comprises the steps of constructing an extraction rule, and matching the extraction rule with the texts to extract the time words. For example, time words such as "12 months and 12 days in 1999", "8 o' clock and half", "monday" are extracted.
However, it is analyzed that, for chinese text, especially ancient chinese text, there are many other forms of expressions for time words besides the conventional expressions such as year, month, day, hour, minute and second. For such texts, if the exact time words are to be extracted, a complex extraction rule needs to be constructed, and the complex extraction rule is likely to cause a large number of time words to be missed.
Disclosure of Invention
In order to solve the above technical problems, the present application provides a time word extraction method to solve the problems of complex time word extraction rules and easy omission.
In a first aspect, a method for extracting time words is provided, which includes the following steps:
acquiring a text of a time word to be extracted;
extracting all candidate words in the text, wherein each candidate word at least has one semantic meaning for representing time;
determining semantic regions corresponding to the candidate words in the text respectively, wherein the semantic regions comprise the candidate words and a preset number of characters before and after the candidate words;
and if the semantic region does not contain a first preset character string corresponding to the candidate word, determining the candidate word as a time word, and outputting the time word.
With reference to the first aspect, in a first possible implementation manner of the first aspect, the step of extracting all candidate words in the text includes:
extracting original words from the text;
determining a matching area corresponding to each original word in the text, wherein the matching area comprises the original words and a predetermined number of characters before and after the original words;
and generating a candidate word, wherein the candidate word is a word which contains the original word in the matching area and has at least one semantic meaning for representing time.
With reference to the first implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the step of outputting the time word includes:
if the time words contain numbers, judging whether the time words are preset exclusion types or not;
if the time word is not the preset exclusion type, converting the time word into a preset format;
and outputting the time words after the format conversion.
With reference to the first aspect and the foregoing possible implementation manners, in a third possible implementation manner of the first aspect, the step of outputting the time word includes:
determining a start-stop position of each time word in the text;
merging the time words with overlapped or adjacent start and stop positions;
and outputting the merged time words.
With reference to the first aspect and the foregoing possible implementations, in a fourth possible implementation of the first aspect, the step of merging time words with overlapping or adjacent start-stop positions includes:
judging whether the start-stop position of the current time word is overlapped or adjacent to the start-stop position of the next time word;
if the time words are overlapped or adjacent, updating the current time word and the next time word into a union of the current time word and the next time word;
determining the starting and ending positions of the updated time words in the text;
and if the start-stop position of the updated time word is not overlapped with and adjacent to the start-stop position of the next time word, taking the updated time word as the merged time word.
In a second aspect, a time word extracting apparatus is provided, including:
the acquisition unit is used for acquiring a text of a time word to be extracted;
the processing unit is used for extracting all candidate words in the text, determining semantic areas corresponding to the candidate words in the text respectively, and determining the candidate words as time words under the condition that the semantic areas do not contain first preset character strings corresponding to the candidate words; each candidate word at least has a semantic meaning for representing time, and the semantic area comprises the candidate words and a preset number of characters before and after the candidate words;
and the output unit is used for outputting the time words.
With reference to the second aspect, in a first possible implementation manner of the second aspect, the processing unit is further configured to extract original words from the text, determine matching regions corresponding to the original words in the text, and generate candidate words; the matching area comprises an original word and a predetermined number of characters before and after the original word, and the candidate word is a word which contains the original word in the matching area and at least has one semantic meaning for representing time.
With reference to the first implementation manner of the second aspect, in a second possible implementation manner of the second aspect, the processing unit is further configured to determine whether the time word is a preset exclusion type when the time word includes a number, and if the time word is not the preset exclusion type, convert the time word into a preset format; the output unit is also used for outputting the time words after format conversion.
With reference to the second aspect and the foregoing possible implementations, in a third possible implementation of the second aspect, the processing unit is further configured to determine a start-stop position of each time word in the text, and merge time words with overlapping or adjacent start-stop positions; the output unit is further used for outputting the merged time words.
With reference to the second aspect and the foregoing possible implementation manners, in a fourth possible implementation manner of the second aspect, the processing unit is further configured to determine whether a start-stop position of a current time word overlaps or is adjacent to a start-stop position of a next time word, update the current time word and the next time word to a union of the current time word and the next time word under the overlapping or adjacent condition, determine a start-stop position of the updated time word in the text, and take the updated time word as the merged time word under the condition that the start-stop position of the updated time word does not overlap and is not adjacent to the start-stop position of the next time word.
According to the method and the device for extracting the time words in the technical scheme, firstly, the text of the time words to be extracted is obtained, and all candidate words are extracted from the text. Each candidate word has at least one semantic meaning for representing time, that is, the candidate word may or may not be a time word representing time in the text. And then determining semantic areas corresponding to the candidate words in the text respectively, and judging whether the semantic areas contain first preset character strings corresponding to the candidate words, so as to determine whether the candidate words are time words or not in the text, and eliminating ambiguity. And finally, outputting the time words to finish the process of extracting the time words from the text.
The method does not directly extract the accurate time words from the text at one time, but extracts the candidate words firstly, then determines the semantic area of the candidate words, and then judges whether the candidate words are the time words in the text by utilizing the semantic area and the first preset character string, thereby extracting the accurate time words from the text. Therefore, on one hand, the extraction rule can be simplified, the number of extracted candidate words can be increased, and the condition that a large number of time words are omitted due to the fact that the extraction rule is too complex is avoided; on the other hand, the candidate words are disambiguated, so that the time words in the text can be extracted more accurately, and the method is particularly suitable for Chinese texts with diversified time word expression forms. The extraction method of the time words is applied to extraction of the time words of the Chinese text, so that the extracted time words can be covered more comprehensively and have more diversified forms, and meanwhile, the missing quantity is also greatly reduced.
Drawings
In order to more clearly explain the technical solution of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious to those skilled in the art that other drawings can be obtained according to the drawings without any creative effort.
FIG. 1 is a flow chart of an embodiment of a method for extracting time words of the present application;
FIG. 2 is a flowchart of an embodiment of the step S200 in the present time word extraction method;
FIG. 3 is a flowchart illustrating a first example of a step of outputting time words according to an embodiment of the present invention;
FIG. 4 is a flowchart illustrating a second example of the step of outputting time words according to an embodiment of the method for extracting time words of the present application;
fig. 5 is a flowchart of a step of S422 in a second embodiment of the step of outputting time words in the embodiment of the method for extracting time words of the present application;
fig. 6 is a schematic structural diagram of a specific embodiment of the time word extraction device according to the present application.
Detailed Description
The following provides a detailed description of the embodiments of the present application.
Referring to fig. 1, in a first embodiment of the present application, a method for extracting time words is provided, which includes steps S100 to S400.
S100: and acquiring a text of the time word to be extracted.
In the step S100, the text of the time word to be extracted may be a chinese text of a white word, or may be a chinese text of a language, and the like, which is not limited in the present application.
S200: and extracting all candidate words in the text, wherein each candidate word at least has one semantic meaning for representing time.
In the step S200, each candidate word has at least one semantic meaning for representing time, that is, the candidate word has at least one semantic meaning for representing time and another semantic meaning. For example, "three" may indicate a certain date, a number of a certain personal object in a series of personal objects, or the like.
All candidate words in the text are extracted by adopting a direct matching mode of constructing a regular expression, and other modes can also be adopted.
In an implementation mode of extracting candidate words, a regular expression is adopted to be directly matched with a text of a time word to be extracted so as to extract the candidate words. In constructing a regular expression, a particular string of the regular expression may include multiple manifestations of time words. For example, "clown", "noon", "two more" and the like characterize the time words of the time information in the manner of heavenly stems and earthly branches; the time words of the solar terms used for representing time information such as big cold, spring equinox, summer solstice and the like; the festival, the labor festival and the like represent time words of time information by festival days; "Tang dynasty", "business week", "Taigu times", "millennium years" and the like represent time words of the era or dynasty; "yearly", "day-by-day", etc. characterize time words of fixed interval time periods; and "chronological", "escape", "decades" etc. represent fuzzy time periods, etc.
In another implementation manner of extracting candidate words, referring to fig. 2, the step of extracting all candidate words in the text may specifically include:
s201: extracting original words from the text;
s202: determining a matching area corresponding to each original word in the text, wherein the matching area comprises the original words and a predetermined number of characters before and after the original words;
s203: and generating a candidate word, wherein the candidate word is a word which contains the original word in the matching area and has at least one semantic meaning for representing time.
In the step of S201, extracting the original word from the text may employ a regular expression to match the extraction. The original words here may be words such as "more", "time", "drum", "moment", "year", "month", "day", etc., which together with their preceding and following characters may characterize time.
For example, the text 1 of the time word to be extracted is "in Shen, state Munfu festoon, cheerful. When the user likes the dead, the smoke of the user is scattered immediately, and the colored lamps with little smoke and fire are reserved in the wind. The original word 1 "time", the original word 2 "time", and the original word 3 "moment" are extracted from the text 1.
In step S202, an original word and a predetermined number of characters before and after the original word form a corresponding matching region of the original word in the text, and each original word has a corresponding matching region in the text.
Following the example in S201, for example, the first 2 characters and the last 1 character of "time" in the text and the original word "time" are preset to form a matching area corresponding to the original word "time"; the first 3 characters of the 'moment' in the text and the original word 'moment' are preset to form a matching area corresponding to the original word 'moment'.
In the text 1 of the time word to be extracted, the matching area corresponding to each original word is as follows:
[ in time, the state of Mufu festooned, hot and alarming. Once the dead man falls, the smoke of the dead man is scattered and leaves the dead man
Matching region 1 matching region 2 matching region 3 also struggles in the wind with colored lights with sporadic smoke and fire.
In step S203, the candidate word is a word in the matching area corresponding to the original word, and the candidate word includes the original word and also has at least one semantic meaning for representing time. The step of generating the candidate word may be to search whether a second preset character string corresponding to the original word exists in the matching area, so as to generate the candidate word; semantic analysis may also be performed on the text in the matching region to generate candidate words.
For example, following the example in S202, the second preset character string corresponding to the original word "time" is preset to include "son", "ugly", "yin", "fourth", "old", "noon", "not", "please", "unitary", "fifth", "helminth", and "helminth". If the previous character of the original word "time" is any one of the second character strings, a candidate word "second character string + time" is generated in the matching area. The preset second preset character string corresponding to the original word "carving" includes "one", "two", "three", "1", "2", "3", and so on. And if the previous character of the original word 'carving' is any one of the second preset character strings corresponding to the 'carving', generating a candidate word of 'second preset character string + carving' in the matching area.
Therefore, by searching in the matching area 1, the previous character of the original word 1 "time" is found to be "time", and the candidate word 1 "time" is generated. By searching in the matching area 2, the previous character of "when" the original word 2 is found to be "when" and the candidate word 2 "when" is generated. By searching in the matching area 3, the previous character of the original word 3 'moment' is found to be 'three', and then the candidate word 3 'three moments' is generated.
For another example, still continuing with the example in S202, the text "time of application" in the matching area 1 is subjected to word segmentation and semantic analysis, and the "time of application" is found to have a semantic meaning representing time, and the candidate word 1 "time of application" is generated. By performing word segmentation and semantic analysis on the text "when the user likes" in the matching area 2, if the text "when the user likes" has a semantic meaning representing time, the candidate word "when the user likes" is generated. For the text "three moments in time" in the matching area 3, through word segmentation and semantic analysis, it is found that "three moments in time" represents immediate and immediate semantics, and does not have semantics representing a specific time point, so that the matching area 3 may not generate corresponding candidate words.
It should be noted that, for different specific ways of word segmentation and semantic analysis of the matching area, the generated candidate word results may have differences. For example, in the foregoing example, for the text "three moments in time" in the matching area 3, three participles of "three moments in time" and "three moments in time" are obtained by the participle. Then, semantic analysis is performed on the three segmented words respectively, and at this time, the fact that the three moments have semantics representing specific time points is found, so that the candidate word 3, namely the three moments, can be generated for the matching area 3.
The method for extracting the candidate words comprises the steps of extracting original words in the text. Compared with the extraction rule of directly extracting candidate words or the extraction rule of directly extracting accurate time words, the extraction rule of the original words is simpler, so that the original words can be extracted from the text as many as possible. And then determining a matching area corresponding to the original word, and generating a candidate word from the matching area, so that the rule of extracting the candidate word from the text can be simplified, and the problem of omission caused by complex extraction rules is further reduced.
The candidate word extracted in step S200 has at least one semantic meaning for representing time, that is, the candidate word may or may not represent time in the text, and there is ambiguity. For example, when the previous character of the candidate word "third" in the text is "man", the "third" indicates the number of a certain personal thing in a series of personal things in the text, and does not characterize time. For another example, when the next character of the candidate word "7.6" in the text is "yuan", "gram", "meter", etc., "7.6" indicates the number of objects in the text, and does not characterize time. For this reason, after step 200, it is determined whether the candidate word is a time word representing time through the steps of S300 and S400, and disambiguation is performed, thereby accurately extracting a time word in the text.
S300: and determining semantic regions corresponding to the candidate words in the text respectively, wherein the semantic regions comprise the candidate words and a preset number of characters before and after the candidate words.
In step S300, each candidate word corresponds to at least one semantic area in the text.
For example, for the text 2 "she is born on eighty-three years, eighty-five days, 1893, 8-15 days, and on eighty birthday, the Zhou couple sets the feast to congratulate her lives. In time, guests arrive at mansion Zhou one after another. ", the candidate words extracted from text 2 are: candidate word 1 "eighty-five days of eighty-one-three years", candidate word 2 "1893 8 months and 15 days", candidate word 3 "chronogram", and candidate word 4 "chronogram".
Supposing that the first 1 character of the candidate word 'hour' and the candidate word 'hour' in the text are preset to form a semantic area corresponding to the candidate word 'hour'; presetting the first 1 character of the candidate word 'time application' in the text and the candidate word 'time application' to form a semantic area corresponding to the candidate word 'time application'; the semantic area of the candidate word in the format of "X year, X month and X day" preset in the text is from the beginning of 4 characters before the character "year" to the character "day". Determining the semantic regions respectively corresponding to the candidate words in the text 2 of the time word to be extracted as follows:
she born in [ eighty-five days in eighty-three years ], [ 15 days in 8 months in 1893 ], in eighty in her birthday ], Zhou-shi fu
Semantic area 1 semantic area 2 semantic area 3 women set up a feast to congregate her life [. Stated another time, guests arrive at mansion Zhou one after another.
Semantic area 4
S400: and if the semantic region does not contain a first preset character string corresponding to the candidate word, determining the candidate word as a time word, and outputting the time word.
In the step S400, the first predetermined character string herein refers to a character string in which the candidate word does not represent time when the candidate word and the candidate word belong to the same semantic area. That is, when the candidate word and the first preset character string corresponding thereto belong to the same semantic area, the candidate word does not represent time. Different candidate words may correspond to different first preset character strings. When a first preset character string corresponding to a certain candidate word is empty, the semantic meaning with the unique representation time of the candidate word is represented, and no ambiguity exists.
The first preset character string corresponding to each candidate word may be pre-stored in the corpus. The first predetermined character string in the corpus may be accumulated from past experience, or may be generated in other manners.
For example, in an implementation manner of generating a first preset character string corresponding to a certain candidate word, a predetermined number of candidate sentences including the candidate word may be selected first; then, selecting a selected sentence from the candidate sentences, wherein the candidate words in the selected sentence do not represent time; and finally, extracting a first preset character string from the selected sentence, wherein the first preset character string only appears in the selected sentence and does not appear in other candidate sentences except the selected sentence.
Comparing the first preset character string with the text of the candidate word in the corresponding semantic area in the text, wherein if the semantic area does not contain the first preset character string corresponding to the candidate word, the candidate word is a time word in the text of the time word to be extracted, namely the candidate word represents time in the text of the time word to be extracted. And if the semantic area contains a first preset character string corresponding to the candidate word, the candidate word is not considered to represent time in the text of the time word to be extracted, so that the candidate word is not the time word.
For example, following the example in the step of S300, the first preset character string corresponding to the candidate word "time of day" is preset to be any one of "birthday" and "birth"; presetting a first preset character string corresponding to the candidate word 'time application' as 'guide'; and presetting a first preset character string corresponding to the candidate words in the format of 'X month and X day in X year' as null.
The first predetermined character string corresponding to the candidate word in the format of "X year, X month, X day" is empty, so that it can be determined that the candidate word 1 "eighty-one-three year, eighty month, fifteen days", and the candidate word 2 "1893, 8 month, 15 days" are time words.
The preset first preset character string corresponding to the candidate word "time of day" is any one of "life" and "birth", and it can be known through comparison that the semantic area 3 includes the first character string "life" corresponding to the candidate word "time of day" 3, so that the candidate word "time of day" 3 is not a time word in the text 2 of the time word to be extracted.
The preset first preset character string corresponding to the candidate word "time" is "quote", and it can be known through comparison that the semantic area 4 does not contain the first character string "quote" corresponding to the candidate word 4 "time", so in the text 2 of the time word to be extracted, the candidate word 4 "time" is the time word.
Finally, the time words "eighty-month and fifteen days in eighty-one-three years", "8-month and 15-day in 1893" and "time of application" are output.
According to the method for extracting the time words in the technical scheme, firstly, the text of the time words to be extracted is obtained, and all candidate words are extracted from the text. Each candidate word has at least one semantic meaning for representing time, that is, the candidate word may or may not be a time word representing time in the text. And then determining semantic areas corresponding to the candidate words in the text respectively, and judging whether the semantic areas contain first preset character strings corresponding to the candidate words or not, so as to determine whether the candidate words are time words or not in the text and eliminate ambiguity. And finally, outputting the time words to finish the process of extracting the time words from the text.
The method of the embodiment does not directly extract the accurate time words from the text at one time, but extracts the candidate words first, determines the semantic area of the candidate words, and then judges whether the candidate words are the time words in the text by using the semantic area and the first preset character string, thereby accurately extracting the time words from the text. Therefore, on one hand, the extraction rule can be simplified, the number of extracted candidate words can be increased, and the condition that a large number of time words are omitted due to the fact that the extraction rule is too complex is avoided; on the other hand, the candidate words are disambiguated, so that the time words in the text can be extracted more accurately, and the method is particularly suitable for Chinese texts with diversified time word expression forms. The extraction method of the time words is applied to extraction of the time words of the Chinese text, so that the extracted time words can be covered more comprehensively and have more diversified forms, and meanwhile, the missing quantity is also greatly reduced.
Alternatively, referring to fig. 3, in the step of S400, the step of outputting the time word may include:
s411: and judging whether the time word contains a number, if so, executing S412, and if not, executing S415.
S412: judging whether the time word is a preset exclusion type, if not, executing S413; if it is a preset exclusion type, S415 is performed.
S413: and converting the time words into a preset format.
S414: and outputting the time words after the format conversion, and ending.
S415: and outputting the time word and ending.
In the step S411, the number included in the time word may be a number represented in chinese, an arabic number, or the like, which is not limited in the present application. For example, the time word "eighty-one-three years, eighty-months and fifteen days" contains the number expressed in Chinese; also for example, the time word "1893.08.15" includes Arabic numerals.
The steps of S411 to S414 are mainly to unify the format of the time words containing numbers. However, for time words such as "two or three years", "several decades", "one or two days", although including chinese numerals, the original semantic meaning of the time words is changed once they are converted into arabic numerals. For example, the Chinese number in "two or three years" is converted into Arabic number "23 years", and the semantics of the two are changed. The time word whose format is thus converted and whose semantics are changed is taken as a preset exclusion type, and such time word is excluded from the time words containing numerals by performing the step of S412.
In the step of S413, here the preset format may be set by the user according to the need. For example, all year, month and day may be collectively represented in a format of "XXXX year, XX month and XX day". For another example, the time points of the time division may be collectively expressed as "XX: XX ", as shown in the example in table 1.
TABLE 1
Time word before conversion Converted time words
Two zero one seven year august one number 8.8.1.2017
1998/7/20 20/7/1998
From 5/6/2007 to 9/10/2008 5 days 6/2007 to 9/10/2008
3 point 20 3:20
3 o' clock 3:15
From 3 to five 3:00-5:00
In this step, the time words containing numbers in the ancient chinese text, such as "one more", "two drums", can also be converted to a unified preset format, as shown in the example in table 2.
TABLE 2
Figure BDA0001468007140000071
Figure BDA0001468007140000081
Through the steps, the extracted time words with the numbers can be converted into a unified preset format and then output, so that the format of the output time words is more standard, and the subsequent utilization is facilitated.
In addition, optionally, in the case that the time word does not include a number, it may also be determined whether the time word is a preset ancient time word, such as "child time", "time of application", and the like. If the time word is a preset ancient time word, converting the time word into a preset format corresponding to the time word; if the time word is not the preset ancient time word, the time word is directly output without format conversion. For example, the time of day such as "child time", "ugly time", etc. may be converted into a unified default format, as shown in the example of table 3.
TABLE 3
Time word before conversion Converted time words
Sub-hour 0:00-2:00
Chen Shi 8:00-10:00
Time signal 16:00-18:00
When making love 20:00-22:00
Alternatively, referring to fig. 4, in the step of S400, the step of outputting the time word may include:
s421: determining a start-stop position of each time word in the text;
s422: merging the time words with overlapped or adjacent start and stop positions;
s423: and outputting the merged time words.
In step S421, the start and end positions of the time word in the text include a start position and an end position. The start-stop position may be determined by character order, for example, in the text 2 of the time word to be extracted in step S300, the start position of the time word "eighty-one-three years, eighty-month, fifteen days" is the 5 th character, and the end position is the 14 th character; the start position of the time word "time" is the 47 th character, and the end position is the 48 th character. In addition to recording the position of the time word in character order, other ways may be used, such as X-axis, Y-axis, etc.
Text 2 of the time word to be extracted:
Figure BDA0001468007140000082
in the process of extracting the candidate words or the original words, some candidate words or original words may be matched with a plurality of extraction rules, so that in the finally extracted time words, overlapping or adjacent conditions may exist among partial time words. In the step of S422, the time words whose start-stop positions overlap or are adjacent may include three cases.
In the first case, the previous time word and the next time word overlap partially. For example, in the text "2015 9/1/8 am", the time word 1 "2015 9/1 am" and the time word 2 "8 am" are extracted, the start and end positions of the time word 1 and the time word 2 can be determined, and the time word 1 and the "morning" in the time word 2 are overlapped. For such time words with overlapping start and stop positions, it is possible to merge the words into "8 am on 9/2015 and 1 st.
In the second case, the previous time word includes the next time word, or the next time word includes the previous time word, the two overlap. For example, in the text "8 o' clock on 1 st morning of 9 th month in 2015", a time word 1 "1 st morning of 9 th month in 2015" and a time word 2 "morning" are extracted, the start and end positions of the time word 1 and the time word 2 can be determined, and the time word 1 includes the time word 2. For such time words with overlapping start and stop positions, it is possible to merge into "9/1/2015 morning".
In the third case, the previous time word is adjacent to the next time word. For example, in the text "2015 year 9, month 1, morning 8", the time word 1 "2015 year 9, month 1" and the time word 2 "morning 8" are extracted, the start and end positions of the time word 1 and the time word 2 can be determined, and the time word 1 and the time word 2 are adjacent in the text. The time words with adjacent start and stop positions may be combined to "8 am on 9/2015 and 1/am".
Before determining whether the positions of the two time words in the text are overlapped or adjacent, all the time words extracted from the text may be sorted according to their starting and ending positions in the text. Therefore, the positions of the current time word and the next time word in the text can be determined to be overlapped, adjacent or spaced only by comparing the starting position and the ending position of the current time word with the starting position and the ending position of the next time word, and whether the time word is overlapped with or adjacent to the other time words can be determined without comparing a certain time word with the rest time words by a certain position, so that the operation amount of the merging step can be greatly reduced. More specifically, referring to fig. 5, the step of S422 merging the time words with overlapping or adjacent start-stop positions may include:
s4221: judging whether the start-stop position of the current time word is overlapped or adjacent to the start-stop position of the next time word;
s4222: if the time words are overlapped or adjacent, updating the current time word and the next time word into a union of the current time word and the next time word;
s4223: determining the starting and ending positions of the updated time words in the text;
and looping the steps of S4221-S4223 until S4224, if the start and stop positions of the updated time word are not overlapped and not adjacent to the start and stop position of the next time word after the updated time word, taking the updated time word as the merged time word.
In the determination result of S4221, if the start-stop position of the current time word is not overlapped with and adjacent to the start-stop position of the next time word, the current time word is output.
By the above method, two or more adjacent or overlapping time words can be combined into one output.
The time word extraction method of the present application is described below by another specific example.
Text 3 of the time word to be extracted:
6.27 in the evening, the user runs 10 km in a city and returns to the home of the building 9 at half night, and the common time is 44 minutes and 33 seconds. The insist running has been done for two or three years and the harvest is quite plentiful. Seeing Jiu Ming Qin Cui (disorder of nine Ming Qi) of Wutrimang of the young of the Qing Dynasty before sleeping, wherein the track is written: "I has counted here, brothers, is rarely at home, cannot catch a drum if they are lost, all alone! "
After the text 3 of the time word to be extracted is obtained, extracting 8 candidate words in the text:
candidate word 1: 6.27
Candidate word 2: in the evening
Candidate word 3: 5 o' clock and half
Candidate word 4: number 9
Candidate word 5: 44 minutes 33 seconds
Candidate word 6: two or three years old
Candidate word 7: qingdai dynasty
Candidate word 8: a drum.
Presetting 1 character after the number of the candidate word in a number 'X.X' (wherein X is a number, X can represent one or more numbers, the same below) format in the text, and 'X.XX', and forming a semantic area corresponding to 'X.XX'; the first predetermined character string corresponding to the candidate word "x.x" is any one of "g", "jin", "m", and "yuan".
Presetting 1 character behind the candidate word 'X number' in the text and the candidate word 'X number' to form a semantic area corresponding to the candidate word 'X number'; the first preset character string corresponding to the candidate word "X number" is any one of "building", "house", and "shop".
Presetting the last 2 characters of the candidate word 'one drum' and the candidate word 'one drum' in the text to form a semantic area corresponding to the candidate word 'one drum'; the first preset character string corresponding to the candidate word "one drum" is any one of "one plate", "and.
The candidate word "X point half" preset in the text itself constitutes a semantic area corresponding to the candidate word "X point half". The candidate word "evening" itself preset in the text constitutes a semantic area corresponding to the candidate word "evening". The candidate word "X minutes X seconds" preset in the text itself constitutes a semantic region corresponding to the candidate word "X minutes X seconds". The candidate word "X year" preset in the text itself constitutes a semantic region corresponding to the candidate word "X year". The candidate word "passage" itself preset in the text constitutes a semantic area corresponding to the candidate word "passage". The first preset character string corresponding to "X dot and half", the first preset character string corresponding to "evening", the first preset character string corresponding to "X minutes and X seconds", the first preset character string corresponding to "X years", and the first preset character string corresponding to "qing dynasty" are all empty.
And determining the corresponding semantic area of each candidate word in the text 3 of the time word to be extracted, as shown in table 4.
TABLE 4
Candidate word Semantic region First preset character string
1 6.27 6.27 side of Gram, jin, meter and yuan
2 In the evening In the evening ——
3 5 o' clock and half 5 o' clock and half ——
4 Number 9 9 th building Building, house and shop
5 44 minutes 33 seconds 44 minutes 33 seconds ——
6 Two or three years old Two or three years old ——
7 Qingdai dynasty Qingdai dynasty ——
8 One drum One drum and catch One plate and one catch
Judging whether a semantic region corresponding to the candidate word corresponds to a corresponding first preset character string or not, if not, judging that the candidate word is a time word in a text 3 of the time word to be extracted; if so, the candidate word is not a time word in the text 3 of the time word to be extracted. Therefore, 6 of the 8 candidate words can be determined as time words, as follows:
time word 1: 6.27
Time word 2: in the evening
Time word 3: 5 o' clock and half
Time word 4: 44 minutes 33 seconds
Time word 5: two or three years old
Time word 6: qing dynasty.
Presetting a time word in an X point-half format to be converted into an X: 30' in a predetermined format; converting the time words in the format of X.X into a preset format of X month and X day; the time words in the format of X minutes and X seconds are kept in the original format without conversion. The preset exclusion types are "XY year", "XY month", "XY day", wherein X, Y are numbers and X is 1 less than Y, where the numbers are chinese numbers.
Judging whether the 6 time words contain numbers, wherein the number of the time words containing numbers is 4, and the judgment is as follows: time word 1 "6.27", time word 3 "half 5", time word 4 "44 minutes 33 seconds", time word 5 "two or three years".
And judging whether the 4 time words containing the numbers are of a preset exclusion type or not. The time word 1 "6.27", the time word 3 "5 and a half, and the time word 4" 44 minutes 33 seconds "are not preset exclusion types; and the time word 5 "two or three years" is a preset exclusion type.
For 3 numeric time words that are not of the preset exclusion type, they are converted into a preset format, as shown in table 5.
TABLE 5
Time word Time word before conversion Converted time words
Time word
1 6.27 6 months and 27 days
Time word
3 5 o' clock and half 5:30
Time word 4 44 minutes 33 seconds 44 minutes 33 seconds
The time words that do not contain numbers and are of the preset exclusion type remain unchanged. At this time, the time words extracted from the text 3 include:
time word 1: 6 months and 27 days
Time word 2: in the evening
Time word 3: 5:30
Time word 4: 44 minutes 33 seconds
Time word 5: two or three years old
Time word 6: qing dynasty.
The 6 time words may be output, or the 6 time words may be output after the merging step.
The respective start and stop positions of the 6 time words in the text 3 are determined, and if the format conversion step has already been performed, the position of the time word before the format conversion in the text 3 is still taken as the start and stop position of the time word in the text 3. The results are shown in Table 6.
TABLE 6
Time word Starting position End position
1 6.27 1 4
2 In the evening 5 6
3 5 o' clock and half 7 9
4 44 minutes 33 seconds 31 36
5 Two or three years old 45 47
6 Qingdai dynasty 50 51
Judging whether the current time word '6.27' is overlapped or adjacent to the next time word 'evening'; as a result, if the two are adjacent, the union "6.27 evening" of "6.27" and "evening" is taken to replace "6.27" and "evening", that is, "6.27 evening" is taken as the new current time word. The start-stop position of the updated time word "6.27 evening" in text 3 is determined as shown in table 7.
TABLE 7
Time word Starting position End position
1 6.27 evening 1 6
2 5 o' clock and half 7 9
3 44 minutes 33 seconds 31 36
4 Two or three years old 45 47
5 Qingdai dynasty 50 51
At this time, the recycling determines whether the current time word "6.27 evening" and the next time word "5 o 'clock" are overlapped or adjacent, and as a result, the two are adjacent, and the union "6.27 evening 5 o' clock" is taken to replace "6.27 evening" and "5 o 'clock", that is, "6.27 evening 5 o' clock" is taken as the new current time word. The start and end positions of the updated time word "5 o' clock in the evening" of 6.27 in text 3 are determined again, as shown in table 8.
TABLE 8
Time word Starting position End position
1 6.27 evening 5 o' clock and half 1 9
2 44 minutes 33 seconds 31 36
3 Two or three years old 45 47
4 Qingdai dynasty 50 51
At this time, the recycling determines whether the current time word "6.27 evening 5: half" and the next time word "44 minutes 33 seconds" are overlapped, and as a result, the current time word "6.27 evening 5: half" is output, wherein the part which has been converted into the preset format is still output in the preset format at the time of output, that is, "6 month 27 evening 5: 30" is output.
Then, the word "44 minutes and 33 seconds" is used as the current time word, and whether the word overlaps or is adjacent to the next time word "two or three years" is judged. As a result, the two are not overlapped and adjacent, and the '44 minutes and 33 seconds' is directly output.
Similarly, the current time word is taken as "two or three years", and whether the current time word is overlapped with or adjacent to the next time word "Qing Dynasty" is judged. And if the two are not overlapped and adjacent, the two-year and three-year output is directly carried out.
The 'generation-clearing' is used as the time word at the most rear position in the text 3, and the 'generation-clearing' is directly output if the next time word does not exist.
Therefore, the output result of the time word extracted from the text 3 to be extracted is as follows: "6 months, 27 days, evening 5: 30", "44 minutes, 33 seconds", "two or three years", "Qing Dynasty".
Referring to fig. 6, in a second embodiment of the present application, there is provided a time word extracting apparatus, including:
the acquisition unit 1 is used for acquiring a text of a time word to be extracted;
the processing unit 2 is configured to extract all candidate words in the text, determine a semantic area corresponding to each candidate word in the text, and determine that the candidate word is a time word when the semantic area does not include a first preset character string corresponding to the candidate word; each candidate word at least has a semantic meaning for representing time, and the semantic area comprises the candidate words and a preset number of characters before and after the candidate words;
and the output unit 3 is used for outputting the time words.
Optionally, the processing unit 2 is further configured to extract original words from the text, determine matching regions corresponding to the original words in the text, and generate candidate words; the matching area comprises an original word and a predetermined number of characters before and after the original word, and the candidate word is a word which contains the original word in the matching area and at least has one semantic meaning for representing time.
Optionally, the processing unit 2 is further configured to determine whether the time word is of a preset exclusion type when the time word includes a number, and if not, convert the time word into a preset format; the output unit is also used for outputting the time words after format conversion.
Optionally, the processing unit 2 is further configured to determine a start-stop position of each time word in the text, and merge time words with overlapping or adjacent start-stop positions; the output unit is further used for outputting the merged time words.
Optionally, the processing unit 2 is further configured to determine whether a start-stop position of the current time word overlaps or is adjacent to a start-stop position of the next time word, update the current time word and the next time word to a union of the current time word and the next time word if the start-stop positions of the current time word and the next time word overlap or are adjacent, determine a start-stop position of the updated time word in the text, and take the updated time word as the merged time word if the start-stop position of the updated time word does not overlap and is not adjacent to the start-stop position of the next time word.
The same and similar parts in the various embodiments in this specification may be referred to each other. The above-described embodiments of the present invention should not be construed as limiting the scope of the present invention.

Claims (8)

1. A method for extracting time words is characterized by comprising the following steps:
acquiring a text of a time word to be extracted;
extracting all candidate words in the text, wherein each candidate word at least has one semantic meaning for representing time;
determining semantic regions corresponding to the candidate words in the text respectively, wherein the semantic regions comprise the candidate words and a preset number of characters before and after the candidate words;
if the semantic region does not contain a first preset character string corresponding to the candidate word, determining the candidate word as a time word, and outputting the time word;
and, the step of outputting the time word includes:
if the time words contain numbers, judging whether the time words are preset exclusion types or not;
if the time word is not the preset exclusion type, converting the time word into a preset format;
and outputting the time words after the format conversion.
2. The method for extracting time words according to claim 1, wherein the step of extracting all candidate words in the text comprises:
extracting original words from the text;
determining a matching area corresponding to each original word in the text, wherein the matching area comprises the original words and a predetermined number of characters before and after the original words;
and generating a candidate word, wherein the candidate word is a word which contains the original word in the matching area and has at least one semantic meaning for representing time.
3. The method for extracting time words according to claim 1, wherein the step of outputting the time words comprises:
determining a start-stop position of each time word in the text;
merging the time words with overlapped or adjacent start and stop positions;
and outputting the merged time words.
4. The method for extracting time words according to claim 3, wherein the step of merging time words with overlapping or adjacent start and stop positions comprises:
judging whether the start-stop position of the current time word is overlapped or adjacent to the start-stop position of the next time word;
if the time words are overlapped or adjacent, updating the current time word and the next time word into a union of the current time word and the next time word;
determining the starting and ending positions of the updated time words in the text;
and if the start-stop position of the updated time word is not overlapped with and adjacent to the start-stop position of the next time word, taking the updated time word as the merged time word.
5. A time word extraction device, comprising:
the acquisition unit is used for acquiring a text of a time word to be extracted;
the processing unit is used for extracting all candidate words in the text, determining semantic areas corresponding to the candidate words in the text respectively, and determining the candidate words as time words under the condition that the semantic areas do not contain first preset character strings corresponding to the candidate words; each candidate word at least has a semantic meaning for representing time, and the semantic area comprises the candidate words and a preset number of characters before and after the candidate words;
an output unit for outputting the time word;
the processing unit is also used for judging whether the time words are in a preset exclusion type or not under the condition that the time words contain numbers, and if not, converting the time words into a preset format; the output unit is also used for outputting the time words after format conversion.
6. The apparatus according to claim 5, wherein the processing unit is further configured to extract original words from the text, determine matching areas corresponding to the original words in the text, and generate candidate words; the matching area comprises an original word and a predetermined number of characters before and after the original word, and the candidate word is a word which contains the original word in the matching area and at least has one semantic meaning for representing time.
7. The apparatus according to claim 5, wherein the processing unit is further configured to determine a start-stop position of each time word in the text, and merge time words with overlapping or adjacent start-stop positions; the output unit is further used for outputting the merged time words.
8. The apparatus according to claim 7, wherein the processing unit is further configured to determine whether a start/stop position of a current time word overlaps or is adjacent to a start/stop position of a next time word, update the current time word and the next time word to a union of the current time word and the next time word if the start/stop positions overlap or are adjacent to each other, determine a start/stop position of the updated time word in the text, and take the updated time word as the merged time word if the start/stop position of the updated time word does not overlap or is not adjacent to the start/stop position of the next time word.
CN201711123985.7A 2017-11-14 2017-11-14 Time word extraction method and device Active CN107894978B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711123985.7A CN107894978B (en) 2017-11-14 2017-11-14 Time word extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711123985.7A CN107894978B (en) 2017-11-14 2017-11-14 Time word extraction method and device

Publications (2)

Publication Number Publication Date
CN107894978A CN107894978A (en) 2018-04-10
CN107894978B true CN107894978B (en) 2021-04-09

Family

ID=61804470

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711123985.7A Active CN107894978B (en) 2017-11-14 2017-11-14 Time word extraction method and device

Country Status (1)

Country Link
CN (1) CN107894978B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829673A (en) * 2018-06-08 2018-11-16 北京玄科技有限公司 The abstracting method and device of time word
CN111027319A (en) * 2019-10-30 2020-04-17 平安科技(深圳)有限公司 Method and device for analyzing natural language time words and computer equipment
US20230280989A1 (en) * 2022-03-04 2023-09-07 Microsoft Technology Licensing, Llc Synthesizing a computer program to include idiomatic function(s) and semantically-meaningful variable(s) using programming by example

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224601A (en) * 2015-08-31 2016-01-06 小米科技有限责任公司 A method and device for extracting time information
CN105824801A (en) * 2015-03-16 2016-08-03 国家计算机网络与信息安全管理中心 Entity relationship rapid extraction method based on automaton
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8521511B2 (en) * 2007-06-18 2013-08-27 International Business Machines Corporation Information extraction in a natural language understanding system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105824801A (en) * 2015-03-16 2016-08-03 国家计算机网络与信息安全管理中心 Entity relationship rapid extraction method based on automaton
CN105224601A (en) * 2015-08-31 2016-01-06 小米科技有限责任公司 A method and device for extracting time information
CN107122416A (en) * 2017-03-31 2017-09-01 北京大学 A kind of Chinese event abstracting method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
突发事件Web新闻发生时间的抽取与时间粒度分析;崔纯爽;《信息安全与管理》;20150415;第57-60页 *

Also Published As

Publication number Publication date
CN107894978A (en) 2018-04-10

Similar Documents

Publication Publication Date Title
Chiswick et al. Ethnic intermarriage among immigrants: Human capital and assortative mating
Becker (r) we there yet? The change to rhoticity in New York City English
CN107894978B (en) Time word extraction method and device
EP2919097A1 (en) Information processing system and information processing method for character input prediction
Rupp et al. Dealing with heterogeneous big data when geoparsing historical corpora
Al-Rojaie Mapping perceptions of linguistic variation in Qassim, Saudi Arabia, using GIS technology
Village et al. Shaping attitudes toward church in a time of coronavirus: Exploring the effects of personal, psychological, social, and theological factors among Church of England clergy and laity
Wu Words and concepts in Chinese religious denunciation: A study of the genealogy of Xiejiao
WO2021254046A1 (en) Information query method and device
CN117216400A (en) Stroke recommendation method, device, platform and storage medium
Brato ‘Outdooring’the Historical Corpus of English in Ghana: Insights from the compilation of a historical corpus of New English
Beal et al. ‘All the Lads and Lasses’: Lexical variation in Tyne and Wear: A discussion of how the traditional dialect terms lad and lass are still used in the modern urban dialects of Newcastle upon Tyne and Sunderland
Amir The identity of piety in the digital age (study of the use of religious symbols in social media)
Stojić et al. Proposal of a catalogue of criteria for the identification and classification of metaphorical collocations: A qualitative study based on German, Croatian, English, and Italian examples
Imer The Danish runestones–when and where?
Askribagbani et al. The double pressure of life, the main outcome of divorce on women: A qualitative study
CN109885659B (en) Method and device for normalizing time information in text
KR102397791B1 (en) Apparatus and method for automatically converting note to action reminders
Yamamoto et al. Acquisition of periodic events with person attributes
Szeptyński Changing Gender of Self-Reference in the Polish Dialects of the Silesian Beskids
Minacapilli A Heuristic Evaluation of Multilingual Lom Ba Rdy: Museums’ Web Sites
Jia et al. Negotiating new cultured identities through stylizing Wenyan: the case of young Chinese in China and the Netherlands
Inharjanto Investigating social status of Singaporean English
Menashe‐Oren et al. The Potential of Internal Migration to Shape Rural and Urban Populations Across Africa, Asia, and Latin America
CHERNYSHEVA Metaphorical use of the word czar depending on the context of the communicative situation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190906

Address after: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant after: China Science and Technology (Beijing) Co., Ltd.

Address before: 100089 Beijing city Haidian District wanquanzhuang Road No. 28 Wanliu new building block A Room 601

Applicant before: Beijing Shenzhou Taiyue Software Co., Ltd.

CB02 Change of applicant information
CB02 Change of applicant information

Address after: 230000 zone B, 19th floor, building A1, 3333 Xiyou Road, hi tech Zone, Hefei City, Anhui Province

Applicant after: Dingfu Intelligent Technology Co., Ltd

Address before: Room 630, 6th floor, Block A, Wanliu Xingui Building, 28 Wanquanzhuang Road, Haidian District, Beijing

Applicant before: DINFO (BEIJING) SCIENCE DEVELOPMENT Co.,Ltd.

GR01 Patent grant
GR01 Patent grant