CN112306961B - Log processing method, device, equipment and storage medium - Google Patents
Log processing method, device, equipment and storage medium Download PDFInfo
- Publication number
- CN112306961B CN112306961B CN201910670466.5A CN201910670466A CN112306961B CN 112306961 B CN112306961 B CN 112306961B CN 201910670466 A CN201910670466 A CN 201910670466A CN 112306961 B CN112306961 B CN 112306961B
- Authority
- CN
- China
- Prior art keywords
- preset
- character string
- string unit
- log
- character
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1805—Append-only file systems, e.g. using logs or journals to store data
- G06F16/1815—Journaling file systems
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/16—File or folder operations, e.g. details of user interfaces specifically adapted to file systems
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/906—Clustering; Classification
 
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Human Computer Interaction (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention discloses a log processing method, a log processing device, log processing equipment and a storage medium. The method comprises the steps of replacing variable character strings in an original log line to be processed through multiple segmentation, replacement and recombination of the original log line to be processed, obtaining fixed character strings in the original log line to be processed, and representing sentence pattern characteristics of the original log line by the fixed character strings. According to the embodiment of the invention, basis can be provided for accurate classification of log lines, and further accurate classification of log lines is realized.
    Description
Technical Field
      The invention belongs to the technical field of big data, and particularly relates to a log processing method, device, equipment and storage medium.
    Background
      Log files are an indispensable existence in large or very large software and network systems. The log file records the running state and running behavior of the system, contains the information of the running condition of the system, plays an important role in system fault detection and problem positioning, and has important significance in quality assurance work of detecting system abnormality and eliminating system hidden danger.
      In order to ensure the accuracy of anomaly detection of log files, the log files need to be accurately classified. However, the existing log file classification processing method cannot realize accurate classification of log files.
    Disclosure of Invention
      The embodiment of the invention provides a log processing method, a device, equipment and a storage medium, which can extract sentence pattern characteristics of log lines, provide basis for accurate classification of the log lines and further realize accurate classification of the log lines.
      In a first aspect, an embodiment of the present invention provides a log processing method, where the method includes:
      acquiring an original log line to be processed;
      dividing an original log line to be processed according to a preset first divider to obtain at least one first character string unit;
      replacing variable character strings in at least one first character string unit by using preset symbols to obtain at least one updated first character string unit;
      reorganizing the updated at least one first character string unit to obtain reorganized log lines;
      dividing and reorganizing the log lines according to the preset symbols to obtain a plurality of second character string units, wherein the plurality of second character string units are used for representing sentence pattern characteristics of the original log lines.
      In a specific implementation manner, the log processing method provided by the embodiment of the invention further includes:
      the original log line is classified according to the plurality of second string units.
      In one embodiment, reorganizing the updated at least one first string unit includes:
      and splicing the updated at least one first character string unit by taking the first segmenter as a spacer.
      In one embodiment, the at least one first string unit includes a first type string unit, wherein the first type string unit does not contain a numeric character;
      replacing a variable string in at least one first string unit with a preset symbol, comprising:
      obtaining a plurality of first sub-strings by utilizing a preset second segmentation Fu Fenge first type string unit;
      judging whether the number of the first substrings is larger than a first preset number or not;
      if the number is larger than the first preset number, the whole first type character string unit is a variable number character string, and the whole first type character string unit is replaced by a preset symbol;
      if the number of the sub-strings is not greater than the first preset number, the first sub-string after the second separator is a variable string, and the first sub-string after the second separator is replaced by a preset symbol.
      In one embodiment, the at least one first string unit includes a second type string unit, the second type string unit containing a numeric character;
      replacing a variable string in at least one first string unit with a preset symbol, comprising:
      dividing the second type character string unit by using a preset third divider to obtain a plurality of second sub-character strings;
      and using a second substring containing the digital character in the plurality of second substrings as a variable quantity character string, and replacing the second substring containing the digital character by a preset symbol.
      In a specific embodiment, before the reorganizing the updated at least one first string unit, the method further includes:
      dividing the updated at least one first character string unit by using a preset fourth divider to obtain a plurality of third sub-character strings;
      judging whether the number of the third substrings is larger than a second preset number;
      if the number is larger than the second preset number, judging whether the third sub-character string contains preset symbols or not;
      if the third sub-string contains the preset symbol, replacing the third sub-string containing the preset symbol with the preset symbol;
      if the number of the character strings is not greater than the second preset number, dividing the updated at least one first character string unit by using a preset fifth divider to obtain a plurality of fourth sub-character strings;
      judging whether the number of the fourth substring is larger than a third preset number;
      if the number is larger than the third preset number, the updated at least one first character string unit is replaced by the preset symbol.
      In a specific implementation manner, before the log processing method provided by the embodiment of the invention uses the preset symbol to segment the reconstructed log line, the log processing method further includes:
      judging whether the recombined log line contains a substring with a preset format;
      if yes, replacing the sub-character strings with preset formats by using preset symbols;
      the substring in the preset format represents one or more of file size, time length and operating system time.
      In a second aspect, an embodiment of the present invention provides a log processing apparatus, including:
      the data acquisition module is used for acquiring an original log line to be processed;
      the first segmentation module is used for segmenting an original log line to be processed by using a preset first segmenter to obtain at least one first character string unit;
      the replacing module is used for replacing variable character strings in at least one first character string unit by using preset symbols to obtain at least one updated first character string unit;
      the reorganization module is used for reorganizing the updated at least one first character string unit to obtain reorganized log lines;
      the second segmentation module is used for segmenting and reorganizing the log lines according to preset symbols to obtain a plurality of second character string units, wherein the plurality of second character string units are used for representing sentence pattern characteristics of the original log lines.
      In a third aspect, an embodiment of the present invention provides a log processing apparatus, including: a processor and a memory storing computer program instructions;
      the processor when executing the computer program instructions implements the log processing method as in the first aspect.
      In a fourth aspect, embodiments of the present invention provide a computer storage medium having stored thereon computer program instructions which, when executed by a processor, implement a log processing method as in the first aspect.
      According to the log processing method, device, equipment and storage medium, the variable character strings in the original log lines to be processed are replaced through multiple times of segmentation, replacement and recombination of the original log lines to be processed, so that the fixed character strings in the original log lines to be processed are obtained, the fixed character strings are used for representing sentence pattern characteristics of the original log lines, basis can be provided for accurate classification of the log lines, and further accurate classification of the log lines is achieved.
    Drawings
      In order to more clearly illustrate the technical solution of the embodiments of the present invention, the drawings that are needed to be used in the embodiments of the present invention will be briefly described, and it is possible for a person skilled in the art to obtain other drawings according to these drawings without inventive effort.
      FIG. 1 is a schematic flow chart of a log processing method according to an embodiment of the present invention;
      fig. 2 is a schematic structural diagram of a log processing device according to an embodiment of the present invention;
      fig. 3 is a schematic structural diagram of a log processing apparatus according to another embodiment of the present invention.
    Detailed Description
      Features and exemplary embodiments of various aspects of the present invention will be described in detail below, and in order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and the detailed embodiments. It should be understood that the specific embodiments described herein are merely configured to illustrate the invention and are not configured to limit the invention. It will be apparent to one skilled in the art that the present invention may be practiced without some of these specific details. The following description of the embodiments is merely intended to provide a better understanding of the invention by showing examples of the invention.
      It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
      The log file is typically in a semi-structured text format, recorded by time of occurrence. From the electronic device running log, to the operating system kernel log and the message log, to the logs of custom processes, business tools, software and high concurrency large software tools, a great amount of information is contained. In addition, as the system scale becomes larger and the complexity increases, the log volume becomes larger, the types become more and more, the information volume becomes higher and higher, and the quality problem of the system is more and more difficult to discover from massive logs only by naked eyes and a searching tool or by using a traditional log detection method.
      For example, for the same row of logs, the first occurrence belongs to class a, and the second occurrence cannot be class B, but can be class a. In addition, for the log line belonging to the class C, it cannot be classified into the class a or the class B, because in this case, if only the log line of the class C shows an abnormality, but the class C has been classified into the class a or the class B by mistake, the subsequent processing logic becomes the class a or the class B shows an abnormality, thereby reducing the accuracy of abnormality detection.
      Therefore, in order to ensure the accuracy of anomaly detection, logs must be classified at a minimum granularity. Text journaling is triggered under certain conditions and follows a certain template, which is determined by the program line defining the journaling format. Therefore, the logs with the smallest granularity are classified according to the program lines, i.e. the logs output by the same sentence of program lines are used as the same class. The program line includes a fixed character string, which is a constant character combination, and a variable character string, which is a value displayed according to an external input. Correspondingly, the log lines also comprise fixed character strings and variable character strings, and characters at corresponding positions of the variable character strings are possibly different from log lines. This feature of log lines complicates and makes difficult log classification.
      In order to solve at least one technical problem described above, an embodiment of the present invention provides a log processing method, apparatus, device, and storage medium. The log processing method provided by the embodiment of the invention is first described below.
      Fig. 1 shows a flow chart of a log processing method according to an embodiment of the present invention. As shown in fig. 1, the log processing method provided by the embodiment of the invention includes the following steps:
      s110, acquiring an original log line to be processed;
      s120, dividing an original log line to be processed according to a preset first divider to obtain at least one first character string unit;
      s130, replacing variable character strings in at least one first character string unit by preset symbols to obtain at least one updated first character string unit;
      s140, reorganizing at least one updated first character string unit to obtain reorganized log lines;
      s150, dividing and reorganizing the log lines according to the preset symbols to obtain a plurality of second character string units, wherein the plurality of second character string units are used for representing sentence pattern characteristics of the original log lines.
      According to the log processing method provided by the embodiment of the invention, the variable character strings in the original log line to be processed are replaced through multiple times of segmentation, replacement and recombination of the original log line to be processed, so that the fixed character strings in the original log line to be processed are obtained, and the fixed character strings are used for representing the sentence pattern characteristics of the original log line, so that basis can be provided for accurate classification of the log line, and further accurate classification of the log line is realized.
      In S110, the log file includes a plurality of log lines, the log lines correspond to program lines, and the log output by the same program line is used as the log of the same line. The original log lines are used as the objects to be processed with the minimum granularity, so that the accuracy of log classification can be improved.
      Further, the time stamp of the original log line may be removed before proceeding to S120. The format of the log file is generally composed of two parts, namely a time stamp and log content, wherein the time stamp is a variable character string, and the time stamp is removed, so that the fixed character string in the original log line can be conveniently extracted.
      In S120, a preset first segmenter may be set according to the log line feature. As an example, most of the time, the words or character strings in the log line are spaced apart by spaces, and spaces may be set as the first separators.
      An example of the original log behavior is described below.
      For example, one of the raw log behaviors to be processed is obtained:
      INFO
      [regionserver/worker016-d44n06.cmsz.com/192.168.80.124:16020-shortCompactions-1519642312557]regionserver.HStore:Starting compaction of 3file(s)in c of fms:Ht_ALL_CDR_RULE,99109964600096149135772017122714211554a09f9a571580ac13,1517469990221.5aedc40e1234496141f5f70d97aa64f5.into tmpdir=hdfs://szbdp/apps/hbase/data/data/fms/Ht_ALL_CDR_RULE/5aedc40e1234496141f5f70d97aa64f5/.tmp,totalSize=166.6M
      the space is used as a first divider, and the original log is divided into 16 first character string units, wherein the space is replaced by # for clearer display, and it is understood that the space is not required to be replaced by # in the actual dividing process.
      The 16 first string units obtained are:
      INFO#[regionserver/worker016-d44n06.cmsz.com/192.168.80.124:16020-shortCompactions-1519642312557]#regionserver.HStore:#Starting#compaction#of#3#file(s)#in#c#of#fms:Ht_ALL_CDR_RULE,99109964600096149135772017122714211554a09f9a571580ac13,1517469990221.5aedc40e1234496141f5f70d97aa64f5.#into#tmpdir=hdfs://szbdp/apps/hbase/data/data/fms/Ht_ALL_CDR_RULE/5aedc40e1234496141f5f70d97aa64f5/.tmp,#totalSize=166.6#M
      as technology advances, words or strings in a log line may no longer be spaced by spaces, but other coincidences, then other symbols are used as the first segmenter.
      In S130, the preset symbol may be a special symbol that does not appear in the log line, and if there is no special symbol that does not appear, the special symbol with the lowest frequency of appearance may be selected. For example, a symbol of an ASCII code value in the interval [21, 30)/(39, 41)/(5 a, 61)/(7 a,7 e) ] is referred to as a special symbol. In the present invention, a preset symbol is used as an explanation.
      It is possible that the entire first string unit is a variable string, or that a part of the strings of the first string unit is a variable string. In general, in the log line, the character string containing the numeric character is definitely a variable amount character string, but a character string not containing the numeric character may also be a variable amount character string.
      In some embodiments, the at least one first string unit comprises a first type string unit, the first type string unit being devoid of numeric characters; replacing a variable string in at least one first string unit with a preset symbol, comprising: obtaining a plurality of first sub-strings by utilizing a preset second segmentation Fu Fenge first type string unit; judging whether the number of the first substrings is larger than a first preset number or not; if the number is larger than the first preset number, the whole first type character string unit is a variable number character string, and the whole first type character string unit is replaced by a preset symbol; if the number of the sub-strings is not greater than the first preset number, the first sub-string after the second separator is a variable string, and the first sub-string after the second separator is replaced by a preset symbol.
      For example, 11 first string units out of the 16 first string units obtained above do not contain a numeric character, and the first string units containing no numeric character are used as the first type string units, so that 11 first type string units are obtained.
      Although the first type string unit does not contain a numeric character, there may be a variable amount string, for example, the first type string unit containing a symbol "=". The preset second separator segmentation may be "=", and the first type string unit is segmented by an equal sign "=". Preferably, the first preset number is 2, and if the number of the obtained first substrings is greater than 2, the whole first type string unit may be regarded as a variable string, and the whole first type string unit is replaced by the variable string. If the number of the obtained first substrings is less than or equal to 2, replacing the first substrings with the equal sign "=".
      The invention further carries out segmentation and replacement processing on the first type character string unit without numbers, can more accurately replace variable character strings in the log lines, so as to obtain fixed character strings, and is further beneficial to accurately classifying the log lines.
      In some embodiments, the at least one first string unit comprises a second type string unit having a numeric character therein; replacing a variable string in at least one first string unit with a preset symbol, comprising: dividing the second type character string unit by using a preset third divider to obtain a plurality of second sub-character strings; and using a second substring containing the digital character in the plurality of second substrings as a variable quantity character string, and replacing the second substring containing the digital character by a preset symbol. The third separator may be a set including a plurality of separator symbols, or may include only one separator symbol.
      For example, among the 16 first string units obtained above, 5 first string units contain numeric characters, and the first string units containing numeric characters are used as the second type string units, so that 5 second type string units are obtained, which are respectively:
      1 st second type string unit:
      [regionserver/worker016-d44n06.cmsz.com/192.168.80.124:16020-shortCompactions-1519642312557]
      a 2 nd second type string unit:
      3 rd second type string unit:
      fms:Ht_ALL_CDR_RULE,99109964600096149135772017122714211554a09f9a571580ac13,1517469990221.5aedc40e1234496141f5f70d97aa64f5.
      4 th second type string unit:
      tmpdir=hdfs://szbdp/apps/hbase/data/data/fms/Ht_ALL_CDR_RULE/5aedc40e1234496141f5f70d97aa64f5/.tmp,
      5 th second type string unit:
      totalSize=166.6
      although the second type string unit contains numeric characters, a fixed string may be present. The second type string unit is further split by a third splitter to determine a second substring of smaller units containing digits and the second substring containing digits is replaced by the third splitter.
      Taking the 3 rd of the 5 second type string units obtained as an example, taking the set of segmenters as {: ___, &.. } as the third segmenter, the obtained plurality of second strings are fms, ht, ALL, CDR, RULE,99109964600096149135772017122714211554a09f9a571580ac13, 1517469990221,5 aaedc 40e1234496141f5f70d97aa64f5, respectively, and the result after replacement is fms, ht, ALL, CDR, RULE. Further, the second type string unit is updated with the replaced result to obtain an updated second type string unit, and the updated second type string unit is "fms: ht_all_cdr_run.
      The invention further carries out segmentation and replacement processing on the second type character string unit containing the digital characters, can more accurately replace variable character strings in the log lines, so as to obtain fixed character strings, and further is beneficial to accurately classifying the log lines.
      The updated 16 first string units obtained by the above-described processing are shown in table 1.
      TABLE 1
      Before S140, the log processing method provided by the embodiment of the present invention further includes dividing the updated at least one first string unit by using a preset fourth divider to obtain a plurality of third substrings; judging whether the number of the third substrings is larger than a second preset number; if the number is larger than the second preset number, judging whether the third sub-character string contains preset symbols or not; if the third sub-string contains the preset symbol, replacing the third sub-string containing the preset symbol with the preset symbol; if the number of the character strings is not greater than the second preset number, dividing the updated at least one first character string unit by using a preset fifth divider to obtain a plurality of fourth sub-character strings; judging whether the number of the fourth substring is larger than a third preset number; if the number is larger than the third preset number, the updated at least one first character string unit is replaced by the preset symbol.
      For example, as shown in table 1, the obtained 16 updated first string units are further divided and replaced, where the preset fourth separator may be "compliant" = ", the second preset number may be 1, and the preset fifth separator may be" compliant ": the third preset number may be 1.
      For example, the 14 th updated first string unit "tmpdir=hdfs:// szbdp/apps/hbase/data/data/fms/ht_all_cdr_run// tmp" in table 1 is used to divide to obtain two third substrings, "tmpdir" and "hdfs:// szbdp/apps/hbase/data/data/fms/ht_all_cdr_run// tmp", respectively, which is greater than the second preset number 1, because the obtained second third substring contains preset coincidence, and the whole obtained second third substring is replaced by the preset coincidence.
      It should be understood that the preset fourth separator may be in accordance with "=", the second preset number may be 1, and the number of the third words is not greater than 1, which indicates that the updated first string unit does not contain the symbol in accordance with "=".
      Illustratively, the use meets the following criteria: dividing the 12 th updated first string unit "fms: ht_all_cdr_run" in table 1, which does not contain the symbol = "to obtain two fourth sub-strings, and replacing the 12 th updated first string unit in the whole table 1 with the preset symbol =" if the number of the obtained fourth strings is greater than 1.
      The invention further carries out segmentation and replacement processing on the updated first character string unit, can more accurately replace variable character strings in the log lines, so as to obtain fixed character strings, and is further beneficial to accurately classifying the log lines.
      Before S140, the log processing method provided by the embodiment of the present invention further includes determining whether the reorganized log line includes a substring having a preset format; if yes, replacing the sub-character strings with preset formats by using preset symbols; the substring in the preset format represents one or more of file size, time length and operating system time.
      In the following, a file size process is taken as an example, and the substring format set P of the preset format representing the file size may be described as:
      "K, M, G, T, P, Z, KB, MB, GB, TB, PB, ZB, mb, mb, gb, tb, pb, zb } {2C,2E,20, EMT }" and EMT means end. In the implementation process, P can be divided into two groups P1 and P2, and P1 can be expressed as:
      “*{K,M,G,T,P,Z,KB,MB,GB,TB,PB,ZB,Mb,Mb,Gb,Tb,Pb,Zb}{2C,2E,20}”,
      p2 can be noted as:
      "K, M, G, T, P, Z, KB, MB, GB, TB, PB, ZB, mb, mb, gb, tb, pb, zb }", first, judging according to the P1 subset, and then judging according to the P2 subset, thereby converting the sub-character string with the preset format into the preset symbol.
      For example, the 16 th updated string unit "M" obtained in table 1 is a substring indicating the file size, and is converted into a preset symbol after processing.
      The invention further carries out replacement processing on the updated first character string unit, can remove the sub character strings of the size, the time length, the operating system time and the like of the characterization file in the log line, and more accurately replaces the variable character strings in the log line to obtain the fixed character strings, thereby being beneficial to the accurate classification of the log line.
      In S140, reorganizing the updated at least one first string unit, including: and splicing the updated at least one first character string unit by taking the first segmenter as a spacer. The embodiment of the invention reorganizes at least one updated first character string unit to obtain a complete log line, and prepares for the next segmentation.
      Illustratively, the 16 updated first string units after the further dividing and replacing processing are spliced by using spaces as spacers, so as to obtain a recombined log line. For the sake of clearer illustration, # is used instead of space, it being understood that in actual segmentation, # is not required to be used instead of space.
      Recombinant log lines:
      INFO#*#regionserver.HStore:#Starting#compaction#of#*#file(s)#in#c#of#*#into#tmpdir=*#totalSize=*
      in S150, the repeated log line is divided, and 5 character strings representing the sentence pattern feature of the original log line as shown in table 2 are obtained.
      TABLE 2
      According to the log processing method provided by the embodiment of the invention, the sentence pattern characteristics of the log lines are constructed through multiple times of segmentation, conversion and recombination, the variable character strings are successfully separated, and the log characteristics are accurately extracted, so that the minimum granularity and the accurate classification of the log can be realized. Meanwhile, the method provided by the invention has simple and universal processing steps, is universal, is suitable for different languages, different software tools, different operating system kernels and message logs, and is especially suitable for classifying and processing massive, diverse and semi-structured logs in a very large-scale and complex big data system.
      Further, after S150, the log processing method provided by the embodiment of the present invention further includes: and classifying the original log line according to a plurality of second character string units representing the sentence pattern characteristics of the original log line.
      Illustratively, the original log lines with the same second character string units representing sentence pattern features are taken as the same type of log lines, so as to realize accurate classification of the log lines.
      Fig. 2 shows a schematic structural diagram of a log processing device according to an embodiment of the present invention. As shown in fig. 2, the log processing device provided in the embodiment of the present invention includes:
      a data acquisition module 201, configured to acquire an original log line to be processed;
      a first segmentation module 202, configured to segment an original log line to be processed by using a preset first segmenter to obtain at least one first string unit;
      a replacing module 203, configured to replace a variable string in at least one first string unit with a preset symbol, to obtain at least one updated first string unit;
      a reorganizing module 204, configured to reorganize the updated at least one first string unit to obtain reorganized log lines;
      the second segmentation module 205 is configured to segment the recombined log line according to a preset symbol, so as to obtain a plurality of second string units, where the plurality of second string units are used to represent sentence pattern features of the original log line.
      According to the log processing method provided by the embodiment of the invention, the variable character strings in the original log line to be processed are replaced through multiple times of segmentation, replacement and recombination of the original log line to be processed, so that the fixed character strings in the original log line to be processed are obtained, and the fixed character strings are used for representing the sentence pattern characteristics of the original log line, so that basis can be provided for accurate classification of the log line, and further accurate classification of the log line is realized.
      In a specific implementation manner, the log processing device provided by the embodiment of the invention further includes:
      and the classification unit is used for classifying the original log lines according to the plurality of second character string units.
      The original log lines with the same character strings representing sentence pattern features are used as the same type of log lines, so that the accurate classification of the log lines is realized.
      In one embodiment, the reorganization module 204 is specifically configured to:
      and splicing the updated at least one first character string unit by taking the first segmenter as a spacer.
      The embodiment of the invention reorganizes at least one updated first character string unit to obtain a complete log line, and prepares for the next segmentation.
      In one embodiment, the at least one first string unit includes a first type string unit, wherein the first type string unit does not contain a numeric character; the replacement module 203 is specifically configured to:
      obtaining a plurality of first sub-strings by utilizing a preset second segmentation Fu Fenge first type string unit;
      judging whether the number of the first substrings is larger than a first preset number or not;
      if the number is larger than the first preset number, the whole first type character string unit is a variable number character string, and the whole first type character string unit is replaced by a preset symbol;
      if the number of the sub-strings is not greater than the first preset number, the first sub-string after the second separator is a variable string, and the first sub-string after the second separator is replaced by a preset symbol.
      The invention further carries out segmentation and replacement processing on the first type character string unit without numbers, can more accurately replace variable character strings in the log lines, so as to obtain fixed character strings, and is further beneficial to accurately classifying the log lines.
      In one embodiment, the at least one first string unit includes a second type string unit, the second type string unit containing a numeric character; the replacement module 203 is specifically configured to:
      dividing the second type character string unit by using a preset third divider to obtain a plurality of second sub-character strings;
      and using a second substring containing the digital character in the plurality of second substrings as a variable quantity character string, and replacing the second substring containing the digital character by a preset symbol.
      The invention further carries out segmentation and replacement processing on the second type character string unit containing the digital characters, can more accurately replace variable character strings in the log lines, so as to obtain fixed character strings, and further is beneficial to accurately classifying the log lines.
      In one embodiment, before reorganizing the updated at least one first string unit, the replacing module 203 is further configured to:
      dividing the updated at least one first character string unit by using a preset fourth divider to obtain a plurality of third sub-character strings;
      judging whether the number of the third substrings is larger than a second preset number;
      if the number is larger than the second preset number, judging whether the third sub-character string contains preset symbols or not;
      if the third sub-string contains the preset symbol, replacing the third sub-string containing the preset symbol with the preset symbol;
      if the number of the character strings is not greater than the second preset number, dividing the updated at least one first character string unit by using a preset fifth divider to obtain a plurality of fourth sub-character strings;
      judging whether the number of the fourth substring is larger than a third preset number;
      if the number is larger than the third preset number, the updated at least one first character string unit is replaced by the preset symbol.
      The invention further carries out segmentation and replacement processing on the updated first character string unit, can more accurately replace variable character strings in the log lines, so as to obtain fixed character strings, and is further beneficial to accurately classifying the log lines.
      In one embodiment, before dividing the reconstructed log line with the preset symbol, the replacing module 203 is further configured to:
      judging whether the recombined log line contains a substring with a preset format;
      if yes, replacing the sub-character strings with preset formats by using preset symbols;
      the substring in the preset format represents one or more of file size, time length and operating system time.
      The invention further carries out replacement processing on the updated first character string unit, can remove the sub character strings of the size, the time length, the operating system time and the like of the characterization file in the log line, and more accurately replaces the variable character strings in the log line to obtain the fixed character strings, thereby being beneficial to the accurate classification of the log line.
      Fig. 3 shows a schematic hardware structure of a log processing device according to an embodiment of the present invention.
      A processor 301 and a memory 302 storing computer program instructions may be included in the log processing device.
      In particular, the processor 301 may include a Central Processing Unit (CPU), or an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or may be configured as one or more integrated circuits that implement embodiments of the present invention.
      Memory 302 may include mass storage for data or instructions. By way of example, and not limitation, memory 302 may comprise a Hard Disk Drive (HDD), floppy Disk Drive, flash memory, optical Disk, magneto-optical Disk, magnetic tape, or universal serial bus (Universal Serial Bus, USB) Drive, or a combination of two or more of the foregoing. Memory 302 may include removable or non-removable (or fixed) media, where appropriate. Memory 302 may be internal or external to the integrated gateway disaster recovery device, where appropriate. In a particular embodiment, the memory 302 is a non-volatile solid-state memory. In particular embodiments, memory 302 includes Read Only Memory (ROM). The ROM may be mask programmed ROM, programmable ROM (PROM), erasable PROM (EPROM), electrically Erasable PROM (EEPROM), electrically rewritable ROM (EAROM), or flash memory, or a combination of two or more of these, where appropriate.
      The processor 301 implements any of the log processing methods of the above embodiments by reading and executing computer program instructions stored in the memory 302.
      In one example, the log processing device may also include a communication interface 303 and a bus 310. As shown in fig. 3, the processor 301, the memory 302, and the communication interface 303 are connected to each other by a bus 310 and perform communication with each other.
      The communication interface 303 is mainly used to implement communication between each module, device, unit and/or apparatus in the embodiment of the present invention.
      Bus 310 includes hardware, software, or both, coupling components of the log processing device to each other. By way of example, and not limitation, the buses may include an Accelerated Graphics Port (AGP) or other graphics bus, an Enhanced Industry Standard Architecture (EISA) bus, a Front Side Bus (FSB), a HyperTransport (HT) interconnect, an Industry Standard Architecture (ISA) bus, an infiniband interconnect, a Low Pin Count (LPC) bus, a memory bus, a micro channel architecture (MCa) bus, a Peripheral Component Interconnect (PCI) bus, a PCI-Express (PCI-X) bus, a Serial Advanced Technology Attachment (SATA) bus, a video electronics standards association local (VLB) bus, or other suitable bus, or a combination of two or more of the above. Bus 310 may include one or more buses, where appropriate. Although embodiments of the invention have been described and illustrated with respect to a particular bus, the invention contemplates any suitable bus or interconnect.
      The log processing device may execute the log processing method in the embodiment of the present invention, thereby implementing the log processing method and apparatus described in connection with fig. 1 and fig. 2.
      In addition, in combination with the log processing method in the above embodiment, the embodiment of the present invention may be implemented by providing a computer storage medium. The computer storage medium has stored thereon computer program instructions; the computer program instructions, when executed by a processor, implement any of the log processing methods of the above embodiments.
      It should be understood that the invention is not limited to the particular arrangements and instrumentality described above and shown in the drawings. For the sake of brevity, a detailed description of known methods is omitted here. In the above embodiments, several specific steps are described and shown as examples. However, the method processes of the present invention are not limited to the specific steps described and shown, and those skilled in the art can make various changes, modifications and additions, or change the order between steps, after appreciating the spirit of the present invention.
      The functional blocks shown in the above-described structural block diagrams may be implemented in hardware, software, firmware, or a combination thereof. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, a plug-in, a function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine readable medium or transmitted over transmission media or communication links by a data signal carried in a carrier wave. A "machine-readable medium" may include any medium that can store or transfer information. Examples of machine-readable media include electronic circuitry, semiconductor memory devices, ROM, flash memory, erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, radio Frequency (RF) links, and the like. The code segments may be downloaded via computer networks such as the internet, intranets, etc.
      It should also be noted that the exemplary embodiments mentioned in this disclosure describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, or may be performed in a different order from the order in the embodiments, or several steps may be performed simultaneously.
      In the foregoing, only the specific embodiments of the present invention are described, and it will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, modules and units described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein. It should be understood that the scope of the present invention is not limited thereto, and any equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present invention, and they should be included in the scope of the present invention.
    Claims (9)
1. A log processing method, comprising:
      acquiring an original log line to be processed;
      dividing the original log line to be processed according to a preset first divider to obtain at least one first character string unit;
      replacing variable character strings in the at least one first character string unit by preset symbols to obtain at least one updated first character string unit;
      reorganizing the updated at least one first character string unit to obtain reorganized log lines;
      dividing the recombined log line according to the preset symbols to obtain a plurality of second character string units, wherein the plurality of second character string units are used for representing sentence pattern characteristics of the original log line;
      the at least one first character string unit comprises a first type character string unit, and the first type character string unit does not contain digital characters;
      the replacing the variable string in the at least one first string unit with a preset symbol includes: dividing the first type character string unit by using a preset second separator to obtain a plurality of first sub-character strings; judging whether the number of the first substrings is larger than a first preset number or not; if the number is larger than the first preset number, the whole first type character string unit is a variable number character string, and the whole first type character string unit is replaced by the preset symbol; if the number of the first sub-strings is not greater than the first preset number, the first sub-strings after the second separator are variable strings, and the first sub-strings after the second separator are replaced by the preset symbols.
    2. The log processing method as defined in claim 1, further comprising:
      and classifying the original log row according to the plurality of second character string units.
    3. The log processing method as set forth in claim 1, wherein reorganizing the updated at least one first string unit includes:
      and splicing the updated at least one first character string unit by taking the first segmenter as a spacer.
    4. The log processing method as defined in claim 1, wherein the at least one first string unit includes a second type string unit, and the second type string unit contains a numeric character therein;
      the replacing the variable string in the at least one first string unit with a preset symbol includes:
      dividing the second type character string unit by using a preset third divider to obtain a plurality of second sub-character strings;
      and taking a second substring containing the digital character in the plurality of second substrings as a variable quantity character string, and replacing the second substring containing the digital character by using the preset symbol.
    5. The log processing method as defined in claim 1, wherein before the reorganizing the updated at least one first string unit, the method further comprises:
      dividing the updated at least one first character string unit by using a preset fourth divider to obtain a plurality of third sub-character strings;
      judging whether the number of the third substrings is larger than a second preset number or not;
      if the number is larger than the second preset number, judging whether the third substring contains the preset symbol or not;
      if the third sub-string contains the preset symbol, replacing the third sub-string containing the preset symbol with the preset symbol;
      if the number of the character strings is not greater than the second preset number, dividing the updated at least one first character string unit by using a preset fifth divider to obtain a plurality of fourth sub-character strings;
      judging whether the number of the fourth substrings is larger than a third preset number or not;
      and if the number is larger than the third preset number, replacing the updated at least one first character string unit by using the preset symbol.
    6. The log processing method of claim 1, wherein prior to the dividing the reorganized log line with the preset symbol, the method further comprises:
      judging whether the recombined log line contains a substring with a preset format or not;
      if yes, replacing the sub-character strings in the preset format by the preset symbols;
      wherein, the substring of preset format represents one or more of file size, time length and operating system time.
    7. A log processing apparatus, the apparatus comprising:
      the data acquisition module is used for acquiring an original log line to be processed;
      the first segmentation module is used for segmenting an original log line to be processed by using a preset first segmenter to obtain at least one first character string unit;
      the replacing module is used for replacing variable character strings in the at least one first character string unit by preset symbols to obtain at least one updated first character string unit;
      the reorganization module is used for reorganizing the updated at least one first character string unit to obtain reorganized log lines;
      the second segmentation module is used for segmenting the recombined log lines according to the preset symbols to obtain a plurality of second character string units, wherein the plurality of second character string units are used for representing sentence pattern characteristics of the original log lines;
      the at least one first character string unit comprises a first type character string unit, and the first type character string unit does not contain digital characters;
      the replacing module is used for dividing the first type character string unit by using a preset second separator to obtain a plurality of first sub-character strings; judging whether the number of the first substrings is larger than a first preset number or not; if the number is larger than the first preset number, the whole first type character string unit is a variable number character string, and the whole first type character string unit is replaced by the preset symbol; if the number of the first sub-strings is not greater than the first preset number, the first sub-strings after the second separator are variable strings, and the first sub-strings after the second separator are replaced by the preset symbols.
    8. A log processing apparatus, the apparatus comprising: a processor and a memory storing computer program instructions;
      the processor, when executing the computer program instructions, implements the log processing method of any of claims 1-6.
    9. A computer readable storage medium, wherein computer program instructions are stored on the computer readable storage medium, which when executed by a processor, implement the log processing method according to any one of claims 1-6.
    Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201910670466.5A CN112306961B (en) | 2019-07-24 | 2019-07-24 | Log processing method, device, equipment and storage medium | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN201910670466.5A CN112306961B (en) | 2019-07-24 | 2019-07-24 | Log processing method, device, equipment and storage medium | 
Publications (2)
| Publication Number | Publication Date | 
|---|---|
| CN112306961A CN112306961A (en) | 2021-02-02 | 
| CN112306961B true CN112306961B (en) | 2024-03-19 | 
Family
ID=74329156
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN201910670466.5A Active CN112306961B (en) | 2019-07-24 | 2019-07-24 | Log processing method, device, equipment and storage medium | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN112306961B (en) | 
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN119088773B (en) * | 2024-11-07 | 2025-02-18 | 杭州浩联智能科技有限公司 | Log processing method, device, equipment and medium | 
Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN102163353A (en) * | 2011-02-25 | 2011-08-24 | 广州广电运通金融电子股份有限公司 | Electronic journal diary intelligent analysis system and method | 
| CN102768636A (en) * | 2011-05-05 | 2012-11-07 | 阿里巴巴集团控股有限公司 | Log analysis method and log analysis device | 
| GB201220817D0 (en) * | 2011-11-28 | 2013-01-02 | Ibm | Data transformation by replacement of sensitive information in a log | 
| CN107315779A (en) * | 2017-06-05 | 2017-11-03 | 海致网络技术(北京)有限公司 | Log analysis method and system | 
| CN109885456A (en) * | 2019-02-20 | 2019-06-14 | 武汉大学 | A multi-type fault event prediction method and device based on system log clustering | 
Family Cites Families (3)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US9563635B2 (en) * | 2013-10-28 | 2017-02-07 | International Business Machines Corporation | Automated recognition of patterns in a log file having unknown grammar | 
| JP6244992B2 (en) * | 2014-03-07 | 2017-12-13 | 富士通株式会社 | Configuration information management program, configuration information management method, and configuration information management apparatus | 
| US9355111B2 (en) * | 2014-04-30 | 2016-05-31 | Microsoft Technology Licensing, Llc | Hierarchical index based compression | 
- 
        2019
        - 2019-07-24 CN CN201910670466.5A patent/CN112306961B/en active Active
 
Patent Citations (5)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN102163353A (en) * | 2011-02-25 | 2011-08-24 | 广州广电运通金融电子股份有限公司 | Electronic journal diary intelligent analysis system and method | 
| CN102768636A (en) * | 2011-05-05 | 2012-11-07 | 阿里巴巴集团控股有限公司 | Log analysis method and log analysis device | 
| GB201220817D0 (en) * | 2011-11-28 | 2013-01-02 | Ibm | Data transformation by replacement of sensitive information in a log | 
| CN107315779A (en) * | 2017-06-05 | 2017-11-03 | 海致网络技术(北京)有限公司 | Log analysis method and system | 
| CN109885456A (en) * | 2019-02-20 | 2019-06-14 | 武汉大学 | A multi-type fault event prediction method and device based on system log clustering | 
Non-Patent Citations (1)
| Title | 
|---|
| 基于日志模板的异常检测技术;王智远;任崇广;陈榕;秦莉;;智能计算机与应用(第05期);全文 * | 
Also Published As
| Publication number | Publication date | 
|---|---|
| CN112306961A (en) | 2021-02-02 | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| Perot et al. | Lmdx: Language model-based document information extraction and localization | |
| CN109582833B (en) | Abnormal text detection method and device | |
| CN112612664A (en) | Electronic equipment testing method and device, electronic equipment and storage medium | |
| CN111581057B (en) | General log analysis method, terminal device and storage medium | |
| CN106528508A (en) | Repeated text judgment method and apparatus | |
| CN107133208B (en) | Entity extraction method and device | |
| CN105630656A (en) | Log model based system robustness analysis method and apparatus | |
| CN112199935A (en) | Data comparison method and device, electronic equipment and computer readable storage medium | |
| CN113190220A (en) | JSON file differentiation comparison method and device | |
| CN116841779A (en) | Abnormality log detection method, abnormality log detection device, electronic device and readable storage medium | |
| CN115658072A (en) | Data blood margin analysis method, device, equipment and computer readable storage medium | |
| WO2018066661A1 (en) | Log analysis method, system, and recording medium | |
| CN117873905A (en) | Method, device, equipment and medium for code homology detection | |
| CN112306961B (en) | Log processing method, device, equipment and storage medium | |
| CN111310224B (en) | Log desensitization method, device, computer equipment and computer readable storage medium | |
| CN116167380A (en) | Data processing method, device, equipment and computer storage medium | |
| CN114090014B (en) | Program splitting method, device, equipment and computer storage medium | |
| CN118897783A (en) | Database abnormal operation detection method, device and related equipment | |
| CN113779932B (en) | Digital formatting method, device, terminal equipment and storage medium | |
| CN112328595A (en) | Data searching method, device, equipment and storage medium | |
| CN116340172A (en) | Data collection method and device based on test scene and test case detection method | |
| CN114417866A (en) | Text security level judgment method and device and electronic equipment | |
| CN114860484A (en) | Data processing method, storage medium and computer terminal | |
| CN112364018A (en) | Method, device and equipment for generating wide table and storage medium | |
| CN106353668B (en) | MAP data compression/recovery method and system in Strip Test process | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |