Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
In the description of the present invention, it should be noted that if the terms "upper", "lower", "inside", "outside", etc. indicate an orientation or a positional relationship based on that shown in the drawings or that the product of the present invention is used as it is, this is only for convenience of description and simplification of the description, and it does not indicate or imply that the device or the element referred to must have a specific orientation, be constructed in a specific orientation, and be operated, and thus should not be construed as limiting the present invention.
Furthermore, the appearances of the terms "first," "second," and the like, if any, are used solely to distinguish one from another and are not to be construed as indicating or implying relative importance.
It should be noted that the features of the embodiments of the present invention may be combined with each other without conflict.
At present, under the background of big data, along with the development of distributed computation, the drawback of the mode of synchronous processing shows gradually, because the producer who is responsible for data generation and the consumer who is responsible for processing data are not matched with each other in step, the synchronous processing mode often can cause losing of data, and the mode of asynchronous processing owing to kept in to the data that the producer generated to the problem that the data that the synchronous mode exists are lost has been solved. To achieve decoupling between the producer and the consumer, message middleware is typically employed to temporarily store data generated by the producer. Referring to fig. 1, fig. 1 is a schematic view illustrating an application scenario provided by an embodiment of the present invention, in fig. 1, a first server 10 is a consumer responsible for data processing, a second server 20 is a server running a message middleware and responsible for temporary storage of data, and a third server is a producer responsible for data generation. The producer generates data, sends the generated data to the message middleware for temporary storage, and the consumer acquires the data from the message middleware and processes the acquired data, for example, stores the acquired data in a preset database.
In this embodiment, the message middleware may be, but is not limited to, kakfa, ActiveMQ, RabbitMQ, and the like.
In this embodiment, the first server 10, the second server 20, and the third server 30 may be physical computers or virtual machines capable of implementing the same functions as the physical computers, one server, or a server cluster composed of multiple servers.
It should be noted that the message middleware may run on the second server 20 independently, or may run on the first server 10 or the third server 30 as a software functional module.
On the basis of fig. 1, a block schematic diagram of the first server 10 in fig. 1 is provided in the embodiment of the present invention, please refer to fig. 2, and fig. 2 shows a block schematic diagram of the first server 10 provided in the embodiment of the present invention.
The first server 10 comprises a processor 11, a memory 12, a bus 13, a communication interface 14. The processor 11 and the memory 12 are connected by a bus 13, and the processor 11 communicates with the second server 20 through a communication interface 14.
The processor 11 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 11. The Processor 11 may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components.
The memory 12 is used for storing a program, such as the data processing apparatus 100 in the embodiment of the present invention, the data processing apparatus 100 includes at least one software functional module which can be stored in the memory 12 in a form of software or firmware (firmware), and the processor 11 executes the program after receiving an execution instruction to implement the data processing method in the embodiment of the present invention.
The Memory 12 may include a high-speed Random Access Memory (RAM) and may also include a non-volatile Memory (non-volatile Memory). Alternatively, the memory 12 may be a storage device built in the processor 11, or may be a storage device independent of the processor 11.
The bus 13 may be an ISA bus, a PCI bus, an EISA bus, or the like. Fig. 2 is represented by only one double-headed arrow, but does not represent only one bus or one type of bus.
Referring to fig. 3, fig. 3 is a flowchart illustrating a data processing method according to an embodiment of the present invention, where the method is applied to the first server 10 in fig. 1 and fig. 2, and the method includes the following steps:
step S100, data to be processed is obtained through message middleware, wherein the data to be processed comprises an identifier for uniquely representing the data to be processed.
In this embodiment, when the producer generates data, an identifier for uniquely characterizing the data is generated, the data and the identifier thereof are sent to the message middleware for temporary storage, and the consumer takes the data out of the middleware, processes the data, and stores the data into the data table. The data table may be stored in advance on the first server 10, or may be stored in advance in an external database server communicatively connected to the first server 10.
In this embodiment, the identifier of the data to be processed may be composed of two or more of the generation time, the service name, and the module number.
Step S110, if a checking table related to the data table exists, judging whether the data table has data to be processed by using the checking table, wherein the data table comprises an identification field, and a main key of the checking table is related to the identification field of the data table.
In this embodiment, the check table is provided with a primary key, where the primary key is a candidate field selected for unique identification, and the primary key may be composed of one field or a plurality of fields. The primary key typically functions four times: the integrity of an entity can be ensured; the operation speed of the database can be accelerated; when adding new record in the table, automatically checking the primary key value of the new record, and not allowing the value to be repeated with the primary key values of other records; and fourthly, automatically displaying the records in the table according to the sequence of the primary key values. Therefore, after the check table sets the primary key, when the identification of the data is stored in the check table, the primary key check is triggered first, that is, whether the value of the primary key of the data to be stored already exists is judged, if the value of the primary key of the data to be stored already exists, the primary key check fails, which means that the data is the repeated data, and the identification of the data stored in the check table fails in the data table, otherwise, the primary key check succeeds, which means that the data is not the repeated data, and the identification of the data stored in the check table succeeds.
In this embodiment, the primary key of the check table is related to the identification field of the data table, for example, the primary key of the check table may be the identification field of the data table, or may be a combination of the identification field of the data table and other fields. For example, the data table includes a field a, a field B, and a field C, where the field a is an identification field, that is, two records having the same value of the field a do not exist in the data table, and the related check table of the data table includes the field a, and the field a is set as the primary key.
In this embodiment, the number of fields in the check table may be smaller than that of the data table, for example, the check table includes field a and the data table includes fields A, B and C, or the number of data pieces in the check table is smaller than that of the data table, for example, the data table stores data of the last year, the number of data pieces is ten thousand, and the check table stores data of the last day, and the number of data pieces is only thousands. The data volume in the check table is smaller than that in the data table, the check table is provided with the main key, the data table does not need to be provided with the main key, and the main key is related to the identification field of the data table, so that whether the data to be processed is repeated data in the data table can be quickly judged by using the check table.
And step S120, if the data to be processed does not exist in the data table, processing the data to be processed.
In this embodiment, the absence of the data to be processed in the data table means that the data to be processed is not duplicated data, at this time, the data to be processed needs to be processed, and the data to be processed is stored in the data table after processing, it should be noted that the identifier of the data to be processed is correspondingly stored in the corresponding position of the identifier field in the data table.
According to the method provided by the embodiment of the invention, the comprehensiveness of the data is ensured and the reliability of the data is increased by setting the check table for the data table, the main key of the check table is determined according to the identification field of the data table, and whether repeated data to be processed exists in the data table can be judged according to the main key of the check table, so that the check efficiency of repeated data is improved, and the data processing efficiency is finally improved.
On the basis of fig. 3, an embodiment of the present invention further provides a specific implementation manner for determining whether there is data to be processed in the data table, please refer to fig. 4, where fig. 4 shows a flowchart of another data processing method provided in the embodiment of the present invention, and step S110 includes the following sub-steps:
and a substep S1101 of performing primary key verification on the identifier of the data to be processed by using the verification table.
In this embodiment, because the primary key of the check table is related to the identification field of the data table, and the identification field of the data table is the only field for representing data, the primary key check can be performed on the identification of the data to be processed through the check table, and whether the data to be processed is the repeated data in the data table is determined according to the primary key check result.
And a substep S1102, if the identifier of the data to be processed passes the primary key verification, judging that the data to be processed does not exist in the data table, and storing the identifier of the data to be processed into the verification table.
And a substep S1103 of determining that the data to be processed exists in the data table if the identifier of the data to be processed does not pass the primary key verification.
In this embodiment, if the identifier of the to-be-processed data does not pass the primary key verification, the identifier of the to-be-processed data is not stored in the verification table.
According to the method provided by the embodiment of the invention, the main key check is carried out on the identifier of the data to be processed to judge whether the data to be processed is the repeated data, so that the judging efficiency is improved.
In this embodiment, when the system is started, a check table is not yet created, and at this time, in order to subsequently perform repeated data determination according to the check table, the check table needs to be created first, and a primary key is set for the newly created check table, so that another data processing method is further provided in this embodiment of the present invention, referring to fig. 5, where fig. 5 shows a flowchart of another data processing method provided in this embodiment of the present invention, the method further includes:
step S130, if the checking table related to the data table does not exist, the checking table is created, and the main key of the checking table is determined according to the identification field of the data table.
According to the method provided by the embodiment of the invention, the main key of the check table is set while the check table is created, so that the influence of creating the check table on the subsequent data processing efficiency can be reduced to the maximum extent.
In this embodiment, in order to record the location of the consumer fetching data from the message middleware, the consumer usually sets a second offset in the message middleware to represent the location of the current data to be processed, and after the consumer finishes processing the data, the consumer usually sets a first offset to represent the location of the current processed data, where the first offset and the second offset are normally synchronous or substantially synchronous. However, when the message middleware is abnormal, for example, the consumer fails to write the second offset, or the consumer fails to process the data, or the consumer fails to write the first offset, at this time, data rollback needs to be performed, the stored data is restored to a previous state, and the first offset and the second offset also need to be synchronously rolled back to the previous state, so as to avoid missing during data restoration, an embodiment of the present invention further provides a specific implementation of abnormal restoration, please refer to fig. 6, and fig. 6 shows a flowchart of another data processing method provided by the embodiment of the present invention, where the method includes:
step S200, when detecting that the message middleware performs an abnormal recovery, using the smaller of the first offset and the second offset as a start position.
In this embodiment, the first offset and the second offset are continuously increased along with the processing of the data, so that the smaller position of the first offset and the second offset is used as the starting position of the recovery, the starting position corresponds to the earlier processed data, and in order to avoid the omission of the data, the data recovery is performed from the data at the starting position (i.e., the earlier data). For example, if the first offset is 1000 and the second offset is 998, the start position is 998, and data is restored from the data at the position 998.
Step S210, updating the data table and the check table from the data at the start position of the message middleware to perform data exception recovery.
In this embodiment, the process of performing data exception recovery on data from the start position is similar to the foregoing steps S100 to S120, and data to be recovered is sequentially taken out from the start position, and a check table is used to determine whether the data to be recovered already exists in the data table, if so, the data to be recovered is ignored, and the next data is continuously recovered, otherwise, the data to be recovered is processed and stored in the data table.
In this embodiment, in order to enable the first offset and the second offset to be always in correct positions, an embodiment of the present invention further provides a specific implementation manner for updating the first offset and the second offset, please refer to fig. 7, and fig. 7 shows a flowchart of another data processing method provided in the embodiment of the present invention, where the method includes step S101 and step S131.
Step S101, the control message middleware updates the second offset.
In this embodiment, the updating of the second offset may be performed after the first server 10 obtains the data to be processed through the message middleware, and the first server 10 controls the second server 20 to update the second offset in the second server 20, so that the second offset points to the next data to be processed, which needs to be obtained.
Step S121, store the data to be processed in the data table, and update the first offset.
In this embodiment, the updating of the first offset may be performed after the data to be processed is processed, in an application scenario where the data to be processed needs to be stored, after the data to be processed is processed, the data to be processed also needs to be stored in the data table, and after the data to be processed is successfully stored, the first offset is updated to point to the currently processed data.
Note that, when data is restored, the second offset and the first offset may not be updated in step S101 and step S131, so that the second offset and the first offset point to the correct positions after data restoration.
In this embodiment, in order to avoid that the data amount in the check table is too large to affect the efficiency of determining the repeated data, it is necessary to properly clear the data in the check table, so that only a preset amount of data is stored in the check table, and therefore, an embodiment of the present invention further provides a method for clearing the data in the check table, please refer to fig. 8, where fig. 8 shows a flowchart of another data processing method provided by the embodiment of the present invention, where the method includes the following steps:
step S300, analyzing the generation time of each data record from the identification field of each data record in the check table.
In this embodiment, the check table includes a plurality of data records, a value of an identification field of each data record is an identification of the data record, and the identification is determined according to a generation time of each data record, for example, the identification may be composed of the generation time and a serial number, and in a scenario where a plurality of types of services exist, a unique code may also be set for each type of service, and at this time, the identification may be composed of the generation time, a service name, a code of the service type, and a serial number. For example, each service type is defined as a unique code with a number of 2 bits, such as person: 01, face: 02, each type of service is also coded according to the codes of the pods, the number of coded bits is 2, if person-pod-0 is numbered 00, person-pod-1 is numbered 01, and the number of the service of the same type is unique. And the four-bit stream is processed and is subjected to scribing distribution through the serial number of the pod, the number of the 4-bit stream is 1 ten thousand pieces of data in total, if the same type of service is N, the number of one service is 10000/N, and if a remainder exists, the number is given to the last service number. The overall identification generation rule is as follows: time 20201024000000+ type number 01+ service number 01+ serial number 4 bits. The identifier generation rule provided by the embodiment can avoid repetition of the primary key to the greatest extent and prevent data loss.
In this embodiment, the producer generates an identifier for the produced data according to the identifier generation rule, sends the identifier and the corresponding data to the message middleware for temporary storage, and the consumer takes the data and the corresponding identifier out of the message middleware, stores the identifier of the data in the check table, and stores the data and the identifier of the data in the data table.
In step S310, the data record in the check table whose generation time is within the preset time period is cleared.
In this embodiment, a preset time period may be preset, only the data in the preset time period is stored in the check table, and the data in the check table may be cleared according to a preset cycle, so that only the data in the preset time period is stored in the check table. For example, the preset period is one day, the preset time period is the last 3 days, and when each day is 3 am, the generation time of each data record is analyzed from the identification field of each data record in the check table, all data before 2 days are deleted, and it is ensured that only the data of the last 3 days are stored in the check table.
According to the method provided by the embodiment of the invention, the data in the sub-table is maintained to be a relatively small data volume by regularly deleting the data in the sub-table, so that the verification speed of the repeated data is accelerated.
In order to more clearly illustrate the data processing scheme provided by the above embodiment, the embodiment of the present invention further provides a specific example for detailed description, and takes the message middleware kafaka as an example for description.
For example, the number of the vehicle consumption service is 01, the number is 4, the preset period of the check table is 1 day, and the current time is 2020, 10, 24, 05, 04 minutes and 02 seconds.
The data processing procedure is as follows:
firstly, the built-in initialization data is loaded in the service initialization process, the vehicle 01 is used for loading, the number is from 00 to 03 because the number is 4, and the processing serial numbers of each service are 00-03: 0000-; when the data is pod0 and only this piece of data is processed this second, the token generated according to the token generation rule of the above embodiment is 2020102405040201000000. And after the data generates the identification and completes the corresponding service processing, the data is stored in kafka.
Secondly, the consumer service monitors that the second offset in the kakfa is updated, the data in the kafka are consumed, the second offset on the kafka side is updated immediately after the data are consumed in the service memory, whether a check table for 24 days exists is determined, and if the second offset does not exist, the check table is created; if the data is consistent with the identification in the current check table, the data is discarded, if the data is not consistent with the identification, other data are continuously compared, after comparison processing is completed, the data is uniformly stored in the data table through a copy method, and whether new data is written into the check table or not, the first offset needs to be updated.
The data clearing process in the check table is as follows: the check table only stores the data of the identification field of the data table, and the data table stores the data of all the fields of the data. When the time is 10 and 25 months in 2020 and 3 am, a new checking table of 25/26/27 days is created, and if the checking table exists, the checking table is skipped. After the new check table is built, the expired data in the sub-table is cleaned, and the retention period is only 1 day, so that the expired data is data before 24 days, including all data of No. 23.
In order to perform the embodiments of the data processing method described above and the corresponding steps in the various possible embodiments, an implementation of the data processing apparatus 100 is given below. Referring to fig. 9, fig. 9 is a block diagram illustrating a data processing apparatus 100 according to an embodiment of the invention. It should be noted that the basic principle and the resulting technical effect of the data processing apparatus 100 provided in the present embodiment are the same as those of the above embodiments, and for the sake of brief description, no reference is made to this embodiment.
The data processing apparatus 100 includes an obtaining module 110, a determining module 120, a processing module 130, a recovering module 140, an updating module 150, and a clearing module 160.
The obtaining module 110 is configured to obtain data to be processed through message middleware, where the data to be processed includes an identifier for uniquely characterizing the data to be processed.
The determining module 120 is configured to determine whether to-be-processed data exists in the data table by using the check table if the check table related to the data table exists, where the data table includes an identification field, and a primary key of the check table is related to the identification field of the data table.
As a specific implementation manner, the determining module 120 is specifically configured to: performing primary key verification on the identifier of the data to be processed by using a verification table; if the identification of the data to be processed passes the primary key verification, judging that the data to be processed does not exist in the data table, and storing the identification of the data to be processed into a verification table; and if the identifier of the data to be processed does not pass the primary key verification, judging that the data to be processed exists in the data table.
As a specific implementation manner, the determining module 120 is further configured to: and if the checking table related to the data table does not exist, creating the checking table, and determining the main key of the checking table according to the identification field of the data table.
The processing module 130 is configured to process the data to be processed if the data to be processed does not exist in the data table.
A recovery module 140 for: when the message middleware is detected to perform abnormal recovery, the smaller one of the first offset and the second offset is used as an initial position; the data table and the check table are updated from the data at the start position of the message middleware for data exception recovery.
An updating module 150, configured to control the message middleware to update the second offset.
As a specific embodiment, the update module 150 is further configured to: and storing the data to be processed into a data table and updating the first offset.
A purge module 160 for: analyzing the generation time of each data record from the identification field of each data record in the check table; and clearing the data records with the generation time within the preset time period in the check table.
An embodiment of the present invention provides a computer-readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the data processing method described above.
In summary, embodiments of the present invention provide a data processing method, an apparatus, a server, and a storage medium, which are applied to a first server, where the first server is communicatively connected to a second server running a message middleware, and the first server includes a data table, where the method includes: acquiring data to be processed through message middleware, wherein the data to be processed comprises an identifier for uniquely representing the data to be processed; if the checking table related to the data table exists, judging whether the data table has data to be processed or not by using the checking table, wherein the data table comprises an identification field, and a main key of the checking table is related to the identification field of the data table; and if the data to be processed does not exist in the data table, processing the data to be processed. Compared with the prior art, the embodiment of the invention determines the main key of the check table according to the identification field of the data table, and can judge whether repeated data to be processed exists in the data table according to the check table, thereby improving the check efficiency of repeated data and finally improving the data processing efficiency.
The above description is only for the specific embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the appended claims.