Disclosure of Invention
The application provides a real-time data processing method, a device and a storage medium of a distributed database, which are used for solving the problems of dynamic management, storage application delay, time-consuming data acquisition and high delay when a cluster is down of the data generated by the distributed database in real-time service processing, so as to ensure high availability of the system and reduce waiting time of users.
In a first aspect, the present application provides a method for real-time data processing of a distributed database, the method comprising:
when the distributed database receives a real-time task, acquiring real-time data to be processed;
Applying for the physical partition for the real-time data on the middleware of the distributed database in an asynchronous batch application mode, and writing the real-time data into the corresponding physical partition from the successful start of the first-batch physical partition application;
after the real-time data is successfully written into the corresponding physical partition, the real-time data is persisted from the middleware to the distributed database.
In a possible implementation manner, the applying for the physical partition for the real-time data on the middleware of the distributed database by adopting an asynchronous batch application mode includes:
determining the number of initialized physical partitions and the number of target physical partitions actually required by the real-time data according to the data quantity of the real-time data, and applying for the physical partitions for the real-time data on the middleware of the distributed database according to the number of initialized physical partitions;
under the condition that the number of the initialized physical partitions is smaller than the number of the target physical partitions, starting a timing task, wherein the timing task is used for periodically applying for the physical partitions for the real-time data on the middleware of the distributed database according to the set number;
And ending the timing task under the condition that the number of physical partitions applied to be successful reaches the target.
In a possible implementation manner, determining the number of initialized physical partitions according to the data amount of the real-time data includes:
Respectively comparing the data quantity of the real-time data with a preset first threshold value and a preset second threshold value, wherein the first threshold value is smaller than the second threshold value;
Determining the first threshold value as an initialized physical partition number under the condition that the data quantity of the real-time data is smaller than or equal to the first threshold value;
Determining the second threshold value as an initialized physical partition number under the condition that the data volume of the real-time data is larger than or equal to the second threshold value;
And under the condition that the data quantity of the real-time data is larger than the first threshold value and smaller than the second threshold value, determining an initialization proportion value according to the data quantity of the real-time data, and multiplying the initialization proportion value by the target physical partition number to obtain the initialization physical partition number.
In a possible implementation manner, determining an initialization scale value according to the data amount of the real-time data includes:
substituting the data quantity of the real-time data into a pre-fitted initialization proportion value calculation formula to obtain an initialization proportion value.
In a possible implementation manner, the applying for the physical partition for the real-time data on the middleware of the distributed database according to the initialized physical partition number includes:
determining the number of idle physical partitions on the middleware;
Comparing the initialized physical partition number with the number of free physical partitions;
Selecting a target physical partition from the idle physical partitions according to the number of the initialized physical partitions to be allocated to the real-time data under the condition that the number of the idle physical partitions is larger than or equal to the number of the initialized physical partitions;
And under the condition that the number of the idle physical partitions is smaller than the number of the initialized physical partitions, distributing all the idle physical partitions to the real-time data, and creating physical partitions for the real-time data on the middleware according to the difference value between the number of the initialized physical partitions and the number of the idle physical partitions.
In a possible implementation manner, after the writing the real-time data into the corresponding physical partition, the method further includes:
establishing a mapping relation between the real-time data and the physical partition written by the real-time data;
the method further comprises the steps of:
Deleting the mapping relation between the data and the physical partition written by the data after deleting the data in the distributed database;
the determining the number of idle physical partitions on the middleware includes:
and determining the physical partitions which are not written with data on the middleware and the physical partitions which are written with data but have no mapping relation as idle physical partitions, and determining the number of the idle physical partitions.
In a possible implementation manner, after the writing the real-time data into the corresponding physical partition, the method further includes:
establishing a mapping relation between the real-time data and the physical partition written by the real-time data;
And after the real-time data are all successfully written into the corresponding physical partitions, the mapping relation between the real-time data and the written physical partitions is persisted into the distributed database.
In a second aspect, the present application provides a real-time data processing apparatus for a distributed database, the apparatus comprising:
The real-time data acquisition module is used for acquiring real-time data to be processed when the distributed database receives the real-time task;
The partition application and data writing module is used for applying physical partitions for the real-time data on the middleware of the distributed database in an asynchronous batch application mode, and writing the real-time data into the corresponding physical partitions from the successful start of the first-batch physical partition application;
And the data persistence module is used for persistence of the real-time data from the middleware to the distributed database after the real-time data are completely successfully written into the corresponding physical partition.
In one possible implementation manner, the partition application and data writing module includes:
the physical partition number initializing unit is used for determining the number of initialized physical partitions and the number of target physical partitions actually required by the real-time data according to the data quantity of the real-time data, and applying for the physical partitions for the real-time data on the middleware of the distributed database according to the number of initialized physical partitions;
The timing task starting unit is used for starting a timing task under the condition that the number of the initialized physical partitions is smaller than the number of the target physical partitions, and the timing task is used for periodically applying for the physical partitions for the real-time data on the middleware of the distributed database according to the set number;
And the timing task ending unit is used for ending the timing task under the condition that the number of the physical partitions which are applied to be successful reaches the target.
In one possible implementation manner, the physical partition number initialization unit includes:
the data quantity threshold value evaluation subunit is used for respectively comparing the data quantity of the real-time data with a preset first threshold value and a preset second threshold value, wherein the first threshold value is smaller than the second threshold value;
a small data amount processing subunit, configured to determine the first threshold as an initialized physical partition number when it is compared that the data amount of the real-time data is less than or equal to the first threshold;
a large data amount processing subunit, configured to determine the second threshold as an initialized physical partition number when it is compared that the data amount of the real-time data is greater than or equal to the second threshold;
and the threshold interval processing subunit is used for determining an initialization proportion value according to the data quantity of the real-time data under the condition that the data quantity of the real-time data is larger than the first threshold and smaller than the second threshold, and multiplying the initialization proportion value by the target physical partition number to obtain the initialization physical partition number.
In a possible implementation manner, the threshold interval processing subunit is specifically configured to:
substituting the data quantity of the real-time data into a pre-fitted initialization proportion value calculation formula to obtain an initialization proportion value.
In one possible implementation manner, the physical partition number initialization unit includes:
an idle physical partition detection subunit, configured to determine the number of idle physical partitions on the middleware;
A physical partition number comparison subunit, configured to compare the initialized physical partition number with the number of idle physical partitions;
A target physical partition allocation subunit, configured to select, according to the number of initialized physical partitions, a target physical partition from the idle physical partitions to allocate to the real-time data, when the number of idle physical partitions is compared to be greater than or equal to the number of initialized physical partitions;
And the physical partition creation and allocation subunit is used for allocating all the idle physical partitions to the real-time data under the condition that the number of the idle physical partitions is smaller than the number of the initialized physical partitions, and creating physical partitions for the real-time data on the middleware according to the difference value between the number of the initialized physical partitions and the number of the idle physical partitions.
In a possible embodiment, the apparatus further comprises:
The mapping relation establishing module is used for establishing a mapping relation between the real-time data and the physical partition written by the real-time data;
The apparatus further comprises:
the mapping relation deleting module is used for deleting the mapping relation between the data and the physical partition written by the data after deleting the data in the distributed database;
The idle physical partition detection subunit is specifically configured to:
and determining the physical partitions which are not written with data on the middleware and the physical partitions which are written with data but have no mapping relation as idle physical partitions, and determining the number of the idle physical partitions.
In a possible embodiment, the apparatus further comprises:
The mapping relation establishing unit is used for establishing a mapping relation between the real-time data and the physical partition written by the real-time data;
And the mapping relation persistence unit is used for persistence of the mapping relation between the real-time data and the written physical partition into the distributed database after the real-time data are written into the corresponding physical partition successfully.
In a third aspect, the application provides an electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface, the memory complete communication with each other via the communication bus,
A memory for storing a computer program;
And the processor is used for realizing the real-time data processing method of the distributed database in any one of the first aspect when executing the program stored in the memory.
In a fourth aspect, the present application also provides a computer storage medium storing computer executable instructions for performing the method for processing real-time data of a distributed database according to any one of the above aspects of the present application.
Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: according to the method provided by the embodiment of the application, the data writing efficiency can be improved by applying for the physical partition in an asynchronous batch mode, so that the data processing process is not blocked, and the real-time or near real-time data writing is realized. The application and the data writing of the physical partition are carried out separately, so that the resources of the middleware and the database can be better utilized, and the writing delay caused by waiting for the application of the partition is avoided. After the data are written into the physical partition, persistence is carried out, so that the data are ensured not to be lost before the data are successfully put in storage, and the reliability of the system is enhanced. In summary, by applying for physical partitioning in an asynchronous batch mode, the advantages of asynchronous processing and layering architecture can be effectively combined, the writing efficiency, the resource utilization rate and the flexibility of the system are improved, and meanwhile, the reliability and consistency of data are enhanced.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.
The following disclosure provides many different embodiments, or examples, for implementing different structures of the invention. In order to simplify the present disclosure, components and arrangements of specific examples are described below. They are, of course, merely examples and are not intended to limit the invention. Furthermore, the present invention may repeat reference numerals and/or letters in the various examples. This repetition is for the purpose of simplicity and clarity and does not in itself dictate a relationship between the various embodiments and/or configurations discussed.
In order to solve the technical problems of dynamic management, delay of storage application, time consumption of data acquisition and high delay when a cluster is down of data generated by a distributed database in real-time service processing in the prior art, the application provides a real-time data processing device of the distributed database, which can realize asynchronous batch application of physical partition, can effectively combine the advantages of asynchronous processing and layering architecture, improves writing efficiency, resource utilization rate and flexibility of a system, and enhances reliability and consistency of data.
Fig. 1 is a flowchart of an embodiment of a method for processing real-time data in a distributed database according to an embodiment of the present application. As shown in fig. 1, the method comprises the following steps:
step 101: and when the distributed database receives the real-time task, acquiring real-time data to be processed.
A distributed database is a database system that stores data in a decentralized manner on a plurality of physical nodes, which may be co-located or geographically located, connected and coordinated by a network. Key features of a distributed database include data slicing, redundant backup, and fault tolerance, allowing data to be distributed and shared among multiple nodes, thereby improving the availability, performance, and scalability of the system. The architecture can enable the system to process massive data, cope with high concurrent access and ensure that normal service can be maintained when a certain node fails.
Real-time tasks refer to tasks that require processing and analysis immediately at the moment of data generation or reception. Such tasks are often very time-efficient and aim to convert data into usable information with as low a delay as possible.
In the context of distributed databases, when the amount of data that is newly added in real-time increases dramatically, the system must quickly partition and process the data effectively to ensure continuity of service and quick response.
For example, in a high availability cluster scenario, multiple nodes may be inconsistent in data copies due to downtime, and in real-time task execution, the system needs to newly add data copies in real-time to maintain data consistency and availability. In this case, the real-time task includes not only fast processing of newly added data, but also efficient management of logical partitions (e.g., topic) and physical partitions (e.g., partition) to ensure that the system is able to continue to efficiently perform business processes in the face of large amounts of newly added data. When the amount of data increases dramatically, the process of partitioning may involve higher resource consumption, including occupation of computing resources and network bandwidth. This not only delays the processing speed of the data, but may also affect the overall throughput of the system. Thus, faced with this challenge of data growth, real-time tasks need to have the ability to dynamically adjust partition policies to still maintain system efficiency at high loads. In addition, a monitoring mechanism is needed to be implemented to analyze the data inflow and the system performance in real time, so that adjustment can be made in time, and the system can continuously and efficiently provide services to meet the service requirements during large-scale data processing.
In the embodiment of the application, the real-time data to be processed comprises a plurality of table data, and one table data corresponds to one physical partition. Real-time data relates to a plurality of different data types and a large amount of dynamic information, and in the management of real-time data, data is generally distributed into a plurality of tables according to characteristics and use requirements thereof, and each table represents a specific data type or service function. For example, a user behavior data table for recording click, browse and interactive behaviors of the user, helping to analyze user preferences and behavior patterns; the transaction record list is used for storing detailed information of purchasing behavior, and is convenient for financial statement and audit. Physical partitioning refers to the actual physical distribution of data on a storage medium, with each data table corresponding to a physical partition, i.e., the data of each table is stored in separate, independent storage areas.
Structured management of real-time data by partitioning the real-time data into multiple tables and physically partitioning, the system is able to achieve efficient data writing and access. The newly generated data can be immediately written into the corresponding table and updated into the corresponding physical partition, thereby ensuring that the system always reflects the latest information state.
Step 102: and applying for the physical partition for the real-time data on the middleware of the distributed database in an asynchronous batch application mode, and executing the writing of the real-time data into the corresponding physical partition from the successful start of the first-batch physical partition application.
Asynchronous means that operations can continue without waiting for the completion of a previous operation, and in the context of databases and middleware, asynchronous operations allow programs to perform other tasks during waiting without waiting immediately for a read result when initiating a database connection or write operation. The batch application, i.e. the process of data processing, is not completed at one time, but is performed in multiple batches, so that the use of resources can be effectively controlled and the efficiency of operation can be improved.
Middleware is used in a distributed system to provide a layer of service between applications and infrastructure. The middleware is responsible for receiving partition application from each application, managing the state of the partition and ensuring that the data can be correctly and efficiently written into the corresponding physical partition in the process of processing the real-time data. And a unified interface is provided, so that a developer can conveniently perform asynchronous batch application, and the data writing efficiency is improved.
In an embodiment of the application, the middleware initiates multiple asynchronous application requests to create a new physical partition for real-time data. In this process, the system will package a batch of requests to reduce network latency and improve efficiency. Because the application is performed asynchronously, multiple requests can be processed simultaneously, thus fully utilizing system resources and reducing the response time of a single request. In each application request, a callback function can be set for executing corresponding operation when the application is successful, so that the system can respond to the application result in time. During this process, the middleware will continuously monitor the application state of each physical partition until all requests are completed. When a partition is applied successfully, corresponding callback logic is triggered, and a notification is sent to a system or a related module to inform that a new partition is available. After the physical partition application is successful, the middleware will begin to prepare to write real-time data to the corresponding physical partition. The middleware will execute data routing logic to determine which specific physical partition each piece of real-time data should be written to. Once the target partition for writing is determined, the middleware asynchronously initiates a writing operation and sends data to the corresponding physical partition in batches, so that the writing efficiency is improved and the burden of single writing is reduced. During writing, middleware needs to maintain control over the transaction, ensure consistency of data in writing, such as rollback mechanism of the transaction, retry logic, and the like, to cope with possible errors or failures.
In the whole process, the middleware can continuously monitor the written data state to ensure that all data are successfully written into the new partition. All partition applications and data write operations will be logged for subsequent auditing and analysis.
Step 103: after the real-time data is all successfully written into the corresponding physical partition, the real-time data is persisted from the middleware into the distributed database.
Middleware is typically used to process data in real time, such as receiving, forwarding, and preliminary processing. However, middleware is not always able to persist data. If the middleware fails, data may be lost. Writing data into the distributed database can ensure that the data is preserved when a fault occurs. The distributed database provides powerful data query functions and analysis capabilities compared to middleware. The real-time data is stored in the database, so that complex query, analysis and report generation can be conveniently carried out to support business decision.
After being processed by the middleware, the data are stored in a distributed database which can efficiently store and manage mass data in a scattered manner. In this process, the data transitions from a scratch state to long-term storage for subsequent querying and analysis.
In the embodiment of the application, the data writing efficiency can be improved by applying for the physical partition in an asynchronous batch mode, so that the data processing process is not blocked, and the real-time or near real-time data writing is realized. The application and the data writing of the physical partition are carried out separately, so that the resources of the middleware and the database can be better utilized, and the writing delay caused by waiting for the application of the partition is avoided. After the data are written into the physical partition, persistence is carried out, so that the data are ensured not to be lost before the data are successfully put in storage, and the reliability of the system is enhanced. In summary, by applying for physical partitioning in an asynchronous batch mode, the advantages of asynchronous processing and layering architecture can be effectively combined, the writing efficiency, the resource utilization rate and the flexibility of the system are improved, and meanwhile, the reliability and consistency of data are enhanced.
Fig. 2 is a flowchart of an embodiment of another method for processing real-time data in a distributed database according to an embodiment of the present application. The embodiment shown in fig. 2 further comprises the following steps, based on the embodiment shown in fig. 1:
Step 201: and establishing a mapping relation between the real-time data and the physical partition written by the real-time data.
The mapping relationship between real-time data and the physical partition to which it is written is a structure or record that indicates to which particular physical partition a piece of real-time data is written, typically contains the following information:
Data identifier: the data identifier is a unique identifier for each piece of real-time data, and is typically assigned automatically by the system at the time the data is generated. It may be in the form of a number, string or combination for uniquely identifying a record. Through the data identifier, the system can quickly find and access specific data records, meanwhile, each piece of data is ensured to be unique in the whole system, repeated records are avoided, and real-time data management and maintenance operations such as updating, deleting and the like are facilitated.
Partition identifier: partition identifiers refer to the identification of a particular physical partition to which real-time data is written, and are typically closely related to the data storage architecture. Each physical partition may be partitioned according to a different partitioning strategy (e.g., based on time, region, or data type, etc.). The partition identifier is used to indicate where the data is specifically stored in the hardware, so that the system can manage the physical storage conveniently. During inquiry, the search range can be quickly reduced, and the designated physical partition is accessed instead of the full library search, so that the search speed is improved, and the partition can be conveniently optimized and maintained, for example, the partition is divided or merged again after the data distribution is analyzed.
Writing a time stamp: the write timestamp refers to the particular time that real-time data is written to a physical partition, typically in a time format (e.g., UTC time). This timestamp is automatically recorded when the data is generated, ensuring that the time attributes of the data are accurately reflected. The method can be used for carrying out time sequencing or grouping analysis on the data, and is convenient for time sequence analysis and trend observation; the system is helped to determine the validity and freshness of the data, and is beneficial to cleaning and management of the expired data; and a time clue is provided for the data, so that auditing, tracing and compliance checking are facilitated, and the transparency of system operation is ensured.
After the mapping relation between the real-time data and the physical partition written in the real-time data is established, the system can track the source and the position of each piece of data, quickly locate the physical storage position of specific data, improve the efficiency of data storage and access, ensure quick retrieval and data management, and simultaneously facilitate the time sequencing and compliance audit of the data. The system is beneficial to realizing the optimized storage, the simplified maintenance and the improved overall performance when processing large-scale real-time data.
Step 202: and after the real-time data are completely successfully written into the corresponding physical partitions, the mapping relation between the real-time data and the written physical partitions is persisted into the distributed database.
The persisted mapping may ensure that the storage structure and associated information of the data is still available even in the event of a system crash, reboot, or other failure. In this way, the system can revert to the previous state, continuing to access the data quickly.
In practical applications, the mapping relationship may be formatted into a structure suitable for storage, such as JSON, CSV, or other database supported data formats. Using the API or query language (such as SQL, noSQL) of the database to write the formatted mapping data into the distributed database, checking whether the same mapping relation exists, and preventing the repeated writing; the atomicity of write operations, i.e., either all succeeds or all fails, is ensured to maintain data consistency. After successful writing, a query is performed to verify whether the mapping is stored correctly.
In this process, the system enforces a policy to create physical partitions on demand. When new real-time data needs to be stored, the system dynamically creates the corresponding physical partition, rather than pre-generating all possible partitions. For the data needing to be deleted in real time, only the corresponding mapping relation is updated or deleted, and the physical partition is not required to be processed. In this way, the system can efficiently determine which data has been deleted, thereby reducing unnecessary resource consumption and processing time.
In the embodiment of the application, the one-to-one correspondence between the real-time data and the physical storage positions is ensured through the establishment of the mapping relationship, which is beneficial to maintaining the consistency of the data. When recovering or inquiring data, the method can quickly locate the corresponding physical partition; the persistent mapping relation can accelerate the data query process, and the system can quickly find out the physical partition according to the mapping relation, so that the data retrieval efficiency is improved, and meanwhile, the integrity and consistency of the data can be recovered even if the system crashes or fails due to the fact that the mapping relation is persisted into the distributed database, and the fault tolerance of the system is enhanced.
Fig. 3 is a flowchart of an embodiment of a method for processing real-time data in a distributed database according to another embodiment of the present application. As shown in fig. 3, the method comprises the following steps:
Step 301: and when the distributed database receives the real-time task, acquiring real-time data to be processed.
The above description of step 301 may be referred to the related description in the above embodiment, and will not be repeated here.
Step 302: respectively comparing the data quantity of the real-time data with a preset first threshold value and a preset second threshold value, wherein the first threshold value is smaller than the second threshold value; in case the data amount of the real-time data is compared to be smaller than or equal to the first threshold value, step 303 is performed; if the data amount of the real-time data is greater than or equal to the second threshold, executing step 304; in case the amount of real-time data is compared to be larger than the first threshold and smaller than the second threshold, step 305 is performed.
Step 303: the first threshold is determined as the initialized physical partition number.
Step 304: the second threshold is determined as the initialized physical partition number.
Step 305: and determining an initialization proportion value according to the data quantity of the real-time data, and multiplying the initialization proportion value by the target physical partition number to obtain the initialization physical partition number.
For ease of understanding, steps 302 through 305 are collectively described as follows:
In the embodiment of the application, the first threshold value is the minimum partition amount applied by the system when asynchronous initialization is carried out. That is, when the actual data amount is below this value (e.g., 100), the system may apply for the space of the first threshold without affecting the traffic processing speed. Thus, setting a minimum threshold may ensure that the system has sufficient resources at start-up.
The second threshold is the maximum amount of partitioning that the system can accept when asynchronously initializing. Exceeding this value (e.g., 4000) may cause the system to be overburdened, thereby affecting performance. Thus, a maximum threshold is set to prevent the system from being degraded or otherwise problematic due to excessive partitioning.
By reasonably setting this minimum and maximum threshold, the system is able to remain efficient under varying workloads.
When the data amount of the real-time data is smaller and does not exceed the first threshold value, the system load can be considered to be lighter, so that the number of physical partitions can be set to be a smaller value, effective utilization of resources of the system is ensured, and resource waste caused by excessive partitioning is avoided. When the data amount of the real-time data reaches or exceeds the second threshold value, the system load is heavier, and in order to improve the processing capacity and performance, the number of physical partitions needs to be increased, so that more data processing requirements can be accommodated.
In an embodiment, when the data amount of the real-time data is greater than the first threshold and less than the second threshold, an initialization proportion value calculation formula may be fitted in advance, the data amount of the real-time data is substituted into the pre-fitted initialization proportion value calculation formula to obtain an initialization proportion value, and the initialization proportion value is multiplied by the target physical partition number to obtain the initialization physical partition number.
In another embodiment, the physical partition number algorithm formula may be fitted in advance, and then the data amount of the real-time data is substituted into the physical partition number algorithm formula fitted in advance. Wherein, the formula of the physical partition number algorithm is as follows:
f(x)=αln(x)+βxγ
Where α, β, γ are values fitted in advance according to the actual business data model, for example, α=1.274, β=0.0135, γ=0.585; x is the real-time traffic data volume, and f (x) is the number of initialized physical partitions.
The partition number can be flexibly adjusted when the load is changed, so that the partition number can be adapted to the current data quantity, and the processing efficiency of the system is improved.
Step 306: applying for physical partitions for real-time data on middleware of the distributed database according to the initialized physical partition number; and starting from the first physical partition application success, executing the writing of the real-time data into the corresponding physical partition.
After the number of the initialized physical partitions is determined through the steps, an application request is sent out on the middleware according to the number of the initialized physical partitions. Once the physical partition of the application is successfully created or allocated, the system receives an acknowledgment response. This acknowledgment indicates that the new physical partition is ready to accept data writes. From the time the acknowledgement response is sent out, the system begins writing real-time data to the newly applied physical partition. This process involves the distribution and scheduling of data, ensuring that the data is written to the various partitions uniformly and efficiently. By writing real-time data into a plurality of physical partitions, the system can realize parallel processing, and the throughput of data processing is remarkably improved.
Step 307: and under the condition that the number of the initialized physical partitions is smaller than the number of the target physical partitions, starting a timing task, wherein the timing task is used for periodically applying for the physical partitions for the real-time data on the middleware of the distributed database according to the set number.
Step 308: and ending the timing task under the condition that the number of the physical partitions which are applied to be successful reaches the target.
For ease of understanding, steps 307 to 308 are collectively described below:
The target number of physical partitions refers to the number of physical partitions required in a data storage or data processing system to efficiently store and process real-time data. The setting of the number is based on analysis of the storage requirements, and by reasonably planning the number of target physical partitions, the system can be ensured to efficiently and flexibly cope with the change of data inflow.
When the initialized physical partition number is smaller than the set target physical partition number, the physical partition is continuously created, so that the method is not only complementary to the current data processing capacity, but also active adaptation to future service growth, and can ensure the flexibility of data storage, improve the system performance, enhance the reliability and optimize the reasonable use of resources.
In one embodiment, a timed task is initiated when the number of initialized physical partitions is less than the set target number of physical partitions. The main functions of this task include:
Periodic application: the timing tasks are triggered once per minute and the application of partitions is performed on the middleware of the distributed database according to a set number (e.g., 100 physical partitions are applied each time).
Multiple experiments show that the operation of the timing task has negligible influence on the service, so that the parallel operation of the tasks can be performed.
When the number of the physical partitions successfully applied reaches the target number of the physical partitions, the timing tasks are automatically ended, so that unnecessary system burden is avoided, and smooth operation of other business tasks is ensured.
Step 309: after the real-time data is completely successfully written into the corresponding physical partition, the real-time data is persisted from the middleware into the distributed database.
The above description of step 309 may be referred to the related description in the above embodiment, and will not be repeated here.
In the embodiment of the application, the number of physical partitions can be dynamically adjusted by comparing the real-time data quantity with the preset threshold value. The self-adaptive partition management mechanism can optimize resource allocation according to actual data flow change, and improves the flexibility and coping capacity of the system. Meanwhile, the real-time data writing and the application of the physical partition are operated in parallel, so that writing delay caused by insufficient partition can be reduced. The parallel processing ensures that the system can maintain good performance under the condition of high load and timely respond to real-time task requirements. By setting two thresholds (a first threshold and a second threshold), only the necessary physical partition is applied when the data volume is small, and resource waste caused by excessive partition is avoided. When the data volume reaches a certain scale, more partitions are applied, and the control mechanism of the partition request effectively reduces the consumption of system resources. According to the method, the physical partition is periodically applied according to the requirement, the partition can be timely adjusted according to the change of real-time data, the problem of unbalanced system load caused by rapid increase of data quantity is avoided, and the load balancing capability of the system is improved. In summary, by the above scheme, the efficiency, flexibility and reliability of the real-time data processing of the distributed database can be improved, and the system can be ensured to stably operate in changeable business scenes.
Fig. 4 is a flowchart of an embodiment of a method for processing real-time data in a distributed database according to still another embodiment of the present application. As shown in fig. 4, the method comprises the following steps:
Step 401: and when the distributed database receives the real-time task, acquiring real-time data to be processed.
The description of step 401 above may be referred to the related description in the above embodiment, and will not be repeated here.
Step 402: and determining the number of the initialized physical partitions and the number of the target physical partitions actually required by the real-time data according to the data quantity of the real-time data.
Step 403: and determining the physical partitions which are not written with data on the middleware and the physical partitions which are written with data but have no mapping relation as idle physical partitions, and determining the number of the idle physical partitions.
Step 404: comparing the number of initialized physical partitions with the number of idle physical partitions, and executing step 405 if the number of idle physical partitions is greater than or equal to the number of initialized physical partitions; in the event that the number of comparison free physical partitions is less than the number of initialized physical partitions, step 406 is performed.
Step 405: and selecting a target physical partition from the idle physical partitions according to the number of the initialized physical partitions to be allocated to the real-time data.
Step 406: all the idle physical partitions are allocated to the real-time data, and physical partitions are created for the real-time data on the middleware according to the difference between the number of initialized physical partitions and the number of idle physical partitions.
For ease of understanding, steps 402 through 406 are collectively described as follows:
An empty physical partition refers to a physical partition in a data storage system or middleware that is not currently actually used or assigned to any data storage, i.e., a partition that has not written any data, that has not stored any information during its lifecycle, and is fully in an empty state. In addition, a partition that has been written with data, but does not currently have a mapping relationship with other data or partitions, may also be considered free. These partitions can be used for previous operations, but are no longer used at this stage due to the processing or cleaning of the data.
And determining the number of the initialized physical partitions and the number of the target physical partitions actually required by the real-time data according to the data quantity of the real-time data. In this step, the total amount of real-time data first needs to be evaluated, involving analysis of the current input stream, data size, processing power, etc. The initial number of physical partitions may ensure that the number of partitions needed to efficiently process and store real-time data, while the target number of physical partitions is a possible requirement calculated from the actual requirement of real-time data. In other words, the core is to determine how the system divides the storage area through data volume analysis to optimize the subsequent data processing efficiency.
And determining the physical partitions which are not written with data on the middleware and the physical partitions which are written with data but have no mapping relation as idle physical partitions, and determining the number of the idle physical partitions. Middleware is referred to herein as an abstraction layer between an application and the underlying operating system or database. Physical partitions where data is not written are those partitions that have not been used or stored with any data. A physical partition that has written data but no mapping relationship refers to a partition that should have data present, but is not actually used for some reason (e.g., data migration or offloading). Through the above operations, all the free physical partitions that are currently available can be found to ensure that there are enough resources to process the new real-time data.
The number of initialized physical partitions is compared to the number of free physical partitions. This step is to compare on the basis of the first two steps to determine whether the currently available resources (free physical partitions) can meet the initially set requirements (initializing the number of physical partitions). The next allocation policy is determined by whether the number of free resources is sufficient.
And selecting a target physical partition from the idle physical partitions according to the number of the initialized physical partitions to be allocated to the real-time data under the condition that the number of the idle physical partitions is larger than or equal to the number of the initialized physical partitions. If the number of free physical partitions found in the previous step of comparison is greater than or equal to the number of initial physical partitions required, the system will directly select a corresponding number of partitions from the free partitions to allocate to the storage of real-time data. Therefore, the system is ensured to fully utilize available physical resources, the waste of the resources is avoided, and the efficiency of data processing is improved.
And under the condition that the number of the idle physical partitions is smaller than the number of the initialized physical partitions, distributing all the idle physical partitions to the real-time data, and creating physical partitions for the real-time data on the middleware according to the difference between the number of the initialized physical partitions and the number of the idle physical partitions. If the number of free physical partitions is less than the number of initialized physical partitions, the system will first allocate all available free partitions to real-time data. In this case, since the available resources are insufficient to meet the demand, a new physical partition needs to be created on the middleware in the following according to the difference between the actual number of free partitions and the number of initialized physical partitions needed. The system needs to dynamically adjust the resource allocation to ensure that real-time data can be stored efficiently to prevent data loss or processing delays.
By establishing a dynamic physical partition management mechanism, storage resources can be flexibly allocated according to the needs of real-time data. This mechanism not only improves the effectiveness of storage, but also ensures that the system maintains good performance and responsiveness in the face of increasing data demands.
Step 407: and under the condition that the number of the initialized physical partitions is smaller than the number of the target physical partitions, starting a timing task, wherein the timing task is used for periodically applying for the physical partitions for the real-time data on the middleware of the distributed database according to the set number.
Step 408: and ending the timing task under the condition that the number of the physical partitions which are applied to be successful reaches the target.
Step 409: and starting from the first physical partition application success, executing the writing of the real-time data into the corresponding physical partition.
The descriptions of the above steps 407 to 409 may be referred to the related descriptions in the above embodiments, and are not repeated here.
Step 410: and establishing a mapping relation between the real-time data and the physical partition written by the real-time data, and deleting the mapping relation between the data and the physical partition written by the data after deleting the data in the distributed database.
In distributed databases, deleting data is a common operation, especially when the data is out of date or no longer needed. The deletion may be a logical deletion (marked as delete but not immediately removed) or a physical deletion (actually removing data from storage). Once the data is deleted, the mapping must be updated in time to reflect this change, i.e., removing the record of the deleted data in the mapping table, ensures that the system no longer associates the data with its physical location.
In the embodiment of the application, a logical deletion mode is adopted, namely, records are not deleted from the data storage actually, but the records are deleted by updating the mapping relation or the state identifier. By preserving the actual data, which facilitates temporary disabling or restoring of the data, rather than complete loss, the data rollback or restoration operation may be facilitated, particularly in the event of accidental deletion. By deleting only the mapping relation between the real-time data and the physical partition, and not deleting the actual data immediately, the purpose of flexibly managing the data can be achieved. Not only improves the efficiency of operation, but also provides convenience for future data recovery and management.
Step 411: after the real-time data is all successfully written into the corresponding physical partition, the real-time data is persisted from the middleware into the distributed database.
The above description of step 411 may be referred to the related description in the above embodiment, and will not be repeated here.
In the embodiment of the application, the number of physical partitions is determined according to the actual demand of the real-time data, and the processing capacity of the system on the high-load real-time data is improved; the physical partition is periodically applied through the timing task, so that the data demand which is continuously changed can be responded in time, and the flexibility and the response speed of the system are enhanced; the mapping relation between the real-time data and the physical partition is established, so that the data management and tracking are facilitated, and the accuracy and consistency of the data are ensured; after the real-time data are written successfully, persistence is carried out, so that the risk of data loss is reduced, and the disaster tolerance of the system is improved; through the timing tasks and the dynamic allocation mechanism, the distributed database can be correspondingly expanded according to the increased data demand, the large-scale data processing is supported, the dynamic demand of the real-time data processing and the resource management of the distributed system are effectively combined, and the flexibility, the efficiency and the data security of the system are improved.
Fig. 5 is a block diagram of an embodiment of a real-time data processing apparatus of a distributed database according to an embodiment of the present application. As shown in fig. 5, the apparatus includes:
The real-time data acquisition module 51 is configured to acquire real-time data to be processed when the distributed database receives a real-time task;
The partition application and data writing module 52 is configured to apply for a physical partition for the real-time data on the middleware of the distributed database in an asynchronous batch application manner, and perform writing of the real-time data into a corresponding physical partition from the successful start of the first-batch physical partition application;
A data persistence module 53, configured to persistence the real-time data from the middleware into the distributed database after the real-time data is all successfully written into the corresponding physical partition.
In one possible implementation, the partition application and data writing module 52 includes:
the physical partition number initializing unit is used for determining the number of initialized physical partitions and the number of target physical partitions actually required by the real-time data according to the data quantity of the real-time data, and applying for the physical partitions for the real-time data on the middleware of the distributed database according to the number of initialized physical partitions;
The timing task starting unit is used for starting a timing task under the condition that the number of the initialized physical partitions is smaller than the number of the target physical partitions, and the timing task is used for periodically applying for the physical partitions for the real-time data on the middleware of the distributed database according to the set number;
And the timing task ending unit is used for ending the timing task under the condition that the number of the physical partitions which are applied to be successful reaches the target.
In one possible implementation manner, the physical partition number initialization unit includes:
the data quantity threshold value evaluation subunit is used for respectively comparing the data quantity of the real-time data with a preset first threshold value and a preset second threshold value, wherein the first threshold value is smaller than the second threshold value;
a small data amount processing subunit, configured to determine the first threshold as an initialized physical partition number when it is compared that the data amount of the real-time data is less than or equal to the first threshold;
a large data amount processing subunit, configured to determine the second threshold as an initialized physical partition number when it is compared that the data amount of the real-time data is greater than or equal to the second threshold;
and the threshold interval processing subunit is used for determining an initialization proportion value according to the data quantity of the real-time data under the condition that the data quantity of the real-time data is larger than the first threshold and smaller than the second threshold, and multiplying the initialization proportion value by the target physical partition number to obtain the initialization physical partition number.
In a possible implementation manner, the threshold interval processing subunit is specifically configured to:
substituting the data quantity of the real-time data into a pre-fitted initialization proportion value calculation formula to obtain an initialization proportion value.
In one possible implementation manner, the physical partition number initialization unit includes:
an idle physical partition detection subunit, configured to determine the number of idle physical partitions on the middleware;
A physical partition number comparison subunit, configured to compare the initialized physical partition number with the number of idle physical partitions;
A target physical partition allocation subunit, configured to select, according to the number of initialized physical partitions, a target physical partition from the idle physical partitions to allocate to the real-time data, when the number of idle physical partitions is compared to be greater than or equal to the number of initialized physical partitions;
And the physical partition creation and allocation subunit is used for allocating all the idle physical partitions to the real-time data under the condition that the number of the idle physical partitions is smaller than the number of the initialized physical partitions, and creating physical partitions for the real-time data on the middleware according to the difference value between the number of the initialized physical partitions and the number of the idle physical partitions.
In a possible embodiment, the apparatus further comprises:
The mapping relation establishing module is used for establishing a mapping relation between the real-time data and the physical partition written by the real-time data;
The apparatus further comprises:
the mapping relation deleting module is used for deleting the mapping relation between the data and the physical partition written by the data after deleting the data in the distributed database;
The idle physical partition detection subunit is specifically configured to:
and determining the physical partitions which are not written with data on the middleware and the physical partitions which are written with data but have no mapping relation as idle physical partitions, and determining the number of the idle physical partitions.
In a possible embodiment, the apparatus further comprises:
The mapping relation establishing unit is used for establishing a mapping relation between the real-time data and the physical partition written by the real-time data;
And the mapping relation persistence unit is used for persistence of the mapping relation between the real-time data and the written physical partition into the distributed database after the real-time data are written into the corresponding physical partition successfully.
As shown in fig. 6, an embodiment of the present application provides an electronic device including a processor 111, a communication interface 112, a memory 113, and a communication bus 114, wherein the processor 111, the communication interface 112, and the memory 113 perform communication with each other through the communication bus 114,
A memory 113 for storing a computer program;
in one embodiment of the present application, the processor 111 is configured to implement the method for processing real-time data of a distributed database according to any one of the foregoing method embodiments when executing a program stored in the memory 113, where the method includes:
when the distributed database receives a real-time task, acquiring real-time data to be processed;
Applying for the physical partition for the real-time data on the middleware of the distributed database in an asynchronous batch application mode, and writing the real-time data into the corresponding physical partition from the successful start of the first-batch physical partition application;
after the real-time data is successfully written into the corresponding physical partition, the real-time data is persisted from the middleware to the distributed database.
The embodiment of the application also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, implements the steps of the method for processing real-time data of a distributed database provided in any one of the method embodiments described above.
The apparatus embodiments described above are merely illustrative, wherein the elements illustrated as separate elements may or may not be physically separate, and the elements shown as elements may or may not be physical elements, may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
From the above description of embodiments, it will be apparent to those skilled in the art that the embodiments may be implemented by means of software plus a general purpose hardware platform, or may be implemented by hardware. Based on such understanding, the foregoing technical solution may be embodied essentially or in a part contributing to the related art in the form of a software product, which may be stored in a computer readable storage medium, such as ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform the method described in the respective embodiments or some parts of the embodiments.
It is to be understood that the terminology used herein is for the purpose of describing particular example embodiments only, and is not intended to be limiting. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The terms "comprises," "comprising," "includes," "including," and "having" are inclusive and therefore specify the presence of stated features, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, steps, operations, elements, components, and/or groups thereof. The method steps, processes, and operations described herein are not to be construed as necessarily requiring their performance in the particular order described or illustrated, unless an order of performance is explicitly stated. It should also be appreciated that additional or alternative steps may be used.
The foregoing is only a specific embodiment of the invention to enable those skilled in the art to understand or practice the invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.