[go: up one dir, main page]

CN119728498A - A data center network element fault scheduling method and electronic equipment - Google Patents

A data center network element fault scheduling method and electronic equipment Download PDF

Info

Publication number
CN119728498A
CN119728498A CN202411695331.1A CN202411695331A CN119728498A CN 119728498 A CN119728498 A CN 119728498A CN 202411695331 A CN202411695331 A CN 202411695331A CN 119728498 A CN119728498 A CN 119728498A
Authority
CN
China
Prior art keywords
network
network node
fault
arbitration
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202411695331.1A
Other languages
Chinese (zh)
Inventor
罗云鹤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Telecom Cloud Technology Co Ltd
Original Assignee
China Telecom Cloud Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Telecom Cloud Technology Co Ltd filed Critical China Telecom Cloud Technology Co Ltd
Priority to CN202411695331.1A priority Critical patent/CN119728498A/en
Publication of CN119728498A publication Critical patent/CN119728498A/en
Pending legal-status Critical Current

Links

Landscapes

  • Computer And Data Communications (AREA)

Abstract

The embodiment of the invention provides a data center network element fault scheduling method and electronic equipment, wherein a data center cluster comprises available areas, each available area is preselected with a representative network node, arbitration modules, scheduling modules and network configuration storage clusters are deployed in the representative network nodes of all the available areas, and the method comprises the steps that a main arbitration module sends a network node detection request to all the arbitration modules, so that the arbitration modules detect all the network nodes according to the network node detection request; and if so, updating the network configuration storage cluster according to the information of the fault network nodes so that the main dispatching module dispatches the network elements running on the fault network nodes to non-fault network nodes in the data center cluster according to the information of the fault network nodes in the updated network configuration storage cluster. The method is rapid in detection, high in instantaneity and faster in fault scheduling response.

Description

Data center network element fault scheduling method and electronic equipment
Technical Field
The present invention relates to the field of data switching network technologies, and in particular, to a data center network element fault scheduling method and an electronic device.
Background
The network element is an important bridge for the user VPC (Virtual Private Cloud ) to communicate with the outside in the north-south direction, and is generally planned to run on a specific network node, once the network node fails, the network element cannot work normally, and the data center cannot provide services to the outside. When a network node fails or a network partition is isolated, even if the network node with a problem is isolated, the network element running on the network node should be scheduled to other network nodes working normally so as to continue to provide network services.
Traditional network devices operate BFD (Bidirectional Forwarding Detection ) detection to detect rapid communication failures between devices so that measures can be taken in time, such as switching routes to standby devices, to ensure that traffic continues to operate. BFD is a point-to-point failure detection method, such as when a network device establishes BFD in an uplink device and a downlink device, respectively. The data center generally operates a virtualized network, such as OVS (Open vSwitch), at which time the upstream device of the network element is a physical switch and can operate BFD, but the downstream device of the network element is a virtualized network device, the next hop of the downstream traffic forwarded by the network element is a virtual switch gateway, but the gateway is distributed and cannot establish BFD like a physical device. One approach is to secure the virtualized switch gateway to a host node, but this introduces a new single point of failure and traffic forwarding bottleneck, which is virtually undesirable.
Essentially, the physical carrier of the downstream virtual device of the network element is all the computing nodes, so that all the computing nodes need to operate a BFD-like detection with the network element, thus, when the number of the network elements is large, a large amount of unnecessary detection flows are caused, and each computing node has its own arbitration result, and it is often desired to determine which network element is providing services (e.g. knowing the load condition of each network node) in management, and thus, the management capability is reduced.
The existing solution is that a network node runs VRRP (Virtual Router Redundancy Protocol, virtual route redundancy protocol), and a node which becomes a leader in VRRP election takes over all network elements, and other nodes become backup nodes and do not provide service to the outside. The scheme has the problems that the VRRP is a model of main and standby operation, only one network node works at the same time, a great amount of resources are wasted, moreover, the VRRP generally needs to participate in the election node to be positioned in the same two-layer network, the multi-availability-area data center often operates in a plurality of machine rooms, and a two-layer network environment is difficult to provide for the network nodes distributed in the plurality of machine rooms. Thus, VRRP is not suitable for multi-available, large-scale template cluster scenarios, and is more suitable for small-scale, single-available-area scenarios.
Existing solutions have been to probe the availability of network nodes by means of a monitoring service outside or inside a fixed cluster, and then to schedule network elements. The problem of the scheme is that the condition of the physical links between the monitoring service and the network nodes is very depended, misjudgment is often easy to cause, jitter of the network element service is caused, and the availability of the monitoring service and the scheduling service is not guaranteed. Meanwhile, if the services are deployed outside the cluster, the detection becomes unreliable due to long links, and if the services are deployed in a certain available area of the cluster, once the available area fails, the services cannot be provided.
In summary, a reliable, clustered autonomous network element failure scheduling method suitable for large-scale, multi-availability data centers is needed.
Disclosure of Invention
The embodiment of the invention provides a data center network element fault scheduling method, a data center network element fault scheduling device, electronic equipment and a computer readable storage medium, which are suitable for network element fault scheduling of large-scale and multi-availability-area data center cluster autonomy.
The embodiment of the invention discloses a data center network element fault scheduling method, a data center cluster comprises more than one available area, a representative network node is preselected in each available area, an arbitration module, a scheduling module and a network configuration storage cluster are deployed in the representative network nodes of all the available areas, and the method comprises the following steps:
The main arbitration module sends a detection network node request to all arbitration modules so that all the arbitration modules detect all the network nodes according to the detection network node request;
the main arbitration module receives detection results fed back by all arbitration modules and determines whether a fault network node exists according to a joint arbitration mechanism;
If the network configuration storage cluster exists, the network configuration storage cluster is updated according to the fault network node information, so that the main dispatching module dispatches the network elements running on the fault network nodes to non-fault network nodes in the data center cluster according to the fault network node information in the updated network configuration storage cluster.
Optionally, before the master arbitration module sends the probe network node request to all arbitration modules, the method further comprises:
all arbitration modules in the representative network nodes elect a main arbitration module in a distributed lock mode;
and/or the number of the groups of groups,
All the dispatching modules in the representative network nodes elect a master dispatching module in a distributed locking mode.
Optionally, the master arbitration module sends a probe network node request to all arbitration modules, including:
the main arbitration module sends a network element information list acquisition request to the network configuration storage cluster and receives list information returned by the network configuration storage cluster according to the network element information list acquisition request;
and the main arbitration module sends a detection network node request to all arbitration modules according to the list information, wherein the detection network node request comprises information for detecting each network node in the list information.
Optionally, the main arbitration module receives the detection results fed back by all arbitration modules, and determines whether a faulty network node exists according to a joint arbitration mechanism, including:
the main arbitration module collects the reply message information of each network node in the detection result, wherein the reply message is a message which is replied by the network node in a designated time in response to the detection message sent by the arbitration module;
it is determined whether the half arbitration module detects that a network node has an unavailable fault.
Optionally, the network configuration storage cluster is a distributed key value database Etcd.
The embodiment of the invention also discloses a data center network element fault scheduling method, the data center cluster comprises more than one available area, each available area is preselected with a representative network node, an arbitration module, a scheduling module and a network configuration storage cluster are deployed in the representative network nodes of all the available areas, and the method comprises the following steps:
the master scheduling module monitors that a fault network node exists in the network configuration storage cluster, and sends a configuration information acquisition request of the fault network node to the network configuration storage cluster;
the master dispatching module receives network element configuration information of the fault network node returned by the network configuration storage cluster according to the configuration information acquisition request;
The main dispatching module dispatches the network element running on the fault network node to the non-fault network node according to the network element configuration information and the information of the non-fault network node in the data center cluster;
the fault network node is an unavailable network node which is interactively detected by the main arbitration module and all arbitration modules, and is updated to the network configuration storage cluster.
Optionally, the main dispatching module dispatches the network element running on the fault network node to the non-fault network node according to the network element configuration information and the information of the non-fault network node in the data center cluster, including:
And the main dispatching module dispatches the network element running on the fault network node to the non-fault network node with high priority according to the network element configuration information and the priority of the non-fault network node.
Optionally, the method further comprises:
and the master scheduling module adjusts the priority of the network node according to the frequency and the duration of the fault of the network node and updates the priority to the network configuration storage cluster.
Optionally, if the non-faulty network node has a shadow network element, the main dispatching module dispatches the network element running on the faulty network node to the non-faulty network node according to the network element configuration information and the information of the non-faulty network node in the data center cluster, including:
The main dispatching module dispatches the network element running on the fault network node to the shadow network element of the non-fault network node according to the network element configuration information;
The shadow network element has the same configuration as the network element of the fault network node and is in a ready state without providing service.
The embodiment of the invention also discloses electronic equipment, which comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
the memory is used for storing a computer program;
The processor is configured to implement the method according to the embodiment of the present invention when executing the program stored in the memory.
Embodiments of the invention also disclose one or more computer-readable media having instructions stored thereon, which when executed by one or more processors, cause the processors to perform the methods described in the embodiments of the invention.
The embodiment of the invention has the following advantages:
The method of the embodiment of the invention interacts with the network configuration storage cluster through the main arbitration module to determine the unavailable network node, so that the main scheduling module can schedule the network element running on the unavailable network node to other normally available network nodes to continue to provide network service, thereby realizing reliable and autonomous network element fault scheduling. In particular, the full link module is highly available during probing, without a split brain condition.
The method of the embodiment of the invention is interacted with the network configuration storage cluster through the main dispatching module, and dispatches the network elements running on the unavailable network nodes to other normally available network nodes so as to continuously provide network services, thereby realizing the cluster autonomous process without the intervention of external nodes and fast network element fault dispatching.
Further, in the embodiment of the invention, the dispatching priority of the network node can be dynamically adjusted according to the fault condition of the network node, so as to dynamically dispatch.
Drawings
FIG. 1 is a schematic diagram of the availability zones in a data center cluster provided in an embodiment of the present invention;
fig. 2 is a flowchart of steps of a method for scheduling faults of network elements in a data center according to an embodiment of the present invention;
Fig. 3 is a flowchart of steps of a method for scheduling faults of network elements in a data center according to an embodiment of the present invention;
fig. 4 is a signaling diagram of a data center network element fault scheduling method provided in an embodiment of the present invention;
Fig. 5 is a block diagram of a data center network element fault dispatching device provided in an embodiment of the present invention;
FIG. 6 is a block diagram of an electronic device provided in an embodiment of the invention;
Fig. 7 is a schematic diagram of a computer readable medium provided in an embodiment of the invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Referring to fig. 1, there is shown a schematic diagram of a data center cluster provided in an embodiment of the present invention, where the data center cluster includes a plurality of network nodes, which may be physical servers, and each network node is configured with a plurality of network elements, that is, the plurality of network elements operate on the network node. In this embodiment, the data center cluster may be pre-divided into a plurality of available areas, where the data center cluster may include more than one available area, each available area is pre-selected to have a representative network node, and an arbitration module, a scheduling module, and a network configuration storage cluster are deployed in the representative network nodes of all the available areas.
In this embodiment, by dividing the data center cluster into a plurality of available areas, even if one of the available areas fails, the other available areas can still keep working normally, so as to ensure continuity of service.
Meanwhile, the access pressure can be shared among different available areas through a load balancing technology, so that the resource utilization efficiency is optimized. When the load of one available area is too high, part of flow can be automatically or manually guided to other available areas, so that single-point overload is avoided.
In this embodiment, the number of network nodes in each available area may be greater than 1, and the available areas may select network nodes according to the scale of the network nodes as representative network nodes, for example, one or two network nodes may be selected, which is not limited in this embodiment, and two representative network nodes (i.e., simply referred to as representative nodes) are selected for each available area according to actual needs, which is shown in fig. 1.
In a specific implementation process, a server with strong computing capability is preferably selected as a representative network node, so that the representative network node can efficiently process various tasks, particularly tasks with high requirements on computing resources, meanwhile, a server with enough memory is selected to reduce service terminal risks caused by insufficient memory by considering that the network node needs a plurality of key components (an arbitration module, a scheduling module and a network configuration storage cluster), and in addition, the representative network node often needs to communicate frequently with other nodes, so that network bandwidth is an important consideration factor, and the selection of the server with high network bandwidth can improve data transmission speed and reduce delay.
In this embodiment, an arbitration module, a scheduling module, and a network configuration storage cluster are deployed in each representative node. It can be understood that after the representative node fails in this embodiment, other representative nodes may be manually switched, and the arbitration module, the scheduling module, and the network configuration storage cluster deployed in each representative node may be directly deployed for implementation when the cluster is planned.
Each available area selection represents a node to centrally deploy critical components (arbitration module, scheduling module, and network configuration storage cluster) that are critical to the stable operation of the overall data center, and thus its high availability and fast response capability are particularly important, helping to simplify management and maintenance work.
If a representative node fails, the system has the ability to automatically detect and quickly switch to a standby representative node to reduce the occurrence of service outages. In addition, it is important to perform fault recovery exercises periodically to ensure stable operation in the event of a real fault, so that high availability and quick response capability are particularly important.
Referring to fig. 2, a step flow chart of a data center network element fault scheduling method provided in an embodiment of the present invention is shown, which specifically may include the following steps:
201. the master arbitration module sends probe network node requests to all arbitration modules so that all arbitration modules probe all network nodes according to the probe network node requests.
The master arbitration module in this embodiment is a master arbitration module determined by all arbitration modules through contention.
For example, all arbitration modules within the representative network node elect a master arbitration module by way of distributed locking.
In this embodiment, common distributed lock implementations include Zookeeper and Etcd. These tools provide a highly consistent distributed coordination service that ensures fair and efficient election of the master arbitration module among multiple nodes.
Each arbitration module periodically sends a heartbeat to the distributed lock service to maintain the validity of the lock. If one arbitration module loses heart beat, the lock is released and other arbitration modules may attempt to acquire the lock.
And meanwhile, reasonable timeout time is set, so that the lock can be released in time when a certain arbitration module does not respond for a long time, and deadlock is prevented.
Step 201 includes main arbitration module sending network element information list acquisition request to network configuration storage cluster, receiving list information returned by network configuration storage cluster according to network element information list acquisition request;
and the main arbitration module sends a detection network node request to all arbitration modules according to the list information, wherein the detection network node request comprises information for detecting each network node in the list information.
In this embodiment, the network configuration storage cluster stores configuration information of network elements running on all network nodes, basic information of the network nodes, network element distribution information, and the like.
Basic information of the network element node, such as IP address, MAC address, host name, operating system version, etc.;
Configuration parameters of the network element running on each network node, such as port number, service type, resource limitation, etc.;
network subsection information, which network elements run on which network nodes and their distribution condition;
the network configuration storage cluster also stores health status information, specifically including the current health status of the network nodes and network elements, such as CPU utilization, memory usage, network bandwidth, etc.
In a specific implementation, the master arbitration module sends a network element information list acquisition request to the network configuration storage cluster, where the request typically includes some necessary parameters, such as a requester identity, a request timestamp, etc., to ensure validity and timeliness of the request.
The network configuration storage cluster returns the generated network element information list to the main arbitration module. The list information may include:
network node ID, namely, an ID for uniquely identifying each network node;
network element ID, namely, the ID of each network element is uniquely identified;
The network element type is the type of the network element, such as Web service, database service and the like;
the network element configuration is that the specific configuration parameters of the network element;
Network node status-the current status of the network node, such as online, offline, maintenance, etc.
Further, after receiving the network element information list, the main arbitration module analyzes the information therein and extracts detailed information of each network node, including an IP address, a port number, and the like.
And the main arbitration module constructs a detection network node request according to the analyzed network node information. Each probe request contains the following information:
target network node ID, probe parameters, and probe instructions;
The master arbitration module sends the constructed probe request to all arbitration modules. Each arbitration module will probe the designated network node based on the received request.
202. The main arbitration module receives the detection results fed back by all the arbitration modules, determines whether a fault network node exists according to the joint arbitration mechanism, if so, executes the following step 203, otherwise, continues to execute the step 201 according to the detection period.
For example, the main arbitration module collects the reply message information of each network node in the detection result, wherein the reply message is a message that the network node replies within a designated time in response to the detection message sent by the arbitration module, determines whether the half arbitration module detects that a certain network node has an unavailable fault, and if so, determines that the network node has a fault and simultaneously obtains the information of the fault network node.
In particular, if more than half of the arbitration modules report that a network node is not available, then the network node is considered to be faulty. The detection result of each arbitration module corresponds to one vote, and if more than half of the voting results are "unavailable", the network node is considered as a faulty node.
In some cases, different arbitration modules may be assigned different weights to reflect their reliability or importance, and the final decision result may be based on weighted voting.
203. If the network configuration storage cluster exists, the network configuration storage cluster is updated according to the fault network node information, so that the main dispatching module dispatches the network elements running on the fault network nodes to non-fault network nodes in the data center cluster according to the fault network node information in the updated network configuration storage cluster.
In addition, in this embodiment, all the scheduling modules in the representative network node elect a master scheduling module in a distributed lock manner.
Further, the main dispatching module selects a proper target node to dispatch the fault network element according to the current network node state and load condition;
The load condition needs to be balanced, namely, a node with lower load is selected, so that the node is prevented from being overloaded due to the fact that a plurality of fault network elements are scheduled to the same node, and in addition, the resource allocation (such as CPU, memory and network bandwidth) of the target node needs to be ensured to meet the requirements of the fault network elements.
The network configuration storage cluster in this embodiment may be a distributed key-value database Etcd, which can ensure consistency and high availability of data, so that the above-mentioned master arbitration module and master scheduling module may be distributed in different representative nodes or may be located in the same representative node.
The method of the embodiment is interacted with the network configuration storage cluster through the main arbitration module, and the network nodes in the data center cluster, which are faulty, are rapidly identified. In the implementation process, synchronous detection and joint arbitration are carried out on multiple network nodes, and the reliability is high. According to the method, external node intervention is not needed, and the full-link module is high in availability in the detection process, so that the situation of brain fracture is avoided.
Referring to fig. 3, a step flow chart of a data center network element fault scheduling method provided in an embodiment of the present invention is shown, which specifically may include the following steps:
301. the master scheduling module monitors that a fault network node exists in the network configuration storage cluster, and sends a configuration information acquisition request of the fault network node to the network configuration storage cluster;
302. the master dispatching module receives network element configuration information of the fault network node returned by the network configuration storage cluster according to the configuration information acquisition request;
303. the main dispatching module dispatches the network element running on the fault network node to the non-fault network node according to the network element configuration information and the information of the non-fault network node in the data center cluster;
the fault network node is an unavailable network node which is interactively detected by the main arbitration module and all arbitration modules, and is updated to the network configuration storage cluster.
The master scheduling module in this embodiment may adjust the priority of the network node according to the frequency and the duration of the failure of the network node, and update the priority to the network configuration storage cluster.
Furthermore, in step 303, the master scheduling module may schedule the network element running on the failed network node to the non-failed network node with a high priority according to the network element configuration information and the priority of the non-failed network node.
In an alternative implementation, if a shadow network element exists in the non-faulty network node, the step 303 may include the main dispatching module dispatching the network element running on the faulty network node to the shadow network element of the non-faulty network node according to the network element configuration information.
The shadow network element in this embodiment may be configured identically to the network element of the failed network node and in a ready-not-provided service state. In this embodiment, not every network node in the data center cluster has a shadow network element, and the shadow network element is deployed according to the planning of the data center cluster.
The method of the embodiment of the invention interacts with the network configuration storage cluster through the main dispatching module, and after the main arbitration module determines that the network node has a fault, the main dispatching module can dispatch the network element running on the fault network node to other normal network nodes so as to continuously provide network service, thereby realizing reliable, autonomous and rapid network element fault dispatching. The method is rapid in scheduling, does not need external node intervention, and achieves autonomy of the data center cluster.
The main scheduling module of the embodiment can dynamically adjust the priority of the network node, and further can dynamically adjust the priority of the normal network node in the scheduling of the fault network node, so that the reliability is high.
Referring to fig. 4, a flowchart illustrating steps of a method for dispatching faults of a network element in a data center according to an embodiment of the present invention may specifically include the following steps:
401. the main arbitration module sends a network element information list acquisition request to the network configuration storage cluster;
402. the network configuration storage cluster returns list information according to the network element information list acquisition request;
403. the main arbitration module sends a request of the detection network node to all arbitration modules;
404. all arbitration modules send detection messages to each network node in the data center cluster according to the detection network node requests;
405. each network node returns a response according to the received detection message;
406. Each arbitration module gathers the received response of the network node and sends the detection result to the main arbitration module;
407. The main arbitration module integrates the detection results of all the arbitration modules, jointly arbitrates the availability of the network nodes, and determines that unavailable network nodes exist as fault network nodes or determines the availability of the network nodes.
408. The master arbitration module updates information of unavailable network nodes or network node availability conditions into the network configuration storage cluster.
409. The main dispatching module monitors the availability condition of nodes in the network configuration storage cluster, changes the failed network node into unavailable, and acquires network element distribution information, network node basic information, network element related configuration and the like which are operated on the failed network node from the network configuration storage cluster;
410. The main dispatching module acquires network element distribution information, network node basic information, network element related configuration and the like which are operated on a fault network node returned by the network configuration storage cluster;
411. The master scheduling module selects a suitable normally operating network node for the network element on the failed network node.
412. The main dispatching module dispatches the network elements to normal network nodes, and updates information such as network element distribution conditions, network node dispatching priorities and the like to the network configuration storage cluster.
It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.
Referring to fig. 5, a block diagram of a data center network element fault dispatching device provided in an embodiment of the present invention is shown, where the data center network element fault dispatching device in the embodiment is located in a network node and is set when a data center cluster is deployed, and each data center network element fault dispatching device may specifically include an arbitration module, a dispatching module, and a network configuration storage cluster.
When the data center cluster is deployed, each available area selects network nodes according to the size of the scale as representative nodes in proportion, and in general, the number and the scale of the network nodes of the available areas in the data center cluster are equivalent, so that each available area is selected to have the same representative nodes, and each available area has the same discretionary right. The representative node deploys a data center network element fault scheduling device comprising the arbitration module, the scheduling module and the network configuration storage cluster.
The network configuration storage cluster in this embodiment may be configured at a representative node deployment Etcd. Etcd a high availability distributed key value database, the network configuration storage cluster formed by Etcd provides high availability storage, distributed lock functionality in the interactions of the above-described methods.
In a specific use process, all arbitration modules in all representative nodes acquire distributed lock selection owners through competition respectively to acquire lock modules to become master arbitration modules, and all scheduling modules acquire distributed lock selection owners through competition respectively to acquire lock modules to become master scheduling modules. The arbitration module/dispatch module that did not acquire the lock would continuously retry acquiring the distributed lock to generate as the master arbitration module/master dispatch module. The distributed lock has lease time, and the master arbitration module/master scheduling module updates the lease time of the distributed lock periodically. If the master arbitration module/master dispatch module fails and the lease time is not continued, the distributed lock is automatically released, and other modules compete for the distributed lock and elect a new master arbitration module/master dispatch module.
In the initial planning and deployment stage, the network node planned by the data center cluster is used as a node for operating a network element, the network node information is stored in a network configuration storage cluster, and the condition of adding and deleting the network node needs to be updated into the network configuration storage cluster. The network element information, the network element scheduling distribution information and the network element operation configuration are stored in a network configuration storage cluster.
The main arbitration module periodically initiates network node detection requests to all arbitration modules, and after receiving the requests, the arbitration module expands the detection to the network nodes and returns detection results to the main arbitration module;
The main arbitration module collects the detection results, performs joint arbitration, judges that a certain network node is unavailable if the half arbitration module detects that the certain network node is unavailable, and updates the unavailable state to the network configuration storage cluster;
That is, the master arbitration module in this embodiment performs periodic combined arbitration, and the specific process includes that the master arbitration module obtains network node list information from the network configuration storage cluster, and initiates network node detection requests to all arbitration modules. After receiving the request, the arbitration module sends a detection message to the network node appointed in the request, and detects the activity of the network node. If the reply message is received, the network node is considered to be available, and then the arbitration module returns the detection result to the main arbitration module.
The main arbitration module performs joint arbitration on the summarized detection results. And if the half arbitration module detects that a certain network node is available, the network node is determined to be available, and the state of the network node is updated to the network configuration storage cluster. If the number of summarized detection results is less than half of the number of arbitration modules, the joint arbitration is invalid.
The synchronous detection, the arbitration result is directly output in an arbitration period, the detection is quick, the real-time performance is high, and the fault scheduling response is faster.
After the master scheduling module monitors that the network nodes in the network configuration storage cluster are unavailable, the network elements running on the master scheduling module are scheduled to other healthy network nodes, and the fault scheduling process is completed;
The main dispatching module monitors the change of network node information in the network configuration storage cluster, if the network node is found to be unavailable, the main dispatching module inquires the network element dispatching distribution information in the network configuration storage cluster, acquires the network element associated with the network node, acquires the network element operation configuration, selects proper available network nodes, dispatches the network element instances to the appointed network node, updates the network element dispatching distribution information to the network configuration storage cluster, and completes the fault dispatching process.
The main dispatching module dynamically adjusts the dispatching priority of the network elements of the network nodes according to the times and duration of the faults of the network nodes, and the network nodes with frequent faults are isolated and the dispatching of the network elements is forbidden;
The master scheduling module periodically divides scheduling priorities for the network nodes, and the network nodes with high scheduling priorities have higher probability to be selected during network element scheduling. If the network node fails, the scheduling priority is lowered, and if the network node fails within a threshold time, the scheduling priority is up-regulated. Network nodes with low scheduling priority, although in an available state, are also scheduled for exclusion. The network node scheduling priority information is stored in a network configuration storage cluster.
In particular, scheduling priority information of the network nodes is stored in a network configuration storage cluster, which information includes:
network node ID, current scheduling priority, number of failures, duration of failure, and last failure time.
For example, the master scheduling module periodically evaluates the network node to include:
Acquiring a history record, namely acquiring a history fault record of each network node from a network configuration storage cluster;
Calculating adjustment priority, namely calculating the scheduling priority of each network node according to the times and duration of faults;
the calculating and adjusting the scheduling priority specifically comprises the following steps:
Lowering the priority:
The number of failures exceeds a threshold value, for example, if a certain network node fails 3 times within the last 24 hours, its scheduling priority decreases from 10 to 5;
The failure duration exceeds a threshold value, e.g. if the failure duration of a certain network node exceeds 1 hour, its scheduling priority drops from 8 to 4.
Up-regulating priority:
no fault, if a certain network node fails within the past 7 days, the scheduling priority of the network node is up-regulated from 5 to 10;
good recovery-if a network node fails in the previous evaluation period, but performs well in the subsequent period, its scheduling priority can be gradually up-regulated from 4 to 8.
The master scheduling module preferably selects network nodes with high scheduling priority when scheduling network elements, and even if some network nodes are in a usable state, the network nodes are excluded from the scheduling range if the scheduling priority is lower than a set threshold.
In a specific implementation process, the master scheduling module may dynamically adjust the evaluation period and the priority adjustment rule to adapt to different service requirements and network conditions. For frequently failed network nodes, the master scheduling module can isolate the frequently failed network nodes and avoid affecting other nodes and services. Meanwhile, an alarm mechanism is set, and when the priority of the network node changes, operation and maintenance personnel are timely informed to process.
By the method, high availability and performance optimization of the data center cluster can be ensured, and service continuity and data safety are ensured.
In an alternative implementation, for a network element that is fault-sensitive, the primary scheduling module may deploy a shadow network element for the network element in a different network node, where the shadow network element is configured identically to the network element, but in a ready state, and does not provide services to the outside. Once the original network element fails, the shadow network element is immediately put into use, so that the fault dispatching is faster.
In other embodiments, if it is necessary to provide fault tolerance at the level of the available area, the cluster includes at least three available areas, so as to ensure that when one available area fails completely, the number of modules in the remaining available areas is still more than half, so as to meet the operation requirements of the selector and the joint arbitration.
The network node applying the device can realize multi-point synchronous detection and joint arbitration in the data center cluster, and has high reliability and quick scheduling.
The data center cluster of the embodiment can complete fault dispatching without intervention of a platform outside the cluster. In addition, the data center cluster has high fault tolerance, can tolerate node level to available region level faults, has strong adaptability, and can adapt to small-scale clusters and large-scale clusters formed by a plurality of available regions.
Further, the full-link modules of the data center cluster are high in availability, no brain fracture exists, the modules in the cluster are primary and standby, raft protocol in Etcd of the storage cluster is configured by the back-to-back network, and no brain fracture exists. The data center cluster realizes intelligent scheduling, dynamically adjusts scheduling priority according to the fault condition of the network node, and avoids influencing service quality because unstable nodes are frequently unavailable.
For the device embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and reference is made to the description of the method embodiments for relevant points.
In addition, the embodiment of the invention also provides a data center cluster, the cluster is divided into a plurality of available areas, each available area comprises a plurality of network nodes, and at least one network node comprises the data center network element fault scheduling device.
In addition, the embodiment of the present invention further provides an electronic device, as shown in fig. 6, including a processor 1301, a communication interface 1302, a memory 1303 and a communication bus 1304, where the processor 1301, the communication interface 1302, and the memory 1303 complete communication with each other through the communication bus 1304,
A memory 1303 for storing a computer program;
processor 1301, when executing the program stored in memory 1303, implements the following steps:
The main arbitration module sends a detection network node request to all arbitration modules so that all the arbitration modules detect all the network nodes according to the detection network node request;
the main arbitration module receives detection results fed back by all arbitration modules and determines whether a fault network node exists according to a joint arbitration mechanism;
If the network configuration storage cluster exists, the network configuration storage cluster is updated according to the fault network node information, so that the main dispatching module dispatches the network elements running on the fault network nodes to non-fault network nodes in the data center cluster according to the fault network node information in the updated network configuration storage cluster.
Or alternatively
The master scheduling module monitors that a fault network node exists in the network configuration storage cluster, and sends a configuration information acquisition request of the fault network node to the network configuration storage cluster;
the master dispatching module receives network element configuration information of the fault network node returned by the network configuration storage cluster according to the configuration information acquisition request;
The main dispatching module dispatches the network element running on the fault network node to the non-fault network node according to the network element configuration information and the information of the non-fault network node in the data center cluster;
the fault network node is an unavailable network node which is interactively detected by the main arbitration module and all arbitration modules, and is updated to the network configuration storage cluster.
The communication bus mentioned by the above terminal may be a peripheral component interconnect standard (PERIPHERAL COMPONENT INTERCONNECT, abbreviated as PCI) bus or an extended industry standard architecture (Extended Industry Standard Architecture, abbreviated as EISA) bus, etc. The communication bus may be classified as an address bus, a data bus, a control bus, or the like. For ease of illustration, the figures are shown with only one bold line, but not with only one bus or one type of bus.
The communication interface is used for communication between the terminal and other devices.
The memory may include random access memory (Random Access Memory, RAM) or may include non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central Processing unit (Central Processing Unit, CPU), a network processor (Network Processor, NP), a digital signal processor (DIGITAL SIGNAL Processing, DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, or discrete hardware components.
In yet another embodiment provided by the present invention, as shown in fig. 7, there is further provided a computer readable storage medium 1401, in which instructions are stored, which when run on a computer, cause the computer to perform a data center network element failure scheduling method as described in the above embodiment.
In yet another embodiment of the present invention, a computer program product containing instructions that, when run on a computer, cause the computer to perform a data center network element failure scheduling method as described in the above embodiment is also provided.
In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present invention, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, optical fiber, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk Solid STATE DISK (SSD)), etc.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
In this specification, each embodiment is described in a related manner, and identical and similar parts of each embodiment are all referred to each other, and each embodiment mainly describes differences from other embodiments. In particular, for system embodiments, since they are substantially similar to method embodiments, the description is relatively simple, as relevant to see a section of the description of method embodiments.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (10)

1. The utility model provides a data center network element fault scheduling method which is characterized in that a data center cluster comprises more than one available area, a representative network node is preselected in each available area, an arbitration module, a scheduling module and a network configuration storage cluster are deployed in the representative network nodes of all the available areas, and the method comprises the following steps:
The main arbitration module sends a detection network node request to all arbitration modules so that all the arbitration modules detect all the network nodes according to the detection network node request;
the main arbitration module receives detection results fed back by all arbitration modules and determines whether a fault network node exists according to a joint arbitration mechanism;
If the network configuration storage cluster exists, the network configuration storage cluster is updated according to the fault network node information, so that the main dispatching module dispatches the network elements running on the fault network nodes to non-fault network nodes in the data center cluster according to the fault network node information in the updated network configuration storage cluster.
2. The method of claim 1, wherein before the master arbitration module sends probe network node requests to all arbitration modules, the method further comprises:
all arbitration modules in the representative network nodes elect a main arbitration module in a distributed lock mode;
and/or the number of the groups of groups,
All the dispatching modules in the representative network nodes elect a master dispatching module in a distributed locking mode.
3. The method of claim 1, wherein the master arbitration module sending probe network node requests to all arbitration modules comprises:
the main arbitration module sends a network element information list acquisition request to the network configuration storage cluster and receives list information returned by the network configuration storage cluster according to the network element information list acquisition request;
and the main arbitration module sends a detection network node request to all arbitration modules according to the list information, wherein the detection network node request comprises information for detecting each network node in the list information.
4. The method of claim 1, wherein the master arbitration module receives the detection results fed back by all arbitration modules, and determining whether a failed network node exists according to a joint arbitration mechanism comprises:
the main arbitration module collects the reply message information of each network node in the detection result, wherein the reply message is a message which is replied by the network node in a designated time in response to the detection message sent by the arbitration module;
it is determined whether the half arbitration module detects that a network node has an unavailable fault.
5. The method of claim 1, wherein the network configuration storage cluster is a distributed key value database Etcd.
6. The utility model provides a data center network element fault scheduling method which is characterized in that a data center cluster comprises more than one available area, a representative network node is preselected in each available area, an arbitration module, a scheduling module and a network configuration storage cluster are deployed in the representative network nodes of all the available areas, and the method comprises the following steps:
the master scheduling module monitors that a fault network node exists in the network configuration storage cluster, and sends a configuration information acquisition request of the fault network node to the network configuration storage cluster;
the master dispatching module receives network element configuration information of the fault network node returned by the network configuration storage cluster according to the configuration information acquisition request;
The main dispatching module dispatches the network element running on the fault network node to the non-fault network node according to the network element configuration information and the information of the non-fault network node in the data center cluster;
the fault network node is an unavailable network node which is interactively detected by the main arbitration module and all arbitration modules, and is updated to the network configuration storage cluster.
7. The method of claim 6, wherein the master scheduling module schedules network elements operating on the failed network node to the non-failed network node based on the network element configuration information and information of the non-failed network nodes in the data center cluster, comprising:
And the main dispatching module dispatches the network element running on the fault network node to the non-fault network node with high priority according to the network element configuration information and the priority of the non-fault network node.
8. The method of claim 6, wherein the method further comprises:
and the master scheduling module adjusts the priority of the network node according to the frequency and the duration of the fault of the network node and updates the priority to the network configuration storage cluster.
9. The method of claim 6, wherein if the non-faulty network node has a shadow network element, the primary scheduling module schedules the network element running on the faulty network node to the non-faulty network node based on the network element configuration information and information of the non-faulty network node in the data center cluster, comprising:
The main dispatching module dispatches the network element running on the fault network node to the shadow network element of the non-fault network node according to the network element configuration information;
The shadow network element has the same configuration as the network element of the fault network node and is in a ready state without providing service.
10. An electronic device comprising a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory communicate with each other via the communication bus;
the memory is used for storing a computer program;
the processor being configured to implement the method of any of claims 1-9 when executing a program stored on a memory.
CN202411695331.1A 2024-11-25 2024-11-25 A data center network element fault scheduling method and electronic equipment Pending CN119728498A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411695331.1A CN119728498A (en) 2024-11-25 2024-11-25 A data center network element fault scheduling method and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411695331.1A CN119728498A (en) 2024-11-25 2024-11-25 A data center network element fault scheduling method and electronic equipment

Publications (1)

Publication Number Publication Date
CN119728498A true CN119728498A (en) 2025-03-28

Family

ID=95092324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411695331.1A Pending CN119728498A (en) 2024-11-25 2024-11-25 A data center network element fault scheduling method and electronic equipment

Country Status (1)

Country Link
CN (1) CN119728498A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120539633A (en) * 2025-07-23 2025-08-26 南京纳恩自动化科技有限公司 Power equipment predicts early warning device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN120539633A (en) * 2025-07-23 2025-08-26 南京纳恩自动化科技有限公司 Power equipment predicts early warning device
CN120539633B (en) * 2025-07-23 2025-09-23 南京纳恩自动化科技有限公司 A prediction and early warning device for power equipment

Similar Documents

Publication Publication Date Title
US6665262B1 (en) Distributed fault management architecture
US10956832B2 (en) Training a data center hardware instance network
US10511524B2 (en) Controller communications in access networks
JP3640187B2 (en) Fault processing method for multiprocessor system, multiprocessor system and node
Gonzalez et al. A fault-tolerant and consistent SDN controller
EP3747156B1 (en) Proactive fault management in slicing-enabled communication networks
CN119728498A (en) A data center network element fault scheduling method and electronic equipment
CN110855737B (en) Consistency level controllable self-adaptive data synchronization method and system
CN118656200A (en) Cloud computing task tracking and processing method and system
Pasieka et al. Models, methods and algorithms of web system architecture optimization
CN112333249A (en) Business service system and method
CN109039795A (en) A kind of Cloud Server resource monitoring method and system
EP1820291B1 (en) A method, system and computer program product for coordinated monitoring and failure detection
Chekired et al. HybCon: A scalable SDN-based distributed cloud architecture for 5G networks
US11695856B2 (en) Scheduling solution configuration method and apparatus, computer readable storage medium thereof, and computer device
CN106899659B (en) Distributed system and management method and management device thereof
CN109510730A (en) Distributed system and its monitoring method, device, electronic equipment and storage medium
CN113890850B (en) Route disaster recovery system and method
EP3424182B1 (en) Neighbor monitoring in a hyperscaled environment
CN118101585A (en) Transmission scheduling method, device, computer equipment and storage medium
CN108540317A (en) A kind of double-deck detection method of multiple domain SDN control node failures
Duy et al. Aloba: A mechanism of adaptive load balancing and failure recovery in distributed SDN controllers
KR20220132808A (en) Method and apparatus for controlling and managing fault prediction based on software defined network
CN117909418B (en) Deep learning model storage consistency method, computing subsystem and computing platform
CN120321249A (en) A method and system for synchronizing cross-domain network nodes in a cloud platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination