[go: up one dir, main page]

CN118069349A - A variable depth resource management method and system for multiple scenarios - Google Patents

A variable depth resource management method and system for multiple scenarios Download PDF

Info

Publication number
CN118069349A
CN118069349A CN202410043057.3A CN202410043057A CN118069349A CN 118069349 A CN118069349 A CN 118069349A CN 202410043057 A CN202410043057 A CN 202410043057A CN 118069349 A CN118069349 A CN 118069349A
Authority
CN
China
Prior art keywords
resource management
load
instance
throughput
resident
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410043057.3A
Other languages
Chinese (zh)
Inventor
戴屹钦
王睿伯
董勇
卢凯
张伟
张文喆
谢旻
周恩强
迟万庆
李佳鑫
邬会军
吴振伟
葛可适
杨梨花
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN202410043057.3A priority Critical patent/CN118069349A/en
Publication of CN118069349A publication Critical patent/CN118069349A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • G06F9/505Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Multi Processors (AREA)

Abstract

The invention discloses a variable depth resource management method and a system for multiple scenes, wherein the method comprises the following steps: receiving high-performance load or high-flux load submitted by a user and judging the load type; if the high-performance load is the high-performance load, performing resource allocation on the high-performance load job task through a resident resource management instance, and controlling the execution of the high-performance load job task, wherein the resident resource management instance is of a fixed depth; if the high-flux load is the high-flux load, the high-flux load manager analyzes and splits the high-flux load into a plurality of subtask sets, after the resource allocation is completed, a temporary resource management father instance and a subtask instance are started according to the number of the subtask sets to control the execution of the job tasks, and the temporary resource management father instance, the subtask instance and the high-flux load manager are temporarily deployed and destroyed along with the high-flux load to realize variable depth. The invention dynamically changes the depth of the resource management system according to the system load type so as to adapt to high-performance calculation and high-flux calculation, thereby providing stable and efficient resource management for the high-performance calculation and the high-flux calculation at the same time.

Description

一种面向多场景的可变深度资源管理方法及系统A variable depth resource management method and system for multiple scenarios

技术领域Technical Field

本发明主要涉及系统资源管理技术领域,具体涉及一种面向多场景的可变深度资源管理方法及系统。The present invention mainly relates to the technical field of system resource management, and specifically to a variable depth resource management method and system for multiple scenarios.

背景技术Background technique

一台超级计算机由大量紧耦合的计算节点组成,各计算节点之间通过高速互连网络相互连接,为科学计算提供了极高的并行度。在此基础上,高性能计算(HPC)和高通量计算(HTC)是超级计算机的两种主要计算模式。高性能计算是超级计算机的传统计算模式,主要针对于解决大规模科学计算和科学模拟问题,并广泛用于工程设计、天气预报、核物理研究等领域。单个高性能计算负载包含一至多个高性能计算任务。根据高性能计算的科学目的,同一个高性能计算负载中的多个高性能任务之间可能存在一定的依赖关系。每个高性能计算任务在超级计算机中映射为一个作业。单个高性能作业通常包含大量的计算进程,需要占用大量的计算和存储资源。高性能作业通常通过消息传递范式(MPI)将计算任务分配到多个计算进程上,利用超级计算机强大的计算和通信性能完成科学计算。随着超级计算机规模的增加,高性能计算的规模不断增加。高性能计算要求资源管理系统具备高可扩展性,并具备针对大规模并行作业的管理能力。因此,针对高性能计算负载,资源管理系统的性能指标主要包括在不同规模下的系统整体资源利用率和作业平均等待时间等等。A supercomputer consists of a large number of tightly coupled computing nodes, which are interconnected through a high-speed interconnection network, providing extremely high parallelism for scientific computing. On this basis, high-performance computing (HPC) and high-throughput computing (HTC) are the two main computing modes of supercomputers. High-performance computing is the traditional computing mode of supercomputers, which is mainly aimed at solving large-scale scientific computing and scientific simulation problems, and is widely used in engineering design, weather forecasting, nuclear physics research and other fields. A single high-performance computing load contains one or more high-performance computing tasks. According to the scientific purpose of high-performance computing, there may be certain dependencies between multiple high-performance tasks in the same high-performance computing load. Each high-performance computing task is mapped as a job in the supercomputer. A single high-performance job usually contains a large number of computing processes and requires a large amount of computing and storage resources. High-performance jobs usually distribute computing tasks to multiple computing processes through the message passing paradigm (MPI), and use the powerful computing and communication performance of supercomputers to complete scientific computing. With the increase in the scale of supercomputers, the scale of high-performance computing continues to increase. High-performance computing requires the resource management system to have high scalability and the ability to manage large-scale parallel jobs. Therefore, for high-performance computing loads, the performance indicators of the resource management system mainly include the overall system resource utilization and the average waiting time of jobs at different scales, etc.

高通量计算是一种新兴的计算模式,主要用于生物医学、机器学习和信息安全等领域。单个高通量计算负载由大量松耦合的、相互独立的高通量任务组成。其中,每个高通量任务通常在超级计算机中映射为一个作业,且作业之间通常可以以任何顺序执行。此外,单个高通量任务对计算和内存资源的需求不高,通常只需要映射到单个计算核心上运行。高通量计算的高并发性要求资源管理系统具备高吞吐量(吞吐量是指在一个通信通道上单位时间能成功传递的平均资料量),即具备长时间调度、加载和监控大量小任务的能力。因此,针对高通量计算负载,资源管理系统的性能指标主要包括系统吞吐量、任务运行效率等。High-throughput computing is an emerging computing model, mainly used in fields such as biomedicine, machine learning, and information security. A single high-throughput computing load consists of a large number of loosely coupled and independent high-throughput tasks. Among them, each high-throughput task is usually mapped as a job in a supercomputer, and the jobs can usually be executed in any order. In addition, a single high-throughput task does not require high computing and memory resources, and usually only needs to be mapped to a single computing core to run. The high concurrency of high-throughput computing requires the resource management system to have high throughput (throughput refers to the average amount of data that can be successfully transmitted per unit time on a communication channel), that is, it has the ability to schedule, load and monitor a large number of small tasks for a long time. Therefore, for high-throughput computing loads, the performance indicators of the resource management system mainly include system throughput, task operation efficiency, etc.

在一台超级计算机上,高性能计算和高通量计算可能同时发生,同一计算资源也可能先后用于两种计算,这种复杂的应用场景要求超级计算机可以同时为两种计算模式提供稳定高效的资源管理服务。然而,现有的资源管理系统,往往不能同时兼顾高性能计算和高通量计算任务的管理,导致在管理两种计算任务中的至少一种时,存在资源浪费或计算速度缓慢的问题。因此,亟需提供一种能够同时有效管理高性能计算和高通量计算任务的资源管理方法及系统。On a supercomputer, high-performance computing and high-throughput computing may occur at the same time, and the same computing resources may be used for both computing modes successively. This complex application scenario requires the supercomputer to provide stable and efficient resource management services for both computing modes at the same time. However, existing resource management systems often cannot take into account the management of high-performance computing and high-throughput computing tasks at the same time, resulting in resource waste or slow computing speed when managing at least one of the two computing tasks. Therefore, there is an urgent need to provide a resource management method and system that can effectively manage high-performance computing and high-throughput computing tasks at the same time.

发明内容Summary of the invention

本发明要解决的技术问题就在于:针对现有技术存在的技术问题,本发明提供一种面向多场景的可变深度资源管理方法及系统,以解决现有的资源管理系统中由于深度固定而不能同时兼顾高性能计算和高通量计算任务的管理和执行的问题。The technical problem to be solved by the present invention is: in response to the technical problems existing in the prior art, the present invention provides a variable-depth resource management method and system for multiple scenarios to solve the problem that the existing resource management system cannot take into account the management and execution of high-performance computing and high-throughput computing tasks at the same time due to the fixed depth.

为解决上述技术问题,本发明提出的技术方案为:In order to solve the above technical problems, the technical solution proposed by the present invention is:

一种面向多场景的可变深度资源管理方法,步骤包括:A variable depth resource management method for multiple scenarios, comprising the following steps:

接收用户提交的高性能负载或高通量负载并判断负载类型;Receive high-performance loads or high-throughput loads submitted by users and determine the load type;

如果判断到负载类型为高性能负载,通过调用资源管理系统中的常驻资源管理实例对高性能负载作业任务进行资源分配,并在资源分配完成后,通过所述常驻资源管理实例控制执行高性能负载作业任务,所述常驻资源管理实例为固定深度;If the load type is determined to be a high-performance load, resources are allocated to the high-performance load job task by calling a resident resource management instance in the resource management system, and after the resource allocation is completed, the high-performance load job task is controlled and executed by the resident resource management instance, and the resident resource management instance is a fixed depth;

如果判断到负载类型为高通量负载,通过资源管理系统中的统一接口启动高通量负载管理器,由所述高通量负载管理器对高通量负载作业任务进行解析并拆分为多个子任务集合,通过所述常驻资源管理实例进行资源分配,在资源分配完成后,通过所述常驻资源管理实例根据高通量负载作业任务的子任务集合数量启动临时资源管理父实例和多个临时资源管理子实例,由所述临时资源管理父实例和各个所述临时资源管理子实例控制执行高通量负载作业任务,所述临时资源管理父实例、临时资源管理子实例及高通量负载管理器随高通量负载临时部署与销毁以实现可变深度。If it is determined that the load type is a high-throughput load, the high-throughput load manager is started through the unified interface in the resource management system, and the high-throughput load manager parses the high-throughput load job task and splits it into multiple sub-task sets, and allocates resources through the resident resource management instance. After the resource allocation is completed, the resident resource management instance is used to start a temporary resource management parent instance and multiple temporary resource management child instances according to the number of sub-task sets of the high-throughput load job task, and the temporary resource management parent instance and each of the temporary resource management child instances control the execution of the high-throughput load job task. The temporary resource management parent instance, the temporary resource management child instance and the high-throughput load manager are temporarily deployed and destroyed with the high-throughput load to achieve variable depth.

作为上述技术方案的进一步改进:As a further improvement of the above technical solution:

所述如果判断到负载类型为高性能负载时,通过调用资源管理系统中的常驻资源管理实例对高性能负载作业任务进行资源分配,包括:If it is determined that the load type is a high-performance load, resources are allocated to the high-performance load job task by calling the resident resource management instance in the resource management system, including:

资源管理系统中的统一接口调用位于计算机登录节点上的常驻资源管理实例的用户接口,向位于计算机控制节点上的常驻资源管理实例控制节点代理进程请求资源分配;The unified interface in the resource management system calls the user interface of the resident resource management instance located on the computer login node, and requests resource allocation from the resident resource management instance control node agent process located on the computer control node;

常驻资源管理实例控制节点代理进程对高性能负载作业任务进行解析,以调度对应的系统资源分配给当前作业任务,并将资源分配结果返回给登录节点上的常驻资源管理实例用户接口;The resident resource management instance control node agent process parses the high-performance load job task to schedule the corresponding system resources to be allocated to the current job task, and returns the resource allocation result to the resident resource management instance user interface on the login node;

所述常驻资源管理实例控制节点代理进程为当前高性能负载作业任务分配唯一的作业号。The resident resource management instance control node agent process allocates a unique job number to the current high-performance load job task.

所述如果判断到负载类型为高性能负载时,通过所述常驻资源管理实例控制执行高性能负载作业任务,包括:If it is determined that the load type is a high-performance load, the high-performance load job task is executed by controlling the resident resource management instance, including:

常驻资源管理实例控制节点代理进程通过所述用户接口将高性能负载作业任务的相关信息发送到当前作业任务对应的计算节点上的常驻资源管理实例计算节点代理进程;The resident resource management instance control node agent process sends the relevant information of the high-performance load job task to the resident resource management instance computing node agent process on the computing node corresponding to the current job task through the user interface;

常驻资源管理实例计算节点代理进程加载并运行高性能负载作业任务;The resident resource management instance computing node agent process loads and runs high-performance load job tasks;

高性能负载作业任务运行过程中,常驻资源管理实例监控运行状态,并将实时监控到的状态记录在常驻资源管理实例控制节点代理进程上,在当前高性能作业任务运行结束后通过常驻资源管理实例回收分配给当前高性能负载作业任务的资源,完成高性能负载的运行。During the operation of high-performance load tasks, the resident resource management instance monitors the operation status and records the real-time monitored status on the resident resource management instance control node agent process. After the current high-performance load task is completed, the resident resource management instance reclaims the resources allocated to the current high-performance load task to complete the operation of the high-performance load.

所述如果判断到负载类型为高通量负载时,通过所述常驻资源管理实例进行资源分配,包括:If it is determined that the load type is a high-throughput load, resource allocation is performed through the resident resource management instance, including:

由高通量负载管理器解析当前高通量负载的总资源需求,确定出所需的计算节点总数,并通过高通量负载管理器将高通量负载中的高通量任务拆分为多个子任务集合,并为每个子任务集合生成一个子任务集合加载脚本;The high-throughput load manager analyzes the total resource requirements of the current high-throughput load, determines the total number of computing nodes required, and splits the high-throughput tasks in the high-throughput load into multiple subtask sets through the high-throughput load manager, and generates a subtask set loading script for each subtask set;

由高通量负载管理器调用常驻资源管理实例的用户接口,向常驻资源管理实例控制节点代理进程发送资源分配请求;The high throughput load manager calls the user interface of the resident resource management instance and sends a resource allocation request to the resident resource management instance control node agent process;

所述常驻资源管理实例控制节点代理进程接收到资源分配请求后,为高通量负载作业任务分配位于计算机计算节点上的资源和唯一的作业号,并返回给高通量负载管理器。After receiving the resource allocation request, the resident resource management instance control node agent process allocates resources and a unique job number located on the computer computing node to the high-throughput load job task, and returns it to the high-throughput load manager.

所述如果判断到负载类型为高通量负载时,通过所述常驻资源管理实例根据高通量负载作业任务的子任务集合数量启动临时资源管理父实例和多个临时资源管理子实例,由所述临时资源管理父实例和各个所述临时资源管理子实例控制执行高通量负载作业任务,包括:If it is determined that the load type is a high-throughput load, the temporary resource management parent instance and multiple temporary resource management child instances are started according to the number of subtask sets of the high-throughput load job task by the resident resource management instance, and the temporary resource management parent instance and each of the temporary resource management child instances control the execution of the high-throughput load job task, including:

常驻资源管理实例通过登录节点上的用户接口将当前作业相关信息发送到分配给当前高通量负载的各个计算节点上;The resident resource management instance sends the current job related information to each computing node assigned to the current high throughput load through the user interface on the login node;

在分配给当前任务的各计算节点上,由常驻资源管理实例计算节点代理进程启动临时资源管理父实例,所述临时资源管理父实例为第一层临时资源管理实例,且由每一个已分配节点上的一个临时资源管理实例代理进程组成,以用于管理分配给当前高通量负载的所有资源;On each computing node assigned to the current task, the resident resource management instance computing node agent process starts a temporary resource management parent instance, which is a first-level temporary resource management instance and is composed of a temporary resource management instance agent process on each assigned node to manage all resources assigned to the current high-throughput load;

所述临时资源管理父实例启动多个临时资源管理子实例,每个临时资源管理子实例管理部分计算节点的计算资源,并运行对应的子任务集合加载脚本,当临时资源管理子实例对应的子任务集合加载脚本中的所有高通量任务运行完毕后临时资源管理子实例退出;The temporary resource management parent instance starts multiple temporary resource management child instances, each of which manages the computing resources of some computing nodes and runs the corresponding subtask set loading script. When all high-throughput tasks in the subtask set loading script corresponding to the temporary resource management child instance are completed, the temporary resource management child instance exits;

高通量负载作业任务运行过程中,高通量负载管理器根据唯一的作业号持续监控高通量负载作业任务的实时状态信息,并将实时状态信息存储到资源管理系统中的临时资源管理实例全局数据库。During the operation of the high-throughput load task, the high-throughput load manager continuously monitors the real-time status information of the high-throughput load task according to the unique job number, and stores the real-time status information in the temporary resource management instance global database in the resource management system.

所述临时资源管理父实例启动M个临时资源管理子实例,M为拆分的高通量子任务集合数量,每个临时资源管理子实例包含N/M个代理进程以对应于N/M个计算节点,每个临时资源管理子实例管理N/M个计算节点的计算资源,N为分配给当前高通量负载的计算节点的数量,运行对应的高通量任务加载脚本时,按照FIFO调度策略将高通量任务调度到所管理的计算资源上运行。The temporary resource management parent instance starts M temporary resource management child instances, where M is the number of split high-throughput quantum task sets, each temporary resource management child instance includes N/M proxy processes corresponding to N/M computing nodes, and each temporary resource management child instance manages the computing resources of N/M computing nodes, where N is the number of computing nodes allocated to the current high-throughput load. When the corresponding high-throughput task loading script is run, the high-throughput task is scheduled to run on the managed computing resources according to the FIFO scheduling strategy.

所述如果判断到负载类型为高通量负载时,通过所述常驻资源管理实例根据高通量负载作业任务的子任务集合数量启动临时资源管理父实例和多个临时资源管理子实例,由所述临时资源管理父实例和各个所述临时资源管理子实例控制执行高通量负载作业任务,还包括:If it is determined that the load type is a high-throughput load, the temporary resource management parent instance and multiple temporary resource management child instances are started according to the number of subtask sets of the high-throughput load job task by the resident resource management instance, and the temporary resource management parent instance and each of the temporary resource management child instances control the execution of the high-throughput load job task, further comprising:

当高通量负载管理器监控到高通量负载作业任务中的部分任务运行错误时,判断运行错误的任务是否是首次运行错误,如果是首次运行错误,则生成一个新的任务脚本,并在所述临时资源管理父实例下启动一个新的临时资源管理子实例,将运行错误的任务提交到新的临时资源管理子实例上运行;如果非首次运行错误,则报告错误信息;When the high-throughput load manager monitors that some tasks in the high-throughput load job have run errors, it determines whether the task with the run error is the first run error. If it is the first run error, a new task script is generated, and a new temporary resource management child instance is started under the temporary resource management parent instance, and the task with the run error is submitted to the new temporary resource management child instance for execution; if it is not the first run error, error information is reported;

当高通量负载管理器监控到高通量负载作业任务中的所有任务都正确运行完毕后,销毁所述临时资源管理父实例,常驻资源管理实例回收分配给高通量负载的资源,高通量负载管理器退出,完成高通量负载的运行。When the high-throughput load manager monitors that all tasks in the high-throughput load job have been completed correctly, the temporary resource management parent instance is destroyed, the resident resource management instance reclaims the resources allocated to the high-throughput load, and the high-throughput load manager exits to complete the operation of the high-throughput load.

该方法还包括作业任务信息查询,所述作业任务信息查询包括步骤:The method further includes querying the operation task information, and the operation task information query includes the steps of:

资源管理系统接收查询请求,并根据查询请求中要查询的作业任务的作业号判断作业任务的类型;The resource management system receives the query request and determines the type of the job task according to the job number of the job task to be queried in the query request;

如果判断到要查询的作业任务是高性能负载,所述资源管理系统通过统一接口调用位于计算机登录节点上的常驻资源管理实例的用户接口发出作业任务信息查询请求,位于计算机控制节点上的常驻资源管理实例的控制节点代理进程对作业任务信息查询请求进行解析,并从资源管理系统中的本地数据结构中查询作业任务的相关信息返回给统一接口;If it is determined that the job task to be queried is a high-performance load, the resource management system calls the user interface of the resident resource management instance located on the computer login node through the unified interface to issue a job task information query request, and the control node agent process of the resident resource management instance located on the computer control node parses the job task information query request, and queries the relevant information of the job task from the local data structure in the resource management system and returns it to the unified interface;

如果判断到要查询的作业任务是高通量负载,资源管理系统通过所述临时资源管理实例全局数据库根据作业号查询作业任务的相关信息,并返回给统一接口。If it is determined that the job task to be queried is a high-throughput load, the resource management system queries the relevant information of the job task according to the job number through the temporary resource management instance global database, and returns it to the unified interface.

本发明还提供一种面向多场景的可变深度资源管理系统,所述资源管理系统包括:The present invention also provides a variable depth resource management system for multiple scenarios, the resource management system comprising:

资源管理系统接口模块,用于接收用户提交的高性能负载或高通量负载并判断负载类型;The resource management system interface module is used to receive high-performance loads or high-throughput loads submitted by users and determine the load type;

高性能负载作业任务管理模块,用于如果判断到负载类型为高性能负载,通过调用资源管理系统中的常驻资源管理实例对高性能负载作业任务进行资源分配,并在资源分配完成后,通过所述常驻资源管理实例控制执行高性能负载作业任务,所述常驻资源管理实例为固定深度;A high-performance load job task management module is used to allocate resources to the high-performance load job task by calling a resident resource management instance in the resource management system if the load type is determined to be a high-performance load, and after the resource allocation is completed, control the execution of the high-performance load job task through the resident resource management instance, and the resident resource management instance is a fixed depth;

高通量负载作业任务管理模块,用于如果判断到负载类型为高通量负载,通过资源管理系统中的统一接口启动高通量负载管理器,由所述高通量负载管理器对高通量负载作业任务进行解析并拆分为多个子任务集合,通过所述常驻资源管理实例进行资源分配,在资源分配完成后,通过所述常驻资源管理实例根据高通量负载作业任务的子任务集合数量启动临时资源管理父实例和多个临时资源管理子实例,由所述临时资源管理父实例和各个所述临时资源管理子实例控制执行高通量负载作业任务,所述临时资源管理父实例、临时资源管理子实例高通量负载管理器随高通量负载临时部署与销毁以实现可变深度。A high-throughput load job task management module is used to start a high-throughput load manager through a unified interface in the resource management system if the load type is determined to be a high-throughput load. The high-throughput load manager parses the high-throughput load job task and splits it into multiple subtask sets. Resources are allocated through the resident resource management instance. After the resource allocation is completed, a temporary resource management parent instance and multiple temporary resource management child instances are started through the resident resource management instance according to the number of subtask sets of the high-throughput load job task. The temporary resource management parent instance and each of the temporary resource management child instances control the execution of the high-throughput load job task. The temporary resource management parent instance and the temporary resource management child instance high-throughput load managers are temporarily deployed and destroyed with the high-throughput load to achieve variable depth.

本发明还提供一种存储有计算机程序的计算机可读存储介质,所述计算机程序被处理器执行时实现上述的方法。The present invention also provides a computer-readable storage medium storing a computer program, and the computer program implements the above method when executed by a processor.

与现有技术相比,本发明的优点在于:Compared with the prior art, the advantages of the present invention are:

本发明通过判断负载类型,根据系统负载类型动态改变资源管理系统的深度以适应高性能计算和高通量计算,对于高性能负载,直接使用常驻资源管理实例为其提供服务,以保证对高性能负载的良好支持;当需要执行包含大量高通量任务的高通量负载时,启动临时资源管理父实例和子实例以增加资源管理系统的深度,可以提高资源管理系统的吞吐量,实现对高通量负载的高效管理,从而使得计算机能够同时为高性能计算和高通量计算提供稳定高效的资源管理。The present invention determines the load type and dynamically changes the depth of the resource management system according to the system load type to adapt to high-performance computing and high-throughput computing. For high-performance loads, resident resource management instances are directly used to provide services for them to ensure good support for high-performance loads. When a high-throughput load containing a large number of high-throughput tasks needs to be executed, temporary resource management parent instances and child instances are started to increase the depth of the resource management system, which can improve the throughput of the resource management system and achieve efficient management of high-throughput loads, thereby enabling the computer to provide stable and efficient resource management for both high-performance computing and high-throughput computing.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例中面向多场景的可变深度资源管理系统架构图。FIG1 is an architecture diagram of a variable depth resource management system for multiple scenarios in an embodiment of the present invention.

图2为本实施例面向多场景的可变深度资源管理方法的流程图。FIG2 is a flow chart of the variable depth resource management method for multiple scenarios according to this embodiment.

图3为本实施例的资源管理方法中执行高性能负载和高通量负载作业任务的流程图。FIG3 is a flow chart of executing high-performance load and high-throughput load job tasks in the resource management method of this embodiment.

图4为本实施例的资源管理方法中查询高性能负载和高通量负载作业任务的流程图。FIG. 4 is a flow chart of querying high-performance load and high-throughput load job tasks in the resource management method of this embodiment.

具体实施方式Detailed ways

为了更好的理解上述技术方案,下面将结合说明书附图以及具体的实施方式对上述技术方案进行详细的说明。In order to better understand the above technical solution, the above technical solution will be described in detail below in conjunction with the accompanying drawings and specific implementation methods.

资源管理系统是高性能计算机系统的重要系统软件,负责对全系统资源进行管理并对用户请求进行响应。资源管理系统的性能直接影响到系统资源利用率、系统吞吐量、作业周转时间和并行应用性能等关键的指标,直接关系到系统整体性能和用户体验。超级计算机的使用效能在一定程度上取决于资源管理系统的功能和运行质量。超级计算机主要包含三类主要节点,包括登录节点、控制节点和计算节点。登录节点用于用户登录系统并与系统的作业调度程序交互,控制节点用于协调和管理计算机网络中的各种资源和服务,计算节点用于执行计算任务。The resource management system is an important system software of high-performance computer systems, responsible for managing the resources of the entire system and responding to user requests. The performance of the resource management system directly affects key indicators such as system resource utilization, system throughput, job turnaround time and parallel application performance, and is directly related to the overall system performance and user experience. The efficiency of supercomputer use depends to a certain extent on the function and operation quality of the resource management system. Supercomputers mainly contain three main types of nodes, including login nodes, control nodes and computing nodes. Login nodes are used for users to log in to the system and interact with the system's job scheduler. Control nodes are used to coordinate and manage various resources and services in the computer network. Computing nodes are used to perform computing tasks.

一个资源管理系统中具备资源管理和作业调度功能的单元称为一个资源管理实例。资源管理实例间可能存在嵌套关系,即一个资源管理实例可以将其管理的资源的一个子集划分给一个新的资源管理实例。因此,一个资源管理系统可以由一个资源管理实例组成,也可以由多个可能具有嵌套关系的资源管理实例组成。A unit in a resource management system that has resource management and job scheduling functions is called a resource management instance. There may be a nested relationship between resource management instances, that is, a resource management instance can assign a subset of the resources it manages to a new resource management instance. Therefore, a resource management system can consist of one resource management instance or multiple resource management instances that may have a nested relationship.

资源管理系统的深度指的是资源管理实例之间嵌套关系的最大层数,资源管理系统的深度直接影响资源管理系统的性能。具体而言,资源管理系统的深度决定了资源管理实例的嵌套层数。嵌套层数越大,说明系统中具备资源管理能力的单元越多,资源管理系统的分布式程度越高,可以同时调度的作业数量更多,吞吐量更大。另一方面,嵌套层数越大,说明维护资源管理实例的开销更高,在各层资源管理实例中进行资源协调和负载均衡的开销更大。The depth of a resource management system refers to the maximum number of nested relationships between resource management instances. The depth of a resource management system directly affects the performance of the resource management system. Specifically, the depth of a resource management system determines the number of nested layers of resource management instances. The greater the number of nested layers, the more units with resource management capabilities there are in the system, the higher the degree of distribution of the resource management system, the more jobs that can be scheduled simultaneously, and the greater the throughput. On the other hand, the greater the number of nested layers, the higher the overhead of maintaining resource management instances, and the greater the overhead of resource coordination and load balancing in resource management instances at each layer.

目前常用的资源管理系统使用的是深度固定为1的常驻资源管理实例,拥有对系统资源和作业的全局视野和丰富的资源分配和作业调度策略,可以良好的应对多用户环境,并对系统资源和作业进行高效的管理。因此,这种深度固定为1的常驻资源管理实例对大规模并行计算负载的支持较好,适合于高性能计算。然而,由于深度固定,这种资源管理系统无法通过实例嵌套增加资源管理单元,通常受到吞吐量的限制,不适合于高通量计算。Currently, commonly used resource management systems use resident resource management instances with a fixed depth of 1. They have a global view of system resources and jobs and a rich set of resource allocation and job scheduling strategies. They can cope well with multi-user environments and manage system resources and jobs efficiently. Therefore, this resident resource management instance with a fixed depth of 1 has good support for large-scale parallel computing loads and is suitable for high-performance computing. However, due to the fixed depth, this resource management system cannot increase resource management units through instance nesting, is usually limited by throughput, and is not suitable for high-throughput computing.

为了解决传统的资源管理系统深度固定、不利于高通量计算的问题,本实施例构建如图1所示的面向多场景的可变深度资源管理系统。在计算机的登录节点、控制节点和计算节点上,面向多场景的可变深度资源管理系统中设置有常驻资源管理实例、临时资源管理实例、高通量负载管理器和临时资源管理实例全局数据库。In order to solve the problem that the depth of the traditional resource management system is fixed and not conducive to high-throughput computing, this embodiment constructs a variable-depth resource management system for multiple scenarios as shown in Figure 1. On the login node, control node and computing node of the computer, the variable-depth resource management system for multiple scenarios is provided with a resident resource management instance, a temporary resource management instance, a high-throughput load manager and a temporary resource management instance global database.

常驻资源管理实例长期部署在系统中,作为资源管理系统的第一层实例,管理系统中的所有资源。常驻资源管理实例在控制节点上启动控制节点代理进程,在所有计算节点上启动计算节点代理进程。常驻资源管理实例提供的用户接口可以供用户在登录节点上直接调用。常驻资源管理系统拥有对资源和作业的全局视野,其丰富的分配和调度策略可以为传统高性能负载提供完整的作业管理服务。The resident resource management instance is deployed in the system for a long time. As the first-level instance of the resource management system, it manages all resources in the system. The resident resource management instance starts the control node agent process on the control node and starts the compute node agent process on all compute nodes. The user interface provided by the resident resource management instance can be directly called by the user on the login node. The resident resource management system has a global view of resources and jobs, and its rich allocation and scheduling strategies can provide complete job management services for traditional high-performance loads.

临时资源管理实例由常驻资源管理实例启动,并随高通量负载临时部署和销毁。一个高通量负载对应于一个临时资源管理实例。对于高通量负载,常驻资源管理实例仅为其提供资源分配服务,通过启动临时资源管理实例在分配的资源上管理大量高通量任务。当资源管理系统中同时执行多个高通量负载任务时,有多个临时资源管理实例同时运行。临时资源管理实例的代理进程全部运行在计算节点上,称为临时资源管理实例的计算节点代理进程。临时资源管理实例在常驻资源管理实例分配的资源上为高通量负载提供作业管理服务。The temporary resource management instance is started by the resident resource management instance and is temporarily deployed and destroyed with the high-throughput load. One high-throughput load corresponds to one temporary resource management instance. For high-throughput loads, the resident resource management instance only provides resource allocation services for it, and manages a large number of high-throughput tasks on the allocated resources by starting the temporary resource management instance. When multiple high-throughput load tasks are executed simultaneously in the resource management system, multiple temporary resource management instances are running at the same time. The agent processes of the temporary resource management instance all run on the computing nodes, which are called the computing node agent processes of the temporary resource management instance. The temporary resource management instance provides job management services for the high-throughput load on the resources allocated by the resident resource management instance.

高通量负载管理器部署在登录节点上,作为一个运行时的工具随高通量负载而临时部署和销毁。一个高通量负载对应于一个高通量负载管理器。当资源管理系统中同时执行多个高通量负载任务时,有多个高通量负载管理器同时运行。The high-throughput load manager is deployed on the login node as a runtime tool that is temporarily deployed and destroyed with the high-throughput load. One high-throughput load corresponds to one high-throughput load manager. When multiple high-throughput load tasks are executed simultaneously in the resource management system, multiple high-throughput load managers run simultaneously.

临时资源管理实例全局数据库长期部署在登录节点上,用于存储每个高通量负载中所有高通量任务的实时状态,以及存储每个高性能负载对应的信息。The global database of the temporary resource management instance is permanently deployed on the login node to store the real-time status of all high-throughput tasks in each high-throughput load, as well as the information corresponding to each high-performance load.

根据本实施例提供的上述面向多场景的可变深度资源管理系统的架构,本实施例提供如图2所示的面向多场景的可变深度资源管理方法,步骤包括:According to the architecture of the variable depth resource management system for multiple scenarios provided in this embodiment, this embodiment provides a variable depth resource management method for multiple scenarios as shown in FIG2 , and the steps include:

步骤S01,接收用户提交的高性能负载或高通量负载并判断负载类型;Step S01, receiving a high-performance load or a high-throughput load submitted by a user and determining the load type;

步骤S02,如果判断到负载类型为高性能负载,通过调用资源管理系统中的常驻资源管理实例对高性能负载作业任务进行资源分配,并在资源分配完成后,通过常驻资源管理实例控制执行高性能负载作业任务,常驻资源管理实例为固定深度;Step S02: if the load type is determined to be a high-performance load, resources are allocated to the high-performance load job task by calling the resident resource management instance in the resource management system, and after the resource allocation is completed, the high-performance load job task is controlled by the resident resource management instance, and the resident resource management instance has a fixed depth;

步骤S03,如果判断到负载类型为高通量负载,通过资源管理系统中的统一接口启动高通量负载管理器,由高通量负载管理器对高通量负载作业任务进行解析并拆分为多个子任务集合,通过常驻资源管理实例进行资源分配,在资源分配完成后,通过常驻资源管理实例根据高通量负载作业任务的子任务集合数量启动临时资源管理父实例和多个临时资源管理子实例,由临时资源管理父实例和各个临时资源管理子实例控制执行高通量负载作业任务,临时资源管理父实例、临时资源管理子实例及高通量负载管理器随高通量负载临时部署与销毁以实现可变深度。Step S03, if it is determined that the load type is a high-throughput load, the high-throughput load manager is started through the unified interface in the resource management system, and the high-throughput load manager parses the high-throughput load job task and splits it into multiple sub-task sets, and allocates resources through the resident resource management instance. After the resource allocation is completed, the resident resource management instance is used to start a temporary resource management parent instance and multiple temporary resource management child instances according to the number of sub-task sets of the high-throughput load job task, and the temporary resource management parent instance and each temporary resource management child instance control the execution of the high-throughput load job task, and the temporary resource management parent instance, the temporary resource management child instance and the high-throughput load manager are temporarily deployed and destroyed with the high-throughput load to achieve variable depth.

具体地,步骤S01中,用户在系统登录节点上调用面向多场景的可变深度资源管理系统统一接口以提交负载脚本,资源管理系统统一接口检查用户提交的负载脚本,并判断对应的类型。Specifically, in step S01, the user calls the unified interface of the variable-depth resource management system for multiple scenarios on the system login node to submit a load script. The unified interface of the resource management system checks the load script submitted by the user and determines the corresponding type.

可以理解,本实施例通过判断负载类型,可以根据系统负载类型动态改变资源管理系统的深度以适应高性能计算和高通量计算,也即对于传统高性能负载,直接使用常驻资源管理实例为其提供服务,以保证对高性能负载的良好支持;当需要执行包含大量高通量任务的高通量负载时,启动临时资源管理父实例和子实例以增加资源管理系统的深度,以提高资源管理系统的吞吐量,实现对高通量负载的高效管理,从而使得计算机能够同时为高性能计算和高通量计算提供稳定高效的资源管理。It can be understood that this embodiment can dynamically change the depth of the resource management system according to the system load type to adapt to high-performance computing and high-throughput computing by judging the load type. That is, for traditional high-performance loads, resident resource management instances are directly used to provide services for them to ensure good support for high-performance loads; when it is necessary to execute high-throughput loads containing a large number of high-throughput tasks, temporary resource management parent instances and child instances are started to increase the depth of the resource management system to improve the throughput of the resource management system and achieve efficient management of high-throughput loads, so that the computer can provide stable and efficient resource management for both high-performance computing and high-throughput computing.

本实施例中,步骤S02中,如果判断到负载类型为高性能负载时,通过调用资源管理系统中的常驻资源管理实例对高性能负载作业任务进行资源分配,包括:In this embodiment, in step S02, if it is determined that the load type is a high-performance load, resources are allocated to the high-performance load job task by calling the resident resource management instance in the resource management system, including:

步骤S201,资源管理系统中的统一接口调用位于计算机登录节点上的常驻资源管理实例的用户接口,向位于计算机控制节点上的常驻资源管理实例控制节点代理进程请求资源分配;Step S201, the unified interface in the resource management system calls the user interface of the resident resource management instance located on the computer login node, and requests resource allocation from the resident resource management instance control node agent process located on the computer control node;

步骤S202,常驻资源管理实例控制节点代理进程对高性能负载作业任务进行解析,以调度对应的系统资源分配给当前作业任务,并将资源分配结果返回给登录节点上的常驻资源管理实例用户接口;Step S202, the resident resource management instance control node agent process parses the high-performance load job task to schedule the corresponding system resources to be allocated to the current job task, and returns the resource allocation result to the resident resource management instance user interface on the login node;

步骤S203,常驻资源管理实例控制节点代理进程为当前高性能负载作业任务分配唯一的作业号。Step S203: the resident resource management instance control node agent process allocates a unique job number to the current high-performance load job task.

本实施例中,如果判断到负载类型作业任务为高性能负载时,通过常驻资源管理实例控制执行高性能负载作业任务,包括:In this embodiment, if it is determined that the load type job task is a high-performance load, the high-performance load job task is executed by controlling the resident resource management instance, including:

步骤S211,常驻资源管理实例控制节点代理进程通过用户接口将高性能负载作业任务的相关信息(包括作业脚本和资源分配信息)发送到当前作业任务对应的计算节点上的常驻资源管理实例计算节点代理进程;Step S211, the resident resource management instance control node agent process sends the relevant information of the high-performance load job task (including the job script and resource allocation information) to the resident resource management instance computing node agent process on the computing node corresponding to the current job task through the user interface;

步骤S212,常驻资源管理实例计算节点代理进程加载并运行高性能负载作业任务;Step S212, the resident resource management instance computing node agent process loads and runs the high-performance load job task;

步骤S213,高性能负载作业任务运行过程中,常驻资源管理实例监控运行状态,并将实时监控到的状态记录在常驻资源管理实例控制节点代理进程上,在当前高性能作业任务运行结束后通过常驻资源管理实例回收分配给当前高性能负载作业任务的资源,完成高性能负载的运行。Step S213, during the running of the high-performance load job task, the resident resource management instance monitors the running status and records the real-time monitored status on the resident resource management instance control node agent process. After the current high-performance load job task is completed, the resident resource management instance reclaims the resources allocated to the current high-performance load job task to complete the running of the high-performance load.

如图3所示,在具体应用实施例中,作业任务为高性能负载类型时的详细步骤如下:As shown in FIG3 , in a specific application embodiment, the detailed steps when the operation task is a high-performance load type are as follows:

步骤1,面向多场景的可变深度资源管理系统统一接口直接调用常驻资源管理实例用户接口,向控制节点上的常驻资源管理实例代理进程以请求资源分配;Step 1: The unified interface of the variable-depth resource management system for multiple scenarios directly calls the resident resource management instance user interface to request resource allocation from the resident resource management instance proxy process on the control node;

步骤2,常驻资源管理实例控制节点代理进程对高性能负载脚本进行解析,调度合适的系统资源分配给该作业任务,为该高性能负载作业任务分配唯一的作业号,并将资源分配结果返回给登录节点上的常驻资源管理实例用户接口,此时,高性能负载作为一个具体的高性能作业在系统中运行;Step 2: The resident resource management instance control node agent process parses the high-performance load script, schedules appropriate system resources to be allocated to the job task, assigns a unique job number to the high-performance load job task, and returns the resource allocation result to the resident resource management instance user interface on the login node. At this point, the high-performance load runs in the system as a specific high-performance job.

步骤3,常驻资源管理实例控制节点代理进程通过用户接口将高性能负载作业任务的相关信息(包含作业脚本和资源分配信息)发送到分配该作业任务的计算节点上的常驻资源管理实例计算节点代理进程;Step 3, the resident resource management instance control node agent process sends the relevant information of the high-performance load job task (including the job script and resource allocation information) to the resident resource management instance computing node agent process on the computing node assigned to the job task through the user interface;

步骤4,常驻资源管理实例计算节点代理进程加载作业任务,启动用户进程,此时,用户提交的高性能负载作业任务开始运行;Step 4: The resident resource management instance computing node agent process loads the job task and starts the user process. At this time, the high-performance load job task submitted by the user starts running;

步骤5,常驻资源管理实例负责监控该高性能负载作业任务的运行情况,将作业任务的实时状态记录在常驻资源管理实例控制节点代理进程上,并在高性能负载作业任务运行结束后回收分配给该高性能负载作业任务的计算资源,至此,高性能负载运行完毕。Step 5: The resident resource management instance is responsible for monitoring the running status of the high-performance load job task, recording the real-time status of the job task on the resident resource management instance control node agent process, and recovering the computing resources allocated to the high-performance load job task after the high-performance load job task is completed. At this point, the high-performance load is completed.

可以理解,资源管理系统的深度指的是资源管理实例之间嵌套关系的最大层数,本实施例在判断到作业任务为高性能负载类型时,根据高性能计算大规模并行计算负载的特点,通过固定深度的常驻资源管理实例即可对系统资源和作业任务进行高效的管理,在不影响计算性能的前提下有效地节省系统资源,降低成本。It can be understood that the depth of the resource management system refers to the maximum number of nested relationships between resource management instances. When this embodiment determines that the job task is a high-performance load type, based on the characteristics of high-performance computing large-scale parallel computing loads, system resources and job tasks can be efficiently managed through fixed-depth resident resource management instances, effectively saving system resources and reducing costs without affecting computing performance.

本实施例中,如果判断到负载类型为高通量负载时,通过常驻资源管理实例进行资源分配,包括:In this embodiment, if it is determined that the load type is a high-throughput load, resource allocation is performed through the resident resource management instance, including:

步骤S301,由高通量负载管理器解析当前高通量负载的总资源需求,确定出所需的计算节点总数,并通过高通量负载管理器将高通量负载中的高通量任务拆分为多个子任务集合,并为每个子任务集合生成一个子任务集合加载脚本;其中,每个子任务集合中包含的高通量任务数量应尽量相同;Step S301, the high-throughput load manager analyzes the total resource requirements of the current high-throughput load, determines the total number of computing nodes required, and splits the high-throughput tasks in the high-throughput load into multiple subtask sets through the high-throughput load manager, and generates a subtask set loading script for each subtask set; wherein the number of high-throughput tasks contained in each subtask set should be as similar as possible;

步骤S302,由高通量负载管理器调用常驻资源管理实例的用户接口,向常驻资源管理实例控制节点代理进程发送资源分配请求;Step S302, the high throughput load manager calls the user interface of the resident resource management instance to send a resource allocation request to the resident resource management instance control node agent process;

步骤S303,常驻资源管理实例控制节点代理进程接收到资源分配请求后,为高通量负载作业任务分配位于计算机计算节点上的资源和唯一的作业号,并返回给高通量负载管理器。Step S303, after receiving the resource allocation request, the resident resource management instance control node agent process allocates resources and a unique job number located on the computer computing node to the high throughput load job task, and returns it to the high throughput load manager.

本实施例中,如果判断到负载类型作业任务为高通量负载时,通过常驻资源管理实例根据高通量负载作业任务的子任务集合数量启动临时资源管理父实例和多个临时资源管理子实例,由临时资源管理父实例和各个临时资源管理子实例控制执行高通量负载作业任务,包括:In this embodiment, if it is determined that the load type job task is a high-throughput load, a temporary resource management parent instance and multiple temporary resource management child instances are started according to the number of subtask sets of the high-throughput load job task through the resident resource management instance, and the temporary resource management parent instance and each temporary resource management child instance control the execution of the high-throughput load job task, including:

步骤S311,常驻资源管理实例通过登录节点上的用户接口将当前作业相关信息发送到分配给当前高通量负载的各个计算节点上;Step S311, the resident resource management instance sends the current job related information to each computing node assigned to the current high throughput load through the user interface on the login node;

步骤S312,在分配给当前任务的各计算节点上,由常驻资源管理实例计算节点代理进程启动临时资源管理父实例,临时资源管理父实例为第一层临时资源管理实例,且由每一个已分配节点上的一个临时资源管理实例代理进程组成,以用于管理分配给当前高通量负载的所有资源;Step S312, on each computing node assigned to the current task, the resident resource management instance computing node agent process starts the temporary resource management parent instance, the temporary resource management parent instance is a first-level temporary resource management instance, and is composed of a temporary resource management instance agent process on each assigned node, so as to manage all resources assigned to the current high-throughput load;

步骤S313,临时资源管理父实例启动多个临时资源管理子实例,每个临时资源管理子实例管理部分计算节点的计算资源,并运行对应的子任务集合加载脚本,当临时资源管理子实例对应的子任务集合加载脚本中的所有高通量任务运行完毕后临时资源管理子实例退出;Step S313, the temporary resource management parent instance starts multiple temporary resource management child instances, each of which manages the computing resources of some computing nodes and runs the corresponding subtask set loading script. When all high-throughput tasks in the subtask set loading script corresponding to the temporary resource management child instance are completed, the temporary resource management child instance exits;

步骤S314,高通量负载作业任务运行过程中,高通量负载管理器根据唯一的作业号持续监控高通量负载作业任务的实时状态信息,并将实时状态信息存储到资源管理系统中的临时资源管理实例全局数据库。Step S314: During the execution of the high-throughput load task, the high-throughput load manager continuously monitors the real-time status information of the high-throughput load task according to the unique job number, and stores the real-time status information in the temporary resource management instance global database in the resource management system.

本实施例中,临时资源管理父实例启动M个临时资源管理子实例,M为拆分的高通量子任务集合数量,每个临时资源管理子实例包含N/M个代理进程以对应于N/M个计算节点,每个临时资源管理子实例管理N/M个计算节点的计算资源,N为分配给当前高通量负载的计算节点的数量,运行对应的高通量任务加载脚本时,按照FIFO调度策略将高通量任务调度到所管理的计算资源上运行。In this embodiment, the temporary resource management parent instance starts M temporary resource management child instances, where M is the number of split high-throughput quantum task sets, each temporary resource management child instance contains N/M proxy processes corresponding to N/M computing nodes, and each temporary resource management child instance manages the computing resources of N/M computing nodes, where N is the number of computing nodes allocated to the current high-throughput load. When the corresponding high-throughput task loading script is run, the high-throughput task is scheduled to run on the managed computing resources according to the FIFO scheduling strategy.

本实施例中,如果判断到负载类型为高通量负载时,通过常驻资源管理实例根据高通量负载作业任务的子任务集合数量启动临时资源管理父实例和多个临时资源管理子实例,由临时资源管理父实例和各个临时资源管理子实例控制执行高通量负载作业任务,还包括:In this embodiment, if it is determined that the load type is a high-throughput load, a temporary resource management parent instance and multiple temporary resource management child instances are started according to the number of subtask sets of the high-throughput load job task through the resident resource management instance, and the temporary resource management parent instance and each temporary resource management child instance control the execution of the high-throughput load job task, and also include:

步骤S321,当高通量负载管理器监控到高通量负载作业任务中的部分任务运行错误时,判断运行错误的任务是否是首次运行错误,如果是首次运行错误,则生成一个新的任务脚本,并在临时资源管理父实例下启动一个新的临时资源管理子实例,将运行错误的任务提交到新的临时资源管理子实例上运行;如果非首次运行错误,则报告错误信息;Step S321, when the high-throughput load manager monitors that some tasks in the high-throughput load job have run errors, it is determined whether the task with the run error is the first run error. If it is the first run error, a new task script is generated, and a new temporary resource management child instance is started under the temporary resource management parent instance, and the task with the run error is submitted to the new temporary resource management child instance for execution; if it is not the first run error, an error message is reported;

步骤S322,当高通量负载管理器监控到高通量负载作业任务中的所有任务都正确运行完毕后,销毁临时资源管理父实例,常驻资源管理实例回收分配给高通量负载的资源,高通量负载管理器退出,完成高通量负载的运行。Step S322, when the high-throughput load manager monitors that all tasks in the high-throughput load job have been completed correctly, it destroys the temporary resource management parent instance, the resident resource management instance reclaims the resources allocated to the high-throughput load, and the high-throughput load manager exits to complete the operation of the high-throughput load.

如图3所示,在具体应用实施例中,作业任务为高通量负载类型时的详细步骤如下:As shown in FIG3 , in a specific application embodiment, the detailed steps when the operation task is a high-throughput load type are as follows:

步骤1,面向多场景的可变深度的资源管理系统统一接口启动一个高通量负载管理器;Step 1: A high-throughput load manager is started through a unified interface of a resource management system with variable depth for multiple scenarios;

步骤2,高通量负载管理器中的作业解析/提交组件解析高通量负载的总资源需求,确定所需的计算节点总数N,然后,高通量负载管理器将高通量负载中的高通量任务拆分为M个子任务脚本(M是可配置的),并保证各脚本中包含的高通量任务数量尽量相同;Step 2: The job parsing/submission component in the high-throughput workload manager parses the total resource requirements of the high-throughput workload and determines the total number of computing nodes N required. Then, the high-throughput workload manager splits the high-throughput tasks in the high-throughput workload into M subtask scripts (M is configurable) and ensures that the number of high-throughput tasks contained in each script is as close as possible.

步骤3,高通量负载管理器的作业解析/提交组件调用常驻资源管理实例用户接口向常驻资源管理实例控制节点代理进程发送资源分配请求;Step 3: The job parsing/submission component of the high throughput load manager calls the resident resource management instance user interface to send a resource allocation request to the resident resource management instance control node agent process;

步骤4,常驻资源管理实例控制节点代理进程为该高通量负载分配计算资源和唯一作业号,并返回给高通量负载管理器,高通量负载管理器的作业监控/再提交组件根据作业号开始持续监控任务实时状态并将实时状态信息更新到临时资源管理实例全局数据库;Step 4: The resident resource management instance control node agent process allocates computing resources and a unique job number to the high-throughput load, and returns it to the high-throughput load manager. The job monitoring/resubmission component of the high-throughput load manager starts to continuously monitor the real-time status of the task according to the job number and updates the real-time status information to the temporary resource management instance global database;

步骤5,常驻资源管理实例通过登录节点上的用户接口将作业任务的相关信息发送到每一个已分配该高通量负载的计算节点上;Step 5: The resident resource management instance sends the relevant information of the job task to each computing node to which the high-throughput load has been assigned through the user interface on the login node;

步骤6,在每个分配高通量负载子任务的计算节点上,常驻资源管理实例计算节点代理进程启动第一层临时资源管理实例(简称为父实例),该父实例由M个已分配节点上的临时资源管理实例计算节点代理进程组成,并用于管理分配给该高通量负载的所有资源;Step 6: On each computing node to which the high-throughput load subtask is assigned, the resident resource management instance computing node agent process starts a first-layer temporary resource management instance (referred to as the parent instance for short), which is composed of the temporary resource management instance computing node agent processes on the M assigned nodes and is used to manage all resources assigned to the high-throughput load;

步骤7,父实例启动M个第二层临时资源管理实例(简称为子实例),每个子实例包含N/M个临时资源管理实例计算节点代理进程,对应于N/M个计算节点,每个子实例管理N/M个计算节点的计算资源,运行本计算节点对应的高通量任务加载脚本,按照FIFO(先来先服务)调度策略将高通量任务调度到所管理的计算资源上运行;Step 7: The parent instance starts M second-layer temporary resource management instances (referred to as child instances). Each child instance contains N/M temporary resource management instance computing node agent processes corresponding to N/M computing nodes. Each child instance manages the computing resources of N/M computing nodes, runs the high-throughput task loading script corresponding to the computing node, and schedules the high-throughput tasks to run on the managed computing resources according to the FIFO (first come, first served) scheduling policy.

步骤8,当子实例对应的所有高通量任务运行完毕时,销毁对应的子实例;Step 8: When all high-throughput tasks corresponding to the sub-instance are completed, the corresponding sub-instance is destroyed;

步骤9,当高通量负载管理器中的作业监控/再提交组件监控到高通量负载作业任务中的部分任务运行错误时,确认运行错误的任务是否是首次运行错误:如果是首次运行错误,则生成一个新的任务脚本,并在父实例下启动一个新的子实例,将运行错误的高通量任务再次提交到该子实例上运行;如果是第二次发生运行错误,则向用户报告错误信息,并跳转步骤11;Step 9, when the job monitoring/resubmission component in the high-throughput load manager monitors that some tasks in the high-throughput load job have run errors, confirm whether the task with the run error is the first run error: if it is the first run error, generate a new task script, start a new child instance under the parent instance, and submit the high-throughput task with the run error to the child instance again for running; if it is the second run error, report the error information to the user and jump to step 11;

步骤10,当高通量负载管理器中的作业监控/再提交组件监控到高通量负载作业任务中的所有任务都正确运行完毕后,销毁父实例,常驻资源管理实例回收分配给该高通量负载的计算资源;Step 10: When the job monitoring/resubmission component in the high-throughput load manager monitors that all tasks in the high-throughput load job task have been successfully completed, the parent instance is destroyed, and the resident resource management instance reclaims the computing resources allocated to the high-throughput load;

步骤11,高通量负载管理器销毁,至此,高通量负载作业任务运行完毕。Step 11: The high-throughput load manager is destroyed. At this point, the high-throughput load job task is completed.

可以理解,本实施例在判断到作业任务为高通量负载类型时,考虑到仅靠深度固定为1的常驻资源管理实例会出现无法通过实例嵌套增加资源管理实例,从而受到吞吐量的限制,影响高通量任务的执行的情况,常驻资源管理实例此时仅为高通量任务提供资源分配,并启动父实例和子实例2层临时资源管理实例来管理和执行高通量任务,也即通过实例嵌套使得资源管理系统的深度相应地由1增加为3,通过采用3层深度的资源管理系统进行任务的分配与执行,可以提高资源管理系统的吞吐量,从而实现对高通量负载的高效管理;It can be understood that when the present embodiment determines that the job task is of the high-throughput load type, it is considered that only relying on the resident resource management instance with a fixed depth of 1 will result in the inability to increase the resource management instance through instance nesting, thereby being limited by throughput and affecting the execution of the high-throughput task. At this time, the resident resource management instance only provides resource allocation for the high-throughput task, and starts the parent instance and the child instance 2-layer temporary resource management instance to manage and execute the high-throughput task, that is, through instance nesting, the depth of the resource management system is correspondingly increased from 1 to 3. By adopting a 3-layer deep resource management system for task allocation and execution, the throughput of the resource management system can be improved, thereby achieving efficient management of the high-throughput load.

本实施例中通过临时资源管理实例全局数据库来存储高通量任务的作业信息,由于临时资源管理实例是临时的,当一个临时资源管理实例退出后,其管理的作业信息也随之被删除,针对这一问题,各高通量负载管理器的作业监控/再提交组件定期收集各高通量的实时作业信息并更新到该临时资源管理实例全局数据库中;临时资源管理实例全局数据库是长期运行在登录节点上的,不随临时资源管理实例的启动和退出发生状态改变,因此,临时资源管理实例全局数据库可以及时更新并长期保存当前运行的和历史运行的所有临时资源管理实例所管理的高通量负载信息以供用户查询,避免出现信息丢失的情况;In this embodiment, the job information of high-throughput tasks is stored in a temporary resource management instance global database. Since the temporary resource management instance is temporary, when a temporary resource management instance exits, the job information it manages is also deleted. To address this issue, the job monitoring/resubmission component of each high-throughput load manager regularly collects the real-time job information of each high-throughput and updates it to the temporary resource management instance global database; the temporary resource management instance global database runs on the login node for a long time and does not change its state with the startup and exit of the temporary resource management instance. Therefore, the temporary resource management instance global database can timely update and store the high-throughput load information managed by all currently running and historically running temporary resource management instances for a long time for user query, avoiding information loss;

本实施例中通过高通量负载管理器中的作业监控/再提交组件监控高通量负载作业任务是否成功完成,当监控到运行错误的高通量任务后,作业监控/再提交组件将错误的高通量任务重新提交给临时资源管理实例运行,从而可以为高通量负载的运行提供一定的容错性。In this embodiment, the job monitoring/resubmission component in the high-throughput load manager monitors whether the high-throughput load job task is successfully completed. When a high-throughput task with running errors is monitored, the job monitoring/resubmission component resubmits the erroneous high-throughput task to the temporary resource management instance for execution, thereby providing a certain fault tolerance for the operation of the high-throughput load.

如图4所示,本实施例中,该方法还包括作业任务信息查询,作业任务信息查询包括步骤:As shown in FIG4 , in this embodiment, the method further includes querying the operation task information, and the operation task information query includes the steps of:

步骤S401,资源管理系统接收查询请求,并根据查询请求中要查询的作业任务的作业号判断作业任务的类型;Step S401, the resource management system receives a query request and determines the type of the job task according to the job number of the job task to be queried in the query request;

步骤S402,如果判断到要查询的作业任务是高性能负载,资源管理系统通过统一接口调用位于计算机登录节点上的常驻资源管理实例的用户接口发出作业任务信息查询请求,位于计算机控制节点上的常驻资源管理实例的控制节点代理进程对作业任务信息查询请求进行解析,并从资源管理系统中的本地数据结构中查询作业任务的相关信息返回给统一接口;Step S402: If it is determined that the job task to be queried is a high-performance load, the resource management system calls the user interface of the resident resource management instance located on the computer login node through the unified interface to issue a job task information query request, and the control node agent process of the resident resource management instance located on the computer control node parses the job task information query request, and queries the relevant information of the job task from the local data structure in the resource management system and returns it to the unified interface;

步骤S403,如果判断到要查询的作业任务是高通量负载作业任务,资源管理系统通过临时资源管理实例全局数据库根据作业号查询作业任务的相关信息,并返回给统一接口。Step S403: If it is determined that the job task to be queried is a high-throughput load job task, the resource management system queries the relevant information of the job task according to the job number through the temporary resource management instance global database, and returns it to the unified interface.

具体地,步骤S401中,用户在登录节点上调用面向多场景的可变深度资源管理系统统一接口提交负载查询请求,由于高性能负载和高通量负载都对应于一个作业号,因此查询命令中包括需要查询的作业号;步骤S403之后可选地还包括步骤:统一接口将作业信息打印到屏幕,便于用户查看。Specifically, in step S401, the user calls the unified interface of the variable depth resource management system for multiple scenarios on the login node to submit a load query request. Since both the high-performance load and the high-throughput load correspond to a job number, the query command includes the job number to be queried; after step S403, it optionally also includes the step of: the unified interface prints the job information to the screen for user viewing.

可以理解,设置临时资源管理实例全局数据库来存储所有高性能任务及高通量任务的信息数据,并通过每个任务对应的唯一作业号进行查询,可以更及时地关注作业任务的运行状态,并在执行完成后也能随时查看对应数据以备有需。It can be understood that setting up a temporary resource management instance global database to store information data of all high-performance tasks and high-throughput tasks, and querying through the unique job number corresponding to each task, can pay more attention to the running status of the job task in a more timely manner, and can also view the corresponding data at any time after the execution is completed in case of need.

本实施例还提供面向多场景的可变深度资源管理系统,资源管理系统包括:This embodiment also provides a variable depth resource management system for multiple scenarios, and the resource management system includes:

资源管理系统接口模块,用于接收用户提交的高性能负载或高通量负载并判断负载类型;The resource management system interface module is used to receive high-performance loads or high-throughput loads submitted by users and determine the load type;

高性能负载作业任务管理模块,用于如果判断到负载类型为高性能负载,通过调用资源管理系统中的常驻资源管理实例对高性能负载作业任务进行资源分配,并在资源分配完成后,通过常驻资源管理实例控制执行高性能负载作业任务,常驻资源管理实例为固定深度;The high-performance load job task management module is used to allocate resources to the high-performance load job task by calling the resident resource management instance in the resource management system if the load type is determined to be a high-performance load, and after the resource allocation is completed, control the execution of the high-performance load job task through the resident resource management instance, and the resident resource management instance has a fixed depth;

高通量负载作业任务管理模块,用于如果判断到负载类型为高通量负载,通过资源管理系统中的统一接口启动高通量负载管理器,由高通量负载管理器对高通量负载作业任务进行解析并拆分为多个子任务集合,通过常驻资源管理实例进行资源分配,在资源分配完成后,通过常驻资源管理实例根据高通量负载作业任务的子任务集合数量启动临时资源管理父实例和多个临时资源管理子实例,由临时资源管理父实例和各个临时资源管理子实例控制执行高通量负载作业任务,临时资源管理父实例、临时资源管理子实例高通量负载管理器随高通量负载临时部署与销毁以实现可变深度。The high-throughput load job task management module is used to start the high-throughput load manager through the unified interface in the resource management system if the load type is determined to be a high-throughput load. The high-throughput load manager parses the high-throughput load job task and splits it into multiple sub-task sets. Resources are allocated through the resident resource management instance. After the resource allocation is completed, a temporary resource management parent instance and multiple temporary resource management child instances are started through the resident resource management instance according to the number of sub-task sets of the high-throughput load job task. The temporary resource management parent instance and each temporary resource management child instance control the execution of the high-throughput load job task. The temporary resource management parent instance and the temporary resource management child instance high-throughput load manager are temporarily deployed and destroyed with the high-throughput load to achieve variable depth.

本实施例还提供存储有计算机程序的计算机可读存储介质,计算机程序被处理器执行时实现上述面向多场景的可变深度资源管理方法。This embodiment also provides a computer-readable storage medium storing a computer program, and when the computer program is executed by a processor, the above-mentioned variable depth resource management method for multiple scenarios is implemented.

本实施例的系统与介质,与上述方法相对应,同样具有如上方法所述的优点。The system and medium of this embodiment correspond to the above method and also have the advantages described in the above method.

本发明实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,计算机程序可存储于一个计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述方法实施例的步骤。其中,计算机程序包括计算机程序代码,计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。计算机可读介质包括:能够携带计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,RandomAccess Memory)、电载波信号、电信信号以及软件分发介质等。存储器用于存储计算机程序和/或模块,处理器通过运行或执行存储在存储器内的计算机程序和/或模块,以及调用存储在存储器内的数据,实现各种功能。存储器可以包括高速随机存取存储器,还可以包括非易失性存储器,例如硬盘、内存、插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)、至少一个磁盘存储器件、闪存器件、或其它易失性固态存储器件等。The present invention implements all or part of the processes in the above-mentioned embodiment method, and can also be completed by instructing related hardware through a computer program. The computer program can be stored in a computer-readable storage medium. When the computer program is executed by the processor, the steps of the above-mentioned method embodiment can be implemented. Among them, the computer program includes computer program code, and the computer program code can be in source code form, object code form, executable file or some intermediate form. Computer-readable media include: any entity or device that can carry computer program code, recording medium, U disk, mobile hard disk, disk, optical disk, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, RandomAccess Memory), electric carrier signal, telecommunication signal and software distribution medium. The memory is used to store computer programs and/or modules. The processor implements various functions by running or executing computer programs and/or modules stored in the memory, and calling data stored in the memory. The memory may include a high-speed random access memory and may also include a non-volatile memory, such as a hard disk, a memory, a plug-in hard disk, a smart media card (SMC), a secure digital (SD) card, a flash card (Flash Card), at least one disk storage device, a flash memory device, or other volatile solid-state storage devices.

以上仅是本发明的优选实施方式,本发明的保护范围并不仅局限于上述实施例,凡属于本发明思路下的技术方案均属于本发明的保护范围。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理前提下的若干改进和润饰,应视为本发明的保护范围。The above are only preferred embodiments of the present invention. The protection scope of the present invention is not limited to the above embodiments. All technical solutions under the concept of the present invention belong to the protection scope of the present invention. It should be pointed out that for ordinary technicians in this technical field, some improvements and modifications without departing from the principle of the present invention should be regarded as the protection scope of the present invention.

Claims (10)

1. The variable depth resource management method for multiple scenes is characterized by comprising the following steps:
receiving high-performance load or high-flux load submitted by a user and judging the load type;
If the load type is judged to be high-performance load, carrying out resource allocation on the high-performance load job task by calling a resident resource management instance in a resource management system, and controlling to execute the high-performance load job task by the resident resource management instance after the resource allocation is completed, wherein the resident resource management instance is of a fixed depth;
If the load type is judged to be high-flux load, a high-flux load manager is started through a unified interface in a resource management system, the high-flux load manager analyzes a high-flux load job task and splits the high-flux load job task into a plurality of subtask sets, resource allocation is carried out through the resident resource management instance, after the resource allocation is completed, a temporary resource management father instance and a plurality of temporary resource management child instances are started through the resident resource management instance according to the number of the subtask sets of the high-flux load job task, the high-flux load job task is controlled and executed by the temporary resource management father instance and each temporary resource management child instance, and the temporary resource management father instance, the temporary resource management child instance and the high-flux load manager are temporarily deployed and destroyed along with the high-flux load so as to realize variable depth.
2. The multi-scenario-oriented variable depth resource management method of claim 1, wherein:
And if the load type is judged to be the high-performance load, carrying out resource allocation on the high-performance load job task by calling a resident resource management instance in the resource management system, wherein the method comprises the following steps of:
A unified interface in the resource management system calls a user interface of a resident resource management instance positioned on a computer login node, and a resident resource management instance control node agent process positioned on a computer control node requests resource allocation;
The resident resource management instance control node proxy process analyzes the high-performance load job task to schedule corresponding system resources to be allocated to the current job task, and returns a resource allocation result to a resident resource management instance user interface on the login node;
the resident resource management instance controls the node proxy process to allocate a unique job number for the current high-performance load job task.
3. The multi-scenario oriented variable depth resource management method of claim 2, wherein:
And if the load type is judged to be a high-performance load, controlling the execution of the high-performance load job task through the resident resource management instance, wherein the method comprises the following steps of:
the resident resource management instance control node proxy process sends the related information of the high-performance load job task to the resident resource management instance computing node proxy process on the computing node corresponding to the current job task through the user interface;
the resident resource management instance computing node proxy process loads and runs the high-performance load job task;
in the operation process of the high-performance load job task, the resident resource management instance monitors the operation state, records the state monitored in real time on the resident resource management instance control node proxy process, and recovers the resources allocated to the current high-performance load job task through the resident resource management instance after the operation of the current high-performance job task is finished, so that the operation of the high-performance load is completed.
4. The multi-scenario-oriented variable depth resource management method of claim 1, wherein:
And if the load type is judged to be a high-flux load, performing resource allocation through the resident resource management instance, wherein the method comprises the following steps:
Analyzing the total resource requirement of the current high-flux load by a high-flux load manager, determining the total number of required computing nodes, splitting the high-flux tasks in the high-flux load into a plurality of subtask sets by the high-flux load manager, and generating a subtask set loading script for each subtask set;
the high-throughput load manager calls a user interface of the resident resource management instance, and sends a resource allocation request to a resident resource management instance control node proxy process;
after the resident resource management instance control node proxy process receives the resource allocation request, the resource and the unique job number on the computer computing node are allocated for the high-throughput load job task, and the resource and the unique job number are returned to the high-throughput load manager.
5. The multi-scenario oriented variable depth resource management method of claim 4, wherein:
And if the load type is judged to be high-throughput load, starting a temporary resource management father instance and a plurality of temporary resource management child instances according to the number of the child task sets of the high-throughput load job task through the resident resource management instance, and controlling the execution of the high-throughput load job task through the temporary resource management father instance and each temporary resource management child instance, wherein the method comprises the following steps:
The resident resource management instance sends the current job related information to each computing node allocated to the current high-throughput load through a user interface on the login node;
On each computing node allocated to the current task, starting a temporary resource management father instance by a resident resource management instance computing node proxy process, wherein the temporary resource management father instance is a first layer temporary resource management instance and consists of one temporary resource management instance proxy process on each allocated node for managing all resources allocated to the current high-throughput load;
The temporary resource management father instance starts a plurality of temporary resource management sub-instances, each temporary resource management sub-instance manages the computing resources of a part of computing nodes, and runs a corresponding sub-task set loading script, and when all high-flux tasks in the sub-task set loading script corresponding to the temporary resource management sub-instance are run, the temporary resource management sub-instance exits;
In the running process of the high-flux load job task, the high-flux load manager continuously monitors real-time state information of the high-flux load job task according to the unique job number and stores the real-time state information into a temporary resource management instance global database in the resource management system.
6. The multi-scenario oriented variable depth resource management method of claim 5, wherein: the temporary resource management parent instance starts M temporary resource management child instances, M is the number of split high-throughput child task sets, each temporary resource management child instance comprises N/M proxy processes so as to correspond to N/M computing nodes, each temporary resource management child instance manages the computing resources of the N/M computing nodes, N is the number of computing nodes distributed to the current high-throughput load, and when a corresponding high-throughput task loading script is operated, the high-throughput task is scheduled to the managed computing resources to operate according to a first-in first-out (FIFO) scheduling strategy.
7. The multi-scenario-oriented variable depth resource management method according to any one of claims 1 to 6, wherein:
If the load type is judged to be high-throughput load, a temporary resource management father instance and a plurality of temporary resource management child instances are started according to the number of the child task sets of the high-throughput load job task through the resident resource management instance, and the high-throughput load job task is controlled to be executed by the temporary resource management father instance and each temporary resource management child instance, and the method further comprises the following steps:
When the high-flux load manager monitors that part of tasks in the high-flux load job task run wrong, judging whether the task with the running wrong is the first running error, if so, generating a new task script, starting a new temporary resource management sub-instance under the temporary resource management father instance, and submitting the task with the running wrong to the new temporary resource management sub-instance for running; reporting error information if the first run is not wrong;
when the high-flux load manager monitors that all tasks in the high-flux load job tasks are correctly operated, destroying the temporary resource management parent instance, recovering the resources allocated to the high-flux load by the resident resource management instance, and exiting the high-flux load manager to finish the operation of the high-flux load.
8. The multi-scenario-oriented variable depth resource management method of claim 1, wherein:
the method also comprises a job task information query, wherein the job task information query comprises the following steps:
The resource management system receives the query request and judges the type of the job task according to the job number of the job task to be queried in the query request;
If the job task to be queried is judged to be a high-performance load, the resource management system calls a user interface of a resident resource management instance positioned on a computer login node through a unified interface to send a job task information query request, a control node proxy process of the resident resource management instance positioned on a computer control node analyzes the job task information query request, and related information of the query job task in a local data structure in the resource management system is returned to the unified interface;
if the job task to be queried is judged to be a high-throughput load, the resource management system queries related information of the job task according to the job number through the temporary resource management instance global database and returns the related information to the unified interface.
9. A multi-scenario oriented variable depth resource management system, the resource management system comprising:
The resource management system interface module is used for receiving high-performance load or high-flux load submitted by a user and judging the load type;
The high-performance load job task management module is used for carrying out resource allocation on the high-performance load job task by calling a resident resource management instance in the resource management system if the load type is judged to be a high-performance load, and controlling the execution of the high-performance load job task by the resident resource management instance after the resource allocation is completed, wherein the resident resource management instance is of a fixed depth;
And the high-throughput load job task management module is used for starting a high-throughput load manager through a unified interface in the resource management system if the load type is judged to be high-throughput load, analyzing the high-throughput load job task by the high-throughput load manager and splitting the high-throughput load job task into a plurality of subtask sets, carrying out resource allocation through the resident resource management instance, starting a temporary resource management father instance and a plurality of temporary resource management child instances through the resident resource management instance according to the number of the subtask sets of the high-throughput load job task after the resource allocation is completed, and controlling the temporary resource management father instance and each temporary resource management child instance to execute the high-throughput load job task, wherein the temporary resource management father instance and the temporary resource management child instance high-throughput load manager are temporarily deployed and destroyed along with the high-throughput load so as to realize variable depth.
10. A computer-readable storage medium storing a computer program, characterized by: the computer program, when executed by a processor, implements the method of any one of claims 1 to 8.
CN202410043057.3A 2024-01-11 2024-01-11 A variable depth resource management method and system for multiple scenarios Pending CN118069349A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410043057.3A CN118069349A (en) 2024-01-11 2024-01-11 A variable depth resource management method and system for multiple scenarios

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410043057.3A CN118069349A (en) 2024-01-11 2024-01-11 A variable depth resource management method and system for multiple scenarios

Publications (1)

Publication Number Publication Date
CN118069349A true CN118069349A (en) 2024-05-24

Family

ID=91104704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410043057.3A Pending CN118069349A (en) 2024-01-11 2024-01-11 A variable depth resource management method and system for multiple scenarios

Country Status (1)

Country Link
CN (1) CN118069349A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118502969A (en) * 2024-07-17 2024-08-16 北京科东电力控制系统有限责任公司 K8S platform-based multi-scene training exercise application building method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118502969A (en) * 2024-07-17 2024-08-16 北京科东电力控制系统有限责任公司 K8S platform-based multi-scene training exercise application building method and system

Similar Documents

Publication Publication Date Title
Gu et al. Liquid: Intelligent resource estimation and network-efficient scheduling for deep learning jobs on distributed GPU clusters
US6732139B1 (en) Method to distribute programs using remote java objects
US8739171B2 (en) High-throughput-computing in a hybrid computing environment
CN103092698B (en) Cloud computing application automatic deployment system and method
Bode et al. The portable batch scheduler and the maui scheduler on linux clusters
US8914805B2 (en) Rescheduling workload in a hybrid computing environment
RU2481618C2 (en) Hierarchical infrastructure of resources backup planning
Wang et al. A three-phases scheduling in a hierarchical cloud computing network
Xu et al. Adaptive task scheduling strategy based on dynamic workload adjustment for heterogeneous Hadoop clusters
US12386658B2 (en) Pull mode and push mode combined resource management and job scheduling method and system, and medium
CN101719852B (en) Method and device for monitoring performance of middleware
Belalem et al. Approaches to improve the resources management in the simulator CloudSim
CN102081554A (en) Cloud computing operating system as well as kernel control system and method thereof
Liu et al. Dynamically negotiating capacity between on-demand and batch clusters
CN112148546A (en) Static safety analysis parallel computing system and method for power system
Bartolini et al. Proactive workload dispatching on the EURORA supercomputer
CN118069349A (en) A variable depth resource management method and system for multiple scenarios
CN113515361B (en) Lightweight heterogeneous computing cluster system facing service
CN114816694A (en) A multi-process collaborative RPA task scheduling method and device
CN113934525A (en) Hadoop cluster task scheduling method based on positive and negative feedback load scheduling algorithm
Cao et al. Performance prediction technology for agent-based resource management in grid environments
CN118819825A (en) Concurrent processing method, concurrent control system, electronic device and storage medium
CN112291320A (en) Distributed two-layer scheduling method and system for quantum computer cluster
CN115237547B (en) A unified container cluster hosting system and method for a non-intrusive HPC computing cluster
Walters et al. Enabling interactive jobs in virtualized data centers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination