Zhao et al., 2019 - Google Patents

Dynamic stale synchronous parallel distributed training for deep learning

Zhao et al., 2019

Document ID: 9568723824413666372
Author: Zhao X; An A; Liu J; Chen B
Publication year: 2019
Publication venue: 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS)

External Links

Cited by

Snippet

Deep learning is a popular machine learning technique and has been applied to many real- world problems, ranging from computer vision to natural language processing. However, training a deep neural network is very time-consuming, especially on big data. It has …

Continue reading at arxiv.org (PDF) (other versions)

230000001360 synchronised 0 title abstract description 15

Classifications

- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Programme initiating; Programme switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
- G06F9/5088—Techniques for rebalancing the load in a distributed system involving task migration
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30386—Retrieval requests
- G06F17/30424—Query processing
- G06F17/30533—Other types of queries
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30575—Replication, distribution or synchronisation of data between databases or within a distributed database; Distributed database system architectures therefor
- G06F17/30584—Details of data partitioning, e.g. horizontal or vertical partitioning
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/456—Parallelism detection
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computer systems based on biological models
- G06N3/02—Computer systems based on biological models using neural network models
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring

Similar Documents

Publication	Publication Date	Title
Zhao et al.	2019	Dynamic stale synchronous parallel distributed training for deep learning
Wang et al.	2019	Distributed machine learning with a serverless architecture
Park et al.	2020	{HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU} clusters through integration of pipelined model parallelism and data parallelism
US11120368B2 (en)	2021-09-14	Scalable and efficient distributed auto-tuning of machine learning and deep learning models
Chen et al.	2018	Efficient and robust parallel dnn training through model parallelism on multi-gpu platform
CN108268638B (en)	2020-07-17	Distributed implementation method for generating countermeasure network based on Spark framework
CN105956021B (en)	2019-05-21	A kind of automation task suitable for distributed machines study parallel method and its system
US12314851B2 (en)	2025-05-27	Microservice-based training systems in heterogeneous graphic processor unit (GPU) cluster and operating method thereof
US20200219028A1 (en)	2020-07-09	Systems, methods, and media for distributing database queries across a metered virtual network
Eliad et al.	2021	Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism
Li et al.	2022	Amp: Automatically finding model parallel strategies with heterogeneity awareness
Zhou et al.	2023	Disttgl: Distributed memory-based temporal graph neural network training
EP3234659A1 (en)	2017-10-25	Scalable scheduling of parallel iterative seismic jobs
Kim et al.	2016	Deepspark: A spark-based distributed deep learning framework for commodity clusters
CN112035234A (en)	2020-12-04	Distributed batch job distribution method and device
Kim et al.	2022	Scale-train: A scalable dnn training framework for a heterogeneous gpu cloud
Zhang et al.	2018	Ftsgd: An adaptive stochastic gradient descent algorithm for spark mllib
Herrera et al.	2013	On a hybrid MPI-Pthread approach for simplicial branch-and-bound
Gu et al.	2018	Parallelizing machine learning optimization algorithms on distributed data-parallel platforms with parameter server
Fekry et al.	2019	Towards seamless configuration tuning of big data analytics
Li et al.	2018	Optimizing machine learning on apache spark in HPC environments
Yoon et al.	2023	MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training
CN119829296B (en)	2025-06-03	Load balancing method and system of integrative super fusion server of deposit and calculation
CN111813525A (en)	2020-10-23	A Workflow Scheduling Method for Heterogeneous Systems
Xu et al.	2023	Efficient supernet training using path parallelism