Zhao et al., 2019 - Google Patents
Dynamic stale synchronous parallel distributed training for deep learningZhao et al., 2019
View PDF- Document ID
- 9568723824413666372
- Author
- Zhao X
- An A
- Liu J
- Chen B
- Publication year
- Publication venue
- 2019 IEEE 39th International Conference on Distributed Computing Systems (ICDCS)
External Links
Snippet
Deep learning is a popular machine learning technique and has been applied to many real- world problems, ranging from computer vision to natural language processing. However, training a deep neural network is very time-consuming, especially on big data. It has …
- 230000001360 synchronised 0 title abstract description 15
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/505—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the load
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Programme initiating; Programme switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5083—Techniques for rebalancing the load in a distributed system
- G06F9/5088—Techniques for rebalancing the load in a distributed system involving task migration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5005—Allocation of resources, e.g. of the central processing unit [CPU] to service a request
- G06F9/5027—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
- G06F9/5038—Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals considering the execution order of a plurality of tasks, e.g. taking priority or time dependency constraints into consideration
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
- G06F9/5066—Algorithms for mapping a plurality of inter-dependent sub-tasks onto a plurality of physical CPUs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30386—Retrieval requests
- G06F17/30424—Query processing
- G06F17/30533—Other types of queries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
- G06F17/30575—Replication, distribution or synchronisation of data between databases or within a distributed database; Distributed database system architectures therefor
- G06F17/30584—Details of data partitioning, e.g. horizontal or vertical partitioning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/456—Parallelism detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computer systems based on biological models
- G06N3/02—Computer systems based on biological models using neural network models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Zhao et al. | Dynamic stale synchronous parallel distributed training for deep learning | |
| Wang et al. | Distributed machine learning with a serverless architecture | |
| Park et al. | {HetPipe}: Enabling large {DNN} training on (whimpy) heterogeneous {GPU} clusters through integration of pipelined model parallelism and data parallelism | |
| US11120368B2 (en) | Scalable and efficient distributed auto-tuning of machine learning and deep learning models | |
| Chen et al. | Efficient and robust parallel dnn training through model parallelism on multi-gpu platform | |
| CN108268638B (en) | Distributed implementation method for generating countermeasure network based on Spark framework | |
| CN105956021B (en) | A kind of automation task suitable for distributed machines study parallel method and its system | |
| US12314851B2 (en) | Microservice-based training systems in heterogeneous graphic processor unit (GPU) cluster and operating method thereof | |
| US20200219028A1 (en) | Systems, methods, and media for distributing database queries across a metered virtual network | |
| Eliad et al. | Fine-tuning giant neural networks on commodity hardware with automatic pipeline model parallelism | |
| Li et al. | Amp: Automatically finding model parallel strategies with heterogeneity awareness | |
| Zhou et al. | Disttgl: Distributed memory-based temporal graph neural network training | |
| EP3234659A1 (en) | Scalable scheduling of parallel iterative seismic jobs | |
| Kim et al. | Deepspark: A spark-based distributed deep learning framework for commodity clusters | |
| CN112035234A (en) | Distributed batch job distribution method and device | |
| Kim et al. | Scale-train: A scalable dnn training framework for a heterogeneous gpu cloud | |
| Zhang et al. | Ftsgd: An adaptive stochastic gradient descent algorithm for spark mllib | |
| Herrera et al. | On a hybrid MPI-Pthread approach for simplicial branch-and-bound | |
| Gu et al. | Parallelizing machine learning optimization algorithms on distributed data-parallel platforms with parameter server | |
| Fekry et al. | Towards seamless configuration tuning of big data analytics | |
| Li et al. | Optimizing machine learning on apache spark in HPC environments | |
| Yoon et al. | MiCRO: Near-Zero Cost Gradient Sparsification for Scaling and Accelerating Distributed DNN Training | |
| CN119829296B (en) | Load balancing method and system of integrative super fusion server of deposit and calculation | |
| CN111813525A (en) | A Workflow Scheduling Method for Heterogeneous Systems | |
| Xu et al. | Efficient supernet training using path parallelism |