Zheng et al., 2022 - Google Patents
AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architecturesZheng et al., 2022
View PDF- Document ID
- 13439155535787267700
- Author
- Zheng Z
- Yang X
- Zhao P
- Long G
- Zhu K
- Zhu F
- Zhao W
- Liu X
- Yang J
- Zhai J
- Song S
- Lin W
- Publication year
- Publication venue
- Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems
External Links
Snippet
This work reveals that memory-intensive computation is a rising performance-critical factor in recent machine learning models. Due to a unique set of new challenges, existing ML optimizing compilers cannot perform efficient fusion under complex two-level dependencies …
- 238000005457 optimization 0 title abstract description 17
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/44—Encoding
- G06F8/443—Optimisation
- G06F8/4441—Reducing the execution time required by the program code
- G06F8/4442—Reducing the number of cache misses; Data prefetching
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/48—Programme initiating; Programme switching, e.g. by interrupt
- G06F9/4806—Task transfer initiation or dispatching
- G06F9/4843—Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
- G06F9/4881—Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/40—Transformations of program code
- G06F8/41—Compilation
- G06F8/45—Exploiting coarse grain parallelism in compilation, i.e. parallelism between groups of instructions
- G06F8/456—Parallelism detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/46—Multiprogramming arrangements
- G06F9/50—Allocation of resources, e.g. of the central processing unit [CPU]
- G06F9/5061—Partitioning or combining of resources
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/44—Arrangements for executing specific programmes
- G06F9/455—Emulation; Software simulation, i.e. virtualisation or emulation of application or operating system execution engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for programme control, e.g. control unit
- G06F9/06—Arrangements for programme control, e.g. control unit using stored programme, i.e. using internal store of processing equipment to receive and retain programme
- G06F9/30—Arrangements for executing machine-instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline, look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units
- G06F9/3889—Concurrent instruction execution, e.g. pipeline, look ahead using a plurality of independent parallel functional units controlled by multiple instructions, e.g. MIMD, decoupled access or execute
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/30—Information retrieval; Database structures therefor; File system structures therefor
- G06F17/30286—Information retrieval; Database structures therefor; File system structures therefor in structured data stores
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/36—Preventing errors by testing or debugging software
- G06F11/3668—Software testing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/30—Creation or generation of source code
- G06F8/31—Programming languages or programming paradigms
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/70—Software maintenance or management
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F8/00—Arrangements for software engineering
- G06F8/60—Software deployment
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06F—ELECTRICAL DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/50—Computer-aided design
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N99/00—Subject matter not provided for in other groups of this subclass
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING; COUNTING
- G06N—COMPUTER SYSTEMS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computer systems utilising knowledge based models
- G06N5/02—Knowledge representation
- G06N5/022—Knowledge engineering, knowledge acquisition
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Zheng et al. | AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures | |
| Gao et al. | Estimating GPU memory consumption of deep learning models | |
| Xin et al. | Accelerating human-in-the-loop machine learning: Challenges and opportunities | |
| Jia et al. | TASO: optimizing deep learning computation with automatic generation of graph substitutions | |
| Dai et al. | Reveal training performance mystery between TensorFlow and PyTorch in the single GPU environment | |
| Pouchet et al. | Loop transformations: convexity, pruning and optimization | |
| Rasch et al. | ATF: A generic directive‐based auto‐tuning framework | |
| Zheng et al. | Fusionstitching: boosting memory intensive computations for deep learning workloads | |
| Cao et al. | Moe-lightning: High-throughput moe inference on memory-constrained gpus | |
| Rausch et al. | A data-centric optimization framework for machine learning | |
| Liou et al. | GEVO: GPU code optimization using evolutionary computation | |
| Totoni et al. | HPAT: high performance analytics with scripting ease-of-use | |
| Huang et al. | Mind the gap: Attainable data movement and operational intensity bounds for tensor algorithms | |
| Chen et al. | Slapo: A schedule language for progressive optimization of large deep learning model training | |
| Hu et al. | Optimal kernel orchestration for tensor programs with korch | |
| Fang et al. | Aristotle: A performance impact indicator for the OpenCL kernels using local memory | |
| Yu et al. | Lorien: Efficient deep learning workloads delivery | |
| Mastoras et al. | Design and implementation for nonblocking execution in GraphBLAS: Tradeoffs and performance | |
| CN118747072B (en) | Compilation method, electronic device and storage medium | |
| Ivanov et al. | Sten: Productive and efficient sparsity in pytorch | |
| Kim et al. | {STRADS-AP}: Simplifying Distributed Machine Learning Programming without Introducing a New Programming Model | |
| Kang et al. | PreScaler: An efficient system-aware precision scaling framework on heterogeneous systems | |
| Dorozhinskii et al. | Fused GEMMs towards an efficient GPU implementation of the ADER‐DG method in SeisSol | |
| US11573777B2 (en) | Method and apparatus for enabling autonomous acceleration of dataflow AI applications | |
| Alemany et al. | Jespipe: a plugin-based, open MPI framework for adversarial machine learning analysis |