Shengen Yan

Orcid: 0009-0005-3858-7972

According to our database1, Shengen Yan authored at least 42 papers between 2012 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Proteus: Simulating the Performance of Distributed DNN Training.
IEEE Trans. Parallel Distributed Syst., October, 2024

CSKV: Training-Efficient Channel Shrinking for KV Cache in Long-Context Scenarios.
CoRR, 2024

Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs.
CoRR, 2024

MoA: Mixture of Sparse Attention for Automatic Large Language Model Compression.
CoRR, 2024

DiTFastAttn: Attention Compression for Diffusion Transformer Models.
CoRR, 2024

ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation.
CoRR, 2024

MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization.
CoRR, 2024

HetHub: A Heterogeneous distributed hybrid training system for large-scale models.
CoRR, 2024

A Survey on Efficient Inference for Large Language Models.
CoRR, 2024

Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better.
CoRR, 2024

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K.
CoRR, 2024

Evaluating Quantized Large Language Models.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

A Holistic Functionalization Approach to Optimizing Imperative Tensor Programs in Deep Learning.
Proceedings of the 61st ACM/IEEE Design Automation Conference, 2024

2023
Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2023

2022
NeoFlow: A Flexible Framework for Enabling Efficient Compilation for High Performance DNN Training.
IEEE Trans. Parallel Distributed Syst., 2022

Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters.
IEEE Trans. Parallel Distributed Syst., 2022

DIESEL+: Accelerating Distributed Deep Learning Tasks on Image Datasets.
IEEE Trans. Parallel Distributed Syst., 2022

GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training.
IEEE Trans. Big Data, 2022

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs.
CoRR, 2022

AMOS: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022

LongTail-Bench: A Benchmark Suite for Domain-Specific Operators in Deep Learning.
Proceedings of the IEEE International Symposium on Workload Characterization, 2022

EasyView: Enabling and Scheduling Tensor Views in Deep Learning Compilers.
Proceedings of the 51st International Conference on Parallel Processing, 2022

2021
Characterization and prediction of deep learning workloads in large-scale GPU datacenters.
Proceedings of the International Conference for High Performance Computing, 2021

2020
Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2020

Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels.
IEEE Trans. Computers, 2020

DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training.
Proceedings of the ICPP 2020: 49th International Conference on Parallel Processing, 2020

Accelerating Deep Learning Tasks with Optimized GPU-assisted Image Decoding.
Proceedings of the 26th IEEE International Conference on Parallel and Distributed Systems, 2020

Elan: Towards Generic and Efficient Elastic Training for Deep Learning.
Proceedings of the 40th IEEE International Conference on Distributed Computing Systems, 2020

2019
面向GPU计算平台的归约算法的性能优化研究 (Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform).
计算机科学, 2019

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes.
CoRR, 2019

A coordinated tiling and batching framework for efficient GEMM on GPUs.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

2017
Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach.
Proceedings of the 2017 IEEE International Conference on Smart Computing, 2017

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs.
Proceedings of the 54th Annual Design Automation Conference, 2017

2016
A Cross-Platform SpMV Framework on Many-Core Architectures.
ACM Trans. Archit. Code Optim., 2016

Timed Dataflow: Reducing Communication Overhead for Distributed Machine Learning Systems.
Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

2014
yaSpMV: yet another SpMV framework on GPUs.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs.
Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software, 2014

A fast integral image generation algorithm on GPUs.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

2013
StreamScan: fast scan algorithms for GPUs without global barrier synchronization.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

CLSIFT: An Optimization Study of the Scale Invariance Feature Transform on GPUs.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

2012
An Insightful Program Performance Tuning Chain for GPU Computing.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2012

GPURoofline: A Model for Guiding Performance Optimizations on GPUs.
Proceedings of the Euro-Par 2012 Parallel Processing - 18th International Conference, 2012


  Loading...