Shengen Yan

According to our database1, Shengen Yan authored at least 32 papers between 2012 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Evaluating Quantized Large Language Models.
CoRR, 2024

LV-Eval: A Balanced Long-Context Benchmark with 5 Length Levels Up to 256K.
CoRR, 2024

2023
Proteus: Simulating the Performance of Distributed DNN Training.
CoRR, 2023

Chimera: An Analytical Optimizing Framework for Effective Compute-intensive Operators Fusion.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2023

2022
NeoFlow: A Flexible Framework for Enabling Efficient Compilation for High Performance DNN Training.
IEEE Trans. Parallel Distributed Syst., 2022

Astraea: A Fair Deep Learning Scheduler for Multi-Tenant GPU Clusters.
IEEE Trans. Parallel Distributed Syst., 2022

DIESEL+: Accelerating Distributed Deep Learning Tasks on Image Datasets.
IEEE Trans. Parallel Distributed Syst., 2022

GradientFlow: Optimizing Network Performance for Large-Scale Distributed DNN Training.
IEEE Trans. Big Data, 2022

A Simulation Platform for Multi-tenant Machine Learning Services on Thousands of GPUs.
CoRR, 2022

AMOS: enabling automatic mapping for tensor computations on spatial accelerators with hardware abstraction.
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022

LongTail-Bench: A Benchmark Suite for Domain-Specific Operators in Deep Learning.
Proceedings of the IEEE International Symposium on Workload Characterization, 2022

EasyView: Enabling and Scheduling Tensor Views in Deep Learning Compilers.
Proceedings of the 51st International Conference on Parallel Processing, 2022

2021
Characterization and prediction of deep learning workloads in large-scale GPU datacenters.
Proceedings of the International Conference for High Performance Computing, 2021

2020
Evaluating Fast Algorithms for Convolutional Neural Networks on FPGAs.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2020

Enabling Efficient Fast Convolution Algorithms on GPUs via MegaKernels.
IEEE Trans. Computers, 2020

DIESEL: A Dataset-Based Distributed Storage and Caching System for Large-Scale Deep Learning Training.
Proceedings of the ICPP 2020: 49th International Conference on Parallel Processing, 2020

Accelerating Deep Learning Tasks with Optimized GPU-assisted Image Decoding.
Proceedings of the 26th IEEE International Conference on Parallel and Distributed Systems, 2020

Elan: Towards Generic and Efficient Elastic Training for Deep Learning.
Proceedings of the 40th IEEE International Conference on Distributed Computing Systems, 2020

2019
面向GPU计算平台的归约算法的性能优化研究 (Study on Performance Optimization of Reduction Algorithm Targeting GPU Computing Platform).
计算机科学, 2019

Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes.
CoRR, 2019

A coordinated tiling and batching framework for efficient GEMM on GPUs.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

2017
Towards Distributed Machine Learning in Shared Clusters: A Dynamically-Partitioned Approach.
Proceedings of the 2017 IEEE International Conference on Smart Computing, 2017

Exploring Heterogeneous Algorithms for Accelerating Deep Convolutional Neural Networks on FPGAs.
Proceedings of the 54th Annual Design Automation Conference, 2017

2016
A Cross-Platform SpMV Framework on Many-Core Architectures.
ACM Trans. Archit. Code Optim., 2016

Timed Dataflow: Reducing Communication Overhead for Distributed Machine Learning Systems.
Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

2014
yaSpMV: yet another SpMV framework on GPUs.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Understanding the tradeoffs between software-managed vs. hardware-managed caches in GPUs.
Proceedings of the 2014 IEEE International Symposium on Performance Analysis of Systems and Software, 2014

A fast integral image generation algorithm on GPUs.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

2013
StreamScan: fast scan algorithms for GPUs without global barrier synchronization.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

CLSIFT: An Optimization Study of the Scale Invariance Feature Transform on GPUs.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

2012
An Insightful Program Performance Tuning Chain for GPU Computing.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2012

GPURoofline: A Model for Guiding Performance Optimizations on GPUs.
Proceedings of the Euro-Par 2012 Parallel Processing - 18th International Conference, 2012


  Loading...