Shuai Che

According to our database1, Shuai Che authored at least 33 papers between 2008 and 2023.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2023
DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.
CoRR, 2023

2022
Towards Execution-Efficient LSTMs via Hardware-Guided Grow-and-Prune Paradigm.
IEEE Trans. Emerg. Top. Comput., 2022

2021
Software-Defined Design Space Exploration for an Efficient DNN Accelerator Architecture.
IEEE Trans. Computers, 2021

2020
Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point.
Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing.
Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, 2020

2019
Software-Defined Design Space Exploration for an Efficient AI Accelerator Architecture.
CoRR, 2019

Hardware-Guided Symbiotic Training for Compact, Accurate, yet Execution-Efficient LSTM.
CoRR, 2019

Northup: Divide-and-Conquer Programming in Systems with Heterogeneous Memories and Processors.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

2017
Programming GPGPU Graph Applications with Linear Algebra Building Blocks.
Int. J. Parallel Program., 2017

Gravel: fine-grain GPU-initiated network messages.
Proceedings of the International Conference for High Performance Computing, 2017

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Work Stealing in a Shared Virtual-Memory Heterogeneous Environment: A Case Study with Betweenness Centrality.
Proceedings of the Computing Frontiers Conference, 2017

Accelerating Matrix Processing with GPUs.
Proceedings of the 24th IEEE Symposium on Computer Arithmetic, 2017

2016
Implementing directed acyclic graphs with the heterogeneous system architecture.
Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, 2016

Challenges of Programming a System with Heterogeneous Memories and Heterogeneous Processors: A Programmer's View.
Proceedings of the Second International Symposium on Memory Systems, 2016

Software Assisted Hardware Cache Coherence for Heterogeneous Processors.
Proceedings of the Second International Symposium on Memory Systems, 2016

Betweenness Centrality in an HSA-enabled System.
Proceedings of the ACM Workshop on High Performance Graph Processing, 2016

2015
Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Synchronization Using Remote-Scope Promotion.
Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015

2014
BenchFriend: Correlating the performance of GPU benchmarks.
Int. J. High Perform. Comput. Appl., 2014

SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance.
Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 2014

Dymaxion++: A Directive-Based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems.
Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

BelRed: Constructing GPGPU graph applications with software building blocks.
Proceedings of the IEEE High Performance Extreme Computing Conference, 2014

GasCL: A vertex-centric graph model for GPUs.
Proceedings of the IEEE High Performance Extreme Computing Conference, 2014

QuickRelease: A throughput-oriented approach to release consistency on GPUs.
Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture, 2014

2013
Pannotia: Understanding irregular GPGPU graph applications.
Proceedings of the IEEE International Symposium on Workload Characterization, 2013

Load balancing in a changing world: dealing with heterogeneity and performance variability.
Proceedings of the Computing Frontiers Conference, 2013

2011
Dymaxion: optimizing memory access patterns for heterogeneous systems.
Proceedings of the Conference on High Performance Computing Networking, 2011

Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads.
Proceedings of the 2011 IEEE International Symposium on Workload Characterization, 2011

2010
A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads.
Proceedings of the 2010 IEEE International Symposium on Workload Characterization, 2010

2009
Rodinia: A benchmark suite for heterogeneous computing.
Proceedings of the 2009 IEEE International Symposium on Workload Characterization, 2009

2008
A performance study of general-purpose applications on graphics processors using CUDA.
J. Parallel Distributed Comput., 2008

Accelerating Compute-Intensive Applications with GPUs and FPGAs.
Proceedings of the IEEE Symposium on Application Specific Processors, 2008


  Loading...