Shuai Che

Orcid: 0000-0002-3192-3057

According to our database¹, Shuai Che authored at least 35 papers between 2008 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

AIMS: Cost-Efficient LLM-Based Agent Deployment in Hybrid Cloud-Edge Environments.

[BibT_eX]

[DOI]

Proceedings of the 21st European Conference on Computer Systems, 2026

2025

HERA: Hybrid Edge-cloud Resource Allocation for Cost-Efficient AI Agents.

[BibT_eX]

[DOI]

CoRR, April, 2025

2023

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.

[BibT_eX]

[DOI]

Zhewei Yao

Reza Yazdani Aminabadi

CoRR, 2023

2022

Towards Execution-Efficient LSTMs via Hardware-Guided Grow-and-Prune Paradigm.

[BibT_eX]

[DOI]

IEEE Trans. Emerg. Top. Comput., 2022

2021

Software-Defined Design Space Exploration for an Efficient DNN Accelerator Architecture.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2021

2020

Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, 2020

AWB-GCN: A Graph Convolutional Network Accelerator with Runtime Workload Rebalancing.

[BibT_eX]

[DOI]

Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, 2020

2019

Software-Defined Design Space Exploration for an Efficient AI Accelerator Architecture.

[BibT_eX]

[DOI]

CoRR, 2019

Hardware-Guided Symbiotic Training for Compact, Accurate, yet Execution-Efficient LSTM.

[BibT_eX]

[DOI]

CoRR, 2019

Northup: Divide-and-Conquer Programming in Systems with Heterogeneous Memories and Processors.

[BibT_eX]

[DOI]

Shuai Che

Jieming Yin

Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

2017

Programming GPGPU Graph Applications with Linear Algebra Building Blocks.

[BibT_eX]

[DOI]

Shuai Che

Bradford M. Beckmann

Steven K. Reinhardt

Int. J. Parallel Program., 2017

Gravel: fine-grain GPU-initiated network messages.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2017

Auto-Tuning Strategies for Parallelizing Sparse Matrix-Vector (SpMV) Multiplication on Multi- and Many-Core Processors.

[BibT_eX]

[DOI]

Kaixi Hou

Wu-chun Feng

Shuai Che

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Work Stealing in a Shared Virtual-Memory Heterogeneous Environment: A Case Study with Betweenness Centrality.

[BibT_eX]

[DOI]

Shuai Che

Marc S. Orr

Jonathan Gallmeier

Proceedings of the Computing Frontiers Conference, 2017

Accelerating Matrix Processing with GPUs.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE Symposium on Computer Arithmetic, 2017

2016

Implementing directed acyclic graphs with the heterogeneous system architecture.

[BibT_eX]

[DOI]

Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, 2016

Challenges of Programming a System with Heterogeneous Memories and Heterogeneous Processors: A Programmer's View.

[BibT_eX]

[DOI]

Shuai Che

Arkaprava Basu

Jonathan Gallmeier

Proceedings of the Second International Symposium on Memory Systems, 2016

Software Assisted Hardware Cache Coherence for Heterogeneous Processors.

[BibT_eX]

[DOI]

Proceedings of the Second International Symposium on Memory Systems, 2016

Betweenness Centrality in an HSA-enabled System.

[BibT_eX]

[DOI]

Proceedings of the ACM Workshop on High Performance Graph Processing, 2016

2015

Graph Coloring on the GPU and Some Techniques to Improve Load Imbalance.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Synchronization Using Remote-Scope Promotion.

[BibT_eX]

[DOI]

Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, 2015

2014

BenchFriend: Correlating the performance of GPU benchmarks.

[BibT_eX]

[DOI]

Shuai Che

Kevin Skadron

Int. J. High Perform. Comput. Appl., 2014

SPEC ACCEL: A Standard Application Suite for Measuring Hardware Accelerator Performance.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking, and Simulation, 2014

Dymaxion++: A Directive-Based API to Optimize Data Layout and Memory Mapping for Heterogeneous Systems.

[BibT_eX]

[DOI]

Shuai Che

Jiayuan Meng

Kevin Skadron

Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

BelRed: Constructing GPGPU graph applications with software building blocks.

[BibT_eX]

[DOI]

Shuai Che

Bradford M. Beckmann

Steven K. Reinhardt

Proceedings of the IEEE High Performance Extreme Computing Conference, 2014

GasCL: A vertex-centric graph model for GPUs.

[BibT_eX]

[DOI]

Shuai Che

Proceedings of the IEEE High Performance Extreme Computing Conference, 2014

QuickRelease: A throughput-oriented approach to release consistency on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture, 2014

2013

Pannotia: Understanding irregular GPGPU graph applications.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Workload Characterization, 2013

Load balancing in a changing world: dealing with heterogeneity and performance variability.

[BibT_eX]

[DOI]

Proceedings of the Computing Frontiers Conference, 2013

2011

Dymaxion: optimizing memory access patterns for heterogeneous systems.

[BibT_eX]

[DOI]

Shuai Che

Jeremy W. Sheaffer

Kevin Skadron

Proceedings of the Conference on High Performance Computing Networking, 2011

Using cycle stacks to understand scaling bottlenecks in multi-threaded workloads.

[BibT_eX]

[DOI]

Proceedings of the 2011 IEEE International Symposium on Workload Characterization, 2011

2010

A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads.

[BibT_eX]

[DOI]

Proceedings of the 2010 IEEE International Symposium on Workload Characterization, 2010

2009

Rodinia: A benchmark suite for heterogeneous computing.

[BibT_eX]

[DOI]

Proceedings of the 2009 IEEE International Symposium on Workload Characterization, 2009

2008

A performance study of general-purpose applications on graphics processors using CUDA.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2008

Accelerating Compute-Intensive Applications with GPUs and FPGAs.

[BibT_eX]

[DOI]

Proceedings of the IEEE Symposium on Application Specific Processors, 2008

Shuai Che

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...