Shuaiwen Song

Orcid: 0000-0002-8402-1436

Affiliations:
  • University of Sydney, Australia


According to our database1, Shuaiwen Song authored at least 117 papers between 2009 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
TeGraph+: Scalable Temporal Graph Processing Enabling Flexible Edge Modifications.
IEEE Trans. Parallel Distributed Syst., August, 2024

<i>TEA+</i>: A Novel Temporal Graph Random Walk Engine with Hybrid Storage Architecture.
ACM Trans. Archit. Code Optim., June, 2024

Efficient Radius Search for Adaptive Foveal Sizing Mechanism in Collaborative Foveated Rendering Framework.
IEEE Trans. Mob. Comput., May, 2024

MalFox: Camouflaged Adversarial Malware Example Generation Based on Conv-GANs Against Black-Box Detectors.
IEEE Trans. Computers, April, 2024

CorDA: Context-Oriented Decomposition Adaptation of Large Language Models.
CoRR, 2024

Dragonfly: Multi-Resolution Zoom Supercharges Large Visual-Language Model.
CoRR, 2024

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design.
CoRR, 2024

Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs.
Proceedings of the 2024 USENIX Annual Technical Conference, 2024

System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.
Proceedings of the 43rd ACM Symposium on Principles of Distributed Computing, 2024

MonoNN: Enabling a New Monolithic Optimization Space for Neural Network Inference Tasks on Modern GPU-Centric Architectures.
Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation, 2024

System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

2023
Enabling High-Efficient ReRAM-Based CNN Training Via Exploiting Crossbar-Level Insignificant Writing Elimination.
IEEE Trans. Computers, November, 2023

Data Fusion in Infrastructure-Augmented Autonomous Driving System: Why? Where? and How?
IEEE Internet Things J., September, 2023

Flash-LLM: Enabling Low-Cost and Highly-Efficient Large Generative Model Inference With Unstructured Sparsity.
Proc. VLDB Endow., 2023

DeepSpeed4Science Initiative: Enabling Large-Scale Scientific Discovery through Sophisticated AI System Technologies.
CoRR, 2023

DeepSpeed Ulysses: System Optimizations for Enabling Training of Extreme Long Sequence Transformer Models.
CoRR, 2023

Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity.
CoRR, 2023

RenAIssance: A Survey into AI Text-to-Image Generation in the Era of Large Model.
CoRR, 2023

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.
CoRR, 2023

Chiplet Cloud: Building AI Supercomputers for Serving Large Generative Language Models.
CoRR, 2023

Mitigating Coupling Map Constrained Correlated Measurement Errors on Quantum Devices.
Proceedings of the International Conference for High Performance Computing, 2023

NAS-SE: Designing A Highly-Efficient In-Situ Neural Architecture Search Engine for Large-Scale Deployment.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023

HEAT: A Highly Efficient and Affordable Training System for Collaborative Filtering Based Recommendation on CPUs.
Proceedings of the 37th International Conference on Supercomputing, 2023

Post0-VR: Enabling Universal Realistic Rendering for Modern VR via Exploiting Architectural Similarity and Data Sharing.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2023

TEA: A General-Purpose Temporal Graph Random Walk Engine.
Proceedings of the Eighteenth European Conference on Computer Systems, 2023

G-Sparse: Compiler-Driven Acceleration for Generalized Sparse Computation for Graph Neural Networks on Modern GPUs.
Proceedings of the 32nd International Conference on Parallel Architectures and Compilation Techniques, 2023

2022
Detecting Performance Variance for Parallel Applications Without Source Code.
IEEE Trans. Parallel Distributed Syst., 2022

DynamAP: Architectural Support for Dynamic Graph Traversal on the Automata Processor.
ACM Trans. Archit. Code Optim., 2022

MSREP: A Fast yet Light Sparse Matrix Framework for Multi-GPU Systems.
CoRR, 2022

Towards Efficient Architecture and Algorithms for Sensor Fusion.
CoRR, 2022

Brief Industry Paper: The Necessity of Adaptive Data Fusion in Infrastructure-Augmented Autonomous Driving System.
Proceedings of the 28th IEEE Real-Time and Embedded Technology and Applications Symposium, 2022

Vapro: performance variance detection and diagnosis for production-run parallel applications.
Proceedings of the PPoPP '22: 27th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Seoul, Republic of Korea, April 2, 2022

Randomness in Neural Network Training: Characterizing the Impact of Tooling.
Proceedings of the Fifth Conference on Machine Learning and Systems, 2022

Bring orders into uncertainty: enabling efficient uncertain graph processing via novel path sampling on multi-accelerator systems.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

TeGraph: A Novel General-Purpose Temporal Graph Computing Engine.
Proceedings of the 38th IEEE International Conference on Data Engineering, 2022

AStitch: enabling a new multi-dimensional optimization space for memory-intensive ML training and inference on modern SIMT architectures.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022

T-GCN: A Sampling Based Streaming Graph Neural Network System with Hybrid Architecture.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2022

2021
Enabling Highly Efficient Capsule Networks Processing Through Software-Hardware Co-Design.
IEEE Trans. Computers, 2021

COMET: A Novel Memory-Efficient Deep Learning Training Framework by Using Error-Bounded Lossy Compression.
Proc. VLDB Endow., 2021

TSM2X: High-performance tall-and-skinny matrix-matrix multiplication on GPUs.
J. Parallel Distributed Comput., 2021

Toward efficient interactions between Python and native libraries.
Proceedings of the ESEC/FSE '21: 29th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2021

MAPA: multi-accelerator pattern allocation policy for multi-tenant GPU servers.
Proceedings of the International Conference for High Performance Computing, 2021

Dr. Top-k: delegate-centric Top-k on GPUs.
Proceedings of the International Conference for High Performance Computing, 2021

An efficient uncertain graph processing framework for heterogeneous architectures.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

A novel memory-efficient deep learning training framework via error-bounded lossy compression.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

Shift-BNN: Highly-Efficient Probabilistic Bayesian Neural Network Training via Memory-Friendly Pattern Retrieving.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

η-LSTM: Co-Designing Highly-Efficient Large LSTM Training via Exploiting Memory-Saving and Architectural Design Opportunities.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

ClickTrain: efficient and accurate end-to-end deep learning training via fine-grained architecture-preserving pruning.
Proceedings of the ICS '21: 2021 International Conference on Supercomputing, 2021

Fast and Scalable Sparse Triangular Solver for Multi-GPU Based HPC Architectures.
Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

Q-VR: system-level design for future mobile collaborative virtual reality.
Proceedings of the ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021

2020
Evaluating Modern GPU Interconnect: PCIe, NVLink, NV-SLI, NVSwitch and GPUDirect.
IEEE Trans. Parallel Distributed Syst., 2020

Energy-Efficient GPU L2 Cache Design Using Instruction-Level Data Locality Similarity.
ACM Trans. Design Autom. Electr. Syst., 2020

An Efficient End-to-End Deep Learning Training Framework via Fine-Grained Pattern-Based Pruning.
CoRR, 2020

MalFox: Camouflaged Adversarial Malware Example Generation Based on C-GANs Against Black-Box Detectors.
CoRR, 2020

ISM2: Optimizing Irregular-Shaped Matrix-Matrix Multiplication on GPUs.
CoRR, 2020

Enabling Highly Efficient Capsule Networks Processing Through A PIM-Based Architecture Design.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

2019
Speeding up Collective Communications Through Inter-GPU Re-Routing.
IEEE Comput. Archit. Lett., 2019

BSTC: a novel binarized-soft-tensor-core design for accelerating bit-based approximated neural nets.
Proceedings of the International Conference for High Performance Computing, 2019

OO-VR: NUMA friendly object-oriented VR rendering framework for future NUMA-based multi-GPU systems.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

PIM-VR: Erasing Motion Anomalies In Highly-Interactive Virtual Reality World with Customized Memory Cube.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

LoSCache: Leveraging Locality Similarity to Build Energy-Efficient GPU L2 Cache.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2019

LP-BNN: Ultra-low-Latency BNN Inference with Layer Parallelism.
Proceedings of the 30th IEEE International Conference on Application-specific Systems, 2019

2018
NUMA-Caffe: NUMA-Aware Deep Learning Neural Networks.
ACM Trans. Archit. Code Optim., 2018

Superneurons: dynamic GPU memory management for training deep neural networks.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Introduction to HPPAC 2018.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Tartan: Evaluating Modern GPU Interconnect via a Multi-GPU Benchmark Suite.
Proceedings of the 2018 IEEE International Symposium on Workload Characterization, 2018

Warp-Consolidation: A Novel Execution Model for GPUs.
Proceedings of the 32nd International Conference on Supercomputing, 2018

Perception-Oriented 3D Rendering Approximation for Modern Graphics Processors.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

CUDAAdvisor: LLVM-based runtime profiling for modern GPUs.
Proceedings of the 2018 International Symposium on Code Generation and Optimization, 2018

Lightweight detection of cache conflicts.
Proceedings of the 2018 International Symposium on Code Generation and Optimization, 2018

2017
EvoGraph: On-the-Fly Efficient Mining of Evolving Graphs on GPU.
Proceedings of the High Performance Computing - 32nd International Conference, 2017

Evaluating GPGPU Memory Performance Through the C-AMAT Model.
Proceedings of the Workshop on Memory Centric Programming for HPC, 2017

Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels.
Proceedings of the International Conference for High Performance Computing, 2017

BVF: enabling significant on-chip power savings via bit-value-favor for throughput processors.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

HPPAC Workshop Introduction.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

IPDRM Workshop Introduction.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Enabling scalability-sensitive speculative parallelization for FSM computations.
Proceedings of the International Conference on Supercomputing, 2017

Processing-in-Memory Enabled Graphics Processors for 3D Rendering.
Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture, 2017

Locality-Aware CTA Clustering for Modern GPUs.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

2016
Scalable Energy Efficiency with Resilience for High Performance Computing Systems: A Quantitative Methodology.
ACM Trans. Archit. Code Optim., 2016

A Graph-based Model for GPU Caching Problems.
CoRR, 2016

Orion: A Framework for GPU Occupancy Tuning.
Proceedings of the 17th International Middleware Conference, Trento, Italy, December 12, 2016

IPDRM Introduction and Committees.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

HPPAC Introduction and Committees.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

X: A Comprehensive Analytic Model for Parallel Machines.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

SFU-Driven Transparent Approximation Acceleration on GPUs.
Proceedings of the 2016 International Conference on Supercomputing, 2016

Tag-Split Cache for Efficient GPGPU Cache Utilization.
Proceedings of the 2016 International Conference on Supercomputing, 2016

New-Sum: A Novel Online ABFT Scheme For General Iterative Methods.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

SMT-Aware Instantaneous Footprint Optimization.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

Critical points based register-concurrency autotuning for GPUs.
Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition, 2016

Combating the Reliability Challenge of GPU Register File at Low Supply Voltage.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016

2015
Scaling Support Vector Machines on modern HPC platforms.
J. Parallel Distributed Comput., 2015

GraphReduce: processing large-scale graphs on accelerator-based systems.
Proceedings of the International Conference for High Performance Computing, 2015

Investigating the Interplay between Energy Efficiency and Resilience in High Performance Computing.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

GraphReduce: Large-Scale Graph Analytics on Accelerator-Based HPC Systems.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Locality-Driven Dynamic GPU Cache Bypassing.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Gregarious Data Re-structuring in a Many Core Architecture.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

2014
Extending PowerPack for Profiling and Analysis of High-Performance Accelerator-Based Systems.
Parallel Process. Lett., 2014

Evaluating multi-core and many-core architectures through accelerating the three-dimensional Lax-Wendroff correction stencil.
Int. J. High Perform. Comput. Appl., 2014

MIC-SVM: Designing a Highly Efficient Support Vector Machine for Advanced Modern Multi-core and Many-Core Architectures.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

The Power-Performance Tradeoffs of the Intel Xeon Phi on HPC Applications.
Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

An adaptive cross-architecture combination method for graph traversal.
Proceedings of the 2014 International Conference on Supercomputing, 2014

ACDT: Architected Composite Data Types trading-in unfettered data access for improved execution.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

2013
Designing energy efficient communication runtime systems: a view from PGAS models.
J. Supercomput., 2013

Unified performance and power modeling of scientific workloads.
Proceedings of the 1st International Workshop on Energy Efficient Supercomputing, 2013

A Simplified and Accurate Model of Power-Performance Efficiency on Emergent GPU Architectures.
Proceedings of the 27th IEEE International Symposium on Parallel and Distributed Processing, 2013

EDR: An energy-aware runtime load distribution system for data-intensive applications in the cloud.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

2012
Abstract: Three Steps to Model Power-Performance Efficiency for Emergent GPU-Based Parallel Systems.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Poster: Three Steps to Model Power-Performance Efficiency for Emergent GPU-Based Parallel Systems.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Energy-Aware Replica Selection for Data-Intensive Services in Cloud.
Proceedings of the 20th IEEE International Symposium on Modeling, 2012

System-level power-performance efficiency modeling for emergent GPU architectures.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2011
Iso-Energy-Efficiency: An Approach to Power-Constrained Parallel Computation.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

An ISO-Energy-Efficient Approach to Scalable System Power-Performance Optimization.
Proceedings of the 2011 IEEE International Conference on Cluster Computing (CLUSTER), 2011

2010
PowerPack: Energy Profiling and Analysis of High-Performance Systems and Applications.
IEEE Trans. Parallel Distributed Syst., 2010

Fault-tolerant communication runtime support for data-centric programming models.
Proceedings of the 2010 International Conference on High Performance Computing, 2010

Designing Energy Efficient Communication Runtime Systems for Data Centric Programming Models.
Proceedings of the 2010 IEEE/ACM Int'l Conference on Green Computing and Communications, 2010

2009
Energy Profiling and Analysis of the HPC Challenge Benchmarks.
Int. J. High Perform. Comput. Appl., 2009


  Loading...