Olatunji Ruwase

Orcid: 0000-0002-5508-0728

According to our database¹, Olatunji Ruwase authored at least 48 papers between 2004 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of three.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

On csauthors.net:

Bibliography

2026

MoE-Prefill: Zero Redundancy Overheads in MoE Prefill Serving.

[BibT_eX]

[DOI]

CoRR, May, 2026

Cross-Layer Energy Analysis of Multimodal Training on Grace Hopper Superchips.

[BibT_eX]

[DOI]

CoRR, May, 2026

AutoSP: Unlocking Long-Context LLM Training Via Compiler-Based Sequence Parallelism.

[BibT_eX]

[DOI]

CoRR, April, 2026

SuperOffload: Unleashing the Power of Large-Scale LLM Training on Superchips.

[BibT_eX]

[DOI]

Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2026

2025

DeepCompile: A Compiler-Driven Approach to Optimizing Distributed Deep Learning Training.

[BibT_eX]

[DOI]

CoRR, April, 2025

Universal Checkpointing: A Flexible and Efficient Distributed Checkpointing System for Large-Scale DNN Training with Reconfigurable Parallelism.

[BibT_eX]

[DOI]

Proceedings of the 2025 USENIX Annual Technical Conference, 2025

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer.

[BibT_eX]

[DOI]

Proceedings of the Eighth Conference on Machine Learning and Systems, 2025

2024

Mojito: Motion Trajectory and Intensity Control for Video Generation.

[BibT_eX]

[DOI]

CoRR, 2024

Domino: Eliminating Communication in LLM Training via Generic Tensor Slicing and Overlapping.

[BibT_eX]

[DOI]

CoRR, 2024

Training Ultra Long Context Language Model with Fully Pipelined Distributed Transformer.

[BibT_eX]

[DOI]

CoRR, 2024

Universal Checkpointing: Efficient and Flexible Checkpointing for Large Scale Distributed Training.

[BibT_eX]

[DOI]

CoRR, 2024

FastPersist: Accelerating Model Checkpointing in Deep Learning.

[BibT_eX]

[DOI]

CoRR, 2024

FP6-LLM: Efficiently Serving Large Language Models Through FP6-Centric Algorithm-System Co-Design.

[BibT_eX]

[DOI]

CoRR, 2024

Quant-LLM: Accelerating the Serving of Large Language Models via FP6-Centric Algorithm-System Co-Design on Modern GPUs.

[BibT_eX]

[DOI]

Proceedings of the 2024 USENIX Annual Technical Conference, 2024

RecFlex: Enabling Feature Heterogeneity-Aware Optimization for Deep Recommendation Models with Flexible Schedules.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2024

Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, 2024

ZeRO++: Extremely Efficient Collective Communication for Large Model Training.

[BibT_eX]

[DOI]

Proceedings of the Twelfth International Conference on Learning Representations, 2024

2023

SHARP: An Adaptable, Energy-Efficient Accelerator for Recurrent Neural Networks.

[BibT_eX]

[DOI]

Reza Yazdani Aminabadi

ACM Trans. Embed. Comput. Syst., March, 2023

ZeroQuant(4+2): Redefining LLMs Quantization with a New FP6-Centric Strategy for Diverse Generative Tasks.

[BibT_eX]

[DOI]

Reza Yazdani Aminabadi

CoRR, 2023

DeepSpeed-VisualChat: Multi-Round Multi-Image Interleave Chat via Multi-Modal Causal Attention.

[BibT_eX]

[DOI]

CoRR, 2023

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales.

[BibT_eX]

[DOI]

Zhewei Yao

Reza Yazdani Aminabadi

CoRR, 2023

ZeRO++: Extremely Efficient Collective Communication for Giant Model Training.

[BibT_eX]

[DOI]

CoRR, 2023

A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training.

[BibT_eX]

[DOI]

CoRR, 2023

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training.

[BibT_eX]

[DOI]

Proceedings of the 37th International Conference on Supercomputing, 2023

2022

DeepSpeed- Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale.

[BibT_eX]

[DOI]

Reza Yazdani Aminabadi

Proceedings of the SC22: International Conference for High Performance Computing, 2022

2021

ZeRO-Offload: Democratizing Billion-Scale Model Training.

[BibT_eX]

[DOI]

Jie Ren

Samyam Rajbhandari

Reza Yazdani Aminabadi

Proceedings of the 2021 USENIX Annual Technical Conference, 2021

ZeRO-infinity: breaking the GPU memory wall for extreme scale deep learning.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2021

SimiGrad: Fine-Grained Adaptive Batching for Large Scale Training using Gradient Similarity Measurement.

[BibT_eX]

[DOI]

Proceedings of the Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, 2021

2020

ZeRO: memory optimizations toward training trillion parameter models.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

DeepSpeed: System Optimizations Enable Training Deep Learning Models with Over 100 Billion Parameters.

[BibT_eX]

[DOI]

Proceedings of the KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020

2019

LSTM-Sharp: An Adaptable, Energy-Efficient Hardware Accelerator for Long Short-Term Memory.

[BibT_eX]

[DOI]

CoRR, 2019

ZeRO: Memory Optimization Towards Training A Trillion Parameter Models.

[BibT_eX]

[DOI]

CoRR, 2019

Accelerating Large Scale Deep Learning Inference through DeepCPU at Microsoft.

[BibT_eX]

[DOI]

Proceedings of the 2019 USENIX Conference on Operational Machine Learning, 2019

2018

Efficient Deep Neural Network Serving: Fast and Furious.

[BibT_eX]

[DOI]

IEEE Trans. Netw. Serv. Manag., 2018

2017

HyperDrive: exploring hyperparameters with POP scheduling.

[BibT_eX]

[DOI]

Proceedings of the 18th ACM/IFIP/USENIX Middleware Conference, Las Vegas, NV, USA, December 11, 2017

Optimizing CNNs on Multicores for Scalability, Performance and Goodput.

[BibT_eX]

[DOI]

Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

2016

SERF: efficient scheduling for fast deep neural network serving via judicious parallelism.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2016

2015

Performance Modeling and Scalability Optimization of Distributed Deep Learning Systems.

[BibT_eX]

[DOI]

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2015

Page overlays: an enhanced virtual memory framework to enable fine-grained memory management.

[BibT_eX]

[DOI]

Proceedings of the 42nd Annual International Symposium on Computer Architecture, 2015

Toward accelerating deep learning at scale using specialized hardware in the datacenter.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE Hot Chips 27 Symposium (HCS), 2015

2014

Guardrail: a high fidelity approach to protecting hardware devices from buggy drivers.

[BibT_eX]

[DOI]

Proceedings of the Architectural Support for Programming Languages and Operating Systems, 2014

2013

Improving Device Driver Reliability through Decoupled Dynamic Binary Analyses.

[BibT_eX]

[DOI]

Olatunji Ruwase

PhD thesis, 2013

2010

Decoupled lifeguards: enabling path optimizations for dynamic correctness checking tools.

[BibT_eX]

[DOI]

Proceedings of the 2010 ACM SIGPLAN Conference on Programming Language Design and Implementation, 2010

2009

Flexible Hardware Acceleration for Instruction-Grain Lifeguards.

[BibT_eX]

[DOI]

IEEE Micro, 2009

2008

Parallelizing dynamic information flow tracking.

[BibT_eX]

[DOI]

Proceedings of the SPAA 2008: Proceedings of the 20th Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2008

Ditto: a system for opportunistic caching in multi-hop wireless networks.

[BibT_eX]

[DOI]

Proceedings of the 14th Annual International Conference on Mobile Computing and Networking, 2008

Flexible Hardware Acceleration for Instruction-Grain Program Monitoring.

[BibT_eX]

[DOI]

Proceedings of the 35th International Symposium on Computer Architecture (ISCA 2008), 2008

2004

A Practical Dynamic Buffer Overflow Detector.

[BibT_eX]

[DOI]

Olatunji Ruwase

Monica S. Lam

Proceedings of the Network and Distributed System Security Symposium, 2004

Olatunji Ruwase

Timeline

Legend:

Links

On csauthors.net:

Bibliography

Loading...