Hyesoon Kim

Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2026

Inside VOLT: Designing an Open-Source GPU Compiler (Tool).

[BibT_eX]

[DOI]

Proceedings of the 35th ACM SIGPLAN International Conference on Compiler Construction, 2026

2025

Inside VOLT: Designing an Open-Source GPU Compiler.

[BibT_eX]

[DOI]

CoRR, November, 2025

RV-CURE: A RISC-V Capability Architecture for Full Memory Safety.

[BibT_eX]

[DOI]

IEEE Trans. Computers, October, 2025

Aero: Adaptive Query Processing of ML Queries.

[BibT_eX]

[DOI]

Proc. ACM Manag. Data, June, 2025

Hardware vs. Software Implementation of Warp-Level Features in Vortex RISC-V GPU.

[BibT_eX]

[DOI]

CoRR, May, 2025

Contention-Aware GPU Thread Block Scheduler for Efficient GPU-SSD.

[BibT_eX]

[DOI]

IEEE Comput. Archit. Lett., 2025

FlexInfer: Flexible LLM Inference with CPU Computations.

[BibT_eX]

[DOI]

Christopher J. Hughes

Proceedings of the Eighth Conference on Machine Learning and Systems, 2025

Swift and Trustworthy Large-Scale GPU Simulation with Fine-Grained Error Modeling and Hierarchical Clustering.

[BibT_eX]

[DOI]

Proceedings of the 58th IEEE/ACM International Symposium on Microarchitecture, 2025

Analysis of the RISC-V Vector Extension for Vulkan Graphics Kernels.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2025

Let-Me-In: (Still) Employing In-pointer Bounds Metadata for Fine-grained GPU Memory Safety.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2025

SparseWeaver: Converting Sparse Operations as Dense Operations on GPUs for Graph Workloads.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2025

SoftCUDA: Running CUDA on Softcore GPU.

[BibT_eX]

[DOI]

Proceedings of the 33rd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2025

Buffer Management for Out-of-GPU LLM Execution.

[BibT_eX]

[DOI]

Jiashen Cao

Joy Arulraj

Proceedings of the Workshop on Data Management for End-to-End Machine Learning, 2025

Multiway Merge Partitioning for Sparse-Sparse Matrix Multiplication on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 34th International Conference on Parallel Architectures and Compilation Techniques, 2025

2024

CuPBoP: Making CUDA a Portable Language.

[BibT_eX]

[DOI]

ACM Trans. Design Autom. Electr. Syst., 2024

Quantifying CO<sub>2</sub> Emission Reduction Through Spatial Partitioning in Deep Learning Recommendation System Workloads.

[BibT_eX]

[DOI]

Andrei Bersatti

Euna Kim

IEEE Micro, 2024

Towards Performance-Aware Allocation for Accelerated Machine Learning on GPU-SSD Systems.

[BibT_eX]

[DOI]

Ayush Gundawar

Euijun Chung

CoRR, 2024

Hydro: Adaptive Query Processing of ML Queries.

[BibT_eX]

[DOI]

CoRR, 2024

Unleashing CPU Potential for Executing GPU Programs Through Compiler/Runtime Optimizations.

[BibT_eX]

[DOI]

Jisheng Zhao

Proceedings of the 57th IEEE/ACM International Symposium on Microarchitecture, 2024

Barre Chord: Efficient Virtual Memory Translation for Multi-Chip-Module GPUs.

[BibT_eX]

[DOI]

Proceedings of the 51st ACM/IEEE Annual International Symposium on Computer Architecture, 2024

Comparative Analysis of Executing GPU Applications on FPGA: HLS vs. Soft GPU Approaches.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Understanding Performance Implications of LLM Inference on CPUs.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Workload Characterization, 2024

Towards "True" GPU Performance Scaling for OpenGPU.

[BibT_eX]

[DOI]

Blaise Tine

Proceedings of the 36th IEEE Hot Chips Symposium, 2024

Enabling Fine-Grained Incremental Builds by Making Compiler Stateful.

[BibT_eX]

[DOI]

Jisheng Zhao

Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2024

Exponentially Expanding the Phase-Ordering Search Space via Dormant Information.

[BibT_eX]

[DOI]

Sri Ranganathan Palaniappan

Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction, 2024

2023

GPU Database Systems Characterization and Optimization.

[BibT_eX]

[DOI]

Proc. VLDB Endow., November, 2023

Revisiting Query Performance in GPU Database Systems.

[BibT_eX]

[DOI]

CoRR, 2023

Hardware-Assisted Code-Pointer Tagging for Forward-Edge Control-Flow Integrity.

[BibT_eX]

[DOI]

IEEE Comput. Archit. Lett., 2023

Mitigating Timing-Based NoC Side-Channel Attacks With LLC Remapping.

[BibT_eX]

[DOI]

IEEE Comput. Archit. Lett., 2023

Unified Co-Simulation Framework for Autonomous UAVs.

[BibT_eX]

[DOI]

Proceedings of the Practice and Experience in Advanced Research Computing, 2023

CuPBoP-AMD: Extending CUDA to AMD Platforms.

[BibT_eX]

[DOI]

Jun Chen

Xule Zhou

Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

CuPBoP: A Framework to Make CUDA Portable.

[BibT_eX]

[DOI]

Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023

Extending the Life of Old Systems with More Memory.

[BibT_eX]

[DOI]

Euna Kim

Andrei Bersatti

Proceedings of the International Symposium on Memory Systems, 2023

EHT-SR: An Entropy-Based Hybrid Approach for Faster Super-Resolution.

[BibT_eX]

[DOI]

Abhilash Dharmavarapu

Stefano Petrangeli

Jiashen Cao

Proceedings of the IEEE International Symposium on Multimedia, 2023

Traversing Large Compressed Graphs on GPUs.

[BibT_eX]

[DOI]

Prasun Gera

Abhimanyu Rajeshkumar Bambhaniya

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

VEGETA: Vertically-Integrated Extensions for Sparse/Dense GEMM Tile Acceleration on CPUs.

[BibT_eX]

[DOI]

Geonhwa Jeong

Sana Damani

Eric Qin

Christopher J. Hughes

Sreenivas Subramoney

Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2023

Spica: Exploring FPGA Optimizations to Enable an Efficient SpMV Implementation for Computations at Edge.

[BibT_eX]

[DOI]

Dheeraj Ramchandani

Proceedings of the IEEE International Conference on Edge Computing and Communications, 2023

Context-Aware Task Handling in Resource-Constrained Robots with Virtualization.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Edge Computing and Communications, 2023

Reducing Inference Latency with Concurrent Architectures for Image Recognition at Edge.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Edge Computing and Communications, 2023

Creating Robust Deep Neural Networks with Coded Distributed Computing for IoT.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Edge Computing and Communications, 2023

Skybox: Open-Source Graphic Rendering on Programmable RISC-V GPUs.

[BibT_eX]

[DOI]

Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

2022

COX : Exposing CUDA Warp-level Functions to CPUs.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., 2022

CuPBoP: CUDA for Parallelized and Broad-range Processors.

[BibT_eX]

[DOI]

CoRR, 2022

FiGO: Fine-Grained Query Optimization in Video Analytics.

[BibT_eX]

[DOI]

Proceedings of the SIGMOD '22: International Conference on Management of Data, Philadelphia, PA, USA, June 12, 2022

Securing GPU via region-based bounds checking.

[BibT_eX]

[DOI]

Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022

Accelerating Graphic Rendering on Programmable RISC-V GPUs.

[BibT_eX]

[DOI]

Tejaswini Anand Kumar

Jeffrey Young

Proceedings of the 2022 IEEE Hot Chips 34 Symposium, 2022

Maia: Matrix Inversion Acceleration Near Memory.

[BibT_eX]

[DOI]

Proceedings of the 32nd International Conference on Field-Programmable Logic and Applications, 2022

2021

Efficiently Solving Partial Differential Equations in a Partially Reconfigurable Specialized Hardware.

[BibT_eX]

[DOI]

Krishna Praveen Yalamarthy

IEEE Trans. Computers, 2021

COX: CUDA on X86 by Exposing Warp-Level Functions to CPUs.

[BibT_eX]

[DOI]

CoRR, 2021

Vortex: Extending the RISC-V ISA for GPGPU and 3D-GraphicsResearch.

[BibT_eX]

[DOI]

Blaise Tine

Fares Elsabbagh

CoRR, 2021

Supporting CUDA for an extended RISC-V GPU architecture.

[BibT_eX]

[DOI]

CoRR, 2021

Creating Robust Deep Neural Networks With Coded Distributed Computing for IoT Systems.

[BibT_eX]

[DOI]

Jiashen Cao

Krishna Praveen Yalamarthy

CoRR, 2021

THIA: Accelerating Video Analytics using Early Inference and Fine-Grained Query Planning.

[BibT_eX]

[DOI]

CoRR, 2021

SmaQ: Smart Quantization for DNN Training by Exploiting Value Clustering.

[BibT_eX]

[DOI]

IEEE Comput. Archit. Lett., 2021

Vortex: Extending the RISC-V ISA for GPGPU and 3D-Graphics.

[BibT_eX]

[DOI]

Blaise Tine

Fares Elsabbagh

Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Workload Characterization, 2021

FAFNIR: Accelerating Sparse Gathering by Using Efficient Near-Memory Intelligent Reduction.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2021

RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU.

[BibT_eX]

[DOI]

Geonhwa Jeong

Eric Qin

Ananda Samajdar

Christopher J. Hughes

Sreenivas Subramoney

Proceedings of the 58th ACM/IEEE Design Automation Conference, 2021

Quantifying the design-space tradeoffs in autonomous drones.

[BibT_eX]

[DOI]

Proceedings of the ASPLOS '21: 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021

2020

Traversing Large Graphs on GPUs with Unified Memory.

[BibT_eX]

[DOI]

Proc. VLDB Endow., 2020

The 2019 Top Picks in Computer Architecture.

[BibT_eX]

[DOI]

IEEE Micro, 2020

Toward Collaborative Inferencing of Deep Neural Networks on Internet-of-Things Devices.

[BibT_eX]

[DOI]

IEEE Internet Things J., 2020

Copernicus: Characterizing the Performance Implications of Compression Formats Used in Sparse Workloads.

[BibT_eX]

[DOI]

CoRR, 2020

Secure Location-Aware Authentication and Communication for Intelligent Transportation Systems.

[BibT_eX]

[DOI]

Pooya Shoghi Ghalehshahi

CoRR, 2020

Reducing Inference Latency with Concurrent Architectures for Image Recognition.

[BibT_eX]

[DOI]

CoRR, 2020

Edge-Tailored Perception: Fast Inferencing in-the-Edge with Efficient Model Distribution.

[BibT_eX]

[DOI]

CoRR, 2020

Vortex: OpenCL Compatible RISC-V GPGPU.

[BibT_eX]

[DOI]

CoRR, 2020

Hardware-based Always-On Heap Memory Safety.

[BibT_eX]

[DOI]

Yonghae Kim

Proceedings of the 53rd Annual IEEE/ACM International Symposium on Microarchitecture, 2020

Parallel Hash Table Design for NDP Systems.

[BibT_eX]

[DOI]

Pranith Kumar

Proceedings of the MEMSYS 2020: The International Symposium on Memory Systems, 2020

Things to Consider to Enable Dynamic Graphs in Processing-in-Memory.

[BibT_eX]

[DOI]

Euna Kim

Proceedings of the MEMSYS 2020: The International Symposium on Memory Systems, 2020

Neural Network Weight Compression with NNW-BDI.

[BibT_eX]

[DOI]

Andrei Bersatti

Nima Shoghi

Proceedings of the MEMSYS 2020: The International Symposium on Memory Systems, 2020

Understanding the Software and Hardware Stacks of a General-Purpose Cognitive Drone.

[BibT_eX]

[DOI]

Sam Jijina

Adriana Amyette

Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2020

MEISSA: Multiplying Matrices Efficiently in a Scalable Systolic Architecture.

[BibT_eX]

[DOI]

Proceedings of the 38th IEEE International Conference on Computer Design, 2020

ALRESCHA: A Lightweight Reconfigurable Sparse-Computation Accelerator.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

RISC-V FPGA Platform Toward ROS-Based Robotics Application.

[BibT_eX]

[DOI]

Proceedings of the 30th International Conference on Field-Programmable Logic and Applications, 2020

Productive Hardware Designs using Hybrid HLS-RTL Development.

[BibT_eX]

[DOI]

Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020

Cash: A Single-Source Hardware-Software Codesign Framework for Rapid Prototyping.

[BibT_eX]

[DOI]

Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020

Proposing a Fast and Scalable Systolic Array for Matrix Multiplication.

[BibT_eX]

[DOI]

Proceedings of the 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2020

Tango: An Optimizing Compiler for Just-In-Time RTL Simulation.

[BibT_eX]

[DOI]

Blaise-Pascal Tine

Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition, 2020

ASCELLA: Accelerating Sparse Computation by Enabling Stream Accesses to Memory.

[BibT_eX]

[DOI]

Proceedings of the 2020 Design, Automation & Test in Europe Conference & Exhibition, 2020

PISCES: Power-Aware Implementation of SLAM by Customizing Efficient Sparse Algebra.

[BibT_eX]

[DOI]

Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

Batch-Aware Unified Memory Management in GPUs for Irregular Workloads.

[BibT_eX]

[DOI]

Proceedings of the ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, 2020

2019

ERIDANUS: Efficiently Running Inference of DNNs Using Systolic Arrays.

[BibT_eX]

[DOI]

IEEE Micro, 2019

Thermal-aware processing-in-memory instruction offloading.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2019

A Case Study: Exploiting Neural Machine Translation to Translate CUDA to OpenCL.

[BibT_eX]

[DOI]

Yonghae Kim

CoRR, 2019

Collaborative Execution of Deep Neural Networks on Internet of Things Devices.

[BibT_eX]

[DOI]

CoRR, 2019

Characterizing the Execution of Deep Neural Networks on Collaborative Robots and Edge Devices.

[BibT_eX]

[DOI]

Proceedings of the Practice and Experience in Advanced Research Computing on Rise of the Machines (learning), 2019

Empirical Investigation of Stale Value Tolerance on Parallel RNN Training.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2019

Characterizing the Deployment of Deep Neural Networks on Commercial Edge Devices.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Workload Characterization, 2019

Capella: Customizing Perception for Edge Devices by Efficiently Allocating FPGAs to DNNs.

[BibT_eX]

[DOI]

Proceedings of the 29th International Conference on Field Programmable Logic and Applications, 2019

FlashGPU: Placing New Flash Next to GPU Cores.

[BibT_eX]

[DOI]

Proceedings of the 56th Annual Design Automation Conference 2019, 2019

Robustly Executing DNNs in IoT Systems Using Coded Distributed Computing.

[BibT_eX]

[DOI]

Proceedings of the 56th Annual Design Automation Conference 2019, 2019

LODESTAR: Creating Locally-Dense CNNs for Efficient Inference on Systolic Arrays.

[BibT_eX]

[DOI]

Proceedings of the 56th Annual Design Automation Conference 2019, 2019

Video analytics from edge to server: work-in-progress.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Hardware/Software Codesign and System Synthesis Companion, 2019

Translating CUDA to OpenCL for Hardware Generation using Neural Machine Translation.

[BibT_eX]

[DOI]

Yonghae Kim

Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2019

POSTER: Tango: An Optimizing Compiler for Just-In-Time RTL Simulation.

[BibT_eX]

[DOI]

Blaise-Pascal Tine

Jeffrey S. Vetter

Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019

2018

StaleLearn: Learning Acceleration with Asynchronous Synchronization Between Model Replicas on PIM.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2018

CODA: Enabling Co-location of Computation and Data for Multiple GPU Systems.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., 2018

Distributed Perception by Collaborative Robots.

[BibT_eX]

[DOI]

IEEE Robotics Autom. Lett., 2018

Musical Chair: Efficient Real-Time Recognition Using Collaborative IoT Devices.

[BibT_eX]

[DOI]

CoRR, 2018

Performance Implications of NoCs on 3D-Stacked Memories: Insights from the Hybrid Memory Cube.

[BibT_eX]

[DOI]

Jeffrey S. Young

Burhan Ahmad Mudassar

Kartikay Garg

Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2018

Performance Characterisation and Simulation of Intel's Integrated GPU Architecture.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2018

CoolPIM: Thermal-Aware Source Throttling for Efficient PIM Instruction Offloading.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Real-Time Image Recognition Using Collaborative IoT Devices.

[BibT_eX]

[DOI]

Proceedings of the 1st on Reproducible Quality-Efficient Systems Tournament on Co-designing Pareto-efficient Deep Learning, 2018

2017

CAIRO: A Compiler-Assisted Technique for Enabling Instruction-Level Offloading of Processing-In-Memory.

[BibT_eX]

[DOI]

ACM Trans. Archit. Code Optim., 2017

Exploring big graph computing - An empirical study from architectural perspective.

[BibT_eX]

[DOI]

J. Parallel Distributed Comput., 2017

Louvre: Light-weight Ordering Using Versioning for Release Consistency.

[BibT_eX]

[DOI]

CoRR, 2017

CODA: Enabling Co-location of Computation and Data for Near-Data Processing.

[BibT_eX]

[DOI]

CoRR, 2017

Inferring Fine-grained Control Flow Inside SGX Enclaves with Branch Shadowing.

[BibT_eX]

[DOI]

Proceedings of the 26th USENIX Security Symposium, 2017

Lightweight SIMT core designs for intelligent 3D stacked DRAM.

[BibT_eX]

[DOI]

Chad D. Kersey

Proceedings of the International Symposium on Memory Systems, 2017

SimProf: A Sampling Framework for Data Analytic Workloads.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Demystifying the characteristics of 3D-stacked memories: A case study for Hybrid Memory Cube.

[BibT_eX]

[DOI]

Burhan Ahmad Mudassar

Saibal Mukhopadhyay

Proceedings of the 2017 IEEE International Symposium on Workload Characterization, 2017

GraphPIM: Enabling Instruction-Level PIM Offloading in Graph Computing Frameworks.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture, 2017

2016

On the Internet of Things.

[BibT_eX]

[DOI]

Vijay Janapa Reddi

IEEE Micro, 2016

Analyzing Consistency Issues in HMC Atomics.

[BibT_eX]

[DOI]

Pranith Kumar

Lifeng Nai

Proceedings of the Second International Symposium on Memory Systems, 2016

2015

GREEN Cache: Exploiting the Disciplined Memory Model of OpenCL on GPUs.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2015

Block-Precise Processors: Low-Power Processors with Reduced Operand Store Accesses and Result Broadcasts.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2015

OpenCL Performance Evaluation on Modern Multicore CPUs.

[BibT_eX]

[DOI]

Sci. Program., 2015

SP-CNN: A Scalable and Programmable CNN-Based Accelerator.

[BibT_eX]

[DOI]

Dilan Manatunga

Saibal Mukhopadhyay

IEEE Micro, 2015

Accelerating Application Start-up with Nonvolatile Memory in Android Systems.

[BibT_eX]

[DOI]

IEEE Micro, 2015

Hardware Support for Safe Execution of Native Client Applications.

[BibT_eX]

[DOI]

Dilan Manatunga

IEEE Comput. Archit. Lett., 2015

GraphBIG: understanding graph computing in the context of industrial solutions.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2015

Instruction Offloading with HMC 2.0 Standard: A Case Study for Graph Traversals.

[BibT_eX]

[DOI]

Lifeng Nai

Proceedings of the 2015 International Symposium on Memory Systems, 2015

Understanding Energy Aspects of Processing-near-Memory for HPC Workloads.

[BibT_eX]

[DOI]

Hyojong Kim

Arun F. Rodrigues

Proceedings of the 2015 International Symposium on Memory Systems, 2015

SIMT-based Logic Layers for Stacked DRAM Architectures: A Prototype.

[BibT_eX]

[DOI]

Chad D. Kersey

Proceedings of the 2015 International Symposium on Memory Systems, 2015

BSSync: Processing Near Memory for Machine Learning Workloads with Bounded Staleness Consistency Models.

[BibT_eX]

[DOI]

Jaewoong Sim

Proceedings of the 2015 International Conference on Parallel Architectures and Compilation, 2015

2014

Power Modeling for GPU Architectures Using McPAT.

[BibT_eX]

[DOI]

Jieun Lim

William J. Song

Wonyong Sung

ACM Trans. Design Autom. Electr. Syst., 2014

Design Space Exploration of Memory Model for Heterogeneous Computing.

[BibT_eX]

[DOI]

Jieun Lim

Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing, 2014

Transparent Hardware Management of Stacked DRAM as Part of Memory.

[BibT_eX]

[DOI]

Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

GPUMech: GPU Performance Modeling Technique Based on Interval Analysis.

[BibT_eX]

[DOI]

Proceedings of the 47th Annual IEEE/ACM International Symposium on Microarchitecture, 2014

TBPoint: Reducing Simulation Time for Large-Scale GPGPU Kernels.

[BibT_eX]

[DOI]

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Spare register aware prefetching for graph algorithms on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 20th IEEE International Symposium on High Performance Computer Architecture, 2014

Harmonica: An FPGA-Based Data Parallel Soft Core.

[BibT_eX]

[DOI]

Chad D. Kersey

Hyojong Kim

Nimit Nigania

Proceedings of the 22nd IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2014

2013

Adaptive virtual channel partitioning for network-on-chip in heterogeneous architectures.

[BibT_eX]

[DOI]

Si Li

ACM Trans. Design Autom. Electr. Syst., 2013

SD3: An Efficient Dynamic Data-Dependence Profiling Mechanism.

[BibT_eX]

[DOI]

Minjang Kim

Chi-Keung Luk

IEEE Trans. Computers, 2013

Design space exploration of on-chip ring interconnection for a CPU-GPU heterogeneous architecture.

[BibT_eX]

[DOI]

Si Li

J. Parallel Distributed Comput., 2013

SESH Framework: A Space Exploration Framework for GPU Application and Hardware Codesign.

[BibT_eX]

[DOI]

Jiayuan Meng

Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, 2013

OpenCL Performance Evaluation on Modern Multi Core CPUs.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

CHiP: A Profiler to Measure the Effect of Cache Contention on Scalability.

[BibT_eX]

[DOI]

Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

2012

Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)

[BibT_eX]

[DOI]

Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, ISBN: 978-3-031-01737-7, 2012

When Prefetching Works, When It Doesn't, and Why.

[BibT_eX]

[DOI]

Richard W. Vuduc

ACM Trans. Archit. Code Optim., 2012

DRAM Scheduling Policy for GPGPU Architectures Based on a Potential Function.

[BibT_eX]

[DOI]

Jinwoo Shin

IEEE Comput. Archit. Lett., 2012

A performance analysis framework for identifying potential benefits in GPGPU applications.

[BibT_eX]

[DOI]

Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

Supporting virtual memory in GPGPU without supporting precise exceptions.

[BibT_eX]

[DOI]

Proceedings of the 2012 ACM SIGPLAN workshop on Memory Systems Performance and Correctness: held in conjunction with PLDI '12, 2012

A Mostly-Clean DRAM Cache for Effective Hit Speculation and Self-Balancing Dispatch.

[BibT_eX]

[DOI]

Proceedings of the 45th Annual IEEE/ACM International Symposium on Microarchitecture, 2012

FLEXclusion: Balancing cache capacity and on-chip bandwidth via Flexible Exclusion.

[BibT_eX]

[DOI]

Proceedings of the 39th International Symposium on Computer Architecture (ISCA 2012), 2012

Predicting Potential Speedup of Serial Code via Lightweight Profiling and Emulations with Memory Performance Model.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

TAP: A TLP-aware cache management policy for a CPU-GPU heterogeneous architecture.

[BibT_eX]

[DOI]

Proceedings of the 18th IEEE International Symposium on High Performance Computer Architecture, 2012

2010

Many-Thread Aware Prefetching Mechanisms for GPGPU Applications.

[BibT_eX]

[DOI]

Richard W. Vuduc

Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

SD3: A Scalable Approach to Dynamic Data-Dependence Profiling.

[BibT_eX]

[DOI]

Minjang Kim

Chi-Keung Luk

Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

An integrated GPU power and performance model.

[BibT_eX]

[DOI]

Sunpyo Hong

Proceedings of the 37th International Symposium on Computer Architecture (ISCA 2010), 2010

Design space exploration of the turbo decoding algorithm on GPUs.

[BibT_eX]

[DOI]

Dongwon Lee

Marilyn Wolf

Proceedings of the 2010 International Conference on Compilers, 2010

2009

Virtual Program Counter (VPC) Prediction: Very Low Cost Indirect Branch Prediction Using Conditional Branch Prediction Hardware.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2009

Age based scheduling for asymmetric multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Qilin: exploiting parallelism on heterogeneous multiprocessors with adaptive mapping.

[BibT_eX]

[DOI]

Chi-Keung Luk

Sunpyo Hong

Proceedings of the 42st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-42 2009), 2009

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness.

[BibT_eX]

[DOI]

Sunpyo Hong

Proceedings of the 36th International Symposium on Computer Architecture (ISCA 2009), 2009

2008

Dynamic Predication of Indirect Jumps.

[BibT_eX]

[DOI]

IEEE Comput. Archit. Lett., 2008

Understanding performance, power and energy behavior in asymmetric multiprocessors.

[BibT_eX]

[DOI]

Proceedings of the 26th International Conference on Computer Design, 2008

Performance-aware speculation control using wrong path usefulness prediction.

[BibT_eX]

[DOI]

Proceedings of the 14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), 2008

Improving the performance of object-oriented languages with dynamic predication of indirect jumps.

[BibT_eX]

[DOI]

Proceedings of the 13th International Conference on Architectural Support for Programming Languages and Operating Systems, 2008

2007

Diverge-Merge Processor: Generalized and Energy-Efficient Dynamic Predication.

[BibT_eX]

[DOI]

IEEE Micro, 2007

VPC prediction: reducing the cost of indirect branches via hardware-based dynamic devirtualization.

[BibT_eX]

[DOI]

Proceedings of the 34th International Symposium on Computer Architecture (ISCA 2007), 2007

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers.

[BibT_eX]

[DOI]

Proceedings of the 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 2007

Profile-assisted Compiler Support for Dynamic Predication in Diverge-Merge Processors.

[BibT_eX]

[DOI]

Proceedings of the Fifth International Symposium on Code Generation and Optimization (CGO 2007), 2007

2006

Address-Value Delta (AVD) Prediction: A Hardware Technique for Efficiently Parallelizing Dependent Cache Misses.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2006

Efficient Runahead Execution: Power-Efficient Memory Latency Tolerance.

[BibT_eX]

[DOI]

IEEE Micro, 2006

Wish Branches: Enabling Adaptive and Aggressive Predicated Execution.

[BibT_eX]

[DOI]

IEEE Micro, 2006

Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths.

[BibT_eX]

[DOI]

Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-39 2006), 2006

2D-Profiling: Detecting Input-Dependent Branches with a Single Input Data Set.

[BibT_eX]

[DOI]

Proceedings of the Fourth IEEE/ACM International Symposium on Code Generation and Optimization (CGO 2006), 2006

2005

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors.

[BibT_eX]

[DOI]

IEEE Trans. Computers, 2005

Using the First-Level Caches as Filters to Reduce the Pollution Caused by Speculative Memory References.

[BibT_eX]

[DOI]

Int. J. Parallel Program., 2005

On Reusing the Results of Pre-Executed Instructions in a Runahead Execution Processor.

[BibT_eX]

[DOI]

IEEE Comput. Archit. Lett., 2005

Address-Value Delta (AVD) Prediction: Increasing the Effectiveness of Runahead Execution by Exploiting Regular Memory Allocation Patterns.

[BibT_eX]

[DOI]

Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), 2005

Wish Branches: Combining Conditional Branching and Predication for Adaptive Predicated Execution.

[BibT_eX]

[DOI]

Proceedings of the 38th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO-38 2005), 2005

Techniques for Efficient Processing in Runahead Execution Engines.

[BibT_eX]

[DOI]