Xuehai Qian

According to our database1, Xuehai Qian authored at least 104 papers between 2007 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Fine-Grained Embedding Dimension Optimization During Training for Recommender Systems.
CoRR, 2024

Guser: A GPGPU Power Stressmark Generator.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2024

2023
DyNNamic: Dynamically Reshaping, High Data-Reuse Accelerator for Compact DNNs.
IEEE Trans. Computers, March, 2023

RDMA-Enabled Concurrency Control Protocols for Transactions in the Cloud Era.
IEEE Trans. Cloud Comput., 2023

RobustState: Boosting Fidelity of Quantum State Preparation via Noise-Aware Variational Training.
CoRR, 2023

GNNPipe: Accelerating Distributed Full-Graph GNN Training with Pipelined Model Parallelism.
CoRR, 2023

Hybrid Gate-Pulse Model for Variational Quantum Algorithms.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023

Achieving Sub-second Pairwise Query over Evolving Graphs.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

Khuzdul: Efficient and Scalable Distributed Graph Pattern Mining Engine.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

DecoMine: A Compilation-Based Graph Pattern Mining System with Pattern Decomposition.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

2022
Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems.
IEEE Trans. Parallel Distributed Syst., 2022

Non-Structured DNN Weight Pruning - Is It Beneficial in Any Platform?
IEEE Trans. Neural Networks Learn. Syst., 2022

GRIM: A General, Real-Time Deep Learning Inference Framework for Mobile Devices Based on Fine-Grained Structured Weight Sparsity.
IEEE Trans. Pattern Anal. Mach. Intell., 2022

SOCA-DOM: A Mobile System-on-Chip Array System for Analyzing Big Data on the Move.
J. Comput. Sci. Technol., 2022

QuEst: Graph Transformer for Quantum Circuit Reliability Estimation.
CoRR, 2022

TopGen: Topology-Aware Bottom-Up Generator for Variational Quantum Circuits.
CoRR, 2022

PAN: Pulse Ansatz on NISQ Machines.
CoRR, 2022

Variational Quantum Pulse Learning.
Proceedings of the IEEE International Conference on Quantum Computing and Engineering, 2022

HyBP: Hybrid Isolation-Randomization Secure Branch Predictor.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2022

SparseCore: stream ISA and processor specialization for sparse computation.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022

2021
A Fast Lock for Explicit Message Passing Architectures.
IEEE Trans. Computers, 2021

3-D Partitioning for Large-Scale Graph Processing.
IEEE Trans. Computers, 2021

Kudu: An Efficient and Scalable Distributed Graph Pattern Mining Engine.
CoRR, 2021

Graph processing and machine learning architectures with emerging memory technologies: a survey.
Sci. China Inf. Sci., 2021

ESCALATE: Boosting the Efficiency of Sparse CNN Accelerator with Kernel Decomposition.
Proceedings of the MICRO '21: 54th Annual IEEE/ACM International Symposium on Microarchitecture, 2021

FORMS: Fine-grained Polarized ReRAM-based In-situ Computation for Mixed-signal DNN Accelerator.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

HASCO: Towards Agile HArdware and Software CO-design for Tensor Computation.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

GoSPA: An Energy-efficient High-performance Globally Optimized SParse Convolutional Neural Network Accelerator.
Proceedings of the 48th ACM/IEEE Annual International Symposium on Computer Architecture, 2021

Mix and Match: A Novel FPGA-Centric Deep Neural Network Quantization Framework.
Proceedings of the IEEE International Symposium on High-Performance Computer Architecture, 2021

A Lightweight Isolation Mechanism for Secure Branch Predictors.
Proceedings of the 58th ACM/IEEE Design Automation Conference, 2021

2020
iCELIA: A Full-Stack Framework for STT-MRAM-Based Deep Learning Acceleration.
IEEE Trans. Parallel Distributed Syst., 2020

Efficient Performance Estimation and Work-Group Size Pruning for OpenCL Kernels on GPUs.
IEEE Trans. Parallel Distributed Syst., 2020

Guest Editors' Introduction to the Special Issue on Machine Learning Architectures and Accelerators.
IEEE Trans. Computers, 2020

IntersectX: An Accelerator for Graph Mining.
CoRR, 2020

Low-Cost Floating-Point Processing in ReRAM for Scientific Computing.
CoRR, 2020

DwarvesGraph: A High-Performance Graph Mining System with Pattern Decomposition.
CoRR, 2020

ReversiSpec: Reversible Coherence Protocol for Defending Transient Attacks.
CoRR, 2020

A Comprehensive Evaluation of RDMA-enabled Concurrency Control Protocols.
CoRR, 2020

SympleGraph: distributed graph processing with precise loop-carried dependency guarantee.
Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2020

AccQOC: Accelerating Quantum Optimal Control Based Pulse Generation.
Proceedings of the 47th ACM/IEEE Annual International Symposium on Computer Architecture, 2020

AccPar: Tensor Partitioning for Heterogeneous Deep Learning Accelerators.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2020

TUPIM: A Transparent and Universal Processing-in-Memory Architecture for Unmodified Binaries.
Proceedings of the GLSVLSI '20: Great Lakes Symposium on VLSI 2020, 2020

DNNGuard: An Elastic Heterogeneous DNN Accelerator Architecture against Adversarial Attacks.
Proceedings of the ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, 2020

Capuchin: Tensor-based GPU Memory Management for Deep Learning.
Proceedings of the ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, 2020

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning.
Proceedings of the ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, 2020

AsymNVM: An Efficient Framework for Implementing Persistent Data Structures on Asymmetric NVM Architecture.
Proceedings of the ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, 2020

Prague: High-Performance Heterogeneity-Aware Asynchronous Decentralized Training.
Proceedings of the ASPLOS '20: Architectural Support for Programming Languages and Operating Systems, 2020

2019
Clip: A Disk I/O Focused Parallel Out-of-Core Graph Processing System.
IEEE Trans. Parallel Distributed Syst., 2019

Distributed Graph Processing System and Processing-in-memory Architecture with Precise Loop-carried Dependency Guarantee.
ACM Trans. Comput. Syst., 2019

HEIF: Highly Efficient Stochastic Computing-Based Inference Framework for Deep Neural Networks.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2019

Heterogeneity-Aware Asynchronous Decentralized Training.
CoRR, 2019

A Stochastic-Computing based Deep Learning Framework using Adiabatic Quantum-Flux-Parametron SuperconductingTechnology.
CoRR, 2019

Non-structured DNN Weight Pruning Considered Harmful.
CoRR, 2019

ReBNN: in-situ acceleration of binarized neural networks in ReRAM using complementary resistive cell.
CCF Trans. High Perform. Comput., 2019

PIMSim: A Flexible and Detailed Processing-in-Memory Simulator.
IEEE Comput. Archit. Lett., 2019

GraphQ: Scalable PIM-Based Graph Processing.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

TPShare: a time-space sharing scheduling abstraction for shared cloud via vertical labels.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

TIE: energy-efficient tensor train-based inference engine for deep neural network.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

A stochastic-computing based deep learning framework using adiabatic quantum-flux-parametron superconducting technology.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

SpeedyBox: Low-Latency NFV Service Chains with Cross-NF Runtime Consolidation.
Proceedings of the 39th IEEE International Conference on Distributed Computing Systems, 2019

A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

pLock: A Fast Lock for Architectures with Explicit Inter-core Message Passing.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

Hop: Heterogeneity-aware Decentralized Training.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

2018
DudeTx: Durable Transactions Made Decoupled.
ACM Trans. Storage, 2018

ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Method of Multipliers.
CoRR, 2018

An Efficient Framework for Implementing Persist Data Structures on Remote NVM.
CoRR, 2018

vSensor: leveraging fixed-workload snippets of programs for performance variance detection.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

CSE: Parallel Finite State Machines with Convergence Set Enumeration.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

CounterMiner: Mining Big Performance Data from Hardware Counters.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

G-TSC: Timestamp Based Coherence for GPUs.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

GraphR: Accelerating Graph Processing Using ReRAM.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

ReRAM-based accelerator for deep learning.
Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition, 2018

Wonderland: A Novel Abstraction-Based Out-Of-Core Graph Processing System.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

VIBNN: Hardware Acceleration of Bayesian Neural Networks.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

Neu-NoC: A high-efficient interconnection network for accelerated neuromorphic systems.
Proceedings of the 23rd Asia and South Pacific Design Automation Conference, 2018

Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018

2017
CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-CirculantWeight Matrices.
CoRR, 2017

Squeezing out All the Value of Loaded Data: An Out-of-core Graph Processing System with Reduced Disk I/O.
Proceedings of the 2017 USENIX Annual Technical Conference, 2017

CirCNN: accelerating and compressing deep neural networks using block-circulant weight matrices.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

Power Efficient Sharing-Aware GPU Data Management.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning.
Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture, 2017

SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

DudeTM: Building Durable Transactions with Decoupling for Persistent Memory.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

2016
OPR: deterministic group replay for one-sided communication.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

Exploring the Hidden Dimension in Graph Processing.
Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016

SReplay: Deterministic Sub-Group Replay for One-Sided Communication.
Proceedings of the 2016 International Conference on Supercomputing, 2016

2015
Improving multiprocessor performance with fine-grain coherence bypass.
Sci. China Inf. Sci., 2015

2014
OmniOrder: Directory-based conflict serialization of transactions.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

2013
Scalable and flexible bulk architecture
PhD thesis, 2013

BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model.
Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, 2013

Volition: scalable and precise sequential consistency violation detection.
Proceedings of the Architectural Support for Programming Languages and Operating Systems, 2013

2012
BulkSMT: Designing SMT processors for atomic-block execution.
Proceedings of the 18th IEEE International Symposium on High Performance Computer Architecture, 2012

2010
ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment.
Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

2007
Design and Implementation of Floating Point Stack on General RISC Architecture.
Proceedings of the 15th Euromicro International Conference on Parallel, 2007

Circuit implementation of floating point range reduction for trigonometric functions.
Proceedings of the International Symposium on Circuits and Systems (ISCAS 2007), 2007

Optimized Register Renaming Scheme for Stack-Based x86 Operations.
Proceedings of the Architecture of Computing Systems, 2007


  Loading...