Xuehai Qian

According to our database1, Xuehai Qian authored at least 57 papers between 2007 and 2020.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Other 

Links

On csauthors.net:

Bibliography

2020
iCELIA: A Full-Stack Framework for STT-MRAM-Based Deep Learning Acceleration.
IEEE Trans. Parallel Distrib. Syst., 2020

Efficient Performance Estimation and Work-Group Size Pruning for OpenCL Kernels on GPUs.
IEEE Trans. Parallel Distrib. Syst., 2020

PatDNN: Achieving Real-Time DNN Execution on Mobile Devices with Pattern-based Weight Pruning.
CoRR, 2020

2019
Clip: A Disk I/O Focused Parallel Out-of-Core Graph Processing System.
IEEE Trans. Parallel Distrib. Syst., 2019

HEIF: Highly Efficient Stochastic Computing-Based Inference Framework for Deep Neural Networks.
IEEE Trans. on CAD of Integrated Circuits and Systems, 2019

Heterogeneity-Aware Asynchronous Decentralized Training.
CoRR, 2019

A Stochastic-Computing based Deep Learning Framework using Adiabatic Quantum-Flux-Parametron SuperconductingTechnology.
CoRR, 2019

Non-structured DNN Weight Pruning Considered Harmful.
CoRR, 2019

PIMSim: A Flexible and Detailed Processing-in-Memory Simulator.
Computer Architecture Letters, 2019

GraphQ: Scalable PIM-Based Graph Processing.
Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, 2019

TPShare: a time-space sharing scheduling abstraction for shared cloud via vertical labels.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

TIE: energy-efficient tensor train-based inference engine for deep neural network.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

A stochastic-computing based deep learning framework using adiabatic quantum-flux-parametron superconducting technology.
Proceedings of the 46th International Symposium on Computer Architecture, 2019

SpeedyBox: Low-Latency NFV Service Chains with Cross-NF Runtime Consolidation.
Proceedings of the 39th IEEE International Conference on Distributed Computing Systems, 2019

A Hybrid Framework for Fast and Accurate GPU Performance Estimation through Source-Level Analysis and Trace-Based Simulation.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

HyPar: Towards Hybrid Parallelism for Deep Learning Accelerator Array.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

E-RNN: Design Optimization for Efficient Recurrent Neural Networks in FPGAs.
Proceedings of the 25th IEEE International Symposium on High Performance Computer Architecture, 2019

pLock: A Fast Lock for Architectures with Explicit Inter-core Message Passing.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Methods of Multipliers.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

Hop: Heterogeneity-aware Decentralized Training.
Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, 2019

2018
DudeTx: Durable Transactions Made Decoupled.
TOS, 2018

ADMM-NN: An Algorithm-Hardware Co-Design Framework of DNNs Using Alternating Direction Method of Multipliers.
CoRR, 2018

An Efficient Framework for Implementing Persist Data Structures on Remote NVM.
CoRR, 2018

vSensor: leveraging fixed-workload snippets of programs for performance variance detection.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

CSE: Parallel Finite State Machines with Convergence Set Enumeration.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

CounterMiner: Mining Big Performance Data from Hardware Counters.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

PermDNN: Efficient Compressed DNN Architecture with Permuted Diagonal Matrices.
Proceedings of the 51st Annual IEEE/ACM International Symposium on Microarchitecture, 2018

GraphP: Reducing Communication for PIM-Based Graph Processing with Efficient Data Partition.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

G-TSC: Timestamp Based Coherence for GPUs.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

GraphR: Accelerating Graph Processing Using ReRAM.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

ReRAM-based accelerator for deep learning.
Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition, 2018

Wonderland: A Novel Abstraction-Based Out-Of-Core Graph Processing System.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

Datasize-Aware High Dimensional Configurations Auto-Tuning of In-Memory Cluster Computing.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

VIBNN: Hardware Acceleration of Bayesian Neural Networks.
Proceedings of the Twenty-Third International Conference on Architectural Support for Programming Languages and Operating Systems, 2018

Neu-NoC: A high-efficient interconnection network for accelerated neuromorphic systems.
Proceedings of the 23rd Asia and South Pacific Design Automation Conference, 2018

Towards Ultra-High Performance and Energy Efficiency of Deep Learning Systems: An Algorithm-Hardware Co-Optimization Framework.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018

2017
CirCNN: Accelerating and Compressing Deep Neural Networks Using Block-CirculantWeight Matrices.
CoRR, 2017

Squeezing out All the Value of Loaded Data: An Out-of-core Graph Processing System with Reduced Disk I/O.
Proceedings of the 2017 USENIX Annual Technical Conference, 2017

CirCNN: accelerating and compressing deep neural networks using block-circulant weight matrices.
Proceedings of the 50th Annual IEEE/ACM International Symposium on Microarchitecture, 2017

Power Efficient Sharing-Aware GPU Data Management.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

PipeLayer: A Pipelined ReRAM-Based Accelerator for Deep Learning.
Proceedings of the 2017 IEEE International Symposium on High Performance Computer Architecture, 2017

SC-DCNN: Highly-Scalable Deep Convolutional Neural Network using Stochastic Computing.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

DudeTM: Building Durable Transactions with Decoupling for Persistent Memory.
Proceedings of the Twenty-Second International Conference on Architectural Support for Programming Languages and Operating Systems, 2017

2016
OPR: deterministic group replay for one-sided communication.
Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2016

Exploring the Hidden Dimension in Graph Processing.
Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation, 2016

SReplay: Deterministic Sub-Group Replay for One-Sided Communication.
Proceedings of the 2016 International Conference on Supercomputing, 2016

2015
Improving multiprocessor performance with fine-grain coherence bypass.
SCIENCE CHINA Information Sciences, 2015

2014
OmniOrder: Directory-based conflict serialization of transactions.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

Pacifier: Record and replay for relaxed-consistency multiprocessors with distributed directory protocol.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

2013
BulkCommit: scalable and fast commit of atomic blocks in a lazy multiprocessor environment.
Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, 2013

Rainbow: Efficient memory dependence recording with high replay parallelism for relaxed memory model.
Proceedings of the 19th IEEE International Symposium on High Performance Computer Architecture, 2013

Volition: scalable and precise sequential consistency violation detection.
Proceedings of the Architectural Support for Programming Languages and Operating Systems, 2013

2012
BulkSMT: Designing SMT processors for atomic-block execution.
Proceedings of the 18th IEEE International Symposium on High Performance Computer Architecture, 2012

2010
ScalableBulk: Scalable Cache Coherence for Atomic Blocks in a Lazy Environment.
Proceedings of the 43rd Annual IEEE/ACM International Symposium on Microarchitecture, 2010

2007
Design and Implementation of Floating Point Stack on General RISC Architecture.
Proceedings of the 15th Euromicro International Conference on Parallel, 2007

Circuit implementation of floating point range reduction for trigonometric functions.
Proceedings of the International Symposium on Circuits and Systems (ISCAS 2007), 2007

Optimized Register Renaming Scheme for Stack-Based x86 Operations.
Proceedings of the Architecture of Computing Systems, 2007


  Loading...