Xueying Wang

Orcid: 0000-0002-7835-113X

Affiliations:
  • Beijing University of Posts and Telecommunications, Beijing, China


According to our database1, Xueying Wang authored at least 23 papers between 2018 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
OptiFX: Automatic Optimization for Convolutional Neural Networks with Aggressive Operator Fusion on GPUs.
ACM Trans. Archit. Code Optim., June, 2025

Libra: Synergizing CUDA and Tensor Cores for High-Performance Sparse Matrix Multiplication.
CoRR, June, 2025

SparkAttention: high-performance multi-head attention for large models on Volta GPU architecture.
CCF Trans. High Perform. Comput., February, 2025

FlashSparse: Minimizing Computation Redundancy for Fast Sparse Matrix Multiplications on Tensor Cores.
Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2025

2024
Fast Convolution Meets Low Precision: Exploring Efficient Quantized Winograd Convolution on Modern CPUs.
ACM Trans. Archit. Code Optim., March, 2024

2023
CoAxNN: Optimizing on-device deep learning with conditional approximate neural networks.
J. Syst. Archit., October, 2023

Facilitating hardware-aware neural architecture search with learning-based predictive models.
J. Syst. Archit., April, 2023

2022
An Application-oblivious Memory Scheduling System for DNN Accelerators.
ACM Trans. Archit. Code Optim., 2022

Optimizing deep neural networks on intelligent edge accelerators via flexible-rate filter pruning.
J. Syst. Archit., 2022

Accelerating deep neural network filter pruning with mask-aware convolutional computations on modern CPUs.
Neurocomputing, 2022

2021
Compiler-assisted Operator Template Library for DNN Accelerators.
Int. J. Parallel Program., 2021

Pinpointing the Memory Behaviors of DNN Training.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2021

Unleashing the Low-Precision Computation Potential of Tensor Cores on GPUs.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2021

2020
Fusion-Catalyzed Pruning for Optimizing Deep Learning on Intelligent Edge Devices.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2020

Compiler-Assisted Operator Template Library for DNN Accelerators.
Proceedings of the Network and Parallel Computing, 2020

Characterizing the I/O Pipeline in the Deployment of CNNs on Commercial Accelerators.
Proceedings of the IEEE International Conference on Parallel & Distributed Processing with Applications, 2020

Lance: efficient low-precision quantized winograd convolution for neural networks based on graphics processing units.
Proceedings of the 2020 IEEE International Conference on Acoustics, 2020

Accelerating Deep Learning Inference with Cross-Layer Data Reuse on GPUs.
Proceedings of the Euro-Par 2020: Parallel Processing, 2020

2019
Exploiting the input sparsity to accelerate deep neural networks: poster.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

XDN: Towards Efficient Inference of Residual Neural Networks on Cambricon Chips.
Proceedings of the Benchmarking, Measuring, and Optimizing, 2019

Acorns: A Framework for Accelerating Deep Neural Networks with Input Sparsity.
Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019

2018
Background Subtraction on Depth Videos with Convolutional Neural Networks.
Proceedings of the 2018 International Joint Conference on Neural Networks, 2018

Auto-tuning Neural Network Quantization Framework for Collaborative Inference Between the Cloud and Edge.
Proceedings of the Artificial Neural Networks and Machine Learning - ICANN 2018, 2018


  Loading...