Xin You

Orcid: 0000-0002-5163-4607

Affiliations:
  • Beihang University, Beijing, China


According to our database1, Xin You authored at least 44 papers between 2018 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2026
Exploiting Efficient Mapping and Pipelined Execution for Accelerating SpMV on Tensor Cores.
Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2026

Efficient Temporal Graph Network Training via Unified Redundancy Elimination.
Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2026

2025
\uline{LO}w-c\uline{O}st yet High-\uline{P}erformant \uline{S}parse Matrix-Matrix Multiplication on Arm SME Architectures.
CoRR, November, 2025

PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization.
CoRR, November, 2025

Identifying Performance Inefficiencies of Parallel Program With Spatial and Temporal Trace Analysis.
IEEE Trans. Parallel Distributed Syst., July, 2025

SimTrace: Exploiting Spatial and Temporal Sampling for Large-Scale Performance Analysis.
ACM Trans. Archit. Code Optim., June, 2025

Exploiting Dynamic Regular Patterns in Irregular Programs for Efficient Vectorization.
ACM Trans. Archit. Code Optim., June, 2025

Hotspy: identifying performance hotspot with graph neural network based static analysis.
CCF Trans. High Perform. Comput., June, 2025

Towards Efficient LLM Inference via Collective and Adaptive Speculative Decoding.
Proceedings of the International Conference for High Performance Computing, 2025

Zero-Value Code Specialization via Profile-Guided Control Data Flow Analysis.
Proceedings of the International Conference for High Performance Computing, 2025

Exploiting Transformer-Based Static Binary Analysis for Identifying Inefficient Locks.
Proceedings of the Network and Parallel Computing, 2025

INSPIRIT: Adaptive Priority-based Task Scheduling for Heterogeneous Hardware.
Proceedings of the 2025 IEEE International Parallel and Distributed Processing Symposium, 2025

GNNPerf: Towards Effective Performance Profiling and Analysis Across GNN Frameworks.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2025

Towards Efficient Instruction Stream Scheduling for Stencil Computation on ARM Processors.
Proceedings of the 2025 IEEE International Parallel and Distributed Processing Symposium, 2025

Efficient Locality-aware Instruction Stream Scheduling for Stencil Computation on ARM Processors.
Proceedings of the 39th ACM International Conference on Supercomputing, 2025

ESC: Effective Submanifold Convolution using Tensor Cores.
Proceedings of the 54th International Conference on Parallel Processing, 2025

Identifying Potential Anomalous Operations in Graph Neural Network Training.
Proceedings of the Advanced Parallel Processing Technologies, 2025

2024
AtRec: Accelerating Recommendation Model Training on CPUs.
IEEE Trans. Parallel Distributed Syst., June, 2024

Minions: Accelerating Large Language Model Inference with Adaptive and Collective Speculative Decoding.
CoRR, 2024

GVARP: Detecting Performance Variance on Large-Scale Heterogeneous Systems.
Proceedings of the International Conference for High Performance Computing, 2024

PRoof: A Comprehensive Hierarchical Profiling Framework for Deep Neural Networks with Roofline Analysis.
Proceedings of the 53rd International Conference on Parallel Processing, 2024

Retrospection on the Performance Analysis Tools for Large-Scale HPC Programs.
Proceedings of the 31st IEEE International Conference on High Performance Computing, 2024

2023
TrivialSpy: Identifying Software Triviality via Fine-grained and Dataflow-based Value Profiling.
Proceedings of the International Conference for High Performance Computing, 2023

BiRFIA: Selective Binary Rewriting for Function Interception on ARM.
Proceedings of the 37th International Conference on Supercomputing, 2023

Accelerating Big Data Application by Eliminating Redundancy on Hadoop Cluster.
Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems, 2023

Efficient Deep Molecular Dynamic Model Training on Heterogeneous System.
Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems, 2023

VClinic: A Portable and Efficient Framework for Fine-Grained Value Profilers.
Proceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023

2022
Accelerating the cryo-EM structure determination in RELION on GPU cluster.
Frontiers Comput. Sci., 2022

PowerSpector: Towards Energy Efficiency with Calling-Context-Aware Profiling.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

Vectorizing SpMV by Exploiting Dynamic Regular Patterns.
Proceedings of the 51st International Conference on Parallel Processing, 2022

2021
The Deep Learning Compiler: A Comprehensive Survey.
IEEE Trans. Parallel Distributed Syst., 2021

dgQuEST: Accelerating Large Scale Quantum Circuit Simulation through Hybrid CPU-GPU Memory Hierarchies.
Proceedings of the Network and Parallel Computing, 2021

Automatic Code Generation and Optimization of Large-scale Stencil Computation on Many-core Processors.
Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

DRStencil: Exploiting Data Reuse within Low-order Stencil on GPU.
Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, 2021

2020
The Deep Learning Compiler: A Comprehensive Survey.
CoRR, 2020

swGBDT: Efficient Gradient Boosted Decision Tree on Sunway Many-Core Processor.
Proceedings of the Supercomputing Frontiers - 6th Asian Conference, 2020

ZeroSpy: exploring software inefficiency with redundant zeros.
Proceedings of the International Conference for High Performance Computing, 2020

Accelerating De Novo Assembler WTDBG2 on Commodity Servers.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2020

Towards GPU Acceleration of Phonon Computation with ShengBTE.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2020

2019
Performance Evaluation and Analysis of Linear Algebra Kernels in the Prototype Tianhe-3 Cluster.
Proceedings of the Supercomputing Frontiers - 5th Asian Conference, 2019

Improving the Parallelism of CESM on GPU.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2019

L-DAG: Enabling Loopy Workflow in Scientific Application with Automatic DAG Transformation.
Proceedings of the 2019 IEEE Intl Conf on Dependable, 2019

2018
swCaffe: A Parallel Framework for Accelerating Deep Learning Applications on Sunway TaihuLight.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

Performance Analysis and Optimization of Cyro-EM Structure Determination in RELION-2.
Proceedings of the Advanced Computer Architecture - 12th Conference, 2018


  Loading...