Wei Xue

Orcid: 0000-0001-9740-6581

Affiliations:
  • Tsinghua University, Department of Computer Science and Technology, Beijing National Research Center for Information Science and Technology (BNRist), Beijing, China
  • Qinghai University, China
  • National Supercomputing Center in Wuxi, China


According to our database1, Wei Xue authored at least 100 papers between 2004 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Accelerating Half-Precision Seismic Simulation on Neural Processing Unit.
IEEE Trans. Parallel Distributed Syst., September, 2025

MMStencil: Optimizing High-order Stencils on Multicore CPU using Matrix Unit.
CoRR, July, 2025

StructMG: A Fast and Scalable Structured Algebraic Multigrid.
CoRR, June, 2025

Semi-StructMG: A Fast and Scalable Semi-Structured Algebraic Multigrid.
Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2025

An AI-Enhanced 1km-Resolution Seamless Global Weather and Climate Model to Achieve Year-Scale Simulation Speed using 34 Million Cores.
Proceedings of the 30th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2025

VAE-Var: Variational Autoencoder-Enhanced Variational Methods for Data Assimilation in Meteorology.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025

2024
O2ath: an OpenMP offloading toolkit for the sunway heterogeneous manycore platform.
CCF Trans. High Perform. Comput., June, 2024

Toward efficient structured-grid triangular solver on sunway many-core processors.
J. Supercomput., May, 2024

Parallel optimization and application of unstructured sparse triangular solver on new generation of Sunway architecture.
Parallel Comput., 2024

VAE-Var: Variational-Autoencoder-Enhanced Variational Assimilation.
CoRR, 2024

Kilometer-Level Coupled Modeling Using 40 Million Cores: An Eight-Year Journey of Model Development.
CoRR, 2024

Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?
Proceedings of the 2024 USENIX Annual Technical Conference, 2024

POSTER: StructMG: A Fast and Scalable Structured Multigrid.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

FP16 Acceleration in Structured Multigrid Preconditioner for Real-World Applications.
Proceedings of the 53rd International Conference on Parallel Processing, 2024

BoostN: Optimizing Imbalanced Neighborhood Communication on Homogeneous Many-Core System.
Proceedings of the 53rd International Conference on Parallel Processing, 2024

Towards a Self-contained Data-driven Global Weather Forecasting Framework.
Proceedings of the Forty-first International Conference on Machine Learning, 2024

A Data Optimizer for Region-Aware Self-describing Files in Scientific Computing.
Proceedings of the 2024 ACM Symposium on Cloud Computing, 2024

2023
Redesign and Accelerate the AIREBO Bond-Order Potential on the New Sunway Supercomputer.
IEEE Trans. Parallel Distributed Syst., December, 2023

GEO-WMS: an improved approach to geoscientific workflow management system on HPC.
CCF Trans. High Perform. Comput., December, 2023

Bio-ESMD: A Data Centric Implementation for Large-Scale Biological System Simulation on Sunway TaihuLight Supercomputer.
IEEE Trans. Parallel Distributed Syst., March, 2023

Model guided algorithm optimization for tridiagonal solver on many-core architectures.
CCF Trans. High Perform. Comput., March, 2023

End-to-end I/O Monitoring on Leading Supercomputers.
ACM Trans. Storage, February, 2023

A Novel Compute-Efficient Tridiagonal Solver for Many-Core Architectures.
IEEE Trans. Parallel Distributed Syst., 2023

FengWu-4DVar: Coupling the Data-driven Weather Forecasting Model with 4D Variational Assimilation.
CoRR, 2023

Automatic Search Guided Code Optimization Framework for Mixed-Precision Scientific Applications.
Proceedings of the SC '23 Workshops of The International Conference on High Performance Computing, 2023

69.7-PFlops Extreme Scale Earthquake Simulation with Crossing Multi-faults and Topography on Sunway.
Proceedings of the International Conference for High Performance Computing, 2023

5 ExaFlop/s HPL-MxP Benchmark with Linear Scalability on the 40-Million-Core Sunway Supercomputer.
Proceedings of the International Conference for High Performance Computing, 2023

Rapid simulations of atmospheric data assimilation of hourly-scale phenomena with modern neural networks.
Proceedings of the International Conference for High Performance Computing, 2023

Enabling Real World Scale Structural Superlubricity All-Atom Simulation on the Next-Generation Sunway Supercomputer.
Proceedings of the International Conference for High Performance Computing, 2023

HadaFS: A File System Bridging the Local and Shared Burst Buffer for Exascale Supercomputers.
Proceedings of the 21st USENIX Conference on File and Storage Technologies, 2023

2022
Leveraging Code Snippets to Detect Variations in the Performance of HPC Systems.
IEEE Trans. Parallel Distributed Syst., 2022

Jdebug: A Fast, Non-Intrusive and Scalable Fault Locating Tool for Ten-Million-Scale Parallel Applications.
IEEE Trans. Parallel Distributed Syst., 2022

Optimization of Reactive Force Field Simulation: Refactor, Parallelization, and Vectorization for Interactions.
IEEE Trans. Parallel Distributed Syst., 2022

Enabling Large-Scale Simulation of CAM on the Sunway TaihuLight Supercomputer.
IEEE Trans. Computers, 2022

An End-to-end and Adaptive I/O Optimization Tool for Modern HPC Storage Systems.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

2021
Many-core acceleration of the first-principles all-electron quantum perturbation calculations.
Comput. Phys. Commun., 2021

BuffetFS: Serve Yourself Permission Checks without Remote Procedure Calls.
CoRR, 2021

Editorial for the special issue on large-scale AI in classical HPC environment and AI for science.
CCF Trans. High Perform. Comput., 2021

LMFF: efficient and scalable layered materials force field on heterogeneous many-core processors.
Proceedings of the International Conference for High Performance Computing, 2021

Profiling HPC Applications with Low Overhead and High Accuracy.
Proceedings of the 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), New York City, NY, USA, September 30, 2021

2020
Millimeter-Scale and Billion-Atom Reactive Force Field Simulation on Sunway Taihulight.
IEEE Trans. Parallel Distributed Syst., 2020

Lessons Learned from Optimizing the Sunway Storage System for Higher Application I/O Performance.
J. Comput. Sci. Technol., 2020

Editorial for the special issue on HPC algorithms and applications.
CCF Trans. High Perform. Comput., 2020

Tuning a general purpose software cache library for TaihuLight's SW26010 processor.
CCF Trans. High Perform. Comput., 2020

APMT: an automatic hardware counter-based performance modeling tool for HPC applications.
CCF Trans. High Perform. Comput., 2020

Cell-list based molecular dynamics on many-core processors: a case study on sunway TaihuLight supercomputer.
Proceedings of the International Conference for High Performance Computing, 2020

Neighbor-list-free molecular dynamics on sunway TaihuLight supercomputer.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

Performance Modeling of Stencil Computation on SW26010 Processors.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2020

2019
Optimizing Finite Volume Method Solvers on Nvidia GPUs.
IEEE Trans. Parallel Distributed Syst., 2019

An automatic performance model-based scheduling tool for coupled climate system models.
J. Parallel Distributed Comput., 2019

NAMSG: An Efficient Method For Training Neural Networks.
CoRR, 2019

SW_GROMACS: accelerate GROMACS on Sunway TaihuLight.
Proceedings of the International Conference for High Performance Computing, 2019

End-to-end I/O Monitoring on a Leading Supercomputer.
Proceedings of the 16th USENIX Symposium on Networked Systems Design and Implementation, 2019

Refactoring and Optimizing WRF Model on Sunway TaihuLight.
Proceedings of the 48th International Conference on Parallel Processing, 2019

Automatic, Application-Aware I/O Forwarding Resource Allocation.
Proceedings of the 17th USENIX Conference on File and Storage Technologies, 2019

2018
ShenTu: processing multi-trillion edge graphs on millions of cores in seconds.
Proceedings of the International Conference for High Performance Computing, 2018

Redesigning LAMMPS for peta-scale and hundred-billion-atom simulation on Sunway TaihuLight.
Proceedings of the International Conference for High Performance Computing, 2018

swSpTRSV: a fast sparse triangular solve with sparse level tile layout on sunway architectures.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

vSensor: leveraging fixed-workload snippets of programs for performance variance detection.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Taming the "Monster": Overcoming Program Optimization Challenges on SW26010 Through Precise Performance Modeling.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Towards Efficient SpMV on Sunway Manycore Architectures.
Proceedings of the 32nd International Conference on Supercomputing, 2018

A Fast Sparse Triangular Solver for Structured-grid Problems on Sunway Many-core Processor SW26010.
Proceedings of the 47th International Conference on Parallel Processing, 2018

2017
Solving Mesoscale Atmospheric Dynamics Using a Reconfigurable Dataflow Architecture.
IEEE Micro, 2017

Understanding object-level memory access patterns across the spectrum.
Proceedings of the International Conference for High Performance Computing, 2017

18.9-Pflops nonlinear earthquake simulation on Sunway TaihuLight: enabling depiction of 18-Hz and 8-meter scenarios.
Proceedings of the International Conference for High Performance Computing, 2017

26 PFLOPS Stencil Computations for Atmospheric Modeling on Sunway TaihuLight.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

2016
The Sunway TaihuLight supercomputer: system and applications.
Sci. China Inf. Sci., 2016

10M-core scalable fully-implicit solver for nonhydrostatic atmospheric dynamics.
Proceedings of the International Conference for High Performance Computing, 2016

Refactoring and optimizing the community atmosphere model (CAM) on the sunway taihulight supercomputer.
Proceedings of the International Conference for High Performance Computing, 2016

A Fast Tridiagonal Solver for Intel MIC Architecture.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Accelerating the 3D euler atmospheric solver through heterogeneous CPU-GPU platforms.
Proceedings of the ACM International Conference on Computing Frontiers, CF'16, 2016

Generalized GPU Acceleration for Applications Employing Finite-Volume Methods.
Proceedings of the IEEE/ACM 16th International Symposium on Cluster, 2016

Unleashing the performance potential of CPU-GPU platforms for the 3D atmospheric Euler solver.
Proceedings of the 27th IEEE International Conference on Application-specific Systems, 2016

2015
Solving the Global Atmospheric Equations through Heterogeneous Reconfigurable Platforms.
ACM Trans. Reconfigurable Technol. Syst., 2015

Ultra-Scalable CPU-MIC Acceleration of Mesoscale Atmospheric Modeling on Tianhe-2.
IEEE Trans. Computers, 2015

ParSA: High-throughput scientific data analysis framework with distributed file system.
Future Gener. Comput. Syst., 2015

2014
A hierarchical tridiagonal system solver for heterogenous supercomputers.
Proceedings of the 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2014

Enabling and Scaling a Global Shallow-Water Atmospheric Model on Tianhe-2.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Scaling and analyzing the stencil performance on multi-core and many-core architectures.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

CESMTuner: An Auto-tuning Framework for the Community Earth System Model.
Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014

A highly-efficient and green data flow engine for solving euler atmospheric equations.
Proceedings of the 24th International Conference on Field Programmable Logic and Applications, 2014

2013
Performance modeling and optimization of sparse matrix-vector multiplication on NVIDIA CUDA platform.
J. Supercomput., 2013

A scalable Helmholtz solver in GRAPES over large-scale multicore cluster.
Concurr. Comput. Pract. Exp., 2013

A peta-scalable CPU-GPU algorithm for global atmospheric simulations.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

Accelerating solvers for global atmospheric equations through mixed-precision data flow engine.
Proceedings of the 23rd International Conference on Field programmable Logic and Applications, 2013

Global Atmospheric Simulation on a Reconfigurable Platform.
Proceedings of the 21st IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2013

A Scalable Barotropic Mode Solver for the Parallel Ocean Program.
Proceedings of the Euro-Par 2013 Parallel Processing, 2013

HW/SW approaches to accelerate GRAPES in an FU array.
Proceedings of the 2013 IEEE Symposium on Low-Power and High-Speed Chips, 2013

2012
Fast time domain simulation of power systems using multilevel preconditioners with adaptive reconstruction strategies.
Simul. Model. Pract. Theory, 2012

2010
SOPA: Selecting the optimal caching policy adaptively.
ACM Trans. Storage, 2010

2007
SLAS: An efficient approach to scaling round-robin striped volumes.
ACM Trans. Storage, 2007

Design and Implementation of an Out-of-Band Virtualization System for Large SANs.
IEEE Trans. Computers, 2007

Design and Implementation of an Efficient Multi-version File System.
Proceedings of the International Conference on Networking, 2007

An Efficient SAN-Level Caching Method Based on Chunk-Aging.
Proceedings of the International Conference on Networking, 2007

2006
A Database Redo Log System Based on Virtual Memory Disk.
Proceedings of the Computational Science, 2006

Design and Implementation of an Out-of-Band Virtualization System on Solaris 10.
Proceedings of the Computational Science, 2006

2005
MagicStore: A New Out-of-Band Virtualization System in SAN Environments.
Proceedings of the Network and Parallel Computing, IFIP International Conference, 2005

Parallel Algorithm and Implementation for Realtime Dynamic Simulation of Power System.
Proceedings of the 34th International Conference on Parallel Processing (ICPP 2005), 2005

TH-VSS: An Asymmetric Storage Virtualization System for the SAN Environment.
Proceedings of the Computational Science, 2005

2004
Parallel Transient Stability Simulation for National Power Grid of China.
Proceedings of the Parallel and Distributed Processing and Applications, 2004


  Loading...