Yunquan Zhang

Orcid: 0000-0001-7520-9640

According to our database1, Yunquan Zhang authored at least 148 papers between 2003 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
IrGEMM: An Input-Aware Tuning Framework for Irregular GEMM on ARM and X86 CPUs.
IEEE Trans. Parallel Distributed Syst., September, 2024

Special issue of HPCChina 2023.
CCF Trans. High Perform. Comput., February, 2024

ConvStencil: Transform Stencil Computation to Matrix Multiplication on Tensor Cores.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

OpenFFT-SME: An Efficient Outer Product Pattern FFT Library on ARM SME CPUs.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

VNEC: A Vectorized Non-Empty Column Format for SpMV on CPUs.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Scalable and Differentiable Simulator for Quantum Computational Chemistry.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Stencil Computation with Vector Outer Product.
Proceedings of the 38th ACM International Conference on Supercomputing, 2024

HAM-SpMSpV: an Optimized Parallel Algorithm for Masked Sparse Matrix-Sparse Vector Multiplications on multi-core CPUs.
Proceedings of the 33rd International Symposium on High-Performance Parallel and Distributed Computing, 2024

2023
MP-DPS: adaptive distributed training for deep learning based on node merging and path prediction.
CCF Trans. High Perform. Comput., December, 2023

Adaptive Federated Learning With Non-IID Data.
Comput. J., November, 2023

Redesigning OpenKMC for Multi-Component Trillion-Atom Simulations on the New Sunway Supercomputer.
IEEE Trans. Parallel Distributed Syst., July, 2023

AGCM-3DLF: Accelerating Atmospheric General Circulation Model via 3-D Parallelization and Leap-Format.
IEEE Trans. Parallel Distributed Syst., March, 2023

Gamify Stencil Dwarf on Cloud for Democratizing Scientific Computing.
CoRR, 2023

Generating Fast FFT Kernels on CPUs via FFT-Specific Intrinsics.
Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023

Asynch-SGBDT: Train Stochastic Gradient Boosting Decision Trees in an Asynchronous Parallel Manner.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2023

OpenFFT: An Adaptive Tuning Framework for 3D FFT on ARM Multicore CPUs.
Proceedings of the 37th International Conference on Supercomputing, 2023

An Auto-Parallel Method for Deep Learning Models Based on Genetic Algorithm.
Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems, 2023

SA_TRSM: A Shape-Aware Auto-Tuning Framework for Small-Scale Irregular-Shaped TRSM.
Proceedings of the 29th IEEE International Conference on Parallel and Distributed Systems, 2023

2022
Publisher Correction: Smart scheduler: an adaptive NVM-aware thread scheduling approach on NUMA systems.
CCF Trans. High Perform. Comput., December, 2022

Smart scheduler: an adaptive NVM-aware thread scheduling approach on NUMA systems.
CCF Trans. High Perform. Comput., December, 2022

Scaling Poisson Solvers on Many Cores via MMEwald.
IEEE Trans. Parallel Distributed Syst., 2022

An Accurate and Efficient Large-Scale Regression Method Through Best Friend Clustering.
IEEE Trans. Parallel Distributed Syst., 2022

Trinity: Neural Network Adaptive Distributed Parallel Training Method Based on Reinforcement Learning.
Algorithms, 2022

Large-Scale Simulation of Quantum Computational Chemistry on a New Sunway Supercomputer.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

An Efficient Vectorization Scheme for Stencil Computation.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

IATF: An Input-Aware Tuning Framework for Compact BLAS Based on ARMv8 CPUs.
Proceedings of the 51st International Conference on Parallel Processing, 2022

Message from the High Performance Computing and Communications 2022 Program Chairs.
Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, 2022

Aware: Adaptive Distributed Training with Computation, Communication and Position Awareness for Deep Learning Model.
Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, 2022

LBBGEMM: A Load-balanced Batch GEMM Framework on ARM CPU s.
Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, 2022

EgpuIP: An Embedded GPU Accelerated Library for Image Processing.
Proceedings of the 24th IEEE Int Conf on High Performance Computing & Communications; 8th Int Conf on Data Science & Systems; 20th Int Conf on Smart City; 8th Int Conf on Dependability in Sensor, 2022

2021
Why Dataset Properties Bound the Scalability of Parallel Machine Learning Training Algorithms.
IEEE Trans. Parallel Distributed Syst., 2021

Efficient parallel linear scaling method to get the response density matrix in all-electron real-space density-functional perturbation theory.
Comput. Phys. Commun., 2021

Many-core acceleration of the first-principles all-electron quantum perturbation calculations.
Comput. Phys. Commun., 2021

Enhanced AGCM3D: A Highly Scalable Dynamical Core of Atmospheric General Circulation Model Based on Leap-Format.
CoRR, 2021

Reducing Redundancy in Data Organization and Arithmetic Calculation for Stencil Computations.
CoRR, 2021

AutoFlow: Hotspot-Aware, Dynamic Load Balancing for Distributed Stream Processing.
CoRR, 2021

An Efficient Vectorization Scheme for Stencil Computation.
CoRR, 2021

AIPerf: Automated machine learning as an AI-HPC benchmark.
Big Data Min. Anal., 2021

Temporal vectorization for stencils.
Proceedings of the International Conference for High Performance Computing, 2021

Extreme-scale <i>ab initio</i> quantum raman spectra simulations on the leadership HPC system in China.
Proceedings of the International Conference for High Performance Computing, 2021

Accelerating all-electron <i>ab initio</i> simulation of raman spectra for biological systems.
Proceedings of the International Conference for High Performance Computing, 2021

TensorKMC: kinetic Monte Carlo simulation of 50 trillion atoms driven by deep learning on a new generation of Sunway supercomputer.
Proceedings of the International Conference for High Performance Computing, 2021

Reducing redundancy in data organization and arithmetic calculation for stencil computations.
Proceedings of the International Conference for High Performance Computing, 2021

AutoTSMM: An Auto-tuning Framework for Building High-Performance Tall-and-Skinny Matrix-Matrix Multiplication on CPUs.
Proceedings of the 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), New York City, NY, USA, September 30, 2021

IAAT: A Input-Aware Adaptive Tuning framework for Small GEMM.
Proceedings of the 27th IEEE International Conference on Parallel and Distributed Systems, 2021

AutoFlow: Hotspot-Aware, Dynamic Load Balancing for Distributed Stream Processing.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2021

A Transpose-free Three-dimensional FFT Algorithm on ARM CPUs.
Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, 2021

2020
Automatic Generation of High-Performance FFT Kernels on Arm and X86 CPUs.
IEEE Trans. Parallel Distributed Syst., 2020

FastNBL: fast neighbor lists establishment for molecular dynamics simulation based on bitwise operations.
J. Supercomput., 2020

并行程序设计语言中局部性机制的研究 (Research on Locality-aware Design Mechanism of State-of-the-art Parallel Programming Languages).
计算机科学, 2020

WP-SGD: Weighted parallel SGD for distributed unbalanced-workload training system.
J. Parallel Distributed Comput., 2020

The static parallel distribution algorithms for hybrid density-functional calculations in HONPAS package.
Int. J. High Perform. Comput. Appl., 2020

HPC software capability landscape in China.
Int. J. High Perform. Comput. Appl., 2020

Accelerated LiDAR data processing algorithm for self-driving cars on the heterogeneous computing platform.
IET Comput. Digit. Tech., 2020

The dynamic parallel distribution algorithm for hybrid density-functional calculations in HONPAS package.
Comput. Phys. Commun., 2020

A Highly Efficient Dynamical Core of Atmospheric General Circulation Model based on Leap-Format.
Proceedings of the 2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2020

Performance Optimization for Feature Extraction Section of DeepChem.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2020

2019
Correction to: FastNBL: fast neighbor lists establishment for molecular dynamics simulation based on bitwise operations.
J. Supercomput., 2019

A Relational Theory of Locality.
ACM Trans. Archit. Code Optim., 2019

2018年中国高性能计算机发展现状分析与展望 (State-of-the-art Analysis and Perspectives of 2018 China HPC Development).
计算机科学, 2019

Efficient parallel optimizations of a high-performance SIFT on GPUs.
J. Parallel Distributed Comput., 2019

Mining concise patterns on graph-connected itemsets.
Neurocomputing, 2019

The Scalability for Parallel Machine Learning Training Algorithm: Dataset Matters.
CoRR, 2019

HPC AI500: A Benchmark Suite for HPC AI Systems.
CoRR, 2019

OpenKMC: a KMC design for hundred-billion-atom simulation using millions of cores on Sunway Taihulight.
Proceedings of the International Conference for High Performance Computing, 2019

AutoFFT: a template-based FFT codes auto-generation framework for ARM and X86 CPUs.
Proceedings of the International Conference for High Performance Computing, 2019

swMD: Performance Optimizations for Molecular Dynamics Simulation on Sunway Taihulight.
Proceedings of the 2019 IEEE Intl Conf on Parallel & Distributed Processing with Applications, 2019

Using Gradient Based Multikernel Gaussian Process and Meta-Acquisition Function to Accelerate SMBO.
Proceedings of the 31st IEEE International Conference on Tools with Artificial Intelligence, 2019

Tessellating Star Stencils.
Proceedings of the 48th International Conference on Parallel Processing, 2019

2018
Cache-Oblivious MPI All-to-All Communications Based on Morton Order.
IEEE Trans. Parallel Distributed Syst., 2018

Using Known Information to Accelerate HyperParameters Optimization Based on SMBO.
CoRR, 2018

Asynchronous Parallel Sampling Gradient Boosting Decision Tree.
CoRR, 2018

A Measurement Theory of Locality.
CoRR, 2018

Rolling Forecasting Forward by Boosting Heterogeneous Kernels.
Proceedings of the Advances in Knowledge Discovery and Data Mining, 2018

Footmark: A New Formulation for Working Set Statistics.
Proceedings of the Languages and Compilers for Parallel Computing, 2018

Communication-Avoiding for Dynamical Core of Atmospheric General Circulation Model.
Proceedings of the 47th International Conference on Parallel Processing, 2018

Massively Scaling the Metal Microscopic Damage Simulation on Sunway TaihuLight Supercomputer.
Proceedings of the 47th International Conference on Parallel Processing, 2018

AGCM3D: A Highly Scalable Finite-Difference Dynamical Core of Atmospheric General Circulation Model Based on 3D Decomposition.
Proceedings of the 24th IEEE International Conference on Parallel and Distributed Systems, 2018

Implementation and Optimization of Multi-dimensional Real FFT on ARMv8 Platform.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2018

HPC AI500: A Benchmark Suite for HPC AI Systems.
Proceedings of the Benchmarking, Measuring, and Optimizing, 2018

2017
Special Issue on Network and Parallel Computing.
Int. J. Parallel Program., 2017

Hybrid-optimization strategy for the communication of large-scale Kinetic Monte Carlo simulation.
Comput. Phys. Commun., 2017

Asynchronous COMID: the theoretic basis for transmitted data sparsification tricks on Parameter Server.
CoRR, 2017

Weighted parallel SGD for distributed unbalanced-workload training system.
CoRR, 2017

Tessellating stencils.
Proceedings of the International Conference for High Performance Computing, 2017

POSTER: Cache-Oblivious MPI All-to-All Communications on Many-Core Architectures.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

HartSift: A High-Accuracy and Real-Time SIFT Based on GPU.
Proceedings of the 23rd IEEE International Conference on Parallel and Distributed Systems, 2017

2016
A Cross-Platform SpMV Framework on Many-Core Architectures.
ACM Trans. Archit. Code Optim., 2016

Parallel Processing Systems for Big Data: A Survey.
Proc. IEEE, 2016

P-DOT: a model of computation for big data.
Int. J. Parallel Emergent Distributed Syst., 2016

边缘海静力数值预报模式并行算法研究 (Parallelization of Hydrostatic Numerical Forecasting Model of Marginal Sea).
计算机科学, 2016

Workshop on high performance data intensive computing.
Concurr. Comput. Pract. Exp., 2016

Efficient Management for Hybrid Memory in Managed Language Runtime.
Proceedings of the Network and Parallel Computing, 2016

2015
基于Pthreads的并行DSRC压缩算法设计与实现 (Design and Implementation of Parallel DSRC Compression Algorithm Based on Pthreads).
计算机科学, 2015

基于Julia语言的并行计算方法初探 (Primary Investigation into Parallel Computing in Julia Language).
计算机科学, 2015

基于OpenCL的直方图生成算法优化方法研究 (Research on Histogram Generation Algorithm Optimization Based on OpenCL).
计算机科学, 2015

Automatic tuning of sparse matrix-vector multiplication on multicore clusters.
Sci. China Inf. Sci., 2015

AsHES Introduction and Committees.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Optimizing Image Sharpening Algorithm on GPU.
Proceedings of the 44th International Conference on Parallel Processing, 2015

Fast Convolution Operations on Many-Core Architectures.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

Optimized Password Recovery for Encrypted RAR on GPUs.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

Analyzing MPI-3.0 Process-Level Shared Memory: A Case Study with Stencil Computations.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

Parallel Solving Method of SOR Based on the Numerical Marine Forecasting Model.
Proceedings of the 15th IEEE/ACM International Symposium on Cluster, 2015

2014
Function Prediction of Proteins in Yeast Networks Based on the MCL Algorithm.
J. Softw., 2014

Memory Efficient Two-Pass 3D FFT Algorithm for Intel® Xeon PhiTM Coprocessor.
J. Comput. Sci. Technol., 2014

yaSpMV: yet another SpMV framework on GPUs.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

AsHES Introduction and Committees.
Proceedings of the 2014 IEEE International Parallel & Distributed Processing Symposium Workshops, 2014

Physically based parallel ray tracer for the Metropolis light transport algorithm on the Tianhe-2 supercomputer.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

Research on Mahalanobis Distance Algorithm Optimization Based on OpenCL.
Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014

2013
MPFFT: An Auto-Tuning FFT Library for OpenCL GPUs.
J. Comput. Sci. Technol., 2013

AUGEM: automatically generate high performance dense linear algebra kernels on x86 CPUs.
Proceedings of the International Conference for High Performance Computing, 2013

StreamScan: fast scan algorithms for GPUs without global barrier synchronization.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

pVOCL: Power-Aware Dynamic Placement and Migration in Virtualized GPU Environments.
Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems, 2013

H-DB: Yet Another Big Data Hybrid System of Hadoop and DBMS.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2013

Large Scale Satellite Imagery Simulations with Physically Based Ray Tracing on Tianhe-1A Supercomputer.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

CLSIFT: An Optimization Study of the Scale Invariance Feature Transform on GPUs.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

2012
Implementing High-performance Intensity Model with Blur Effect on GPUs for Large-scale Star Image Simulation.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Modeling the Locality in Graph Traversals.
Proceedings of the 41st International Conference on Parallel Processing, 2012

Model-driven Level 3 BLAS Performance Optimization on Loongson 3A Processor.
Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, 2012

An Insightful Program Performance Tuning Chain for GPU Computing.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2012

Accelerating Viola-Jones Facce Detection Algorithm on GPUs.
Proceedings of the 14th IEEE International Conference on High Performance Computing and Communication & 9th IEEE International Conference on Embedded Software and Systems, 2012

GPURoofline: A Model for Guiding Performance Optimizations on GPUs.
Proceedings of the Euro-Par 2012 Parallel Processing - 18th International Conference, 2012

A Locality-based Performance Model for Load-and-Compute Style Computation.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

2011
Optimizing SpMV for Diagonal Sparse Matrices on GPU.
Proceedings of the International Conference on Parallel Processing, 2011

Automatic FFT Performance Tuning on OpenCL GPUs.
Proceedings of the 17th IEEE International Conference on Parallel and Distributed Systems, 2011

CRSD: Application Specific Auto-tuning of SpMV for Diagonal Sparse Matrices.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

2010
Perspectives of China's HPC system development: a view from the 2009 China HPC TOP100 list.
Frontiers Comput. Sci. China, 2010

Heterogeneous Multi-core Parallel SGEMM Performance Testing and Analysis on Cell/B.E Processor.
Proceedings of the Fifth International Conference on Networking, Architecture, and Storage, 2010

Optimizing Sparse Matrix Vector Multiplication Using Diagonal Storage Matrix Format.
Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications, 2010

Numerical Simulation of the Thermal Convection in the Earth's Outer Core.
Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications, 2010

LogGPH: A Parallel Computational Model with Hierarchical Communication Awareness.
Proceedings of the 13th IEEE International Conference on Computational Science and Engineering, 2010

QuantWiz: A scalable parallel software package for label-free protein quantification.
Proceedings of the Fifth International Conference on Bio-Inspired Computing: Theories and Applications, 2010

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

2009
A parallel shortest path algorithm based on graph-partitioning and iterative correcting.
Comput. Syst. Sci. Eng., 2009

Early Performance Evaluation of Dawning 5000A and DeepComp 7000.
Proceedings of the 15th IEEE International Conference on Parallel and Distributed Systems, 2009

QuantWiz: A Parallel Software Package for LC-MS-based Label-Free Protein Quantification.
Proceedings of the 11th IEEE International Conference on High Performance Computing and Communications, 2009

Performance Evaluation of Multithreaded Sparse Matrix-Vector Multiplication Using OpenMP.
Proceedings of the 11th IEEE International Conference on High Performance Computing and Communications, 2009

Development of a Scalable Solver for the Earth's Core Convection.
Proceedings of the High Performance Computing and Applications, 2009

2008
Basic research in computer science and software engineering at SKLCS.
Frontiers Comput. Sci. China, 2008

Parallelization of FM-Index.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications, 2008

Memory Access Complexity Analysis of SpMV in RAM (h) Model.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications, 2008

Utilizing the Multi-threading Techniques to Improve the Two-Level Checkpoint/Rollback System for MPI Applications.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications, 2008

2007
Models of parallel computation: a survey and classification.
Frontiers Comput. Sci. China, 2007

A brief introduction to China HPC TOP100: from 2002 to 2006.
Proceedings of the CHINA HPC 2007, 2007

Block size selection of parallel LU and QR on PVP-based and RISC-based supercomputers.
Proceedings of the CHINA HPC 2007, 2007

Efficient Construction of FM-index Using Overlapping Block Processing for Large Scale Texts.
Proceedings of the Advances in Information Retrieval, 2007

2006
Study on Parallel Computing.
J. Comput. Sci. Technol., 2006

2003
Hardware Impact on Communication Performance of Beowulf LINUX Cluster.
Proceedings of the 21st IASTED International Multi-Conference on Applied Informatics (AI 2003), 2003


  Loading...