We stand with Ukraine

We stand with Ukraine

Jianbin Fang

Orcid: 0000-0003-3542-4869

According to our database¹, Jianbin Fang authored at least 100 papers between 2010 and 2026.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book In proceedings Article PhD thesis Dataset Other

Links

Online presence:

on orcid.org

On csauthors.net:

Bibliography

2026

mtGEMM: An Efficient GEMM Library for Modern Multi-Core DSPs.

[DOI]

,

,

,

,

,

,

,

IEEE Trans. Parallel Distributed Syst., April, 2026

Optimizing small matrix multiplications via batch grouping on multi-core DSPs.

[DOI]

,

,

,

,

CCF Trans. High Perform. Comput., February, 2026

2025

Demystifying ARM SME to Optimize General Matrix Multiplications.

[DOI]

,

,

,

CoRR, December, 2025

DCSolver: Accelerating Sparse Iterative Solvers via Divide-and-Conquer on GPUs.

[DOI]

,

,

,

,

,

,

,

,

,

,

ACM Trans. Archit. Code Optim., September, 2025

nDirect2: A High-Performance Library for Direct Convolutions on Multicore CPUs.

[DOI]

,

,

,

,

,

,

,

,

,

,

IEEE Trans. Computers, June, 2025

Gator: Accelerating Graph Attention Networks by Jointly Optimizing Attention and Graph Processing.

[DOI]

,

,

,

,

,

ACM Trans. Archit. Code Optim., June, 2025

An empirical performance evaluation of SYCL on ARM multi-core processors.

[DOI]

,

,

,

,

,

CCF Trans. High Perform. Comput., February, 2025

Directed Testing in MLIR: Unleashing Its Potential by Overcoming the Limitations of Random Fuzzing.

[DOI]

,

,

,

,

,

Proc. ACM Softw. Eng., 2025

FMCC-RT: a scalable and fine-grained all-reduce algorithm for large-scale SMP clusters.

[DOI]

,

,

,

,

,

,

,

,

,

,

Sci. China Inf. Sci., 2025

Constraint-Driven Auto-Tuning of GEMM-like Operators for MT-3000 Many-core Processor.

[DOI]

,

,

,

,

Proceedings of the International Conference for High Performance Computing, 2025

Optimizing Direct Convolutions on High-Performance Multi-Core DSPs.

[DOI]

,

,

,

,

,

,

Proceedings of the 54th International Conference on Parallel Processing, 2025

Selection of Supervised Learning-Based Sparse Matrix Reordering Algorithms.

[DOI]

,

,

,

,

,

,

Proceedings of the 32nd IEEE International Conference on High Performance Computing, 2025

Me-MPK: Accelerating Krylov Subspace Solvers via Memory-efficient Matrix-Power Kernel.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the 62nd ACM/IEEE Design Automation Conference, 2025

2024

Mentor: A Memory-Efficient Sparse-dense Matrix Multiplication Accelerator Based on Column-Wise Product.

[DOI]

,

,

,

,

,

,

ACM Trans. Archit. Code Optim., December, 2024

Efficient compiler optimization by modeling passes dependence.

[DOI]

,

,

,

,

,

CCF Trans. High Perform. Comput., December, 2024

thSORT: an efficient parallel sorting algorithm on multi-core DSPs.

[DOI]

,

,

,

,

CCF Trans. High Perform. Comput., October, 2024

Editorial for the special issue on programming models and system software for High-Performance Computing (HPC) environments.

[DOI]

,

,

CCF Trans. High Perform. Comput., June, 2024

SNCL: a supernode OpenCL implementation for hybrid computing arrays.

[DOI]

,

,

,

,

,

,

,

,

J. Supercomput., May, 2024

Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs.

[DOI]

,

,

,

,

IEEE Trans. Parallel Distributed Syst., March, 2024

Enhancing Compiler Optimization with Reinforcement Learning and Monte Carlo Tree Search.

[DOI]

,

,

,

,

Proceedings of the 36th International Conference on Software Engineering and Knowledge Engineering, 2024

A Conflict-aware Divide-and-Conquer Algorithm for Symmetric Sparse Matrix-Vector Multiplication.

[DOI]

,

,

,

,

,

,

,

,

,

Proceedings of the International Conference for High Performance Computing, 2024

Towards Scalable Unstructured Mesh Computations on Shared Memory Many-Cores.

[DOI]

,

,

,

,

,

,

,

,

,

,

Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

GraphCube: Interconnection Hierarchy-aware Graph Processing.

[DOI]

,

,

,

,

,

,

,

,

,

Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

Optimizing General Matrix Multiplications on Modern Multi-core DSPs.

[DOI]

,

,

,

,

,

,

,

,

,

Proceedings of the IEEE International Parallel and Distributed Processing Symposium, 2024

Optimizing Stencil Computation on Multi-core DSPs.

[DOI]

,

,

,

,

,

,

,

,

,

Proceedings of the 53rd International Conference on Parallel Processing, 2024

Optimizing SpMV on Heterogeneous Multi-Core DSPs through Improved Locality and Vectorization.

[DOI]

,

,

,

,

Proceedings of the 53rd International Conference on Parallel Processing, 2024

2023

wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems.

[DOI]

,

,

,

,

J. Comput. Sci. Technol., December, 2023

Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000.

[DOI]

,

,

,

,

,

,

Frontiers Inf. Technol. Electron. Eng., 2023

Optimizing Direct Convolutions on ARM Multi-Cores.

[DOI]

,

,

,

,

,

,

,

Proceedings of the International Conference for High Performance Computing, 2023

Optimizing MPI Collectives on Shared Memory Multi-Cores.

[DOI]

,

,

,

,

,

,

,

Proceedings of the International Conference for High Performance Computing, 2023

Optimizing HPC I/O Performance with Regression Analysis and Ensemble Learning.

[DOI]

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Cluster Computing, 2023

2022

FlowDNN: a physics-informed deep neural network for fast and accurate flow prediction.

[DOI]

,

,

,

,

,

,

Frontiers Inf. Technol. Electron. Eng., 2022

MT-3000: a heterogeneous multi-zone processor for HPC.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

CCF Trans. High Perform. Comput., 2022

PipeFB: An Optimized Pipeline Parallelism Scheme to Reduce the Peak Memory Usage.

[DOI]

,

,

,

,

,

,

Proceedings of the Algorithms and Architectures for Parallel Processing, 2022

2021

BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUs.

[DOI]

,

,

,

IEEE Trans. Parallel Distributed Syst., 2021

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+.

[DOI]

,

,

,

J. Comput. Sci. Technol., 2021

LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores.

[DOI]

,

,

,

,

Proceedings of the International Conference for High Performance Computing, 2021

Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures.

[DOI]

,

,

Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Characterizing OpenMP Synchronization Implementations on ARMv8 Multi-Cores.

[DOI]

,

,

,

,

Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, 2021

Optimizing Barrier Synchronization on ARMv8 Many-Core Architectures.

[DOI]

,

,

,

,

Proceedings of the IEEE International Conference on Cluster Computing, 2021

2020

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures.

[DOI]

,

,

,

,

,

IEEE Trans. Parallel Distributed Syst., 2020

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters.

[DOI]

,

,

,

,

,

,

IEEE Trans. Parallel Distributed Syst., 2020

Characterizing Scalability of Sparse Matrix-Vector Multiplications on Phytium FT-2000+.

[DOI]

,

,

,

,

Int. J. Parallel Program., 2020

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization.

[DOI]

,

,

,

,

Future Gener. Comput. Syst., 2020

Parallel Programming Models for Heterogeneous Many-Cores : A Survey.

[DOI]

,

,

,

CoRR, 2020

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach.

[DOI]

,

,

,

,

,

CoRR, 2020

Parallel programming models for heterogeneous many-cores: a comprehensive survey.

[DOI]

,

,

,

CCF Trans. High Perform. Comput., 2020

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures.

[DOI]

,

,

,

,

Proceedings of the Network and Parallel Computing, 2020

FlowGAN: A Conditional Generative Adversarial Network for Flow Prediction in Various Conditions.

[DOI]

,

,

,

,

,

,

Proceedings of the 32nd IEEE International Conference on Tools with Artificial Intelligence, 2020

Dissecting the Phytium 2000+ Memory Hierarchy via Microbenchmarking.

[DOI]

,

,

,

Proceedings of the Advanced Computer Architecture - 13th Conference, 2020

Deep Program Structure Modeling Through Multi-Relational Graph-based Learning.

[DOI]

,

,

,

,

,

,

Proceedings of the PACT '20: International Conference on Parallel Architectures and Compilation Techniques, 2020

2019

Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization.

[DOI]

,

,

,

,

J. Supercomput., 2019

Optimizing Sparse Matrix-Vector Multiplications on an ARMv8-based Many-Core Architecture.

[DOI]

,

,

,

,

Int. J. Parallel Program., 2019

Characterizing Scalability of Sparse Matrix-Vector Multiplications on Phytium FT-2000+ Many-cores.

[DOI]

,

,

,

,

CoRR, 2019

Auto-Tuning MPI Collective Operations on Large-Scale Parallel Systems.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

2018

Benchmarking the GPU memory at the warp level.

[DOI]

,

,

,

,

,

Parallel Comput., 2018

Orchestrating parallel detection of strongly connected components on GPUs.

[DOI]

,

,

,

,

,

,

Parallel Comput., 2018

Moving from exascale to zettascale computing: challenges and techniques.

[DOI]

,

,

,

,

,

,

,

,

,

,

Frontiers Inf. Technol. Electron. Eng., 2018

Optimizing Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures.

[DOI]

,

,

,

,

CoRR, 2018

Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach.

[DOI]

,

,

,

,

CoRR, 2018

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the IEEE International Conference on Parallel & Distributed Processing with Applications, 2018

Auto-tuning Streamed Applications on Intel Xeon Phi.

[DOI]

,

,

,

,

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures.

[DOI]

,

,

,

,

Proceedings of the 20th IEEE International Conference on High Performance Computing and Communications; 16th IEEE International Conference on Smart City; 4th IEEE International Conference on Data Science and Systems, 2018

Proteus: network-aware web browsing on heterogeneous mobile systems.

[DOI]

,

,

,

,

,

,

,

Proceedings of the 14th International Conference on emerging Networking EXperiments and Technologies, 2018

MOCL: an efficient openCL implementation for the matrix-2000 architecture.

[DOI]

,

,

,

,

,

Proceedings of the 15th ACM International Conference on Computing Frontiers, 2018

2017

多核/众核平台上推荐算法的实现与性能评估 (Implementation and Performance Evaluation of Recommender Algorithms Based on Multi-/Many-core Platforms).

[DOI]

,

,

,

计算机科学, 2017

Efficient and high-quality sparse graph coloring on GPUs.

[DOI]

,

,

,

,

,

Concurr. Comput. Pract. Exp., 2017

LU factorization on heterogeneous systems: an energy-efficient approach towards high performance.

[DOI]

,

,

,

Computing, 2017

High Performance Detection of Strongly Connected Components in Sparse Graphs on GPUs.

[DOI]

,

,

,

,

,

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, 2017

Implementing and Evaluating OpenCL on an ARMv8 Multi-Core CPU.

[DOI]

,

,

,

,

Proceedings of the 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), 2017

Efficient and Portable ALS Matrix Factorization for Recommender Systems.

[DOI]

,

,

,

,

,

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

High Performance Coordinate Descent Matrix Factorization for Recommender Systems.

[DOI]

,

,

,

,

,

Proceedings of the Computing Frontiers Conference, 2017

2016

Evaluating Multiple Streams on Heterogeneous Platforms.

[DOI]

,

,

,

,

,

,

Parallel Process. Lett., 2016

Streaming Applications on Heterogeneous Platforms.

[DOI]

,

,

,

,

Proceedings of the Network and Parallel Computing, 2016

Evaluating Multi-core and Many-Core Architectures through Accelerating an Alternating Direction Implicit CFD Solver.

[DOI]

,

,

,

Proceedings of the 15th International Symposium on Parallel and Distributed Computing, 2016

Evaluating the Performance Impact of Multiple Streams on the MIC-Based Heterogeneous Platform.

[DOI]

,

,

,

,

,

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

High Performance Parallel Graph Coloring on GPGPUs.

[DOI]

,

,

,

,

,

,

Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

An Energy-Efficient Implementation of LU Factorization on Heterogeneous Systems.

[DOI]

,

,

,

,

,

Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

2015

NEMO5: Achieving High-end Internode Communication for Performance Projection Beyond Moore's Law.

[DOI]

Robert Andrawis

,

José David Bermeo

,

,

,

,

,

Gerhard Klimeck

,

Zhengping Jiang

,

,

Daniel F. Mejia

,

,

Michael Povolotskyi

,

Santiago Alonso Pérez-Rubiano

,

Prasad Sarangapani

,

CoRR, 2015

Evaluating vector data type usage in OpenCL kernels.

[DOI]

,

Ana Lucia Varbanescu

,

,

Concurr. Comput. Pract. Exp., 2015

Realistic Performance Characterization of CFD Applications on Intel Many Integrated Core Architecture.

[DOI]

,

,

,

,

Comput. J., 2015

High Performance Computing of Fast Independent Component Analysis for Hyperspectral Image Dimensionality Reduction on MIC-Based Clusters.

[DOI]

,

,

,

,

,

Proceedings of the 44th International Conference on Parallel Processing Workshops, 2015

2014

Towards a Systematic Exploration of the Optimization Space for Many-Core Processors.

[DOI]

PhD thesis, 2014

Aristotle: A performance impact indicator for the OpenCL kernels using local memory.

[DOI]

,

,

Ana Lucia Varbanescu

Sci. Program., 2014

Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer.

[DOI]

,

,

,

,

,

,

,

,

,

,

,

J. Comput. Phys., 2014

Test-driving Intel Xeon Phi.

[DOI]

,

,

,

,

,

Ana Lucia Varbanescu

Proceedings of the ACM/SPEC International Conference on Performance Engineering, 2014

Parallel Computation of Non-Bonded Interactions in Drug Discovery: Nvidia GPUs vs. Intel Xeon Phi.

[DOI]

,

Ana Lucia Varbanescu

,

Baldomero Imbernon

,

José M. Cecilia

,

Horacio Emilio Pérez Sánchez

Proceedings of the International Work-Conference on Bioinformatics and Biomedical Engineering, 2014

Balancing CPU-GPU Collaborative High-Order CFD Simulations on the Tianhe-1A Supercomputer.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels.

[DOI]

,

,

Pekka Jääskeläinen

,

Ana Lucia Varbanescu

Proceedings of the 43rd International Conference on Parallel Processing, 2014

2013

An application-centric evaluation of OpenCL on multi-core CPUs.

[DOI]

,

,

,

Ana Lucia Varbanescu

Parallel Comput., 2013

An Empirical Study of Intel Xeon Phi.

[DOI]

,

Ana Lucia Varbanescu

,

,

,

,

CoRR, 2013

Parallelizing a High-Order CFD Software for 3D, Multi-block, Structural Grids on the TianHe-1A Supercomputer.

[DOI]

,

,

,

,

,

,

,

,

Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

Performance Traps in OpenCL for CPUs.

[DOI]

,

,

,

Ana Lucia Varbanescu

Proceedings of the 21st Euromicro International Conference on Parallel, 2013

ELMO: A User-Friendly API to Enable Local Memory in OpenCL Kernels.

[DOI]

,

Ana Lucia Varbanescu

,

,

Proceedings of the 21st Euromicro International Conference on Parallel, 2013

Sesame: A User-Transparent Optimizing Framework for Many-Core Processors.

[DOI]

,

Ana Lucia Varbanescu

,

Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

2012

Performance Gaps between OpenMP and OpenCL for Multi-core CPUs.

[DOI]

,

,

,

Ana Lucia Varbanescu

Proceedings of the 41st International Conference on Parallel Processing Workshops, 2012

Accelerating Cost Aggregation for Real-Time Stereo Matching.

[DOI]

,

Ana Lucia Varbanescu

,

,

,

,

Laurens van der Maaten

Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, 2012

2011

A Comprehensive Performance Comparison of CUDA and OpenCL.

[DOI]

,

Ana Lucia Varbanescu

,

Proceedings of the International Conference on Parallel Processing, 2011

An Auto-tuning Solution to Data Streams Clustering in OpenCL.

[DOI]

,

Ana Lucia Varbanescu

,

Proceedings of the 14th IEEE International Conference on Computational Science and Engineering, 2011

2010

Optimizing Adaptive Synchronization in Parallel Simulators for Large-scale Parallel Systems and Applications.

[DOI]

,

,

,

Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

Loading...