Jianbin Fang

Orcid: 0000-0003-3542-4869

According to our database1, Jianbin Fang authored at least 77 papers between 2010 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Optimizing Full-Spectrum Matrix Multiplications on ARMv8 Multi-Core CPUs.
IEEE Trans. Parallel Distributed Syst., March, 2024

Towards Scalable Unstructured Mesh Computations on Shared Memory Many-Cores.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

GraphCube: Interconnection Hierarchy-aware Graph Processing.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

2023
wrBench: Comparing Cache Architectures and Coherency Protocols on ARMv8 Many-Core Systems.
J. Comput. Sci. Technol., December, 2023

Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000.
Frontiers Inf. Technol. Electron. Eng., 2023

Optimizing Direct Convolutions on ARM Multi-Cores.
Proceedings of the International Conference for High Performance Computing, 2023

Optimizing MPI Collectives on Shared Memory Multi-Cores.
Proceedings of the International Conference for High Performance Computing, 2023

Optimizing HPC I/O Performance with Regression Analysis and Ensemble Learning.
Proceedings of the IEEE International Conference on Cluster Computing, 2023

2022
FlowDNN: a physics-informed deep neural network for fast and accurate flow prediction.
Frontiers Inf. Technol. Electron. Eng., 2022

MT-3000: a heterogeneous multi-zone processor for HPC.
CCF Trans. High Perform. Comput., 2022

PipeFB: An Optimized Pipeline Parallelism Scheme to Reduce the Peak Memory Usage.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2022

2021
BALS: Blocked Alternating Least Squares for Parallel Sparse Matrix Factorization on GPUs.
IEEE Trans. Parallel Distributed Syst., 2021

Performance Evaluation of Memory-Centric ARMv8 Many-Core Architectures: A Case Study with Phytium 2000+.
J. Comput. Sci. Technol., 2021

LIBSHALOM: optimizing small and irregular-shaped matrix multiplications on ARMv8 multi-cores.
Proceedings of the International Conference for High Performance Computing, 2021

Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures.
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Characterizing OpenMP Synchronization Implementations on ARMv8 Multi-Cores.
Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, 2021

Optimizing Barrier Synchronization on ARMv8 Many-Core Architectures.
Proceedings of the IEEE International Conference on Cluster Computing, 2021

2020
Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures.
IEEE Trans. Parallel Distributed Syst., 2020

Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters.
IEEE Trans. Parallel Distributed Syst., 2020

Characterizing Scalability of Sparse Matrix-Vector Multiplications on Phytium FT-2000+.
Int. J. Parallel Program., 2020

clMF: A fine-grained and portable alternating least squares algorithm for parallel matrix factorization.
Future Gener. Comput. Syst., 2020

Parallel Programming Models for Heterogeneous Many-Cores : A Survey.
CoRR, 2020

Optimizing Streaming Parallelism on Heterogeneous Many-Core Architectures: A Machine Learning Based Approach.
CoRR, 2020

Parallel programming models for heterogeneous many-cores: a comprehensive survey.
CCF Trans. High Perform. Comput., 2020

NUMA-Aware Optimization of Sparse Matrix-Vector Multiplication on ARMv8-Based Many-Core Architectures.
Proceedings of the Network and Parallel Computing, 2020

FlowGAN: A Conditional Generative Adversarial Network for Flow Prediction in Various Conditions.
Proceedings of the 32nd IEEE International Conference on Tools with Artificial Intelligence, 2020

Dissecting the Phytium 2000+ Memory Hierarchy via Microbenchmarking.
Proceedings of the Advanced Computer Architecture - 13th Conference, 2020

Deep Program Structure Modeling Through Multi-Relational Graph-based Learning.
Proceedings of the PACT '20: International Conference on Parallel Architectures and Compilation Techniques, 2020

2019
Toward fault-tolerant hybrid programming over large-scale heterogeneous clusters via checkpointing/restart optimization.
J. Supercomput., 2019

Optimizing Sparse Matrix-Vector Multiplications on an ARMv8-based Many-Core Architecture.
Int. J. Parallel Program., 2019

Characterizing Scalability of Sparse Matrix-Vector Multiplications on Phytium FT-2000+ Many-cores.
CoRR, 2019

Auto-Tuning MPI Collective Operations on Large-Scale Parallel Systems.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

2018
Benchmarking the GPU memory at the warp level.
Parallel Comput., 2018

Orchestrating parallel detection of strongly connected components on GPUs.
Parallel Comput., 2018

Moving from exascale to zettascale computing: challenges and techniques.
Frontiers Inf. Technol. Electron. Eng., 2018

Optimizing Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures.
CoRR, 2018

Tuning Streamed Applications on Intel Xeon Phi: A Machine Learning Based Approach.
CoRR, 2018

To Compress, or Not to Compress: Characterizing Deep Learning Model Compression for Embedded Inference.
Proceedings of the IEEE International Conference on Parallel & Distributed Processing with Applications, 2018

Auto-tuning Streamed Applications on Intel Xeon Phi.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Adaptive Optimization of Sparse Matrix-Vector Multiplication on Emerging Many-Core Architectures.
Proceedings of the 20th IEEE International Conference on High Performance Computing and Communications; 16th IEEE International Conference on Smart City; 4th IEEE International Conference on Data Science and Systems, 2018

Proteus: network-aware web browsing on heterogeneous mobile systems.
Proceedings of the 14th International Conference on emerging Networking EXperiments and Technologies, 2018

MOCL: an efficient openCL implementation for the matrix-2000 architecture.
Proceedings of the 15th ACM International Conference on Computing Frontiers, 2018

2017
多核/众核平台上推荐算法的实现与性能评估 (Implementation and Performance Evaluation of Recommender Algorithms Based on Multi-/Many-core Platforms).
计算机科学, 2017

Efficient and high-quality sparse graph coloring on GPUs.
Concurr. Comput. Pract. Exp., 2017

LU factorization on heterogeneous systems: an energy-efficient approach towards high performance.
Computing, 2017

High Performance Detection of Strongly Connected Components in Sparse Graphs on GPUs.
Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores, 2017

Implementing and Evaluating OpenCL on an ARMv8 Multi-Core CPU.
Proceedings of the 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC), 2017

Efficient and Portable ALS Matrix Factorization for Recommender Systems.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

High Performance Coordinate Descent Matrix Factorization for Recommender Systems.
Proceedings of the Computing Frontiers Conference, 2017

2016
Evaluating Multiple Streams on Heterogeneous Platforms.
Parallel Process. Lett., 2016

Streaming Applications on Heterogeneous Platforms.
Proceedings of the Network and Parallel Computing, 2016

Evaluating Multi-core and Many-Core Architectures through Accelerating an Alternating Direction Implicit CFD Solver.
Proceedings of the 15th International Symposium on Parallel and Distributed Computing, 2016

Evaluating the Performance Impact of Multiple Streams on the MIC-Based Heterogeneous Platform.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

High Performance Parallel Graph Coloring on GPGPUs.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, 2016

An Energy-Efficient Implementation of LU Factorization on Heterogeneous Systems.
Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

2015
NEMO5: Achieving High-end Internode Communication for Performance Projection Beyond Moore's Law.
CoRR, 2015

Evaluating vector data type usage in OpenCL kernels.
Concurr. Comput. Pract. Exp., 2015

Realistic Performance Characterization of CFD Applications on Intel Many Integrated Core Architecture.
Comput. J., 2015

High Performance Computing of Fast Independent Component Analysis for Hyperspectral Image Dimensionality Reduction on MIC-Based Clusters.
Proceedings of the 44th International Conference on Parallel Processing Workshops, 2015

2014
Towards a Systematic Exploration of the Optimization Space for Many-Core Processors.
PhD thesis, 2014

Aristotle: A performance impact indicator for the OpenCL kernels using local memory.
Sci. Program., 2014

Collaborating CPU and GPU for large-scale high-order CFD simulations with complex grids on the TianHe-1A supercomputer.
J. Comput. Phys., 2014

Test-driving Intel Xeon Phi.
Proceedings of the ACM/SPEC International Conference on Performance Engineering, 2014

Parallel Computation of Non-Bonded Interactions in Drug Discovery: Nvidia GPUs vs. Intel Xeon Phi.
Proceedings of the International Work-Conference on Bioinformatics and Biomedical Engineering, 2014

Balancing CPU-GPU Collaborative High-Order CFD Simulations on the Tianhe-1A Supercomputer.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

2013
An application-centric evaluation of OpenCL on multi-core CPUs.
Parallel Comput., 2013

An Empirical Study of Intel Xeon Phi.
CoRR, 2013

Parallelizing a High-Order CFD Software for 3D, Multi-block, Structural Grids on the TianHe-1A Supercomputer.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

Performance Traps in OpenCL for CPUs.
Proceedings of the 21st Euromicro International Conference on Parallel, 2013

ELMO: A User-Friendly API to Enable Local Memory in OpenCL Kernels.
Proceedings of the 21st Euromicro International Conference on Parallel, 2013

Sesame: A User-Transparent Optimizing Framework for Many-Core Processors.
Proceedings of the 13th IEEE/ACM International Symposium on Cluster, 2013

2012
Performance Gaps between OpenMP and OpenCL for Multi-core CPUs.
Proceedings of the 41st International Conference on Parallel Processing Workshops, 2012

Accelerating Cost Aggregation for Real-Time Stereo Matching.
Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, 2012

2011
A Comprehensive Performance Comparison of CUDA and OpenCL.
Proceedings of the International Conference on Parallel Processing, 2011

An Auto-tuning Solution to Data Streams Clustering in OpenCL.
Proceedings of the 14th IEEE International Conference on Computational Science and Engineering, 2011

2010
Optimizing Adaptive Synchronization in Parallel Simulators for Large-scale Parallel Systems and Applications.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010


  Loading...