Kazuya Matsumoto

Yoichi Tomioka

Stanislav Sedukhin

Proceedings of the Computational Science and Its Applications - ICCSA 2022, 2022

2019

Implementation and performance evaluation of a communication-avoiding GMRES method for stencil-based code on GPU cluster.

[BibT_eX]

[DOI]

J. Supercomput., 2019

Brain-inspired Co-design of Algorithm/Architecture for CNN Accelerators.

[BibT_eX]

[DOI]

Stanislav Sedukhin

Yoichi Tomioka

Proceedings of the 8th International Congress on Advanced Applied Informatics, 2019

Effectiveness of performance tuning techniques for general matrix multiplication on the PEZY-SC2.

[BibT_eX]

[DOI]

Toshiaki Hishinuma

Proceedings of the 10th International Symposium on Highly-Efficient Accelerators and Reconfigurable Technologies, 2019

2017

Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms.

[BibT_eX]

[DOI]

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2017

2016

Implementation and Evaluation of NAS Parallel CG Benchmark on GPU Cluster with Proprietary Interconnect TCA.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science - VECPAR 2016, 2016

2015

Implementation of CG Method on GPU Cluster with Proprietary Interconnect TCA for GPU Direct Communication.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Improving Strong-Scaling on GPU Cluster Based on Tightly Coupled Accelerators Architecture.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

Evaluation of FFT for GPU Cluster Using Tightly Coupled Accelerators Architecture.

[BibT_eX]

[DOI]

Proceedings of the 2015 IEEE International Conference on Cluster Computing, 2015

2012

Blocked United Algorithm for the All-Pairs Shortest Paths Problem on Hybrid CPU-GPU Systems.

[BibT_eX]

[DOI]

IEICE Trans. Inf. Syst., 2012

Performance Tuning of Matrix Multiplication in OpenCL on Different GPUs and CPUs.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Implementing a Code Generator for Fast Matrix Multiplication in OpenCL on the GPU.

[BibT_eX]

[DOI]

Proceedings of the IEEE 6th International Symposium on Embedded Multicore/Manycore SoCs, 2012

2011

Multi-level Optimization of Matrix Multiplication for GPU-equipped Systems.

[BibT_eX]

[DOI]

Proceedings of the International Conference on Computational Science, 2011

Blocked All-Pairs Shortest Paths Algorithm for Hybrid CPU-GPU System.

[BibT_eX]

[DOI]

Proceedings of the 13th IEEE International Conference on High Performance Computing & Communication, 2011

2010

Matrix Multiply-Add in Min-plus Algebra on a Short-Vector SIMD Processor of Cell/B.E..

[BibT_eX]

[DOI]

Proceedings of the First International Conference on Networking and Computing, 2010

2009

A Solution of the All-Pairs Shortest Paths Problem on the Cell Broadband Engine Processor.

[BibT_eX]

[DOI]

IEICE Trans. Inf. Syst., 2009

Matrix Inversion on the Cell/B.E. Processor.

[BibT_eX]

[DOI]

Shodai Yokoyama