Daichi Mukunoki

Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, 2026

Evaluating Claude Code's Coding and Test Automation for GPU Acceleration ofa Legacy Fortran Application: A GeoFEM Case Study.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, 2026

2025

3Dify: a Framework for Procedural 3D-CG Generation Assisted by LLMs Using MCP and RAG.

[BibT_eX]

[DOI]

CoRR, October, 2025

VibeCodeHPC: An Agent-Based Iterative Prompting Auto-Tuner for HPC Code Generation Using LLMs.

[BibT_eX]

[DOI]

CoRR, October, 2025

DGEMM without FP64 Arithmetic - Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme.

[BibT_eX]

[DOI]

CoRR, August, 2025

Towards Generalized Parameter Tuning in Coherent Ising Machines: A Portfolio-Based Approach.

[BibT_eX]

[DOI]

CoRR, July, 2025

Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation.

[BibT_eX]

[DOI]

CoRR, July, 2025

Sparse Iterative Solvers Using High-Precision Arithmetic with Quasi Multi-Word Algorithms.

[BibT_eX]

[DOI]

Katsuhisa Ozaki

Proceedings of the 18th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2025

Performance Evaluation of Loop Body Splitting for Fast Modal Filtering in SCALE-DG on A64FX.

[BibT_eX]

[DOI]

Proceedings of the 2025 International Conference on High Performance Computing in Asia-Pacific Region Workshops, 2025

An Algorithm Portfolio Approach for Parameter Tuning in Coherent Ising Machines.

[BibT_eX]

[DOI]

Proceedings of the Thirteenth International Symposium on Computing and Networking, CANDAR 2025, 2025

2024

Performance evaluation and modelling of single-precision matrix multiplication on Cerebras CS-2.

[BibT_eX]

[DOI]

Ryunosuke Matsuzaki

Takaaki Miyajima

Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

Reduced-Precision and Reduced-Exponent Formats for Accelerating Adaptive Precision Sparse Matrix-Vector Product.

[BibT_eX]

[DOI]

Proceedings of the Euro-Par 2024: Parallel Processing, 2024

2023

Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor.

[BibT_eX]

[DOI]

Masatoshi Kawai

Proceedings of the 16th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2023

2022

Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2022

2021

Task Scheduling Strategies for Batched Basic Linear Algebra Subprograms on Many-core CPUs.

[BibT_eX]

[DOI]

Yusuke Hirota

Proceedings of the 14th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2021

Matrix Engines for High Performance Computing: A Paragon of Performance or Grasping at Straws?

[BibT_eX]

[DOI]

Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme.

[BibT_eX]

[DOI]

Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

A Rapid Euclidean Norm Calculation Algorithm that Reduces Overflow and Underflow.

[BibT_eX]

[DOI]

Proceedings of the Computational Science and Its Applications - ICCSA 2021, 2021

Conjugate Gradient Solvers with High Accuracy and Bit-wise Reproducibility between CPU and GPU using Ozaki scheme.

[BibT_eX]

[DOI]

Proceedings of the HPC Asia 2021: The International Conference on High Performance Computing in Asia-Pacific Region, 2021

2020

Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs.

[BibT_eX]

[DOI]

Takeshi Ogita

J. Comput. Appl. Math., 2020

White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing.

[BibT_eX]

[DOI]

CoRR, 2020

Can We Avoid Rounding-Error Estimation in HPC Codes and Still Get Trustworthy Results?

[BibT_eX]

[DOI]

Proceedings of the Software Verification - 12th International Conference, 2020

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 35th International Conference, 2020

2019

Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures.

[BibT_eX]

[DOI]

Takeshi Ogita

Katsuhisa Ozaki

Proceedings of the Parallel Processing and Applied Mathematics, 2019

Design of an FPGA-Based Matrix Multiplier with Task Parallelism.

[BibT_eX]

[DOI]

Yiyu Tan

Proceedings of the Parallel Computing: Technology Trends, 2019

2018

Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster.

[BibT_eX]

[DOI]

Proceedings of the Computational Science - ICCS 2018, 2018

2017

Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2017

Design Towards Modern High Performance Numerical LA Library Enabling Heterogeneity and Flexible Data Formats.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing is Everywhere, 2017

2016

Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 10th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2016

Reduced-Precision Floating-Point Formats on GPUs for High Performance and Energy Efficient Computation.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

2015

Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs.

[BibT_eX]

[DOI]

Proceedings of the 23rd Euromicro International Conference on Parallel, 2015

2013

Using Quadruple Precision Arithmetic to Accelerate Krylov Subspace Methods on GPUs.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2013

Optimization of Sparse Matrix-Vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs.

[BibT_eX]

[DOI]

Proceedings of the Computational Science and Its Applications - ICCSA 2013, 2013

2012

Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs.

[BibT_eX]

[DOI]

Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

2010

Implementation and Evaluation of Quadruple Precision BLAS Functions on GPUs.

[BibT_eX]

[DOI]