Daichi Mukunoki

Orcid: 0000-0002-0051-6811

According to our database1, Daichi Mukunoki authored at least 38 papers between 2010 and 2026.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book  In proceedings  Article  PhD thesis  Dataset  Other 

Links

Online presence:

On csauthors.net:

Bibliography

2026
Layer-wise MoE Routing Locality under Shared-Prefix Code Generation: Token-Identity Decomposition and Compile-Equivalent Fork Redundancy.
CoRR, April, 2026

Improving HPC Code Generation Capability of LLMs via Online Reinforcement Learning with Real-Machine Benchmark Rewards.
CoRR, February, 2026

Learning-Augmented Performance Model for Tensor Product Factorization in High-Order FEM.
IEEE Access, 2026

DGEMM using FP64 Arithmetic Emulation and FP8 Tensor Cores with Ozaki Scheme.
Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, 2026

Evaluating Claude Code's Coding and Test Automation for GPU Acceleration ofa Legacy Fortran Application: A GeoFEM Case Study.
Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, 2026

2025
3Dify: a Framework for Procedural 3D-CG Generation Assisted by LLMs Using MCP and RAG.
CoRR, October, 2025

VibeCodeHPC: An Agent-Based Iterative Prompting Auto-Tuner for HPC Code Generation Using LLMs.
CoRR, October, 2025

DGEMM without FP64 Arithmetic - Using FP64 Emulation and FP8 Tensor Cores with Ozaki Scheme.
CoRR, August, 2025

Towards Generalized Parameter Tuning in Coherent Ising Machines: A Portfolio-Based Approach.
CoRR, July, 2025

Performance Evaluation of General Purpose Large Language Models for Basic Linear Algebra Subprograms Code Generation.
CoRR, July, 2025

Sparse Iterative Solvers Using High-Precision Arithmetic with Quasi Multi-Word Algorithms.
Proceedings of the 18th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2025

Performance Evaluation of Loop Body Splitting for Fast Modal Filtering in SCALE-DG on A64FX.
Proceedings of the 2025 International Conference on High Performance Computing in Asia-Pacific Region Workshops, 2025

An Algorithm Portfolio Approach for Parameter Tuning in Coherent Ising Machines.
Proceedings of the Thirteenth International Symposium on Computing and Networking, CANDAR 2025, 2025

2024
Performance evaluation and modelling of single-precision matrix multiplication on Cerebras CS-2.
Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

Reduced-Precision and Reduced-Exponent Formats for Accelerating Adaptive Precision Sparse Matrix-Vector Product.
Proceedings of the Euro-Par 2024: Parallel Processing, 2024

2023
Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor.
Proceedings of the 16th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2023

2022
Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors.
Proceedings of the Parallel Processing and Applied Mathematics, 2022

2021
Task Scheduling Strategies for Batched Basic Linear Algebra Subprograms on Many-core CPUs.
Proceedings of the 14th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2021

Matrix Engines for High Performance Computing: A Paragon of Performance or Grasping at Straws?
Proceedings of the 35th IEEE International Parallel and Distributed Processing Symposium, 2021

Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme.
Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

A Rapid Euclidean Norm Calculation Algorithm that Reduces Overflow and Underflow.
Proceedings of the Computational Science and Its Applications - ICCSA 2021, 2021

Conjugate Gradient Solvers with High Accuracy and Bit-wise Reproducibility between CPU and GPU using Ozaki scheme.
Proceedings of the HPC Asia 2021: The International Conference on High Performance Computing in Asia-Pacific Region, 2021

2020
Performance and energy consumption of accurate and mixed-precision linear algebra kernels on GPUs.
J. Comput. Appl. Math., 2020

White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing.
CoRR, 2020

Can We Avoid Rounding-Error Estimation in HPC Codes and Still Get Trustworthy Results?
Proceedings of the Software Verification - 12th International Conference, 2020

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions.
Proceedings of the High Performance Computing - 35th International Conference, 2020

2019
Reproducible BLAS Routines with Tunable Accuracy Using Ozaki Scheme for Many-Core Architectures.
Proceedings of the Parallel Processing and Applied Mathematics, 2019

Design of an FPGA-Based Matrix Multiplier with Task Parallelism.
Proceedings of the Parallel Computing: Technology Trends, 2019

2018
Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster.
Proceedings of the Computational Science - ICCS 2018, 2018

2017
Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer.
Proceedings of the Parallel Processing and Applied Mathematics, 2017

Design Towards Modern High Performance Numerical LA Library Enabling Heterogeneity and Flexible Data Formats.
Proceedings of the Parallel Computing is Everywhere, 2017

2016
Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs.
Proceedings of the 10th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2016

Reduced-Precision Floating-Point Formats on GPUs for High Performance and Energy Efficient Computation.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

2015
Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs.
Proceedings of the 23rd Euromicro International Conference on Parallel, 2015

2013
Using Quadruple Precision Arithmetic to Accelerate Krylov Subspace Methods on GPUs.
Proceedings of the Parallel Processing and Applied Mathematics, 2013

Optimization of Sparse Matrix-Vector Multiplication for CRS Format on NVIDIA Kepler Architecture GPUs.
Proceedings of the Computational Science and Its Applications - ICCSA 2013, 2013

2012
Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

2010
Implementation and Evaluation of Quadruple Precision BLAS Functions on GPUs.
Proceedings of the Applied Parallel and Scientific Computing, 2010


  Loading...