Toshiyuki Imamura

CoRR, March, 2026

Iterative Refinement for a Subset of Eigenvectors of Symmetric Matrices via Matrix Multiplications.

[BibT_eX]

[DOI]

CoRR, February, 2026

Error Analysis of Matrix Multiplication Emulation Using Ozaki-II Scheme.

[BibT_eX]

[DOI]

CoRR, February, 2026

Solving large-scale eigen problem in quantum few-body system on massive parallel computer.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, 2026

Mixed-precision Interpolative Decomposition on GPUs.

[BibT_eX]

[DOI]

Qianxiang Ma

Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region, 2026

Integrating Quantum and HPC: A Prototype Hybrid Implementation and Benchmark of Quantum-Selected Configuration Interaction.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, 2026

Orchid: Towards Heterogeneous Batched Eigenvalue Solvers.

[BibT_eX]

[DOI]

Matthew Chung

Keita Teranishi

Narasinga Rao Miniskar

Mohammad Alaul Haque Monil

Proceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, 2026

2025

Emulation of Complex Matrix Multiplication based on the Chinese Remainder Theorem.

[BibT_eX]

[DOI]

CoRR, December, 2025

Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique.

[BibT_eX]

[DOI]

CoRR, April, 2025

ML-Based Optimum Number of CUDA Streams for the GPU Implementation of the Tridiagonal Partition Method.

[BibT_eX]

[DOI]

Milena Veneva

CoRR, January, 2025

Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2025

High-Performance and Power-Efficient Emulation of Matrix Multiplication using INT8 Matrix Engines.

[BibT_eX]

[DOI]

Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, 2025

Parallel Tall-and-Skinny QR Factorization Based on LU-CholeskyQR Algorithm.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2025

2024

Iterative refinement for an eigenpair subset of a real symmetric matrix.

[BibT_eX]

[DOI]

Takeshi Terao

JSIAM Lett., 2024

Interface for Sparse Linear Algebra Operations.

[BibT_eX]

[DOI]

CoRR, 2024

High-Performance Eigensolver Combining EigenExa and Iterative Refinement.

[BibT_eX]

[DOI]

Proceedings of the SC24-W: Workshops of the International Conference for High Performance Computing, 2024

2023

Sparse Matrix-Vector Multiplication with Reduced-Precision Memory Accessor.

[BibT_eX]

[DOI]

Masatoshi Kawai

Proceedings of the 16th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2023

A new data conversion method for mixed precision Krylov solvers with FP16/BF16 Jacobi preconditioners.

[BibT_eX]

[DOI]

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2023

2022

High Performance Parallel LOBPCG Method for Large Hamiltonian Derived from Hubbard Model on Multi-GPU Systems.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing Frontiers - 7th Asian Conference, 2022

GPU Optimization of Lattice Boltzmann Method with Local Ensemble Transform Kalman Filter.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Heterogeneous Systems, 2022

Infinite-Precision Inner Product and Sparse Matrix-Vector Multiplication Using Ozaki Scheme with Dot2 on Manycore Processors.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2022

2021

MLPerf HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems.

[BibT_eX]

[DOI]

CoRR, 2021

Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations.

[BibT_eX]

[DOI]

Proceedings of the 12th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2021

MLPerf™ HPC: A Holistic Benchmark Suite for Scientific Machine Learning on HPC Systems.

[BibT_eX]

[DOI]

Proceedings of the IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, 2021

Task Scheduling Strategies for Batched Basic Linear Algebra Subprograms on Many-core CPUs.

[BibT_eX]

[DOI]

Yusuke Hirota

Proceedings of the 14th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2021

Accurate Matrix Multiplication on Binary128 Format Accelerated by Ozaki Scheme.

[BibT_eX]

[DOI]

Proceedings of the ICPP 2021: 50th International Conference on Parallel Processing, Lemont, IL, USA, August 9, 2021

A Rapid Euclidean Norm Calculation Algorithm that Reduces Overflow and Underflow.

[BibT_eX]

[DOI]

Proceedings of the Computational Science and Its Applications - ICCSA 2021, 2021

2020

White Paper from Workshop on Large-scale Parallel Numerical Computing Technology (LSPANC 2020): HPC and Computer Arithmetic toward Minimal-Precision Computing.

[BibT_eX]

[DOI]

CoRR, 2020

Error Analysis of the Cholesky QR-Based Block Orthogonalization Process for the One-Sided Block Jacobi SVD Algorithm.

[BibT_eX]

[DOI]

Shuhei Kudo

Yusaku Yamamoto

Comput. Informatics, 2020

Can We Avoid Rounding-Error Estimation in HPC Codes and Still Get Trustworthy Results?

[BibT_eX]

[DOI]

Proceedings of the Software Verification - 12th International Conference, 2020

DGEMM Using Tensor Cores, and Its Accurate and Reproducible Versions.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing - 35th International Conference, 2020

Implementation and Numerical Techniques for One EFlop/s HPL-AI Benchmark on Fugaku.

[BibT_eX]

[DOI]

Proceedings of the 11th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2020

A 1024-member ensemble data assimilation with 3.5-km mesh global weather simulations.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

Acceleration of fusion plasma turbulence simulations using the mixed-precision communication-avoiding krylov method.

[BibT_eX]

[DOI]

Proceedings of the International Conference for High Performance Computing, 2020

An FPGA-based Sound Field Rendering System.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2020

Prompt Report on Exa-Scale HPL-AI Benchmark.

[BibT_eX]

[DOI]

Proceedings of the IEEE International Conference on Cluster Computing, 2020

2019

High Performance Eigenvalue Solver for Hubbard Model: Tuning Strategies for LOBPCG Method on CUDA GPU.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: Technology Trends, 2019

Design of an FPGA-Based Matrix Multiplier with Task Parallelism.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: Technology Trends, 2019

Batched 3D-Distributed FFT Kernels Towards Practical DNS Codes.

[BibT_eX]

[DOI]

Masaaki Aoki

Mitsuo Yokokawa

Proceedings of the Parallel Computing: Technology Trends, 2019

Cache-efficient implementation and batching of tridiagonalization on manycore CPUs.

[BibT_eX]

[DOI]

Shuhei Kudo

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2019

2018

High Performance LOBPCG Method for Solving Multiple Eigenvalues of Hubbard Model: Efficiency of Communication Avoiding Neumann Expansion Preconditioner.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing Frontiers - 4th Asian Conference, 2018

Application of a Preconditioned Chebyshev Basis Communication-Avoiding Conjugate Gradient Method to a Multiphase Thermal-Hydraulic CFD Code.

[BibT_eX]

[DOI]

Proceedings of the Supercomputing Frontiers - 4th Asian Conference, 2018

Optimization of Reordering Procedures in HOTRG for Distributed Parallel Computing.

[BibT_eX]

[DOI]

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

A Case Study on Modeling the Performance of Dense Matrix Computation: Tridiagonalization in the EigenExa Eigensolver on the K Computer.

[BibT_eX]

[DOI]

Takeshi Fukaya

Yusaku Yamamoto

Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Performance Analysis of 2D-compatible 2.5D-PDGEMM on Knights Landing Cluster.

[BibT_eX]

[DOI]

Proceedings of the Computational Science - ICCS 2018, 2018

Performance Evaluation of a Toolkit for Sparse Tensor Decomposition.

[BibT_eX]

[DOI]

Proceedings of the Poster Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018

2017

Application of a communication-avoiding generalized minimal residual method to a gyrokinetic five dimensional eulerian code on many core platforms.

[BibT_eX]

[DOI]

Proceedings of the 8th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2017

Implementation and Performance Analysis of 2.5D-PDGEMM on the K Computer.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2017

Parallel Divide-and-Conquer Algorithm for Solving Tridiagonal Eigenvalue Problems on Manycore Systems.

[BibT_eX]

[DOI]

Yusuke Hirota

Proceedings of the Parallel Processing and Applied Mathematics, 2017

Communication Avoiding Neumann Expansion Preconditioner for LOBPCG Method: Convergence Property of Exact Diagonalization Method for Hubbard Model.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing is Everywhere, 2017

Design Towards Modern High Performance Numerical LA Library Enabling Heterogeneity and Flexible Data Formats.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing is Everywhere, 2017

Quadruple-Precision BLAS Using Bailey's Arithmetic with FMA Instruction: Its Performance and Applications.

[BibT_eX]

[DOI]

Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

An energy-efficient FPGA-based matrix multiplier.

[BibT_eX]

[DOI]

Proceedings of the 24th IEEE International Conference on Electronics, Circuits and Systems, 2017

2016

Parallel implementation of 3D FFT with volumetric decomposition schemes for efficient molecular dynamics simulations.

[BibT_eX]

[DOI]

Comput. Phys. Commun., 2016

Left-Preconditioned Communication-Avoiding Conjugate Gradient Methods for Multiphase CFD Simulations on the K Computer.

[BibT_eX]

[DOI]

Proceedings of the 7th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, 2016

Automatic Thread-Block Size Adjustment for Memory-Bound BLAS Kernels on GPUs.

[BibT_eX]

[DOI]

Daisuke Takahashi

Proceedings of the 10th IEEE International Symposium on Embedded Multicore/Many-core Systems-on-Chip, 2016

Reduced-Precision Floating-Point Formats on GPUs for High Performance and Energy Efficient Computation.

[BibT_eX]

[DOI]

Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

2015

Performance Analysis of the Chebyshev Basis Conjugate Gradient Method on the K Computer.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2015

Fast Implementation of General Matrix-Vector Multiplication (GEMV) on Kepler GPUs.

[BibT_eX]

[DOI]

Daisuke Takahashi

Proceedings of the 23rd Euromicro International Conference on Parallel, 2015

High Performance Eigenvalue Solver in Exact-diagonalization Method for Hubbard Model on CUDA GPU.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: On the Road to Exascale, 2015

CAHTR: Communication-Avoiding Householder TRidiagonalization.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: On the Road to Exascale, 2015

Performance Evaluation of the Eigen Exa Eigensolver on Oakleaf-FX: Tridiagonalization Versus Pentadiagonalization.

[BibT_eX]

[DOI]

Takeshi Fukaya

Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

2014

Implementation of d-Spline-based incremental performance parameter estimation method with ppOpen-AT.

[BibT_eX]

[DOI]

Sci. Program., 2014

Communication-overlap techniques for improved strong scaling of gyrokinetic Eulerian code beyond 100k cores on the K-computer.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2014

Performance Analysis of the Householder-Type Parallel Tall-Skinny QR Factorizations Toward Automatic Algorithm Selection.

[BibT_eX]

[DOI]

Takeshi Fukaya

Yusaku Yamamoto

Proceedings of the High Performance Computing for Computational Science - VECPAR 2014 - 11th International Conference, Eugene, OR, USA, June 30, 2014

A Study of Parallel Data Compression Using Proper Orthogonal Decomposition on the K Computer.

[BibT_eX]

[DOI]

Proceedings of the 14th Eurographics Symposium on Parallel Graphics and Visualization, 2014

2013

Eigen-G: GPU-Based Eigenvalue Solver for Real-Symmetric Dense Matrices.

[BibT_eX]

[DOI]

Proceedings of the Parallel Processing and Applied Mathematics, 2013

Parallel Computing Design for Exact Diagonalization Scheme on Multi-band Hubbard Cluster Models.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: Accelerating Computational Science and Engineering (CSE), 2013

Proper orthogonal decomposition based parallel compression for visualizing big data on the K computer.

[BibT_eX]

[DOI]

Proceedings of the IEEE Symposium on Large-Scale Data Analysis and Visualization, 2013

2012

A High Performance SYMV Kernel on a Fermi-core GPU.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing for Computational Science, 2012

Poster: Preliminary Report for a High Precision Distributed Memory Parallel Eigenvalue Solver.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Preliminary Report for a High Precision Distributed Memory Parallel Eigenvalue Solver.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Poster: Communication Overlap Techniques for Improved Strong Scaling of Gyrokinetic Eulerian Code beyond 100k Cores on the K-Computer.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Abstract: Communication Overlap Techniques for Improved Strong Scaling of Gyrokinetic Eulerian Code beyond 100k Cores on the K-Computer.

[BibT_eX]

[DOI]

Proceedings of the 2012 SC Companion: High Performance Computing, 2012

2011

Parallelization design on multi-core platforms in density matrix renormalization group toward 2-D quantum strongly-correlated systems.

[BibT_eX]

[DOI]

Proceedings of the Conference on High Performance Computing Networking, 2011

2010

High-Performance Quantum Simulation for Coupled Josephson Junctions on the Earth Simulator: a Challenge To the Schrödinger Equation On 256<sup>4</sup> Grids.

[BibT_eX]

[DOI]

Int. J. High Perform. Comput. Appl., 2010

2009

Narrow-band reduction approach of a DRSM eigensolver on a multicore-based cluster system.

[BibT_eX]

[DOI]

Proceedings of the Parallel Computing: From Multicores and GPU's to Petascale, 2009

2007

Recursive multi-factoring algorithm for MPI allreduce.

[BibT_eX]

Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2007

2006

Gordon Bell finalists I - High-performance computing for exact numerical approaches to quantum many-body problems on the earth simulator.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

2005

16.447 TFlops and 159-Billion-dimensional Exact-diagonalization for Trapped Fermion-Hubbard Model on the Earth Simulator.

[BibT_eX]

[DOI]

Proceedings of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, 2005

10TFLOPS Eigenvalue Solver for Strongly-Correlated Fermions on the Earth Simulator.

[BibT_eX]

Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2005

C-Stab: Cache Stabilizing Algorithm for a Numerical Library.

[BibT_eX]

Proceedings of the IASTED International Conference on Parallel and Distributed Computing and Networks, 2005

An Evaluation Towards Automatically Tuned Eigensolvers.

[BibT_eX]

[DOI]

Ken Naono

Proceedings of the Large-Scale Scientific Computing, 5th International Conference, 2005

Automatic Tuning Technique Exploring Within the Hardware-Specific Constrained Parameters.

[BibT_eX]

[DOI]

Ken Naono

Proceedings of the Large-Scale Scientific Computing, 5th International Conference, 2005

16.14 TFLOPS Eigenvalue Solver on the Earth Simulator: Exact Diagonalization for Ultra Largescale Hamiltonian Matrix.

[BibT_eX]

[DOI]

Proceedings of the High-Performance Computing - 6th International Symposium, 2005

2003

MPI-2 Support in Heterogeneous Computing Environment Using an SCore Cluster System.

[BibT_eX]

[DOI]

Proceedings of the Parallel and Distributed Processing and Applications, 2003

A Visual Resource Integration Environment for Distributed Applications on the ITBL System.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 5th International Symposium, 2003

Grid Computing Supporting System on ITBL Project.

[BibT_eX]

[DOI]

Proceedings of the High Performance Computing, 5th International Symposium, 2003

2002

Stampi-I/O: A Flexible Parallel-I/O Library for Heterogeneous Computing Environment.

[BibT_eX]

[DOI]

Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 9th European PVM/MPI Users' Group Meeting, Linz, Austria, September 29, 2002

2000

An Estimation of Complexity and Computational Costs for Vertical Block-Cyclic Distributed Parallel LU Factorization.

[BibT_eX]

[DOI]

J. Supercomput., 2000

An Architecture of Stampi: MPI Library on a Cluster of Parallel Computers.

[BibT_eX]

[DOI]