Toshio Endo

Orcid: 0000-0001-7297-6211

According to our database1, Toshio Endo authored at least 73 papers between 1997 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
AshPipe: Asynchronous Hybrid Pipeline Parallel for DNN Training.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2024

Retargeting and Respecializing GPU Workloads for Performance Portability.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2024

2023
Exploiting Scratchpad Memory for Deep Temporal Blocking: A case study for 2D Jacobian 5-point iterative stencil kernel (j2d5pt).
CoRR, 2023

Pyramid Swin Transformer: Different-Size Windows Swin Transformer for Image Classification and Object Detection.
Proceedings of the 18th International Joint Conference on Computer Vision, 2023

High-Performance GPU-to-CPU Transpilation and Optimization via High-Level Parallel Constructs.
Proceedings of the 28th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2023

Revisiting Temporal Blocking Stencil Optimizations.
Proceedings of the 37th International Conference on Supercomputing, 2023

PERKS: a Locality-Optimized Execution Model for Iterative Memory-bound GPU Applications.
Proceedings of the 37th International Conference on Supercomputing, 2023

The Aggressive Oversubscribing Scheduling for Interactive Jobs on a Supercomputing System.
Proceedings of the IEEE High Performance Extreme Computing Conference, 2023

Effectiveness of the Oversubscribing Scheduling on Supercomputer Systems.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2023

Pyramid Swin Transformer for Multi-task: Expanding to More Computer Vision Tasks.
Proceedings of the Advanced Concepts for Intelligent Vision Systems, 2023

2022
mdx: A Cloud Platform for Supporting Data Science and Cross-Disciplinary Research Collaborations.
CoRR, 2022

Speed-up Single Shot Detector on GPU with CUDA.
Proceedings of the 23rd ACIS International Summer Virtual Conference on Software Engineering, 2022

Efficient Stencil Computation with Temporal Blocking by Halide DSL.
Proceedings of the IEEE Intl Conf on Parallel & Distributed Processing with Applications, 2022


2021
Measurement and Modeling of Performance of HPC Applications Towards Overcommitting Scheduling Systems.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2021

Performance Modeling of HPC Applications on Overcommitted Systems.
Proceedings of the HPC Asia 2021: The International Conference on High Performance Computing in Asia-Pacific Region, 2021

2020
Integrating Cache Oblivious Approach with Modern Processor Architecture: The Case of Floyd-Warshall Algorithm.
Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, 2020

AN5D: automated stencil framework for high-degree temporal blocking on GPUs.
Proceedings of the CGO '20: 18th ACM/IEEE International Symposium on Code Generation and Optimization, 2020

2019
An Autotuning Framework for Scalable Execution of Tiled Code via Iterative Polyhedral Compilation.
ACM Trans. Archit. Code Optim., 2019

Profiling based Out-of-core Hybrid Method for Large Neural Networks.
CoRR, 2019

Profiling based out-of-core hybrid method for large neural networks: poster.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

2018
Characterizing Memory-Latency Sensitivity of Sparse Matrix Kernels.
Proceedings of the 26th Euromicro International Conference on Parallel, 2018

Applying Recursive Temporal Blocking for Stencil Computations to Deeper Memory Hierarchy.
Proceedings of the IEEE 7th Non-Volatile Memory Systems and Applications Symposium, 2018

Scalable RMA-based Communication Library Featuring Node-local NVMs.
Proceedings of the 2018 IEEE High Performance Extreme Computing Conference, 2018

Exhaustive evaluation of memory-latency sensitivity on manycore processors with large cache.
Proceedings of the 2nd International Conference on High Performance Compilation, 2018

2017
Applying Temporal Blocking with a Directive-based Approach.
Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, 2017

An Accurate Simulator of Cache-Line Conflicts to Exploit the Underlying Cache Performance.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

A Stencil Framework to Realize Large-Scale Computations Beyond Device Memory Capacity on GPU Supercomputers.
Proceedings of the 2017 IEEE International Conference on Cluster Computing, 2017

ExanaDBT: A Dynamic Compilation System for Transparent Polyhedral Optimizations at Runtime.
Proceedings of the Computing Frontiers Conference, 2017

ooc_cuDNN: Accommodating convolutional neural networks over GPU memory capacity.
Proceedings of the 2017 IEEE International Conference on Big Data (IEEE BigData 2017), 2017

2016
PGAS Communication Runtime for Extreme Large Data Computation.
Proceedings of the Second International Workshop on Extreme Scale Programming Models and Middleware, 2016

Advanced Computing and Optimization Infrastructure for Extremely Large-Scale Graphs on Post Peta-Scale Supercomputers.
Proceedings of the Mathematical Software - ICMS 2016, 2016

Realizing Out-of-Core Stencil Computations Using Multi-tier Memory Hierarchy on GPGPU Clusters.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

From FLOPS to BYTES: disruptive change in high-performance computing towards the post-moore era.
Proceedings of the ACM International Conference on Computing Frontiers, CF'16, 2016

Evaluating the impacts of code-level performance tunings on power efficiency.
Proceedings of the 2016 IEEE International Conference on Big Data (IEEE BigData 2016), 2016

2015
Power Capping of CPU-GPU Heterogeneous Systems using Power and Performance Models.
Proceedings of the SMARTGREENS 2015, 2015

The scalable petascale data-driven approach for the Cholesky factorization with multiple GPUs.
Proceedings of the First International Workshop on Extreme Scale Programming Models and Middleware, 2015

Investigating potential performance benefits of memory layout optimization based on roofline model.
Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems, 2015

Exana: an execution-driven application analysis tool for assisting productive performance tuning.
Proceedings of the 2nd International Workshop on Software Engineering for Parallel Systems, 2015

Data Driven Scheduling Approach for the Multi-node Multi-GPU Cholesky Decomposition.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2015

Exploration of Lossy Compression for Application-Level Checkpoint/Restart.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Realizing Extremely Large-Scale Stencil Applications on GPU Supercomputers.
Proceedings of the 21st IEEE International Conference on Parallel and Distributed Systems, 2015

2014
Special Issue on Applications for the Heterogeneous Computing Era.
Int. J. High Perform. Comput. Appl., 2014

Petascale General Solver for Semidefinite Programming Problems with Over Two Million Constraints.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

TSUBAME-KFC: A modern liquid submersion cooling prototype towards exascale becoming the greenest supercomputer in the world.
Proceedings of the 20th IEEE International Conference on Parallel and Distributed Systems, 2014

An evaluation of the potential of flash SSD as large and slow memory for stencil computations.
Proceedings of the International Conference on High Performance Computing & Simulation, 2014

Software technologies coping with memory hierarchy of GPGPU clusters for stencil computations.
Proceedings of the 2014 IEEE International Conference on Cluster Computing, 2014

2013
A Multi-Level Optimization Method for Stencil Computation on the Domain that is Bigger than Memory Capacity of GPU.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

A parallel optimization method for stencil computation on the domain that is bigger than memory capacity of GPUs.
Proceedings of the 2013 IEEE International Conference on Cluster Computing, 2013

2012
High-performance general solver for extremely large-scale semidefinite programming problems.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

2011
Peta-scale phase-field simulation for dendritic solidification on the TSUBAME 2.0 supercomputer.
Proceedings of the Conference on High Performance Computing Networking, 2011

Petaflop biofluidics simulations on a two million-core system.
Proceedings of the Conference on High Performance Computing Networking, 2011

2010
An 80-Fold Speedup, 15.0 TFlops Full GPU Acceleration of Non-Hydrostatic Weather Model ASUCA Production Code.
Proceedings of the Conference on High Performance Computing Networking, 2010

Linpack evaluation on a supercomputer with heterogeneous accelerators.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Statistical power modeling of GPU kernels using performance counters.
Proceedings of the International Green Computing Conference 2010, 2010

2009
Power-aware dynamic task scheduling for heterogeneous accelerated clusters.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

File Clustering Based Replication Algorithm in a Grid Environment.
Proceedings of the 9th IEEE/ACM International Symposium on Cluster Computing and the Grid, 2009

2008
Bandwidth intensive 3-D FFT kernel for GPUs using CUDA.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

Locality aware MPI communication on a commodity opto-electronic hybrid network.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

An efficient, model-based CPU-GPU heterogeneous FFT library.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Performance evaluation of parallel applications on next generation memory architecture with power-aware paging method.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Massive supercomputing coping with heterogeneity of modern accelerators.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Access-pattern and bandwidth aware file replication algorithm in a grid environment.
Proceedings of the 9th IEEE/ACM International Conference on Grid Computing (Grid 2008), Tsukuba, Japan, September 29, 2008

Environmental-aware optimization of MPI checkpointing intervals.
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

2007
ABARIS: An Adaptable Fault Detection/Recovery Component Framework for MPIs.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

High-Performance MPI Broadcast Algorithm for Grid Environments Utilizing Multi-lane NICs.
Proceedings of the Seventh IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2007), 2007

2005
Highly latency tolerant Gaussian elimination.
Proceedings of the 6th IEEE/ACM International Conference on Grid Computing (GRID 2005), 2005

2004
High performance LU factorization for non-dedicated clusters.
Proceedings of the 4th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGrid 2004), 2004

2003
Phoenix: a parallel programming model for accommodating dynamically joining/leaving resources.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2003

2002
Reducing pause time of conservative collectors.
Proceedings of The Workshop on Memory Systems Performance (MSP 2002), 2002

2001
Predicting Scalability of Parallel Garbage Collectors on Shared Memory Multiprocessors.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

1998
On a High-Speed Hough Transform Algorithm MRHT.
Proceedings of IAPR Workshop on Machine Vision Applications, 1998

1997
A Scalable Mark-Sweep Garbage Collector on Large-Scale Shared-Memory Machines.
Proceedings of the ACM/IEEE Conference on Supercomputing, 1997


  Loading...