Tal Ben-Nun

Orcid: 0000-0002-3657-6568

According to our database1, Tal Ben-Nun authored at least 71 papers between 2009 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
Arrow Matrix Decomposition: A Novel Approach for Communication-Efficient Sparse Matrix Multiplication.
Proceedings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, 2024

2023
ComPile: A Large IR Dataset from Production Sources.
CoRR, 2023

Cached Operator Reordering: A Unified View for Fast GNN Training.
CoRR, 2023

STen: Productive and Efficient Sparsity in PyTorch.
CoRR, 2023

Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization.
CoRR, 2023

A Theory of I/O-Efficient Sparse Neural Network Inference.
CoRR, 2023

FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs.
Proceedings of the International Conference for High Performance Computing, 2023

VENOM: A Vectorized N: M Format for Unleashing the Power of Sparse Tensor Cores.
Proceedings of the International Conference for High Performance Computing, 2023

Performance Embeddings: A Similarity-Based Transfer Tuning Approach to Performance Optimization.
Proceedings of the 37th International Conference on Supercomputing, 2023

Maximum Flows in Parametric Graph Templates.
Proceedings of the Algorithms and Complexity - 13th International Conference, 2023

Bridging Control-Centric and Data-Centric Optimization.
Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization, 2023

2022
Python FPGA Programming with Data-Centric Multi-Level Design.
CoRR, 2022

ENS-10: A Dataset For Post-Processing Ensemble Weather Forecast.
CoRR, 2022

Deinsum: Practically I/O Optimal Multilinear Algebra.
CoRR, 2022

The spatial computer: A model for energy-efficient parallel computation.
CoRR, 2022

Deinsum: Practically I/O Optimal Multi-Linear Algebra.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Boosting Performance Optimization with Interactive Data Movement Visualization.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Productive Performance Engineering for Weather and Climate Modeling with Python.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

ENS-10: A Dataset For Post-Processing Ensemble Weather Forecasts.
Proceedings of the Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, 2022

A data-centric optimization framework for machine learning.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

Lifting C semantics for dataflow optimization.
Proceedings of the ICS '22: 2022 International Conference on Supercomputing, Virtual Event, June 28, 2022

Temporal Vectorization: A Compiler Approach to Automatic Multi-Pumping.
Proceedings of the 41st IEEE/ACM International Conference on Computer-Aided Design, 2022

2021
Breaking (Global) Barriers in Parallel Stochastic Optimization With Wait-Avoiding Group Averaging.
IEEE Trans. Parallel Distributed Syst., 2021

Sparsity in Deep Learning: Pruning and growth for efficient inference and training in neural networks.
J. Mach. Learn. Res., 2021

Learning Combinatorial Node Labeling Algorithms.
CoRR, 2021

Pebbles, Graphs, and a Pinch of Combinatorics: Towards Tight I/O Lower Bounds for Statically Analyzable Programs.
Proceedings of the SPAA '21: 33rd ACM Symposium on Parallelism in Algorithms and Architectures, 2021

Productivity, portability, performance: data-centric Python.
Proceedings of the International Conference for High Performance Computing, 2021

On the parallel I/O optimality of linear algebra kernels: near-optimal matrix factorizations.
Proceedings of the International Conference for High Performance Computing, 2021

Clairvoyant prefetching for distributed machine learning I/O.
Proceedings of the International Conference for High Performance Computing, 2021

On the parallel I/O optimality of linear algebra kernels: near-optimal LU factorization.
Proceedings of the PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2021

Data Movement Is All You Need: A Case Study on Optimizing Transformers.
Proceedings of Machine Learning and Systems 2021, 2021

NPBench: a benchmarking suite for high-performance NumPy.
Proceedings of the ICS '21: 2021 International Conference on Supercomputing, 2021

ProGraML: A Graph-based Program Representation for Data Flow Analysis and Compiler Optimizations.
Proceedings of the 38th International Conference on Machine Learning, 2021

StencilFlow: Mapping Large Stencil Programs to Distributed Spatial Computing Systems.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2021

2020
Substream-Centric Maximum Matchings on FPGA.
ACM Trans. Reconfigurable Technol. Syst., 2020

Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing.
ACM Trans. Parallel Comput., 2020

Deep Data Flow Analysis.
CoRR, 2020

Parametric Graph Templates: Properties and Algorithms.
CoRR, 2020

Deep Learning for Post-Processing Ensemble Weather Forecasts.
CoRR, 2020

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging.
CoRR, 2020

ProGraML: Graph-based Deep Learning for Program Optimization and Analysis.
CoRR, 2020

Workflows are the New Applications: Challenges in Performance, Portability, and Productivity.
Proceedings of the IEEE/ACM International Workshop on Performance, 2020

Taming unbalanced training workloads in deep learning with partial collective operations.
Proceedings of the PPoPP '20: 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2020

Augment Your Batch: Improving Generalization Through Instance Repetition.
Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2019
Demystifying Parallel and Distributed Deep Learning: An In-depth Concurrency Analysis.
ACM Comput. Surv., 2019

A Data-Centric Approach to Extreme-Scale Ab initio Dissipative Quantum Transport Simulations.
CoRR, 2019

Predicting Weather Uncertainty with Deep Convnets.
CoRR, 2019

Mix & Match: training convnets with mixed image sizes for improved accuracy, speed and scale resiliency.
CoRR, 2019

Graph Processing on FPGAs: Taxonomy, Survey, Challenges.
CoRR, 2019

Stateful Dataflow Multigraphs: A Data-Centric Model for High-Performance Parallel Programs.
CoRR, 2019

Augment your batch: better training with larger batches.
CoRR, 2019

Optimizing the data movement in quantum transport simulations via data-centric parallel programming.
Proceedings of the International Conference for High Performance Computing, 2019

A data-centric approach to extreme-scale <i>ab initio</i> dissipative quantum transport simulations.
Proceedings of the International Conference for High Performance Computing, 2019

Stateful dataflow multigraphs: a data-centric model for performance portability on heterogeneous architectures.
Proceedings of the International Conference for High Performance Computing, 2019

A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

Substream-Centric Maximum Matchings on FPGA.
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019

2018
μ-cuDNN: Accelerating Deep Learning Frameworks with Micro-Batching.
CoRR, 2018

Neural Code Comprehension: A Learnable Representation of Code Semantics.
Proceedings of the Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, 2018

Optimizing Parallel Graph Connectivity Computation via Subgraph Sampling.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Accelerating Deep Learning Frameworks with Micro-Batches.
Proceedings of the IEEE International Conference on Cluster Computing, 2018

2017
Groute: An Asynchronous Multi-GPU Programming Model for Irregular Computations.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Big data causing big (TLB) problems: taming random memory accesses on the GPU.
Proceedings of the 13th International Workshop on Data Management on New Hardware, 2017

2016
FFMK: A Fast and Fault-Tolerant Microkernel-Based System for Exascale Computing.
Proceedings of the Software for Exascale Computing - SPPEXA 2013-2015, 2016

Memory-Oriented Programming : A Data-Centric Programming Model for Systems with Multiple Parallel Accelerators (שער נוסף בעברית: תכנות מונחה זיכרון : מודל תכנות עבור מערכות מרובות מאיצים מקביליים.).
PhD thesis, 2016

Spline-based parallel nonlinear optimization of function sequences.
J. Parallel Distributed Comput., 2016

Reciprocal Grids: A Hierarchical Algorithm for Computing Solution X-ray Scattering Curves from Supramolecular Complexes at High Resolution.
J. Chem. Inf. Model., 2016

Adaptive Work-Efficient Connected Components on the GPU.
CoRR, 2016

2015
Memory access patterns: the missing piece of the multi-GPU puzzle.
Proceedings of the International Conference for High Performance Computing, 2015

2014
MAPS: Optimizing Massively Parallel Applications Using Device-Level Memory Abstraction.
ACM Trans. Archit. Code Optim., 2014

2010
Design and implementation of a generic resource sharing virtual time dispatcher.
Proceedings of of SYSTOR 2010: The 3rd Annual Haifa Experimental Systems Conference, 2010

2009
A global scheduling framework for virtualization environments.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009


  Loading...