Torsten Hoefler

According to our database1, Torsten Hoefler
  • authored at least 166 papers between 2005 and 2017.
  • has a "Dijkstra number"2 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Other 

Links

Homepage:

On csauthors.net:

Bibliography

2017
Trends in Data Locality Abstractions for HPC Systems.
IEEE Trans. Parallel Distrib. Syst., 2017

Distributed Join Algorithms on Thousands of Cores.
PVLDB, 2017

Designing Databases for Future High-Performance Networks.
IEEE Data Eng. Bull., 2017

sPIN: High-performance streaming Processing in the Network.
CoRR, 2017

Communication Lower Bounds of Bilinear Algorithms for Symmetric Tensor Contractions.
CoRR, 2017

A Communication-Avoiding Parallel Algorithm for the Symmetric Eigenvalue Problem.
Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures, 2017

Scaling betweenness centrality using communication-efficient sparse matrix multiplication.
Proceedings of the International Conference for High Performance Computing, 2017

sPIN: high-performance streaming processing in the network.
Proceedings of the International Conference for High Performance Computing, 2017

Isoefficiency in Practice: Configuring and Understanding the Performance of Task-based Applications.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

POSTER: Cache-Oblivious MPI All-to-All Communications on Many-Core Architectures.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Communication-Avoiding Parallel Algorithms for Solving Triangular Systems of Linear Equations.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

IPDRM Workshop Introduction.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Capability Models for Manycore Memory Systems: A Case-Study with Xeon Phi KNL.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Corrected Gossip Algorithms for Fast Reliable Broadcast on Unreliable Systems.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

EMBRACE Keynote.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium Workshops, 2017

Transparent Caching for RMA Systems.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

SlimSell: A Vectorizable Graph Representation for Breadth-First Search.
Proceedings of the 2017 IEEE International Parallel and Distributed Processing Symposium, 2017

Model-Driven Choice of Numerical Methods for the Solution of the Linear Advection Equation.
Proceedings of the International Conference on Computational Science, 2017

AllConcur: Leaderless Concurrent Atomic Broadcast.
Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, 2017

To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations.
Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing, 2017

An Effective Queuing Scheme to Provide Slim Fly Topologies with HoL Blocking Reduction and Deadlock Freedom for Minimal-Path Routing.
Proceedings of the 3rd IEEE International Workshop on High-Performance Interconnection Networks in the Exascale and Big-Data Era, 2017

Improving Non-minimal and Adaptive Routing Algorithms in Slim Fly Networks.
Proceedings of the 25th IEEE Annual Symposium on High-Performance Interconnects, 2017

Fast Networks and Slow Memories: A Mechanism for Mitigating Bandwidth Mismatches.
Proceedings of the 25th IEEE Annual Symposium on High-Performance Interconnects, 2017

Multi-agent Pathfinding with n Agents on Graphs with n Vertices: Combinatorial Classification and Tight Algorithmic Bounds.
Proceedings of the Algorithms and Complexity - 10th International Conference, 2017

2016
Automatic Performance Modeling of HPC Applications.
Proceedings of the Software for Exascale Computing - SPPEXA 2013-2015, 2016

Cache Line Aware Algorithm Design for Cache-Coherent Architectures.
IEEE Trans. Parallel Distrib. Syst., 2016

Exploiting Offload-Enabled Network Interfaces.
IEEE Micro, 2016

On noise and the performance benefit of nonblocking collectives.
IJHPCA, 2016

Communication-Avoiding Parallel Algorithms for Solving Triangular Systems of Linear Equations.
CoRR, 2016

Betweenness Centrality is more Parallelizable than Dense Matrix Multiplication.
CoRR, 2016

A communication-avoiding parallel algorithm for the symmetric eigenvalue problem.
CoRR, 2016

SDNsec: Forwarding Accountability for the SDN Data Plane.
CoRR, 2016

AllConcur: Leaderless Concurrent Atomic Broadcast (Extended Version).
CoRR, 2016

Extreme scale plasma turbulence simulations on top supercomputers worldwide.
Proceedings of the International Conference for High Performance Computing, 2016

A PCIe congestion-aware performance model for densely populated accelerator servers.
Proceedings of the International Conference for High Performance Computing, 2016

dCUDA: hardware supported overlap of computation and communication.
Proceedings of the International Conference for High Performance Computing, 2016

Scheduling-aware routing for supercomputers.
Proceedings of the International Conference for High Performance Computing, 2016

Selecting Technical Papers for an Interdisciplinary Conference: The PASC Review Process.
Proceedings of the Platform for Advanced Scientific Computing Conference, 2016

Modeling and analysis of remote memory access programming.
Proceedings of the 2016 ACM SIGPLAN International Conference on Object-Oriented Programming, 2016

Polly-ACC Transparent compilation to heterogeneous hardware.
Proceedings of the 2016 International Conference on Supercomputing, 2016

SDNsec: Forwarding Accountability for the SDN Data Plane.
Proceedings of the 25th International Conference on Computer Communication and Networks, 2016

High-Performance Distributed RMA Locks.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

Routing on the Dependency Graph: A New Approach to Deadlock-Free High-Performance Routing.
Proceedings of the 25th ACM International Symposium on High-Performance Parallel and Distributed Computing, 2016

Ensuring Deadlock-Freedom in Low-Diameter InfiniBand Networks.
Proceedings of the 24th IEEE Annual Symposium on High-Performance Interconnects, 2016

Fast Multi-parameter Performance Modeling.
Proceedings of the 2016 IEEE International Conference on Cluster Computing, 2016

2015
Remote Memory Access Programming in MPI-3.
TOPC, 2015

Introduction to the Special Issue on SPAA 2013.
TOPC, 2015

Operating systems and runtime environments on supercomputers.
IJHPCA, 2015

Sparse Tensor Algebra as a Parallel Programming Model.
CoRR, 2015

Cost-effective diameter-two topologies: analysis and evaluation.
Proceedings of the International Conference for High Performance Computing, 2015

Scientific benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results.
Proceedings of the International Conference for High Performance Computing, 2015

HIPS-LSPP Keynotes.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Notified Access: Extending Remote Memory Access Programming Models for Producer-Consumer Synchronization.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Exascaling Your Library: Will Your Implementation Meet Your Expectations?
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

MODESTO: Data-centric Analytic Optimization of Complex Stencil Programs on Heterogeneous Architectures.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Active Access: A Mechanism for High-Performance Distributed Data-Centric Computations.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Cache Line Aware Optimizations for ccNUMA Systems.
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

DARE: High-Performance State Machine Replication on RDMA Networks.
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages.
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, 2015

Distributing the Data Plane for Remote Storage Access.
Proceedings of the 15th Workshop on Hot Topics in Operating Systems, 2015

Exploiting Offload Enabled Network Interfaces.
Proceedings of the 23rd IEEE Annual Symposium on High-Performance Interconnects, 2015

Source-Based Path Selection: The Data Plane Perspective.
Proceedings of the 10th International Conference on Future Internet, 2015

Evaluating the Cost of Atomic Operations on Modern Architectures.
Proceedings of the 2015 International Conference on Parallel Architecture and Compilation, 2015

Using Compiler Techniques to Improve Automatic Performance Modeling.
Proceedings of the 2015 International Conference on Parallel Architecture and Compilation, 2015

2014
Enabling highly-scalable remote memory access programming with MPI-3 One Sided.
Scientific Programming, 2014

Application-oriented ping-pong benchmarking: how to assess the real communication overheads.
Computing, 2014

Improved MPI collectives for MPI processes in shared address spaces.
Cluster Computing, 2014

Automatic complexity analysis of explicitly parallel programs.
Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, 2014

Understanding the Effects of Communication and Coordination on Checkpointing at Scale.
Proceedings of the International Conference for High Performance Computing, 2014

Fail-in-Place Network Design: Interaction Between Topology, Routing Algorithm and Failures.
Proceedings of the International Conference for High Performance Computing, 2014

Slim Fly: A Cost Effective Low-Diameter Network Topology.
Proceedings of the International Conference for High Performance Computing, 2014

Exploring the effect of noise on the performance benefit of nonblocking allreduce.
Proceedings of the 21st European MPI Users' Group Meeting, 2014

Designing Bit-Reproducible Portable High-Performance Applications.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Efficient task placement and routing of nearest neighbor exchanges in dragonfly networks.
Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, 2014

Fault tolerance for remote memory access programming models.
Proceedings of the 23rd International Symposium on High-Performance Parallel and Distributed Computing, 2014

Catwalk: A Quick Development Path for Performance Models.
Proceedings of the Euro-Par 2014: Parallel Processing Workshops, 2014

PEMOGEN: automatic adaptive performance modeling during program runtime.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013
Fast pattern-specific routing for fat tree networks.
TACO, 2013

Operating systems and runtime environments on supercomputers.
IJHPCA, 2013

MPI + MPI: a new hybrid approach to parallel programming with MPI plus shared memory.
Computing, 2013

Using Simulation to Evaluate the Performance of Resilience Strategies at Scale.
Proceedings of the High Performance Computing Systems. Performance Modeling, Benchmarking and Simulation, 2013

Enabling highly-scalable remote memory access programming with MPI-3 one sided.
Proceedings of the International Conference for High Performance Computing, 2013

Hybrid MPI: efficient message passing for multi-core systems.
Proceedings of the International Conference for High Performance Computing, 2013

Using automated performance modeling to find scalability bugs in complex codes.
Proceedings of the International Conference for High Performance Computing, 2013

MPI datatype processing using runtime compilation.
Proceedings of the 20th European MPI Users's Group Meeting, 2013

Ownership passing: efficient distributed memory programming on multi-core systems.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

Compiler Optimizations for Non-contiguous Remote Data Movement.
Proceedings of the Languages and Compilers for Parallel Computing, 2013

Bandwidth-optimal all-to-all exchanges in fat tree networks.
Proceedings of the International Conference on Supercomputing, 2013

Protocols for Fully Offloaded Collective Operations on Accelerated Network Adapters.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

Modeling communication in cache-coherent SMP systems: a case-study with Xeon Phi.
Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

NUMA-aware shared-memory collective communication for MPI.
Proceedings of the 22nd International Symposium on High-Performance Parallel and Distributed Computing, 2013

Topic 13: High-Performance Networks and Communication - (Introduction).
Proceedings of the Euro-Par 2013 Parallel Processing, 2013

2012
Extensions for next-generation parallel programming models.
Parallel Computing, 2012

Operating systems and runtime environments on supercomputers.
IJHPCA, 2012

Abstract: Slack-Conscious Lightweight Loop Scheduling for Improving Scalability of Bulk-synchronous MPI Applications.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Optimization principles for collective neighborhood communications.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

Micro-applications for Communication Data Access Patterns and MPI Datatypes.
Proceedings of the Recent Advances in the Message Passing Interface, 2012

Exact Dependence Analysis for Increased Communication Overlap.
Proceedings of the Recent Advances in the Message Passing Interface, 2012

Leveraging MPI's One-Sided Communication Interface for Shared-Memory Programming.
Proceedings of the Recent Advances in the Message Passing Interface, 2012

Automatic datatype generation and optimization.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

Communication-centric optimizations by dynamically detecting collective operations.
Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2012

Assessing HPC Failure Detectors for MPI Jobs.
Proceedings of the 20th Euromicro International Conference on Parallel, 2012

On the Effects of CPU Caches on MPI Point-to-Point Communications.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

Productive Parallel Linear Algebra Programming with Unstructured Topology Adaption.
Proceedings of the 12th IEEE/ACM International Symposium on Cluster, 2012

Performance Modeling and Comparative Analysis of the MILC Lattice QCD Application su3_rmd.
Proceedings of the 12th IEEE/ACM International Symposium on Cluster, 2012

Runtime detection and optimization of collective communication patterns.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2011
Mpi on millions of Cores.
Parallel Processing Letters, 2011

The scalable process topology interface of MPI 2.2.
Concurrency and Computation: Practice and Experience, 2011

Performance modeling for systematic performance tuning.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

Design and Evaluation of Nonblocking Collective I/O Operations.
Proceedings of the Recent Advances in the Message Passing Interface, 2011

Writing Parallel Libraries with MPI - Common Practice, Issues, and Extensions.
Proceedings of the Recent Advances in the Message Passing Interface, 2011

Performance Expectations and Guidelines for MPI Derived Datatypes.
Proceedings of the Recent Advances in the Message Passing Interface, 2011

Active pebbles: a programming model for highly parallel fine-grained data-driven computations.
Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2011

Kanor - A Declarative Language for Explicit Communication.
Proceedings of the Practical Aspects of Declarative Languages, 2011

HIPS Introduction.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Deadlock-Free Oblivious Routing for Arbitrary Topologies.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Active pebbles: parallel programming for data-driven applications.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Generic topology mapping strategies for large-scale parallel architectures.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Kernel-Based Offload of Collective Operations - Implementation, Evaluation and Lessons Learned.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

2010
Accurately measuring overhead, communication time and progression of blocking and nonblocking collective operations at massive scale.
IJPEDS, 2010

Software and Hardware Techniques for Power-Efficient HPC Networking.
Computing in Science and Engineering, 2010

Characterizing the Influence of System Noise on Large-Scale Applications by Simulation.
Proceedings of the Conference on High Performance Computing Networking, 2010

Toward Performance Models of MPI Implementations for Understanding Application Scaling Issues.
Proceedings of the Recent Advances in the Message Passing Interface, 2010

Parallel Zero-Copy Algorithms for Fast Fourier Transform and Conjugate Gradient Using MPI Datatypes.
Proceedings of the Recent Advances in the Message Passing Interface, 2010

Efficient MPI Support for Advanced Hybrid Programming Models.
Proceedings of the Recent Advances in the Message Passing Interface, 2010

Scalable communication protocols for dynamic sparse data exchange.
Proceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010

LogGOPSim: simulating large-scale applications in the LogGOPS model.
Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, 2010

The PERCS High-Performance Interconnect.
Proceedings of the IEEE 18th Annual Symposium on High Performance Interconnects, 2010

A space-efficient parallel algorithm for computing betweenness centrality in distributed memory.
Proceedings of the 2010 International Conference on High Performance Computing, 2010

Bridging Performance Analysis Tools and Analytic Performance Modeling for HPC.
Proceedings of the Euro-Par 2010 Parallel Processing Workshops, 2010

AM++: a generalized active message framework.
Proceedings of the 19th International Conference on Parallel Architecture and Compilation Techniques, 2010

2009
LogGP in theory and practice - An in-depth analysis of modern interconnection networks and benchmarking methods for collective operations.
Simulation Modelling Practice and Theory, 2009

The Effect of Network Noise on Large-Scale Collective Communications.
Parallel Processing Letters, 2009

Towards Efficient MapReduce Using MPI.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2009

Implementation and analysis of nonblocking collective operations on SCI networks.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Sparse collective operations for MPI.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

The impact of network noise at large-scale communication performance.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

A power-aware, application-based performance study of modern commodity cluster interconnection networks.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Group Operation Assembly Language - A Flexible Way to Express Collective Communication.
Proceedings of the ICPP 2009, 2009

Optimized Routing for Large-Scale InfiniBand Networks.
Proceedings of the 17th IEEE Symposium on High Performance Interconnects, 2009

Demand-driven execution of static directed acyclic graphs using task parallelism.
Proceedings of the 16th International Conference on High Performance Computing, 2009

2008
Leveraging non-blocking collective communication in high-performance applications.
Proceedings of the SPAA 2008: Proceedings of the 20th Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2008

Communication Optimization for Medical Image Reconstruction Algorithms.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008

Sparse Non-blocking Collectives in Quantum Mechanical Calculations.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2008

Accurately measuring collective operations at massive scale.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Optimizing non-blocking collective operations for infiniband.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Adaptive Routing Strategies for Modern High Performance Networks.
Proceedings of the 16th Annual IEEE Symposium on High Performance Interconnects (HOTI 2008), 2008

Multistage switches are not crossbars: Effects of static routing in high-performance networks.
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

Message progression in parallel computing - to thread or not to thread?
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

Overlapping Communication and Computation with High Level Communication Routines.
Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), 2008

An Optimized ZGEMM Implementation for the Cell BE.
Proceedings of the 9th Workshop on Parallel Systems and Algorithms (PASA) held at the 21st Conference on the Architecture of Computing Systems (ARCS), 2008

2007
Optimizing a conjugate gradient solver with non-blocking collective operations.
Parallel Computing, 2007

Implementation and performance analysis of non-blocking collective operations for MPI.
Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007

A Case for Standard Non-blocking Collective Operations.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 14th European PVM/MPI User's Group Meeting, Paris, France, September 30, 2007

A practically constant-time MPI Broadcast Algorithm for large-scale InfiniBand Clusters with Multicast.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Low-Overhead LogGP Parameter Assessment for Modern Interconnection Networks.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Netgauge: A Network Performance Measurement Framework.
Proceedings of the High Performance Computing and Communications, 2007

2006
Optimizing a Conjugate Gradient Solver with Non-Blocking Collective Operations.
Proceedings of the Recent Advances in Parallel Virtual Machine and Message Passing Interface, 2006

IRS - A Portable Interface for Reconfigurable Systems.
Proceedings of the Fifth International Conference on Parallel Computing in Electrical Engineering (PARELEC 2006), 2006

Assessing Single-Message and Multi-Node Communication Performance of InfiniBand.
Proceedings of the Fifth International Conference on Parallel Computing in Electrical Engineering (PARELEC 2006), 2006

A Case for Non-blocking Collective Operations.
Proceedings of the Frontiers of High Performance Computing and Networking, 2006

LogfP - a model for small messages in InfiniBand.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Fast barrier synchronization for InfiniBand/spl trade/.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Analysis of the Memory Registration Process in the Mellanox InfiniBand Software Stack.
Proceedings of the Euro-Par 2006, Parallel Processing, 12th International Euro-Par Conference, Dresden, Germany, August 28, 2006

Adding Low-Cost Hardware Barrier Support to Small Commodity Clusters.
Proceedings of the ARCS 2006, 2006

2005
A Practical Approach to the Rating of Barrier Algorithms Using the LogP Model and Open MPI.
Proceedings of the 34th International Conference on Parallel Processing Workshops (ICPP 2005 Workshops), 2005


  Loading...