P. Sadayappan

According to our database1, P. Sadayappan authored at least 317 papers between 1985 and 2021.

Collaborative distances:

Awards

IEEE Fellow

IEEE Fellow 2015, "For contributions to parallel programming tools for high-performance computing".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2021
Analytical Characterization and Design Space Exploration for Optimization of CNNs.
CoRR, 2021

2020
Efficient tiled sparse matrix multiplication through matrix signatures.
Proceedings of the International Conference for High Performance Computing, 2020

Scalable heterogeneous execution of a coupled-cluster model with perturbative triples.
Proceedings of the International Conference for High Performance Computing, 2020

Compiling generalized histograms for GPU.
Proceedings of the International Conference for High Performance Computing, 2020

Automated derivation of parametric data movement lower bounds for affine programs.
Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2020

ALO-NMF: Accelerated Locality-Optimized Non-negative Matrix Factorization.
Proceedings of the KDD '20: The 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2020

2019
PL-NMF: Parallel Locality-Optimized Non-negative Matrix Factorization.
CoRR, 2019

An efficient mixed-mode representation of sparse tensors.
Proceedings of the International Conference for High Performance Computing, 2019

Parallel Data-Local Training for Optimizing Word2Vec Embeddings for Word and Graph Embeddings.
Proceedings of the 2019 IEEE/ACM Workshop on Machine Learning in High Performance Computing Environments, 2019

Analytical cache modeling and tilesize optimization for tensor contractions.
Proceedings of the International Conference for High Performance Computing, 2019

Adaptive sparse tiling for sparse matrix multiplication.
Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2019

On Optimizing Complex Stencils on GPUs.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

Load-Balanced Sparse MTTKRP on GPUs.
Proceedings of the 2019 IEEE International Parallel and Distributed Processing Symposium, 2019

A Code Generator for High-Performance Tensor Contractions on GPUs.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2019

ATP: Directed Graph Embedding with Asymmetric Transitivity Preservation.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

2018
Domain-Specific Optimization and Generation of High-Performance GPU Code for Stencil Computations.
Proc. IEEE, 2018

Analytical modeling of cache behavior for affine programs.
Proc. ACM Program. Lang., 2018

Associative instruction reordering to alleviate register pressure.
Proceedings of the International Conference for High Performance Computing, 2018

Register optimizations for stencils on GPUs.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

Performance modeling for GPUs using abstract kernel emulation.
Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2018

GPU code optimization using abstract kernel emulation and sensitivity analysis.
Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2018

TTLG - An Efficient Tensor Transposition Library for GPUs.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium, 2018

Effective Machine Learning Based Format Selection and Performance Modeling for SpMV on GPUs.
Proceedings of the 2018 IEEE International Parallel and Distributed Processing Symposium Workshops, 2018

Optimizing Tensor Contractions in CCSD(T) for Efficient Execution on GPUs.
Proceedings of the 32nd International Conference on Supercomputing, 2018

Parallel Latent Dirichlet Allocation on GPUs.
Proceedings of the Computational Science - ICCS 2018, 2018

Efficient sparse-matrix multi-vector product on GPUs.
Proceedings of the 27th International Symposium on High-Performance Parallel and Distributed Computing, 2018

Sampled Dense Matrix Multiplication for High-Performance Machine Learning.
Proceedings of the 25th IEEE International Conference on High Performance Computing, 2018

2017
Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis.
Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2017

Parallel CCD++ on GPU for Matrix Factorization.
Proceedings of the General Purpose GPUs, 2017

Efficient Cache Simulation for Affine Computations.
Proceedings of the Languages and Compilers for Parallel Computing, 2017

On improving performance of sparse matrix-matrix multiplication on GPUs.
Proceedings of the International Conference on Supercomputing, 2017

Characterization of Data Movement Requirements for Sparse Matrix Computations on GPUs.
Proceedings of the 24th IEEE International Conference on High Performance Computing, 2017

POSTER: Statement Reordering to Alleviate Register Pressure for Stencils on GPUs.
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques, 2017

MultiGraph: Efficient Graph Processing on GPUs.
Proceedings of the 26th International Conference on Parallel Architectures and Compilation Techniques, 2017

2016
Static and Dynamic Frequency Scaling on Multicore CPUs.
ACM Trans. Archit. Code Optim., 2016

Global-view coefficients: a data management solution for parallel quantum Monte Carlo applications.
Concurr. Comput. Pract. Exp., 2016

Work stealing for GPU-accelerated parallel programs in a global address space framework.
Concurr. Comput. Pract. Exp., 2016

Brief Announcement: Approximating the I/O Complexity of One-Shot Red-Blue Pebbling.
Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, 2016

A domain-specific compiler for a parallel multiresolution adaptive numerical simulation environment.
Proceedings of the International Conference for High Performance Computing, 2016

PIPES: a language and compiler for task-based programming on distributed-memory clusters.
Proceedings of the International Conference for High Performance Computing, 2016

Effective resource management for enhancing performance of 2D and 3D stencils on GPUs.
Proceedings of the 9th Annual Workshop on General Purpose Processing using Graphics Processing Unit, 2016

PolyCheck: dynamic verification of iteration space transformations on affine programs.
Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2016

Effective padding of multidimensional arrays to avoid cache conflict misses.
Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation, 2016

Architecting and Programming a Hardware-Incoherent Multiprocessor Cache Hierarchy.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Differentiated Scheduling of Response-Critical and Best-Effort Wide-Area Data Transfers.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

Compiler Support for Software Cache Coherence.
Proceedings of the 23rd IEEE International Conference on High Performance Computing, 2016

On fusing recursive traversals of K-d trees.
Proceedings of the 25th International Conference on Compiler Construction, 2016

Register allocation and promotion through combined instruction scheduling and loop unrolling.
Proceedings of the 25th International Conference on Compiler Construction, 2016

Resource Conscious Reuse-Driven Tiling for GPUs.
Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, 2016

2015
Introduction to the Special Issue on PPoPP'12.
ACM Trans. Parallel Comput., 2015

A model-driven blocking strategy for load balanced sparse matrix-vector multiplication on GPUs.
J. Parallel Distributed Comput., 2015

SDSLc: a multi-target domain-specific compiler for stencil computations.
Proceedings of the 5th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, 2015

An elegant sufficiency: load-aware differentiated scheduling of data transfers.
Proceedings of the International Conference for High Performance Computing, 2015

Distributed memory code generation for mixed Irregular/Regular computations.
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015

On optimizing machine learning workloads via kernel fusion.
Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2015

On Characterizing the Data Access Complexity of Programs.
Proceedings of the 42nd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2015

iWAPT Invited Talks.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

A Roofline-Based Performance Estimator for Distributed Matrix-Multiply on Intel CnC.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium Workshop, 2015

Automatic Selection of Sparse Matrix Representation on GPUs.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Optimistic Delinearization of Parametrically Sized Arrays.
Proceedings of the 29th ACM on International Conference on Supercomputing, 2015

Characterizing and enhancing global memory data coalescing on GPUs.
Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2015

2014
Automatic parallelization of a class of irregular loops for distributed memory systems.
ACM Trans. Parallel Comput., 2014

Compiler/Runtime Framework for Dynamic Dataflow Parallelization of Tiled Programs.
ACM Trans. Archit. Code Optim., 2014

On Using the Roofline Model with Lower Bounds on Data Movement.
ACM Trans. Archit. Code Optim., 2014

The Relation Between Diamond Tiling and Hexagonal Tiling.
Parallel Process. Lett., 2014

Introduction to the JPDC Special Issue on Domain-Specific Languages and High-Level Frameworks for High-Performance Computing.
J. Parallel Distributed Comput., 2014

A Tiling Perspective for Register Optimization.
CoRR, 2014

On characterizing the data movement complexity of computational DAGs for parallel execution.
Proceedings of the 26th ACM Symposium on Parallelism in Algorithms and Architectures, 2014

A Communication-Optimal Framework for Contracting Distributed Tensors.
Proceedings of the International Conference for High Performance Computing, 2014

Fast Sparse Matrix-Vector Multiplication on GPUs for Graph Applications.
Proceedings of the International Conference for High Performance Computing, 2014

Compiler-assisted detection of transient memory errors.
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014

A framework for enhancing data reuse via associative reordering.
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2014

WOSC 2014: second workshop on optimizing stencil computations.
Proceedings of the Conference on Systems, 2014

An efficient two-dimensional blocking strategy for sparse matrix-vector multiplication on GPUs.
Proceedings of the 2014 International Conference on Supercomputing, 2014

PWCET: Power-Aware Worst Case Execution Time Analysis.
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014

Checksumming Strategies for Data in Volatile Memories.
Proceedings of the 43rd International Conference on Parallel Processing Workshops, 2014

CAST: Contraction Algorithm for Symmetric Tensors.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

A fast implementation of MLR-MCL algorithm on multi-core processors.
Proceedings of the 21st International Conference on High Performance Computing, 2014

Hybrid Hexagonal/Classical Tiling for GPUs.
Proceedings of the 12th Annual IEEE/ACM International Symposium on Code Generation and Optimization, 2014

Modeling and Optimizing Large-Scale Wide-Area Data Transfers.
Proceedings of the 14th IEEE/ACM International Symposium on Cluster, 2014

Global graphs: A middleware for large scale graph processing.
Proceedings of the 2014 IEEE International Conference on Big Data, 2014

2013
Beyond reuse distance analysis: Dynamic analysis for characterization of data locality potential.
ACM Trans. Archit. Code Optim., 2013

Stencil-Aware GPU Optimization of Iterative Solvers.
SIAM J. Sci. Comput., 2013

Predictive Modeling in a Polyhedral Optimization Space.
Int. J. Parallel Program., 2013

Adaptive parallel tiled code generation and accelerated auto-tuning.
Int. J. High Perform. Comput. Appl., 2013

A framework for load balancing of tensor contraction expressions via dynamic task partitioning.
Proceedings of the International Conference for High Performance Computing, 2013

When polyhedral transformations meet SIMD code generation.
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2013

Parametric GPU Code Generation for Affine Loop Programs.
Proceedings of the Languages and Compilers for Parallel Computing, 2013

A Compiler Analysis to Determine Useful Cache Size for Energy Efficiency.
Proceedings of the 2013 IEEE International Symposium on Parallel & Distributed Processing, 2013

A stencil compiler for short-vector SIMD architectures.
Proceedings of the International Conference on Supercomputing, 2013

Stratification driven placement of complex data: A framework for distributed data analytics.
Proceedings of the 29th IEEE International Conference on Data Engineering, 2013

Accelerating Strassen-Winograd's matrix multiplication algorithm on GPUs.
Proceedings of the 20th Annual International Conference on High Performance Computing, 2013

Polyhedral-based data reuse optimization for configurable computing.
Proceedings of the 2013 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2013

Split tiling for GPUs: automatic parallelization using trapezoidal tiles.
Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, 2013

2012
Using machine learning to improve automatic vectorization.
ACM Trans. Archit. Code Optim., 2012

International Conference on Computational Science, ICCS 2012.
Proceedings of the International Conference on Computational Science, 2012

Empirical performance model-driven data layout optimization and library call selection for tensor contraction expressions.
J. Parallel Distributed Comput., 2012

Code generation for parallel execution of a class of irregular loops on distributed memory systems.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

GADBMS: A Framework for Scalable Array Analytics.
Proceedings of the 2012 SC Companion: High Performance Computing, 2012

Dynamic trace-based analysis of vectorization potential of applications.
Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2012

PARDA: A Fast Parallel Reuse Distance Analysis Algorithm.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

Load Balancing of Dynamical Nucleation Theory Monte Carlo Simulations through Resource Sharing Barriers.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium, 2012

High-performance code generation for stencil computations on GPU architectures.
Proceedings of the International Conference on Supercomputing, 2012

A global address space approach to automated data management for parallel Quantum Monte Carlo applications.
Proceedings of the 19th International Conference on High Performance Computing, 2012

Analytical Bounds for Optimal Tile Size Selection.
Proceedings of the Compiler Construction - 21st International Conference, 2012

High-performance sparse matrix-vector multiplication on GPUs for structured grid computations.
Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, 2012

2011
Fast Sparse Matrix-Vector Multiplication on GPUs: Implications for Graph Mining.
Proc. VLDB Endow., 2011

Optimizing latency and throughput of application workflows on clusters.
Parallel Comput., 2011

Memory-optimal evaluation of expression trees involving large objects.
Comput. Lang. Syst. Struct., 2011

Poster: FOX: a fault-oblivious extreme scale execution environment.
Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis, 2011

Loop transformations: convexity, pruning and optimization.
Proceedings of the 38th ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages, 2011

Model-Driven SIMD Code Generation for a Multi-resolution Tensor Kernel.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Dynamic selection of tile sizes.
Proceedings of the 18th International Conference on High Performance Computing, 2011

Application-Specific Fault Tolerance via Data Access Characterization.
Proceedings of the Euro-Par 2011 Parallel Processing - 17th International Conference, 2011

Predictive modeling in a polyhedral optimization space.
Proceedings of the CGO 2011, 2011

Data Layout Transformation for Stencil Computations on Short-Vector SIMD Architectures.
Proceedings of the Compiler Construction - 20th International Conference, 2011

StVEC: A Vector Instruction Extension for High Performance Stencil Computation.
Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011

2010
Parameterized specification, configuration and execution of data-intensive scientific workflows.
Clust. Comput., 2010

Combined Iterative and Model-driven Optimization in an Automatic Parallelization Framework.
Proceedings of the Conference on High Performance Computing Networking, 2010

Optimal loop unrolling for GPGPU programs.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

DynTile: Parametric tiled loop generation for parallel execution on multicore processors.
Proceedings of the 24th IEEE International Symposium on Parallel and Distributed Processing, 2010

Understanding parallelism-inhibiting dependences in sequential Java programs.
Proceedings of the 26th IEEE International Conference on Software Maintenance (ICSM 2010), 2010

Parallel Job Scheduling Policies to Improve Fairness: A Case Study.
Proceedings of the 39th International Conference on Parallel Processing, 2010

Parameterized tiling revisited.
Proceedings of the CGO 2010, 2010

Hybrid parallel programming with MPI and unified parallel C.
Proceedings of the 7th Conference on Computing Frontiers, 2010

Selective Recovery from Failures in a Task Parallel Programming Model.
Proceedings of the 10th IEEE/ACM International Conference on Cluster, 2010

Automatic C-to-CUDA Code Generation for Affine Programs.
Proceedings of the Compiler Construction, 19th International Conference, 2010

2009
An Integrated Approach to Locality-Conscious Processor Allocation and Scheduling of Mixed-Parallel Applications.
IEEE Trans. Parallel Distributed Syst., 2009

Enabling software management for multicore caches with a lightweight hardware support.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Scalable work stealing.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2009

Compiler-assisted dynamic scheduling for effective parallelization of loop nests on multicore processors.
Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009

Annotation-based empirical performance tuning using Orio.
Proceedings of the 23rd IEEE International Symposium on Parallel and Distributed Processing, 2009

Parametric multi-level tiling of imperfectly nested loops.
Proceedings of the 23rd international conference on Supercomputing, 2009

An integrated framework for performance-based optimization of scientific workflows.
Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing, 2009

Scalable I/O forwarding framework for high-performance computing systems.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

Soft-OLP: Improving Hardware Cache Performance through Software-Controlled Object-Level Partitioning.
Proceedings of the PACT 2009, 2009

Data Layout Transformation for Enhancing Data Locality on NUCA Chip Multiprocessors.
Proceedings of the PACT 2009, 2009

2008
A message passing benchmark for unbalanced applications.
Simul. Model. Pract. Theory, 2008

A framework for characterizing overlap of communication and computation in parallel applications.
Clust. Comput., 2008

Global trees: a framework for linked data structures on distributed memory parallel systems.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

Using overlays for efficient data transfer over shared wide-area networks.
Proceedings of the ACM/IEEE Conference on High Performance Computing, 2008

Automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories.
Proceedings of the 13th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2008

A practical automatic polyhedral parallelizer and locality optimizer.
Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, 2008

A dynamic scheduling approach for coordinated wide-area data transfers using GridFTP.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Towards effective automatic parallelization for multicore systems.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

A compiler framework for optimization of affine loop nests for gpgpus.
Proceedings of the 22nd Annual International Conference on Supercomputing, 2008

A Duplication Based Algorithm for Optimizing Latency Under Throughput Constraints for Streaming Workflows.
Proceedings of the 2008 International Conference on Parallel Processing, 2008

Scioto: A Framework for Global-View Task Parallelism.
Proceedings of the 2008 International Conference on Parallel Processing, 2008

Integrated Data and Task Management for Scientific Applications.
Proceedings of the Computational Science, 2008

Multi-hop path splitting and multi-pathing optimizations for data transfers over shared wide-area networks using gridFTP.
Proceedings of the 17th International Symposium on High-Performance Distributed Computing (HPDC-17 2008), 2008

Gaining insights into multicore cache partitioning: Bridging the gap between simulation and real systems.
Proceedings of the 14th International Conference on High-Performance Computer Architecture (HPCA-14 2008), 2008

Are nonblocking networks really needed for high-end-computing workloads?
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

An OSD-based approach to managing directory operations in parallel file systems.
Proceedings of the 2008 IEEE International Conference on Cluster Computing, 29 September, 2008

Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model.
Proceedings of the Compiler Construction, 17th International Conference, 2008

2007
Efficient search-space pruning for integrated fusion and tiling transformations.
Concurr. Comput. Pract. Exp., 2007

Integrating parallel file systems with object-based storage devices.
Proceedings of the ACM/IEEE Conference on High Performance Networking and Computing, 2007

Automatic mapping of nested loops to FPGAS.
Proceedings of the 12th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2007

Effective automatic parallelization of stencil computations.
Proceedings of the ACM SIGPLAN 2007 Conference on Programming Language Design and Implementation, 2007

A global address space framework for locality aware scheduling of block-sparse computations.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Dynamic Load Balancing of Unbalanced Computations Using Message Passing.
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Analyzing and Minimizing the Impact of Opportunity Cost in QoS-aware Job Scheduling.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

Toward Optimizing Latency Under Throughput Constraints for Application Workflows on Clusters.
Proceedings of the Euro-Par 2007, 2007

Scheduling File Transfers for Data-Intensive Jobs on Heterogeneous Clusters.
Proceedings of the Euro-Par 2007, 2007

Non-collective parallel I/O for global address space programming models.
Proceedings of the 2007 IEEE International Conference on Cluster Computing, 2007

2006
Layout transformation support for the disk resident arrays framework.
J. Supercomput., 2006

MOLAR: adaptive runtime support for high-end computing operating and runtime systems.
ACM SIGOPS Oper. Syst. Rev., 2006

Efficient synthesis of out-of-core algorithms using a nonlinear optimization solver.
J. Parallel Distributed Comput., 2006

M12 - Overview of the global arrays parallel software development toolkit.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

Data management and query - Hypergraph partitioning for automatic memory hierarchy management.
Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing, 2006

UTS: An Unbalanced Tree Search Benchmark.
Proceedings of the Languages and Compilers for Parallel Computing, 2006

Moldable Parallel Job Scheduling Using Job Efficiency: An Iterative Approach.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2006

A Data Locality Aware Online Scheduling Approach for I/O-Intensive Jobs with File Sharing.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2006

An approach to locality-conscious load balancing and transparent memory hierarchy management with a global-address-space parallel programming model.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

An extensible global address space framework with decoupled task and data abstractions.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Parallel FPGA-based all-pairs shortest-paths in a directed graph.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

Memory minimization for tensor contractions using integer linear programming.
Proceedings of the 20th International Parallel and Distributed Processing Symposium (IPDPS 2006), 2006

An Integrated Approach for Processor Allocation and Scheduling of Mixed-Parallel Applications.
Proceedings of the 2006 International Conference on Parallel Processing (ICPP 2006), 2006

Identifying Cost-Effective Common Subexpressions to Reduce Operation Count in Tensor Contraction Evaluations.
Proceedings of the Computational Science, 2006

Task Scheduling and File Replication for Data-Intensive Jobs with Batch-shared I/O.
Proceedings of the 15th IEEE International Symposium on High Performance Distributed Computing, 2006

Hardware/Software Integration for FPGA-based All-Pairs Shortest-Paths.
Proceedings of the 14th IEEE Symposium on Field-Programmable Custom Computing Machines (FCCM 2006), 2006

Locality Conscious Processor Allocation and Scheduling for Mixed Parallel Applications.
Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

A Performance Instrumentation Framework to Characterize Computation-Communication Overlap in Message-Passing Systems.
Proceedings of the 2006 IEEE International Conference on Cluster Computing, 2006

Combining analytical and empirical approaches in tuning matrix transposition.
Proceedings of the 15th International Conference on Parallel Architectures and Compilation Techniques (PACT 2006), 2006

2005
Synthesis of High-Performance Parallel Programs for a Class of ab Initio Quantum Chemistry Models.
Proc. IEEE, 2005

Selective preemption strategies for parallel job scheduling.
Int. J. High Perform. Comput. Netw., 2005

Integrated Loop Optimizations for Data Locality Enhancement of Tensor Contraction Expressions.
Proceedings of the ACM/IEEE SC2005 Conference on High Performance Networking and Computing, 2005

Performance modeling and optimization of parallel out-of-core tensor contractions.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2005

Unfairness Metrics for Space-Sharing Parallel Job Schedulers.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2005

Cache Miss Characterization and Data Locality Optimization for Imperfectly Nested Loops on Shared Memory Multiprocessors.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Message from the Chairs.
Proceedings of the 34th International Conference on Parallel Processing Workshops (ICPP 2005 Workshops), 2005

Automated Operation Minimization of Tensor Contraction Expressions in Electronic Structure Calculations.
Proceedings of the Computational Science, 2005

Assessment and enhancement of meta-schedulers for multi-site job sharing.
Proceedings of the 14th IEEE International Symposium on High Performance Distributed Computing, 2005

Data and Computation Abstractions for Dynamic and Irregular Computations.
Proceedings of the High Performance Computing, 2005

A hypergraph partitioning based approach for scheduling of tasks with batch-shared I/O.
Proceedings of the 5th International Symposium on Cluster Computing and the Grid (CCGrid 2005), 2005

2004
Robust scheduling of moldable parallel jobs.
Int. J. High Perform. Comput. Netw., 2004

Efficient parallel out-of-core matrix transposition.
Int. J. High Perform. Comput. Netw., 2004

Empirical Performance-Model Driven Data Layout Optimization.
Proceedings of the Languages and Compilers for High Performance Computing, 2004

Applying MPI Derived Datatypes to the NAS Benchmarks: A Case Study.
Proceedings of the 33rd International Conference on Parallel Processing Workshops (ICPP 2004 Workshops), 2004

Message from the Chairs: International Workshop on Compile and Run Time Techniques for Parallel Computing.
Proceedings of the 33rd International Conference on Parallel Processing Workshops (ICPP 2004 Workshops), 2004

Job Fairness in Non-Preemptive Job Scheduling.
Proceedings of the 33rd International Conference on Parallel Processing (ICPP 2004), 2004

Efficient Layout Transformation for Disk-Based Multidimensional Arrays.
Proceedings of the High Performance Computing, 2004

Use of PVFS for Efficient Execution of Jobs with Pipeline-Shared I/O.
Proceedings of the 5th International Workshop on Grid Computing (GRID 2004), 2004

On fairness in distributed job scheduling across multiple sites.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

Towards provision of quality of service guarantees in job scheduling.
Proceedings of the 2004 IEEE International Conference on Cluster Computing (CLUSTER 2004), 2004

2003
Evaluating the Impact of Programming Language Features on the Performance of Parallel Applications on Cluster Architectures.
Proceedings of the Languages and Compilers for Parallel Computing, 2003

Memory-Constrained Data Locality Optimization for Tensor Contractions.
Proceedings of the Languages and Compilers for Parallel Computing, 2003

Scheduling of Parallel Jobs in a Heterogeneous Multi-site Environement.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2003

QoPS: A QoS Based Scheme for Parallel Job Scheduling.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2003

Global Communication Optimization for Tensor Contraction Expressions under Memory Constraints.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

Data Locality Optimization for Synthesis of Efficient Out-of-Core Algorithms.
Proceedings of the High Performance Computing - HiPC 2003, 10th International Conference, 2003

A Robust Scheduling Strategy for Moldable Scheduling of Parallel Jobs.
Proceedings of the 2003 IEEE International Conference on Cluster Computing (CLUSTER 2003), 2003

2002
A high-level approach to synthesis of high-performance codes for quantum chemistry.
Proceedings of the 2002 ACM/IEEE conference on Supercomputing, 2002

Space-Time Trade-Off Optimization for a Class of Electronic Structure Calculations.
Proceedings of the 2002 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), 2002

Memory-Constrained Communication Minimization for a Class of Array Computations.
Proceedings of the Languages and Compilers for Parallel Computing, 15th Workshop, 2002

Selective Reservation Strategies for Backfill Job Scheduling.
Proceedings of the Job Scheduling Strategies for Parallel Processing, 2002

A Performance Optimization Framework for Compilation of Tensor Contraction Expressions into Parallel Programs.
Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

Characterization of Backfilling Strategies for Parallel Job Scheduling.
Proceedings of the 31st International Conference on Parallel Processing Workshops (ICPP 2002 Workshops), 2002

Message from the Co-Chairs.
Proceedings of the 31st International Conference on Parallel Processing Workshops (ICPP 2002 Workshops), 2002

A Reliable Multicast Algorithm for Mobile Ad Hoc Networks.
Proceedings of the 22nd International Conference on Distributed Computing Systems (ICDCS'02), 2002

Distributed Job Scheduling on Computational Grids Using Multiple Simultaneous Requests.
Proceedings of the 11th IEEE International Symposium on High Performance Distributed Computing (HPDC-11 2002), 2002

Effective Selection of Partition Sizes for Moldable Scheduling of Parallel Jobs.
Proceedings of the High Performance Computing, 2002

Selective Buddy Allocation for Scheduling Parallel Jobs on Clusters.
Proceedings of the 2002 IEEE International Conference on Cluster Computing (CLUSTER 2002), 2002

2001
Efficient Multicast Algorithms for Heterogeneous Switch-based Irregular Networks of Workstations.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

Performance Benefits of NIC-Based Barrier on Myrinet/GM.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

Fast NIC-Based Barrier over Myrinet/GM.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

VIBe: A Micro-benchmark Suite for Evaluating Virtual Interface Architecture (VIA) Implementations.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

Loop optimization for a class of memory-constrained computations.
Proceedings of the 15th international conference on Supercomputing, 2001

NIC-Based Rate Control for Proportional Bandwidth Allocation in Myrinet Clusters.
Proceedings of the 2001 International Conference on Parallel Processing, 2001

Implementing TreadMarksover VIA on Myrinet and Gigabit Ethernet: Challenges, Design Experience, and Performance Evaluation.
Proceedings of the 2001 International Conference on Parallel Processing, 2001

Towards Automatic Synthesis of High-Performance Codes for Electronic Structure Calculations: Data Locality Optimization.
Proceedings of the High Performance Computing - HiPC 2001, 8th International Conference, 2001

2000
Characterization and Enhancement of Dynamic Mapping Heuristics for Heterogeneous Systems.
Proceedings of the 2000 International Workshop on Parallel Processing, 2000

Message from the Chair.
Proceedings of the 2000 International Workshop on Parallel Processing, 2000

Balancing Web Server Load for Adaptable Video Distribution.
Proceedings of the 2000 International Workshop on Parallel Processing, 2000

Characterization and enhancement of Static Mapping Heuristics for Heterogeneous Systems.
Proceedings of the High Performance Computing, 2000

Fast Collective Communication Algorithms for Reflective Memory Network Clusters.
Proceedings of the Network-Based Parallel Computing: Communication, 2000

Broadcast/Multicast over Myrinet Using NIC-Assisted Multidestination Messages.
Proceedings of the Network-Based Parallel Computing: Communication, 2000

1999
Performance Optimization of a Class of Loops Involving Sums of Products of Sparse Arrays.
Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, 1999

Optimization of Memory Usage Requirement for a Class of Loops Implementing Multi-dimensional Integrals.
Proceedings of the Languages and Compilers for Parallel Computing, 1999

Low-Latency Message Passing on Workstation Clusters using SCRAMNet.
Proceedings of the 13th International Parallel Processing Symposium / 10th Symposium on Parallel and Distributed Processing (IPPS / SPDP '99), 1999

All-to-All Broadcast on Switch-Based Clusters of Workstations.
Proceedings of the 13th International Parallel Processing Symposium / 10th Symposium on Parallel and Distributed Processing (IPPS / SPDP '99), 1999

An Incremental Methodology for Parallelizing Legacy Stencil Codes on Message-Passing Computers.
Proceedings of the International Conference on Parallel Processing 1999, 1999

Memory-Optimal Evaluation of Expression Trees Involving Large Objects.
Proceedings of the High Performance Computing, 1999

Communication Modeling of Heterogeneous Networks of Workstations for Performance Characterization of Collective Operations.
Proceedings of the 8th Heterogeneous Computing Workshop, 1999

Low Latency Message-Passing for Reflective Memory Networks.
Proceedings of the Network-Based Parallel Computing: Communication, 1999

1998
Partitioning Graphs on Message-Passing Machines by Pairwise Mincut.
Inf. Sci., 1998

A technique for overlapping computation and communication for block recursive algorithms.
Concurr. Pract. Exp., 1998

1997
On Optimizing a Class of Multi-Dimensional Loops with Reductions for Parallel Execution.
Parallel Process. Lett., 1997

Optimal Algorithms for All-to-All Personalized Communication on Rings and Two Dimensional Tori.
J. Parallel Distributed Comput., 1997

Optimization of a Class of Multi-Dimensional Integrals on Parallel Machines.
Proceedings of the Eighth SIAM Conference on Parallel Processing for Scientific Computing, 1997

On improving the performance of sparse matrix-vector multiplication.
Proceedings of the Fourth International on High-Performance Computing, 1997

1996
Communication-Efficient Matrix Multiplication on Hypercubes.
Parallel Comput., 1996

Efficient Index Set Generation for Compiling HPF Array Statements on Distributed-Memory Machines.
J. Parallel Distributed Comput., 1996

Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines.
J. Parallel Distributed Comput., 1996

A Framework for Generating Distributed-Memory Parallel Programs for Block Recursive Algorithms.
J. Parallel Distributed Comput., 1996

An Algebraic Theory for Modeling Directt Interconnection Networks.
J. Inf. Sci. Eng., 1996

Introduction.
Int. J. Parallel Program., 1996

Optimal Reordering and Mapping of a Class of Nested-Loops for Parallel Execution.
Proceedings of the Languages and Compilers for Parallel Computing, 1996

Hybrid Algorithms for Complete Exchange in 2D Meshes.
Proceedings of the 10th international conference on Supercomputing, 1996

1995
A Tensor Product Formulation of Strassen's Matrix Multiplication Algorithm with Memory Reduction.
Sci. Program., 1995

A Clustering Algorithm for Parallel Sparse Cholesky Factorization.
Parallel Process. Lett., 1995

Practical abduction: characterization, decomposition and concurrency.
J. Exp. Theor. Artif. Intell., 1995

Mapping combinatorial optimization problems onto neural networks.
Inf. Sci., 1995

Compiling Array Statements for Efficient Execution on Distributed-Memory Machines: Two-Level Mappings.
Proceedings of the Languages and Compilers for Parallel Computing, 1995

Multi-phase array redistribution: modeling and evaluation.
Proceedings of IPPS '95, 1995

1994
Implementing Fast Fourier Transforms on Distributed-Memory Multiprocessors Using Data Redistributions.
Parallel Process. Lett., 1994

EXTENT: a portable programming environment for designing and implementing high-performance block recursive algorithms.
Proceedings of the Proceedings Supercomputing '94, 1994

Incremental Generation of Index Sets for Array Statement Execution on Distributed-Memory Machines.
Proceedings of the Languages and Compilers for Parallel Computing, 1994

A Clustered Reduced Communication Element by Element Preconditioned Conjugate Gradient Algorithm for Finite Element Computations.
Proceedings of the 8th International Symposium on Parallel Processing, 1994

On sparse matrix reordering for parallel factorization.
Proceedings of the 8th international conference on Supercomputing, 1994

An approach to communication-efficient data redistribution.
Proceedings of the 8th international conference on Supercomputing, 1994

Communication-Efficient Implementation of Block Recursive Algorithms on Distributed-Memory Machines.
Proceedings of the Proceedings 1994 International Conference on Parallel and Distributed Systems, 1994

1993
Communication-Free Hyperplane Partitioning of Nested Loops.
J. Parallel Distributed Comput., 1993

Efficient transposition algorithms for large matrices.
Proceedings of the Proceedings Supercomputing '93, 1993

A Methodology for Generating Efficient Disk-Based Algorithms from Tensor Product Formulas.
Proceedings of the Languages and Compilers for Parallel Computing, 1993

A Parallel Progressive Refinement Image Rendering Algorithm on a Scalable Multithreaded VLSI Processor Array.
Proceedings of the 1993 International Conference on Parallel Processing, 1993

On Compiling Array Expressions for Efficient Execution on Distributed-Memory Machines.
Proceedings of the 1993 International Conference on Parallel Processing, 1993

Supernodal Sparse Cholesky Facotrization on Distributed-Memory Multiprocessors.
Proceedings of the 1993 International Conference on Parallel Processing, 1993

Compile-Time Characterization of Recurrent Patterns in Irregular Computations.
Proceedings of the 1993 International Conference on Parallel Processing, 1993

Architectural Synthesis of Performance-Driven Multipliers with Accumulator Interleaving.
Proceedings of the 30th Design Automation Conference. Dallas, 1993

1992
Toward super-real-time simulation of robotic mechanisms using a parallel integration method.
IEEE Trans. Syst. Man Cybern., 1992

Tiling Multidimensional Itertion Spaces for Multicomputers.
J. Parallel Distributed Comput., 1992

The Rectilinear Steiner Arborescence Problem.
Algorithmica, 1992

A Methodology for Generating Data Distributions to Optimize Communication.
Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing, 1992

On Data Dependence Analysis for Compiling Programs on Distributed-Memory Machines (Extended Abstract).
Proceedings of the 2nd SIGPLAN Workshop on Languages, Compilers, and Run-Time Environments for Distributed Memory Multiprocessors, Boulder, Colorado, September 30, 1992

On the Automatic Generation of Data Distributions.
Proceedings of the 2nd SIGPLAN Workshop on Languages, Compilers, and Run-Time Environments for Distributed Memory Multiprocessors, Boulder, Colorado, September 30, 1992

An Algebraic Theory for Modeling Direct Interconnection Networks.
Proceedings of the Proceedings Supercomputing '92, 1992

On the Synthesis of Parallel Programs from Tensor Product Formulas for Block Recursive Algorithms.
Proceedings of the Languages and Compilers for Parallel Computing, 1992

Efficient dynamic simulation of multiple manipulator systems with singularities.
Proceedings of the 1992 IEEE International Conference on Robotics and Automation, 1992

1991
Compile-Time Techniques for Data Distribution in Distributed Memory Machines.
IEEE Trans. Parallel Distributed Syst., 1991

Removal of Redundant Dependences in DOACROSS Loops with Constant Dependences.
IEEE Trans. Parallel Distributed Syst., 1991

Tiling multidimensional iteration spaces for nonshared memory machines.
Proceedings of the Proceedings Supercomputing '91, 1991

Removal of Redundant Dependences in DOACROSS Lops with Constant Dependences.
Proceedings of the Third ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming (PPOPP), 1991

Real-time robot dynamic simulation on a vector/parallel supercomputer.
Proceedings of the 1991 IEEE International Conference on Robotics and Automation, 1991

Computer Graphics Rendering on a Shared Memory Multiprocessor.
Proceedings of the International Conference on Parallel Processing, 1991

Multifrontal Factorization of Sparse Matrices on Shared-Memory Multiprocessors.
Proceedings of the International Conference on Parallel Processing, 1991

1990
Cluster partitioning approaches to mapping parallel programs onto a hypercube.
Parallel Comput., 1990

Task Allocation onto a Hypercube by Recursive Mincut Bipartitioning.
J. Parallel Distributed Comput., 1990

Dynamic Scheduling of DOACROSS Loops for Multiprocessors.
Proceedings of the Parallel Architectures (Postconference PARBASE-90)., 1990

Tiling of Iteration Spaces for Multicomputers.
Proceedings of the 1990 International Conference on Parallel Processing, 1990

Exploiting Parallelism Through Run-Time Analysis on a Vector Processor (Abstract).
Proceedings of the ACM 18th Annual Computer Science Conference on Cooperation, 1990

1989
A restructurable VLSI robotics vector processor architecture for real-time control.
IEEE Trans. Robotics Autom., 1989

Efficient sparse matrix factorization for circuit simulation on vector supercomputers.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 1989

Communication reduction for distributed sparse matrix factorization on a processor mesh.
Proceedings of the Proceedings Supercomputing '89, Reno, NV, USA, November 12-17, 1989, 1989

A methodology for parallelizing programs for multicomputers and complex memory multiprocessors.
Proceedings of the Proceedings Supercomputing '89, Reno, NV, USA, November 12-17, 1989, 1989

One-to-one mapping of process graphs onto a hypercube.
Proceedings of the 3rd international conference on Supercomputing, 1989

Optimal Static Scheduling of Sequential Loops on Multiprocessors.
Proceedings of the International Conference on Parallel Processing, 1989

1988
Circuit Simulation on Shared-Memory Multiprocessors.
IEEE Trans. Computers, 1988

Iterative Algorithms for Solution of Large Sparse Systems of Linear Equations on Hypercubes.
IEEE Trans. Computers, 1988

Parallelization and performance evaluation of circuit simulation on a shared-memory multiprocessor.
Proceedings of the 2nd international conference on Supercomputing, 1988

An approach to synchronization for parallel computing.
Proceedings of the 2nd international conference on Supercomputing, 1988

A VLSI robotics vector processor for real-time control.
Proceedings of the 1988 IEEE International Conference on Robotics and Automation, 1988

Optimization by neural networks.
Proceedings of International Conference on Neural Networks (ICNN'88), 1988

Towards a 'neural' architecture for abductive reasoning.
Proceedings of International Conference on Neural Networks (ICNN'88), 1988

1987
Nearest-Neighbor Mapping of Finite Element Graphs onto Processor Meshes.
IEEE Trans. Computers, 1987

Cluster-Partitioning Approaches to Mapping Parallel Programs onto a Hypercube.
Proceedings of the Supercomputing, 1987

Mapping Finite Element Graphs onto Processor Meshes.
Proceedings of the International Conference on Parallel Processing, 1987

1985
Modeling switch-level simulation using data flow.
Proceedings of the 22nd ACM/IEEE conference on Design automation, 1985


  Loading...