José Nelson Amaral

Orcid: 0000-0002-9943-1809

Affiliations:
  • University of Alberta, Edmonton, Canada


According to our database1, José Nelson Amaral authored at least 131 papers between 1995 and 2024.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2024
Effective Communication of Scientific Results.
CoRR, 2024

Region-Based Data Layout via Data Reuse Analysis.
Proceedings of the 33rd ACM SIGPLAN International Conference on Compiler Construction, 2024

2023
Advancing Direct Convolution Using Convolution Slicing Optimization and ISA Extensions.
ACM Trans. Archit. Code Optim., December, 2023

Fast matrix multiplication via compiler-only layered data reorganization and intrinsic lowering.
Softw. Pract. Exp., September, 2023

YaConv: Convolution with Low Cache Footprint.
ACM Trans. Archit. Code Optim., March, 2023

CacheIR: The Benefits of a Structured Representation for Inline Caches.
Proceedings of the 20th ACM SIGPLAN International Conference on Managed Programming Languages and Runtimes, 2023

To Pack or Not to Pack: A Generalized Packing Analysis and Transformation.
Proceedings of the 21st ACM/IEEE International Symposium on Code Generation and Optimization, 2023

Efficient Auto-Vectorization for Control-flow Dependent Loops through Data Permutation.
Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering, 2023

Stub Folding: Retaining Type Specialization to Increase the Efficiency of Highly Polymorphic Inline Caches.
Proceedings of the 33rd Annual International Conference on Computer Science and Software Engineering, 2023

2022
Vectorizing divergent control flow with active-lane consolidation on long-vector architectures.
J. Supercomput., 2022

Compiling for the IBM Matrix Engine for Enterprise Workloads.
IEEE Micro, 2022

19th Compiler-Driven PerformanceWorkshop.
Proceedings of the 32nd Annual International Conference on Computer Science and Software Engineering, 2022

Improving Convolution via Cache Hierarchy Tiling and Reduced Packing.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2022

2021
Instruction visibility in SPEC CPU2017.
J. Comput. Lang., 2021

Methodological Principles for Reproducible Performance Evaluation in Cloud Computing.
IEEE Trans. Software Eng., 2021

KernelFaRer: Replacing Native-Code Idioms with High-Performance Library Calls.
ACM Trans. Archit. Code Optim., 2021

Practical dynamic reconstruction of control flow graphs.
Softw. Pract. Exp., 2021

Pooling Acceleration in the DaVinci Architecture Using Im2col and Col2im Instructions.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2021

Vulkan Vision: Ray Tracing Workload Characterization using Automatic Graphics Instrumentation.
Proceedings of the IEEE/ACM International Symposium on Code Generation and Optimization, 2021

U can't inline this!
Proceedings of the CASCON '21: Proceedings of the 31st Annual International Conference on Computer Science and Software Engineering, Toronto, Ontario, Canada, November 22, 2021

2020
Flexibility Is Key in Organizing a Global Professional Conference Online: The ICPE 2020 Experience in the COVID-19 Era.
CoRR, 2020

PSU: A Framework for Dynamic Software Updates in Multi-threaded C-Language Programs.
Proceedings of the 32nd IEEE International Symposium on Computer Architecture and High Performance Computing, 2020

2019
Memory-access-aware Safety and Profitability Analysis for Transformation of Accelerator-bound OpenMP Loops.
ACM Trans. Archit. Code Optim., 2019

Efficient and Precise Dynamic Construction of Control Flow Graphs.
Proceedings of the XXIII Brazilian Symposium on Programming Languages, 2019

Toward an Analytical Performance Model to Select between GPU and CPU Execution.
Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops, 2019

Compiler-driven performance workshop.
Proceedings of the 29th Annual International Conference on Computer Science and Software Engineering, 2019

2018
Using Hardware-Transactional-Memory Support to Implement Thread-Level Speculation.
IEEE Trans. Parallel Distributed Syst., 2018

Syntax and sensibility: Using language models to detect and correct syntax errors.
Proceedings of the 25th International Conference on Software Analysis, 2018

OpenMP Code Offloading: Splitting GPU Kernels, Pipelining Communication and Computation, and Selecting Better Grid Geometries.
Proceedings of the Accelerator Programming Using Directives - 5th International Workshop, 2018

Automated GPU Grid Geometry Selection for OPENMP Kernels.
Proceedings of the 30th International Symposium on Computer Architecture and High Performance Computing, 2018

The Alberta Workloads for the SPEC CPU 2017 Benchmark Suite.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2018

Run-Length Base-Delta Encoding for High-Speed Compression.
Proceedings of the 47th International Conference on Parallel Processing, 2018

2017
Finding and correcting syntax errors using recurrent neural networks.
PeerJ Prepr., 2017

Performance Evaluation of Thread-Level Speculation in Off-the-Shelf Hardware Transactional Memories.
Proceedings of the Euro-Par 2017: Parallel Processing - 23rd International Conference on Parallel and Distributed Computing, Santiago de Compostela, Spain, August 28, 2017

2016
Combining Static and Dynamic Data Coalescing in Unified Parallel C.
IEEE Trans. Parallel Distributed Syst., 2016

The Truth, The Whole Truth, and Nothing But the Truth: A Pragmatic Guide to Assessing Empirical Evaluations.
ACM Trans. Program. Lang. Syst., 2016

SafeType: detecting type violations for type-basedalias analysis of C.
Softw. Pract. Exp., 2016

Study of hardware transactional memory characteristics and serialization policies on Haswell.
Parallel Comput., 2016

Using shared-data localization to reduce the cost of inspector-execution in unified-parallel-C programs.
Parallel Comput., 2016

Evaluating and Improving Thread-Level Speculation in Hardware Transactional Memories.
Proceedings of the 2016 IEEE International Parallel and Distributed Processing Symposium, 2016

2015
Software Support and Evaluation of Hardware Transactional Memory on Blue Gene/Q.
IEEE Trans. Computers, 2015

Error location in Python: where the mutants hide.
PeerJ Prepr., 2015

Hybrid parallel task placement in irregular applications.
J. Parallel Distributed Comput., 2015

Guest Editorial: SBAC-PAD 2013.
Int. J. Parallel Program., 2015

In defense of soundiness: a manifesto.
Commun. ACM, 2015

Using Hardware Transactional Memory to Enable Speculative Trace Optimization.
Proceedings of the 2015 International Symposium on Computer Architecture and High Performance Computing Workshops, 2015

Serialization Management for Best-Effort Hardware Transactional Memory.
Proceedings of the 27th International Symposium on Computer Architecture and High Performance Computing, 2015

Stratified Sampling for Even Workload Partitioning Applied to IDA* and Delaunay Algorithms.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Stratified sampling for even workload partitioning applied to single source shortest path algorithm.
Proceedings of 25th Annual International Conference on Computer Science and Software Engineering, 2015

Data-dependence profiling to enable safe thread level speculation.
Proceedings of 25th Annual International Conference on Computer Science and Software Engineering, 2015

2014
A special issue from the international conference on performance engineering 2013.
Concurr. Comput. Pract. Exp., 2014

Multi-dimensional Evaluation of Haswell's Transactional Memory Performance.
Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing, 2014

Reducing Compiler-Inserted Instrumentation in Unified-Parallel-C Code Generation.
Proceedings of the 26th IEEE International Symposium on Computer Architecture and High Performance Computing, 2014

Syntax errors just aren't natural: improving error reporting with language models.
Proceedings of the 11th Working Conference on Mining Software Repositories, 2014

Measuring Effective Work to Reward Success in Dynamic Transaction Scheduling.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

Heavyweight Pattern Mining in Attributed Flow Graphs.
Proceedings of the 2014 IEEE International Conference on Data Mining, 2014

Optimizing shared data accesses in distributed-memory X10 systems.
Proceedings of the 21st International Conference on High Performance Computing, 2014

Stratified sampling for even workload partitioning.
Proceedings of the International Conference on Parallel Architectures and Compilation, 2014

2013
Hybrid parallel task placement in X10.
Proceedings of the third ACM SIGPLAN X10 Workshop, 2013

Improving performance of all-to-all communication through loop scheduling in PGAS environments.
Proceedings of the International Conference on Supercomputing, 2013

Improving communication in PGAS environments: static and dynamic coalescing in UPC.
Proceedings of the International Conference on Supercomputing, 2013

On the Merits of Distributed Work-Stealing on Selective Locality-Aware Tasks.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

Automatic speculative parallelization of loops using polyhedral dependence analysis.
Proceedings of the First International Workshop on Code Optimisation for Multi and Many Cores, 2013

12th Compiler-Driven Performance Workshop.
Proceedings of the Center for Advanced Studies on Collaborative Research, 2013

2012
Combined profiling: A methodology to capture varied program behavior across multiple inputs.
Proceedings of the 2012 IEEE International Symposium on Performance Analysis of Systems & Software, 2012

11th Compiler-Driven Performance Workshop.
Proceedings of the Center for Advanced Studies on Collaborative Research, 2012

Evaluation of blue Gene/Q hardware support for transactional memories.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

Transactional event profiling in a best-effort hardware transactional memory system.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2011
Evaluating address register assignment and offset assignment algorithms.
ACM Trans. Embed. Comput. Syst., 2011

Combined profiling: practical collection of feedback information for code optimization.
Proceedings of the ICPE'11, 2011

Using machines to learn method-specific compilation strategies.
Proceedings of the CGO 2011, 2011

10th Workshop on Compiler-Driven Performance.
Proceedings of the Center for Advanced Studies on Collaborative Research, 2011

2010
An Optimal Encoding to Represent a Single Set in an ROBDD.
IEEE Trans. Computers, 2010

Using Support Vector Machines to Learn How to Compile a Method.
Proceedings of the 22st International Symposium on Computer Architecture and High Performance Computing, 2010

Mining for Paths in Flow Graphs.
Proceedings of the Advances in Data Mining. Applications and Theoretical Aspects, 2010

Mining Opportunities for Code Improvement in a Just-In-Time Compiler.
Proceedings of the Compiler Construction, 19th International Conference, 2010

Compiling Python to a hybrid execution environment.
Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, 2010

2009
Using XBDDs and ZBDDs in points-to analysis.
Softw. Pract. Exp., 2009

Workload Reduction for Multi-input Feedback-Directed Optimization.
Proceedings of the CGO 2009, 2009

2008
A cache-based internet protocol address lookup architecture.
Comput. Networks, 2008

MPADS: memory-pooling-assisted data splitting.
Proceedings of the 7th International Symposium on Memory Management, 2008

The MAP3S Static-and-Regular Mesh Simulation and Wavefront Parallel-Programming Patterns.
Proceedings of the 2008 International Conference on Parallel Processing, 2008

Topic 9: Parallel and Distributed Programming.
Proceedings of the Euro-Par 2008, 2008

2007
<i>Forma</i>: A framework for safe automatic array reshaping.
ACM Trans. Program. Lang. Syst., 2007

<i>Ablego</i>: a function outlining and partial inlining framework.
Softw. Pract. Exp., 2007

Using SIMD registers and instructions to enable instruction-level parallelism in sorting algorithms.
Proceedings of the SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, 2007

Using ZBDDs in Points-to Analysis.
Proceedings of the Languages and Compilers for Parallel Computing, 2007

Multidimensional Blocking in UPC.
Proceedings of the Languages and Compilers for Parallel Computing, 2007

Evaluation of Offset Assignment Heuristics.
Proceedings of the High Performance Embedded Architectures and Compilers, 2007

A Dimension Abstraction Approach to Vectorization in Matlab.
Proceedings of the Fifth International Symposium on Code Generation and Optimization (CGO 2007), 2007

2006
Is MPI suitable for a generative design-pattern system?
Parallel Comput., 2006

Eliminating Redundant Join-Set Computations in Static Single Assignment.
J. Univers. Comput. Sci., 2006

Shared memory programming for large scale machines.
Proceedings of the ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation, 2006

A Characterization of Shared Data Access Patterns in UPC Programs.
Proceedings of the Languages and Compilers for Parallel Computing, 2006

Tree-Traversal Orientation Analysis.
Proceedings of the Languages and Compilers for Parallel Computing, 2006

Aestimo: a feedback-directed optimization evaluation tool.
Proceedings of the 2006 IEEE International Symposium on Performance Analysis of Systems and Software, 2006

A Parallel External-Memory Frontier Breadth-First Traversal Algorithm for Clusters of Workstations.
Proceedings of the 2006 International Conference on Parallel Processing (ICPP 2006), 2006

Utilizing field usage patterns for Java heap space optimization.
Proceedings of the 2006 conference of the Centre for Advanced Studies on Collaborative Research, 2006

Sequential and Parallel Algorithms for Frontier A* with Delayed Duplicate Detection.
Proceedings of the Proceedings, 2006

2005
Teaching digital design to computing science students in a single academic term.
IEEE Trans. Educ., 2005

Function Outlining and Partial Inlining.
Proceedings of the 17th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2005), 2005

A Multizone Pipelined Cache for IP Routing.
Proceedings of the NETWORKING 2005: Networking Technologies, 2005

A hardware-based longest prefix matching scheme for TCAMs.
Proceedings of the International Symposium on Circuits and Systems (ISCAS 2005), 2005

Feedback-Directed Switch-Case Statement Optimization.
Proceedings of the 34th International Conference on Parallel Processing Workshops (ICPP 2005 Workshops), 2005

Generalized Index-Set Splitting.
Proceedings of the Compiler Construction, 14th International Conference, 2005

2004
FPGA implementation and experimental evaluation of a multizone network cache.
Microprocess. Microsystems, 2004

A performance study of data layout techniques for improving data locality in refinement-based pathfinding.
ACM J. Exp. Algorithmics, 2004

An FPGA prototype for the experimental evaluation of a multizone network cache.
Proceedings of the ACM/SIGDA 12th International Symposium on Field Programmable Gate Arrays, 2004

Identifying opportunities for automatic remote field cloning.
Proceedings of the 2004 conference of the Centre for Advanced Studies on Collaborative research, 2004

2003
Minimum Register Instruction Sequencing to Reduce Register Spills in Out-of-Order Issue Superscalar Architectures.
IEEE Trans. Computers, 2003

Implementation of the EARTH programming model on SMP clusters: a multi-threaded language and runtime system.
Concurr. Comput. Pract. Exp., 2003

To Inline or Not to Inline? Enhanced Inlining Decisions.
Proceedings of the Languages and Compilers for Parallel Computing, 2003

Crafting Data Structures: A Study of Reference Locality in Refinement-Based Pathfinding.
Proceedings of the High Performance Computing - HiPC 2003, 10th International Conference, 2003

The Bank Nth Chance Replacement Policy for FPGA-Based CAMs.
Proceedings of the Field Programmable Logic and Application, 13th International Conference, 2003

Should potential loop optimizations influence inlining decisions?
Proceedings of the 2003 conference of the Centre for Advanced Studies on Collaborative Research, 2003

2002
On the Tamability of the Location Consistency Memory Model.
Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, 2002

Fine-Grain Stacked Register Allocation for the Itanium Architecture.
Proceedings of the Languages and Compilers for Parallel Computing, 15th Workshop, 2002

Removing Impediments to Loop Fusion Through Code Transformations.
Proceedings of the Languages and Compilers for Parallel Computing, 15th Workshop, 2002

2001
Dynamic Load Balancers for a Multithreaded Multiprocessor System.
Parallel Process. Lett., 2001

An Abstract State Machine Specification and Verification of the Location Consistency Memory Model and Cache Protocol.
J. Univers. Comput. Sci., 2001

Exploiting Locality in Single Assignment Data Structures Updated Through Split-Phase Transactions.
Clust. Comput., 2001

Minimum Register Instruction Sequence Problem: Revisiting Optimal Code Generation for DAGs.
Proceedings of the 15th International Parallel & Distributed Processing Symposium (IPDPS-01), 2001

Speculative Prefetching of Induction Pointers.
Proceedings of the Compiler Construction, 10th International Conference, 2001

2000
Design and Implementation of an Efficient Thread Partitioning Algorithm.
Proceedings of the High Performance Computing, Third International Symposium, 2000

Caching Single-Assignment Structures to Build a Robust Fine-Grain Multi-Threading System.
Proceedings of the 14th International Parallel & Distributed Processing Symposium (IPDPS'00), 2000

Automatic compiler techniques for thread coarsening for multithreaded architectures.
Proceedings of the 14th international conference on Supercomputing, 2000

1999
Coping with very High Latencies in Petaflop Computer Systems.
Proceedings of the High Performance Computing, Second International Symposium, 1999

1997
Invariant pattern recognition of 2D images using neural networks and frequency-domain representation.
Proceedings of International Conference on Neural Networks (ICNN'97), 1997

1996
A Concurrent Architecture for Serializable Production Systems.
IEEE Trans. Parallel Distributed Syst., 1996

1995
Designing genetic algorithms for the state assignment problem.
IEEE Trans. Syst. Man Cybern., 1995

Performance measurements of a concurrent production system architecture without global synchronization.
Proceedings of IPPS '95, 1995


  Loading...