Per Stenström

According to our database1, Per Stenström
  • authored at least 172 papers between 1987 and 2018.
  • has a "Dijkstra number"2 of three.

Awards

IEEE Fellow

IEEE Fellow 2007, "For contributions to design of high-performance memory systems".

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Other 

Links

Homepages:

On csauthors.net:

Bibliography

2018
Scheduling Parallel Real-Time Recurrent Tasks on Multicore Platforms.
IEEE Trans. Parallel Distrib. Syst., 2018

ProFess: A Probabilistic Hybrid Main Memory Management Framework for High Performance and Fairness.
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2018

2017
SLOOP: QoS-Supervised Loop Execution to Reduce Energy on Heterogeneous Architectures.
TACO, 2017

A Framework for Automated and Controlled Floating-Point Accuracy Reduction in Graphics Applications on GPUs.
TACO, 2017

Runtime-Assisted Global Cache Management for Task-Based Parallel Programs.
Computer Architecture Letters, 2017

Timing-Anomaly Free Dynamic Scheduling of Task-Based Parallel Applications.
Proceedings of the 2017 IEEE Real-Time and Embedded Technology and Applications Symposium, 2017

Rock: a framework for pruning the design space of hybrid main memory systems.
Proceedings of the International Symposium on Memory Systems, 2017

2016
2015 Maurice Wilkes Award Given to Christos Kozyrakis.
IEEE Micro, 2016

PATer: A Hardware Prefetching Automatic Tuner on IBM POWER8 Processor.
Computer Architecture Letters, 2016

Timing-anomaly free dynamic scheduling of task-based parallel applications.
Proceedings of the 2016 IEEE Real-Time Systems Symposium, 2016

Adaptive Row Addressing for Cost-Efficient Parallel Memory Protocols in Large-Capacity Memories.
Proceedings of the Second International Symposium on Memory Systems, 2016

RADAR: Runtime-assisted dead region management for last-level caches.
Proceedings of the 2016 IEEE International Symposium on High Performance Computer Architecture, 2016

EUROSERVER: Share-anything scale-out micro-server design.
Proceedings of the 2016 Design, Automation & Test in Europe Conference & Exhibition, 2016

2015
A Primer on Compression in the Memory Hierarchy
Synthesis Lectures on Computer Architecture, Morgan & Claypool Publishers, 2015

HyComp: a hybrid cache compression method for selection of data-type-specific compression methods.
Proceedings of the 48th International Symposium on Microarchitecture, 2015

Performance Impact of Batching Web-Application Requests Using Hot-Spot Processing on GPUs.
Proceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium, 2015

Enhancing Garbage Collection Synchronization Using Explicit Bit Barriers.
Proceedings of the 44th International Conference on Parallel Processing, 2015

2014
ZEBRA: Data-Centric Contention Management in Hardware Transactional Memory.
IEEE Trans. Parallel Distrib. Syst., 2014

Characterizing and Exploiting Small-Value Memory Instructions.
IEEE Trans. Computers, 2014

Introduction to the JPDC special issue on Perspectives on Parallel and Distributed Processing.
J. Parallel Distrib. Comput., 2014

Removal of Conflicts in Hardware Transactional Memory Systems.
International Journal of Parallel Programming, 2014

A Case for a Value-Aware Cache.
Computer Architecture Letters, 2014

Overhead-aware temporal partitioning on multicore processors.
Proceedings of the 20th IEEE Real-Time and Embedded Technology and Applications Symposium, 2014

SC2: A statistical compression cache scheme.
Proceedings of the ACM/IEEE 41st International Symposium on Computer Architecture, 2014

Runtime-Guided Cache Coherence Optimizations in Multi-core Architectures.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Performance and Energy Analysis of the Restricted Transactional Memory Implementation on Haswell.
Proceedings of the 2014 IEEE 28th International Parallel and Distributed Processing Symposium, 2014

Crystal: A Design-Time Resource Partitioning Method for Hybrid Main Memory.
Proceedings of the 43rd International Conference on Parallel Processing, 2014

Effective resource management towards efficient computing.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2014

2013
Eager Beats Lazy: Improving Store Management in Eager Hardware Transactional Memory.
IEEE Trans. Parallel Distrib. Syst., 2013

Moving from petaflops to petadata.
Commun. ACM, 2013

Efficient Forwarding of Producer-Consumer Data in Task-Based Programs.
Proceedings of the 42nd International Conference on Parallel Processing, 2013

HARP: Adaptive abort recurrence prediction for Hardware Transactional Memory.
Proceedings of the 20th Annual International Conference on High Performance Computing, 2013

Improving data access efficiency by using a tagless access buffer (TAB).
Proceedings of the 2013 IEEE/ACM International Symposium on Code Generation and Optimization, 2013

Keynote talk: Towards automatic resource management in parallel architectures.
Proceedings of the 22nd International Conference on Parallel Architectures and Compilation Techniques, 2013

2012
Introduction to the special issue on high-performance and embedded architectures and compilers.
TACO, 2012

Critical lock analysis: diagnosing critical section bottlenecks in multithreaded applications.
Proceedings of the SC Conference on High Performance Computing Networking, 2012

π-TM: Pessimistic invalidation for scalable lazy hardware transactional memory.
Proceedings of the 18th IEEE International Symposium on High Performance Computer Architecture, 2012

Transactional prefetching: narrowing the window of contention in hardware transactional memory.
Proceedings of the International Conference on Parallel Architectures and Compilation Techniques, 2012

2011
Classification and Elimination of Conflicts in Hardware Transactional Memory Systems.
Proceedings of the 23rd International Symposium on Computer Architecture and High Performance Computing, 2011

Panel Statement.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

The Impact of Non-coherent Buffers on Lazy Hardware Transactional Memory Systems.
Proceedings of the 25th IEEE International Symposium on Parallel and Distributed Processing, 2011

Poster: implications of merging phases on scalability of multi-core architectures.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

ZEBRA: a data-centric, hybrid-policy hardware transactional memory design.
Proceedings of the 25th International Conference on Supercomputing, 2011, Tucson, AZ, USA, May 31, 2011

Eager Meets Lazy: The Impact of Write-Buffering on Hardware Transactional Memory.
Proceedings of the International Conference on Parallel Processing, 2011

Implications of Merging Phases on Scalability of Multi-core Architectures.
Proceedings of the International Conference on Parallel Processing, 2011

A unified approach to eliminate memory accesses early.
Proceedings of the 14th International Conference on Compilers, 2011

Pi-TM: Pessimistic Invalidation for Scalable Lazy Hardware Transactional Memory.
Proceedings of the 2011 International Conference on Parallel Architectures and Compilation Techniques, 2011

2010
The Velox Transactional Memory Stack.
IEEE Micro, 2010

LV*: A low complexity lazy versioning HTM infrastructure.
Proceedings of the 2010 International Conference on Embedded Computer Systems: Architectures, 2010

Characterization and exploitation of narrow-width loads: the narrow-width cache approach.
Proceedings of the 2010 International Conference on Compilers, 2010

2009
FlexCore: Utilizing Exposed Datapath Control for Efficient Computing.
Signal Processing Systems, 2009

Introduction.
Trans. HiPEAC, 2009

Schemes for avoiding starvation in transactional memory systems.
Concurrency and Computation: Practice and Experience, 2009

Cancellation of loads that return zero using zero-value caches.
Proceedings of the 23rd international conference on Supercomputing, 2009

A Flexible Code Compression Scheme Using Partitioned Look-Up Tables.
Proceedings of the High Performance Embedded Architectures and Compilers, 2009

Zero-Value Caches: Cancelling Loads that Return Zero.
Proceedings of the PACT 2009, 2009

Using Hoarding to Increase Availability in Shared File Systems.
Proceedings of the 8th IEEE/ACIS International Conference on Computer and Information Science, 2009

2008
The worst-case execution-time problem - overview of methods and survey of tools.
ACM Trans. Embedded Comput. Syst., 2008

Memory-Link Compression Schemes: A Value Locality Perspective.
IEEE Trans. Computers, 2008

Early detection and bypassing of trivial operations to improve energy efficiency of processors.
Microprocessors and Microsystems - Embedded Hardware Design, 2008

Simple Penalty-Sensitive Cache Replacement Policies.
J. Instruction-Level Parallelism, 2008

Dual-thread Speculation: A Simple Approach to Uncover Thread-level Parallelism on a Simultaneous Multithreaded Processor.
International Journal of Parallel Programming, 2008

Efficient management of speculative data in hardware transactional memory systems.
Proceedings of the 2008 International Conference on Embedded Computer Systems: Architectures, 2008

Intermediate checkpointing with conflicting access prediction in transactional memory systems.
Proceedings of the 22nd IEEE International Symposium on Parallel and Distributed Processing, 2008

Accommodation of the Bandwidth of Large Cache Blocks Using Cache/Memory Link Compression.
Proceedings of the 2008 International Conference on Parallel Processing, 2008

Leveraging Data Promotion for Low Power D-NUCA Caches.
Proceedings of the 11th Euromicro Conference on Digital System Design: Architectures, 2008

2007
Introduction to Part 1.
Trans. HiPEAC, 2007

High-Performance Embedded Architecture and Compilation Roadmap.
Trans. HiPEAC, 2007

Starvation-free commit arbitration policies for transactional memory systems.
SIGARCH Computer Architecture News, 2007

An LRU-based replacement algorithm augmented with frequency of access in shared chip-multiprocessor caches.
SIGARCH Computer Architecture News, 2007

Improving power efficiency of D-NUCA caches.
SIGARCH Computer Architecture News, 2007

SimWattch: Integrating Complete-System and User-Level Performance and Power Simulators.
IEEE Micro, 2007

Effectiveness of caching in a distributed digital library system.
Journal of Systems Architecture, 2007

Energy and Performance Trade-offs between Instruction Reuse and Trivial Computations for Embedded Applications.
Proceedings of the IEEE Second International Symposium on Industrial Embedded Systems, 2007

FlexCore: Utilizing Exposed Datapath Control for Efficient Computing.
Proceedings of the 2007 International Conference on Embedded Computer Systems: Architectures, 2007

IPDPS Panel: Is the Multi-Core Roadmap going to Live Up to its Promises?
Proceedings of the 21th International Parallel and Distributed Processing Symposium (IPDPS 2007), 2007

Loop-level Speculative Parallelism in Embedded Applications.
Proceedings of the 2007 International Conference on Parallel Processing (ICPP 2007), 2007

An Adaptive Shared/Private NUCA Cache Partitioning Scheme for Chip Multiprocessors.
Proceedings of the 13st International Conference on High-Performance Computer Architecture (HPCA-13 2007), 2007

Starvation-Free Transactional Memory-System Protocols.
Proceedings of the Euro-Par 2007, 2007

Microprocessors in the era of terascale integration.
Proceedings of the 2007 Design, Automation and Test in Europe Conference and Exposition, 2007

Implicit Transactional Memory in Kilo-Instruction Multiprocessors.
Proceedings of the Advances in Computer Systems Architecture, 2007

2006
Introduction.
J. Parallel Distrib. Comput., 2006

Dual-Thread Speculation: Two Threads in the Machine are Worth Eight in the Bush.
Proceedings of the 18th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2006), 2006

Scalable Value-Cache Based Compression Schemes for Multiprocessors.
Proceedings of the 18th Symposium on Computer Architecture and High Performance Computing (SBAC-PAD 2006), 2006

Reduction of Energy Consumption in Processors by Early Detection and Bypassing of Trivial Operations.
Proceedings of 2006 International Conference on Embedded Computer Systems: Architectures, 2006

Chip-multiprocessing and beyond.
Proceedings of the 12th International Symposium on High-Performance Computer Architecture, 2006

A Cache-Partitioning Aware Replacement Policy for Chip Multiprocessors.
Proceedings of the High Performance Computing, 2006

Simple penalty-sensitive replacement policies for caches.
Proceedings of the Third Conference on Computing Frontiers, 2006

Enhancing Last-Level Cache Performance by Block Bypassing and Early Miss Determination.
Proceedings of the Advances in Computer Systems Architecture, 11th Asia-Pacific Conference, 2006

2005
Introduction to the special issue.
ACM Trans. Embedded Comput. Syst., 2005

Enhancing Multiprocessor Architecture Simulation Speed Using Matched-Pair Comparison.
Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, 2005

A Robust Main-Memory Compression Scheme.
Proceedings of the 32st International Symposium on Computer Architecture (ISCA 2005), 2005

A Cost-Effective Main Memory Organization for Future Servers.
Proceedings of the 19th International Parallel and Distributed Processing Symposium (IPDPS 2005), 2005

Implementing Kilo-Instruction Multiprocessors.
Proceedings of the International Conference on Pervasive Services 2005, 2005

The Chip-Multiprocessing Paradigm Shift: Opportunities and Challenges.
Proceedings of the High Performance Embedded Architectures and Compilers, 2005

Reducing misspeculation overhead for module-level speculative execution.
Proceedings of the Second Conference on Computing Frontiers, 2005

Evaluation of extended dictionary-based static code compression schemes.
Proceedings of the Second Conference on Computing Frontiers, 2005

2004
A cache block reuse prediction scheme.
Microprocessors and Microsystems, 2004

A comparative evaluation of hardware-only and software-only directory protocols in shared-memory multiprocessors.
Journal of Systems Architecture, 2004

A case for multi-level main memory.
Proceedings of the 3rd Workshop on Memory Performance Issues, 2004

Self-correcting LRU replacement policies.
Proceedings of the First Conference on Computing Frontiers, 2004

2003
Integrating complete-system and user-level performance/power simulators: the SimWattch approach.
Proceedings of the 2003 IEEE International Symposium on Performance Analysis of Systems and Software, 2003

Improving Speculative Thread-Level Parallelism Through Module Run-Length Prediction.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

Speculative Lock Reordering: Optimistic Out-of-Order Execution of Critical Sections.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

The Coherence Predictor Cache: A Resource-Efficient and Accurate Coherence Prediction Infrastructure.
Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 2003

A Novel Approach to Cache Block Reuse Predictions.
Proceedings of the 32nd International Conference on Parallel Processing (ICPP 2003), 2003

Performance and Power Impact of Issue-width in Chip-Multiprocessor Cores.
Proceedings of the 32nd International Conference on Parallel Processing (ICPP 2003), 2003

One Chip, One Server: How Do We Exploit Its Power?
Proceedings of the High Performance Computing - HiPC 2003, 10th International Conference, 2003

An Evaluation of Document Prefetching in a Distributed Digital Library.
Proceedings of the Research and Advanced Technology for Digital Libraries, 2003

2002
Improvement of energy-efficiency in off-chip caches by selective prefetching.
Microprocessors and Microsystems, 2002

TLB and snoop energy-reduction using virtual caches in low-power chip-multiprocessors.
Proceedings of the 2002 International Symposium on Low Power Electronics and Design, 2002

Empirical Observations Regarding Predictability in User Access-Behavior in a Distributed Digital Library System.
Proceedings of the 16th International Parallel and Distributed Processing Symposium (IPDPS 2002), 2002

The FAB Predictor: Using Fourier Analysis to Predict the Outcome of Conditional Branches.
Proceedings of the Eighth International Symposium on High-Performance Computer Architecture (HPCA'02), 2002

2001
An All-Software Thread-Level Data Dependence Speculation System for Multiprocessors.
J. Instruction-Level Parallelism, 2001

A Case Study of Load Distribution in Parallel View Frustum Culling and Collision Detection.
Proceedings of the Euro-Par 2001: Parallel Processing, 2001

Limits on Speculative Module-Level Parallelism in Imperative and Object-Oriented Programs on CMP Platforms.
Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques (PACT 2001), 2001

2000
Comparative Evaluation of Latency-Tolerating and -Reducing Techniques for Hardware-Only and Software-Only Directory Protocols.
J. Parallel Distrib. Comput., 2000

Shared-memory multiprocessing: Current state and future directions.
Advances in Computers, 2000

An analytical model of the working-set sizes in decision-support systems.
Proceedings of the 2000 ACM SIGMETRICS international conference on Measurement and modeling of computer systems, 2000

Recency-based TLB preloading.
Proceedings of the 27th International Symposium on Computer Architecture (ISCA 2000), 2000

A Prefetching Technique for Irregular Accesses to Linked Data Structures.
Proceedings of the Sixth International Symposium on High-Performance Computer Architecture, 2000

Parallel Computer Architecture.
Proceedings of the Euro-Par 2000, Parallel Processing, 6th International Euro-Par Conference, Munich, Germany, August 29, 2000

1999
An Integrated Path and Timing Analysis Method based on Cycle-Level Symbolic Execution.
Real-Time Systems, 1999

Evaluation of Compiler-Controlled Updating to Reduce Coherence-Miss Penalties in Shared-Memory Multiprocessors.
J. Parallel Distrib. Comput., 1999

Timing Anomalies in Dynamically Scheduled Microprocessors.
Proceedings of the 20th IEEE Real-Time Systems Symposium, 1999

A Method to Improve the Estimated Worst-Case Performance of Data Caching.
Proceedings of the 6th International Workshop on Real-Time Computing and Applications Symposium (RTCSA '99), 1999

1998
Performance Evaluation and Cost Analysis of Cache Protocol Extensions for Shared-Memory Multiprocessors.
IEEE Trans. Computers, 1998

An evaluation of hardware-based and compiler-controlled optimizations of snooping cache protocols.
Future Generation Comp. Syst., 1998

A holistic approach to computer system design education based on system simulation techniques.
Proceedings of the 1998 workshop on Computer architecture education, 1998

SimICS/Sun4m: A Virtual Workstation.
Proceedings of the 1998 USENIX Annual Technical Conference, 1998

Integrating Path and Timing Analysis Using Instruction-Level Simulation Techniques.
Proceedings of the Languages, 1998

1997
Effectivness of Dynamic Prefetching in Multiple-Writer Distributed Virtual Shared-Memory Systems.
J. Parallel Distrib. Comput., 1997

Trends in Shared Memory Multiprocessing.
IEEE Computer, 1997

Boosting the Performance of Shared Memory Multiprocessors.
IEEE Computer, 1997

Reducing the Read-Miss Penalty for Flat COMA Protocols.
Comput. J., 1997

Relative Performance of Hardware and Software-Only Directory Protocols Under Latency Tolerating and Reducing Techniques.
Proceedings of the 11th International Parallel Processing Symposium (IPPS '97), 1997

A Performance Tuning Approach for Shared-Memory Multiprocessors.
Proceedings of the Euro-Par '97 Parallel Processing, 1997

1996
Evaluation of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors.
IEEE Trans. Parallel Distrib. Syst., 1996

Using Dataflow Analysis Techniques to Reduce Ownership Overhead in Cache Coherence Protocols.
ACM Trans. Program. Lang. Syst., 1996

Characterising and Modelling Shared Memory Accesses in Multiprocessor Programs.
Parallel Computing, 1996

The design of a non-blocking load processor architecture.
Microprocessors and Microsystems - Embedded Hardware Design, 1996

Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection.
J. Parallel Distrib. Comput., 1996

Applications for Shared Memory Multiprocessors (Guest Editors' Introduction).
IEEE Computer, 1996

Performance Evaluation of a Cluster-Based Multiprocessor Built from ATM Switches and Bus-Based Multiprocessor Servers.
Proceedings of the Second International Symposium on High-Performance Computer Architecture, 1996

1995
Sequential Hardware Prefetching in Shared-Memory Multiprocessors.
IEEE Trans. Parallel Distrib. Syst., 1995

Essential Misses and Data Traffic in Coherence Protocols.
J. Parallel Distrib. Comput., 1995

Using Write Caches to Improve Performance of Cache Coherence Protocols in Shared-Memory Multiprocessors.
J. Parallel Distrib. Comput., 1995

Implementation and evaluation of update-based cache protocols under relaxed memory consistency models.
Future Generation Comp. Syst., 1995

Efficient Strategies for Software-Only Protocols in Shared-Memory Multiprocessors.
Proceedings of the 22nd Annual International Symposium on Computer Architecture, 1995

Effectiveness of Hardware-Based Stride and Sequential Prefetching in Shared-Memory Multiprocessors.
Proceedings of the 1st IEEE Symposium on High-Performance Computer Architecture (HPCA 1995), 1995

Using hints to reduce the read miss penalty for flat COMA protocols.
Proceedings of the 28th Annual Hawaii International Conference on System Sciences (HICSS-28), 1995

A compiler algorithm that reduces read latency in ownership-based cache coherence protocols.
Proceedings of the IFIP WG10.3 working conference on Parallel architectures and compilation techniques, 1995

1994
Modelling accesses to migratory and producer-consumer characterised data in a shared memory multiprocessor.
Proceedings of the Sixth IEEE Symposium on Parallel and Distributed Processing, 1994

An Adaptive Update-Based Cache Coherence Protocol for Reduction of Miss Rate and Traffic.
Proceedings of the PARLE '94: Parallel Architectures and Languages Europe, 1994

Combined Performance Gains of Simple Cache Protocol Extensions.
Proceedings of the 21st Annual International Symposium on Computer Architecture. Chicago, 1994

An Integrated Methodology for the Verification of Directory-Based Cache Protocols.
Proceedings of the 1994 International Conference on Parallel Processing, 1994

Reducing the Write Traffic for a Hybrid Cache Protocol.
Proceedings of the 1994 International Conference on Parallel Processing, 1994

Introduction.
Proceedings of the 27th Annual Hawaii International Conference on System Sciences (HICSS-27), 1994

Simple Compiler Algorithms to Reduce Ownership Operhead in Cache Coherence Protocols.
Proceedings of the ASPLOS-VI Proceedings, 1994

1993
An Adaptive Cache Coherence Protocol Optimized for Migratory Sharing.
Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993

The Detection and Elimination of Useless Misses in Multiprocessors.
Proceedings of the 20th Annual International Symposium on Computer Architecture, 1993

Fixed and Adaptive Sequential Prefetching in Shared Memory Multiprocessors.
Proceedings of the 1993 International Conference on Parallel Processing, 1993

The Cachemire Test Bench A Flexible And Effective Approach For Simulation Of Multiprocessors.
Proceedings of the Proceedings 26th Annual Simulation Symposium, ANSS 1993, 1993

1992
The Scalable Tree Protocol - A Cache Coherence Approach for Large-Scale Multiprocessors.
Proceedings of the Fourth IEEE Symposium on Parallel and Distributed Processing, 1992

Comparative Performance Evaluation of Cache-Coherent NUMA and COMA Architectures.
Proceedings of the 19th Annual International Symposium on Computer Architecture. Gold Coast, 1992

A Latency-Hiding Scheme for Multiprocessors with Buffered Multistage Networks.
Proceedings of the 6th International Parallel Processing Symposium, 1992

1991
On Reconfigurable On-Chip Data Caches.
Proceedings of the 24th Annual IEEE/ACM International Symposium on Microarchitecture, 1991

A Lockup-Free Multiprocessor Cache Design.
Proceedings of the International Conference on Parallel Processing, 1991

1990
A Survey of Cache Coherence Schemes for Multiprocessors.
IEEE Computer, 1990

1989
A Cache Consistency Protocol for Multiprocessors with Multistage Networks.
Proceedings of the 16th Annual International Symposium on Computer Architecture. Jerusalem, 1989

1988
Reducing Contention in Sharde-Memory Multiprocessors.
IEEE Computer, 1988

1987
A Layered Emulator for Design Evaluation of MIMD Multiprocessors with Shared Memory.
Proceedings of the PARLE, 1987


  Loading...