Chunyuan Zhang

Orcid: 0000-0001-8297-9803

According to our database1, Chunyuan Zhang authored at least 136 papers between 1997 and 2024.

Collaborative distances:

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

On csauthors.net:

Bibliography

2024
SparGD: A Sparse GEMM Accelerator with Dynamic Dataflow.
ACM Trans. Design Autom. Electr. Syst., March, 2024

2023
RHS-TRNG: A Resilient High-Speed True Random Number Generator Based on STT-MTJ Device.
IEEE Trans. Very Large Scale Integr. Syst., October, 2023

Recursive least squares method for training and pruning convolutional neural networks.
Appl. Intell., October, 2023

2022
Recursive Least Squares Advantage Actor-Critic Algorithms.
CoRR, 2022

Recursive Least Squares for Training and Pruning Convolutional Neural Networks.
CoRR, 2022

Recursive Least Squares Policy Control with Echo State Network.
CoRR, 2022

BP-Im2col: Implicit Im2col Supporting AI Backpropagation on Systolic Arrays.
Proceedings of the IEEE 40th International Conference on Computer Design, 2022

2021
Revisiting Recursive Least Squares for Training Deep Neural Networks.
CoRR, 2021

Minibatch Recursive Least Squares Q-Learning.
Comput. Intell. Neurosci., 2021

Automatic mapping and code optimization for OpenCL kernels on FT-matrix architecture (WIP paper).
Proceedings of the LCTES '21: 22nd ACM SIGPLAN/SIGBED International Conference on Languages, 2021

SAI: Self-Adjusting Incremental Quantile Estimation for Sparse Training of Neural Networks on Hardware Accelerators.
Proceedings of the 2021 IEEE 23rd Int Conf on High Performance Computing & Communications; 7th Int Conf on Data Science & Systems; 19th Int Conf on Smart City; 7th Int Conf on Dependability in Sensor, 2021

2020
Deep Learning Research and Development Platform: Characterizing and Scheduling with QoS Guarantees on GPU Clusters.
IEEE Trans. Parallel Distributed Syst., 2020

Toward an Efficient Deep Pipelined Template-Based Architecture for Accelerating the Entire 2-D and 3-D CNNs on FPGA.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2020

Efficient Parallel TLD on CPU-GPU Platform for Real-Time Tracking.
KSII Trans. Internet Inf. Syst., 2020

Estimation of the parameters of a weighted nuclear norm model and its application in image denoising.
Inf. Sci., 2020

P4 to FPGA-A Fast Approach for Generating Efficient Network Processors.
IEEE Access, 2020

Incremental Deployment of Programmable Switches for Sketch-based Network Measurement.
Proceedings of the IEEE Symposium on Computers and Communications, 2020

Efficient Mini-batch Training for Echo State Networks.
Proceedings of the ICRAI 2020: 6th International Conference on Robotics and Artificial Intelligence, 2020

Towards High-Efficiency Data Centers via Job-Aware Network Scheduling.
Proceedings of the ICPP 2020: 49th International Conference on Parallel Processing, 2020

HybridSketch: A Memory-centric Precise Approach for Flow Measurement.
Proceedings of the 2020 IEEE International Conference on Communications, 2020

Optimized HybridSketch: More Efficient with Analysis and Algorithm.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2020

Towards a Deep-Pipelined Architecture for Accelerating Deep GCN on a Multi-FPGA Platform.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2020

Scalable FPGA-based Architecture for High-Performance Per-Flow Traffic Measurement.
Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020

Towards Memory-Efficient Streaming Processing with Counter-Cascading Sketching on FPGA.
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020

2019
A Fast Approach for Generating Efficient Parsers on FPGAs.
Symmetry, 2019

Application-Oriented Network Scheduling With Metaflow.
IEEE Access, 2019

Interleaved Sketch: Toward Consistent Network Telemetry for Commodity Programmable Switches.
IEEE Access, 2019

KVSwitch: An In-network Load Balancer for Key-Value Stores.
Proceedings of the 2019 IEEE Symposium on Computers and Communications, 2019

Towards a Uniform Architecture for the Efficient Implementation of 2D and 3D Deconvolutional Neural Networks on FPGAs.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2019

Poster Abstract: Deep Learning Workloads Scheduling with Reinforcement Learning on GPU Clusters.
Proceedings of the IEEE INFOCOM 2019, 2019

Poster Abstract: A Template-based Framework for Generating Network Processor in FPGA.
Proceedings of the IEEE INFOCOM 2019, 2019

An Efficient Design Flow for Accelerating Complicated-connected CNNs on a Multi-FPGA Platform.
Proceedings of the 48th International Conference on Parallel Processing, 2019

TBSW: Time-Based Sliding Window Algorithm for Network Traffic Measurement.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

SACC: Configuring Application-Level Cache Intelligently for In-Memory Database Based on Long Short-Term Memory.
Proceedings of the 21st IEEE International Conference on High Performance Computing and Communications; 17th IEEE International Conference on Smart City; 5th IEEE International Conference on Data Science and Systems, 2019

SWAP: a sliding window algorithm for in-network packet measurement.
Proceedings of the 3rd International Conference on High Performance Compilation, 2019

Accelerating 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System.
Proceedings of the 2019 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2019

GENIE: QoS-guided Dynamic Scheduling for CNN-based Tasks on SME Clusters.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2019

Scale-out Acceleration for 3D CNN-based Lung Nodule Segmentation on a Multi-FPGA System.
Proceedings of the 56th Annual Design Automation Conference 2019, 2019

2018
HPGraph: High-Performance Graph Analytics with Productivity on the GPU.
Sci. Program., 2018

MALMM: A multi-array architecture for large-scale matrix multiplication on FPGA.
IEICE Electron. Express, 2018

Design of Practical Experiences to Improve Student Understanding of Efficiency and Scalability Issues in High Performance Computing: (Abstract Only).
Proceedings of the 49th ACM Technical Symposium on Computer Science Education, 2018

Towards a Multi-array Architecture for Accelerating Large-scale Matrix Multiplication on FPGAs.
Proceedings of the IEEE International Symposium on Circuits and Systems, 2018

Multiple CNN-based Tasks Scheduling across Shared GPU Platform in Research and Development Scenarios.
Proceedings of the 20th IEEE International Conference on High Performance Computing and Communications; 16th IEEE International Conference on Smart City; 4th IEEE International Conference on Data Science and Systems, 2018

High performance graph analytics with productivity on hybrid CPU-GPU platforms.
Proceedings of the 2nd International Conference on High Performance Compilation, 2018

Towards a Uniform Template-based Architecture for Accelerating 2D and 3D CNNs on FPGA.
Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2018

Parallel programming course development based on parallel computational thinking.
Proceedings of ACM Turing Celebration Conference - China, 2018

2017
A Highly Parallel and Scalable Motion Estimation Algorithm with GPU for HEVC.
Sci. Program., 2017

Exploiting a depth context model in visual tracking with correlation filter.
Frontiers Inf. Technol. Electron. Eng., 2017

Applying Detection Proposals to Visual Tracking for Scale and Aspect Ratio Adaptability.
Int. J. Comput. Vis., 2017

FPGA-accelerated deep convolutional neural networks for high throughput and energy efficiency.
Concurr. Comput. Pract. Exp., 2017

High Performance Implementation of 3D Convolutional Neural Networks on a GPU.
Comput. Intell. Neurosci., 2017

Optimizing OpenCL Implementation of Deep Convolutional Neural Network on FPGA.
Proceedings of the Network and Parallel Computing, 2017

DCC: Distributed Cache Consistency.
Proceedings of the Data Science, 2017

Winograd Algorithm for 3D Convolution Neural Networks.
Proceedings of the Artificial Neural Networks and Machine Learning - ICANN 2017, 2017

RVNet: A fast and high energy efficiency network packet processing system on RISC-V.
Proceedings of the 28th IEEE International Conference on Application-specific Systems, 2017

2016
Kernel Recursive Least-Squares Temporal Difference Algorithms with Sparsification and Regularization.
Comput. Intell. Neurosci., 2016

Enabling Tissue-Scale Cardiac Simulations Using Heterogeneous Computing on Tianhe-2.
Proceedings of the 22nd IEEE International Conference on Parallel and Distributed Systems, 2016

Multikernel Recursive Least-Squares Temporal Difference Learning.
Proceedings of the Intelligent Computing Methodologies - 12th International Conference, 2016

Improve security and availability for cloud storage.
Proceedings of the 4th International Conference on Cloud Computing and Intelligence Systems, 2016

2015
An analytical GPU performance model for 3D stencil computations from the angle of data traffic.
J. Supercomput., 2015

A Computational Model of the Short-Cut Rule for 2D Shape Decomposition.
IEEE Trans. Image Process., 2015

Towards simulation of subcellular calcium dynamics at nanometre resolution.
Int. J. High Perform. Comput. Appl., 2015

Enabling a Uniform OpenCL Device View for Heterogeneous Platforms.
IEICE Trans. Inf. Syst., 2015

Communication-hiding programming for clusters with multi-coprocessor nodes.
Concurr. Comput. Pract. Exp., 2015

Fast tracking via context depth model learning.
Proceedings of the 2015 IEEE International Conference on Image Processing, 2015

Enable Scale and Aspect Ratio Adaptability in Visual Tracking with Detection Proposals.
Proceedings of the British Machine Vision Conference 2015, 2015

2014
High efficient sedimentary basin simulations on hybrid CPU-GPU clusters.
Clust. Comput., 2014

Utilizing Multiple Xeon Phi Coprocessors on One Compute Node.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2014

Automated Transformation of GPU-Specific OpenCL Kernels Targeting Performance Portability on Multi-Core/Many-Core CPUs.
Proceedings of the Euro-Par 2014 Parallel Processing, 2014

A fault detection mechanism in a Data-flow scheduled Multithreaded processor.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2014

Rethread: A Low-Cost Transient Fault Recovery Scheme for Multithreaded Processors.
Proceedings of the Ninth International Conference on Availability, 2014

2013
Shape Similarity Analysis by Self-Tuning Locally Constrained Mixed-Diffusion.
IEEE Trans. Multim., 2013

Accelerating thread-intensive and explicit memory management programs with dynamic partial reconfiguration.
J. Supercomput., 2013

Resource-efficient utilization of CPU/GPU-based heterogeneous supercomputers for Bayesian phylogenetic inference.
J. Supercomput., 2013

Efficient fine-grained shared buffer management for multiple OpenCL devices.
J. Zhejiang Univ. Sci. C, 2013

Simulating Cardiac Electrophysiology in the Era of GPU-Cluster Computing.
IEICE Trans. Inf. Syst., 2013

On the GPU Performance of 3D Stencil Computations Implemented in OpenCL.
Proceedings of the Supercomputing - 28th International Supercomputing Conference, 2013

On-demand thread-level fault detection in a concurrent programming environment.
Proceedings of the 2013 International Conference on Embedded Computer Systems: Architectures, 2013

An Adaptive Low-Overhead Mechanism for Dependable General-Purpose Many-Core Processors.
Proceedings of the Information and Communicatiaon Technology - International Conference, 2013

On the GPU-CPU Performance Portability of OpenCL for 3D Stencil Computations.
Proceedings of the 19th IEEE International Conference on Parallel and Distributed Systems, 2013

Performance of Sediment Transport Simulations on NVIDIA's Kepler Architecture.
Proceedings of the International Conference on Computational Science, 2013

Solving the Cardiac Model Using Multi-core CPU and Many Integrated Cores (MIC).
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

Device View Redundancy: An Adaptive Low-Overhead Fault Tolerance Mechanism for Many-Core System.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

Automatic Mapping Single-Device OpenCL Program to Heterogeneous Multi-device Platform.
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013

ACF: Networks-on-Chip Deadlock Recovery with Accurate Detection and Elastic Credit.
Proceedings of the Advanced Parallel Processing Technologies, 2013

2012
Identification method of independent module for dynamic fault tree with interdependent basic events and repeated events.
Int. J. Comput. Appl. Technol., 2012

Improving Performance of GPU Specific OpenCL Program on CPUs.
Proceedings of the 13th International Conference on Parallel and Distributed Computing, 2012

Architecting Dependable Many-Core Processors Using Core-Level Dynamic Redundancy.
Proceedings of the Trustworthy Computing and Services - International Conference, ISCTCS 2012, Beijing, China, May 28, 2012

A Parallel H.264 Encoder with CUDA: Mapping and Evaluation.
Proceedings of the 18th IEEE International Conference on Parallel and Distributed Systems, 2012

Parallelization Design of Irregular Algorithms of Video Processing on GPUs.
Proceedings of the 2012 IEEE International Conference on Multimedia and Expo, 2012

Fully Distributed On-chip Instruction Memory Design for Stream Architecture Based on Field-Divided VLIW Compression.
Proceedings of the 14th IEEE International Conference on High Performance Computing and Communication & 9th IEEE International Conference on Embedded Software and Systems, 2012

Extending BORPH for shared memory reconfigurable computers.
Proceedings of the 22nd International Conference on Field Programmable Logic and Applications (FPL), 2012

The masala machine: accelerating thread-intensive and explicit memory management programs with dynamically reconfigurable FPGAs (abstract only).
Proceedings of the ACM/SIGDA 20th International Symposium on Field Programmable Gate Arrays, 2012

Using 1000+ GPUs and 10000+ CPUs for Sedimentary Basin Simulations.
Proceedings of the 2012 IEEE International Conference on Cluster Computing, 2012

2011
Tiled Multi-Core Stream Architecture.
Trans. High Perform. Embed. Archit. Compil., 2011

A high-efficient software parallel CAVCL encoder based on GPU.
Proceedings of the 34th International Conference on Telecommunications and Signal Processing (TSP 2011), 2011

High-efficient software parallel CAVLC encoder based on programmable stream processor.
Proceedings of the 19th International Conference on Multimedia 2011, Scottsdale, AZ, USA, November 28, 2011

A Multilevel Parallel Intra Coding for H.264/AVC Based on CUDA.
Proceedings of the Sixth International Conference on Image and Graphics, 2011

2010
Importance Measure Method for Dynamic Fault Tree Based on Isomorphic Node.
Proceedings of the Information Computing and Applications - First International Conference, 2010

A Parallel Streaming Motion Estimation for Real-Time HD H.264 Encoding on Programmable Processors.
Proceedings of the Fifth International Conference on Frontier of Computer Science and Technology, 2010

Software Managed Instruction Scratchpad Memory Optimization in Stream Architecture Based on Hot Code Analysis of Kernels.
Proceedings of the 13th Euromicro Conference on Digital System Design, 2010

SAT: A Stream Architecture Template for Embedded Applications.
Proceedings of the 10th IEEE International Conference on Computer and Information Technology, 2010

2009
Optimal Channel Width Adaptation, Logical Topology Design, and Routing in Wireless Mesh Networks.
EURASIP J. Wirel. Commun. Netw., 2009

QoS-aware on-demand channel width adaptation protocols for multi-radio ad-hoc networks.
Proceedings of the 2009 IEEE Wireless Communications and Networking Conference, 2009

Interference-aware Broadcast Routing and Channel Assignment for Multi-Radio Wireless Mesh Networks.
Proceedings of the 70th IEEE Vehicular Technology Conference, 2009

Streaming HD H.264 encoder on programmable processors.
Proceedings of the 17th International Conference on Multimedia 2009, 2009

Cache streamization for high performance stream processor.
Proceedings of the 16th International Conference on High Performance Computing, 2009

Software parallel CAVLC encoder based on stream processing.
Proceedings of the 7th IEEE/ACM/IFIP Workshop on Embedded Systems for Real-Time Multimedia, 2009

Joint Channel Width Adaptation, Topology Control, and Routing for Multi-Radio Multi-Channel Wireless Mesh Networks.
Proceedings of the 6th IEEE Consumer Communications and Networking Conference, 2009

2008
Transform coding on programmable stream processors.
J. Supercomput., 2008

On-Chip Memory System Optimization Design for the FT64 Scientific Stream Accelerator.
IEEE Micro, 2008

Improving the Quality of Graduate Education by Association Rules Analysis.
Proceedings of the Fifth International Conference on Fuzzy Systems and Knowledge Discovery, 2008

Load scheduling: Reducing pressure on distributed register files for free.
Proceedings of the 13th Asia South Pacific Design Automation Conference, 2008

FPGA-based Equivalent Simulation Technology (FEST) for clustered stream architecture.
Proceedings of the 13th Asia-Pacific Computer Systems Architecture Conference, 2008

2007
Stream Algorithm for 4*4 Integer Transform in H.264.
Proceedings of the 2007 International Conference on Multimedia and Ubiquitous Engineering (MUE 2007), 2007

Broadcasting in Multi-Radio Multi-Channel and Multi-Hop Wireless Networks.
Proceedings of the Real-Time Mobile Multimedia Services, 2007

Cut Sequence Set Generation for Fault Tree Analysis.
Proceedings of the Embedded Software and Systems, [Third] International Conference, 2007

A Fault-Tolerant Real-Time Scheduling Algorithm in Software Fault-Tolerant Module.
Proceedings of the Computational Science - ICCS 2007, 7th International Conference, Beijing, China, May 27, 2007

Quantification of Cut Sequence Set for Fault Tree Analysis.
Proceedings of the High Performance Computing and Communications, 2007

Efficient Broadcasting in Multi-radio Multi-channel and Multi-hop Wireless Networks Based on Self-pruning.
Proceedings of the High Performance Computing and Communications, 2007

FT64: Scientific Computing with Streams.
Proceedings of the High Performance Computing, 2007

The Design on SEU-Tolerant Information Processing System of the On-Board-Computer.
Proceedings of the Advanced Parallel Processing Technologies, 7th International Symposium, 2007

A Stream System-on-Chip Architecture for High Speed Target Recognition Based on Biologic Vision.
Proceedings of the Advances in Computer Systems Architecture, 2007

2006
Prediction-Table Based Fault-Tolerant Real-Time Scheduling Algorithm.
Proceedings of the Seventh International Conference on Parallel and Distributed Computing, 2006

The Process of Synchronization in Dual Redundant Fault-Tolerant System.
Proceedings of the Intelligent Information Processing III, 2006

A Streaming Implementation of Transform and Quantization in H.264.
Proceedings of the High Performance Computing and Communications, 2006

Register Allocation on Stream Processor with Local Register File.
Proceedings of the Advances in Computer Systems Architecture, 11th Asia-Pacific Conference, 2006

Optimization and Evaluating of StreamYGX2 on MASA Stream Processor.
Proceedings of the Advances in Computer Systems Architecture, 11th Asia-Pacific Conference, 2006

Analysis and Performance Results of a fluid dynamics Application on MASA Stream Processor.
Proceedings of the 5th Annual IEEE/ACIS International Conference on Computer and Information Science (ICIS 2006) and 1st IEEE/ACIS International Workshop on Component-Based Software Engineering, 2006

2005
Multiple-Morphs Adaptive Stream Architecture.
J. Comput. Sci. Technol., 2005

Accelerated Motion Estimation of H.264 on Imagine Stream Processor.
Proceedings of the Image Analysis and Recognition, Second International Conference, 2005

A Stream Architecture Supporting Multiple Stream Execution Models.
Proceedings of the Advances in Computer Systems Architecture, 10th Asia-Pacific Conference, 2005

2004
A Parallel Reed-Solomon Decoder on the Imagine Stream Processor.
Proceedings of the Parallel and Distributed Processing and Applications, 2004

A Case of SCMP with TLS.
Proceedings of the Parallel and Distributed Processing and Applications, 2004

Multiple-Dimension Scalable Adaptive Stream Architecture.
Proceedings of the Advances in Computer Systems Architecture, 9th Asia-Pacific Conference, 2004

1997
The study of parallel simulation processing based on MPP technology.
Proceedings of the 1997 Advances in Parallel and Distributed Computing Conference (APDC '97), 1997


  Loading...