Guohao Dai
Orcid: 0000-0003-0849-3252
According to our database1,
Guohao Dai
authored at least 134 papers
between 2013 and 2025.
Collaborative distances:
Collaborative distances:
Timeline
Legend:
Book In proceedings Article PhD thesis Dataset OtherLinks
On csauthors.net:
Bibliography
2025
VocabTailor: Dynamic Vocabulary Selection for Downstream Tasks in Small Language Models.
CoRR, August, 2025
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., July, 2025
Reducing GPU Memory Fragmentation via Spatio-Temporal Planning for Efficient Large-Scale Model Training.
CoRR, July, 2025
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., June, 2025
HyCTor: A Hybrid CNN-Transformer Network Accelerator With Flexible Weight/Output Stationary Dataflow and Multicore Extension.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., May, 2025
R2R: Efficiently Navigating Divergent Reasoning Paths with Small-Large Model Token Routing.
CoRR, May, 2025
CoRR, May, 2025
semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage.
CoRR, April, 2025
FlashOverlap: A Lightweight Design for Efficiently Overlapping Communication and Computation.
CoRR, April, 2025
CoRR, April, 2025
DiTFastAttnV2: Head-wise Attention Compression for Multi-Modality Diffusion Transformers.
CoRR, March, 2025
A Point Transformer Accelerator With Distribution-Aware Heuristic Distance Calculation.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., February, 2025
FrameFusion: Combining Similarity and Importance for Video Token Reduction on Large Visual Language Models.
CoRR, January, 2025
Proceedings of the 52nd Annual International Symposium on Computer Architecture, 2025
DeepGate4: Efficient and Effective Representation Learning for Circuit Design at Scale.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Accelerating Auto-regressive Text-to-Image Generation with Training-free Speculative Jacobi Decoding.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better.
Proceedings of the Thirteenth International Conference on Learning Representations, 2025
Proceedings of the IEEE International Symposium on High Performance Computer Architecture, 2025
FMC-LLM: Enabling FPGAs for Efficient Batched Decoding of 70B+ LLMs with a Memory-Centric Streaming Architecture.
Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2025
FlightVGM: Efficient Video Generation Model Inference with Online Sparsification and Hybrid Precision on FPGAs.
Proceedings of the 2025 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2025
DyLGNN: Efficient LM-GNN Fine-Tuning with Dynamic Node Partitioning, Low-Degree Sparsity, and Asynchronous Sub-Batch.
Proceedings of the Design, Automation & Test in Europe Conference, 2025
AiSpGEMM: Accelerating Imbalanced SpGEMM on FPGAs with Flexible Interconnect and Intra-row Parallel Merging.
Proceedings of the Design, Automation & Test in Europe Conference, 2025
SoftmAP: Software-Hardware Co-Design for Integer-Only Softmax on Associative Processors.
Proceedings of the Design, Automation & Test in Europe Conference, 2025
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
Deploying Diffusion Models with Scheduling Space Search and Memory Overflow Prevention Based on Graph Optimization.
Proceedings of the 30th Asia and South Pacific Design Automation Conference, 2025
Proceedings of the 30th Asia and South Pacific Design Automation Conference, 2025
LLSM: LLM-enhanced Logic Synthesis Model with EDA-guided CoT Prompting, Hybrid Embedding and AIG-tailored Acceleration.
Proceedings of the 30th Asia and South Pacific Design Automation Conference, 2025
ViDA: Video Diffusion Transformer Acceleration with Differential Approximation and Adaptive Dataflow.
Proceedings of the 30th Asia and South Pacific Design Automation Conference, 2025
2024
IEEE Trans. Circuits Syst. Video Technol., September, 2024
GRAPHIC: Gather and Process Harmoniously in the Cache With High Parallelism and Flexibility.
IEEE Trans. Emerg. Top. Comput., 2024
CoRR, 2024
Automating Energy-Efficient GPU Kernel Generation: A Fast Search-Based Compilation Approach.
CoRR, 2024
Efficient and Effective Retrieval of Dense-Sparse Hybrid Vectors using Graph-based Approximate Nearest Neighbor Search.
CoRR, 2024
CoRR, 2024
Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs.
CoRR, 2024
CoRR, 2024
ViDiT-Q: Efficient and Accurate Quantization of Diffusion Transformers for Image and Video Generation.
CoRR, 2024
CoRR, 2024
Linear Combination of Saved Checkpoints Makes Consistency and Diffusion Models Better.
CoRR, 2024
CoRR, 2024
Proceedings of the 37th IEEE International System-on-Chip Conference, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
Proceedings of the Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, 2024
FlashDecoding++: Faster Large Language Model Inference with Asynchronization, Flat GEMM Optimization, and Heuristics.
Proceedings of the Seventh Annual Conference on Machine Learning and Systems, 2024
Proceedings of the Forty-first International Conference on Machine Learning, 2024
Towards Floating Point-Based Attention-Free LLM: Hybrid PIM with Non-Uniform Data Format and Reduced Multiplications.
Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024
Fast and Efficient 2-bit LLM Inference on GPU: 2/4/16-bit in a Weight Matrix with Asynchronous Dequantization.
Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024
Proceedings of the 43rd IEEE/ACM International Conference on Computer-Aided Design, 2024
FlightLLM: Efficient Large Language Model Inference with a Complete Mapping Flow on FPGAs.
Proceedings of the 2024 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, 2024
Proceedings of the NeurIPS Efficient Natural Language and Speech Processing Workshop, 2024
MixDQ: Memory-Efficient Few-Step Text-to-Image Diffusion Models with Metric-Decoupled Mixed Precision Quantization.
Proceedings of the Computer Vision - ECCV 2024, 2024
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2024
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2024
Proceedings of the 61st ACM/IEEE Design Automation Conference, 2024
FlashEval: Towards Fast and Accurate Evaluation of Text-to-Image Diffusion Generative Models.
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
FEASTA: A Flexible and Efficient Accelerator for Sparse Tensor Algebra in Machine Learning.
Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2024
2023
CoGNN: An Algorithm-Hardware Co-Design Approach to Accelerate GNN Inference With Minibatch Sampling.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., December, 2023
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., November, 2023
Gibbon: An Efficient Co-Exploration Framework of NN Model and Processing-In-Memory Architecture.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., November, 2023
Adaptive Multidimensional Parallel Fault Simulation Framework on Heterogeneous System.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., June, 2023
CCF Trans. High Perform. Comput., June, 2023
Serving Multi-DNN Workloads on FPGAs: A Coordinated Architecture, Scheduling, and Mapping Perspective.
IEEE Trans. Computers, May, 2023
Proceedings of the ACM Web Conference 2023, 2023
History-Detr: Optimize Query Initialization Strategy by Using Historical Information and Kinematics.
Proceedings of the ACM Multimedia Asia 2023, 2023
HyperGef: A Framework Enabling Efficient Fusion for Hypergraph Neural Network on GPUs.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023
Exploiting Hardware Utilization and Adaptive Dataflow for Efficient Sparse Convolution in 3D Point Clouds.
Proceedings of the Sixth Conference on Machine Learning and Systems, 2023
DF-GAS: a Distributed FPGA-as-a-Service Architecture towards Billion-Scale Graph-based Approximate Nearest Neighbor Search.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023
TorchSparse++: Efficient Training and Inference Framework for Sparse Convolution on GPUs.
Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, 2023
Ada3D : Exploiting the Spatial Redundancy with Adaptive Inference for Efficient 3D Object Detection.
Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023
TSTC: Two-Level Sparsity Tensor Core Enabling both Algorithm Flexibility and Hardware Efficiency.
Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 2023
OPT: Optimal Proposal Transfer for Efficient Yield Optimization for Analog and SRAM Circuits.
Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 2023
A Point Transformer Accelerator with Fine-Grained Pipelines and Distribution-Aware Dynamic FPS.
Proceedings of the IEEE/ACM International Conference on Computer Aided Design, 2023
Adam Accumulation to Reduce Memory Footprints of Both Activations and Gradients for Large-Scale DNN Training.
Proceedings of the ECAI 2023 - 26th European Conference on Artificial Intelligence, September 30 - October 4, 2023, Kraków, Poland, 2023
Minimizing Communication Conflicts in Network-On-Chip Based Processing-In-Memory Architecture.
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2023
Proceedings of the Design, Automation & Test in Europe Conference & Exhibition, 2023
PIM-HLS: An Automatic Hardware Generation Tool for Heterogeneous Processing-In-Memory-based Neural Network Accelerators.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023
Processing-In-Hierarchical-Memory Architecture for Billion-Scale Approximate Nearest Neighbor Search.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023
An Efficient Accelerator for Point-based and Voxel-based Point Cloud Neural Networks.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023
Seeking the Yield Barrier: High-Dimensional SRAM Evaluation Through Optimal Manifold.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023
Memory-Efficient and Real-Time SPAD-based dToF Depth Sensor with Spatial and Statistical Correlation.
Proceedings of the 60th ACM/IEEE Design Automation Conference, 2023
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
High-Dimensional Yield Estimation Using Shrinkage Deep Features and Maximization of Integral Entropy Reduction.
Proceedings of the 28th Asia and South Pacific Design Automation Conference, 2023
Proceedings of the 28th Asia and South Pacific Design Automation Conference, 2023
2022
A Unified FPGA Virtualization Framework for General-Purpose Deep Neural Networks in the Cloud.
ACM Trans. Reconfigurable Technol. Syst., 2022
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2022
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2022
GRAPHIC: GatheR-And-Process in Highly parallel with In-SSD Compression Architecture in Very Large-Scale Graph.
CoRR, 2022
Understanding GNN Computational Graph: A Coordinated Computation, IO, and Memory Perspective.
Proceedings of the Fifth Conference on Machine Learning and Systems, 2022
Proceedings of the 23rd IEEE International Conference on Mobile Data Management, 2022
Proceedings of the ISCA '22: The 49th Annual International Symposium on Computer Architecture, New York, New York, USA, June 18, 2022
Exploiting Parallelism with Vertex-Clustering in Processing-In-Memory-based GCN Accelerators.
Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, 2022
Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, 2022
Proceedings of the DAC '22: 59th ACM/IEEE Design Automation Conference, San Francisco, California, USA, July 10, 2022
A one-for-all and <i>o</i>(<i>v</i> log(<i>v</i> ))-cost solution for parallel merge style operations on sorted key-value arrays.
Proceedings of the ASPLOS '22: 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Lausanne, Switzerland, 28 February 2022, 2022
2021
Efficient Sparse Matrix Kernels based on Adaptive Workload-Balancing and Parallel-Reduction.
CoRR, 2021
Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs.
Proceedings of the 39th IEEE International Conference on Computer Design, 2021
Rerec: In-ReRAM Acceleration with Access-Aware Mapping for Personalized Recommendation.
Proceedings of the IEEE/ACM International Conference On Computer Aided Design, 2021
3M-AI: A Multi-task and Multi-core Virtualization Framework for Multi-FPGA AI Systems in the Cloud.
Proceedings of the FPGA '21: The 2021 ACM/SIGDA International Symposium on Field Programmable Gate Arrays, Virtual Event, USA, February 28, 2021
2020
GE-SpMM: general-purpose sparse matrix-matrix multiplication on GPUs for graph neural networks.
Proceedings of the International Conference for High Performance Computing, 2020
Proceedings of the 2020 IEEE High Performance Extreme Computing Conference, 2020
Proceedings of the 2020 IEEE High Performance Extreme Computing Conference, 2020
MNSIM 2.0: A Behavior-Level Modeling Tool for Memristor-based Neuromorphic Computing Systems.
Proceedings of the GLSVLSI '20: Great Lakes Symposium on VLSI 2020, 2020
An Order Sampling Processing-in-Memory Architecture for Approximate Graph Pattern Mining.
Proceedings of the GLSVLSI '20: Great Lakes Symposium on VLSI 2020, 2020
Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020
Proceedings of the FPGA '20: The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020
Proceedings of the 28th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2020
Proceedings of the 57th ACM/IEEE Design Automation Conference, 2020
2019
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., 2019
IEEE Trans. Computers, 2019
Centrifuge: Evaluating full-system HLS-generated heterogenous-accelerator SoCs using FPGA-Acceleration.
Proceedings of the International Conference on Computer-Aided Design, 2019
Proceedings of the 56th Annual Design Automation Conference 2019, 2019
Proceedings of the 56th Annual Design Automation Conference 2019, 2019
GraphSAR: a sparsity-aware processing-in-memory architecture for large-scale graph processing on ReRAMs.
Proceedings of the 24th Asia and South Pacific Design Automation Conference, 2019
2018
Proceedings of the International Symposium on Memory Systems, 2018
NewGraph: Balanced Large-Scale Graph Processing on FPGAs with Low Preprocessing Overheads.
Proceedings of the 26th IEEE Annual International Symposium on Field-Programmable Custom Computing Machines, 2018
Proceedings of the 2018 Design, Automation & Test in Europe Conference & Exhibition, 2018
2017
Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2017
2016
Proceedings of the 32nd IEEE International Conference on Data Engineering, 2016
Proceedings of the 26th International Conference on Field Programmable Logic and Applications, 2016
Proceedings of the 2016 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2016
2015
Proceedings of the 2015 International Conference on Field Programmable Technology, 2015
2014
Proceedings of the 2014 International Conference on Field-Programmable Technology, 2014
2013
Proceedings of the 10th IEEE International Conference on High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing, 2013