Kai Lu

Orcid: 0000-0003-2284-7897

Affiliations:
  • National University of Defense Technology, College of Computer Science, National Key Laboratory of Parallel and Distributed Processing, Changsha, China
  • National University of Defense Technology, Changsha, China (PhD 1999)


According to our database1, Kai Lu authored at least 172 papers between 2003 and 2025.

Collaborative distances:
  • Dijkstra number2 of four.
  • Erdős number3 of four.

Timeline

Legend:

Book 
In proceedings 
Article 
PhD thesis 
Dataset
Other 

Links

Online presence:

On csauthors.net:

Bibliography

2025
Oases: Efficient Large-Scale Model Training on Commodity Servers via Overlapped and Automated Tensor Model Parallelism.
IEEE Trans. Parallel Distributed Syst., September, 2025

Eliminate Data Divergence in SpMV via Processor and Memory Co-Computing Framework.
IEEE Trans. Computers, June, 2025

TSN Cache: Exploiting Data Localities in Graph Computing Applications.
ACM Trans. Archit. Code Optim., June, 2025

Design of a Compact Low-Power Sub-2.4-GHz Transceiver for Medical Band Applications.
IEEE J. Solid State Circuits, May, 2025

AutoPipe-H: A Heterogeneity-Aware Data-Paralleled Pipeline Approach on Commodity GPU Servers.
IEEE Trans. Computers, April, 2025

Bubble-Swap Flow Control.
ACM Trans. Archit. Code Optim., March, 2025

Efficient Forward-Edge Control-Flow Integrity for COTS Binaries via Arm BTI.
IEEE Trans. Inf. Forensics Secur., 2025

Towards Megacity-Scale Wind Flow Simulations on Many-Core CPU-Accelerator Systems.
SIAM J. Sci. Comput., 2025

GraphCSR: A Space and Time-Efficient Sparse Matrix Representation for Web-scale Graph Processing.
Proceedings of the ACM on Web Conference 2025, 2025

GraphCom: Communication Hierarchy-aware Graph Engine for Distributed Model Training.
Proceedings of the ACM on Web Conference 2025, 2025

DPGA-TextSyn: Differentially Private Genetic Algorithm for Synthetic Text Generation.
Proceedings of the Findings of the Association for Computational Linguistics, 2025

AGD: Adversarial Game Defense Against Jailbreak Attacks in Large Language Models.
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Can Large Language Models Derive High-Level Cognition from Low-Level and Fragmented Foundational Information?
Proceedings of the AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25, 2025

2024
DELTA: Memory-Efficient Training via Dynamic Fine-Grained Recomputation and Swapping.
ACM Trans. Archit. Code Optim., December, 2024

MST: Topology-Aware Message Aggregation for Exascale Graph Processing of Traversal-Centric Algorithms.
ACM Trans. Archit. Code Optim., December, 2024

A Multidimensional Communication Scheduling Method for Hybrid Parallel DNN Training.
IEEE Trans. Parallel Distributed Syst., August, 2024

Instiller: Toward Efficient and Realistic RTL Fuzzing.
IEEE Trans. Comput. Aided Des. Integr. Circuits Syst., July, 2024

A survey of compute nodes with 100 TFLOPS and beyond for supercomputers.
CCF Trans. High Perform. Comput., June, 2024

SNCL: a supernode OpenCL implementation for hybrid computing arrays.
J. Supercomput., May, 2024

The progress, challenges, and perspectives of directed greybox fuzzing.
Softw. Test. Verification Reliab., March, 2024

Towards adaptive graph neural networks via solving prior-data conflicts.
Frontiers Inf. Technol. Electron. Eng., March, 2024

Faster and Scalable MPI Applications Launching.
IEEE Trans. Parallel Distributed Syst., February, 2024

Armor: Protecting Software Against Hardware Tracing Techniques.
IEEE Trans. Inf. Forensics Secur., 2024

INSTILLER: Towards Efficient and Realistic RTL Fuzzing.
CoRR, 2024

HyperGo: Probability-based directed hybrid fuzzing.
Comput. Secur., 2024

Towards Highly Compatible I/O-Aware Workflow Scheduling on HPC Systems.
Proceedings of the International Conference for High Performance Computing, 2024

Efficiently Rebuilding Coverage in Hardware-Assisted Greybox Fuzzing.
Proceedings of the 27th International Symposium on Research in Attacks, 2024

DeepGo: Predictive Directed Greybox Fuzzing.
Proceedings of the 31st Annual Network and Distributed System Security Symposium, 2024

Fully Decentralized Data Distribution for Exascale-HPC: End of the Provider-Demander Matching Puzzle.
Proceedings of the IEEE International Conference on Cluster Computing, 2024

A Fully Integrated 400MHz Band Transceiver with a 96Mbps 16QAM Transmitter and a Phase Tracking Receiver in 40-nm CMOS.
Proceedings of the IEEE Asian Solid-State Circuits Conference, 2024

2023
Accelerating GNN Training by Adapting Large Graphs to Distributed Heterogeneous Architectures.
IEEE Trans. Computers, December, 2023

A Survey of the Security Analysis of Embedded Devices.
Sensors, November, 2023

UltraFuzz: Towards Resource-Saving in Distributed Fuzzing.
IEEE Trans. Software Eng., April, 2023

Compressed Collective Sparse-Sketch for Distributed Data-Parallel Training of Deep Learning Models.
IEEE J. Sel. Areas Commun., April, 2023

Inspecting End-to-End Encrypted Communication Differentially for the Efficient Identification of Harmful Media.
IEEE Trans. Inf. Forensics Secur., 2023

Programming bare-metal accelerators with heterogeneous threading models: a case study of Matrix-3000.
Frontiers Inf. Technol. Electron. Eng., 2023

Free energy perturbation-based large-scale virtual screening for effective drug discovery against COVID-19.
Int. J. High Perform. Comput. Appl., 2023

Automated Tensor Model Parallelism with Overlapped Communication for Efficient Foundation Model Training.
CoRR, 2023

Leveraging Free Labels to Power up Heterophilic Graph Learning in Weakly-Supervised Settings: An Empirical Study.
Proceedings of the Machine Learning and Knowledge Discovery in Databases: Research Track, 2023

VulHawk: Cross-architecture Vulnerability Detection with Entropy-based Binary Code Search.
Proceedings of the 30th Annual Network and Distributed System Security Symposium, 2023

Roar: A Router Microarchitecture for In-network Allreduce.
Proceedings of the 37th International Conference on Supercomputing, 2023

ReForker: Patching x86_64 Binaries with the Fork Server to Improve Hardware-Assisted Fuzzing through Trampoline-Based Binary Rewriting.
Proceedings of the 2nd International Conference on Networks, 2023

2022
TianheGraph: Customizing Graph Search for Graph500 on Tianhe Supercomputer.
IEEE Trans. Parallel Distributed Syst., 2022

ParaX : Bandwidth-Efficient Instance Assignment for DL on Multi-NUMA Many-Core CPUs.
IEEE Trans. Computers, 2022

Self-deployed execution environment for high performance computing.
Frontiers Inf. Technol. Electron. Eng., 2022

TEES: topology-aware execution environment service for fast and agile application deployment in HPC.
Frontiers Inf. Technol. Electron. Eng., 2022

ovAFLow: Detecting Memory Corruption Bugs with Fuzzing-Based Taint Inference.
J. Comput. Sci. Technol., 2022

Towards Defense Against Adversarial Attacks on Graph Neural Networks via Calibrated Co-Training.
J. Comput. Sci. Technol., 2022

Tree-based Search Graph for Approximate Nearest Neighbor Search.
CoRR, 2022

MT-3000: a heterogeneous multi-zone processor for HPC.
CCF Trans. High Perform. Comput., 2022

BioNet: a large-scale and heterogeneous biological network model for interaction prediction with graph convolution.
Briefings Bioinform., 2022

RED: Learning the role embedding in networks via Discrete-time quantum walk.
Appl. Intell., 2022

Game of Hide-and-Seek: Exposing Hidden Interfaces in Embedded Web Applications of IoT Devices.
Proceedings of the WWW '22: The ACM Web Conference 2022, Virtual Event, Lyon, France, April 25, 2022

vGraph: Memory-Efficient Multicore Graph Processing for Traversal-Centric Algorithms.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

Towards Scalable Resource Management for Supercomputers.
Proceedings of the SC22: International Conference for High Performance Computing, 2022

MobFuzz: Adaptive Multi-objective Optimization in Gray-box Fuzzing.
Proceedings of the 29th Annual Network and Distributed System Security Symposium, 2022

The Fast and Scalable MPI Application Launch of the Tianhe HPC system.
Proceedings of the 2022 IEEE International Parallel and Distributed Processing Symposium, 2022

XTree: Traversal-Based Partitioning for Extreme-Scale Graph Processing on Supercomputers.
Proceedings of the 38th IEEE International Conference on Data Engineering, 2022

PMemTrace: Lightweight and Efficient Memory Access Monitoring for Persistent Memory.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2022

Full-credit Flow Control: A Novel Technique to Implement Deadlock-free Adaptive Routing.
Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, 2022

2021
Coordinative Scheduling of Computation and Communication in Data-Parallel Systems.
IEEE Trans. Computers, 2021

MEBS: Uncovering Memory Life-Cycle Bugs in Operating System Kernels.
J. Comput. Sci. Technol., 2021

CoG: a Two-View Co-training Framework for Defending Adversarial Attacks on Graph.
CoRR, 2021

Processing extreme-scale graphs on China's supercomputers.
Commun. ACM, 2021

Correction to: Mining a stroke knowledge graph from literature.
BMC Bioinform., 2021

Mining a stroke knowledge graph from literature.
BMC Bioinform., 2021

QSIM: A novel approach to node proximity estimation based on Discrete-time quantum walk.
Appl. Intell., 2021

Microarchitecture of a Configurable High-Radix Router for the Post-Moore Era.
Proceedings of the High Performance Computing - 36th International Conference, 2021

Sparse Matrix-Vector Multiplication Cache Performance Evaluation and Design Exploration.
Proceedings of the 29th International Symposium on Modeling, 2021

2020
SMINT: Toward Interpretable and Robust Model Sharing for Deep Neural Networks.
ACM Trans. Web, 2020

High-Scalable Collaborated Parallel Framework for Large-Scale Molecular Dynamic Simulation on Tianhe-2 Supercomputer.
IEEE ACM Trans. Comput. Biol. Bioinform., 2020

Sabotaging the system boundary: A study of the inter-boundary vulnerability.
J. Inf. Secur. Appl., 2020

An efficient framework for generating robust adversarial examples.
Int. J. Intell. Syst., 2020

UniFuzz: Optimizing Distributed Fuzzing via Dynamic Centralized Task Scheduling.
CoRR, 2020

A survey on optimizations towards best-effort hardware transactional memory.
CCF Trans. High Perform. Comput., 2020

EcoFuzz: Adaptive Energy-Saving Greybox Fuzzing as a Variant of the Adversarial Multi-Armed Bandit.
Proceedings of the 29th USENIX Security Symposium, 2020

PMThreads: persistent memory threads harnessing versioned shadow copies.
Proceedings of the 41st ACM SIGPLAN International Conference on Programming Language Design and Implementation, 2020

Segment Medical Image Using U-Net Combining Recurrent Residuals and Attention.
Proceedings of 2020 International Conference on Medical Imaging and Computer-Aided Diagnosis, 2020

Representation Learning with Multiple Lipschitz-Constrained Alignments on Partially-Labeled Cross-Domain Data.
Proceedings of the Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2019
A Cost-Efficient Router Architecture for HPC Inter-Connection Networks: Design and Implementation.
IEEE Trans. Parallel Distributed Syst., 2019

CURE: Flexible Categorical Data Representation by Hierarchical Coupling Learning.
IEEE Trans. Knowl. Data Eng., 2019

DFTracker: detecting double-fetch bugs by multi-taint parallel tracking.
Frontiers Comput. Sci., 2019

MapEff: An Effective Graph Isomorphism Agorithm Based on the Discrete-Time Quantum Walk.
Entropy, 2019

Lightweight Container-based User Environment.
CoRR, 2019

The Vulnerabilities of Graph Convolutional Networks: Stronger Attacks and Defensive Techniques.
CoRR, 2019

AVPredictor: Comprehensive prediction and detection of atomicity violations.
Concurr. Comput. Pract. Exp., 2019

Efficient Algorithms on Rigidity Decomposition for Network Localizability Analysis.
Ad Hoc Sens. Wirel. Networks, 2019

Adversarial Examples for Graph Data: Deep Insights into Attack and Defense.
Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence, 2019

Evolutionarily Learning Multi-Aspect Interactions and Influences from Network Structure and Node Content.
Proceedings of the Thirty-Third AAAI Conference on Artificial Intelligence, 2019

POSTER: Quiescent and Versioned Shadow Copies for NVM.
Proceedings of the 28th International Conference on Parallel Architectures and Compilation Techniques, 2019

2018
A Differentiated Caching Mechanism to Enable Primary Storage Deduplication in Clouds.
IEEE Trans. Parallel Distributed Syst., 2018

Unsupervised Coupled Metric Similarity for Non-IID Categorical Data.
IEEE Trans. Knowl. Data Eng., 2018

Versionized process based on non-volatile random-access memory for fine-grained fault tolerance.
Frontiers Inf. Technol. Electron. Eng., 2018

Moving from exascale to zettascale computing: challenges and techniques.
Frontiers Inf. Technol. Electron. Eng., 2018

Untrusted Hardware Causes Double-Fetch Problems in the I/O Memory.
J. Comput. Sci. Technol., 2018

Marking Vertices to Find Graph Isomorphism Mapping Based on Continuous-Time Quantum Walk.
Entropy, 2018

A survey of the double-fetch vulnerabilities.
Concurr. Comput. Pract. Exp., 2018

Constructing a database for the relations between CNV and human genetic diseases via systematic text mining.
BMC Bioinform., 2018

Sharing Deep Neural Network Models with Interpretation.
Proceedings of the 2018 World Wide Web Conference on World Wide Web, 2018

DFTinker: Detecting and Fixing Double-Fetch Bugs in an Automated Way.
Proceedings of the Wireless Algorithms, Systems, and Applications, 2018

Detecting Multiple Information Sources Based on the Quantum Walk.
Proceedings of the 5th International Conference on Systems and Informatics, 2018

One Size Does Not Fit All: The Case for Chunking Configuration in Backup Deduplication.
Proceedings of the 18th IEEE/ACM International Symposium on Cluster, 2018

Metric-Based Auto-Instructor for Learning Mixed Data Representation.
Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, 2018

2017
Fine-grained checkpoint based on non-volatile memory.
Frontiers Inf. Technol. Electron. Eng., 2017

Fast Persistent Heap Based on Non-Volatile Memory.
IEICE Trans. Inf. Syst., 2017

Interpreting Shared Deep Learning Models via Explicable Boundary Trees.
CoRR, 2017

HPDedup: A Hybrid Prioritized Data Deduplication Mechanism for Primary Storage in the Cloud.
CoRR, 2017

Surveying concurrency bug detectors based on types of detected bugs.
Sci. China Inf. Sci., 2017

Topology-aware network fault influence domain analysis.
Comput. Electr. Eng., 2017

Community detection in attributed networks based on heterogeneous vertex interactions.
Appl. Intell., 2017

SwapX: An NVM-Based Hierarchical Swapping Framework.
IEEE Access, 2017

Building Emulation Framework for Non-Volatile Memory.
IEEE Access, 2017

Flexible Page-level Memory Access Monitoring Based on Virtualization Hardware.
Proceedings of the 13th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments, 2017

How Double-Fetch Situations turn into Double-Fetch Vulnerabilities: A Study of Double Fetches in the Linux Kernel.
Proceedings of the 26th USENIX Security Symposium, 2017

Embedding-based Representation of Categorical Data by Hierarchical Value Coupling Learning.
Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017

Dynamic Community Detection Algorithm Based on Automatic Parameter Adjustment.
Proceedings of the Intelligent Data Engineering and Automated Learning - IDEAL 2017 - 18th International Conference, Guilin, China, October 30, 2017

A Case for Memory Frequency Sensitivity.
Proceedings of the 2017 IEEE International Conference on Web Services, 2017

Reinforcement Label Propagation Algorithm Based on History Record.
Proceedings of the Neural Information Processing - 24th International Conference, 2017

Building Emulation Framework for Non-volatile Memory.
Proceedings of the 37th IEEE International Conference on Distributed Computing Systems Workshops, 2017

High Performance Coordinate Descent Matrix Factorization for Recommender Systems.
Proceedings of the Computing Frontiers Conference, 2017

mD3DOCKxb: An Ultra-Scalable CPU-MIC Coordinated Virtual Screening Framework.
Proceedings of the 17th IEEE/ACM International Symposium on Cluster, 2017

2016
StageFS: A Parallel File System Optimizing Metadata Performance for SSD Based Clusters.
Proceedings of the 2016 IEEE Trustcom/BigDataSE/ISPA, 2016

Taint Reverse Propagation for Analysis of Privacy Leak.
Proceedings of the 2016 IEEE Trustcom/BigDataSE/ISPA, 2016

Application-Based Coarse-Grained Incremental Checkpointing Based on Non-volatile Memory.
Proceedings of the Network and Parallel Computing, 2016

Alleviating network congestion for HPC clusters with fat-tree interconnection leveraging software-defined networking.
Proceedings of the 3rd International Conference on Systems and Informatics, 2016

mAMBER: A CPU/MIC collaborated parallel framework for AMBER on Tianhe-2 supercomputer.
Proceedings of the IEEE International Conference on Bioinformatics and Biomedicine, 2016

Unified Weighted Label Propagation Algorithm Using Connection Factor.
Proceedings of the Advanced Data Mining and Applications - 12th International Conference, 2016

2015
Detecting harmful data races through parallel verification.
J. Supercomput., 2015

Write-Combined Logging: An Optimized Logging for Consistency in NVRAM.
Sci. Program., 2015

Local feature point extraction for quantum images.
Quantum Inf. Process., 2015

An Efficient and Flexible Deterministic Framework for Multithreaded Programs.
J. Comput. Sci. Technol., 2015

Collaborative Technique for Concurrency Bug Detection.
Int. J. Parallel Program., 2015

A Load-Balanced Deterministic Runtime for Pipeline Parallelism.
IEICE Trans. Inf. Syst., 2015

QSobel: A novel quantum image edge extraction algorithm.
Sci. China Inf. Sci., 2015

RaceChecker: Efficient Identification of Harmful Data Races.
Proceedings of the 23rd Euromicro International Conference on Parallel, 2015

Identifying Repeated Interleavings to Improve the Efficiency of Concurrency Bug Detection.
Proceedings of the Algorithms and Architectures for Parallel Processing, 2015

Efficiently Trigger Data Races through Speculative Execution.
Proceedings of the 17th IEEE International Conference on High Performance Computing and Communications, 2015

2014
Robust Component-Based Localizationin Sparse Networks.
IEEE Trans. Parallel Distributed Syst., 2014

DRDet: Efficiently Making Data Races Deterministic.
IEICE Trans. Inf. Syst., 2014

Iaso: an autonomous fault-tolerant management system for supercomputers.
Frontiers Comput. Sci., 2014

Efficient deterministic multithreading without global barriers.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2014

Approximate Maximum Common Sub-graph Isomorphism Based on Discrete-Time Quantum Walk.
Proceedings of the 22nd International Conference on Pattern Recognition, 2014

Online Taint Propagation Analysis with Precise Pointer-to Analysis for Detecting Bugs in Binaries.
Proceedings of the 2014 IEEE International Conference on High Performance Computing and Communications, 2014

Enhancing the Security of Parallel Programs via Reducing Scheduling Space.
Proceedings of the IEEE 12th International Conference on Dependable, 2014

2013
A novel quantum representation for log-polar images.
Quantum Inf. Process., 2013

NEQR: a novel enhanced quantum representation of digital images.
Quantum Inf. Process., 2013

Self-Adaptive Power Management of Idle Nodes in Large Scale Systems.
Int. J. Next Gener. Comput., 2013

Deterministic Message Passing for Distributed Parallel Computing.
IEICE Trans. Inf. Syst., 2013

Understanding the Impact of BPRAM on Incremental Checkpoint.
IEICE Trans. Inf. Syst., 2013

OFA: An optimistic approach to conquer flip ambiguity in network localization.
Comput. Networks, 2013

ColFinder Collaborative Concurrency Bug Detection.
Proceedings of the 2013 13th International Conference on Quality Software, 2013

RaceFree: an efficient multi-threading model for determinism.
Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2013

FLPI: representation of quantum images for log-polar coordinate.
Proceedings of the Fifth International Conference on Digital Image Processing, 2013

Pruning False Positives of Static Data-Race Detection via Thread Specialization.
Proceedings of the Advanced Parallel Processing Technologies, 2013

2012
Exploiting parallelism in deterministic shared memory multiprocessing.
J. Parallel Distributed Comput., 2012

dMPI: Facilitating Debugging of MPI Programs via Deterministic Message Passing.
Proceedings of the Network and Parallel Computing, 9th IFIP International Conference, 2012

A Power Provision and Capping Architecture for Large Scale Systems.
Proceedings of the 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, 2012

Self-adaptive management of the sleep depths of idle nodes in large scale systems to balance between energy consumption and response times.
Proceedings of the 4th IEEE International Conference on Cloud Computing Technology and Science Proceedings, 2012

NV-process: a fault-tolerance process model based on non-volatile memory.
Proceedings of the Asia-Pacific Workshop on Systems, 2012

2011
The TianHe-1A Supercomputer: Its Hardware and Software.
J. Comput. Sci. Technol., 2011

2010
TH-1: China's first petaflop supercomputer.
Frontiers Comput. Sci. China, 2010

Brief announcement: NUMA-aware transactional memory.
Proceedings of the 29th Annual ACM Symposium on Principles of Distributed Computing, 2010

Adaptive Optimization for Petascale Heterogeneous CPU/GPU Computing.
Proceedings of the 2010 IEEE International Conference on Cluster Computing, 2010

2009
Mixing Concrete and Symbolic Execution to Improve the Performance of Dynamic Test Generation.
Proceedings of the NTMS 2009, 2009

Hierarchical Conflict Detection for Cluster's Transactional Memory.
Proceedings of the International Conference on Networked Computing and Advanced Information Management, 2009

Architecture- and OS-Independent Binary-Level Dynamic Test Generation.
Proceedings of the Information and Communications Security, 11th International Conference, 2009

Investigating transactional memory performance on ccNUMA machines.
Proceedings of the 18th ACM International Symposium on High Performance Distributed Computing, 2009

Decoupling Dynamic Test Generation from Specific Operating System Details Based on Whole System Virtual Machine.
Proceedings of the Fourth International Conference on Frontier of Computer Science and Technology, 2009

Two-phase conflict detection for transactional memory on clusters.
Proceedings of the 2009 IEEE International Conference on Cluster Computing, August 31, 2009

HPVZ: A High Performance Virtual Computing Environment for Super Computers.
Proceedings of the Advanced Parallel Processing Technologies, 8th International Symposium, 2009

2003
Dynamic Self-Adaptive Replica Location Method in Data Grids.
Proceedings of the 2003 IEEE International Conference on Cluster Computing (CLUSTER 2003), 2003

A Scalable Peer-to-Peer Network with Constant Degree.
Proceedings of the Advanced Parallel Programming Technologies, 5th International Workshop, 2003


  Loading...