We stand with Ukraine

We stand with Ukraine

Kaiyue Wen

According to our database¹, Kaiyue Wen authored at least 23 papers between 2022 and 2025.

Collaborative distances:

Dijkstra number² of four.
Erdős number³ of four.

Timeline

Legend:

Book

In proceedings

Article

PhD thesis

Dataset

Other

Links

On csauthors.net:

Bibliography

2025

Fantastic Pretraining Optimizers and Where to Find Them.

[BibT_eX]

[DOI]

,

,

,

CoRR, September, 2025

QuestA: Expanding Reasoning Capacity in LLMs via Question Augmentation.

[BibT_eX]

[DOI]

,

,

,

,

,

,

,

CoRR, July, 2025

PaTH Attention: Position Encoding via Accumulating Householder Transformations.

[BibT_eX]

[DOI]

,

,

,

,

,

,

,

CoRR, May, 2025

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free.

[BibT_eX]

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

CoRR, May, 2025

Weight Ensembling Improves Reasoning in Language Models.

[BibT_eX]

[DOI]

,

,

,

,

Aditi Raghunathan

CoRR, April, 2025

Task Generalization With AutoRegressive Compositional Structure: Can Learning From <i>D</i> Tasks Generalize to <i>D</i><sup>T</sup> Tasks?

[BibT_eX]

[DOI]

Amirhesam Abedsoltan

,

,

,

,

,

CoRR, February, 2025

Overtrained Language Models Are Harder to Fine-Tune.

[BibT_eX]

[DOI]

Jacob Mitchell Springer

,

,

,

,

,

Sadhika Malladi

,

,

Aditi Raghunathan

Proceedings of the Forty-second International Conference on Machine Learning, 2025

Task Generalization with Autoregressive Compositional Structure: Can Learning from D Tasks Generalize to DT Tasks?

[BibT_eX]

[DOI]

Amirhesam Abedsoltan

,

,

,

,

,

Proceedings of the Forty-second International Conference on Machine Learning, 2025

From Sparse Dependence to Sparse Attention: Unveiling How Chain-of-Thought Enhances Transformer Sample Efficiency.

[BibT_eX]

[DOI]

,

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

RNNs are not Transformers (Yet): The Key Bottleneck on In-Context Retrieval.

[BibT_eX]

[DOI]

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape View.

[BibT_eX]

[DOI]

,

,

,

David Leo Wright Hall

,

,

Proceedings of the Thirteenth International Conference on Learning Representations, 2025

Symmetrical Visual Contrastive Optimization: Aligning Vision-Language Models with Minimal Contrastive Images.

[BibT_eX]

[DOI]

,

,

,

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

Demons in the Detail: On Implementing Load Balancing Loss for Training Specialized Mixture-of-Expert Models.

[BibT_eX]

[DOI]

,

,

,

,

,

,

,

,

,

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025

2024

Understanding Warmup-Stable-Decay Learning Rates: A River Valley Loss Landscape Perspective.

[BibT_eX]

[DOI]

,

,

,

,

,

CoRR, 2024

2023

Practically Solving LPN in High Noise Regimes Faster Using Neural Networks.

[BibT_eX]

[DOI]

,

,

IACR Cryptol. ePrint Arch., 2023

Transformers are uninterpretable with myopic methods: a case study with bounded Dyck grammars.

[BibT_eX]

[DOI]

,

,

,

Andrej Risteski

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Sharpness Minimization Algorithms Do Not Only Minimize Sharpness To Achieve Better Generalization.

[BibT_eX]

[DOI]

,

,

Proceedings of the Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, 2023

Benign Overfitting in Classification: Provably Counter Label Noise with Larger Models.

[BibT_eX]

[DOI]

,

,

Proceedings of the Eleventh International Conference on Learning Representations, 2023

How Sharpness-Aware Minimization Minimizes Sharpness?

[BibT_eX]

[DOI]

,

,

Proceedings of the Eleventh International Conference on Learning Representations, 2023

2022

How Does Sharpness-Aware Minimization Minimize Sharpness?

[BibT_eX]

[DOI]

,

,

CoRR, 2022

Realistic Deep Learning May Not Fit Benignly.

[BibT_eX]

[DOI]

,

,

CoRR, 2022

On Transferability of Prompt Tuning for Natural Language Processing.

[BibT_eX]

[DOI]

,

,

,

,

,

,

,

,

,

,

,

,

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2022

Finding Skill Neurons in Pre-trained Transformer-based Language Models.

[BibT_eX]

[DOI]

,

,

,

,

,

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022

Loading...