CSE599H: Advances and Challenges in Language Models, Reasoning, and AI Agents

Spring 2024-2025
Mondays and Wednesdays, 3pm to 4:20pm
CSE2 G04
Gradescope | Ed

Hanna Hajishirzi

Instructor

hannaneh [at] cs [dot] washington [dot] edu

Jiacheng Liu

liujc [at] cs [dot] washington [dot] edu

Rulin Shao

rulins [at] cs [dot] washington [dot] edu

Office hours by appointment.
To contact course staff, please make an Ed post.

Language models, such as GPT-o3, DeepSeek-R1, and Deep Research, have demonstrated remarkable capabilities in natural language understanding, generation, and reasoning, with applications ranging from literature summarization to complex problem-solving tasks. However, as we will discuss, these models are not without limitations, such as susceptibility to hallucinations, poor capabilities in strategic exploration, and limitations in long-horizon planning. In this class, we will explore the latest research on language models, reasoning, and AI agents, discussing both the advances and challenges in these areas. We will examine the current state-of-the-art models, their limitations, and the ongoing efforts to address these challenges. Through this course, you will engage in paper discussions and gain a deeper understanding of the latest developments in the field and contribute to the ongoing discussions and research in this exciting area.

This is a seminar designed for PhD students. Students are expected to be able to read and understand the assigned papers on their own, and they should be familiar with ML and NLP concepts at the level of having taken advanced undergraduate classes.

Schedule

Weekly due dates:

By Monday 11:59pm: Slides for Wednesday's papers (presenters only)
By Saturday 11:59pm: Slides for Monday's papers (presenters only)

Mar 31 (Mon)	Course overview (slides)
Apr 2 (Wed)	Basic Pre-training and Post-training (slides) The Llama 3 Herd of Models 2 OLMo 2 Furious DeepSeek-V3 Technical Report Optional reading The Ultra-Scale Playbook: Training LLMs on GPU Clusters
Apr 7 (Mon)	Guest Lecture: Nathan Lambert (slides) Tülu 3: Pushing Frontiers in Open Language Model Post-Training Direct Preference Optimization: Your Language Model is Secretly a Reward Model Optional reading DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning Unpacking DPO and PPO: Disentangling Best Practices for Learning from Preference Feedback Understanding R1-Zero-Like Training: A Critical Perspective
Apr 9 (Wed)	Guest Lecture: Kyle Lo (slides) Dolma: Open Corpus of Three Trillion Tokens DataComp-LM: Next Generation Training Sets The FineWeb Datasets: Decanting the Web for the Finest Text Data at Scale Optional reading Does your data spark joy? Performance gains from domain upsampling at the end of training Organize the Web: Constructing Domains Enhances Pre-Training Data Curation Understanding Emergent Abilities of Language Models from the Loss Perspective
Apr 14 (Mon)	Scaling Laws of Language Models (slides) Training Compute-Optimal Large Language Models Language models scale reliably with over-training and on downstream tasks Optional reading Scaling Laws for Neural Language Models A Hitchhiker's Guide to Scaling Law Estimation
Apr 16 (Wed)	Building Reasoning Models & Systems I (slides) Chain of Thought Prompting Elicits Reasoning in Large Language Models Self-Consistency Improves Chain of Thought Reasoning in Language Models STAR: Bootstrapping Reasoning With Reasoning
Apr 21 (Mon)	Building Reasoning Models & Systems II (slides) OpenAI o3-mini System Card Tülu 3: Pushing Frontiers in Open Language Model Post-Training (Section 6 only) DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Apr 23 (Wed)	Test-Time Scaling (slides) Stream of Search (SoS): Learning to Search in Language s1: Simple test-time scaling Sample, Don't Search: Rethinking Test-Time Alignment for Language Models Optional reading A Survey of Frontiers in LLM Reasoning: Inference Scaling, Learning to Reason, and Agentic Systems Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters Reasoning Models Can Be Effective Without Thinking
Apr 28 (Mon)	AI Agents and Tool Use (slides) ReAct: Synergizing Reasoning and Acting in Language Models Toolformer: Language Models Can Teach Themselves to Use Tools START: Self-taught Reasoner with Tools
Apr 30 (Wed)	AI Agents for Coding (slides) SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering OpenHands: An Open Platform for AI Software Developers as Generalist Agents Optional reading LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code SWE-Lancer: Can Frontier LLMs Earn $1 Million from Real-World Freelance Software Engineering?
May 5 (Mon)	AI Agents for Computer Use and Web Browsing (slides) OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments WebVoyager: Building an End-to-End Web Agent with Large Multimodal Models UI-TARS: Pioneering Automated GUI Interaction with Native Agents Optional reading WebArena: A Realistic Web Environment for Building Autonomous Agents [OpenAI] Computer-Using Agent [Anthropic] Developing a computer use model
May 7 (Wed)	AI Agents for Deep Research Deep Research System Card OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs Optional reading Search-R1: Training LLMs to Reason and Leverage Search Engines ReSearch: Learning to Reason with Search for LLMs via Reinforcement Learning RAG-RL: Advancing Retrieval-Augmented Generation via RL and Curriculum Learning
May 12 (Mon)	Features and Limitations I AI models collapse when trained on recursively generated data The Reversal Curse: LLMs trained on "A is B" fail to learn "B is A" Mitigating Reversal Curse in Large Language Models via Semantic-aware Permutation Training Optional reading Physics of Language Models: Part 3.2, Knowledge Manipulation Reverse training to nurse the reversal curse Faith and Fate: Limits of Transformers on Compositionality The Generative AI Paradox: "What It Can Create, It May Not Understand"
May 14 (Wed)	Features and Limitations II Alignment faking in large language models Grokking: Generalization Beyond Overfitting on Small Algorithmic Datasets The Hyperfitting Phenomenon: Sharpening and Stabilizing LLMs for Open-Ended Text Generation Optional reading Are Emergent Abilities of Large Language Models a Mirage?
May 19 (Mon)	Alternative Architectures OLMoE: Open Mixture-of-Experts Language Models Mamba: Linear-Time Sequence Modeling with Selective State Spaces Learning to (Learn at Test Time): RNNs with Expressive Hidden States Optional reading Large Concept Models: Language Modeling in a Sentence Representation Space
May 21 (Wed)	Efficiency and Scaling FlashAttention: Fast and Memory-Efficient Exact Attention vLLM: Easy, Fast, and Cheap LLM Serving QLoRA: Efficient Finetuning of Quantized LLMs
May 26 (Mon)	No class, Memorial Day
May 28 (Wed)	TBD
Jun 2 (Mon)	TBD
Jun 4 (Wed)	TBD

Acknowledgements

We are grateful to Pang Wei Koh for sharing their website template with us.