- AGI House
- Posts
- AGI House: NeurIPS Takeaways, Building with Gemini 3, and the SLM Thesis
AGI House: NeurIPS Takeaways, Building with Gemini 3, and the SLM Thesis
In this issue: Key takeaways from NeurIPS and our Global Chess Challenge, what we're building at Gemini 3 Build Day, why small language models matter (new memo), joining the team, and what's capturing our attention
📣NeurIPS 2025: Chess, Reasoning Models, and AWS Trainium
On December 2nd, we partnered with AWS and AIcrowd to host Checkmate: Fine-Tune Your Own Small Language Model for Real-Time Chess Gameplay—a hands-on workshop at NeurIPS 2025 in San Diego, where attendees fine-tuned Qwen-based models on AWS Trainium, submitted to a live leaderboard, and watched their models compete in real time.

Participants received dedicated AWS Workshop Studio environments, mastered the nuances of post-training on Trainium chips, and deployed working chess-playing language models—all in 90 minutes. We are incredibly grateful to Emily Webber, Jim Burtoft, Armin Aghaebra, Ruchi Bhatia, and the AWS AI/ML team for their partnership, and to SP Mohanty and the AIcrowd team for powering the leaderboard infrastructure.
♟️The Global Chess Challenge Is Now Live
The NeurIPS workshop was merely the opening gambit. The Global Chess Challenge is now accepting submissions, with $17,000 in cash prizes and $8,000 in compute credits on the line.

The Challenge: Build a chess-playing AI that thinks out loud. Your model reads the board position as text, chooses its moves and explains its reasoning—all without ever seeing the actual board. Stockfish evaluates whether the moves make sense.
This transcends a simple chess competition—it's a reasoning laboratory. Can your model calculate tactics, execute coherent strategy, and communicate its thinking with precision? Chess offers fixed rules, perfect information, and objective measurement, making it the ideal testbed for evaluating text-based reasoning capabilities.
Two Research Tracks:
Data-centric fine-tuning: Supervised learning on Lichess open data with Stockfish annotations
RLVR (Reinforcement Learning with Verifiable Rewards): Leverage Stockfish as verifier and reward signal with PPO or GRPO
Timeline:
Challenge opens: December 2, 2025
Round 1 closes: December 31, 2025 (23:55 UTC)
Prize Pool:
🥇 First Place: $10,000 cash + $5,000 compute credits
🥈 Second Place: $5,000 cash + $2,000 compute credits
🥉 Third Place: $2,000 cash + $1,000 compute credits
The starter kit includes a production-ready environment, baseline agents, and reference implementations. Whether you joined us at NeurIPS or are coming in fresh, you can shape the future of reasoning research.
🗓️ Upcoming Events
⭐️Gemini 3 Build Day | December 13

We're closing out the year with the most capable model launch of 2025. Google just shipped Gemini 3—the first LLM to exceed 1500 Elo on LMArena, achieving 31.1% on ARC-AGI-2 (a 6× leap from Gemini 2.5), alongside a 1M token context window and breakthrough capabilities like generative UI.
We're partnering with the Gemini team to bring builders together and explore what's suddenly possible. Note: This build day starts at 9 AM—one hour earlier than our usual events. Participants will have ten hours to push the boundaries and ship something extraordinary.
🤖AGI Market Research: Small Language Models

Last month we co-hosted an SLM Build Day with AWS in NYC, where teams built production-grade prototypes on Trainium in under 8 hours—clinical documentation systems, multi-agent frameworks, privacy-preserving data tools. The results validated something we've been tracking closely: the path to capable small models runs through efficiency, not scale.
We've now published our full memo – The SLM Opportunity: Data Flywheels, Edge Deployment, and the New AI Infrastructure Stack – synthesizing insights from that hackathon alongside research analysis on the SLM landscape.
The central insight: While frontier labs chase trillion-parameter models, a countervailing force has emerged. Smaller, specialized models now match or surpass frontier performance in targeted domains—at a fraction of the cost. DeepSeek-R1 rivals OpenAI's o1 on reasoning benchmarks. Microsoft's Phi-4 outperforms models 20× its size. Meta's Llama 3.2 delivers capable 1B and 3B models to edge devices.
We identify four enduring value sources: (1) Workflow-embedded applications with proprietary data flywheels that compound over time, (2) Enterprise packaging that captures margin beyond commoditizing model weights, (3) Hardware-architecture co-design creating defensible technical moats, (4) Orchestration layers governing edge-cloud collaboration as the new control plane. The memo provides deep analysis of each vector, mapping the competitive landscape, citing foundational research, and outlining investment implications.
If you're building or investing in this space, consider this as required reading.
💼We’re Hiring
Creative Director (Full-Time)
AGI House is seeking a Creative Director to lead our cross-platform content strategy across TikTok, Instagram, YouTube, X, and LinkedIn. You'll direct shoots, craft compelling narratives, oversee editing workflows, and leverage AI creative tools including Pika and Runway.
Exceptional candidates may receive complimentary residency at our Hillsborough house.
Video Editor (Contract)
We're also hiring a contract video editor for our upcoming Gemini 3 Build Day—ideal for rapid-turnaround storytelling and event documentation.
Apply Here → | Questions? Email [email protected]
📚What We've Been Reading
Gated Attention for Large Language Models – We couldn't walk ten feet at NeurIPS without hearing about this paper—it earned a Best Paper award for good reason. The insight is elegantly simple: adding a data-dependent sigmoid gate after attention output dramatically stabilizes training and enhances long-context performance. We're already observing this architectural shift in next-generation models like Qwen3-Next. For model builders in our network, this is a must-implement advancement.
Does Reinforcement Learning Really Incentivize Reasoning? – This paper challenges the prevailing RLVR narrative. The authors demonstrate that RL improves sampling efficiency but doesn't spontaneously generate new reasoning patterns—it surfaces latent capabilities already present in the base model. If you're competing in our Global Chess Challenge, understand this: your base model selection matters more than your RL tuning strategy.
From Words to Worlds: Spatial Intelligence is AI’s Next Frontier – Fei-Fei Li and the World Labs team published this manifesto on the evolution from digital to physical AI. Their argument: the next major scaling law won't emerge from additional text tokens, but from 3D consistency and physics prediction. We're witnessing this shift firsthand—builders in our portfolio are moving beyond pure language models toward "Large World Models" that comprehend geometry, not merely pixels.
Nested Learning: A New ML Paradigm for Continual Learning – Google DeepMind released this compelling paper proposing a "learning to learn" architecture. By embedding an inner optimization loop (rapid adaptation) within an outer loop (long-term stability), they directly address catastrophic forgetting. It evokes the early "Titans" memory research but applied to lifelong learning agents—a critical component for the long-context agents we'll build at the Gemini 3 event.
Hugging Face Open-R1: Mixture-of-Thoughts – If you're following the data-centric AI conversation, you've encountered this dataset. It contains 350K verified reasoning traces distilled from DeepSeek-R1, effectively democratizing the pipeline behind SOTA reasoning models. This directly validates our SLM thesis: open, high-quality chain-of-thought data has become the primary lever for performance, not raw parameter count.
Join us at Gemini Build Day—or prove your model can outthink Stockfish in the Chess Challenge.
Till next time,
AGI House Team
Want something featured or interested in partnering? Email us