SmallFireDragon Lab

AI Science

Making complex AI concepts understandable for humans

06/28/2026

The "Context Window" of Modern AI: The Engineering Truth from Fixed Length to Infinite Expansion

In LLM brochures, the "Context Window" is often simplified to a single number, such as 128K or 1M. However, for engineers, the context window is not merely "sto

The "Computational Efficiency" of Modern AI: The Engineering Truth Behind Speculative Decoding

06/27/2026

Science

The "Computational Efficiency" of Modern AI: The Engineering Truth Behind Speculative Decoding

During the inference process of Large Language Models (LLMs), the core bottleneck lies in their "autoregressive" nature: for every token generated, the entire m

The "Computational Leverage" of Modern AI: The Engineering Truth Behind Mixture of Experts (MoE)

06/26/2026

Science

The "Computational Leverage" of Modern AI: The Engineering Truth Behind Mixture of Experts (MoE)

In the evolution of Large Language Models (LLMs), a core contradiction has always persisted: we desire models with vast knowledge (requiring more parameters), y

The "Inference Acceleration Puzzle" of Modern AI: Evolution from KV Cache to PagedAttention

06/25/2026

Science

The "Inference Acceleration Puzzle" of Modern AI: Evolution from KV Cache to PagedAttention

In the field of LLM inference optimization, discussions often revolve around PagedAttention or Speculative Decoding. However, the underlying problem these techn

The "Health Checkup" for Modern AI Systems: Why Health Checks and Fallback Strategies Matter More Than Single Successes

06/24/2026

Science

The "Health Checkup" for Modern AI Systems: Why Health Checks and Fallback Strategies Matter More Than Single Successes

Many AI systems appear to run smoothly during demos: a request is sent, the model returns an answer, and the result appears on the page. However, once deployed

The "Dynamic Dispatcher" of Modern AI Systems: A Deep Dive into Continuous Batching

06/23/2026

Science

The "Dynamic Dispatcher" of Modern AI Systems: A Deep Dive into Continuous Batching

In production environments for Large Language Models (LLMs), one of the most significant components of inference cost is GPU utilization. If you observe a simpl

"Memory Expansion" for Modern AI Systems: A Deep Dive into KV Cache Compression and Quantization

06/22/2026

Science

"Memory Expansion" for Modern AI Systems: A Deep Dive into KV Cache Compression and Quantization

In the inference process of Large Language Models (LLMs), the most expensive resource is not computational power (FLOPs), but VRAM bandwidth and capacity. When

The "Inference Accelerator" of Modern AI Systems: A Deep Dive into Speculative Decoding

06/21/2026

Science

The "Inference Accelerator" of Modern AI Systems: A Deep Dive into Speculative Decoding

In production environments for Large Language Models (LLMs), the most intuitive pain point for users is the slow "typewriter" speed. Despite the astonishing com

MoE Routing Is Not a Cost-Saving Switch: Why Expert Models Fear Load Imbalance

06/20/2026

Science

MoE Routing Is Not a Cost-Saving Switch: Why Expert Models Fear Load Imbalance

Mixture of Experts (MoE) models appear to offer a straightforward optimization: only a small subset of experts is activated per request, allowing the parameter