vllm.v1.kv_cache_interface ¶
ChunkedLocalAttentionSpec dataclass ¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
max_admission_blocks_per_request ¶
Per-request admission cap, in blocks.
Single source of truth for both startup pool sizing (max_memory_usage_bytes) and the runtime admission gate, so requests admitted by startup can also be admitted at runtime.
Source code in vllm/v1/kv_cache_interface.py
CrossAttentionSpec dataclass ¶
Bases: AttentionSpec
KV cache spec for cross-attention layers in encoder-decoder models.
Source code in vllm/v1/kv_cache_interface.py
FullAttentionSpec dataclass ¶
Bases: AttentionSpec
When hybrid allocator is disabled and the model contains both full attention layers and sliding window attention layers, sliding window attention are regarded as full attention in KV cache manager (blocks are allocated for all tokens), while computed as sliding window attention in model runner. In this case, we use FullAttentionSpec and record the sliding window size.
Source code in vllm/v1/kv_cache_interface.py
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 | |
sliding_window class-attribute instance-attribute ¶
sliding_window: int | None = None
Default to None for not using sliding window attention.
merge classmethod ¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
KVCacheConfig dataclass ¶
The KV cache configuration of a model.
Source code in vllm/v1/kv_cache_interface.py
kv_cache_groups instance-attribute ¶
kv_cache_groups: list[KVCacheGroupSpec]
The kv cache groups of the model. For models with only one type of attention, there is only one group that contains all layers. For models with multiple types of attention, there will be multiple groups, see _get_kv_cache_config_uniform_page_size for more details.
kv_cache_tensors instance-attribute ¶
kv_cache_tensors: list[KVCacheTensor]
How should model runner initialize the KV cache tensors for each layer
KVCacheGroupSpec dataclass ¶
Represents a group of model layers that share the same KV cache block table. These layers are regarded as one layer in the KV cache manager.
Source code in vllm/v1/kv_cache_interface.py
KVCacheSpec dataclass ¶
A base class for specifying the KV cache format of one layer.
Source code in vllm/v1/kv_cache_interface.py
copy_with_new_block_size ¶
Create a new KVCacheSpec from self but replacing the block size.
max_memory_usage_bytes ¶
max_memory_usage_bytes(vllm_config: VllmConfig) -> int
The maximum possible memory usage of this KV cache in bytes.
Returns:
| Type | Description |
|---|---|
int | The KV cache size in bytes |
merge classmethod ¶
Merge a list of KVCacheSpec objects into a single KVCacheSpec object.
Source code in vllm/v1/kv_cache_interface.py
KVCacheTensor dataclass ¶
A class for specifying how the workers should initialize the KV cache.
Source code in vllm/v1/kv_cache_interface.py
KVQuantMode ¶
Bases: IntEnum
KV cache quantization mode.
Used by attention backends and kernels to dispatch quantization logic without string matching on kv_cache_dtype.
Source code in vllm/v1/kv_cache_interface.py
SinkFullAttentionSpec dataclass ¶
Bases: FullAttentionSpec
Source code in vllm/v1/kv_cache_interface.py
merge classmethod ¶
Merge a list of FullAttentionSpec objects into a single FullAttentionSpec object.
Source code in vllm/v1/kv_cache_interface.py
SlidingWindowMLASpec dataclass ¶
Bases: SlidingWindowSpec
Sliding window attention with MLA cache format.
Source code in vllm/v1/kv_cache_interface.py
SlidingWindowSpec dataclass ¶
Bases: AttentionSpec
Source code in vllm/v1/kv_cache_interface.py
max_admission_blocks_per_request ¶
Per-request admission cap, in blocks.
Single source of truth for both startup pool sizing (max_memory_usage_bytes) and the runtime admission gate. Per-request real-held blocks plateau at this bound because SlidingWindowManager.remove_skipped_blocks runs from allocate_slots before each chunk's get_num_blocks_to_allocate.
Source code in vllm/v1/kv_cache_interface.py
TQFullAttentionSpec dataclass ¶
Bases: FullAttentionSpec
FullAttentionSpec with TQ-aware page size.
Python equivalent of the C++ TQ4FullAttentionSpec. Overrides real_page_size_bytes to use TQ slot bytes instead of the raw head_size * dtype formula.
Source code in vllm/v1/kv_cache_interface.py
UniformTypeKVCacheSpecs dataclass ¶
Bases: KVCacheSpec
A KV cache spec for multiple layers with the same type of attention. Here, same types means always need the same number of token slots. For example, sliding window attentions with different window sizes are not the same type and should not be merged into one UniformTypeKVCacheSpecs.
Source code in vllm/v1/kv_cache_interface.py
622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 | |
from_specs classmethod ¶
from_specs(
kv_cache_specs: dict[str, KVCacheSpec],
) -> Self | None
Return a SameTypeKVCacheSpecs object if all layers have the same type of KV cache spec. Return None if not.
Source code in vllm/v1/kv_cache_interface.py
is_uniform_type classmethod ¶
is_uniform_type(
kv_cache_specs: dict[str, KVCacheSpec],
) -> bool
Whether all layers have the same type of KV cache spec.
Source code in vllm/v1/kv_cache_interface.py
get_kv_quant_mode ¶
get_kv_quant_mode(kv_cache_dtype: str) -> KVQuantMode
Map a kv_cache_dtype string to a :class:KVQuantMode.
Source code in vllm/v1/kv_cache_interface.py
kv_cache_uses_per_token_head_scales ¶
Return True if kv_cache_dtype needs per-token-head scales.