vllm.model_executor.layers.sparse_attn_indexer ¶
Custom Sparse Attention Indexer layers.
SparseAttnIndexer ¶
Bases: CustomOp
Sparse Attention Indexer Custom Op Layer. This layer is extracted as a separate custom op since it involves heavy custom kernels like mqa_logits, paged_mqa_logits and top_k_per_row, etc. Those kernels maybe requires specific memory layout or implementation for different hardware backends to achieve optimal performance.
For now, the default native path will use CUDA backend path. Other platform may requires add the corresponding Custom Op name sparse_attn_indexer to custom_ops in CompilationConfig to enable the platform specific path.
Source code in vllm/model_executor/layers/sparse_attn_indexer.py
404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 | |
_gather_workspace_shapes ¶
_gather_workspace_shapes(
total_seq_lens: int,
head_dim: int,
fp8_dtype: dtype,
use_fp4_cache: bool,
) -> tuple[
tuple[tuple[int, int], dtype],
tuple[tuple[int, int], dtype],
]
Return ((values_shape, values_dtype), (scales_shape, scales_dtype)) for the K-gather workspace. FP8 path: (T, head_dim) fp8 + (T, 4) uint8 fp32 scales. MXFP4 path: (T, head_dim // 2) uint8 packed mxfp4 + (T, head_dim // MXFP4_BLOCK_SIZE) uint8 ue8m0 scales.
Source code in vllm/model_executor/layers/sparse_attn_indexer.py
kv_cache_as_quant_view ¶
4D [num_blocks, block_size, 1, head_width] view expected by DeepGEMM, from the 3D indexer kv-cache allocation.