vllm.compilation.passes.fusion.sequence_parallelism ¶
SequenceParallelismPass ¶
Bases: VllmPatternMatcherPass
This pass enables sequence parallelism for models. It identifies patterns where an AllReduce operation is followed by an RMSNorm (or RMSNorm and then Quantization) operation. These patterns are replaced with a ReduceScatter operation, followed by a local RMSNorm/Quantization, and then an AllGather operation.
The general transformation is: Input -> AllReduce -> RMSNorm -> Output becomes Input -> ReduceScatter -> RMSNorm -> AllGather -> Output
While this pass itself does not directly yield performance improvements, it lays the groundwork for subsequent fusion passes, such as GEMM + ReduceScatter and AllGather + GEMM fusions. These fusions can significantly reduce communication overhead and improve overall model performance.
This pass is only supported when compiling the whole graph (fullgraph mode, i.e. using Inductor graph partition or empty splitting_ops). Piecewise compilation is not supported because the residual tensor gets split across TP ranks, causing size mismatches at subgraph boundaries.
This pass splits up the residual tensor across TP ranks and hence divides its size. Because the pattern matcher starts at the end of the graph, the replacement contains a slice that temporarily conforms the input residual to the correct size. After all patterns have been matched, we use a NoOpEliminationPass to clean up what have now become no-op slices.
Source code in vllm/compilation/passes/fusion/sequence_parallelism.py
325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 | |
is_applicable_for_range ¶
Determines if sequence parallelism should be applied for the given compile range.
SP is only beneficial for larger batch sizes where the communication overhead is amortized. For small batches, the overhead of splitting and gathering tensors across TP ranks outweighs the benefits.
Returns False (SP disabled) when: - min_token_num is None (SP disabled for this device/config) - The compile range starts below the minimum token threshold
Source code in vllm/compilation/passes/fusion/sequence_parallelism.py
_SequenceParallelPatternHelper ¶
Helper for sequence parallelism patterns.
Source code in vllm/compilation/passes/fusion/sequence_parallelism.py
get_sequence_parallelism_threshold ¶
get_sequence_parallelism_threshold(
hidden_size: int, tp_size: int, element_size: int
) -> int | None
Calculate the minimum token threshold for applying sequence parallelism.
Returns None if sequence parallelism should not be applied based on model size.
Branching logic based on device capability: - Check if hidden_size >= SP_MIN_HIDDEN_SIZE[device_capability] - If not, returns None (SP disabled for small models on this device) - If yes, calculates threshold based on per-GPU size
min_token_num = (min_per_gpu_size_mb * tp_size * MiB) //
(hidden_size * element_size)