vllm.v1.worker.utils ¶
KVBlockZeroer ¶
Manages efficient zeroing of KV cache blocks via a Triton kernel.
Call :meth:init_meta once after KV caches are allocated to precompute segment addresses, then call :meth:zero_block_ids each step to zero newly-allocated blocks.
Source code in vllm/v1/worker/utils.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 | |
init_meta ¶
init_meta(
attn_groups_iter: Iterable[AttentionGroup],
kernel_block_sizes: list[int],
cache_dtype: str,
runner_only_attn_layers: set[str],
static_forward_context: dict[str, Any],
) -> None
One-time precomputation for zero_block_ids.
Builds absolute-address table for the Triton zeroing kernel. Each entry is the absolute byte address of a segment start on the GPU, so segments in different CUDA allocations work correctly.
Block IDs from the scheduler reference logical blocks whose size may differ from the kernel block size (virtual block splitting). PAGE_SIZE_EL accounts for this ratio so that block_id * PAGE_SIZE_EL lands at the correct offset.
Only AttentionSpec layers are processed; Mamba layers are skipped.
Source code in vllm/v1/worker/utils.py
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 | |
zero_block_ids ¶
Zero the KV cache memory for the given block IDs.
Source code in vllm/v1/worker/utils.py
_zero_kv_blocks_kernel ¶
_zero_kv_blocks_kernel(
seg_addrs_ptr,
block_ids_ptr,
n_blocks,
N_SEGS: constexpr,
PAGE_SIZE_EL: constexpr,
BLOCK_SIZE: constexpr,
)
Zero KV cache blocks across all segments in a single launch.
Each segment is a contiguous region of one block's data. For backends where blocks are outermost (block_dim=0) there is one segment per buffer. For backends where K/V is outermost (block_dim=1) there are two segments per buffer (one for K, one for V).
seg_addrs_ptr holds absolute byte addresses (int64) for each segment, allowing segments to live in different CUDA allocations.
Programs are mapped as (block_index, seg_index, chunk_index).
Source code in vllm/v1/worker/utils.py
add_kv_sharing_layers_to_kv_cache_groups ¶
add_kv_sharing_layers_to_kv_cache_groups(
shared_kv_cache_layers: dict[str, str],
kv_cache_groups: list[KVCacheGroupSpec],
runner_only_attn_layers: set[str] | None = None,
) -> None
Sets up KV cache sharing by reusing the allocated KV caches in kv_caches for layers that do not allocate its own KV cache, based on the mapping in shared_kv_cache_layers. Adds these layers to the corresponding KV cache group, which is needed to ensure that attention metadata is assigned later.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shared_kv_cache_layers | dict[str, str] | Layer pairings for cross-layer KV sharing. If an Attention layer | required |
kv_cache_groups | list[KVCacheGroupSpec] | The KV cache groups of the model. | required |
Source code in vllm/v1/worker/utils.py
bind_kv_cache ¶
bind_kv_cache(
kv_caches: dict[str, Tensor],
forward_context: dict[str, Attention],
runner_kv_caches: list[Tensor],
num_attn_module: int = 1,
) -> None
Bind the allocated KV cache to both ModelRunner and forward context so that the KV cache can be used in the forward pass.
This function
1) Fills the ModelRunner's kv cache list (runner_kv_caches) with kv_caches. 2) Associates each attention layer in the forward_context with its corresponding KV cache in kv_caches.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kv_caches | dict[str, Tensor] | The allocated kv_caches with layer names as keys. | required |
forward_context | dict[str, Attention] | The global forward context containing all Attention layers with layer names as keys. | required |
runner_kv_caches | list[Tensor] | The kv_cache declared by ModelRunner. | required |
Source code in vllm/v1/worker/utils.py
is_residual_scattered_for_sp ¶
is_residual_scattered_for_sp(
vllm_config: VllmConfig, num_input_tokens: int
) -> bool
Check if the residual tensor is scattered for sequence parallelism.
The residual tensor is scattered across tensor parallel ranks when sequence parallelism and tensor parallelism is enabled.
This follows the same logic as SequenceParallelismPass.is_applicable_for_range(): - In full-graph compilation mode (no splitting ops or using inductor graph partition), SP is always applied - Otherwise, SP is only applied for specific shapes in compile_sizes
Source code in vllm/v1/worker/utils.py
prepare_kernel_block_sizes ¶
prepare_kernel_block_sizes(
kv_cache_config: KVCacheConfig,
attn_groups: list[list[AttentionGroup]],
) -> list[int]
Generate kernel_block_sizes that matches each block_size.
For attention backends that support virtual block splitting, use the supported block sizes from the backend. For other backends (like Mamba), use the same block size (no splitting).
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kv_cache_config | KVCacheConfig | The KV cache configuration. | required |
attn_groups | list[list[AttentionGroup]] | Attention groups indexed by KV cache group id. | required |
Returns:
| Type | Description |
|---|---|
list[int] | List of kernel block sizes for each cache group. |
Source code in vllm/v1/worker/utils.py
request_memory ¶
request_memory(
init_snapshot: MemorySnapshot, cache_config: CacheConfig
) -> int
Calculate the amount of memory required by vLLM, then validate that the current amount of free memory is sufficient for that.
Source code in vllm/v1/worker/utils.py
sanity_check_mm_encoder_outputs ¶
sanity_check_mm_encoder_outputs(
mm_embeddings: MultiModalEmbeddings,
expected_num_items: int,
) -> None
Perform sanity checks for the result of vllm.model_executor.models.SupportsMultiModal.embed_multimodal.
Source code in vllm/v1/worker/utils.py
select_common_block_size ¶
select_common_block_size(
kv_manager_block_size: int,
backends: list[type[AttentionBackend]],
) -> int
Select a block size that is supported by all backends and is a factor of kv_manager_block_size.
If kv_manager_block_size is supported by all backends, return it directly. Otherwise, return the max supported size.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kv_manager_block_size | int | Block size of KV cache. | required |
backends | list[type[AttentionBackend]] | List of attention backend classes. | required |
Returns:
| Type | Description |
|---|---|
int | The selected block size. |
Raises:
| Type | Description |
|---|---|
ValueError | If no valid block size found. |