vllm.model_executor.models.molmo2 ¶
AdapterConfig dataclass ¶
Config for a vit-llm adapter
Source code in vllm/model_executor/models/molmo2.py
ImagePoolingAttention ¶
Bases: Module
Multi-head attention used for image pooling
Source code in vllm/model_executor/models/molmo2.py
543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 | |
ImageProjectorMLP ¶
Bases: Module
MLP used for the image projector
Source code in vllm/model_executor/models/molmo2.py
LanguageModelMLP ¶
Bases: Module
Molmo2's LLM mlp.
Source code in vllm/model_executor/models/molmo2.py
Molmo2Attention ¶
Bases: Module
Molmo2's LLM Attention.
Source code in vllm/model_executor/models/molmo2.py
881 882 883 884 885 886 887 888 889 890 891 892 893 894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 1020 1021 1022 1023 1024 | |
Molmo2ForConditionalGeneration ¶
Bases: Module, SupportsMultiModal, SupportsPP, SupportsLoRA, SupportsQuant
Source code in vllm/model_executor/models/molmo2.py
2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309 2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367 2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425 2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479 2480 2481 2482 2483 2484 2485 2486 2487 2488 2489 2490 2491 2492 2493 2494 2495 2496 2497 2498 2499 2500 2501 2502 2503 2504 2505 2506 2507 2508 2509 2510 2511 2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558 2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600 2601 2602 2603 2604 2605 2606 2607 2608 2609 | |
get_mm_mapping ¶
Get the module prefix in multimodal models
Source code in vllm/model_executor/models/molmo2.py
Molmo2ImageInputs ¶
Bases: TensorSchema
Dimensions
- nc: The total number of crops (dynamic)
- np: The total number of patches per crop
- cps: Number of channels * patch_size * patch_size
- npp: Number of pooled patches (dynamic)
- pp: pooling_size * pooling_size
- ni: Number of images
- nt: Number of image tokens (dynamic)
Source code in vllm/model_executor/models/molmo2.py
token_pooling instance-attribute ¶
token_pooling: Annotated[Tensor, TensorShape(npp, pp)]
An index tensor that maps image features to their corresponding patch tokens before pooling.
Molmo2VideoInputs ¶
Bases: TensorSchema
Dimensions
- nc: The total number of frames (dynamic)
- np: The total number of patches per frame
- cps: Number of channels * patch_size * patch_size
- npp: Number of pooled patches (dynamic)
- pp: pooling_size * pooling_size
- nv: Number of videos
- nt: Number of video tokens (dynamic)
Source code in vllm/model_executor/models/molmo2.py
token_pooling instance-attribute ¶
token_pooling: Annotated[Tensor, TensorShape(npp, pp)]
An index tensor that maps image features to their corresponding patch tokens before pooling.
Molmo2VisionBackbone ¶
Bases: Module, SupportsQuant
Source code in vllm/model_executor/models/molmo2.py
705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 839 840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 | |
encode_image ¶
: param images: (batch_size, num_crops, num_patch, n_pixels)
Source code in vllm/model_executor/models/molmo2.py
Molmo2VisionBlock ¶
Bases: Module
Residual attention block used in Vision Transformer.
Source code in vllm/model_executor/models/molmo2.py
Molmo2VisionBlockCollection ¶
Bases: Module
Collection of residual attention blocks used in Vision Transformer.
Source code in vllm/model_executor/models/molmo2.py
Molmo2VisionTransformer ¶
Bases: Module
Vision Transformer used in Vision Backbone.
Source code in vllm/model_executor/models/molmo2.py
forward ¶
: param x: (batch_size, num_patch, n_pixels)
Source code in vllm/model_executor/models/molmo2.py
TextConfig dataclass ¶
Configuration for a text model transformer
Source code in vllm/model_executor/models/molmo2.py
202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 | |
additional_vocab_size class-attribute instance-attribute ¶
additional_vocab_size: int = 128
Number of additional tokens to have the input embeddings for
head_dim class-attribute instance-attribute ¶
head_dim: int = 128
The head dimensionality for the attention mechanism.
hidden_act class-attribute instance-attribute ¶
hidden_act: str = 'silu'
The activation function to use within the MLP layers.
hidden_size class-attribute instance-attribute ¶
hidden_size: int = 3584
The hidden size of the model.
intermediate_size class-attribute instance-attribute ¶
intermediate_size: int = 18944
The hidden size for the MLP.
layer_norm_eps class-attribute instance-attribute ¶
layer_norm_eps: float = 1e-06
epsilon for layer norms
max_position_embeddings class-attribute instance-attribute ¶
max_position_embeddings: int = 4096
Max positional embeddings to use in RoPE cache
norm_after class-attribute instance-attribute ¶
norm_after: bool = False
Apply layer norm before and after the attention and MLP blocks.
num_attention_heads class-attribute instance-attribute ¶
num_attention_heads: int = 28
The number of self-attention heads.
num_hidden_layers class-attribute instance-attribute ¶
num_hidden_layers: int = 48
The number of layers/blocks.
num_key_value_heads class-attribute instance-attribute ¶
num_key_value_heads: int = 4
The number of heads to use for keys and values.
qk_norm_type class-attribute instance-attribute ¶
qk_norm_type: str = 'olmo'
The type of layer norm to use for the keys and queries. Can be "olmo" or "qwen3".
rope_scaling_layers class-attribute instance-attribute ¶
RoPE scaling layers.
ViTMLP ¶
Bases: Module
MLP used in Vision Transformer.
Source code in vllm/model_executor/models/molmo2.py
ViTMultiHeadDotProductAttention ¶
Bases: Module
Multi-head attention used in Vision Transformer.
Source code in vllm/model_executor/models/molmo2.py
VitConfig dataclass ¶
Config for a vision transformer
Source code in vllm/model_executor/models/molmo2.py
get_candidate_target_fps ¶
get_candidate_target_fps(
video_fps: int | float,
sampling_fps: int | float,
max_fps: int | float = _MAX_VIDEO_FPS,
) -> list[float]
Return the subset of video_fps factors that remain multiples of sampling_fps.
Examples:
>>> get_candidate_target_fps(video_fps=6, sampling_fps=2)
[2, 6]
>>> get_candidate_target_fps(video_fps=5, sampling_fps=1)
[1, 5]
>>> get_candidate_target_fps(video_fps=2, sampling_fps=2)
[2]
>>> get_candidate_target_fps(video_fps=5, sampling_fps=2)
Traceback (most recent call last):
...
ValueError: sampling_fps=2 must divide video_fps=5 to produce
consistent frame steps.
Source code in vllm/model_executor/models/molmo2.py
get_target_fps ¶
get_target_fps(
video_fps: float,
max_frames: int,
total_frames: int,
frame_sample_mode: str,
candidate_target_fps: list[float],
) -> float | None
Get the target fps that best spans the video and has the most frames sampled