Models
Datasets
Spaces
Buckets new
Docs
Enterprise
Pricing
- Website
- Community
- Solutions
Log In
Sign Up

Collections

Discover the best community collections!

Collections including paper arxiv:2501.04001

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 30
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

ByteDance/Sa2VA-4B

Image-Text-to-Text • Updated Sep 8, 2025 • 1.53k • 98
Finnish-NLP/Ahma-2-4B-Instruct

Text Generation • 4B • Updated Nov 25, 2025 • 601 • 4
black-forest-labs/FLUX.2-dev

Image-to-Image • Updated Feb 17 • 238k • • 1.67k
mistralai/Mistral-Large-Instruct-2407

Updated Jul 28, 2025 • 7.83k • 860

Multimodal-Grounding

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10, 2025 • 29
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 48
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Paper • 2512.14699 • Published Dec 16, 2025 • 28

image-segmentation

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 48
facebook/sapiens-seg-1b-torchscript

Image Segmentation • Updated Oct 6, 2024 • 1.42k • 5
sayeed99/segformer_b3_clothes

Image Segmentation • 47.2M • Updated Feb 27, 2024 • 3.63k • • 37
allenai/WildDet3D

Object Detection • Updated Apr 13 • 64 • 40

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 48
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7, 2025 • 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10, 2025 • 75

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Paper • 2401.09985 • Published Jan 18, 2024 • 18
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Paper • 2401.09962 • Published Jan 18, 2024 • 9
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Paper • 2401.10404 • Published Jan 18, 2024 • 10
ActAnywhere: Subject-Aware Video Background Generation

Paper • 2401.10822 • Published Jan 19, 2024 • 13

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 305
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 309
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Paper • 2503.24235 • Published Mar 31, 2025 • 55
Seedream 3.0 Technical Report

Paper • 2504.11346 • Published Apr 15, 2025 • 70

Object segmentation 🧩

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 48
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 122

Image-Video General Tasks

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 48
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Paper • 2501.07556 • Published Jan 13, 2025 • 7

Sa2VA Model Zoo

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research

ByteDance/Sa2VA-4B

Image-Text-to-Text • Updated Sep 8, 2025 • 1.53k • 98
ByteDance/Sa2VA-8B

Image-Text-to-Text • 8B • Updated Sep 8, 2025 • 2.35k • 66
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 1.01k • 30
ByteDance/Sa2VA-26B

Image-Text-to-Text • 26B • Updated Sep 8, 2025 • 143 • 32

EVA-CLIP-18B: Scaling CLIP to 18 Billion Parameters

Paper • 2402.04252 • Published Feb 6, 2024 • 30
Vision Superalignment: Weak-to-Strong Generalization for Vision Foundation Models

Paper • 2402.03749 • Published Feb 6, 2024 • 15
ScreenAI: A Vision-Language Model for UI and Infographics Understanding

Paper • 2402.04615 • Published Feb 7, 2024 • 44
EfficientViT-SAM: Accelerated Segment Anything Model Without Performance Loss

Paper • 2402.05008 • Published Feb 7, 2024 • 23

WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens

Paper • 2401.09985 • Published Jan 18, 2024 • 18
CustomVideo: Customizing Text-to-Video Generation with Multiple Subjects

Paper • 2401.09962 • Published Jan 18, 2024 • 9
Inflation with Diffusion: Efficient Temporal Adaptation for Text-to-Video Super-Resolution

Paper • 2401.10404 • Published Jan 18, 2024 • 10
ActAnywhere: Subject-Aware Video Background Generation

Paper • 2401.10822 • Published Jan 19, 2024 • 13

ByteDance/Sa2VA-4B

Image-Text-to-Text • Updated Sep 8, 2025 • 1.53k • 98
Finnish-NLP/Ahma-2-4B-Instruct

Text Generation • 4B • Updated Nov 25, 2025 • 601 • 4
black-forest-labs/FLUX.2-dev

Image-to-Image • Updated Feb 17 • 238k • • 1.67k
mistralai/Mistral-Large-Instruct-2407

Updated Jul 28, 2025 • 7.83k • 860

Advances and Challenges in Foundation Agents: From Brain-Inspired Intelligence to Evolutionary, Collaborative, and Safe Systems

Paper • 2504.01990 • Published Mar 31, 2025 • 305
InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Paper • 2504.10479 • Published Apr 14, 2025 • 309
What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models

Paper • 2503.24235 • Published Mar 31, 2025 • 55
Seedream 3.0 Technical Report

Paper • 2504.11346 • Published Apr 15, 2025 • 70

Multimodal-Grounding

Migician: Revealing the Magic of Free-Form Multi-Image Grounding in Multimodal Large Language Models

Paper • 2501.05767 • Published Jan 10, 2025 • 29
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 48
MemFlow: Flowing Adaptive Memory for Consistent and Efficient Long Video Narratives

Paper • 2512.14699 • Published Dec 16, 2025 • 28

Object segmentation 🧩

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 48
SAM 2: Segment Anything in Images and Videos

Paper • 2408.00714 • Published Aug 1, 2024 • 122

image-segmentation

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 48
facebook/sapiens-seg-1b-torchscript

Image Segmentation • Updated Oct 6, 2024 • 1.42k • 5
sayeed99/segformer_b3_clothes

Image Segmentation • 47.2M • Updated Feb 27, 2024 • 3.63k • • 37
allenai/WildDet3D

Object Detection • Updated Apr 13 • 64 • 40

Image-Video General Tasks

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 48
LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
An Empirical Study of Autoregressive Pre-training from Videos

Paper • 2501.05453 • Published Jan 9, 2025 • 41
MatchAnything: Universal Cross-Modality Image Matching with Large-Scale Pre-Training

Paper • 2501.07556 • Published Jan 13, 2025 • 7

LLaVA-Mini: Efficient Image and Video Large Multimodal Models with One Vision Token

Paper • 2501.03895 • Published Jan 7, 2025 • 52
Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Paper • 2501.04001 • Published Jan 7, 2025 • 48
Are VLMs Ready for Autonomous Driving? An Empirical Study from the Reliability, Data, and Metric Perspectives

Paper • 2501.04003 • Published Jan 7, 2025 • 27
VideoRAG: Retrieval-Augmented Generation over Video Corpus

Paper • 2501.05874 • Published Jan 10, 2025 • 75

Sa2VA Model Zoo

Huggingace Model Zoo For Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos By Bytedance Seed CV Research

ByteDance/Sa2VA-4B

Image-Text-to-Text • Updated Sep 8, 2025 • 1.53k • 98
ByteDance/Sa2VA-8B

Image-Text-to-Text • 8B • Updated Sep 8, 2025 • 2.35k • 66
ByteDance/Sa2VA-1B

Image-Text-to-Text • 1B • Updated Sep 8, 2025 • 1.01k • 30
ByteDance/Sa2VA-26B

Image-Text-to-Text • 26B • Updated Sep 8, 2025 • 143 • 32

Previous
1
2
Next

Company

TOS Privacy About Careers

Website

Models Datasets Spaces Pricing Docs