Hugging Face's logo Hugging Face
  • Models
  • Datasets
  • Spaces
  • Buckets new
  • Docs
  • Enterprise
  • Pricing
    • Website
      • Tasks
      • HuggingChat
      • Collections
      • Languages
      • Organizations
    • Community
      • Blog
      • Posts
      • Daily Papers
      • Learn
      • Discord
      • Forum
      • GitHub
    • Solutions
      • Team & Enterprise
      • Hugging Face PRO
      • Enterprise Support
      • Inference Providers
      • Inference Endpoints
      • Storage Buckets

  • Log In
  • Sign Up
tuandunghcmut 's Collections
MT-LLM
Agentic Benchmarks
Safety SFT
Tool Calling dataset for search domain
Document Layout Analysis Dataset
Post-training Dataset
RL-Papers
Visual Chain-of-Thought Reasoning Benchmarks
LLM for Security Benchmarks/Datasets
Visual-CoT/GCoT related
Text Embedding Papers
EMPTY A
Quantized versions of LLMs/MLLMs
Multilingual Sentiment Analysis Dataset
LLM Series
LLM/MLLM (20B - 80B, fit on 1-2 A100/H100)
SLM
MLLM (100B - 300B)
Benchmarks for evaluating LLMs/MLLMs
Conversation Dataset
Multilingual Parallel Text Corpus
Multilingual Pretraining Corpus for Southeast Asian Language

Agentic Benchmarks

updated Mar 26
Upvote
-

  • OpenResearcher/OpenResearcher-Dataset

    Viewer • Updated Mar 25 • 97.6k • 8.3k • 124

    Note For Deep Reasearch Agent


  • gaia-benchmark/GAIA

    Viewer • Updated Oct 28, 2025 • 932 • 49.6k • 667

  • vaskarnath/toolcomp

    Viewer • Updated Aug 21, 2025 • 493 • 44 • 1

    Note ToolComp of ScaleAI


  • vaskarnath/toolcomp_process_supervision_eval

    Viewer • Updated Aug 21, 2025 • 1.72k • 30 • 2

  • ToolComp: A Multi-Tool Reasoning & Process Supervision Benchmark

    Paper • 2501.01290 • Published Jan 2, 2025 • 1
Upvote
-
  • Collection guide
  • Browse collections
Company
TOS Privacy About Careers
Website
Models Datasets Spaces Pricing Docs