Inteligencia AI en Tiempo Real

AIBreakingNow

Sigue cada lanzamiento de modelo, benchmark y filtración de investigación al instante.

🤗
HuggingFace
500K+ Models
📄
PapersWithCode
90K+ Papers
GitHub
Updated Hourly
Last sync: 10:15 AM GMT+5:30
01

HippoCamp: Benchmarking Contextual Agents on Personal Computers

We present HippoCamp, a new benchmark designed to evaluate agents' capabilities on multimodal file management. Unlike existing agent benchmarks that focus on tasks like web interaction, tool use, or software automation in generic settings, HippoCamp evaluates agents in user-centric environments to model individual user profiles and search massive personal files for context-aware reasoning. Our benchmark instantiates device-scale file systems over real-world profiles spanning diverse modalities, comprising 42.4 GB of data across over 2K real-world files. Building upon the raw files, we construct 581 QA pairs to assess agents' capabilities in search, evidence perception, and multi-step reasoning. To facilitate fine-grained analysis, we provide 46.1K densely annotated structured trajectories for step-wise failure diagnosis. We evaluate a wide range of state-of-the-art multimodal large language models (MLLMs) and agentic methods on HippoCamp. Our comprehensive experiments reveal a significant performance gap: even the most advanced commercial models achieve only 48.3% accuracy in user profiling, struggling particularly with long-horizon retrieval and cross-modal reasoning within dense personal file systems. Furthermore, our step-wise failure diagnosis identifies multimodal perception and evidence grounding as the primary bottlenecks. Ultimately, HippoCamp exposes the critical limitations of current agents in realistic, user-centric environments and provides a robust foundation for developing next-generation personal AI assistants.

Zhe YangShulin TianKairui Hu2026-04-01
0
citas
02

LAtent Phase Inference from Short time sequences using SHallow REcurrent Decoders (LAPIS-SHRED)

Reconstructing full spatio-temporal dynamics from sparse observations in both space and time remains a central challenge in complex systems, as measurements can be spatially incomplete and can be also limited to narrow temporal windows. Yet approximating the complete spatio-temporal trajectory is essential for mechanistic insight and understanding, model calibration, and operational decision-making. We introduce LAPIS-SHRED (LAtent Phase Inference from Short time sequence using SHallow REcurrent Decoders), a modular architecture that reconstructs and/or forecasts complete spatiotemporal dynamics from sparse sensor observations confined to short temporal windows. LAPIS-SHRED operates through a three-stage pipeline: (i) a SHRED model is pre-trained entirely on simulation data to map sensor time-histories into a structured latent space, (ii) a temporal sequence model, trained on simulation-derived latent trajectories, learns to propagate latent states forward or backward in time to span unobserved temporal regions from short observational time windows, and (iii) at deployment, only a short observation window of hyper-sparse sensor measurements from the true system is provided, from which the frozen SHRED model and the temporal model jointly reconstruct or forecast the complete spatiotemporal trajectory. The framework supports bidirectional inference, inherits data assimilation and multiscale reconstruction capabilities from its modular structure, and accommodates extreme observational constraints including single-frame terminal inputs. We evaluate LAPIS-SHRED on six experiments spanning complex spatio-temporal physics: turbulent flows, multiscale propulsion physics, volatile combustion transients, and satellite-derived environmental fields, highlighting a lightweight, modular architecture suited for operational settings where observation is constrained by physical or logistical limitations.

Yuxuan BaoXingyue ZhangJ. Nathan Kutz2026-04-01
0
citas
03

The Recipe Matters More Than the Kitchen:Mathematical Foundations of the AI Weather Prediction Pipeline

AI weather prediction has advanced rapidly, yet no unified mathematical framework explains what determines forecast skill. Existing theory addresses specific architectural choices rather than the learning pipeline as a whole, while operational evidence from 2023-2026 demonstrates that training methodology, loss function design, and data diversity matter at least as much as architecture selection. This paper makes two interleaved contributions. Theoretically, we construct a framework rooted in approximation theory on the sphere, dynamical systems theory, information theory, and statistical learning theory that treats the complete learning pipeline (architecture, loss function, training strategy, data distribution) rather than architecture alone. We establish a Learning Pipeline Error Decomposition showing that estimation error (loss- and data-dependent) dominates approximation error (architecture-dependent) at current scales. We develop a Loss Function Spectral Theory formalizing MSE-induced spectral blurring in spherical harmonic coordinates, and derive Out-of-Distribution Extrapolation Bounds proving that data-driven models systematically underestimate record-breaking extremes with bias growing linearly in record exceedance. Empirically, we validate these predictions via inference across ten architecturally diverse AI weather models using NVIDIA Earth2Studio with ERA5 initial conditions, evaluating six metrics across 30 initialization dates spanning all seasons. Results confirm universal spectral energy loss at high wavenumbers for MSE-trained models, rising Error Consensus Ratios showing that the majority of forecast error is shared across architectures, and linear negative bias during extreme events. A Holistic Model Assessment Score provides unified multi-dimensional evaluation, and a prescriptive framework enables mathematical evaluation of proposed pipelines before training.

Piyush GargDiana R. GergelAndrew E. Shao2026-04-01
0
citas
04

$\texttt{YC-Bench}$: Benchmarking AI Agents for Long-Term Planning and Consistent Execution

As LLM agents tackle increasingly complex tasks, a critical question is whether they can maintain strategic coherence over long horizons: planning under uncertainty, learning from delayed feedback, and adapting when early mistakes compound. We introduce $\texttt{YC-Bench}$, a benchmark that evaluates these capabilities by tasking an agent with running a simulated startup over a one-year horizon spanning hundreds of turns. The agent must manage employees, select task contracts, and maintain profitability in a partially observable environment where adversarial clients and growing payroll create compounding consequences for poor decisions. We evaluate 12 models, both proprietary and open source, across 3 seeds each. Only three models consistently surpass the starting capital of \$200K, with Claude Opus 4.6 achieving the highest average final funds at \$1.27 M, followed by GLM-5 at \$1.21 M at 11$\times$ lower inference cost. Scratchpad usage, the sole mechanism for persisting information across context truncation, is the strongest predictor of success, and adversarial client detection is the primary failure mode, accounting for $47\%$ of bankruptcies. Our analysis reveals that frontier models still fail through distinct failure modes such as over-parallelization, demonstrating the capability gaps for long-horizon performance. $\texttt{YC-Bench}$ is open-source, reproducible, and configurable.

Muyu HeAdit JainAnand Kumar2026-04-01
0
citas
05

CliffSearch: Structured Agentic Co-Evolution over Theory and Code for Scientific Algorithm Discovery

Scientific algorithm discovery is iterative: hypotheses are proposed, implemented, stress-tested, and revised. Current LLM-guided search systems accelerate proposal generation, but often under-represent scientific structure by optimizing code-only artifacts with weak correctness/originality gating. We present CliffSearch, an agentic evolutionary framework in which the core evolution operators (pair selection, crossover, mutation, and review) are implemented as LLM agents, and the loop is designed around three principles: (1) each node is a structured scientific artifact, instantiated in either theory+code or code_only mode, (2) reviewer judgments of correctness and originality are first-class selection gates alongside optimization of the benchmark metric of interest, and (3) mutation is split into exploration and correction pathways with distinct objectives. Exploration mutation imports ideas from adjacent scientific domains to increase novelty, while correction mutation performs targeted evidence-guided repair using reviewer signals over theory, code, benchmark results, and runtime errors. We illustrate the framework on three benchmark-grounded studies: transformer hyper-connection evolution, optimizer discovery on a fixed nanoGPT stack, and a smaller native-optimizer ablation. Across these settings, the same loop supports explicit metric direction, reproducible persistence, and reviewer-gated comparison of discoveries under controlled search conditions. The result is a discovery workflow that prioritizes scientific interpretability and correctness while optimizing task metrics under controlled novelty constraints, rather than maximizing candidate throughput alone. Full run artifacts, interactive visualizations, and exported best nodes for the reported studies are available at https://cliffsearch.ai .

Youssef MrouehCarlos FonsecaBrian Belgodere2026-04-01
0
citas

Entendiendo los Benchmarks de IA y los Lanzamientos de Modelos

Una guía completa para evaluar grandes modelos de lenguaje, seguir avances de investigación y entender qué significan realmente las puntuaciones de benchmark.

Key Benchmarks Reference
MMLU
Massive Multitask Language Understanding - 57 academic subjects
92%
SOTA
HumanEval
Python code generation accuracy from docstrings
87%
SOTA
GSM8K
Grade school math word problem solving
95%
SOTA
HellaSwag
Common sense reasoning and natural language inference
89%
SOTA
ARC-Challenge
Science question answering - challenge set
85%
SOTA
TruthfulQA
Measures model truthfulness across 817 questions
78%
SOTA

What Are AI Benchmarks?

Benchmarks are standardized evaluation suites that measure specific capabilities of large language models (LLMs). Unlike human intuition-based evaluations, benchmarks provide reproducible, quantitative comparisons across models from different organizations and research groups.

The most important benchmarks span reasoning (MMLU, ARC), coding (HumanEval, MBPP), mathematics (GSM8K, MATH), instruction following (IFEval), and long-context comprehension (SCROLLS, ZeroSCROLLS). Each captures a different dimension of intelligence.

How to Evaluate a New Model Release

When a new model drops, look beyond the headline number. First, check which benchmarks were reported - models often selectively report their best results. Second, verify if the evaluation was done zero-shot or few-shot, since few-shot prompting can inflate scores by 10-20%. Third, check for contamination - if training data overlaps with benchmark test sets, scores are unreliable.

Community-run leaderboards like the Open LLM Leaderboard on HuggingFace provide independent, standardized evaluation across consistent benchmarks for fair apples-to-apples comparison.

Reading GitHub Stars as a Signal

GitHub star counts are a noisy-but-useful proxy for community interest. For inference frameworks like vLLM, llama.cpp, and Ollama, star velocity (growth rate over time, not absolute count) is the key signal. A repo gaining 5K stars in 7 days signals a breakthrough; a 500K-star repo with flat growth is stable infrastructure.

Forks signal active use and development even more than stars - developers fork to build, not just bookmark. For model repos specifically, the "likes" metric on HuggingFace Hub is more signal-dense than GitHub stars.

Understanding Model Size vs. Capability

Parameter count (7B, 13B, 70B, 405B) is a rough proxy for capability but not a direct one. Advances in training data curation, instruction tuning, and RLHF alignment mean that modern 7B models frequently outperform earlier 70B models on many tasks. Architectural innovations like Mixture of Experts (MoE) allow models to have a large total parameter count (e.g. 141B in Mixtral 8x22B) while only activating 39B parameters per token.

The real-world deployment constraint is typically memory bandwidth, not raw FLOPS. A 7B model in float16 requires ~14GB VRAM; quantized to 4-bit (GGUF), it drops to ~4GB, enabling consumer GPU inference.