โ† back

Bookmarks

17 saved articles.

1
x.comChamath Palihapitiya08:57score 17.9

Bill Gates' 2007 definition of platform economics still defines today's AI hype

Chamath Palihapitiya recalls a 2007 meeting where Bill Gates challenged Facebook's platform definition, arguing a real platform lets third-party creators earn more collectively than the platform owner. Palihapitiya draws a parallel to current AI tokenization trends, suggesting today's AI industry mirrors that same flawed thinkingโ€”building extractive systems that claim to be platforms while concentrating value at the center rather than sharing it with participants.

1,494โค106RT101reply
2
x.comDAIR.AI10:30score 16.9

Weak AI model with verification beats advanced models on code tasks

Researchers found that a smaller, cheaper AI model paired with an execution-based verification system can match the performance of frontier models like Claude and Gemini on software engineering tasks. The key insight is that weak models already generate the correct solution in their top eight candidates most of the time, so running and testing those candidates automatically is more effective than asking the model to pick the best one. This suggests that model capability matters less than having a good selection mechanism, potentially saving significant compute costs.

88โค15RT10reply
4
x.comShubhankar09:33score 16.1

Browse.sh releases open library for AI agents navigating websites

Browse.sh is an open-source collection of recorded interactions and instructions that teach AI agents how to accomplish tasks on real websites. Instead of agents learning to click and fill forms from scratch, they get a playbook built from research across hundreds of actual sites. This matters because web automation is messyโ€”every site has different layouts, paywalls, and interaction patterns. A shared library of working examples lets developers build reliable agents faster without reinventing the wheel for each new task.

17โค1RT1reply
7
x.comCameron R. Wolfe, Ph.D.08:41score 16.4

Comprehensive guide to agent evaluation frameworks and benchmarks

Cameron Wolfe published a detailed guide covering agent evaluation methodology, progressing from foundational agent concepts through multi-agent systems to practical evaluation patterns and frameworks used in the field. The guide includes case studies of established agent benchmarks, providing practitioners with concrete patterns and approaches for assessing agent performance and behavior in production settings.

16โค3RT2reply
9
x.comDr Milan Milanoviฤ‡01:30score 16.6

ProgramBench: frontier models solve 0% of full-stack software rebuild tasks

Meta, Stanford, and Harvard released ProgramBench, evaluating nine frontier models on 200 software reconstruction tasks ranging from CLI tools to FFmpeg and SQLite. Across 1,800 runs, no model completed a single task end-to-end; only Claude Opus 4.7 passed 3% of tasks while matching 95% of unit tests. Key failure modes include monolithic code generation (60% of solutions in 1โ€“3 files vs. modular human baselines), language abandonment (models pick Python 36โ€“79% of the time despite original language), and poor compositional reasoning. C/C++ projects proved hardest (27.7% test pass rate vs. 38.5% for Rust/Go), with models succeeding only on small utilities while larger projects like FFmpeg and php-src remained completely unsolved.

31โค11RT5reply