Testing AI agents on real-world application and terminal tasks

Original: First in a two-part series where I throw RLMs at benchmarks and see how far they can go.

Writing ELI5 summary…