← back
arXivNikola Pavlovic, Sattar Vakili, Qing ZhaoFri, May 22, 2026, 7:00 AM PDT
score 14.6

Learning optimal behavior from preference comparisons, not reward scores

Original: Learning Kernel-Based MDPs from Episodic Preferential Feedback

Source: arxiv.org

Writing ELI5 summary…