Continue rapport · gwen.works/internshiplogs@dcfbc2e

+34

bib.yaml

··· 547 547 date: '2025-10-16' 548 548 value: https://spinningup.openai.com/en/latest/algorithms/ppo.html 549 549 550 + rl-reproducibility: 551 + type: article 552 + title: Deep Reinforcement Learning that Matters 553 + author: 554 + - Henderson, Peter 555 + - Islam, Riashat 556 + - Bachman, Philip 557 + - Pineau, Joelle 558 + - Precup, Doina 559 + - Meger, David 560 + date: 2017-09 561 + url: 562 + value: http://arxiv.org/abs/1709.06560v3 563 + date: '2025-10-16' 564 + serial-number: 565 + arxiv: 1709.06560v3 566 + abstract: In recent years, significant progress has been made in solving challenging 567 + problems across various domains using deep reinforcement learning (RL). Reproducing 568 + existing work and accurately judging the improvements offered by novel methods 569 + is vital to sustaining this progress. Unfortunately, reproducing results for state-of-the-art 570 + deep RL methods is seldom straightforward. In particular, non-determinism in standard 571 + benchmark environments, combined with variance intrinsic to the methods, can make 572 + reported results tough to interpret. Without significance metrics and tighter 573 + standardization of experimental reporting, it is difficult to determine whether 574 + improvements over the prior state-of-the-art are meaningful. In this paper, we 575 + investigate challenges posed by reproducibility, proper experimental techniques, 576 + and reporting procedures. We illustrate the variability in reported metrics and 577 + results when comparing against common baselines and suggest guidelines to make 578 + future results in deep RL more reproducible. We aim to spur discussion about how 579 + to ensure continued progress in the field by minimizing wasted effort stemming 580 + from results that are non-reproducible and easily misinterpreted. 581 + parent: 582 + type: periodical 583 +

+40 -5

rapport/context.typ

··· 572 572 573 573 #section[Par _clipping_ _(PPO-Clip)_] 574 574 575 - _PPO-Clip_ évite le calcul d'une distance K-L#footnote[Kullback-Leibler] et enlève la contraînte sur le problème d'optimisation. 575 + _PPO-Clip_ enlève la contraînte sur le problème d'optimisation. 576 576 577 - On préfère changer la mise à jour de la politique, pour limiter directement dans son expression l'ampleur de la modification à $Q_cal(P) (s_(t+1), a_(t+1)^*)$ (cf @policy-update-loop) 577 + On préfère changer l'objectif la quantité à optimiser, pour limiter intrinsèquement l'ampleur des modifications, en résolvant le problème d'optimisation suivant @ppo-openai 578 + 579 + $ 580 + argmax_(cal(P)') & exp_((s, a) in cal(S)) overbracket(min( 581 + (Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)) A_(cal(P)', R)(s, a), quad 582 + op("clip")( 583 + (Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)), 584 + 1 - epsilon, 585 + 1 + epsilon 586 + ) A_(cal(P)', R)(s, a) 587 + ), L(s, a, cal(P), cal(P'), R)) \ 588 + "s.c." & top 589 + $ 578 590 579 - On utilise cette mise à jour @ppo-openai 591 + Avec $epsilon in RR_+^*$ est un paramètre indiquant à quel point l'on peut s'écarter de la politique précédente, et 580 592 581 593 $ 582 - Q_cal(P) (s_(t+1), a_(t+1)) <- min( 594 + op("clip") := (x, m, M) |-> cases( 595 + m si x < m, 596 + M si x > M, 597 + x sinon 598 + ) 599 + $ 583 600 601 + La complexité de l'expression, et la présence d'un $min$ au lieu de simplement un $op("clip")$ est dûe au fait que l'avantage $A_(cal(P)', R) (s, a)$ peut être négatif: 602 + 603 + / Si l'avantage est positif: 604 + #diagram( 605 + edge((-5, 0), "->", (5, 0)), 606 + edge((-5, 0.25), "-", (-5, -0.25), label-side: left)[$0$] 584 607 ) 585 608 $ 586 - 609 + L(s, a, cal(P), cal(P)', R) = min( 610 + (Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)), 611 + quad 1 + epsilon 612 + ) A_(cal(P)', R)(s, a) 613 + $ 614 + / Si l'avantage est négatif: $ 615 + L(s, a, cal(P), cal(P)', R) = max( 616 + 1 - epsilon, quad 617 + (Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)) 618 + ) A_(cal(P)', R)(s, a) 619 + $ 587 620 588 621 == Le H1v2 d'_Unitree_ 589 622 590 623 == Reproductibilité logicielle 591 624 625 + La reproductibilité est particulièrement complexe dans le champ du reinforcement learning @rl-reproducibility 626 +

+1 -1

rapport/main.typ

··· 94 94 authors: ( 95 95 ( 96 96 name: "Gwenn Le Bihan", 97 - email: "gwenn.lebihan@etu.inp-n7.fr", 97 + email: "gwenn.lebihan7@gmail.com", 98 98 affiliation: "ENSEEIHT", 99 99 ), 100 100 ),

Configure Feed

Configure Feed