Continue rapport · gwen.works/internshiplogs@e6a7c6c

+25

bib.yaml

··· 582 582 parent: 583 583 type: periodical 584 584 585 + ppo: 586 + type: article 587 + title: Simple Policy Optimization 588 + author: Xie, Zhengpeng 589 + date: 2024-01 590 + url: 591 + value: http://arxiv.org/abs/2401.16025v2 592 + date: '2025-10-16' 593 + serial-number: 594 + arxiv: 2401.16025v2 595 + abstract: PPO (Proximal Policy Optimization) algorithm has demonstrated excellent 596 + performance in many fields, and it is considered as a simple version of TRPO (Trust 597 + Region Policy Optimization) algorithm. However, the ratio clipping operation in 598 + PPO may not always effectively enforce the trust region constraints, this can 599 + be a potential factor affecting the stability of the algorithm. In this paper, 600 + we propose Simple Policy Optimization (SPO) algorithm, which introduces a novel 601 + clipping method for KL divergence between the old and current policies. Extensive 602 + experimental results in Atari 2600 environments indicate that, compared to the 603 + mainstream variants of PPO, SPO achieves better sample efficiency, extremely low 604 + KL divergence, and higher policy entropy, and is robust to the increase in network 605 + depth or complexity. More importantly, SPO maintains the simplicity of an unconstrained 606 + first-order algorithm. Code is available at https://github.com/MyRepositories-hub/Simple-Policy-Optimization. 607 + parent: 608 + type: periodical 609 +

+2 -1

cite-arxiv.fish

··· 3 3 set doi ( 4 4 echo "$argv[2]" \ 5 5 | string replace "https://arxiv.org/pdf/" "" \ 6 - | string replace "https://arxiv.org/abs/" "" 6 + | string replace "https://arxiv.org/abs/" "" \ 7 + | string replace "https://arxiv.org/html/" "" 7 8 ) 8 9 9 10 set bibtex (uvx arxiv2bib "$doi")

+63 -47

rapport/context.typ

··· 4 4 5 5 #show figure: set block(spacing: 4em) 6 6 #let diagram = (caption: none, ..args) => figure(caption: caption, fletcher.diagram(..args)) 7 + #let dontbreak = content => block(breakable: false, content) 7 8 8 9 #show math.equation.where(block: true): set block(spacing: 2em) 9 10 ··· 566 567 567 568 ==== _Proximal Policy Optimization_ 568 569 569 - La _PPO_ repose sur le même principe de stabilisation de l'entraînement par limitation de l'ampleur des changements de politique à chaque pas. 570 + La _PPO_ repose sur le même principe de stabilisation de l'entraînement par limitation de l'ampleur des changements de politique à chaque pas. 571 + 572 + Cependant, les méthodes _PPO_ préfèrent changer la quantité à optimiser, pour limiter intrinsèquement l'ampleur des modifications, en résolvant un problème d'optimisation sans contraintes @ppo 573 + 574 + 575 + $ 576 + argmax_(cal(P)') & exp_((s, a) in cal(S)) L(s, a, cal(P), cal(P'), R) \ 577 + "s.c." & top 578 + $ 570 579 571 580 #section[Avec pénalité _(PPO-Penalty)_] 572 581 582 + _PPO-Penalty_ soustrait une divergence K-L pondérée à l'avantage: 583 + 584 + $ 585 + L(s, a, cal(P), cal(P'), R) = (Q_cal(P) (s, a)) / (Q_cal(P') (s, a)) A_(cal(P), R) (s, a) - beta D_"KL"(cal(P) || cal(P')) 586 + $ 587 + 573 588 #section[Par _clipping_ _(PPO-Clip)_] 574 589 575 - _PPO-Clip_ lève la contraînte du problème d'optimisation. 590 + _PPO-Clip_ utilise une limitation du ratio de probabilités (en minimum et en maximum) @ppo-openai 576 591 577 - On préfère changer l'objectif la quantité à optimiser, pour limiter intrinsèquement l'ampleur des modifications, en résolvant le problème d'optimisation suivant @ppo-openai 578 592 579 593 $ 580 - argmax_(cal(P)') & exp_((s, a) in cal(S)) overbracket(min( 581 - (Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)) A_(cal(P)', R)(s, a), quad 582 - op("clip")( 594 + L(s, a, cal(P), cal(P'), R) = min( 595 + &(Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)) A_(cal(P)', R)(s, a), quad \ 596 + &op("clip")( 583 597 (Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)), 584 598 1 - epsilon, 585 599 1 + epsilon 586 600 ) A_(cal(P)', R)(s, a) 587 - ), L(s, a, cal(P), cal(P'), R)) \ 588 - "s.c." & top 601 + ) 589 602 $ 590 603 591 604 Avec $epsilon in RR_+^*$ est un paramètre indiquant à quel point l'on peut s'écarter de la politique précédente, et ··· 602 615 603 616 #let named_point = (x, y, shape: "@", color: black, side: right, content) => edge((x, y), shape + "-", (x+0.01, y), label-side: side, stroke: color, text(fill: color, content)) 604 617 605 - / Si l'avantage est positif: $a$ est un meilleur choix que $cal(P)(s)$. 618 + #let equation_and_diagram = (eqn, diagrm) => stack(dir: ltr, 619 + block(width: 70%, math.equation(numbering: none, block: true, eqn)), 620 + diagrm 621 + ) 606 622 607 - #stack(dir: ltr, 623 + #dontbreak[ 608 624 609 - block(width: 70%, math.equation(numbering: none, block: true, $ 610 - L(s, a, cal(P), cal(P)', R) = min( 611 - (Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)), 612 - quad 1 + epsilon 613 - ) A_(cal(P)', R)(s, a) 614 - $)), 625 + / Si l'avantage est positif: $a$ est un meilleur choix que $cal(P)(s)$. 615 626 616 - diagram( 617 - spacing: (2.7em, 2em), 618 - node((-1, 0))[$cal(P)'$], 619 - edge((-1, 0), "->", (3, 0), stroke: luma(150)), 620 - edge((-1, 0), "-|", (1, 0), extrude: (1, -1, 0) ), 621 - named_point(1, 0, shape: "|")[$1+epsilon$], 622 - named_point(0, 0)[$cal(P)$], 623 - named_point(1.5, 0, color: red, side: left)[$times$], 624 - named_point(0.5, 0, color: olive, side: left)[$checkmark$], 625 - ), 626 - 627 + #equation_and_diagram( 628 + $ 629 + L(s, a, cal(P), cal(P)', R) = min( 630 + (Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)), 631 + quad 1 + epsilon 632 + ) A_(cal(P)', R)(s, a) 633 + $, 634 + diagram( 635 + spacing: (2.7em, 2em), 636 + node((-1, 0))[$cal(P)'$], 637 + edge((-1, 0), "->", (3, 0), stroke: luma(150)), 638 + edge((-1, 0), "-|", (1, 0), extrude: (1, -1, 0) ), 639 + named_point(1, 0, shape: "|")[$1+epsilon$], 640 + named_point(0, 0)[$cal(P)$], 641 + named_point(1.5, 0, color: red, side: left)[$times$], 642 + named_point(0.5, 0, color: olive, side: left)[$checkmark$], 643 + ) 627 644 ) 628 645 629 646 / Si l'avantage est négatif: choisir $a$ est pire que garder $cal(P)(s)$. 630 647 631 - #stack(dir: ltr, 632 - 633 - block(width: 70%, math.equation(numbering: none, block:true, $ 634 - L(s, a, cal(P), cal(P)', R) = max( 635 - 1 - epsilon, quad 636 - (Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)) 637 - ) A_(cal(P)', R)(s, a) 638 - $)), 639 - 640 - diagram( 641 - spacing: (2.7em, 2em), 642 - node((3, 0))[$cal(P)'$], 643 - edge((-1, 0), "<-", (3, 0), stroke: luma(150)), 644 - edge((1, 0), "|-", (3, 0), extrude: (1, -1, 0) ), 645 - named_point(1, 0, shape: "|")[$1-epsilon$], 646 - named_point(2, 0)[$cal(P)$], 647 - named_point(0, 0, color: red, side: left)[$times$], 648 - named_point(1.5, 0, color: olive, side: left)[$checkmark$], 649 - ), 648 + #equation_and_diagram( 649 + $ 650 + L(s, a, cal(P), cal(P)', R) = max( 651 + 1 - epsilon, quad 652 + (Q_cal(P)' (s, a)) / (Q_cal(P) (s, a)) 653 + ) A_(cal(P)', R)(s, a) 654 + $, 655 + diagram( 656 + spacing: (2.7em, 2em), 657 + node((3, 0))[$cal(P)'$], 658 + edge((-1, 0), "<-", (3, 0), stroke: luma(150)), 659 + edge((1, 0), "|-", (3, 0), extrude: (1, -1, 0) ), 660 + named_point(1, 0, shape: "|")[$1-epsilon$], 661 + named_point(2, 0)[$cal(P)$], 662 + named_point(0, 0, color: red, side: left)[$times$], 663 + named_point(1.5, 0, color: olive, side: left)[$checkmark$], 664 + ), 665 + ) 650 666 651 - ) 667 + ] 652 668 653 669 == Le H1v2 d'_Unitree_ 654 670

Configure Feed

Configure Feed