[At decision squares, the 5×5 rand-region cheese maze network will put max cumulative probability on the maximal-advantage action at least] 95% of the time

Created by TurnTrout on 2023-02-09; known on 2023-02-16

TurnTrout estimated 12% on 2023-02-09
peligrietzer estimated 5% on 2023-02-09
uli estimated 20% and said “I don’t know the details of the training process, but given the policy is (roughly) trained by supervised learning to argmax over the advantage, it doesn’t seem that unlikely ” on 2023-02-12
TurnTrout changed the deadline from “on 2023-02-16” on 2023-03-01
rhaps0dy estimated 15% on 2023-03-02