Within 8 months of serial research, a competent team can find a “secret spilling” steering vector which gets a smart model to give up in-context secrets across a variety of situations.
Created by TurnTrout on 2023-05-13; known on 2025-05-13
- TurnTrout estimated 65% on 2023-05-13
- TurnTrout said “(a success rate commensurate with the success rate of asking other questions like “What did I say my name is?” (given that you’ve introduced yourself earlier)). ” on 2023-05-13
- PseudonymousUser estimated 45% and said “Seems possible with prompting, let alone whatever this steering vector is. Slight pessimism because it may not be published. ” on 2023-05-22
- JoshuaZ estimated 54% on 2023-05-22