TLS 1.3 has a HelloRetryRequest path that'll be easy to fail to implement or get wrong. HRR costs a RTT, so there's incentive to avoid it (thus that code will be rarely exercised).
The end result is any groups we predict in initial ClientHellos will be impossible to remove. For this (and other) reasons, we are conservative and only predict X25519 (plus a GREASE key share so servers can handle multiple predictions). I doubt we'll want to take it out anytime soon, but it'd still be good to ensure this codepath works.
Unfortunately, stressing this codepath costs us a RTT, so we can't do it on every handshake like GREASE. Ideally we would:
a. Always offer key shares against servers that work without them. (Performance)
b. Never offer key shares against servers that only work with them. (Catch bad implementations in interop testing)
If we simply made 1% of connections require HRR, a retry (either user-initiated or automatic) will mask the bug. Further, a low probability means the bug won't be noticed early enough while a high probability means performance suffers.
Here's one idea: add a stateful mechanism, combined with a probe. On a successful TLS 1.3 connection, probe for HRR support in the background (with appropriate priority, rate limiting, etc), connect without advertising any key shares.
If a keyshare-less ClientHello fails for any reason, place the server in a HRR blacklist. Once in the blacklist, we will never reuse a socket (to keep socket reuse from masking) and always send keyshare-less ClientHellos to that server until it can successfully handshake.
This gives (a) except spurious network errors will temporarily cost in performance and (b) except it's asynchronous so the first few HTTP requests will still go through. Add metrics to monitor both.