Unlike most people, I'm actually fine being proven wrong because it means I learned something. And frankly, it would be really convenient to have BRR's huge database at my disposal.
The graph (the first one in the OP) looks nice but were multiple runs done with a statistical analysis? Which differences shown are significant, and which aren't? I read the Instagram post but didn't find that information.
The data is promising, though. This is the first time I've seen anyone even attempt to show that roller testing is applicable to the real world.