Crucially, these tests are generated by custom code and don’t rely on pre-existing images or tests that could be found on the public Internet, thereby “minimiz[ing] the chance that VLMs can solve by ...