Abstract: The integration of visual and textual data in Vision- Language Pre-training (VLP) models is crucial forenhancing vision-language understanding. However, the adversarial robustness of these ...