Abstract: Video-Text Retrieval (VTR) is a crucial multi-modal task in an era of massive video-text data on the Internet. A plethora of work characterized by using a two-stream Vision-Language model ...
Abstract: Videos contain multimodal content, and exploring multi-branch cross-modal interactions with natural language queries can be of benefit to the text-video retrieval task (TVR). However, recent ...