|
作者:哈工大SCIR 鐘蔚弘 1. 簡(jiǎn)介 隨著預(yù)訓(xùn)練模型的發(fā)展,研究者也開始嘗試將預(yù)訓(xùn)練模型的架構(gòu)和方法應(yīng)用于多模態(tài)任務(wù)當(dāng)中。在圖片-文本多模態(tài)任務(wù)當(dāng)中,預(yù)訓(xùn)練模型的應(yīng)用已經(jīng)取得了出色的表現(xiàn)。相比于圖片,視頻內(nèi)容中包含的信息更加豐富而冗余,多幀之間可能包含高度相似的畫面。與圖片不同,視頻內(nèi)容中自然地包含了時(shí)序信息,隨著視頻時(shí)間長(zhǎng)度的增長(zhǎng),其包含的時(shí)序信息也愈加豐富。同時(shí),由于視頻數(shù)據(jù)的體積相較于圖片而言也更加龐大,數(shù)據(jù)集、模型的構(gòu)建都為研究者提出了更大的挑戰(zhàn)。因此,如何更優(yōu)雅,高質(zhì)量地建立視頻-文本表示之間的聯(lián)系、進(jìn)行良好的交互,并為下游任務(wù)帶來(lái)提升,就成為了研究者們探究的問(wèn)題。 本文簡(jiǎn)單梳理了當(dāng)前視頻-文本預(yù)訓(xùn)練的模型架構(gòu)及相關(guān)數(shù)據(jù)集,同時(shí),針對(duì)視頻信息較為冗余的特點(diǎn),對(duì)引入細(xì)粒度信息的工作進(jìn)行了簡(jiǎn)要介紹。 2. 常用預(yù)訓(xùn)練數(shù)據(jù)集 多模態(tài)預(yù)訓(xùn)練的數(shù)據(jù)通常來(lái)源于大規(guī)模的模態(tài)間對(duì)齊樣本對(duì)。由于時(shí)序維度的存在,視頻當(dāng)中包含了比圖片更加豐富而冗余的信息。因此,收集大規(guī)模的視頻-文本對(duì)齊數(shù)據(jù)對(duì)用于視頻預(yù)訓(xùn)練存在較高的難度。目前,大部分研究者所使用的公開預(yù)訓(xùn)練數(shù)據(jù)集主要包括HowTo100M[1]和WebVid[2]數(shù)據(jù)集,此外,由于視頻和圖片特征的相似性,也有非常多工作利用圖片-文本預(yù)訓(xùn)練數(shù)據(jù)集進(jìn)行訓(xùn)練,本節(jié)主要對(duì)視頻-文本預(yù)訓(xùn)練中常用的數(shù)據(jù)集進(jìn)行簡(jiǎn)單的介紹。 2.1 HowTo100M 學(xué)習(xí)視頻-文本的跨模態(tài)表示通常需要人工標(biāo)注描述的的視頻片段(clip),而標(biāo)注一個(gè)這樣的大規(guī)模數(shù)據(jù)集非常昂貴。Miech[1]等人發(fā)布了HowTo100M數(shù)據(jù)集,幫助模型從帶有自動(dòng)轉(zhuǎn)寫的旁白文本(automatically transcribed narrations)的視頻數(shù)據(jù)中學(xué)習(xí)到跨模態(tài)的表示。HowTo100M從1.22M個(gè)帶有旁白的教學(xué)(instructional)網(wǎng)絡(luò)視頻中裁切得到了136M個(gè)視頻片段(clip)。視頻的教學(xué)內(nèi)容多由人類展示,包含了超過(guò)兩萬(wàn)三千個(gè)不同的視覺任務(wù)。 ![]() ![]() ![]() ![]() 表3 Conceptual Captions的統(tǒng)計(jì)數(shù)據(jù)[3] ![]() ![]() ![]() ![]() ?![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() ![]() [1] Miech A, Zhukov D, Alayrac J B, et al. Howto100m: Learning a text-video embedding by watching hundred million narrated video clips[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 2630-2640. [2] Bain M, Nagrani A, Varol G, et al. Frozen in time: A joint video and image encoder for end-to-end retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 1728-1738. [3] Sharma P, Ding N, Goodman S, et al. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2018: 2556-2565. [4] Sun C, Myers A, Vondrick C, et al. Videobert: A joint model for video and language representation learning[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019: 7464-7473. [5] Devlin J, Chang M W, Lee K, et al. Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018. [6] Lei J, Li L, Zhou L, et al. Less is more: Clipbert for video-and-language learning via sparse sampling[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021: 7331-7341. [7] Xu H, Ghosh G, Huang P Y, et al. VLM: Task-agnostic video-language model pre-training for video understanding[J]. arXiv preprint arXiv:2105.09996, 2021. [8] Sun C, Baradel F, Murphy K, et al. Learning video representations using contrastive bidirectional transformer[J]. arXiv preprint arXiv:1906.05743, 2019. [9] Luo H, Ji L, Shi B, et al. Univl: A unified video and language pre-training model for multimodal understanding and generation[J]. arXiv preprint arXiv:2002.06353, 2020. [10] Bertasius G, Wang H, Torresani L. Is space-time attention all you need for video understanding?[C]//ICML. 2021, 2(3): 4. [11] Li L, Chen Y C, Cheng Y, et al. Hero: Hierarchical encoder for video+ language omni-representation pre-training[J]. arXiv preprint arXiv:2005.00200, 2020. [12] Zellers R, Lu X, Hessel J, et al. Merlot: Multimodal neural script knowledge models[J]. Advances in Neural Information Processing Systems, 2021, 34: 23634-23651. [13] Tang Z, Lei J, Bansal M. Decembert: Learning from noisy instructional videos via dense captions and entropy minimization[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021: 2415-2426. [14] Fu T J, Li L, Gan Z, et al. VIOLET: End-to-end video-language transformers with masked visual-token modeling[J]. arXiv preprint arXiv:2111.12681, 2021. [15] Zhu L, Yang Y. Actbert: Learning global-local video-text representations[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2020: 8746-8755. [16] Wang J, Ge Y, Cai G, et al. Object-aware Video-language Pre-training for Retrieval[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 3313-3322. [17] Li D, Li J, Li H, et al. Align and Prompt: Video-and-Language Pre-training with Entity Prompts[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 4953-4963. [18] Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//International Conference on Machine Learning. PMLR, 2021: 8748-8763. [19] Ge Y, Ge Y, Liu X, et al. Bridging Video-Text Retrieval With Multiple Choice Questions[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 16167-16176. [20] Liu S, Fan H, Qian S, et al. Hit: Hierarchical transformer with momentum contrast for video-text retrieval[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2021: 11915-11925. [21] Min S, Kong W, Tu R C, et al. HunYuan_tvr for Text-Video Retrivial[J]. arXiv preprint arXiv:2204.03382, 2022. [22] Van Den Oord A, Vinyals O. Neural discrete representation learning[J]. Advances in neural information processing systems, 2017, 30. 本期責(zé)任編輯:丁 效 |
|
|