Coarse-to-Fine Vision-Language Pre-training with Fusion in the Backbone
暂无分享,去创建一个
Yann LeCun | Jianfeng Gao | Zi-Yi Dou | Zhe Gan | Nanyun Peng | Pengchuan Zhang | Lijuan Wang | Zicheng Liu | Linjie Li | Ce Liu | Aishwarya Kamath | Jianfeng Wang
[1] Zirui Wang,et al. CoCa: Contrastive Captioners are Image-Text Foundation Models , 2022, Trans. Mach. Learn. Res..
[2] Oriol Vinyals,et al. Flamingo: a Visual Language Model for Few-Shot Learning , 2022, NeurIPS.
[3] Jianfeng Gao,et al. Unified Contrastive Learning in Image-Text-Label Space , 2022, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[4] Jingren Zhou,et al. OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework , 2022, ICML.
[5] S. Hoi,et al. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation , 2022, ICML.
[6] Saining Xie,et al. SLIP: Self-supervision meets Language-Image Pre-training , 2021, ECCV.
[7] Xiaowei Hu,et al. Injecting Semantic Concepts into End-to-End Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[8] Marcus Rohrbach,et al. FLAVA: A Foundational Language And Vision Alignment Model , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[9] Liunian Harold Li,et al. Grounded Language-Image Pre-training , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[10] Xiaowei Hu,et al. Scaling Up Vision-Language Pretraining for Image Captioning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[11] Hang Li,et al. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts , 2021, ICML.
[12] Daniel Keysers,et al. LiT: Zero-Shot Transfer with Locked-image text Tuning , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[13] Zhenguo Li,et al. FILIP: Fine-grained Interactive Language-Image Pre-Training , 2021, ICLR.
[14] Zi-Yi Dou,et al. An Empirical Study of Training End-to-End Vision-and-Language Transformers , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[15] Junjie Yan,et al. Supervision Exists Everywhere: A Data Efficient Contrastive Language-Image Pre-training Paradigm , 2021, ICLR.
[16] David J. Fleet,et al. Pix2seq: A Language Modeling Framework for Object Detection , 2021, ICLR.
[17] Adams Wei Yu,et al. SimVLM: Simple Visual Language Model Pretraining with Weak Supervision , 2021, ICLR.
[18] Kurt Keutzer,et al. How Much Can CLIP Benefit Vision-and-Language Tasks? , 2021, ICLR.
[19] Nils Reimers,et al. Retrieve Fast, Rerank Smart: Cooperative and Joint Approaches for Improved Cross-Modal Retrieval , 2021, TACL.
[20] Zhe Gan,et al. UFO: A UniFied TransfOrmer for Vision-Language Representation Learning , 2021, ArXiv.
[21] Li Dong,et al. VLMo: Unified Vision-Language Pre-Training with Mixture-of-Modality-Experts , 2021, NeurIPS.
[22] Junnan Li,et al. Align before Fuse: Vision and Language Representation Learning with Momentum Distillation , 2021, NeurIPS.
[23] Jianlong Fu,et al. Probing Inter-modality: Visual Parsing with Self-Attention for Vision-Language Pre-training , 2021, NeurIPS.
[24] Yejin Choi,et al. VinVL: Revisiting Visual Representations in Vision-Language Models , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Lu Yuan,et al. Dynamic Head: Unifying Object Detection Heads with Attentions , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[26] Yann LeCun,et al. MDETR - Modulated Detection for End-to-End Multi-Modal Understanding , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[27] Jianlong Fu,et al. Seeing Out of tHe bOx: End-to-End Pre-training for Vision-Language Representation Learning , 2021, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Derek Hoiem,et al. Towards General Purpose Vision Systems: An End-to-End Task-Agnostic Vision-Language Architecture , 2021, 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[29] Ilya Sutskever,et al. Learning Transferable Visual Models From Natural Language Supervision , 2021, ICML.
[30] Ronghang Hu,et al. UniT: Multimodal Multitask Learning with a Unified Transformer , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[31] Quoc V. Le,et al. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision , 2021, ICML.
[32] Wonjae Kim,et al. ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision , 2021, ICML.
[33] Jaemin Cho,et al. Unifying Vision-and-Language Tasks via Text Generation , 2021, ICML.
[34] Yehao Li,et al. Scheduled Sampling in Vision-Language Pretraining with Decoupled Encoder-Decoder Network , 2021, AAAI.
[35] Hua Wu,et al. UNIMO: Towards Unified-Modal Understanding and Generation via Cross-Modal Contrastive Learning , 2020, ACL.
[36] Jiebo Luo,et al. TAP: Text-Aware Pre-training for Text-VQA and Text-Caption , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[37] S. Gelly,et al. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale , 2020, ICLR.
[38] Hao Tian,et al. ERNIE-ViL: Knowledge Enhanced Vision-Language Representations Through Scene Graph , 2020, AAAI.
[39] Justin Johnson,et al. VirTex: Learning Visual Representations from Textual Annotations , 2020, 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[40] Richard Yuanzhe Pang,et al. Text Generation by Learning from Demonstrations , 2021, ICLR.
[41] Zhe Gan,et al. Crossing the Format Boundary of Text and Boxes: Towards Unified Vision-Language Modeling , 2021, ArXiv.
[42] Stephen Lin,et al. Swin Transformer: Hierarchical Vision Transformer using Shifted Windows , 2021, 2021 IEEE/CVF International Conference on Computer Vision (ICCV).
[43] Jianfeng Gao,et al. VIVO: Surpassing Human Performance in Novel Object Captioning with Visual Vocabulary Pre-Training , 2020, ArXiv.
[44] Julien Perez,et al. Learning Visual Representations with Caption Annotations , 2020, ECCV.
[45] Yu Cheng,et al. Large-Scale Adversarial Training for Vision-and-Language Representation Learning , 2020, NeurIPS.
[46] Nicolas Usunier,et al. End-to-End Object Detection with Transformers , 2020, ECCV.
[47] Jianfeng Gao,et al. Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks , 2020, ECCV.
[48] Jianlong Fu,et al. Pixel-BERT: Aligning Image Pixels with Text by Deep Multi-Modal Transformers , 2020, ArXiv.
[49] Shifeng Zhang,et al. Bridging the Gap Between Anchor-Based and Anchor-Free Detection via Adaptive Training Sample Selection , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50] Colin Raffel,et al. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..
[51] Quoc V. Le,et al. Randaugment: Practical automated data augmentation with a reduced search space , 2019, 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW).
[52] Yu Cheng,et al. UNITER: UNiversal Image-TExt Representation Learning , 2019, ECCV.
[53] Jason J. Corso,et al. Unified Vision-Language Pre-Training for Image Captioning and VQA , 2019, AAAI.
[54] Furu Wei,et al. VL-BERT: Pre-training of Generic Visual-Linguistic Representations , 2019, ICLR.
[55] Nan Duan,et al. Unicoder-VL: A Universal Encoder for Vision and Language by Cross-modal Pre-training , 2019, AAAI.
[56] Ross B. Girshick,et al. Focal Loss for Dense Object Detection , 2017, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[57] Ross B. Girshick,et al. Mask R-CNN , 2017, 1703.06870.
[58] Jian Sun,et al. Objects365: A Large-Scale, High-Quality Dataset for Object Detection , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[59] Mohit Bansal,et al. LXMERT: Learning Cross-Modality Encoder Representations from Transformers , 2019, EMNLP.
[60] Cho-Jui Hsieh,et al. VisualBERT: A Simple and Performant Baseline for Vision and Language , 2019, ArXiv.
[61] Stefan Lee,et al. ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks , 2019, NeurIPS.
[62] Omer Levy,et al. RoBERTa: A Robustly Optimized BERT Pretraining Approach , 2019, ArXiv.
[63] Ross B. Girshick,et al. LVIS: A Dataset for Large Vocabulary Instance Segmentation , 2019, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[64] Yoav Artzi,et al. A Corpus for Reasoning about Natural Language Grounded in Photographs , 2018, ACL.
[65] Frank Hutter,et al. Decoupled Weight Decay Regularization , 2017, ICLR.
[66] Xinlei Chen,et al. nocaps: novel object captioning at scale , 2019, 2019 IEEE/CVF International Conference on Computer Vision (ICCV).
[67] Ming-Wei Chang,et al. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.
[68] Radu Soricut,et al. Conceptual Captions: A Cleaned, Hypernymed, Image Alt-text Dataset For Automatic Image Captioning , 2018, ACL.
[69] Omer Levy,et al. GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding , 2018, BlackboxNLP@EMNLP.
[70] Licheng Yu,et al. MAttNet: Modular Attention Network for Referring Expression Comprehension , 2018, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition.
[71] Kaiming He,et al. Focal Loss for Dense Object Detection , 2017, 2017 IEEE International Conference on Computer Vision (ICCV).
[72] Lukasz Kaiser,et al. Attention is All you Need , 2017, NIPS.
[73] Kaiming He,et al. Feature Pyramid Networks for Object Detection , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[74] Vaibhava Goel,et al. Self-Critical Sequence Training for Image Captioning , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[75] Abhishek Das,et al. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization , 2016, 2017 IEEE International Conference on Computer Vision (ICCV).
[76] Fei-Fei Li,et al. Deep visual-semantic alignments for generating image descriptions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[77] Licheng Yu,et al. Modeling Context in Referring Expressions , 2016, ECCV.
[78] Basura Fernando,et al. SPICE: Semantic Propositional Image Caption Evaluation , 2016, ECCV.
[79] Michael S. Bernstein,et al. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations , 2016, International Journal of Computer Vision.
[80] Kaiming He,et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks , 2015, IEEE Transactions on Pattern Analysis and Machine Intelligence.
[81] Svetlana Lazebnik,et al. Flickr30k Entities: Collecting Region-to-Phrase Correspondences for Richer Image-to-Sentence Models , 2015, International Journal of Computer Vision.
[82] Margaret Mitchell,et al. VQA: Visual Question Answering , 2015, International Journal of Computer Vision.
[83] C. Lawrence Zitnick,et al. CIDEr: Consensus-based image description evaluation , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[84] Vicente Ordonez,et al. ReferItGame: Referring to Objects in Photographs of Natural Scenes , 2014, EMNLP.
[85] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[86] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.
[87] Pietro Perona,et al. Microsoft COCO: Common Objects in Context , 2014, ECCV.
[88] Vicente Ordonez,et al. Im2Text: Describing Images Using 1 Million Captioned Photographs , 2011, NIPS.
[89] Alon Lavie,et al. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments , 2005, IEEvaluation@ACL.
[90] Salim Roukos,et al. Bleu: a Method for Automatic Evaluation of Machine Translation , 2002, ACL.