Video2Script

# Video2Script

# 基本概念

# 文献调研

VideoChat

[2305.06355] VideoChat: Chat-Centric Video Understanding (opens new window)

【双流：逐帧+视频】
MVBench（VideoChat2）

[2311.17005] MVBench: A Comprehensive Multi-modal Video Understanding Benchmark (opens new window)
Dolphin

GitHub - kaleido-lab/dolphin: General video interaction platform based on LLMs, including Video ChatGPT (opens new window)

【开源项目】

# 相关工作

# 其他工作

GRiT

[2212.00280] GRiT: A Generative Region-to-text Transformer for Object Understanding (arxiv.org) (opens new window)
Dense Video Object Captioning from Disjoint Supervision

[2306.11729] Dense Video Object Captioning from Disjoint Supervision (arxiv.org) (opens new window)
A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video

[2312.01575] A Challenging Multimodal Video Summary: Simultaneously Extracting and Generating Keyframe-Caption Pairs from Video (arxiv.org) (opens new window)
Vid2Seq（CVPR-23）

[2302.14115] Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning (arxiv.org) (opens new window)
VidChapters-7M（NIPS-23）

[2309.13952] VidChapters-7M: Video Chapters at Scale (arxiv.org) (opens new window)

← 多模态 VideoLLMs→

01
VideoLLMs 03-20

02
多模态 11-09

03
LLM-Agents 10-08

更多文章>