'I Can See Forever!': Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments

書目詳細資料
題名: 'I Can See Forever!': Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
Authors: Zhang, Ziyi, Sun, Zhen, Zhang, Zongmin, Peng, Zifan, Zhao, Yuemeng, Wang, Zichun, Luo, Zeren, Zuo, Ruiting, He, Xinlei
Publication Year: 2025
Collection: Computer Science
Subject Terms: Computer Science - Computer Vision and Pattern Recognition, Computer Science - Artificial Intelligence, Computer Science - Human-Computer Interaction, Computer Science - Multimedia
實物特徵: The visually impaired population, especially the severely visually impaired, is currently large in scale, and daily activities pose significant challenges for them. Although many studies use large language and vision-language models to assist the blind, most focus on static content and fail to meet real-time perception needs in dynamic and complex environments, such as daily activities. To provide them with more effective intelligent assistance, it is imperative to incorporate advanced visual understanding technologies. Although real-time vision and speech interaction VideoLLMs demonstrate strong real-time visual understanding, no prior work has systematically evaluated their effectiveness in assisting visually impaired individuals. In this work, we conduct the first such evaluation. First, we construct a benchmark dataset (VisAssistDaily), covering three categories of assistive tasks for visually impaired individuals: Basic Skills, Home Life Tasks, and Social Life Tasks. The results show that GPT-4o achieves the highest task success rate. Next, we conduct a user study to evaluate the models in both closed-world and open-world scenarios, further exploring the practical challenges of applying VideoLLMs in assistive contexts. One key issue we identify is the difficulty current models face in perceiving potential hazards in dynamic environments. To address this, we build an environment-awareness dataset named SafeVid and introduce a polling mechanism that enables the model to proactively detect environmental risks. We hope this work provides valuable insights and inspiration for future research in this field.
Comment: 12 pages, 6 figures
Document Type: Working Paper
Access URL: http://arxiv.org/abs/2505.04488
Accession Number: edsarx.2505.04488
Database: arXiv
FullText Text:
  Availability: 0
CustomLinks:
  – Url: http://arxiv.org/abs/2505.04488
    Name: EDS - Arxiv
    Category: fullText
    Text: Full Text (arXiv)
    MouseOverText: Full Text (arXiv)
Header DbId: edsarx
DbLabel: arXiv
An: edsarx.2505.04488
RelevancyScore: 1147
AccessLevel: 3
PubType: Report
PubTypeId: report
PreciseRelevancyScore: 1146.60375976563
IllustrationInfo
Items – Name: Title
  Label: Title
  Group: Ti
  Data: 'I Can See Forever!': Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
– Name: Author
  Label: Authors
  Group: Au
  Data: <searchLink fieldCode="AR" term="%22Zhang%2C+Ziyi%22">Zhang, Ziyi</searchLink><br /><searchLink fieldCode="AR" term="%22Sun%2C+Zhen%22">Sun, Zhen</searchLink><br /><searchLink fieldCode="AR" term="%22Zhang%2C+Zongmin%22">Zhang, Zongmin</searchLink><br /><searchLink fieldCode="AR" term="%22Peng%2C+Zifan%22">Peng, Zifan</searchLink><br /><searchLink fieldCode="AR" term="%22Zhao%2C+Yuemeng%22">Zhao, Yuemeng</searchLink><br /><searchLink fieldCode="AR" term="%22Wang%2C+Zichun%22">Wang, Zichun</searchLink><br /><searchLink fieldCode="AR" term="%22Luo%2C+Zeren%22">Luo, Zeren</searchLink><br /><searchLink fieldCode="AR" term="%22Zuo%2C+Ruiting%22">Zuo, Ruiting</searchLink><br /><searchLink fieldCode="AR" term="%22He%2C+Xinlei%22">He, Xinlei</searchLink>
– Name: DatePubCY
  Label: Publication Year
  Group: Date
  Data: 2025
– Name: Subset
  Label: Collection
  Group: HoldingsInfo
  Data: Computer Science
– Name: Subject
  Label: Subject Terms
  Group: Su
  Data: <searchLink fieldCode="DE" term="%22Computer+Science+-+Computer+Vision+and+Pattern+Recognition%22">Computer Science - Computer Vision and Pattern Recognition</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Artificial+Intelligence%22">Computer Science - Artificial Intelligence</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Human-Computer+Interaction%22">Computer Science - Human-Computer Interaction</searchLink><br /><searchLink fieldCode="DE" term="%22Computer+Science+-+Multimedia%22">Computer Science - Multimedia</searchLink>
– Name: Abstract
  Label: Description
  Group: Ab
  Data: The visually impaired population, especially the severely visually impaired, is currently large in scale, and daily activities pose significant challenges for them. Although many studies use large language and vision-language models to assist the blind, most focus on static content and fail to meet real-time perception needs in dynamic and complex environments, such as daily activities. To provide them with more effective intelligent assistance, it is imperative to incorporate advanced visual understanding technologies. Although real-time vision and speech interaction VideoLLMs demonstrate strong real-time visual understanding, no prior work has systematically evaluated their effectiveness in assisting visually impaired individuals. In this work, we conduct the first such evaluation. First, we construct a benchmark dataset (VisAssistDaily), covering three categories of assistive tasks for visually impaired individuals: Basic Skills, Home Life Tasks, and Social Life Tasks. The results show that GPT-4o achieves the highest task success rate. Next, we conduct a user study to evaluate the models in both closed-world and open-world scenarios, further exploring the practical challenges of applying VideoLLMs in assistive contexts. One key issue we identify is the difficulty current models face in perceiving potential hazards in dynamic environments. To address this, we build an environment-awareness dataset named SafeVid and introduce a polling mechanism that enables the model to proactively detect environmental risks. We hope this work provides valuable insights and inspiration for future research in this field.<br />Comment: 12 pages, 6 figures
– Name: TypeDocument
  Label: Document Type
  Group: TypDoc
  Data: Working Paper
– Name: URL
  Label: Access URL
  Group: URL
  Data: <link linkTarget="URL" linkTerm="http://arxiv.org/abs/2505.04488" linkWindow="_blank">http://arxiv.org/abs/2505.04488</link>
– Name: AN
  Label: Accession Number
  Group: ID
  Data: edsarx.2505.04488
PLink https://holycross.idm.oclc.org/login?auth=cas&url=https://search.ebscohost.com/login.aspx?direct=true&site=eds-live&db=edsarx&AN=edsarx.2505.04488
RecordInfo BibRecord:
  BibEntity:
    Subjects:
      – SubjectFull: Computer Science - Computer Vision and Pattern Recognition
        Type: general
      – SubjectFull: Computer Science - Artificial Intelligence
        Type: general
      – SubjectFull: Computer Science - Human-Computer Interaction
        Type: general
      – SubjectFull: Computer Science - Multimedia
        Type: general
    Titles:
      – TitleFull: 'I Can See Forever!': Evaluating Real-time VideoLLMs for Assisting Individuals with Visual Impairments
        Type: main
  BibRelationships:
    HasContributorRelationships:
      – PersonEntity:
          Name:
            NameFull: Zhang, Ziyi
      – PersonEntity:
          Name:
            NameFull: Sun, Zhen
      – PersonEntity:
          Name:
            NameFull: Zhang, Zongmin
      – PersonEntity:
          Name:
            NameFull: Peng, Zifan
      – PersonEntity:
          Name:
            NameFull: Zhao, Yuemeng
      – PersonEntity:
          Name:
            NameFull: Wang, Zichun
      – PersonEntity:
          Name:
            NameFull: Luo, Zeren
      – PersonEntity:
          Name:
            NameFull: Zuo, Ruiting
      – PersonEntity:
          Name:
            NameFull: He, Xinlei
    IsPartOfRelationships:
      – BibEntity:
          Dates:
            – D: 07
              M: 05
              Type: published
              Y: 2025
ResultId 1