Exploring Image-to-Text Visual Search Using Open Source Models
Publicerad
Författare
Typ
Examensarbete för masterexamen
Master's Thesis
Master's Thesis
Modellbyggare
Tidskriftstitel
ISSN
Volymtitel
Utgivare
Sammanfattning
Visual searching refers to the use of visual data, typically images, in order to perform
a search rather than textual input. Most visual search implementations rely
on performing similarity searching over image features, in which a user-submitted
query image is compared against all searchable entries’ features before returning
sufficiently similar results. This thesis explores a different method which utilizes
image descriptions generated by vision-language models instead of image features,
where the descriptions are converted into embeddings in order to match with other
search entries. Evaluation data indicate that the method can provide satisfactory
retrieval performance in addition to maintaining a low search query execution time,
provided that an adequate vision-language model is employed and sufficient server
capacity is available.
Beskrivning
Ämne/nyckelord
visual search, machine learning, deep learning, embedding, vision-language model, transformer, e-commerce
