Multimodal Semantic Indexing for Image Retrieval at IIIT Hyderabad

Slide Note

This research delves into multimodal semantic indexing methods for image retrieval, focusing on extending Latent Semantic Indexing (LSI) and probabilistic LSI to a multi-modal setting. Contributions include the refinement of graph models and partitioning algorithms to enhance image retrieval from tripartite graph models. Background information on latent semantic indexing and probabilistic latent semantic indexing is provided, along with mentions of related literature and the application of higher-order Singular Value Decomposition (SVD) for capturing latent semantics. The study emphasizes the importance of exploiting multi-modal data representations using tensors for effective image retrieval.

kailan Follow

Uploaded on Sep 16, 2024 | 1 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Multimodal Semantic Indexing for Image Retrieval P . L . Chandrika Advisors: Dr.C. V. Jawahar Centre for Visual Information Technology, IIIT- Hyderabad IIIT Hyderabad

Problem Setting Love Rose Flower Petals Red Gift Bud Green Semantics Not Captured Words IIIT Hyderabad *J Sivic & Zisserman,2003; Nister & Henrik,2006; Philbin,Sivic,Zisserman et la,2008;

Contribution Latent Semantic Indexing(LSI) is extended to Multi-modal LSI. pLSA (probabilistic Latent Semantic Analysis) is extended to Multi-modal pLSA. Extending Bipartite Graph Model to Tripartite Graph Model. A graph partitioning algorithm is refined for retrieving relevant images from a tripartite graph model. Verification on data sets and comparisons. IIIT Hyderabad

Background In Latent semantic Indexing, the term document matrix is decomposed using singular value decomposition. = t N U V k = ( , ) ( ) ( | ) ( | ) P d w P d P z d P w z i j i k i j k In Probabilistic Latent Semantic Indexing, P(d), P(z|d), P(w|z) are computed used EM algorithm. IIIT Hyderabad

Semantic Indexing Whippet doberman Animal GSD d Whippet daffodil w GSD tulip doberman P(w|d) rose daffodil LSI, pLSA, LDA tulip rose Flower IIIT Hyderabad * Hoffman 1999; Blei, Ng & Jordan, 2004; R. Lienhart and M. Slaney,2007

Literature LSI. pLSA. Incremental pLSA. Multilayer multimodal pLSA. High space complexity due to large matrix operations. Slow, resource intensive offline processing. IIIT Hyderabad *R. Lienhart and M. Slaney., Plsa on large scale image databases, in ECCV, 2006. *H. Wu, Y. Wang, and X. Cheng, Incremental probabilistic latent semantic analysis for automatic question recommendation, in AMC on RSRS, 2008. *R. Lienhart, S. Romberg, and E. H orster, Multilayer plsa for multimodal image retrieval, in CIVR, 2009.

Multimodal LSI Most of the current image representations either solely on visual features or on surrounding text. Tensor We represent the multi-modal data using 3rd order tensor. Vector: order-1 tensor Order-3 tensor Matrix: order-2 tensor IIIT Hyderabad

MultiModal LSI Higher Order SVD is used to capture the latent semantics. Finds correlated within the same mode and across different modes. HOSVD extension of SVD and represented as = A Z U U U 1 2 3 images visualword s textwords IIIT Hyderabad

HOSVD Algorithm IIIT Hyderabad

Multimodal PLSA An unobserved latent variable z is associated with the text words w t ,visual words wvand the documents d. The join probability for text words, images and visual words is t v t t v t = ( , , ) ( ) ( | ) ( | , ) P w d w P w P w d P w w d j i l j j i l j i Assumption: v t v = ( | , ) ( | ) P w w d P w d Thus, t v t t v = ( , , ) ( ) ( | ) ( | ) P w d w P w P w d P w d j i l j j i l i IIIT Hyderabad

Multimodal PLSA The joint probabilistic model for the above generative model is given by the following: t v t v = ( , , ) ( ) ( | ) ( | ) ( | ) ( | ) P w d w P d P w z P z d P w z P z d 2 = 2 t v ( ) ( | ) ( | ) ( | ) ( ) P d P w z P z d P w z P z Here we capture the patterns between images, text words and visual words by using EM algorithm to determine the hidden layers connecting them. IIIT Hyderabad

Multimodal PLSA E-Step: tj ( | ) ( | ) P w z P z id k k tj = ( | , ) P z id w k tj k n = ( | ) ( | ) P vj w z P z id n n 1 w ( | ) ( | ) P kz P kz id vj = ( | , ) P kz id w vj k n = ( | ) ( | ) P w z P z id n n 1 M-Step: M = t j t ( , ) ( | , ) n d w P z d w tjz w i k i j 1 i ( | ) P = k = M N = ( d t t ( , ) ( | , ) n d w P z d w i j k i j 1 M = 1 j i v l v l , ) ( | , ) n w P z d w v l i k i 1 i ( | ) P w z = k M = N = v l v l ( , ) ( | , ) n d w P z d w i k i 1 1 l i = N L t j v l t j v l ( , , ) ( | , ) ( | , ) n d w w P z d w P z d w i k i k i = 1 1 j l ( | ) P z kd = IIIT Hyderabad i ( ) n d i

Bipartite Graph Model w1 w1 w3 w2 w5 w2 w1 w3 w2 w5 w3 TF words Documents w1 w3 w2 w5 w4 IDF w1 w3 w2 w5 w5 w1 w3 w2 w5 w6 IIIT Hyderabad

BGM Query Image Cash Flow w1 w2 w3 w4 w5 w6 w7 w8 Results : IIIT Hyderabad *Suman karthik, chandrika pulla & C.V. Jawahar, "Incremental On-line semantic Indexing for Image Retrieval in Dynamic. Databases , Workshop on Semantic Learning and Applications, CVPR, 2008

Tripartite Graph Model Tensor represented as a Tripartite graph of text words, visual words and images. IIIT Hyderabad

Tripartite Graph Model The edge weights between text words with visual word are computed as: i d t e i d v e + C ( 1 ( ) ) , t v p p p q i W = pq i d t e i d v e + 1 ( ) p q i Learning edge weights to improve performance. Sum-of-squares error and log loss. L-BFGS for fast convergence and local minima IIIT Hyderabad * Wen-tan, Yih, Learning term-weighting functions for similarity measures, in EMNLP, 2009.

Offline Indexing Bipartite graph model as a special case of TGM. Reduce the computational time for retrieval. Similarity Matrix for graphs Ga and Gb T T = + S BS A B S A +1 p p p A and B are adjacency matrixes for Ga and Gb A special case is Ga = Gb =G . IIIT Hyderabad

Datasets University of Washington(UW) 1109 images. manually annotated key words. IAPR TC12 20,000 images of natural scenes(sports and actions, landscapes, cites etc) . 291 vocabulary size and 17,825 images for training. 1,980 images for testing. Corel 5000 images. 4500 for training and 500 for testing. 260 unique words. Holiday dataset 1491 images 500 categories Multi-label Image 139 urban scene images. Overlapping labels: Buildings, Flora, People and Sky. Manually created ground truth data for 50 images. IIIT Hyderabad

Experimental Settings Pre-processing Sift feature extraction. Quantization using k-means. Performance measures : The mean Average precision(mAP). Q q = ( ) AveP q 1 = mAP Q Time taken for semantic indexing. Memory space used for semantic indexing. IIIT Hyderabad

BGM vs pLSA,IpLSA Model mAP Time Space Probabilistic LSI 0.642 547s 3267Mb Incremental PLSA 0.567 56s 3356Mb BGM 0.594 42s 57Mb * On Holiday dataset IIIT Hyderabad

BGA vs pLSA,IpLSA pLSA Cannot scale for large databases. Cannot update incrementally. Latent topic initialization difficult Space complexity high IpLSA Cannot scale for large databases. Cannot update new latent topics. Latent topic initialization difficult Space complexity high BGM+Cashflow Efficient Low space com plexity IIIT Hyderabad

Results Datasets Visual-based Tag-based Pseudo single mode MMLSI UW 0.46 0.55 0.55 0.63 LSI vs MMLSI Multilabel 0.33 0.42 0.39 0.49 IAPR 0.42 0.46 0.43 0.55 Corel 0.25 0.46 0.47 0.53 Datasets Visual- based 0.60 0.36 0.43 0.33 Tag-based Pseudo single mode 0.59 0.36 0.44 0.48 mm-pLSA Our MM- pLSA 0.70 0.51 0.59 0.59 UW Multilabel IAPR Corel 0.57 0.41 0.47 0.47 0.68 0.50 0.56 0.59 pLSA vs MMpLSA IIIT Hyderabad

TGM vs MMLSI,MMpLSA,mm-pLSA mm-pLSA Merge dictionaries with different modes. No intraction between different modes. MMLSI and MMpLSA Cannot scale for large databases. Cannot update incrementally. Latent topic initialization difficult Space complexity high TGM+Cashflow Efficient Low space complexity Datasets MMLSI MMpLSA mm-pLSA TGM- TFIDF 0.64 0.49 0.56 0.35 TGM- learning 0.67 0.50 0.59 0.38 UW Multilabel IAPR Corel 0.63 0.49 0.55 0.33 0.70 0.51 0.59 0.39 0.68 0.50 0.56 0.37 IIIT Hyderabad

TGM vs MMLSI,MMpLSA,mm-pLSA TGM Takes few milliseconds for semantic indexing. Low space complexity Model MMLSI MMpLSA mm-pLSA TGM mAP 0.63 0.70 0.68 0.67 Time 1897s 983s 1123s 55s space 4856Mb 4267Mb 3812Mb 168Mb IIIT Hyderabad

Conclusion MMLSI and MMpLSA Outperforms single mode and existing multimodal. LSI, pLSA and multimodal techniques proposed. Memory and computational intensive. TGM Fast and effective retrieval. Scalable. Computationally light intensive. Less resource intensive. IIIT Hyderabad

Future work Learning approach to determine the size of the concept space. Various methods can be explored to determine the weights in TGM. Extending the algorithms designed for Video Retrieval . IIIT Hyderabad

Related Publications Suman Karthik, Chandrika Pulla, C.V.Jawahar, "Incremental On-line semantic Indexing for Image Retrieval in Dynamic. Databases" 4th International Workshop on Semantic Learning and Applications, CVPR, 2008. Chandrika pulla, C.V.Jawahar, Multi Modal Semantic Indexing for Image Retrieval ,In Proceedings of Conference on Image and Video Retrieval(CIVR), 2010. Chandrika pulla, Suman Karthik, C.V.Jawahar, Effective Semantic Indexing for Image Retrieval , In Proceedings of International Conference on Pattern Recognition(ICPR), 2010. Chandrika pulla, C.V.Jawahar, Tripartite Graph Models for Multi Modal Image Retrieval , In Proceedings of British Machine Vision Conference(BMVC), 2010. IIIT Hyderabad

Thank you IIIT Hyderabad

Multimodal Semantic Indexing for Image Retrieval at IIIT Hyderabad

Download Presentation

Presentation Transcript

Related

More Related Content