Enhancing Graph Neural Networks for Text-rich Graphs
Designing GNNs for text-rich graphs involves considering both textual and non-textual linkage information among entities, such as papers, webpages, and people. Utilizing structural information beyond citation networks and exploring latent textual linkages are key challenges. Previous work has focused on independent text usage, but the cooperative utilization of textual and non-textual linkages remains underexplored.
- Graph Neural Networks
- Text-rich Graphs
- Linkage Information
- Structural Information
- Latent Textual Relationships
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Beyond Non-textual Linkage: Designing GNN for Text-rich Graphs Yanbang Wang, Jul 27, 2020 at UIUC DMG Collaborated work with Carl Yang, Pan Li and Prof. Jiawei Han
Text-rich Graphs Usually come with two things: 1. Node attributes: Raw text (or Bag-of-Word / TF-IDF features) May or may not have extra numerical features 2. Pregiven linkage information among entities: Citation Cite paper paper Examples: Paper-paper: Cora, Citeseer, Pubmed, arXiv, DBLP Webpage-webpage: WebKB Person-person: Facebook
Task Prediction over text-rich graphs Node classification (theme categorization) Link prediction (citation prediction) Prototype to many important applications
Opportunities & challenges from Text: Structural information beyond citation network beyond citation network The main advantage of GNN (collective classification): utilization of the linkage between target entities Almost all previous work simply follow the pregiven linkage information Or at most make some modification primarily based on the pregiven linkage However, in text-rich networks, the linkage information should not be confined to those explicitly given For example, the relationship between two papers is not only just citation Text attributes endorse much richer interrelationship beyond the pregiven linkage
??? Two types of linkage paper paper 1. Non-textual linkage citation co-authorship same publication venue usually pregiven and clean 2. Textual linkage topic clusters (e.g. different research areas) (dis)similarities among the topic clusters (e.g., Network analysis , Graph mining , Security ) subtle semantic relationship (e.g., machine learning , deep learning , CNN ) latent, complex, but very rich
Previous Work How to 1) model 2) utilize these (latent) textual linkages is seriously underexplored: Previous work on text-rich graphs focus on using the text independently: Use deep models like Bi-LSTM to vectorize node s textual tags independently; Treat text attribute as a general feature vector; Previous work on document classification: Can not cooperatively use textual linkage and non-textual linkage; Fail to capture the complexity of the textual linkages; (e.g. simple keyword matching like Text-GNN [AAAI 19], or k-nearest neighbor like paper2vec)
Proposed method involving two phases Basic idea: make the best use of the latent relationship of topics and word semantics To model and utilize these (latent) textual linkage: Phase 1: graph construction, textual + non-textual -> heterogeneous Phase 2: Specialized GNN to model the interaction Whether or not to combine the two phases results in two variants
Phase 1: Heterogeneous graph construction Node Attributes Input: Input: Non-textual linkage (citation) Node Text ?(?,?) Topic Model ?(?|??) ?(??|?) <cite> <has mixture of> <distributes over> Graph: Graph: Doc Topic Term Emb. Lookup Supervision: Supervision: node/edge label ???? Loss: Loss:
Phase 2: Neural Propagation on Heterogeneous Graph Step 1: Encode the different edge types and edge weights, Step 2: project all edges and node attributs to a unified feature space Step 3: Propagate the features: ?? Where, ?? : sigmoid, g: softplus, ??,? encodes edge types and weights
Whats captured and whats missed <has mixture of> <distributes over> Doc Topic Term <cite> Pros: Much richer latent textual relationships. Interaction between textual and non-textual linkage Phased framework: clean and transparent Problemroots in the two phases being completely independent: The topic clustering information are hard-coded by the pretrained topic models, and remains frozen throughout the GNN training later 1. Graph construction receive no benefit from the supervision signals in the 2nd phase 2. GNN training process has to accept whatever the topic model yields in the 1st phase
Making graph construction trainable Node Attributes Input: Input: Non-textual linkage Node text Topic Model ?(?|??) ?(??|?) <has mixture of> <distributes over> Doc Topic Term <cite> ?(?|??) = ???(??,?) ? ??? = ???(?,??) Topic Model Supervision: Supervision: Node text (doc-term matrix) node/edge label Loss: Loss: ???? ?????
Integrating supervision from the text Input: Doc. node features: ?? ??? ? Topic node feature: ?? ??? ? Term node feature: ?? ??? ? Supervision: Doc-term matrix: ? ??? ??, extracted from raw text 1) Techniques to enforce sparsity, 2) Gumbel softmax to learn distrb. ???= ??? ??,?? = ???????(?((????)(????) )) ???= ??? ??,?? = ???????(?((????)(????) )) 2+ ?1||???||? 2+ ?2||???||? 2, ?????= ||? ??? ????||? ? = ????+ ????? ?1, ?2< 0 Where, ? is a learnable diagonal matrix parametrizing topic weights, ?1, ?2 and are scalar loss coefficients
Experiments Cora Citeseer Pubmed GCN 87.63 77.28 87.17 GAT 87.71 76.21 86.92 GraphSAGE 86.82 75.19 84.74 Text-rich GNN (phased) 88.92 78.56 88.08 Text-rich GNN (joint) On-going
Conclusion We propose leveraging the latent textual relationship Model and use the latent textual relationship
Project Update Worked on several technical details of the GNN architecture Experiment with 20NewsGroup Dataset Systematic experiment setup
Review Node Attributes Input: Input: Pregiven Link Node text Topic Model ?(?|??) ?(??|?) <has mixture of> <distributes over> Doc Topic Term <cite> ?(?|??) = ???(??,?) ? ??? = ???(?,??) Topic Model Supervision: Supervision: Node text (doc-term matrix) node/edge label) Loss: Loss: ???? ?????
20NewsGroup Dataset Document type: news report 20 news categories No pregiven link information Documents: 18,846, vocab size: 42,757, average length: 221.3 Method LSTM Bi-LSTM PTE CNN Text-GCN (doc-word, word- word) Our Method Accuracy 65.71 73.18 76.74 82.15 86.34 83~84
Experiment Setup 1. Prediction Tasks 2. Ablation & Comparison Study 3. Effect of topic model s parameters (#Topics, #Terms) 4. Analysis of the learned attention & topic models
Prediction Task Text-rich graph with pregiven but weak link data Document classification dataset without any links Name CORA_ML Hep-th WebKB 20 NewsGroup MoviewReviews Movie Reviews Reuters Node Meaning ML. Papers High-energy Physics Papers Citation Webpages of Top Univs News reports Pregiven Link Classification Target Citation 7 ML. Areas 4 High-energy Physics Areas Hyperlink 5 types of target readers (previous SOTA 0.6) - 20 News Categories - 2 (Positive/Negative) - 8 News Categories News reports
Baselines GNN-based methods: GCN, GAT. Random-walk based methods: paper2vec, TADW Text-network based methods: Text-GCN, PTE (Predictive Text Embed.)
Ablation & Comparison Study Remove different types of links in our network Pregiven Doc-Topic Topic-Term Initialize doc. node attributes with different feature extractors for text TF-IDF (default) Glove Vector (Mean pooled) Bi-LSTM Text CNN (BERT)
Further Analysis 1. Robustness to topic model setting What happen if we use different number of topic node and term nodes? 2. Analysis of the learned attention & topic models Can we visualize the learned attention and check how our GNN model finally learn about the text relationships?
Dataset Overview Name CORA_ML Citeseer Pubmed Hep-th Wikipedia 20 NewsGroupNews reports Reuters Node Meaning ML. Papers ML. Papers Biomed Papers High-energy Physics Papers Citation Wikipedia webpages Pregiven Link? #Target Citation Citation Citation Raw Text? x x x #docs, #edges, #vocab 2708, 5278, N/A 3327, 4552, N/A 19717, 44324, N/A 11752, 134956, 21614 2405, 17981, N/A 18846, N/A, 42757 7647, N/A, 7688 7 6 3 4 19 20 8 Hyperlink None None News reports Notes: When we remove pregiven links from text-rich graphs we get document collections (the last 2 datasets) Not all datasets we use come with raw text, some only have TF-IDF / Word Freq. features Our method can work well without raw text and/or without pregiven links, while almost all baselines require at least one of them available We claim our major contribution with text-rich graphs (the first 5 datasets)
Main Performance Table Text-rich Graph Datasets Row-wise comparison: Our method uniformly and significantly outperforms the state-of-art baselines on all these popular text-rich graph datasets LDA does a better job than MLP GNN based methods also generally shows very competitive results, but there is limited difference within this line of works Random-walk based methods relies on matrix factorization, which is generally a linear process to deal with relational data and lacks expressive power (even CANE is limited in this way )
Main Performance Table Text-rich Graph Datasets Row-wise comparison: Text-GCN is the strongest baseline, the idea of introducing additional text relations is rather game-changing! Our most significant gain is achieved with the most difficult Hep-th dataset
Ablation study Method: remove different components of our framework and to check how the performance is affected Goal: validate the usefulness of each building component
Note: Our method include {Doc_doc, doc_topic, topic_word, doc_word} links Becomes Text-gcn Analysis: Ab. 1 vs. the rest: the importance of using various types of text relationship Ab. 0 vs. Ab. 2: pregiven links is usually helpful to some extent, though relatively limited with Hep-th Ab. 0 vs. Ab. 3: usefulness of doc_word links (direct channels between document and words) Ab. 0 vs. Ab. 4: our model trains word embeddings that better suits the downstream classification task Ab. 0 vs. Ab. 5 and Ab. 6 vs. Ab. 7: the existence of topic node is highly important in most cases, no matter how many word nodes are used
Analysis: Ab. 0 vs. Ab. 6: when topic node is NOT present, using all vocabulary without pmi link is a very bad choice Ab. 5 vs. Ab. 7: when topic node is present, using all vocabulary without pmi link is does not have a consistent effect Ab. 7 vs. Ab. 8: the usage of word_word pmi is highly crucial to the success of text-gcn. However, it also requires all vocabulary to be used as word nodes, which leads to an implicit tradeoff!
Other interesting findings The optimal number of topic nodes is usually 1 to 1.5 times the number of classification categories The optimal #words/topic usually ranges from 20 to 100, usually accounting for only less than 5% of the entire vocabulary Using all vocabulary leads to significant overfit of the model
On-going experiments Robustness to hyperparameters Initialize the with different feature extractors End2end training framework Case study of learned topic and word embeddings