Understanding Major Terms, Cluster Labels, and Themes in IN-SPIRE Training

Slide Note
Embed
Share

Major terms in IN-SPIRE are keywords used for clustering documents, while cluster labels in Galaxy view represent the most important terms associated with a point. Themes, calculated by clustering keywords, provide a higher-level description of data. PNNL techniques like RAKE and CAST help extract and analyze significant themes by grouping related keywords. CAST uses a hierarchical agglomerative clustering algorithm to organize keywords into coherent themes based on document associations. Check out the provided links for more details.


Uploaded on Oct 04, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Major terms, cluster labels and Major terms, cluster labels and themes themes Slides for IN-SPIRE training OPA_T#964_Mar-14-2016 Office of Portfolio Analysis

  2. Terms and Themes Terms and Themes Major terms: are the terms IN-SPIRE identifies as keywords and uses for clustering documents. Cluster labels (Galaxy view): Are the most representative major terms associated with a point, and arranged in order of decreasing importance. Themes: Are a collection of key words and phrases (that occur together in the documents) that can be used to help describe data at a higher level than the Galaxy labels. They are calculated by clustering keywords rather than documents. OPA_T#964_Mar-14-2016 Office of Portfolio Analysis

  3. PNNL techniques used in IN PNNL techniques used in IN- -SPIRE Rapid Automatic Keyword Extraction (RAKE): extracts single and multi-word key words. RAKE provides a set of high-value keywords to CAST to identify themes. Computation and Analysis of Significant Themes (CAST): computes themes as clusters of related key words providing a higher level grouping. Each computed theme comprises a set of highly associated keywords and a set of documents that are highly associated with the theme s keywords. Whereas many text analysis methods focus on what distinguishes documents, RAKE and CAST focus on what describes documents, ideally characterizing what each document is essentially about. PNNL website: http://vis.pnnl.gov/research.stm More details on RAKE: http://media.wiley.com/product_data/excerpt/22/04707498/0470749822.pdf SPIRE OPA_T#964_Mar-14-2016 Office of Portfolio Analysis

  4. Computation and Analysis of Significant Computation and Analysis of Significant Themes (CAST) Themes (CAST) The extracted keywords are grouped into coherent themes by applying a hierarchical agglomerative clustering algorithm to a keyword similarity matrix based on keyword document associations in the corpus. Meaning: Keywords are compared to all the documents and scored to give vector. Vectors for all pairs of keywords are then compared for similarity. Groups of keywords are created by clustering keywords by similarity. Highest scoring terms become candidate themes. Keywords within each theme are ranked by numbers of documents assigned to theme to order theme label. More details on CAST: http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=59619 40 OPA_T#964_Mar-14-2016 Office of Portfolio Analysis

Related


More Related Content