Accessing NLM Data EDirect for PubMed - Part 2: Extracting Data from XML

 
The Insider’s Guide to Accessing NLM Data
 
Part 2: Extracting Data from XML
 
 
National Library of Medicine
National Institutes of Health
U.S. Department of Health and Human Services
 
EDirect for PubMed
 
Sarah Helson, MLIS
 
EDirect for PubMed Agenda
 
Part 1: Getting PubMed Data
Part 2: Extracting Data from XML
Part 3: Formatting Results and Unix tools
Part 4: xtract Conditional Arguments
Part 5: Developing and Building Scripts
 
2
 
Today’s Agenda
 
Recap of Part One
XML refresher
Creating basic tables with xtract
Processing your output
 
3
 
Recap of Part One
 
esearch: searches for PMIDs
efetch: retrieves records in a variety of
formats
"|": pipes results from one command to the
next
 
4
 
Tips for Cygwin users
 
Copy = Ctrl + Insert
Paste = Shift + Insert
Adjustable in Cygwin options.
 
5
 
Tips for all users
 
Ctrl + C = Cancel
Quick way out of a mistake
Up and Down arrows cycle through history
Helpful to edit or re-run recent commands
"clear" clears your screen
Doesn't clear your history!
 
6
 
Questions from last class? Homework?
 
7
 
Remember our theme…
 
Get 
exactly
 the data you need
…and 
only
 the data you need
…in the 
format
 you need.
 
8
 
XML
 
eXtensible Markup Language
Language for storing and transporting data
Human- and machine-readable
Composed of XML elements
 
9
XML Basics
 
Element
 
<Year>2015</Year>
Attribute
 
<MedlineCitation Status="MEDLINE">
  
[…]
 
</MedlineCitation>
10
An XML example
<PubmedArticleSet>
    <PubmedArticle>
        <PMID Version="1">26438784</PMID>
        <DateCreated>
            <Year>2015</Year>
            <Month>12</Month>
            <Day>15</Day>
        </DateCreated>
        <Journal>
            <ISSN IssnType="Electronic">1468-201X</ISSN>
            <JournalIssue CitedMedium="Internet">
                <Volume>102</Volume>
                <Issue>1</Issue>
                <PubDate>
                    <Year>2016</Year>
                    <Month>Jan</Month>
                </PubDate>
            </JournalIssue>
            <Title>Heart (British Cardiac Society)</Title>
            <ISOAbbreviation>Heart</ISOAbbreviation>
        </Journal>
        <ArticleTitle>Handheld echocardiographic screening for rheumatic heart disease by non-experts.</ArticleTitle>
        <ELocationID EIdType="doi" ValidYN="Y">10.1136/heartjnl-2015-308236</ELocationID>
    </PubmedArticle>
</PubmedArticleSet>
11
Some XML elements repeat
<PubmedArticleSet>
    <PubmedArticle>
        <PMID Version="1">26438784</PMID>
        […]
    </PubmedArticle>
    <PubmedArticle>
        <PMID Version="1">25359968</PMID>
        […]
    </PubmedArticle>
    <PubmedArticle>
        <PMID Version="1">26276820</PMID>
        […]
    </PubmedArticle>
    <PubmedArticle>
        <PMID Version="1">25656311</PMID>
        […]
    </PubmedArticle>
    <PubmedArticle>
        <PMID Version="1">27580140</PMID>
        […]
    </PubmedArticle>
    <PubmedArticle>
        <PMID Version="1">15988488</PMID>
        […]
    </PubmedArticle>
</PubmedArticleSet>
12
 
xtract
 
Extracts specific elements from XML and
arranges them in a customized tabular format.
Format determined by arguments
Very powerful and flexible
Not an E-utilities command
 
13
 
What XML are we xtract-ing from?
 
XML (any XML!)
Can get XML from "efetch"
Can pull XML in from a file on your computer
using "-input"
 
14
efetch -format xml | xtract […]
xtract –input file.xml […]
 
PubMed XML Documentation
 
Large number of XML elements in PubMed
 
XML Element Descriptions
https://www.nlm.nih.gov/bsd/licensee/elements_des
criptions.html
 
PubMed DTD Documentation
https://dtd.nlm.nih.gov/ncbi/pubmed/out/doc/2018/
 
 
 
15
 
Before you start xtract-ing…
 
Helpful to look at some PubMed XML
 
16
<PubmedArticle>
     <MedlineCitation Status="Publisher" Owner="NLM">
          <PMID Version="1">27577264</PMID>
          <DateCreated>
               <Year>2016</Year>
               <Month>8</Month>
               <Day>31</Day>
          </DateCreated>
          <DateRevised>
               <Year>2016</Year>
               <Month>8</Month>
               <Day>31</Day>
          </DateRevised>
          <Article PubModel="Print-Electronic">
               <Journal>
                    <ISSN IssnType="Electronic">1421-9751</ISSN>
                    <JournalIssue CitedMedium="Internet">
                         <Volume>136</Volume>
                         <Issue>2</Issue>
                         <PubDate>
 
Get a small sample dataset
 
Choose a few representative records
Avoid atypical examples…for now.
Write a quick efetch
 
Do this right now, save it for later!
 
17
efetch –db pubmed –id 24102982,21171099,17150207 -format xml
 
xtract Example 1
 
We have a set of records
We want a tabular list with PMID, Journal TA,
and Title:
 
18
Questions to ask when making a table
 
What connects the data
in each row?
How many rows?
How many columns?
What data is in each
column?
19
What connects the data in each row?
 
Specify an XML element using "-pattern"
All data in a row comes from descendants of
one occurrence of the pattern
xtract scans the XML until it finds an
occurrence of the pattern
When it does, it creates a new row
20
-pattern PubmedArticle
<PubmedArticleSet>
    <PubmedArticle>
        <PMID Version="1">26438784</PMID>
        <DateCreated>
            <Year>2015</Year>
            <Month>12</Month>
            <Day>15</Day>
        </DateCreated>
        <ArticleTitle>Handheld echocardiographic screening for rheumatic heart disease by non-experts.</ArticleTitle>
    </PubmedArticle>
    <PubmedArticle>
        <PMID Version="1">25359968</PMID>
        <DateCreated>
            <Year>2014</Year>
            <Month>10</Month>
            <Day>31</Day>
        </DateCreated>
        <ArticleTitle>Structural basis for microRNA targeting.</ArticleTitle>
    </PubmedArticle>
    <PubmedArticle>
        <PMID Version="1">26276820</PMID>
        <DateCreated>
            <Year>2015</Year>
            <Month>10</Month>
            <Day>2</Day>
[…]
21
 
How many rows?
 
One row per occurrence of the pattern in the
XML input.
"-pattern PubmedArticle" creates as many
rows as there are records in your input.
 
22
Won't my -pattern always be
PubmedArticle?
 
Most of the time, yes.
Lets you tabulate PubMed records
Change -pattern to see different types of
relationships
Tabulate author, grant information, etc.
23
How many columns?
 
You specify a series of XML elements or attributes
using "-element"
Each element or attribute name you specify
creates a new column*
How many elements/attributes you specify
determines the number of columns.*
(*Most of the time.)
24
 
What data is in each column?
 
Inside a pattern, xtract looks for the elements
or attributes that you specify.
The value of each occurrence of the
element/attribute is put in the column.
 
25
-element PMID Year ArticleTitle
<PubmedArticleSet>
    <PubmedArticle>
        <PMID Version="1">26438784</PMID>
        <DateCreated>
            <Year>2015</Year>
            <Month>12</Month>
            <Day>15</Day>
        </DateCreated>
        <ArticleTitle>Handheld echocardiographic screening for rheumatic heart disease by non-experts.</ArticleTitle>
    </PubmedArticle>
    <PubmedArticle>
        <PMID Version="1">25359968</PMID>
        <DateCreated>
            <Year>2014</Year>
            <Month>10</Month>
            <Day>31</Day>
        </DateCreated>
        <ArticleTitle>Structural basis for microRNA targeting.</ArticleTitle>
    </PubmedArticle>
    <PubmedArticle>
        <PMID Version="1">26276820</PMID>
        <DateCreated>
            <Year>2015</Year>
            <Month>10</Month>
            <Day>2</Day>
[…]
26
 
xtract syntax
 
27
xtract –pattern PubmedArticle –element ArticleTitle
 
Creating multiple columns
 
Separate element names with spaces.
 
You don’t need to repeat “-element”
 
28
xtract –pattern PubmedArticle –element Agency GrantID
 
Exercise 1
 
Write an xtract command that:
Creates a table with one row per PubMed article.
Each row should have two columns:
Volume
Issue Number
Use the following efetch as input:
 
29
efetch -db pubmed -id 24102982,21171099,17150207 -format xml
Exercise 1 Solution
30
xtract –pattern PubmedArticle –element Volume Issue
 
Parent/Child construction
 
Retrieves only elements that are the child of a
specific parent
Isolates objects of the same name in different
locations in hierarchy
 
31
xtract –pattern PubmedArticle –element MedlineCitation/PMID
 
Back to xtract Example 1…
 
We have a set of records
We want a tabular list with PMID, Journal TA,
and Title:
 
32
 
Solving xtract Example 1
 
33
xtract –pattern PubmedArticle –element MedlineCitation/PMID \
Journal/ISOAbbreviation ArticleTitle
 
xtract-ing attribute values
 
Use “@” to specify an attribute
 
34
xtract –pattern PubmedArticle \
–element DescriptorName@MajorTopicYN
 
Exercise 2
 
Write an xtract command that:
Has one row per PubMed record
Has three columns:
PMID
Journal ISSN
Citation Status
 
35
 
Exercise 2 Solution
 
36
xtract –pattern PubmedArticle –element MedlineCitation/PMID \
Journal/ISSN MedlineCitation@Status
 
Exercise 3: Putting it all together
 
We want to find out which authors have been
writing about traumatic brain injury in athletes.
Limit to publications from 2016 and 2017.
We want to see just the author names, one per
line.
We want the Last Name and Initials
We want the whole script (not just the xtract
command).
 
37
Exercise 3 Solution
38
esearch -db pubmed -query "traumatic brain injury athletes" \
-datetype PDAT -mindate 2016 -maxdate 2017 | \
efetch -format xml | \
xtract -pattern Author -element LastName Initials
 
sort-uniq-count-rank
 
Helps to quantify lists of data
Shows unique values from a list, along with
how many times they appeared
Sorts by frequency
 
39
 
head
 
Limits output to only the first few lines.
Syntax: "head -n 10"
Outputs the first ten lines of the input.
 
40
 
Coming next week…
 
Limiting output using Conditional arguments
Suggestions for building solutions
Examine real-world practical applications
Step-by-step walkthrough of development
process
 
41
 
In the meantime…
 
Insider’s Guide online
https://dataguide.nlm.nih.gov
Sign up for "utilities-announce" mailing list!
Questions?
https://dataguide.nlm.nih.gov/contact
 
42
 
Homework
 
Exercises at the bottom of handout
 
Send us your case studies!
https://dataguide.nlm.nih.gov/contact
 
43
 
Questions?
 
44
Slide Note
Embed
Share

A comprehensive guide on utilizing NLM's EDirect tool for PubMed, focusing on extracting data from XML files. Covering topics such as manipulating output, basic table creation, and tips for Cygwin and all users. The emphasis is on obtaining precise and relevant data in the desired format efficiently.

  • NLM
  • EDirect
  • PubMed
  • XML
  • Data Extraction

Uploaded on Sep 19, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. The Insiders Guide to Accessing NLM Data EDirect for PubMed Part 2: Extracting Data from XML Sarah Helson, MLIS National Library of Medicine National Institutes of Health U.S. Department of Health and Human Services

  2. EDirect for PubMed Agenda Part 1: Getting PubMed Data Part 2: Extracting Data from XML Part 3: Formatting Results and Unix tools Part 4: xtract Conditional Arguments Part 5: Developing and Building Scripts 2

  3. Todays Agenda Recap of Part One XML refresher Creating basic tables with xtract Processing your output 3

  4. Recap of Part One esearch: searches for PMIDs efetch: retrieves records in a variety of formats "|": pipes results from one command to the next 4

  5. Tips for Cygwin users Copy = Ctrl + Insert Paste = Shift + Insert Adjustable in Cygwin options. 5

  6. Tips for all users Ctrl + C = Cancel Quick way out of a mistake Up and Down arrows cycle through history Helpful to edit or re-run recent commands "clear" clears your screen Doesn't clear your history! 6

  7. Questions from last class? Homework? 7

  8. Remember our theme Get exactly the data you need and only the data you need in the format you need. 8

  9. XML eXtensible Markup Language Language for storing and transporting data Human- and machine-readable Composed of XML elements 9

  10. XML Basics Element Attribute <MedlineCitation Status="MEDLINE"> [ ] </MedlineCitation> <Year>2015</Year> 10

  11. An XML example <PubmedArticleSet> <PubmedArticle> <PMID Version="1">26438784</PMID> <DateCreated> <Year>2015</Year> <Month>12</Month> <Day>15</Day> </DateCreated> <Journal> <ISSN IssnType="Electronic">1468-201X</ISSN> <JournalIssue CitedMedium="Internet"> <Volume>102</Volume> <Issue>1</Issue> <PubDate> <Year>2016</Year> <Month>Jan</Month> </PubDate> </JournalIssue> <Title>Heart (British Cardiac Society)</Title> <ISOAbbreviation>Heart</ISOAbbreviation> </Journal> <ArticleTitle>Handheld echocardiographic screening for rheumatic heart disease by non-experts.</ArticleTitle> <ELocationID EIdType="doi" ValidYN="Y">10.1136/heartjnl-2015-308236</ELocationID> </PubmedArticle> </PubmedArticleSet> 11

  12. Some XML elements repeat <PubmedArticleSet> <PubmedArticle> <PMID Version="1">26438784</PMID> [ ] </PubmedArticle> <PubmedArticle> <PMID Version="1">25359968</PMID> [ ] </PubmedArticle> <PubmedArticle> <PMID Version="1">26276820</PMID> [ ] </PubmedArticle> <PubmedArticle> <PMID Version="1">25656311</PMID> [ ] </PubmedArticle> <PubmedArticle> <PMID Version="1">27580140</PMID> [ ] </PubmedArticle> <PubmedArticle> <PMID Version="1">15988488</PMID> [ ] </PubmedArticle> </PubmedArticleSet> 12

  13. xtract Extracts specific elements from XML and arranges them in a customized tabular format. Format determined by arguments Very powerful and flexible Not an E-utilities command 13

  14. What XML are we xtract-ing from? XML (any XML!) Can get XML from "efetch" Can pull XML in from a file on your computer using "-input" xtract input file.xml [ ] efetch -format xml | xtract [ ] 14

  15. PubMed XML Documentation Large number of XML elements in PubMed XML Element Descriptions https://www.nlm.nih.gov/bsd/licensee/elements_des criptions.html PubMed DTD Documentation https://dtd.nlm.nih.gov/ncbi/pubmed/out/doc/2018/ 15

  16. Before you start xtract-ing Helpful to look at some PubMed XML <PubmedArticle> <MedlineCitation Status="Publisher" Owner="NLM"> <PMID Version="1">27577264</PMID> <DateCreated> <Year>2016</Year> <Month>8</Month> <Day>31</Day> </DateCreated> <DateRevised> <Year>2016</Year> <Month>8</Month> <Day>31</Day> </DateRevised> <Article PubModel="Print-Electronic"> <Journal> <ISSN IssnType="Electronic">1421-9751</ISSN> <JournalIssue CitedMedium="Internet"> <Volume>136</Volume> <Issue>2</Issue> <PubDate> 16

  17. Get a small sample dataset Choose a few representative records Avoid atypical examples for now. Write a quick efetch efetch db pubmed id 24102982,21171099,17150207 -format xml Do this right now, save it for later! 17

  18. xtract Example 1 We have a set of records We want a tabular list with PMID, Journal TA, and Title: PMID1 Journal TA1 Article Title1 PMID2 Journal TA2 Article Title2 PMID3 Journal TA3 Article Title3 PMID4 Journal TA4 Article Title4 18

  19. Questions to ask when making a table What connects the data in each row? How many rows? How many columns? What data is in each column? ??? 1 ??? 2 ??? 3 1 PMID1 Journal TA1 Article Title1 2 PMID2 Journal TA2 Article Title2 3 PMID3 Journal TA3 Article Title3 PMID4 Journal TA4 Article Title4 4 19

  20. What connects the data in each row? Specify an XML element using "-pattern" All data in a row comes from descendants of one occurrence of the pattern xtract scans the XML until it finds an occurrence of the pattern When it does, it creates a new row 20

  21. -pattern PubmedArticle <PubmedArticleSet> <PubmedArticle> <PMID Version="1">26438784</PMID> <DateCreated> <Year>2015</Year> <Month>12</Month> <Day>15</Day> </DateCreated> <ArticleTitle>Handheld echocardiographic screening for rheumatic heart disease by non-experts.</ArticleTitle> </PubmedArticle> <PubmedArticle> <PMID Version="1">25359968</PMID> <DateCreated> <Year>2014</Year> <Month>10</Month> <Day>31</Day> </DateCreated> <ArticleTitle>Structural basis for microRNA targeting.</ArticleTitle> </PubmedArticle> <PubmedArticle> <PMID Version="1">26276820</PMID> <DateCreated> <Year>2015</Year> <Month>10</Month> <Day>2</Day> [ ] 21

  22. How many rows? One row per occurrence of the pattern in the XML input. "-pattern PubmedArticle" creates as many rows as there are records in your input. 22

  23. Won't my -pattern always be PubmedArticle? Most of the time, yes. Lets you tabulate PubMed records Change -pattern to see different types of relationships Tabulate author, grant information, etc. 23

  24. How many columns? You specify a series of XML elements or attributes using "-element" Each element or attribute name you specify creates a new column* How many elements/attributes you specify determines the number of columns.* (*Most of the time.) 24

  25. What data is in each column? Inside a pattern, xtract looks for the elements or attributes that you specify. The value of each occurrence of the element/attribute is put in the column. 25

  26. -element PMID Year ArticleTitle <PubmedArticleSet> <PubmedArticle> <PMID Version="1">26438784</PMID> <DateCreated> <Year>2015</Year> <Month>12</Month> <Day>15</Day> </DateCreated> <ArticleTitle>Handheld echocardiographic screening for rheumatic heart disease by non-experts.</ArticleTitle> </PubmedArticle> <PubmedArticle> <PMID Version="1">25359968</PMID> <DateCreated> <Year>2014</Year> <Month>10</Month> <Day>31</Day> </DateCreated> <ArticleTitle>Structural basis for microRNA targeting.</ArticleTitle> </PubmedArticle> <PubmedArticle> <PMID Version="1">26276820</PMID> <DateCreated> <Year>2015</Year> <Month>10</Month> <Day>2</Day> [ ] 26

  27. xtract syntax xtract pattern PubmedArticle element ArticleTitle 27

  28. Creating multiple columns Separate element names with spaces. xtract pattern PubmedArticle element Agency GrantID You don t need to repeat -element 28

  29. Exercise 1 Write an xtract command that: Creates a table with one row per PubMed article. Each row should have two columns: Volume Issue Number Use the following efetch as input: efetch -db pubmed -id 24102982,21171099,17150207 -format xml 29

  30. Exercise 1 Solution xtract pattern PubmedArticle element Volume Issue 30

  31. Parent/Child construction Retrieves only elements that are the child of a specific parent Isolates objects of the same name in different locations in hierarchy xtract pattern PubmedArticle element MedlineCitation/PMID 31

  32. Back to xtract Example 1 We have a set of records We want a tabular list with PMID, Journal TA, and Title: PMID1 Journal TA1 Article Title1 PMID2 Journal TA2 Article Title2 PMID3 Journal TA3 Article Title3 PMID4 Journal TA4 Article Title4 32

  33. Solving xtract Example 1 xtract pattern PubmedArticle element MedlineCitation/PMID \ Journal/ISOAbbreviation ArticleTitle 33

  34. xtract-ing attribute values Use @ to specify an attribute xtract pattern PubmedArticle \ element DescriptorName@MajorTopicYN 34

  35. Exercise 2 Write an xtract command that: Has one row per PubMed record Has three columns: PMID Journal ISSN Citation Status PMID1 Journal ISSN1 Citation Status1 PMID2 Journal ISSN2 Citation Status2 PMID3 Journal ISSN3 Citation Status3 PMID4 Journal ISSN4 Citation Status4 35

  36. Exercise 2 Solution xtract pattern PubmedArticle element MedlineCitation/PMID \ Journal/ISSN MedlineCitation@Status 36

  37. Exercise 3: Putting it all together We want to find out which authors have been writing about traumatic brain injury in athletes. Limit to publications from 2016 and 2017. We want to see just the author names, one per line. We want the Last Name and Initials We want the whole script (not just the xtract command). 37

  38. Exercise 3 Solution esearch -db pubmed -query "traumatic brain injury athletes" \ -datetype PDAT -mindate 2016 -maxdate 2017 | \ efetch -format xml | \ xtract -pattern Author -element LastName Initials 38

  39. sort-uniq-count-rank Helps to quantify lists of data Shows unique values from a list, along with how many times they appeared Sorts by frequency 39

  40. head Limits output to only the first few lines. Syntax: "head -n 10" Outputs the first ten lines of the input. 40

  41. Coming next week Limiting output using Conditional arguments Suggestions for building solutions Examine real-world practical applications Step-by-step walkthrough of development process 41

  42. In the meantime Insider s Guide online https://dataguide.nlm.nih.gov Sign up for "utilities-announce" mailing list! Questions? https://dataguide.nlm.nih.gov/contact 42

  43. Homework Exercises at the bottom of handout Send us your case studies! https://dataguide.nlm.nih.gov/contact 43

  44. Questions? 44

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#