Using XPath for XML Document Processing
"XML structures data into elements and attributes, with XPath defining a way to select parts for further processing. Learn why XPath is crucial in XML technologies and how it evolved from XSL to become a powerful language with a wider range of functions and operators."
Download Presentation

Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.
E N D
Presentation Transcript
UFCFR5-15-3 Advanced Topics in Web Development II 2021/22 Lecture 5: Understanding & Using XPATH
abstract XML structures data into a rather small number of different constructs, most notably elements and attributes. The XML Path Language (XPath) defines a way how to select parts of XML documents, so that they can be used for further processing. XPath's primary use in in XSL Transformations (XSLT) and XQuery (XQ). XPath is a very compact language with a syntax that resembles path expressions well-known from file systems. These path expressions, however, are generalized and therefore much more powerful than the rather simple path expressions in file systems. Because of its use in different XML technologies, XPath is one of the most important XML core technologies.
why xpath (1) o XML is a syntax for trees o it defines a way for how trees can be exchanged o XML technologies should provide support for working with trees o when receiving trees, access to the tree should be easy (DOM) o validating trees should be easy (XSD) o mapping trees should be easy (XSLT) o querying tree collections should be easy (XQuery) o XPath is what regular expressions are for text-based information
why xpath (2) o Different XML technologies need selection o XSLT needs it for selecting parts and manipulating them o XSD needs it for applying identity constraints o DOM needs it for extracting parts from an XML tree o XQuery needs it for writing XML-oriented queries o XPath was created to be reusable o XML experts should only learn one selection language o this knowledge can be reused when learning new technologies o implementations can reuse code libraries
how xpath evolved o XSL was designed as the new XML stylesheet language o XSL Transformations (XSLT) transform the input document o XSL Formatting Objects (XSL-FO) is what they will transform it to o XSLT was designed to work on arbitrary XML input documents o started as a part of XSL (WD-xsl-19981216 WD-xslt-19990421) o for selecting parts of the transformation input, a selection mechanism had to be provided o XPath was turned into a standalone specification o started as a part of XSLT (WD-xslt-19990421 WD-xslt-19990709) o reused in a number of other W3C specifications (XSD, DOM) o Complete overhaul for XSLT 2.0 and XQuery o XPath 2.0 as the core language (current version is XPath 3.1) o a much larger set of functions and operators o the underlying data model which describes the foundation
XPath in the family of XML technologies a query language to extract data from a XML file or any collection of data that can be XML-like. methods for creating internal and external links within XML documents, and associating metadata with those links. a language for locating data within an XML document XPath is a language for finding data in an XML document. (Core XML technology) a language for transforming XML documents into other documents
Starting from the Infoset o XPath operates on an abstract data model o a tree derived from the XML Information Set (XML Infoset) o a simplification (another one!) of the underlying XML o The Infoset is turned into an XPath node tree o 11 infoset item types 7 XPath tree node types o character items are merged into text nodes o namespace declarations are no longer visible as attributes
What is NOT in the Infoset Things which are not in the Infoset o the order of attributes in a start tag o the types of quotes around attribute values o character references and entities (ü/ü ) o And some more o namespace declarations are no longer visible as attributes o notations and unexpanded entity references
xpath selectors: tree in / selection out o XPath evaluates an expression based on a tree o Where the tree comes from is out of XPath's scope o The result of the evaluation is a selection o //img[not(@alt)] select all images which have no alt attribute o count(//img) return the number of images o /descendant::img[3]/@src return the third image's src URI o starts-with(/html/@lang, 'en') test whether the document's language is english o Syntax errors may occur
xpath location paths (1): path structure o Each location path consists of Location Steps o location steps are separated by /, like path names in file systems o Similarities between XPath location paths and file systems o nodes in the XPath tree have different types o the type and number of nodes selected by one step o the direction in which each step moves o additional filters for selecting specific nodes o Differences between XPath location paths and file systems o XPaths may return other data types than nodes o XPath provides a built-in function library
xpath location paths (2): file system v. xpath location File System Path: # Selected Nodes: / usr / local / apache / bin / 1 1 1 1 1 XPath: # Selected Nodes: / html / body / table / thead / tr 1 1 1 6 4 12
xpath location paths (3): test for nodes o Name tests o testing for a particular name (elements/attributes): /html/head/title o wildcards (testing for any name): /html/head/* o Node type tests o text nodes: text() o comment nodes: comment() o any nodes: node() o Processing instruction tests o any PI: processing-instruction() o specific PI: processing-instruction("xml-stylesheet")
xpath axis (1): getting there o File system paths are one direction only o always one level down in the file system hierarchy o . and .. are clever directory shortcuts o other directions supported by tools (e.g., find) o XPath allows steps is different directions o the default direction is child o other directions are explicitly specified: descendant::a
xpath axis (2): axis peculiarities o Attributes and Namespaces are not the children of elements, but o elements are their attributes' parent! o very counter-intuitive o very convenient o Attributes and Namespaces are always leaves in the node tree o Attribute nodes have the attribute value as their value o Namespace nodes have the namespace name (i.e., a URI) as their value o Namespace nodes exist because of namespace declarations o in the XPath node tree, only the namespace nodes are visible o the namespace declaration attributes (xmlns) are invisible o one namespace declaration potentially creates many namespace nodes
xpath axis (3) xpath has thirteen axis child parent descendent ancestor descendent-or-self ancestor-or-self following-sibling preceding-sibling following preceding attribute namespace self
putting it all together o XPath location paths use a simple syntax o sequence of location steps, separated by / o Each location step uses a simple structure (preceding::p[@class="warning"]) o an axis followed by :: (no axis uses the default axis child) o a node test o 0-n Predicates enclosed in [] o Location paths can be abbreviated ochild:: can be omitted (default axis) o attribute:: can be written as @ o . is an abbreviation for self::node() o .. is an abbreviation for parent::node() o// is an abbreviation for /descendant-or-self::node()/
location step filters o Predicates are filters for each location step o there can be any number of filters (0-n) o each filter is applied to each selected node individually o Each predicate is an XPath and evaluated as a boolean o the context of this evaluation is the node for which the filter is evaluated o if the result is a number, it is compared with the position() function (/descendant::a[5]) o Predicates always reduce the set of selected nodes o as corner cases, the set of selected nodes does not change or is empty o predicates are used in the majority of non-trivial XPath location paths
location path processing Location paths are processed in a very simple way. 1. start with a given context 2. for each location step, repeat the following steps: 3. based on the context and the axis, select the nodes on this axis 4. reduce this selection to the nodes identified by the node test 5. sequentially apply all filters to each of these nodes 6. take the remaining node set as the context for the next location step
xpath expressions o XPath is a full expression language o any evaluated expression in XSLT is an XPath o XPath must be able to calculate operations on non-XML data types o XPath uses a very simple data model o node sets: //img[not(@alt)] o number: count(//img) o string: /descendant::img[3]/@src o boolean: starts-with(/html/@lang, 'en')
xpath usage o XPath is used in different technologies o XSLT uses XPath as its expression language o XSD uses XPath for selecting identity constraint nodes o DOM uses XPath as a way to select DOM nodes o Depending on the environment, expression must yield certain results o for conditionals, a boolean must be returned o iterations (in XSLT) only loop over nodes o when printing out text, a string must be produce o XPath has built-in rules for casting types o node set boolean: empty is false, non-empty is true o node string: take the string value (i.e., concatenate all text node descendants) o string number: interpret as decimal notation (otherwise return NaN) o XPaths often return surprising results (//a[starts-with(@href, https)])
xpath functions o XPath has a small library of built-in functions o useful for basic XPath-level functions o other specs are allowed to extend it (XSLT does it) o XPath functions return results of various data types o boolean: boolean, contains, false, lang, not, starts-with, true o number: ceiling, count, floor, last, number, position, round, string-length, sum o string: concat, local-name, name, namespace-uri, normalize-space, string, substring, substring-after, substring-before, translate o node set: id
using functions o Functions and location paths are orthogonal o each construct may be based on the other o it is possible to nest them arbitrarily o predicates often contain functions //a[substring(@href,string-length(@href)-2)='pdf'] o XPaths can become powerful and complex o writing some code or thinking about an XPath? o XPaths are more declarative o they may be more robust against changes in the XML schema o they can be optimized by a smart XPath implementation
add-dayTimeDuration-to-date xpath 3.0 functions & operators abs yearMonthDurations to-timezone base64Binary-equal ceiling date current-dateTime data date-equal from-date day-from-dateTime collation distinct-values yearMonthDuration-by-yearMonthDuration element-with-id empty one except exists false filter floor number format-time gDay-equal generate-id has-children head id idref implicit-timezone lang last local-name matches max min from-duration multiply-dayTimeDuration name namespace-uri space normalize-unicode integer-divide numeric-less-than one-or-more outermost parse-xml parse-xml-fragment QName QName-equal remove replace resolve-QName seconds-from-dateTime uri string string-join before subtract-dates dayTimeDuration-from-time yearMonthDuration-from-dateTime tail tan time-equal time to tokenize union unordered unparsed-text year-from-date year-from-dateTime zero-or-one acos add-dayTimeDurations add-yearMonthDuration-to-date adjust-time-to-timezone base-uri codepoint-equal codepoints-to-string current-time date-greater-than add-dayTimeDuration-to-dateTime add-yearMonthDuration-to-dateTime analyze-string asin atan boolean-equal boolean-greater-than collection compare add-dayTimeDuration-to-time adjust-dateTime-to-timezone available-environment-variables avg boolean-less-than concat concatenate contains add- adjust-date- atan2 boolean cos count current- date-less-than dateTime dateTime-equal dateTime-greater-than dayTimeDuration-less-than dateTime-less-than deep-equal divide-yearMonthDuration day- days-from-duration dayTimeDuration-greater-than divide-dayTimeDuration divide-dayTimeDuration-by-dayTimeDuration doc doc-available document-uri encode-for-uri ends-with environment-variable exp10 fold-left fold-right for-each for-each-pair function-arity function-lookup function-name gMonthDay-equal gMonth-equal gYear-equal hexBinary-equal hours-from-dateTime index-of innermost in-scope-prefixes local-name-from-QName log log10 minutes-from-dateTime minutes-from-duration multiply-yearMonthDuration namespace-uri-for-prefix namespace-uri-from-QName not NOTATION-equal number numeric-add numeric-mod numeric-multiply default- divide- duration-equal error escape-html-uri exactly- exp format-date format-dateTime format-integer format- gYearMonth-equal hours-from-duration hours-from-time intersect insert-before iri-to-uri is-same-node lower-case minutes-from-time month-from-date month-from-dateTime months- nilled node-after numeric-equal numeric-unary-minus node-before numeric-greater-than numeric-unary-plus node-name normalize- numeric- numeric-divide numeric-subtract path pi position pow prefix-from-QName reverse seconds-from-time string-to-codepoints subtract-dayTimeDuration-from-date subtract-dayTimeDurations subtract-yearMonthDurations time-greater-than time-less-than trace translate true unparsed-text-available yearMonthDuration-greater-than resolve-uri root round round-half-to-even serialize subsequence substring seconds-from-duration string-length subtract-dateTimes starts-with substring-after subtract-dayTimeDuration-from-dateTime subtract-yearMonthDuration-from-date sin sqrt static-base- substring- subtract- subtract-times subtract- sum timezone-from-date timezone-from-dateTime timezone-from- unparsed-text-lines upper-case uri-collection years-from-duration yearMonthDuration-less-than
xpath selects o Query languages select and recombine 1. look up all addresses by post code 2. For each post code, count the number of addresses o XSLT fills in the missing parts (as a programming language) o XSLT can construct XML and re-apply Xpath o XQuery fills in the missing parts (query-wise) o 80% of XQuery is XPath (in version 2.0, though) o the remaining 20% are bindings, constructors, and glue
xpath conclusions o XPath is a basic tool of the XML toolbox o XPath is reused in various XML technologies o XPath selects parts of an XML document o XPath can do more general things by using expressions
resources: XPath with PHP: https://www.ibm.com/developerworks/library/x-xpathphp/index.html XPath syntax: https://msdn.microsoft.com/en-us/library/ms256471(v=vs.110).aspx XPath functions: http://dh.obdurodon.org/functions.xhtml XPath Tester & Evaluator: https://www.freeformatter.com/xpath-tester.html