Annotation of embedaddon/libxml2/doc/xmlreader.html, revision 1.1
1.1 ! misho 1: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
! 2: "http://www.w3.org/TR/html4/loose.dtd">
! 3: <html>
! 4: <head>
! 5: <meta http-equiv="Content-Type" content="text/html">
! 6: <style type="text/css"></style>
! 7: <!--
! 8: TD {font-family: Verdana,Arial,Helvetica}
! 9: BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
! 10: H1 {font-family: Verdana,Arial,Helvetica}
! 11: H2 {font-family: Verdana,Arial,Helvetica}
! 12: H3 {font-family: Verdana,Arial,Helvetica}
! 13: A:link, A:visited, A:active { text-decoration: underline }
! 14: </style>
! 15: -->
! 16: <title>Libxml2 XmlTextReader Interface tutorial</title>
! 17: </head>
! 18:
! 19: <body bgcolor="#fffacd" text="#000000">
! 20: <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
! 21:
! 22: <p></p>
! 23:
! 24: <p>This document describes the use of the XmlTextReader streaming API added
! 25: to libxml2 in version 2.5.0 . This API is closely modeled after the <a
! 26: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
! 27: and <a
! 28: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
! 29: classes of the C# language.</p>
! 30:
! 31: <p>This tutorial will present the key points of this API, and working
! 32: examples using both C and the Python bindings:</p>
! 33:
! 34: <p>Table of content:</p>
! 35: <ul>
! 36: <li><a href="#Introducti">Introduction: why a new API</a></li>
! 37: <li><a href="#Walking">Walking a simple tree</a></li>
! 38: <li><a href="#Extracting">Extracting informations for the current
! 39: node</a></li>
! 40: <li><a href="#Extracting1">Extracting informations for the
! 41: attributes</a></li>
! 42: <li><a href="#Validating">Validating a document</a></li>
! 43: <li><a href="#Entities">Entities substitution</a></li>
! 44: <li><a href="#L1142">Relax-NG Validation</a></li>
! 45: <li><a href="#Mixing">Mixing the reader and tree or XPath
! 46: operations</a></li>
! 47: </ul>
! 48:
! 49: <p></p>
! 50:
! 51: <h2><a name="Introducti">Introduction: why a new API</a></h2>
! 52:
! 53: <p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
! 54: tree based</a>, where the parsing operation results in a document loaded
! 55: completely in memory, and expose it as a tree of nodes all availble at the
! 56: same time. This is very simple and quite powerful, but has the major
! 57: limitation that the size of the document that can be hamdled is limited by
! 58: the size of the memory available. Libxml2 also provide a <a
! 59: href="http://www.saxproject.org/">SAX</a> based API, but that version was
! 60: designed upon one of the early <a
! 61: href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
! 62: also not formally defined for C. SAX basically work by registering callbacks
! 63: which are called directly by the parser as it progresses through the document
! 64: streams. The problem is that this programming model is relatively complex,
! 65: not well standardized, cannot provide validation directly, makes entity,
! 66: namespace and base processing relatively hard.</p>
! 67:
! 68: <p>The <a
! 69: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
! 70: API from C#</a> provides a far simpler programming model. The API acts as a
! 71: cursor going forward on the document stream and stopping at each node in the
! 72: way. The user's code keeps control of the progress and simply calls a
! 73: Read() function repeatedly to progress to each node in sequence in document
! 74: order. There is direct support for namespaces, xml:base, entity handling and
! 75: adding DTD validation on top of it was relatively simple. This API is really
! 76: close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
! 77: specification</a> This provides a far more standard, easy to use and powerful
! 78: API than the existing SAX. Moreover integrating extension features based on
! 79: the tree seems relatively easy.</p>
! 80:
! 81: <p>In a nutshell the XmlTextReader API provides a simpler, more standard and
! 82: more extensible interface to handle large documents than the existing SAX
! 83: version.</p>
! 84:
! 85: <h2><a name="Walking">Walking a simple tree</a></h2>
! 86:
! 87: <p>Basically the XmlTextReader API is a forward only tree walking interface.
! 88: The basic steps are:</p>
! 89: <ol>
! 90: <li>prepare a reader context operating on some input</li>
! 91: <li>run a loop iterating over all nodes in the document</li>
! 92: <li>free up the reader context</li>
! 93: </ol>
! 94:
! 95: <p>Here is a basic C sample doing this:</p>
! 96: <pre>#include <libxml/xmlreader.h>
! 97:
! 98: void processNode(xmlTextReaderPtr reader) {
! 99: /* handling of a node in the tree */
! 100: }
! 101:
! 102: int streamFile(char *filename) {
! 103: xmlTextReaderPtr reader;
! 104: int ret;
! 105:
! 106: reader = xmlNewTextReaderFilename(filename);
! 107: if (reader != NULL) {
! 108: ret = xmlTextReaderRead(reader);
! 109: while (ret == 1) {
! 110: processNode(reader);
! 111: ret = xmlTextReaderRead(reader);
! 112: }
! 113: xmlFreeTextReader(reader);
! 114: if (ret != 0) {
! 115: printf("%s : failed to parse\n", filename);
! 116: }
! 117: } else {
! 118: printf("Unable to open %s\n", filename);
! 119: }
! 120: }</pre>
! 121:
! 122: <p>A few things to notice:</p>
! 123: <ul>
! 124: <li>the include file needed : <code>libxml/xmlreader.h</code></li>
! 125: <li>the creation of the reader using a filename</li>
! 126: <li>the repeated call to xmlTextReaderRead() and how any return value
! 127: different from 1 should stop the loop</li>
! 128: <li>that a negative return means a parsing error</li>
! 129: <li>how xmlFreeTextReader() should be used to free up the resources used by
! 130: the reader.</li>
! 131: </ul>
! 132:
! 133: <p>Here is similar code in python for exactly the same processing:</p>
! 134: <pre>import libxml2
! 135:
! 136: def processNode(reader):
! 137: pass
! 138:
! 139: def streamFile(filename):
! 140: try:
! 141: reader = libxml2.newTextReaderFilename(filename)
! 142: except:
! 143: print "unable to open %s" % (filename)
! 144: return
! 145:
! 146: ret = reader.Read()
! 147: while ret == 1:
! 148: processNode(reader)
! 149: ret = reader.Read()
! 150:
! 151: if ret != 0:
! 152: print "%s : failed to parse" % (filename)</pre>
! 153:
! 154: <p>The only things worth adding are that the <a
! 155: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
! 156: is abstracted as a class like in C#</a> with the same method names (but the
! 157: properties are currently accessed with methods) and that one doesn't need to
! 158: free the reader at the end of the processing. It will get garbage collected
! 159: once all references have disapeared.</p>
! 160:
! 161: <h2><a name="Extracting">Extracting information for the current node</a></h2>
! 162:
! 163: <p>So far the example code did not indicate how information was extracted
! 164: from the reader. It was abstrated as a call to the processNode() routine,
! 165: with the reader as the argument. At each invocation, the parser is stopped on
! 166: a given node and the reader can be used to query those node properties. Each
! 167: <em>Property</em> is available at the C level as a function taking a single
! 168: xmlTextReaderPtr argument whose name is
! 169: <code>xmlTextReader</code><em>Property</em> , if the return type is an
! 170: <code>xmlChar *</code> string then it must be deallocated with
! 171: <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
! 172: <em>Property</em> method to the reader class that can be called on the
! 173: instance. The list of the properties is based on the <a
! 174: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
! 175: XmlTextReader class</a> set of properties and methods:</p>
! 176: <ul>
! 177: <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
! 178: element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
! 179: entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
! 180: 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
! 181: fragment and 12 for notation nodes.</li>
! 182: <li><em>Name</em>: the <a
! 183: href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
! 184: name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
! 185: <li><em>LocalName</em>: the <a
! 186: href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
! 187: the node.</li>
! 188: <li><em>Prefix</em>: a shorthand reference to the <a
! 189: href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
! 190: the node.</li>
! 191: <li><em>NamespaceUri</em>: the URI defining the <a
! 192: href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
! 193: the node.</li>
! 194: <li><em>BaseUri:</em> the base URI of the node. See the <a
! 195: href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
! 196: <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
! 197: root node.</li>
! 198: <li><em>HasAttributes</em>: whether the node has attributes.</li>
! 199: <li><em>HasValue</em>: whether the node can have a text value.</li>
! 200: <li><em>Value</em>: provides the text value of the node if present.</li>
! 201: <li><em>IsDefault</em>: whether an Attribute node was generated from the
! 202: default value defined in the DTD or schema (<em>unsupported
! 203: yet</em>).</li>
! 204: <li><em>XmlLang</em>: the <a
! 205: href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
! 206: within which the node resides.</li>
! 207: <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
! 208: bit bizarre in the sense that <code><a/></code> will be considered
! 209: empty while <code><a></a></code> will not.</li>
! 210: <li><em>AttributeCount</em>: provides the number of attributes of the
! 211: current node.</li>
! 212: </ul>
! 213:
! 214: <p>Let's look first at a small example to get this in practice by redefining
! 215: the processNode() function in the Python example:</p>
! 216: <pre>def processNode(reader):
! 217: print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
! 218: reader.Name(), reader.IsEmptyElement())</pre>
! 219:
! 220: <p>and look at the result of calling streamFile("tst.xml") for various
! 221: content of the XML test file.</p>
! 222:
! 223: <p>For the minimal document "<code><doc/></code>" we get:</p>
! 224: <pre>0 1 doc 1</pre>
! 225:
! 226: <p>Only one node is found, its depth is 0, type 1 indicate an element start,
! 227: of name "doc" and it is empty. Trying now with
! 228: "<code><doc></doc></code>" instead leads to:</p>
! 229: <pre>0 1 doc 0
! 230: 0 15 doc 0</pre>
! 231:
! 232: <p>The document root node is not flagged as empty anymore and both a start
! 233: and an end of element are detected. The following document shows how
! 234: character data are reported:</p>
! 235: <pre><doc><a/><b>some text</b>
! 236: <c/></doc></pre>
! 237:
! 238: <p>We modifying the processNode() function to also report the node Value:</p>
! 239: <pre>def processNode(reader):
! 240: print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
! 241: reader.Name(), reader.IsEmptyElement(),
! 242: reader.Value())</pre>
! 243:
! 244: <p>The result of the test is:</p>
! 245: <pre>0 1 doc 0 None
! 246: 1 1 a 1 None
! 247: 1 1 b 0 None
! 248: 2 3 #text 0 some text
! 249: 1 15 b 0 None
! 250: 1 3 #text 0
! 251:
! 252: 1 1 c 1 None
! 253: 0 15 doc 0 None</pre>
! 254:
! 255: <p>There are a few things to note:</p>
! 256: <ul>
! 257: <li>the increase of the depth value (first row) as children nodes are
! 258: explored</li>
! 259: <li>the text node child of the b element, of type 3 and its content</li>
! 260: <li>the text node containing the line return between elements b and c</li>
! 261: <li>that elements have the Value None (or NULL in C)</li>
! 262: </ul>
! 263:
! 264: <p>The equivalent routine for <code>processNode()</code> as used by
! 265: <code>xmllint --stream --debug</code> is the following and can be found in
! 266: the xmllint.c module in the source distribution:</p>
! 267: <pre>static void processNode(xmlTextReaderPtr reader) {
! 268: xmlChar *name, *value;
! 269:
! 270: name = xmlTextReaderName(reader);
! 271: if (name == NULL)
! 272: name = xmlStrdup(BAD_CAST "--");
! 273: value = xmlTextReaderValue(reader);
! 274:
! 275: printf("%d %d %s %d",
! 276: xmlTextReaderDepth(reader),
! 277: xmlTextReaderNodeType(reader),
! 278: name,
! 279: xmlTextReaderIsEmptyElement(reader));
! 280: xmlFree(name);
! 281: if (value == NULL)
! 282: printf("\n");
! 283: else {
! 284: printf(" %s\n", value);
! 285: xmlFree(value);
! 286: }
! 287: }</pre>
! 288:
! 289: <h2><a name="Extracting1">Extracting information for the attributes</a></h2>
! 290:
! 291: <p>The previous examples don't indicate how attributes are processed. The
! 292: simple test "<code><doc a="b"/></code>" provides the following
! 293: result:</p>
! 294: <pre>0 1 doc 1 None</pre>
! 295:
! 296: <p>This proves that attribute nodes are not traversed by default. The
! 297: <em>HasAttributes</em> property allow to detect their presence. To check
! 298: their content the API has special instructions. Basically two kinds of operations
! 299: are possible:</p>
! 300: <ol>
! 301: <li>to move the reader to the attribute nodes of the current element, in
! 302: that case the cursor is positionned on the attribute node</li>
! 303: <li>to directly query the element node for the attribute value</li>
! 304: </ol>
! 305:
! 306: <p>In both case the attribute can be designed either by its position in the
! 307: list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
! 308: by their name (and namespace):</p>
! 309: <ul>
! 310: <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
! 311: the specified index no relative to the containing element.</li>
! 312: <li><em>GetAttribute</em>(name): provides the value of the attribute with
! 313: the specified qualified name.</li>
! 314: <li>GetAttributeNs(localName, namespaceURI): provides the value of the
! 315: attribute with the specified local name and namespace URI.</li>
! 316: <li><em>MoveToAttributeNo</em>(no): moves the position of the current
! 317: instance to the attribute with the specified index relative to the
! 318: containing element.</li>
! 319: <li><em>MoveToAttribute</em>(name): moves the position of the current
! 320: instance to the attribute with the specified qualified name.</li>
! 321: <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
! 322: of the current instance to the attribute with the specified local name
! 323: and namespace URI.</li>
! 324: <li><em>MoveToFirstAttribute</em>: moves the position of the current
! 325: instance to the first attribute associated with the current node.</li>
! 326: <li><em>MoveToNextAttribute</em>: moves the position of the current
! 327: instance to the next attribute associated with the current node.</li>
! 328: <li><em>MoveToElement</em>: moves the position of the current instance to
! 329: the node that contains the current Attribute node.</li>
! 330: </ul>
! 331:
! 332: <p>After modifying the processNode() function to show attributes:</p>
! 333: <pre>def processNode(reader):
! 334: print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
! 335: reader.Name(), reader.IsEmptyElement(),
! 336: reader.Value())
! 337: if reader.NodeType() == 1: # Element
! 338: while reader.MoveToNextAttribute():
! 339: print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
! 340: reader.Name(),reader.Value())</pre>
! 341:
! 342: <p>The output for the same input document reflects the attribute:</p>
! 343: <pre>0 1 doc 1 None
! 344: -- 1 2 (a) [b]</pre>
! 345:
! 346: <p>There are a couple of things to note on the attribute processing:</p>
! 347: <ul>
! 348: <li>Their depth is the one of the carrying element plus one.</li>
! 349: <li>Namespace declarations are seen as attributes, as in DOM.</li>
! 350: </ul>
! 351:
! 352: <h2><a name="Validating">Validating a document</a></h2>
! 353:
! 354: <p>Libxml2 implementation adds some extra features on top of the XmlTextReader
! 355: API. The main one is the ability to DTD validate the parsed document
! 356: progressively. This is simply the activation of the associated feature of the
! 357: parser used by the reader structure. There are a few options available
! 358: defined as the enum xmlParserProperties in the libxml/xmlreader.h header
! 359: file:</p>
! 360: <ul>
! 361: <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
! 362: <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
! 363: loading the DTD)</li>
! 364: <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
! 365: the DTD)</li>
! 366: <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
! 367: reference nodes are not generated and are replaced by their expanded
! 368: content.</li>
! 369: <li>more settings might be added, those were the one available at the 2.5.0
! 370: release...</li>
! 371: </ul>
! 372:
! 373: <p>The GetParserProp() and SetParserProp() methods can then be used to get
! 374: and set the values of those parser properties of the reader. For example</p>
! 375: <pre>def parseAndValidate(file):
! 376: reader = libxml2.newTextReaderFilename(file)
! 377: reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
! 378: ret = reader.Read()
! 379: while ret == 1:
! 380: ret = reader.Read()
! 381: if ret != 0:
! 382: print "Error parsing and validating %s" % (file)</pre>
! 383:
! 384: <p>This routine will parse and validate the file. Error messages can be
! 385: captured by registering an error handler. See python/tests/reader2.py for
! 386: more complete Python examples. At the C level the equivalent call to cativate
! 387: the validation feature is just:</p>
! 388: <pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
! 389:
! 390: <p>and a return value of 0 indicates success.</p>
! 391:
! 392: <h2><a name="Entities">Entities substitution</a></h2>
! 393:
! 394: <p>By default the xmlReader will report entities as such and not replace them
! 395: with their content. This default behaviour can however be overriden using:</p>
! 396:
! 397: <p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
! 398:
! 399: <h2><a name="L1142">Relax-NG Validation</a></h2>
! 400:
! 401: <p style="font-size: 10pt">Introduced in version 2.5.7</p>
! 402:
! 403: <p>Libxml2 can now validate the document being read using the xmlReader using
! 404: Relax-NG schemas. While the Relax NG validator can't always work in a
! 405: streamable mode, only subsets which cannot be reduced to regular expressions
! 406: need to have their subtree expanded for validation. In practice it means
! 407: that, unless the schemas for the top level element content is not expressable
! 408: as a regexp, only chunk of the document needs to be parsed while
! 409: validating.</p>
! 410:
! 411: <p>The steps to do so are:</p>
! 412: <ul>
! 413: <li>create a reader working on a document as usual</li>
! 414: <li>before any call to read associate it to a Relax NG schemas, either the
! 415: preparsed schemas or the URL to the schemas to use</li>
! 416: <li>errors will be reported the usual way, and the validity status can be
! 417: obtained using the IsValid() interface of the reader like for DTDs.</li>
! 418: </ul>
! 419:
! 420: <p>Example, assuming the reader has already being created and that the schema
! 421: string contains the Relax-NG schemas:</p>
! 422: <pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
! 423: rngs = rngp.relaxNGParse()<br>
! 424: reader.RelaxNGSetSchema(rngs)<br>
! 425: ret = reader.Read()<br>
! 426: while ret == 1:<br>
! 427: ret = reader.Read()<br>
! 428: if ret != 0:<br>
! 429: print "Error parsing the document"<br>
! 430: if reader.IsValid() != 1:<br>
! 431: print "Document failed to validate"</code><br>
! 432: </pre>
! 433:
! 434: <p>See <code>reader6.py</code> in the sources or documentation for a complete
! 435: example.</p>
! 436:
! 437: <h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
! 438:
! 439: <p style="font-size: 10pt">Introduced in version 2.5.7</p>
! 440:
! 441: <p>While the reader is a streaming interface, its underlying implementation
! 442: is based on the DOM builder of libxml2. As a result it is relatively simple
! 443: to mix operations based on both models under some constraints. To do so the
! 444: reader has an Expand() operation allowing to grow the subtree under the
! 445: current node. It returns a pointer to a standard node which can be
! 446: manipulated in the usual ways. The node will get all its ancestors and the
! 447: full subtree available. Usual operations like XPath queries can be used on
! 448: that reduced view of the document. Here is an example extracted from
! 449: reader5.py in the sources which extract and prints the bibliography for the
! 450: "Dragon" compiler book from the XML 1.0 recommendation:</p>
! 451: <pre>f = open('../../test/valid/REC-xml-19980210.xml')
! 452: input = libxml2.inputBuffer(f)
! 453: reader = input.newTextReader("REC")
! 454: res=""
! 455: while reader.Read():
! 456: while reader.Name() == 'bibl':
! 457: node = reader.Expand() # expand the subtree
! 458: if node.xpathEval("@id = 'Aho'"): # use XPath on it
! 459: res = res + node.serialize()
! 460: if reader.Next() != 1: # skip the subtree
! 461: break;</pre>
! 462:
! 463: <p>Note, however that the node instance returned by the Expand() call is only
! 464: valid until the next Read() operation. The Expand() operation does not
! 465: affects the Read() ones, however usually once processed the full subtree is
! 466: not useful anymore, and the Next() operation allows to skip it completely and
! 467: process to the successor or return 0 if the document end is reached.</p>
! 468:
! 469: <p><a href="mailto:xml@gnome.org">Daniel Veillard</a></p>
! 470:
! 471: <p>$Id$</p>
! 472:
! 473: <p></p>
! 474: </body>
! 475: </html>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>