embedaddon/libxml2/doc/xmlreader.html - view

File: [ELWIX - Embedded LightWeight unIX -] / embedaddon / libxml2 / doc / xmlreader.html
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Tue Feb 21 23:37:59 2012 UTC (12 years, 4 months ago) by misho
Branches: libxml2, MAIN
CVS tags: v2_9_1p0, v2_9_1, v2_8_0p0, v2_8_0, v2_7_8, HEAD

libxml2

1: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 2: "http://www.w3.org/TR/html4/loose.dtd"> 3: <html> 4: <head> 5: <meta http-equiv="Content-Type" content="text/html"> 6: <style type="text/css"></style> 7:  16: <title>Libxml2 XmlTextReader Interface tutorial</title> 17: </head> 18: 19: <body bgcolor="#fffacd" text="#000000"> 20: <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1> 21: 22:  23: 24: This document describes the use of the XmlTextReader streaming API added 25: to libxml2 in version 2.5.0 . This API is closely modeled after the <a 26: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a> 27: and <a 28: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a> 29: classes of the C# language. 30: 31: This tutorial will present the key points of this API, and working 32: examples using both C and the Python bindings: 33: 34: Table of content: 35: <ul> 36: <li><a href="#Introducti">Introduction: why a new API</a></li> 37: <li><a href="#Walking">Walking a simple tree</a></li> 38: <li><a href="#Extracting">Extracting informations for the current 39: node</a></li> 40: <li><a href="#Extracting1">Extracting informations for the 41: attributes</a></li> 42: <li><a href="#Validating">Validating a document</a></li> 43: <li><a href="#Entities">Entities substitution</a></li> 44: <li><a href="#L1142">Relax-NG Validation</a></li> 45: <li><a href="#Mixing">Mixing the reader and tree or XPath 46: operations</a></li> 47: </ul> 48: 49:  50: 51: <h2><a name="Introducti">Introduction: why a new API</a></h2> 52: 53: Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is 54: tree based</a>, where the parsing operation results in a document loaded 55: completely in memory, and expose it as a tree of nodes all availble at the 56: same time. This is very simple and quite powerful, but has the major 57: limitation that the size of the document that can be hamdled is limited by 58: the size of the memory available. Libxml2 also provide a <a 59: href="http://www.saxproject.org/">SAX</a> based API, but that version was 60: designed upon one of the early <a 61: href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is 62: also not formally defined for C. SAX basically work by registering callbacks 63: which are called directly by the parser as it progresses through the document 64: streams. The problem is that this programming model is relatively complex, 65: not well standardized, cannot provide validation directly, makes entity, 66: namespace and base processing relatively hard. 67: 68: The <a 69: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader 70: API from C#</a> provides a far simpler programming model. The API acts as a 71: cursor going forward on the document stream and stopping at each node in the 72: way. The user's code keeps control of the progress and simply calls a 73: Read() function repeatedly to progress to each node in sequence in document 74: order. There is direct support for namespaces, xml:base, entity handling and 75: adding DTD validation on top of it was relatively simple. This API is really 76: close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core 77: specification</a> This provides a far more standard, easy to use and powerful 78: API than the existing SAX. Moreover integrating extension features based on 79: the tree seems relatively easy. 80: 81: In a nutshell the XmlTextReader API provides a simpler, more standard and 82: more extensible interface to handle large documents than the existing SAX 83: version. 84: 85: <h2><a name="Walking">Walking a simple tree</a></h2> 86: 87: Basically the XmlTextReader API is a forward only tree walking interface. 88: The basic steps are: 89: <ol> 90: <li>prepare a reader context operating on some input</li> 91: <li>run a loop iterating over all nodes in the document</li> 92: <li>free up the reader context</li> 93: </ol> 94: 95: Here is a basic C sample doing this: 96: <pre>#include <libxml/xmlreader.h> 97: 98: void processNode(xmlTextReaderPtr reader) { 99: /* handling of a node in the tree */ 100: } 101: 102: int streamFile(char *filename) { 103: xmlTextReaderPtr reader; 104: int ret; 105: 106: reader = xmlNewTextReaderFilename(filename); 107: if (reader != NULL) { 108: ret = xmlTextReaderRead(reader); 109: while (ret == 1) { 110: processNode(reader); 111: ret = xmlTextReaderRead(reader); 112: } 113: xmlFreeTextReader(reader); 114: if (ret != 0) { 115: printf("%s : failed to parse\n", filename); 116: } 117: } else { 118: printf("Unable to open %s\n", filename); 119: } 120: }</pre> 121: 122: A few things to notice: 123: <ul> 124: <li>the include file needed : <code>libxml/xmlreader.h</code></li> 125: <li>the creation of the reader using a filename</li> 126: <li>the repeated call to xmlTextReaderRead() and how any return value 127: different from 1 should stop the loop</li> 128: <li>that a negative return means a parsing error</li> 129: <li>how xmlFreeTextReader() should be used to free up the resources used by 130: the reader.</li> 131: </ul> 132: 133: Here is similar code in python for exactly the same processing: 134: <pre>import libxml2 135: 136: def processNode(reader): 137: pass 138: 139: def streamFile(filename): 140: try: 141: reader = libxml2.newTextReaderFilename(filename) 142: except: 143: print "unable to open %s" % (filename) 144: return 145: 146: ret = reader.Read() 147: while ret == 1: 148: processNode(reader) 149: ret = reader.Read() 150: 151: if ret != 0: 152: print "%s : failed to parse" % (filename)</pre> 153: 154: The only things worth adding are that the <a 155: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader 156: is abstracted as a class like in C#</a> with the same method names (but the 157: properties are currently accessed with methods) and that one doesn't need to 158: free the reader at the end of the processing. It will get garbage collected 159: once all references have disapeared. 160: 161: <h2><a name="Extracting">Extracting information for the current node</a></h2> 162: 163: So far the example code did not indicate how information was extracted 164: from the reader. It was abstrated as a call to the processNode() routine, 165: with the reader as the argument. At each invocation, the parser is stopped on 166: a given node and the reader can be used to query those node properties. Each 167: Property is available at the C level as a function taking a single 168: xmlTextReaderPtr argument whose name is 169: <code>xmlTextReader</code>Property , if the return type is an 170: <code>xmlChar *</code> string then it must be deallocated with 171: <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a 172: Property method to the reader class that can be called on the 173: instance. The list of the properties is based on the <a 174: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C# 175: XmlTextReader class</a> set of properties and methods: 176: <ul> 177: <li>NodeType: The node type, 1 for start element, 15 for end of 178: element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for 179: entity references, 6 for entity declarations, 7 for PIs, 8 for comments, 180: 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document 181: fragment and 12 for notation nodes.</li> 182: <li>Name: the <a 183: href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified 184: name</a> of the node, equal to (Prefix:)LocalName.</li> 185: <li>LocalName: the <a 186: href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of 187: the node.</li> 188: <li>Prefix: a shorthand reference to the <a 189: href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with 190: the node.</li> 191: <li>NamespaceUri: the URI defining the <a 192: href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with 193: the node.</li> 194: <li>BaseUri: the base URI of the node. See the <a 195: href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li> 196: <li>Depth: the depth of the node in the tree, starts at 0 for the 197: root node.</li> 198: <li>HasAttributes: whether the node has attributes.</li> 199: <li>HasValue: whether the node can have a text value.</li> 200: <li>Value: provides the text value of the node if present.</li> 201: <li>IsDefault: whether an Attribute node was generated from the 202: default value defined in the DTD or schema (unsupported 203: yet).</li> 204: <li>XmlLang: the <a 205: href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope 206: within which the node resides.</li> 207: <li>IsEmptyElement: check if the current node is empty, this is a 208: bit bizarre in the sense that <code><a/></code> will be considered 209: empty while <code><a></a></code> will not.</li> 210: <li>AttributeCount: provides the number of attributes of the 211: current node.</li> 212: </ul> 213: 214: Let's look first at a small example to get this in practice by redefining 215: the processNode() function in the Python example: 216: <pre>def processNode(reader): 217: print "%d %d %s %d" % (reader.Depth(), reader.NodeType(), 218: reader.Name(), reader.IsEmptyElement())</pre> 219: 220: and look at the result of calling streamFile("tst.xml") for various 221: content of the XML test file. 222: 223: For the minimal document "<code><doc/></code>" we get: 224: <pre>0 1 doc 1</pre> 225: 226: Only one node is found, its depth is 0, type 1 indicate an element start, 227: of name "doc" and it is empty. Trying now with 228: "<code><doc></doc></code>" instead leads to: 229: <pre>0 1 doc 0 230: 0 15 doc 0</pre> 231: 232: The document root node is not flagged as empty anymore and both a start 233: and an end of element are detected. The following document shows how 234: character data are reported: 235: <pre><doc><a/><b>some text 236: <c/></doc></pre> 237: 238: We modifying the processNode() function to also report the node Value: 239: <pre>def processNode(reader): 240: print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(), 241: reader.Name(), reader.IsEmptyElement(), 242: reader.Value())</pre> 243: 244: The result of the test is: 245: <pre>0 1 doc 0 None 246: 1 1 a 1 None 247: 1 1 b 0 None 248: 2 3 #text 0 some text 249: 1 15 b 0 None 250: 1 3 #text 0 251: 252: 1 1 c 1 None 253: 0 15 doc 0 None</pre> 254: 255: There are a few things to note: 256: <ul> 257: <li>the increase of the depth value (first row) as children nodes are 258: explored</li> 259: <li>the text node child of the b element, of type 3 and its content</li> 260: <li>the text node containing the line return between elements b and c</li> 261: <li>that elements have the Value None (or NULL in C)</li> 262: </ul> 263: 264: The equivalent routine for <code>processNode()</code> as used by 265: <code>xmllint --stream --debug</code> is the following and can be found in 266: the xmllint.c module in the source distribution: 267: <pre>static void processNode(xmlTextReaderPtr reader) { 268: xmlChar *name, *value; 269: 270: name = xmlTextReaderName(reader); 271: if (name == NULL) 272: name = xmlStrdup(BAD_CAST "--"); 273: value = xmlTextReaderValue(reader); 274: 275: printf("%d %d %s %d", 276: xmlTextReaderDepth(reader), 277: xmlTextReaderNodeType(reader), 278: name, 279: xmlTextReaderIsEmptyElement(reader)); 280: xmlFree(name); 281: if (value == NULL) 282: printf("\n"); 283: else { 284: printf(" %s\n", value); 285: xmlFree(value); 286: } 287: }</pre> 288: 289: <h2><a name="Extracting1">Extracting information for the attributes</a></h2> 290: 291: The previous examples don't indicate how attributes are processed. The 292: simple test "<code><doc a="b"/></code>" provides the following 293: result: 294: <pre>0 1 doc 1 None</pre> 295: 296: This proves that attribute nodes are not traversed by default. The 297: HasAttributes property allow to detect their presence. To check 298: their content the API has special instructions. Basically two kinds of operations 299: are possible: 300: <ol> 301: <li>to move the reader to the attribute nodes of the current element, in 302: that case the cursor is positionned on the attribute node</li> 303: <li>to directly query the element node for the attribute value</li> 304: </ol> 305: 306: In both case the attribute can be designed either by its position in the 307: list of attribute (MoveToAttributeNo or GetAttributeNo) or 308: by their name (and namespace): 309: <ul> 310: <li>GetAttributeNo(no): provides the value of the attribute with 311: the specified index no relative to the containing element.</li> 312: <li>GetAttribute(name): provides the value of the attribute with 313: the specified qualified name.</li> 314: <li>GetAttributeNs(localName, namespaceURI): provides the value of the 315: attribute with the specified local name and namespace URI.</li> 316: <li>MoveToAttributeNo(no): moves the position of the current 317: instance to the attribute with the specified index relative to the 318: containing element.</li> 319: <li>MoveToAttribute(name): moves the position of the current 320: instance to the attribute with the specified qualified name.</li> 321: <li>MoveToAttributeNs(localName, namespaceURI): moves the position 322: of the current instance to the attribute with the specified local name 323: and namespace URI.</li> 324: <li>MoveToFirstAttribute: moves the position of the current 325: instance to the first attribute associated with the current node.</li> 326: <li>MoveToNextAttribute: moves the position of the current 327: instance to the next attribute associated with the current node.</li> 328: <li>MoveToElement: moves the position of the current instance to 329: the node that contains the current Attribute node.</li> 330: </ul> 331: 332: After modifying the processNode() function to show attributes: 333: <pre>def processNode(reader): 334: print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(), 335: reader.Name(), reader.IsEmptyElement(), 336: reader.Value()) 337: if reader.NodeType() == 1: # Element 338: while reader.MoveToNextAttribute(): 339: print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(), 340: reader.Name(),reader.Value())</pre> 341: 342: The output for the same input document reflects the attribute: 343: <pre>0 1 doc 1 None 344: -- 1 2 (a) [b]</pre> 345: 346: There are a couple of things to note on the attribute processing: 347: <ul> 348: <li>Their depth is the one of the carrying element plus one.</li> 349: <li>Namespace declarations are seen as attributes, as in DOM.</li> 350: </ul> 351: 352: <h2><a name="Validating">Validating a document</a></h2> 353: 354: Libxml2 implementation adds some extra features on top of the XmlTextReader 355: API. The main one is the ability to DTD validate the parsed document 356: progressively. This is simply the activation of the associated feature of the 357: parser used by the reader structure. There are a few options available 358: defined as the enum xmlParserProperties in the libxml/xmlreader.h header 359: file: 360: <ul> 361: <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li> 362: <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply 363: loading the DTD)</li> 364: <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading 365: the DTD)</li> 366: <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity 367: reference nodes are not generated and are replaced by their expanded 368: content.</li> 369: <li>more settings might be added, those were the one available at the 2.5.0 370: release...</li> 371: </ul> 372: 373: The GetParserProp() and SetParserProp() methods can then be used to get 374: and set the values of those parser properties of the reader. For example 375: <pre>def parseAndValidate(file): 376: reader = libxml2.newTextReaderFilename(file) 377: reader.SetParserProp(libxml2.PARSER_VALIDATE, 1) 378: ret = reader.Read() 379: while ret == 1: 380: ret = reader.Read() 381: if ret != 0: 382: print "Error parsing and validating %s" % (file)</pre> 383: 384: This routine will parse and validate the file. Error messages can be 385: captured by registering an error handler. See python/tests/reader2.py for 386: more complete Python examples. At the C level the equivalent call to cativate 387: the validation feature is just: 388: <pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre> 389: 390: and a return value of 0 indicates success. 391: 392: <h2><a name="Entities">Entities substitution</a></h2> 393: 394: By default the xmlReader will report entities as such and not replace them 395: with their content. This default behaviour can however be overriden using: 396: 397: <code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code> 398: 399: <h2><a name="L1142">Relax-NG Validation</a></h2> 400: 401: Introduced in version 2.5.7 402: 403: Libxml2 can now validate the document being read using the xmlReader using 404: Relax-NG schemas. While the Relax NG validator can't always work in a 405: streamable mode, only subsets which cannot be reduced to regular expressions 406: need to have their subtree expanded for validation. In practice it means 407: that, unless the schemas for the top level element content is not expressable 408: as a regexp, only chunk of the document needs to be parsed while 409: validating. 410: 411: The steps to do so are: 412: <ul> 413: <li>create a reader working on a document as usual</li> 414: <li>before any call to read associate it to a Relax NG schemas, either the 415: preparsed schemas or the URL to the schemas to use</li> 416: <li>errors will be reported the usual way, and the validity status can be 417: obtained using the IsValid() interface of the reader like for DTDs.</li> 418: </ul> 419: 420: Example, assuming the reader has already being created and that the schema 421: string contains the Relax-NG schemas: 422: <pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))  423: rngs = rngp.relaxNGParse()  424: reader.RelaxNGSetSchema(rngs)  425: ret = reader.Read()  426: while ret == 1:  427: ret = reader.Read()  428: if ret != 0:  429: print "Error parsing the document"  430: if reader.IsValid() != 1:  431: print "Document failed to validate"</code>  432: </pre> 433: 434: See <code>reader6.py</code> in the sources or documentation for a complete 435: example. 436: 437: <h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2> 438: 439: Introduced in version 2.5.7 440: 441: While the reader is a streaming interface, its underlying implementation 442: is based on the DOM builder of libxml2. As a result it is relatively simple 443: to mix operations based on both models under some constraints. To do so the 444: reader has an Expand() operation allowing to grow the subtree under the 445: current node. It returns a pointer to a standard node which can be 446: manipulated in the usual ways. The node will get all its ancestors and the 447: full subtree available. Usual operations like XPath queries can be used on 448: that reduced view of the document. Here is an example extracted from 449: reader5.py in the sources which extract and prints the bibliography for the 450: "Dragon" compiler book from the XML 1.0 recommendation: 451: <pre>f = open('../../test/valid/REC-xml-19980210.xml') 452: input = libxml2.inputBuffer(f) 453: reader = input.newTextReader("REC") 454: res="" 455: while reader.Read(): 456: while reader.Name() == 'bibl': 457: node = reader.Expand() # expand the subtree 458: if node.xpathEval("@id = 'Aho'"): # use XPath on it 459: res = res + node.serialize() 460: if reader.Next() != 1: # skip the subtree 461: break;</pre> 462: 463: Note, however that the node instance returned by the Expand() call is only 464: valid until the next Read() operation. The Expand() operation does not 465: affects the Read() ones, however usually once processed the full subtree is 466: not useful anymore, and the Next() operation allows to skip it completely and 467: process to the successor or return 0 if the document end is reached. 468: 469: <a href="mailto:xml@gnome.org">Daniel Veillard</a> 470: 471: $Id: xmlreader.html,v 1.1.1.1 2012/02/21 23:37:59 misho Exp $ 472: 473:  474: </body> 475: </html>