Annotation of embedaddon/libxml2/doc/xmlreader.html, revision 1.1.1.1
1.1 misho 1: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
2: "http://www.w3.org/TR/html4/loose.dtd">
3: <html>
4: <head>
5: <meta http-equiv="Content-Type" content="text/html">
6: <style type="text/css"></style>
7: <!--
8: TD {font-family: Verdana,Arial,Helvetica}
9: BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
10: H1 {font-family: Verdana,Arial,Helvetica}
11: H2 {font-family: Verdana,Arial,Helvetica}
12: H3 {font-family: Verdana,Arial,Helvetica}
13: A:link, A:visited, A:active { text-decoration: underline }
14: </style>
15: -->
16: <title>Libxml2 XmlTextReader Interface tutorial</title>
17: </head>
18:
19: <body bgcolor="#fffacd" text="#000000">
20: <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
21:
22: <p></p>
23:
24: <p>This document describes the use of the XmlTextReader streaming API added
25: to libxml2 in version 2.5.0 . This API is closely modeled after the <a
26: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
27: and <a
28: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
29: classes of the C# language.</p>
30:
31: <p>This tutorial will present the key points of this API, and working
32: examples using both C and the Python bindings:</p>
33:
34: <p>Table of content:</p>
35: <ul>
36: <li><a href="#Introducti">Introduction: why a new API</a></li>
37: <li><a href="#Walking">Walking a simple tree</a></li>
38: <li><a href="#Extracting">Extracting informations for the current
39: node</a></li>
40: <li><a href="#Extracting1">Extracting informations for the
41: attributes</a></li>
42: <li><a href="#Validating">Validating a document</a></li>
43: <li><a href="#Entities">Entities substitution</a></li>
44: <li><a href="#L1142">Relax-NG Validation</a></li>
45: <li><a href="#Mixing">Mixing the reader and tree or XPath
46: operations</a></li>
47: </ul>
48:
49: <p></p>
50:
51: <h2><a name="Introducti">Introduction: why a new API</a></h2>
52:
53: <p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
54: tree based</a>, where the parsing operation results in a document loaded
55: completely in memory, and expose it as a tree of nodes all availble at the
56: same time. This is very simple and quite powerful, but has the major
57: limitation that the size of the document that can be hamdled is limited by
58: the size of the memory available. Libxml2 also provide a <a
59: href="http://www.saxproject.org/">SAX</a> based API, but that version was
60: designed upon one of the early <a
61: href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
62: also not formally defined for C. SAX basically work by registering callbacks
63: which are called directly by the parser as it progresses through the document
64: streams. The problem is that this programming model is relatively complex,
65: not well standardized, cannot provide validation directly, makes entity,
66: namespace and base processing relatively hard.</p>
67:
68: <p>The <a
69: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
70: API from C#</a> provides a far simpler programming model. The API acts as a
71: cursor going forward on the document stream and stopping at each node in the
72: way. The user's code keeps control of the progress and simply calls a
73: Read() function repeatedly to progress to each node in sequence in document
74: order. There is direct support for namespaces, xml:base, entity handling and
75: adding DTD validation on top of it was relatively simple. This API is really
76: close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
77: specification</a> This provides a far more standard, easy to use and powerful
78: API than the existing SAX. Moreover integrating extension features based on
79: the tree seems relatively easy.</p>
80:
81: <p>In a nutshell the XmlTextReader API provides a simpler, more standard and
82: more extensible interface to handle large documents than the existing SAX
83: version.</p>
84:
85: <h2><a name="Walking">Walking a simple tree</a></h2>
86:
87: <p>Basically the XmlTextReader API is a forward only tree walking interface.
88: The basic steps are:</p>
89: <ol>
90: <li>prepare a reader context operating on some input</li>
91: <li>run a loop iterating over all nodes in the document</li>
92: <li>free up the reader context</li>
93: </ol>
94:
95: <p>Here is a basic C sample doing this:</p>
96: <pre>#include <libxml/xmlreader.h>
97:
98: void processNode(xmlTextReaderPtr reader) {
99: /* handling of a node in the tree */
100: }
101:
102: int streamFile(char *filename) {
103: xmlTextReaderPtr reader;
104: int ret;
105:
106: reader = xmlNewTextReaderFilename(filename);
107: if (reader != NULL) {
108: ret = xmlTextReaderRead(reader);
109: while (ret == 1) {
110: processNode(reader);
111: ret = xmlTextReaderRead(reader);
112: }
113: xmlFreeTextReader(reader);
114: if (ret != 0) {
115: printf("%s : failed to parse\n", filename);
116: }
117: } else {
118: printf("Unable to open %s\n", filename);
119: }
120: }</pre>
121:
122: <p>A few things to notice:</p>
123: <ul>
124: <li>the include file needed : <code>libxml/xmlreader.h</code></li>
125: <li>the creation of the reader using a filename</li>
126: <li>the repeated call to xmlTextReaderRead() and how any return value
127: different from 1 should stop the loop</li>
128: <li>that a negative return means a parsing error</li>
129: <li>how xmlFreeTextReader() should be used to free up the resources used by
130: the reader.</li>
131: </ul>
132:
133: <p>Here is similar code in python for exactly the same processing:</p>
134: <pre>import libxml2
135:
136: def processNode(reader):
137: pass
138:
139: def streamFile(filename):
140: try:
141: reader = libxml2.newTextReaderFilename(filename)
142: except:
143: print "unable to open %s" % (filename)
144: return
145:
146: ret = reader.Read()
147: while ret == 1:
148: processNode(reader)
149: ret = reader.Read()
150:
151: if ret != 0:
152: print "%s : failed to parse" % (filename)</pre>
153:
154: <p>The only things worth adding are that the <a
155: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
156: is abstracted as a class like in C#</a> with the same method names (but the
157: properties are currently accessed with methods) and that one doesn't need to
158: free the reader at the end of the processing. It will get garbage collected
159: once all references have disapeared.</p>
160:
161: <h2><a name="Extracting">Extracting information for the current node</a></h2>
162:
163: <p>So far the example code did not indicate how information was extracted
164: from the reader. It was abstrated as a call to the processNode() routine,
165: with the reader as the argument. At each invocation, the parser is stopped on
166: a given node and the reader can be used to query those node properties. Each
167: <em>Property</em> is available at the C level as a function taking a single
168: xmlTextReaderPtr argument whose name is
169: <code>xmlTextReader</code><em>Property</em> , if the return type is an
170: <code>xmlChar *</code> string then it must be deallocated with
171: <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
172: <em>Property</em> method to the reader class that can be called on the
173: instance. The list of the properties is based on the <a
174: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
175: XmlTextReader class</a> set of properties and methods:</p>
176: <ul>
177: <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
178: element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
179: entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
180: 9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
181: fragment and 12 for notation nodes.</li>
182: <li><em>Name</em>: the <a
183: href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
184: name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
185: <li><em>LocalName</em>: the <a
186: href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
187: the node.</li>
188: <li><em>Prefix</em>: a shorthand reference to the <a
189: href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
190: the node.</li>
191: <li><em>NamespaceUri</em>: the URI defining the <a
192: href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
193: the node.</li>
194: <li><em>BaseUri:</em> the base URI of the node. See the <a
195: href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
196: <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
197: root node.</li>
198: <li><em>HasAttributes</em>: whether the node has attributes.</li>
199: <li><em>HasValue</em>: whether the node can have a text value.</li>
200: <li><em>Value</em>: provides the text value of the node if present.</li>
201: <li><em>IsDefault</em>: whether an Attribute node was generated from the
202: default value defined in the DTD or schema (<em>unsupported
203: yet</em>).</li>
204: <li><em>XmlLang</em>: the <a
205: href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
206: within which the node resides.</li>
207: <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
208: bit bizarre in the sense that <code><a/></code> will be considered
209: empty while <code><a></a></code> will not.</li>
210: <li><em>AttributeCount</em>: provides the number of attributes of the
211: current node.</li>
212: </ul>
213:
214: <p>Let's look first at a small example to get this in practice by redefining
215: the processNode() function in the Python example:</p>
216: <pre>def processNode(reader):
217: print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
218: reader.Name(), reader.IsEmptyElement())</pre>
219:
220: <p>and look at the result of calling streamFile("tst.xml") for various
221: content of the XML test file.</p>
222:
223: <p>For the minimal document "<code><doc/></code>" we get:</p>
224: <pre>0 1 doc 1</pre>
225:
226: <p>Only one node is found, its depth is 0, type 1 indicate an element start,
227: of name "doc" and it is empty. Trying now with
228: "<code><doc></doc></code>" instead leads to:</p>
229: <pre>0 1 doc 0
230: 0 15 doc 0</pre>
231:
232: <p>The document root node is not flagged as empty anymore and both a start
233: and an end of element are detected. The following document shows how
234: character data are reported:</p>
235: <pre><doc><a/><b>some text</b>
236: <c/></doc></pre>
237:
238: <p>We modifying the processNode() function to also report the node Value:</p>
239: <pre>def processNode(reader):
240: print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
241: reader.Name(), reader.IsEmptyElement(),
242: reader.Value())</pre>
243:
244: <p>The result of the test is:</p>
245: <pre>0 1 doc 0 None
246: 1 1 a 1 None
247: 1 1 b 0 None
248: 2 3 #text 0 some text
249: 1 15 b 0 None
250: 1 3 #text 0
251:
252: 1 1 c 1 None
253: 0 15 doc 0 None</pre>
254:
255: <p>There are a few things to note:</p>
256: <ul>
257: <li>the increase of the depth value (first row) as children nodes are
258: explored</li>
259: <li>the text node child of the b element, of type 3 and its content</li>
260: <li>the text node containing the line return between elements b and c</li>
261: <li>that elements have the Value None (or NULL in C)</li>
262: </ul>
263:
264: <p>The equivalent routine for <code>processNode()</code> as used by
265: <code>xmllint --stream --debug</code> is the following and can be found in
266: the xmllint.c module in the source distribution:</p>
267: <pre>static void processNode(xmlTextReaderPtr reader) {
268: xmlChar *name, *value;
269:
270: name = xmlTextReaderName(reader);
271: if (name == NULL)
272: name = xmlStrdup(BAD_CAST "--");
273: value = xmlTextReaderValue(reader);
274:
275: printf("%d %d %s %d",
276: xmlTextReaderDepth(reader),
277: xmlTextReaderNodeType(reader),
278: name,
279: xmlTextReaderIsEmptyElement(reader));
280: xmlFree(name);
281: if (value == NULL)
282: printf("\n");
283: else {
284: printf(" %s\n", value);
285: xmlFree(value);
286: }
287: }</pre>
288:
289: <h2><a name="Extracting1">Extracting information for the attributes</a></h2>
290:
291: <p>The previous examples don't indicate how attributes are processed. The
292: simple test "<code><doc a="b"/></code>" provides the following
293: result:</p>
294: <pre>0 1 doc 1 None</pre>
295:
296: <p>This proves that attribute nodes are not traversed by default. The
297: <em>HasAttributes</em> property allow to detect their presence. To check
298: their content the API has special instructions. Basically two kinds of operations
299: are possible:</p>
300: <ol>
301: <li>to move the reader to the attribute nodes of the current element, in
302: that case the cursor is positionned on the attribute node</li>
303: <li>to directly query the element node for the attribute value</li>
304: </ol>
305:
306: <p>In both case the attribute can be designed either by its position in the
307: list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
308: by their name (and namespace):</p>
309: <ul>
310: <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
311: the specified index no relative to the containing element.</li>
312: <li><em>GetAttribute</em>(name): provides the value of the attribute with
313: the specified qualified name.</li>
314: <li>GetAttributeNs(localName, namespaceURI): provides the value of the
315: attribute with the specified local name and namespace URI.</li>
316: <li><em>MoveToAttributeNo</em>(no): moves the position of the current
317: instance to the attribute with the specified index relative to the
318: containing element.</li>
319: <li><em>MoveToAttribute</em>(name): moves the position of the current
320: instance to the attribute with the specified qualified name.</li>
321: <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
322: of the current instance to the attribute with the specified local name
323: and namespace URI.</li>
324: <li><em>MoveToFirstAttribute</em>: moves the position of the current
325: instance to the first attribute associated with the current node.</li>
326: <li><em>MoveToNextAttribute</em>: moves the position of the current
327: instance to the next attribute associated with the current node.</li>
328: <li><em>MoveToElement</em>: moves the position of the current instance to
329: the node that contains the current Attribute node.</li>
330: </ul>
331:
332: <p>After modifying the processNode() function to show attributes:</p>
333: <pre>def processNode(reader):
334: print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
335: reader.Name(), reader.IsEmptyElement(),
336: reader.Value())
337: if reader.NodeType() == 1: # Element
338: while reader.MoveToNextAttribute():
339: print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
340: reader.Name(),reader.Value())</pre>
341:
342: <p>The output for the same input document reflects the attribute:</p>
343: <pre>0 1 doc 1 None
344: -- 1 2 (a) [b]</pre>
345:
346: <p>There are a couple of things to note on the attribute processing:</p>
347: <ul>
348: <li>Their depth is the one of the carrying element plus one.</li>
349: <li>Namespace declarations are seen as attributes, as in DOM.</li>
350: </ul>
351:
352: <h2><a name="Validating">Validating a document</a></h2>
353:
354: <p>Libxml2 implementation adds some extra features on top of the XmlTextReader
355: API. The main one is the ability to DTD validate the parsed document
356: progressively. This is simply the activation of the associated feature of the
357: parser used by the reader structure. There are a few options available
358: defined as the enum xmlParserProperties in the libxml/xmlreader.h header
359: file:</p>
360: <ul>
361: <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
362: <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
363: loading the DTD)</li>
364: <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
365: the DTD)</li>
366: <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
367: reference nodes are not generated and are replaced by their expanded
368: content.</li>
369: <li>more settings might be added, those were the one available at the 2.5.0
370: release...</li>
371: </ul>
372:
373: <p>The GetParserProp() and SetParserProp() methods can then be used to get
374: and set the values of those parser properties of the reader. For example</p>
375: <pre>def parseAndValidate(file):
376: reader = libxml2.newTextReaderFilename(file)
377: reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
378: ret = reader.Read()
379: while ret == 1:
380: ret = reader.Read()
381: if ret != 0:
382: print "Error parsing and validating %s" % (file)</pre>
383:
384: <p>This routine will parse and validate the file. Error messages can be
385: captured by registering an error handler. See python/tests/reader2.py for
386: more complete Python examples. At the C level the equivalent call to cativate
387: the validation feature is just:</p>
388: <pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
389:
390: <p>and a return value of 0 indicates success.</p>
391:
392: <h2><a name="Entities">Entities substitution</a></h2>
393:
394: <p>By default the xmlReader will report entities as such and not replace them
395: with their content. This default behaviour can however be overriden using:</p>
396:
397: <p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
398:
399: <h2><a name="L1142">Relax-NG Validation</a></h2>
400:
401: <p style="font-size: 10pt">Introduced in version 2.5.7</p>
402:
403: <p>Libxml2 can now validate the document being read using the xmlReader using
404: Relax-NG schemas. While the Relax NG validator can't always work in a
405: streamable mode, only subsets which cannot be reduced to regular expressions
406: need to have their subtree expanded for validation. In practice it means
407: that, unless the schemas for the top level element content is not expressable
408: as a regexp, only chunk of the document needs to be parsed while
409: validating.</p>
410:
411: <p>The steps to do so are:</p>
412: <ul>
413: <li>create a reader working on a document as usual</li>
414: <li>before any call to read associate it to a Relax NG schemas, either the
415: preparsed schemas or the URL to the schemas to use</li>
416: <li>errors will be reported the usual way, and the validity status can be
417: obtained using the IsValid() interface of the reader like for DTDs.</li>
418: </ul>
419:
420: <p>Example, assuming the reader has already being created and that the schema
421: string contains the Relax-NG schemas:</p>
422: <pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
423: rngs = rngp.relaxNGParse()<br>
424: reader.RelaxNGSetSchema(rngs)<br>
425: ret = reader.Read()<br>
426: while ret == 1:<br>
427: ret = reader.Read()<br>
428: if ret != 0:<br>
429: print "Error parsing the document"<br>
430: if reader.IsValid() != 1:<br>
431: print "Document failed to validate"</code><br>
432: </pre>
433:
434: <p>See <code>reader6.py</code> in the sources or documentation for a complete
435: example.</p>
436:
437: <h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
438:
439: <p style="font-size: 10pt">Introduced in version 2.5.7</p>
440:
441: <p>While the reader is a streaming interface, its underlying implementation
442: is based on the DOM builder of libxml2. As a result it is relatively simple
443: to mix operations based on both models under some constraints. To do so the
444: reader has an Expand() operation allowing to grow the subtree under the
445: current node. It returns a pointer to a standard node which can be
446: manipulated in the usual ways. The node will get all its ancestors and the
447: full subtree available. Usual operations like XPath queries can be used on
448: that reduced view of the document. Here is an example extracted from
449: reader5.py in the sources which extract and prints the bibliography for the
450: "Dragon" compiler book from the XML 1.0 recommendation:</p>
451: <pre>f = open('../../test/valid/REC-xml-19980210.xml')
452: input = libxml2.inputBuffer(f)
453: reader = input.newTextReader("REC")
454: res=""
455: while reader.Read():
456: while reader.Name() == 'bibl':
457: node = reader.Expand() # expand the subtree
458: if node.xpathEval("@id = 'Aho'"): # use XPath on it
459: res = res + node.serialize()
460: if reader.Next() != 1: # skip the subtree
461: break;</pre>
462:
463: <p>Note, however that the node instance returned by the Expand() call is only
464: valid until the next Read() operation. The Expand() operation does not
465: affects the Read() ones, however usually once processed the full subtree is
466: not useful anymore, and the Next() operation allows to skip it completely and
467: process to the successor or return 0 if the document end is reached.</p>
468:
469: <p><a href="mailto:xml@gnome.org">Daniel Veillard</a></p>
470:
471: <p>$Id$</p>
472:
473: <p></p>
474: </body>
475: </html>
FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>