File:  [ELWIX - Embedded LightWeight unIX -] / embedaddon / libxml2 / doc / xmlreader.html
Revision 1.1.1.1 (vendor branch): download - view: text, annotated - select for diffs - revision graph
Tue Feb 21 23:37:59 2012 UTC (12 years, 4 months ago) by misho
Branches: libxml2, MAIN
CVS tags: v2_9_1p0, v2_9_1, v2_8_0p0, v2_8_0, v2_7_8, HEAD
libxml2

    1: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    2:     "http://www.w3.org/TR/html4/loose.dtd">
    3: <html>
    4: <head>
    5:   <meta http-equiv="Content-Type" content="text/html">
    6:   <style type="text/css"></style>
    7: <!--
    8: TD {font-family: Verdana,Arial,Helvetica}
    9: BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
   10: H1 {font-family: Verdana,Arial,Helvetica}
   11: H2 {font-family: Verdana,Arial,Helvetica}
   12: H3 {font-family: Verdana,Arial,Helvetica}
   13: A:link, A:visited, A:active { text-decoration: underline }
   14:   </style>
   15: -->
   16:   <title>Libxml2 XmlTextReader Interface tutorial</title>
   17: </head>
   18: 
   19: <body bgcolor="#fffacd" text="#000000">
   20: <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
   21: 
   22: <p></p>
   23: 
   24: <p>This document describes the use of the XmlTextReader streaming API added
   25: to libxml2 in version 2.5.0 . This API is closely modeled after the <a
   26: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
   27: and <a
   28: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
   29: classes of the C# language.</p>
   30: 
   31: <p>This tutorial will present the key points of this API, and working
   32: examples using both C and the Python bindings:</p>
   33: 
   34: <p>Table of content:</p>
   35: <ul>
   36:   <li><a href="#Introducti">Introduction: why a new API</a></li>
   37:   <li><a href="#Walking">Walking a simple tree</a></li>
   38:   <li><a href="#Extracting">Extracting informations for the current
   39:   node</a></li>
   40:   <li><a href="#Extracting1">Extracting informations for the
   41:   attributes</a></li>
   42:   <li><a href="#Validating">Validating a document</a></li>
   43:   <li><a href="#Entities">Entities substitution</a></li>
   44:   <li><a href="#L1142">Relax-NG Validation</a></li>
   45:   <li><a href="#Mixing">Mixing the reader and tree or XPath
   46:   operations</a></li>
   47: </ul>
   48: 
   49: <p></p>
   50: 
   51: <h2><a name="Introducti">Introduction: why a new API</a></h2>
   52: 
   53: <p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
   54: tree based</a>, where the parsing operation results in a document loaded
   55: completely in memory, and expose it as a tree of nodes all availble at the
   56: same time. This is very simple and quite powerful, but has the major
   57: limitation that the size of the document that can be hamdled is limited by
   58: the size of the memory available. Libxml2 also provide a <a
   59: href="http://www.saxproject.org/">SAX</a> based API, but that version was
   60: designed upon one of the early <a
   61: href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
   62: also not formally defined for C. SAX basically work by registering callbacks
   63: which are called directly by the parser as it progresses through the document
   64: streams. The problem is that this programming model is relatively complex,
   65: not well standardized, cannot provide validation directly, makes entity,
   66: namespace and base processing relatively hard.</p>
   67: 
   68: <p>The <a
   69: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
   70: API from C#</a> provides a far simpler programming model. The API acts as a
   71: cursor going forward on the document stream and stopping at each node in the
   72: way. The user's code keeps control of the progress and simply calls a
   73: Read() function repeatedly to progress to each node in sequence in document
   74: order. There is direct support for namespaces, xml:base, entity handling and
   75: adding DTD validation on top of it was relatively simple. This API is really
   76: close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
   77: specification</a> This provides a far more standard, easy to use and powerful
   78: API than the existing SAX. Moreover integrating extension features based on
   79: the tree seems relatively easy.</p>
   80: 
   81: <p>In a nutshell the XmlTextReader API provides a simpler, more standard and
   82: more extensible interface to handle large documents than the existing SAX
   83: version.</p>
   84: 
   85: <h2><a name="Walking">Walking a simple tree</a></h2>
   86: 
   87: <p>Basically the XmlTextReader API is a forward only tree walking interface.
   88: The basic steps are:</p>
   89: <ol>
   90:   <li>prepare a reader context operating on some input</li>
   91:   <li>run a loop iterating over all nodes in the document</li>
   92:   <li>free up the reader context</li>
   93: </ol>
   94: 
   95: <p>Here is a basic C sample doing this:</p>
   96: <pre>#include &lt;libxml/xmlreader.h&gt;
   97: 
   98: void processNode(xmlTextReaderPtr reader) {
   99:     /* handling of a node in the tree */
  100: }
  101: 
  102: int streamFile(char *filename) {
  103:     xmlTextReaderPtr reader;
  104:     int ret;
  105: 
  106:     reader = xmlNewTextReaderFilename(filename);
  107:     if (reader != NULL) {
  108:         ret = xmlTextReaderRead(reader);
  109:         while (ret == 1) {
  110:             processNode(reader);
  111:             ret = xmlTextReaderRead(reader);
  112:         }
  113:         xmlFreeTextReader(reader);
  114:         if (ret != 0) {
  115:             printf("%s : failed to parse\n", filename);
  116:         }
  117:     } else {
  118:         printf("Unable to open %s\n", filename);
  119:     }
  120: }</pre>
  121: 
  122: <p>A few things to notice:</p>
  123: <ul>
  124:   <li>the include file needed : <code>libxml/xmlreader.h</code></li>
  125:   <li>the creation of the reader using a filename</li>
  126:   <li>the repeated call to xmlTextReaderRead() and how any return value
  127:     different from 1 should stop the loop</li>
  128:   <li>that a negative return means a parsing error</li>
  129:   <li>how xmlFreeTextReader() should be used to free up the resources used by
  130:     the reader.</li>
  131: </ul>
  132: 
  133: <p>Here is similar code in python for exactly the same processing:</p>
  134: <pre>import libxml2
  135: 
  136: def processNode(reader):
  137:     pass
  138: 
  139: def streamFile(filename):
  140:     try:
  141:         reader = libxml2.newTextReaderFilename(filename)
  142:     except:
  143:         print "unable to open %s" % (filename)
  144:         return
  145: 
  146:     ret = reader.Read()
  147:     while ret == 1:
  148:         processNode(reader)
  149:         ret = reader.Read()
  150: 
  151:     if ret != 0:
  152:         print "%s : failed to parse" % (filename)</pre>
  153: 
  154: <p>The only things worth adding are that the <a
  155: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
  156: is abstracted as a class like in C#</a> with the same method names (but the
  157: properties are currently accessed with methods) and that one doesn't need to
  158: free the reader at the end of the processing. It will get garbage collected
  159: once all references have disapeared.</p>
  160: 
  161: <h2><a name="Extracting">Extracting information for the current node</a></h2>
  162: 
  163: <p>So far the example code did not indicate how information was extracted
  164: from the reader. It was abstrated as a call to the processNode() routine,
  165: with the reader as the argument. At each invocation, the parser is stopped on
  166: a given node and the reader can be used to query those node properties. Each
  167: <em>Property</em> is available at the C level as a function taking a single
  168: xmlTextReaderPtr argument whose name is
  169: <code>xmlTextReader</code><em>Property</em> , if the return type is an
  170: <code>xmlChar *</code> string then it must be deallocated with
  171: <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
  172: <em>Property</em> method to the reader class that can be called on the
  173: instance. The list of the properties is based on the <a
  174: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
  175: XmlTextReader class</a> set of properties and methods:</p>
  176: <ul>
  177:   <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
  178:     element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
  179:     entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
  180:     9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
  181:     fragment and 12 for notation nodes.</li>
  182:   <li><em>Name</em>: the <a
  183:     href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
  184:     name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
  185:   <li><em>LocalName</em>: the <a
  186:     href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
  187:     the node.</li>
  188:   <li><em>Prefix</em>: a  shorthand reference to the <a
  189:     href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
  190:     the node.</li>
  191:   <li><em>NamespaceUri</em>: the URI defining the <a
  192:     href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
  193:     the node.</li>
  194:   <li><em>BaseUri:</em> the base URI of the node. See the <a
  195:     href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
  196:   <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
  197:     root node.</li>
  198:   <li><em>HasAttributes</em>: whether the node has attributes.</li>
  199:   <li><em>HasValue</em>: whether the node can have a text value.</li>
  200:   <li><em>Value</em>: provides the text value of the node if present.</li>
  201:   <li><em>IsDefault</em>: whether an Attribute  node was generated from the
  202:     default value defined in the DTD or schema (<em>unsupported
  203:   yet</em>).</li>
  204:   <li><em>XmlLang</em>: the <a
  205:     href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
  206:     within which the node resides.</li>
  207:   <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
  208:     bit bizarre in the sense that <code>&lt;a/&gt;</code> will be considered
  209:     empty while <code>&lt;a&gt;&lt;/a&gt;</code> will not.</li>
  210:   <li><em>AttributeCount</em>: provides the number of attributes of the
  211:     current node.</li>
  212: </ul>
  213: 
  214: <p>Let's look first at a small example to get this in practice by redefining
  215: the processNode() function in the Python example:</p>
  216: <pre>def processNode(reader):
  217:     print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
  218:                            reader.Name(), reader.IsEmptyElement())</pre>
  219: 
  220: <p>and look at the result of calling streamFile("tst.xml") for various
  221: content of the XML test file.</p>
  222: 
  223: <p>For the minimal document "<code>&lt;doc/&gt;</code>" we get:</p>
  224: <pre>0 1 doc 1</pre>
  225: 
  226: <p>Only one node is found, its depth is 0, type 1 indicate an element start,
  227: of name "doc" and it is empty. Trying now with
  228: "<code>&lt;doc&gt;&lt;/doc&gt;</code>" instead leads to:</p>
  229: <pre>0 1 doc 0
  230: 0 15 doc 0</pre>
  231: 
  232: <p>The document root node is not flagged as empty anymore and both a start
  233: and an end of element are detected. The following document shows how
  234: character data are reported:</p>
  235: <pre>&lt;doc&gt;&lt;a/&gt;&lt;b&gt;some text&lt;/b&gt;
  236: &lt;c/&gt;&lt;/doc&gt;</pre>
  237: 
  238: <p>We modifying the processNode() function to also report the node Value:</p>
  239: <pre>def processNode(reader):
  240:     print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
  241:                               reader.Name(), reader.IsEmptyElement(),
  242:                               reader.Value())</pre>
  243: 
  244: <p>The result of the test is:</p>
  245: <pre>0 1 doc 0 None
  246: 1 1 a 1 None
  247: 1 1 b 0 None
  248: 2 3 #text 0 some text
  249: 1 15 b 0 None
  250: 1 3 #text 0
  251: 
  252: 1 1 c 1 None
  253: 0 15 doc 0 None</pre>
  254: 
  255: <p>There are a few things to note:</p>
  256: <ul>
  257:   <li>the increase of the depth value (first row) as children nodes are
  258:     explored</li>
  259:   <li>the text node child of the b element, of type 3 and its content</li>
  260:   <li>the text node containing the line return between elements b and c</li>
  261:   <li>that elements have the Value None (or NULL in C)</li>
  262: </ul>
  263: 
  264: <p>The equivalent routine for <code>processNode()</code> as used by
  265: <code>xmllint --stream --debug</code> is the following and can be found in
  266: the xmllint.c module in the source distribution:</p>
  267: <pre>static void processNode(xmlTextReaderPtr reader) {
  268:     xmlChar *name, *value;
  269: 
  270:     name = xmlTextReaderName(reader);
  271:     if (name == NULL)
  272:         name = xmlStrdup(BAD_CAST "--");
  273:     value = xmlTextReaderValue(reader);
  274: 
  275:     printf("%d %d %s %d",
  276:             xmlTextReaderDepth(reader),
  277:             xmlTextReaderNodeType(reader),
  278:             name,
  279:             xmlTextReaderIsEmptyElement(reader));
  280:     xmlFree(name);
  281:     if (value == NULL)
  282:         printf("\n");
  283:     else {
  284:         printf(" %s\n", value);
  285:         xmlFree(value);
  286:     }
  287: }</pre>
  288: 
  289: <h2><a name="Extracting1">Extracting information for the attributes</a></h2>
  290: 
  291: <p>The previous examples don't indicate how attributes are processed. The
  292: simple test "<code>&lt;doc a="b"/&gt;</code>" provides the following
  293: result:</p>
  294: <pre>0 1 doc 1 None</pre>
  295: 
  296: <p>This proves that attribute nodes are not traversed by default. The
  297: <em>HasAttributes</em> property allow to detect their presence. To check
  298: their content the API has special instructions. Basically two kinds of operations
  299: are possible:</p>
  300: <ol>
  301:   <li>to move the reader to the attribute nodes of the current element, in
  302:     that case the cursor is positionned on the attribute node</li>
  303:   <li>to directly query the element node for the attribute value</li>
  304: </ol>
  305: 
  306: <p>In both case the attribute can be designed either by its position in the
  307: list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
  308: by their name (and namespace):</p>
  309: <ul>
  310:   <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
  311:     the specified index no relative to the containing element.</li>
  312:   <li><em>GetAttribute</em>(name): provides the value of the attribute with
  313:     the specified qualified name.</li>
  314:   <li>GetAttributeNs(localName, namespaceURI): provides the value of the
  315:     attribute with the specified local name and namespace URI.</li>
  316:   <li><em>MoveToAttributeNo</em>(no): moves the position of the current
  317:     instance to the attribute with the specified index relative to the
  318:     containing element.</li>
  319:   <li><em>MoveToAttribute</em>(name): moves the position of the current
  320:     instance to the attribute with the specified qualified name.</li>
  321:   <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
  322:     of the current instance to the attribute with the specified local name
  323:     and namespace URI.</li>
  324:   <li><em>MoveToFirstAttribute</em>: moves the position of the current
  325:     instance to the first attribute associated with the current node.</li>
  326:   <li><em>MoveToNextAttribute</em>: moves the position of the current
  327:     instance to the next attribute associated with the current node.</li>
  328:   <li><em>MoveToElement</em>: moves the position of the current instance to
  329:     the node that contains the current Attribute  node.</li>
  330: </ul>
  331: 
  332: <p>After modifying the processNode() function to show attributes:</p>
  333: <pre>def processNode(reader):
  334:     print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
  335:                               reader.Name(), reader.IsEmptyElement(),
  336:                               reader.Value())
  337:     if reader.NodeType() == 1: # Element
  338:         while reader.MoveToNextAttribute():
  339:             print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
  340:                                           reader.Name(),reader.Value())</pre>
  341: 
  342: <p>The output for the same input document reflects the attribute:</p>
  343: <pre>0 1 doc 1 None
  344: -- 1 2 (a) [b]</pre>
  345: 
  346: <p>There are a couple of things to note on the attribute processing:</p>
  347: <ul>
  348:   <li>Their depth is the one of the carrying element plus one.</li>
  349:   <li>Namespace declarations are seen as attributes, as in DOM.</li>
  350: </ul>
  351: 
  352: <h2><a name="Validating">Validating a document</a></h2>
  353: 
  354: <p>Libxml2 implementation adds some extra features on top of the XmlTextReader
  355: API. The main one is the ability to DTD validate the parsed document
  356: progressively. This is simply the activation of the associated feature of the
  357: parser used by the reader structure. There are a few options available
  358: defined as the enum xmlParserProperties in the libxml/xmlreader.h header
  359: file:</p>
  360: <ul>
  361:   <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
  362:   <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
  363:     loading the DTD)</li>
  364:   <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
  365:     the DTD)</li>
  366:   <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
  367:     reference nodes are not generated and are replaced by their expanded
  368:     content.</li>
  369:   <li>more settings might be added, those were the one available at the 2.5.0
  370:     release...</li>
  371: </ul>
  372: 
  373: <p>The GetParserProp() and SetParserProp() methods can then be used to get
  374: and set the values of those parser properties of the reader. For example</p>
  375: <pre>def parseAndValidate(file):
  376:     reader = libxml2.newTextReaderFilename(file)
  377:     reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
  378:     ret = reader.Read()
  379:     while ret == 1:
  380:         ret = reader.Read()
  381:     if ret != 0:
  382:         print "Error parsing and validating %s" % (file)</pre>
  383: 
  384: <p>This routine will parse and validate the file. Error messages can be
  385: captured by registering an error handler. See python/tests/reader2.py for
  386: more complete Python examples. At the C level the equivalent call to cativate
  387: the validation feature is just:</p>
  388: <pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
  389: 
  390: <p>and a return value of 0 indicates success.</p>
  391: 
  392: <h2><a name="Entities">Entities substitution</a></h2>
  393: 
  394: <p>By default the xmlReader will report entities as such and not replace them
  395: with their content. This default behaviour can however be overriden using:</p>
  396: 
  397: <p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
  398: 
  399: <h2><a name="L1142">Relax-NG Validation</a></h2>
  400: 
  401: <p style="font-size: 10pt">Introduced in version 2.5.7</p>
  402: 
  403: <p>Libxml2 can now validate the document being read using the xmlReader using
  404: Relax-NG schemas. While the Relax NG validator can't always work in a
  405: streamable mode, only subsets which cannot be reduced to regular expressions
  406: need to have their subtree expanded for validation. In practice it means
  407: that, unless the schemas for the top level element content is not expressable
  408: as a regexp, only chunk of the document needs to be parsed while
  409: validating.</p>
  410: 
  411: <p>The steps to do so are:</p>
  412: <ul>
  413:   <li>create a reader working on a document as usual</li>
  414:   <li>before any call to read associate it to a Relax NG schemas, either the
  415:     preparsed schemas or the URL to the schemas to use</li>
  416:   <li>errors will be reported the usual way, and the validity status can be
  417:     obtained using the IsValid() interface of the reader like for DTDs.</li>
  418: </ul>
  419: 
  420: <p>Example, assuming the reader has already being created and that the schema
  421: string contains the Relax-NG schemas:</p>
  422: <pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
  423: rngs = rngp.relaxNGParse()<br>
  424: reader.RelaxNGSetSchema(rngs)<br>
  425: ret = reader.Read()<br>
  426: while ret == 1:<br>
  427:     ret = reader.Read()<br>
  428: if ret != 0:<br>
  429:     print "Error parsing the document"<br>
  430: if reader.IsValid() != 1:<br>
  431:     print "Document failed to validate"</code><br>
  432: </pre>
  433: 
  434: <p>See <code>reader6.py</code> in the sources or documentation for a complete
  435: example.</p>
  436: 
  437: <h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
  438: 
  439: <p style="font-size: 10pt">Introduced in version 2.5.7</p>
  440: 
  441: <p>While the reader is a streaming interface, its underlying implementation
  442: is based on the DOM builder of libxml2. As a result it is relatively simple
  443: to mix operations based on both models under some constraints. To do so the
  444: reader has an Expand() operation allowing to grow the subtree under the
  445: current node. It returns a pointer to a standard node which can be
  446: manipulated in the usual ways. The node will get all its ancestors and the
  447: full subtree available. Usual operations like XPath queries can be used on
  448: that reduced view of the document. Here is an example extracted from
  449: reader5.py in the sources which extract and prints the bibliography for the
  450: "Dragon" compiler book from the XML 1.0 recommendation:</p>
  451: <pre>f = open('../../test/valid/REC-xml-19980210.xml')
  452: input = libxml2.inputBuffer(f)
  453: reader = input.newTextReader("REC")
  454: res=""
  455: while reader.Read():
  456:     while reader.Name() == 'bibl':
  457:         node = reader.Expand()            # expand the subtree
  458:         if node.xpathEval("@id = 'Aho'"): # use XPath on it
  459:             res = res + node.serialize()
  460:         if reader.Next() != 1:            # skip the subtree
  461:             break;</pre>
  462: 
  463: <p>Note, however that the node instance returned by the Expand() call is only
  464: valid until the next Read() operation. The Expand() operation does not
  465: affects the Read() ones, however usually once processed the full subtree is
  466: not useful anymore, and the Next() operation allows to skip it completely and
  467: process to the successor or return 0 if the document end is reached.</p>
  468: 
  469: <p><a href="mailto:xml@gnome.org">Daniel Veillard</a></p>
  470: 
  471: <p>$Id: xmlreader.html,v 1.1.1.1 2012/02/21 23:37:59 misho Exp $</p>
  472: 
  473: <p></p>
  474: </body>
  475: </html>

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>