Annotation of embedaddon/libxml2/doc/xmlreader.html, revision 1.1.1.1

1.1       misho       1: <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
                      2:     "http://www.w3.org/TR/html4/loose.dtd">
                      3: <html>
                      4: <head>
                      5:   <meta http-equiv="Content-Type" content="text/html">
                      6:   <style type="text/css"></style>
                      7: <!--
                      8: TD {font-family: Verdana,Arial,Helvetica}
                      9: BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
                     10: H1 {font-family: Verdana,Arial,Helvetica}
                     11: H2 {font-family: Verdana,Arial,Helvetica}
                     12: H3 {font-family: Verdana,Arial,Helvetica}
                     13: A:link, A:visited, A:active { text-decoration: underline }
                     14:   </style>
                     15: -->
                     16:   <title>Libxml2 XmlTextReader Interface tutorial</title>
                     17: </head>
                     18: 
                     19: <body bgcolor="#fffacd" text="#000000">
                     20: <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
                     21: 
                     22: <p></p>
                     23: 
                     24: <p>This document describes the use of the XmlTextReader streaming API added
                     25: to libxml2 in version 2.5.0 . This API is closely modeled after the <a
                     26: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
                     27: and <a
                     28: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
                     29: classes of the C# language.</p>
                     30: 
                     31: <p>This tutorial will present the key points of this API, and working
                     32: examples using both C and the Python bindings:</p>
                     33: 
                     34: <p>Table of content:</p>
                     35: <ul>
                     36:   <li><a href="#Introducti">Introduction: why a new API</a></li>
                     37:   <li><a href="#Walking">Walking a simple tree</a></li>
                     38:   <li><a href="#Extracting">Extracting informations for the current
                     39:   node</a></li>
                     40:   <li><a href="#Extracting1">Extracting informations for the
                     41:   attributes</a></li>
                     42:   <li><a href="#Validating">Validating a document</a></li>
                     43:   <li><a href="#Entities">Entities substitution</a></li>
                     44:   <li><a href="#L1142">Relax-NG Validation</a></li>
                     45:   <li><a href="#Mixing">Mixing the reader and tree or XPath
                     46:   operations</a></li>
                     47: </ul>
                     48: 
                     49: <p></p>
                     50: 
                     51: <h2><a name="Introducti">Introduction: why a new API</a></h2>
                     52: 
                     53: <p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
                     54: tree based</a>, where the parsing operation results in a document loaded
                     55: completely in memory, and expose it as a tree of nodes all availble at the
                     56: same time. This is very simple and quite powerful, but has the major
                     57: limitation that the size of the document that can be hamdled is limited by
                     58: the size of the memory available. Libxml2 also provide a <a
                     59: href="http://www.saxproject.org/">SAX</a> based API, but that version was
                     60: designed upon one of the early <a
                     61: href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
                     62: also not formally defined for C. SAX basically work by registering callbacks
                     63: which are called directly by the parser as it progresses through the document
                     64: streams. The problem is that this programming model is relatively complex,
                     65: not well standardized, cannot provide validation directly, makes entity,
                     66: namespace and base processing relatively hard.</p>
                     67: 
                     68: <p>The <a
                     69: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
                     70: API from C#</a> provides a far simpler programming model. The API acts as a
                     71: cursor going forward on the document stream and stopping at each node in the
                     72: way. The user's code keeps control of the progress and simply calls a
                     73: Read() function repeatedly to progress to each node in sequence in document
                     74: order. There is direct support for namespaces, xml:base, entity handling and
                     75: adding DTD validation on top of it was relatively simple. This API is really
                     76: close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
                     77: specification</a> This provides a far more standard, easy to use and powerful
                     78: API than the existing SAX. Moreover integrating extension features based on
                     79: the tree seems relatively easy.</p>
                     80: 
                     81: <p>In a nutshell the XmlTextReader API provides a simpler, more standard and
                     82: more extensible interface to handle large documents than the existing SAX
                     83: version.</p>
                     84: 
                     85: <h2><a name="Walking">Walking a simple tree</a></h2>
                     86: 
                     87: <p>Basically the XmlTextReader API is a forward only tree walking interface.
                     88: The basic steps are:</p>
                     89: <ol>
                     90:   <li>prepare a reader context operating on some input</li>
                     91:   <li>run a loop iterating over all nodes in the document</li>
                     92:   <li>free up the reader context</li>
                     93: </ol>
                     94: 
                     95: <p>Here is a basic C sample doing this:</p>
                     96: <pre>#include &lt;libxml/xmlreader.h&gt;
                     97: 
                     98: void processNode(xmlTextReaderPtr reader) {
                     99:     /* handling of a node in the tree */
                    100: }
                    101: 
                    102: int streamFile(char *filename) {
                    103:     xmlTextReaderPtr reader;
                    104:     int ret;
                    105: 
                    106:     reader = xmlNewTextReaderFilename(filename);
                    107:     if (reader != NULL) {
                    108:         ret = xmlTextReaderRead(reader);
                    109:         while (ret == 1) {
                    110:             processNode(reader);
                    111:             ret = xmlTextReaderRead(reader);
                    112:         }
                    113:         xmlFreeTextReader(reader);
                    114:         if (ret != 0) {
                    115:             printf("%s : failed to parse\n", filename);
                    116:         }
                    117:     } else {
                    118:         printf("Unable to open %s\n", filename);
                    119:     }
                    120: }</pre>
                    121: 
                    122: <p>A few things to notice:</p>
                    123: <ul>
                    124:   <li>the include file needed : <code>libxml/xmlreader.h</code></li>
                    125:   <li>the creation of the reader using a filename</li>
                    126:   <li>the repeated call to xmlTextReaderRead() and how any return value
                    127:     different from 1 should stop the loop</li>
                    128:   <li>that a negative return means a parsing error</li>
                    129:   <li>how xmlFreeTextReader() should be used to free up the resources used by
                    130:     the reader.</li>
                    131: </ul>
                    132: 
                    133: <p>Here is similar code in python for exactly the same processing:</p>
                    134: <pre>import libxml2
                    135: 
                    136: def processNode(reader):
                    137:     pass
                    138: 
                    139: def streamFile(filename):
                    140:     try:
                    141:         reader = libxml2.newTextReaderFilename(filename)
                    142:     except:
                    143:         print "unable to open %s" % (filename)
                    144:         return
                    145: 
                    146:     ret = reader.Read()
                    147:     while ret == 1:
                    148:         processNode(reader)
                    149:         ret = reader.Read()
                    150: 
                    151:     if ret != 0:
                    152:         print "%s : failed to parse" % (filename)</pre>
                    153: 
                    154: <p>The only things worth adding are that the <a
                    155: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
                    156: is abstracted as a class like in C#</a> with the same method names (but the
                    157: properties are currently accessed with methods) and that one doesn't need to
                    158: free the reader at the end of the processing. It will get garbage collected
                    159: once all references have disapeared.</p>
                    160: 
                    161: <h2><a name="Extracting">Extracting information for the current node</a></h2>
                    162: 
                    163: <p>So far the example code did not indicate how information was extracted
                    164: from the reader. It was abstrated as a call to the processNode() routine,
                    165: with the reader as the argument. At each invocation, the parser is stopped on
                    166: a given node and the reader can be used to query those node properties. Each
                    167: <em>Property</em> is available at the C level as a function taking a single
                    168: xmlTextReaderPtr argument whose name is
                    169: <code>xmlTextReader</code><em>Property</em> , if the return type is an
                    170: <code>xmlChar *</code> string then it must be deallocated with
                    171: <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
                    172: <em>Property</em> method to the reader class that can be called on the
                    173: instance. The list of the properties is based on the <a
                    174: href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
                    175: XmlTextReader class</a> set of properties and methods:</p>
                    176: <ul>
                    177:   <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
                    178:     element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
                    179:     entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
                    180:     9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
                    181:     fragment and 12 for notation nodes.</li>
                    182:   <li><em>Name</em>: the <a
                    183:     href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
                    184:     name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
                    185:   <li><em>LocalName</em>: the <a
                    186:     href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
                    187:     the node.</li>
                    188:   <li><em>Prefix</em>: a  shorthand reference to the <a
                    189:     href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
                    190:     the node.</li>
                    191:   <li><em>NamespaceUri</em>: the URI defining the <a
                    192:     href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
                    193:     the node.</li>
                    194:   <li><em>BaseUri:</em> the base URI of the node. See the <a
                    195:     href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
                    196:   <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
                    197:     root node.</li>
                    198:   <li><em>HasAttributes</em>: whether the node has attributes.</li>
                    199:   <li><em>HasValue</em>: whether the node can have a text value.</li>
                    200:   <li><em>Value</em>: provides the text value of the node if present.</li>
                    201:   <li><em>IsDefault</em>: whether an Attribute  node was generated from the
                    202:     default value defined in the DTD or schema (<em>unsupported
                    203:   yet</em>).</li>
                    204:   <li><em>XmlLang</em>: the <a
                    205:     href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
                    206:     within which the node resides.</li>
                    207:   <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
                    208:     bit bizarre in the sense that <code>&lt;a/&gt;</code> will be considered
                    209:     empty while <code>&lt;a&gt;&lt;/a&gt;</code> will not.</li>
                    210:   <li><em>AttributeCount</em>: provides the number of attributes of the
                    211:     current node.</li>
                    212: </ul>
                    213: 
                    214: <p>Let's look first at a small example to get this in practice by redefining
                    215: the processNode() function in the Python example:</p>
                    216: <pre>def processNode(reader):
                    217:     print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
                    218:                            reader.Name(), reader.IsEmptyElement())</pre>
                    219: 
                    220: <p>and look at the result of calling streamFile("tst.xml") for various
                    221: content of the XML test file.</p>
                    222: 
                    223: <p>For the minimal document "<code>&lt;doc/&gt;</code>" we get:</p>
                    224: <pre>0 1 doc 1</pre>
                    225: 
                    226: <p>Only one node is found, its depth is 0, type 1 indicate an element start,
                    227: of name "doc" and it is empty. Trying now with
                    228: "<code>&lt;doc&gt;&lt;/doc&gt;</code>" instead leads to:</p>
                    229: <pre>0 1 doc 0
                    230: 0 15 doc 0</pre>
                    231: 
                    232: <p>The document root node is not flagged as empty anymore and both a start
                    233: and an end of element are detected. The following document shows how
                    234: character data are reported:</p>
                    235: <pre>&lt;doc&gt;&lt;a/&gt;&lt;b&gt;some text&lt;/b&gt;
                    236: &lt;c/&gt;&lt;/doc&gt;</pre>
                    237: 
                    238: <p>We modifying the processNode() function to also report the node Value:</p>
                    239: <pre>def processNode(reader):
                    240:     print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
                    241:                               reader.Name(), reader.IsEmptyElement(),
                    242:                               reader.Value())</pre>
                    243: 
                    244: <p>The result of the test is:</p>
                    245: <pre>0 1 doc 0 None
                    246: 1 1 a 1 None
                    247: 1 1 b 0 None
                    248: 2 3 #text 0 some text
                    249: 1 15 b 0 None
                    250: 1 3 #text 0
                    251: 
                    252: 1 1 c 1 None
                    253: 0 15 doc 0 None</pre>
                    254: 
                    255: <p>There are a few things to note:</p>
                    256: <ul>
                    257:   <li>the increase of the depth value (first row) as children nodes are
                    258:     explored</li>
                    259:   <li>the text node child of the b element, of type 3 and its content</li>
                    260:   <li>the text node containing the line return between elements b and c</li>
                    261:   <li>that elements have the Value None (or NULL in C)</li>
                    262: </ul>
                    263: 
                    264: <p>The equivalent routine for <code>processNode()</code> as used by
                    265: <code>xmllint --stream --debug</code> is the following and can be found in
                    266: the xmllint.c module in the source distribution:</p>
                    267: <pre>static void processNode(xmlTextReaderPtr reader) {
                    268:     xmlChar *name, *value;
                    269: 
                    270:     name = xmlTextReaderName(reader);
                    271:     if (name == NULL)
                    272:         name = xmlStrdup(BAD_CAST "--");
                    273:     value = xmlTextReaderValue(reader);
                    274: 
                    275:     printf("%d %d %s %d",
                    276:             xmlTextReaderDepth(reader),
                    277:             xmlTextReaderNodeType(reader),
                    278:             name,
                    279:             xmlTextReaderIsEmptyElement(reader));
                    280:     xmlFree(name);
                    281:     if (value == NULL)
                    282:         printf("\n");
                    283:     else {
                    284:         printf(" %s\n", value);
                    285:         xmlFree(value);
                    286:     }
                    287: }</pre>
                    288: 
                    289: <h2><a name="Extracting1">Extracting information for the attributes</a></h2>
                    290: 
                    291: <p>The previous examples don't indicate how attributes are processed. The
                    292: simple test "<code>&lt;doc a="b"/&gt;</code>" provides the following
                    293: result:</p>
                    294: <pre>0 1 doc 1 None</pre>
                    295: 
                    296: <p>This proves that attribute nodes are not traversed by default. The
                    297: <em>HasAttributes</em> property allow to detect their presence. To check
                    298: their content the API has special instructions. Basically two kinds of operations
                    299: are possible:</p>
                    300: <ol>
                    301:   <li>to move the reader to the attribute nodes of the current element, in
                    302:     that case the cursor is positionned on the attribute node</li>
                    303:   <li>to directly query the element node for the attribute value</li>
                    304: </ol>
                    305: 
                    306: <p>In both case the attribute can be designed either by its position in the
                    307: list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
                    308: by their name (and namespace):</p>
                    309: <ul>
                    310:   <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
                    311:     the specified index no relative to the containing element.</li>
                    312:   <li><em>GetAttribute</em>(name): provides the value of the attribute with
                    313:     the specified qualified name.</li>
                    314:   <li>GetAttributeNs(localName, namespaceURI): provides the value of the
                    315:     attribute with the specified local name and namespace URI.</li>
                    316:   <li><em>MoveToAttributeNo</em>(no): moves the position of the current
                    317:     instance to the attribute with the specified index relative to the
                    318:     containing element.</li>
                    319:   <li><em>MoveToAttribute</em>(name): moves the position of the current
                    320:     instance to the attribute with the specified qualified name.</li>
                    321:   <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
                    322:     of the current instance to the attribute with the specified local name
                    323:     and namespace URI.</li>
                    324:   <li><em>MoveToFirstAttribute</em>: moves the position of the current
                    325:     instance to the first attribute associated with the current node.</li>
                    326:   <li><em>MoveToNextAttribute</em>: moves the position of the current
                    327:     instance to the next attribute associated with the current node.</li>
                    328:   <li><em>MoveToElement</em>: moves the position of the current instance to
                    329:     the node that contains the current Attribute  node.</li>
                    330: </ul>
                    331: 
                    332: <p>After modifying the processNode() function to show attributes:</p>
                    333: <pre>def processNode(reader):
                    334:     print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
                    335:                               reader.Name(), reader.IsEmptyElement(),
                    336:                               reader.Value())
                    337:     if reader.NodeType() == 1: # Element
                    338:         while reader.MoveToNextAttribute():
                    339:             print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
                    340:                                           reader.Name(),reader.Value())</pre>
                    341: 
                    342: <p>The output for the same input document reflects the attribute:</p>
                    343: <pre>0 1 doc 1 None
                    344: -- 1 2 (a) [b]</pre>
                    345: 
                    346: <p>There are a couple of things to note on the attribute processing:</p>
                    347: <ul>
                    348:   <li>Their depth is the one of the carrying element plus one.</li>
                    349:   <li>Namespace declarations are seen as attributes, as in DOM.</li>
                    350: </ul>
                    351: 
                    352: <h2><a name="Validating">Validating a document</a></h2>
                    353: 
                    354: <p>Libxml2 implementation adds some extra features on top of the XmlTextReader
                    355: API. The main one is the ability to DTD validate the parsed document
                    356: progressively. This is simply the activation of the associated feature of the
                    357: parser used by the reader structure. There are a few options available
                    358: defined as the enum xmlParserProperties in the libxml/xmlreader.h header
                    359: file:</p>
                    360: <ul>
                    361:   <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
                    362:   <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
                    363:     loading the DTD)</li>
                    364:   <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
                    365:     the DTD)</li>
                    366:   <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
                    367:     reference nodes are not generated and are replaced by their expanded
                    368:     content.</li>
                    369:   <li>more settings might be added, those were the one available at the 2.5.0
                    370:     release...</li>
                    371: </ul>
                    372: 
                    373: <p>The GetParserProp() and SetParserProp() methods can then be used to get
                    374: and set the values of those parser properties of the reader. For example</p>
                    375: <pre>def parseAndValidate(file):
                    376:     reader = libxml2.newTextReaderFilename(file)
                    377:     reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
                    378:     ret = reader.Read()
                    379:     while ret == 1:
                    380:         ret = reader.Read()
                    381:     if ret != 0:
                    382:         print "Error parsing and validating %s" % (file)</pre>
                    383: 
                    384: <p>This routine will parse and validate the file. Error messages can be
                    385: captured by registering an error handler. See python/tests/reader2.py for
                    386: more complete Python examples. At the C level the equivalent call to cativate
                    387: the validation feature is just:</p>
                    388: <pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
                    389: 
                    390: <p>and a return value of 0 indicates success.</p>
                    391: 
                    392: <h2><a name="Entities">Entities substitution</a></h2>
                    393: 
                    394: <p>By default the xmlReader will report entities as such and not replace them
                    395: with their content. This default behaviour can however be overriden using:</p>
                    396: 
                    397: <p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
                    398: 
                    399: <h2><a name="L1142">Relax-NG Validation</a></h2>
                    400: 
                    401: <p style="font-size: 10pt">Introduced in version 2.5.7</p>
                    402: 
                    403: <p>Libxml2 can now validate the document being read using the xmlReader using
                    404: Relax-NG schemas. While the Relax NG validator can't always work in a
                    405: streamable mode, only subsets which cannot be reduced to regular expressions
                    406: need to have their subtree expanded for validation. In practice it means
                    407: that, unless the schemas for the top level element content is not expressable
                    408: as a regexp, only chunk of the document needs to be parsed while
                    409: validating.</p>
                    410: 
                    411: <p>The steps to do so are:</p>
                    412: <ul>
                    413:   <li>create a reader working on a document as usual</li>
                    414:   <li>before any call to read associate it to a Relax NG schemas, either the
                    415:     preparsed schemas or the URL to the schemas to use</li>
                    416:   <li>errors will be reported the usual way, and the validity status can be
                    417:     obtained using the IsValid() interface of the reader like for DTDs.</li>
                    418: </ul>
                    419: 
                    420: <p>Example, assuming the reader has already being created and that the schema
                    421: string contains the Relax-NG schemas:</p>
                    422: <pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
                    423: rngs = rngp.relaxNGParse()<br>
                    424: reader.RelaxNGSetSchema(rngs)<br>
                    425: ret = reader.Read()<br>
                    426: while ret == 1:<br>
                    427:     ret = reader.Read()<br>
                    428: if ret != 0:<br>
                    429:     print "Error parsing the document"<br>
                    430: if reader.IsValid() != 1:<br>
                    431:     print "Document failed to validate"</code><br>
                    432: </pre>
                    433: 
                    434: <p>See <code>reader6.py</code> in the sources or documentation for a complete
                    435: example.</p>
                    436: 
                    437: <h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
                    438: 
                    439: <p style="font-size: 10pt">Introduced in version 2.5.7</p>
                    440: 
                    441: <p>While the reader is a streaming interface, its underlying implementation
                    442: is based on the DOM builder of libxml2. As a result it is relatively simple
                    443: to mix operations based on both models under some constraints. To do so the
                    444: reader has an Expand() operation allowing to grow the subtree under the
                    445: current node. It returns a pointer to a standard node which can be
                    446: manipulated in the usual ways. The node will get all its ancestors and the
                    447: full subtree available. Usual operations like XPath queries can be used on
                    448: that reduced view of the document. Here is an example extracted from
                    449: reader5.py in the sources which extract and prints the bibliography for the
                    450: "Dragon" compiler book from the XML 1.0 recommendation:</p>
                    451: <pre>f = open('../../test/valid/REC-xml-19980210.xml')
                    452: input = libxml2.inputBuffer(f)
                    453: reader = input.newTextReader("REC")
                    454: res=""
                    455: while reader.Read():
                    456:     while reader.Name() == 'bibl':
                    457:         node = reader.Expand()            # expand the subtree
                    458:         if node.xpathEval("@id = 'Aho'"): # use XPath on it
                    459:             res = res + node.serialize()
                    460:         if reader.Next() != 1:            # skip the subtree
                    461:             break;</pre>
                    462: 
                    463: <p>Note, however that the node instance returned by the Expand() call is only
                    464: valid until the next Read() operation. The Expand() operation does not
                    465: affects the Read() ones, however usually once processed the full subtree is
                    466: not useful anymore, and the Next() operation allows to skip it completely and
                    467: process to the successor or return 0 if the document end is reached.</p>
                    468: 
                    469: <p><a href="mailto:xml@gnome.org">Daniel Veillard</a></p>
                    470: 
                    471: <p>$Id$</p>
                    472: 
                    473: <p></p>
                    474: </body>
                    475: </html>

FreeBSD-CVSweb <freebsd-cvsweb@FreeBSD.org>