XML files are text files, intended for human consumption, that mix markup with content. The markup uses a number of relatively simple rules. Additionally, there are structural requirements that assure that an XML file has a minimal level of validity. There are additional rules (either a Document Type Defintion, DTD, or an XML Schema Definition, XSD) that provide additional structural rules.
There are three separate XML parsers available with Python. We'll
ignore the xml.expat
module (not for any good
reason), and focus on the xml.sax
and
xml.minidom
parsers.
xml.sax
Parsing. The Standard API for XML (SAX) parser is described as an event
parser. The parser recognizes different elements of an XML document
and invokes methods in a handler which you provide. Your handler will
be given pieces of the document, and can do appropriate processing
with those pieces.
For most XML processing, your program will have the following
outline: This parser will then use your
ContentHandler
as it parses.
Define a subclass of
xml.sax.ContentHandler
. The methods of this
class will do your unique processing will happen.
Request the module to create an instance of an
xml.sax.Parser
.
Create an instance of your handler class. Provide this to the parser you created.
Set any features or options in the parser.
Invoke the parser on your document (or incoming stream of data from a network socket).
Here's a short example that shows the essentials of building a
simple XML parser with the xml.sax
module. This
example defines a simple ContentHandler
that
prints the tags as well as counting the occurances of the
<informaltable>
tag.
import xml.sax class DumpDetails( xml.sax.ContentHandler ): def __init__( self ): self.depth= 0 self.tableCount= 0 def startElement( self, aName, someAttrs ): print self.depth*' ' + aName self.depth += 1 if aName == 'informaltable': self.tableCount += 1 def endElement( self, aName ): self.depth -= 1 def characters( self, content ): pass # ignore the actual data p= xml.sax.make_parser() myHandler= DumpDetails() p.setContentHandler( myHandler ) p.parse( "../p5-projects.xml" ) print myHandler.tableCount, "tables"
Since the parsing is event-driven, your handler must accumulate any context required to determine where the individual tags occur. In some content models (like XHTML and DocBook) there are two levels of markup: structural and semantic. The structural markup includes books, parts, chapters, sections, lists and the like. The semantic markup is sometimes called "inline" markup, and it includes tags to identify function names, class names, exception names, variable names, and the like. When processing this kind of document, you're application must determine the which tag is which.
A ContentHandler
Subclass. The heart of a SAX parser is the subclass of
ContentHandler
that you define in your
application. There are a number of methods which you may want to
override. Minimally, you'll override the
startElement
and
characters
methods. There are other methods
of this class described in section 13.10.1 of the Python
Library Reference.
setDocumentLocator
(
locator
)
The parser will call this method to provide an
xml.sax.Locator
object. This object has the
XML document ID information, plus line and column information. The
locator will be updated within the parser, so it should only be
used within these handler methods.
startDocument
The parser will call this method at the start of the document. It can be used for initialization and resetting any context information.
endDocument
This method is paired with the
startDocument
method; it is called once
by the parser at the end of the document.
startElement
(
name
,
attrs
)
The parser calls this method with each tag that is found, in
non-namespace mode. The name
is the string with
the tag name. The attrs
parameter is an
xml.sax.Attributes
object. This object is
reused by the parser; your handler cannot save this object. The
xml.sax.Attributes object behaves somewhat like a mapping. It
doesn't support the []
operator for getting values,
but does support get
,
has_key
, items
,
keys
, and values
methods.
endElement
(
name
)
The parser calls this method with each tag that is found, in
non-namespace mode. The name
is the string with
the tag name.
startElementNS
(
name
,
qname
,
attrs
)
The parser calls this method with each tag that is found, in
namespace mode. You set namesace mode by using the parser's
p.setFeature( xml.sax.handler.feature_namespaces, True
)
. The name
is a tuple with the URI for
the namespace and the tag name. The qname
is
the fully qualified text name. The attrs
parameter is an xml.sax.Attributes
object.
This object is reused by the parser; your handler cannot save this
object. The xml.sax.Attributes object behaves somewhat like a
mapping. It doesn't support the []
operator for
getting values, but does support get
,
has_key
, items
,
keys
, and values
methods.
endElementNS
(
name
,
qname
)
The parser calls this method with each tag that is found, in
namespace mode. The name
is a tuple with the
URI for the namespace and the tag name. The
qname
is the fully qualified text name.
characters
(
content
)
The parser uses this method to provide character data to the
ContentHandler
. The parser may provide
character data in a single chunk, or it may provide the characters
in several chunks.
ignorableWhitespace
(
whitespace
)
The parser will use this method to provide ignorable
whitespace to the ContentHandler
. This is
whitespace between tags, usually line breaks and indentation. The
parser may provide whitespace in a single chunk, or it may provide
the characters in several chunks.
processingInstructions
(
target
,
data
)
The parser will provide all
<?
processing
instructions to this method. Note that the initial target
data
?><?xml
version="1.0" encoding="UTF-8"?>
is not reported.
xml.minidom
Parsing. The Document Object Model (DOM) parser creates a document object
model from your XML document. The parser transforms the text of an XML
document into a DOM object. Once your program has the DOM object, you
can examine that object.
Here's a short example that shows the essentials of building a
simple XML parser with the xml.dom
module. This
example defines a simple ContentHandler
that
prints the tags as well as counting the occurances of the
<informaltable>
tag.
We defined a walkNode
function which does a
recursive, depth-first traversal of the elements in the document
structure. In many applications, the structure of the XML document is
well known, and functions which are tied to the structure of the
document can be used. In this example, we're reading a DocBook XML file,
which has a complex, highly-nested structure.
import xml.dom.minidom tables= [] def walkNode( n, depth=0 ): print depth*' ', n.tagName if n.tagName == "informaltable": tables.append( n ) for d in n.childNodes: if d.nodeType == xml.dom.Node.ELEMENT_NODE: walkNode( d, depth+1 ) dom1 = xml.dom.minidom.parse("../p5-projects.xml") walkNode( dom1.documentElement ) print tables
The DOM Object Model. The heart of a DOM parser is the DOM class hierarchy. Your
program will work with a xml.dom.Document
object. We'll look at a few essential classes of the DOM. There are
other classes in this model, described in section 13.6.2 of the
Python Library Reference. We'll focus on the
most commonly-used classes.
The XML Document Object Model is a standard definition. The
standard applies to both Java programs as well as Python. The
xml.dom
package provides definitions which meet
this standard. The standard doesn't address how XML is parsed to create
this structure. Consequently, the xml.dom
package
has no official parser. You could, for example, use a SAX parser to
produce a DOM structure. Your handler would create objects from the
classes defined in xml.dom
.
The xml.dom.minidom
package is an
implementation of the DOM standard, which is slightly simplified. This
implementation of the standard is extended to include a parser. The
essential class definitions, however, come from
xml.dom
. We'll only look at methods used to get
data from an XML document. We'll ignore the additional methods used by a
parser to build a DOM object.
The Node
class is the superclass for
all of the various DOM classes. It defines a number of attributes
and methods which are common to all of the various subclasses.
This class should be thought of as abstract: it is not used
directly; it exists to provide common features to all of the
subclasses.
Here are the attributes which are common to all of the
various kinds of Node
s
This is an integer code that discriminates among the
subclasses of Node
. There are a
number of helpful symbolic constants which are class
variables in xml.dom.Node. These constants define the
various types of Nodes. ELEMENT_NODE
,
ATTRIBUTE_NODE
,
TEXT_NODE
,
CDATA_SECTION_NODE
,
ENTITY_NODE
,
PROCESSING_INSTRUCTION_NODE
,
COMMENT_NODE
,
DOCUMENT_NODE
,
DOCUMENT_TYPE_NODE
,
NOTATION_NODE
.
This is a map-like collection of attributes. It is an
instance of xml.dom.NamedNodeMap
. It
has method functions including get
,
getNamedItem
,
getNamedItemNS
,
has_key
,
item
,
items
,
itemsNS
,
keys
,
keysNS
,
length
,
removeNamedItem
,
removeNamedItemNS
,
setNamedItem
,
setNamedItemNS
,
values
. The
item
and
length
methods are defined by the
standard and provided for Java compatibility.
If there is a namespace, then this is the portion of the name after the colon. If there is no namespace, this is the entire tag name.
If there is a namespace, then this is the portion of the name before the colon. If there is no namespace, this is an empty string.
If there is a namespace, this is the URI for that
namespace. If there is no namespace, this is
None
.
This is the parent of this
Node
. The
Document
Node
will have None
for this attribute, since
it is the parent of all Node
s in the
document. For all other Node
s, this
is the context in which the Node
appears.
Sibling Node
s share a common
parent. This attribute of a Node
is
the Node
which precedes it within a
parent. If this is the first Node
under a parent, the previousSibling
will
be None
. Often, the preceeding
Node
will be a
Text
containing whitespace.
Sibling Node
s share a common
parent. This attribute of a Node
is
the Node
which follows it within a
parent. If this is the last Node
under a parent, the nextSibling
will be
None
. Often, the following
Node
will be
Text
containing whitespace.
The list of child Nodes under this Node. Generally,
this will be a xml.dom.NodeList
instance, not a simple Python list
. A
NodeList
behaves like a
list
, but has two extra methods:
item
and
length
, which are defined by the
standard and provided for Java compatibility.
The first Node
in the
childNodes
list, similar to
childNodes[:1]. It will be None
if the
childNodes
list is also empty.
The last Node
in the
childNodes
list, similar to
childNodes[-1:]. It will be None
if the
childNodes
list is also empty.
Here are some attributes which are overridden in each
subclass of Node
. They have slightly
different meanings for each node type.
A string with the "name" for this
Node
. For an
Element
, this will be the same as the
tagName
attribute. In some cases, it will
be None
.
A string with the "value" for this
Node
. For an
Text
, this will be the same as the
data
attribute. In some cases, it will be
None
.
Here are some methods of a
Node
.
hasAttributes
This function returns True
if there
are attributes associated with this
Node
.
hasChildNodes
This function returns True if there child
Node
s associated with this
Node
.
This is the top-level document, the object returned by the
parser. It is a subclass of Node
, so it
inherits all of those attributes and methods. The
Document
class adds some attributes and
method functions to the Node
definition.
This attribute refers to the top-most
Element
in the XML document. A
Document
may contain
DocumentType
,
ProcessingInstruction
and
Comment
Node
s,
also. This attribute saves you having to dig through the
childNodes
list for the top
Element
.
getElementsByTagName
(
tagName
)
This function returns a
NodeList
with each
Element
in this
Document
that has the given tag
name.
getElementsByTagNameNS
(
namespaceURI
,
tagName
)
This function returns a
NodeList
with each
Element
in this
Document
that has the given namespace
URI and local tag name.
This is a specific element within an XML document. An
element is surrounded by XML tags. In <para
id="sample">Text</para>
, the tag is
<para>
, which provides the name for the
Element
. Most
Elements
will have children, some will have
Attributes
as well as children. The
Element
class adds some attributes and
method functions to the Node
definition.
The full name for the tag. If there is a namesace,
this will be the complete name, including colons. This will
also be in nodeValue
.
getElementsByTagName
(
tagName
)
This function returns a
NodeList
with each
Element
in this
Element
that has the given tag
name.
getElementsByTagNameNS
(
namespaceURI
,
tagName
)
This function returns a
NodeList
with each
Element
in this
Element
that has the given namespace
URI and local tag name.
hasAttribute
(
name
)
Returns True
if this
Element
has an
Attr
with the given name.
hasAttributeNS
(
namespaceURI
,
localName
)
Returns True
if this
Element
has an
Attr
with the given name based on the
namespace and localName.
getAttribute
(
name
)
Returns the string value of the
Attr
with the given name. If the
attribute doesn't exist, this will return a zero-length
string.
getAttributeNS
(
namespaceURI
,
localName
)
Returns the string value of the
Attr
with the given name. If the
attribute doesn't exist, this will return a zero-length
string.
getAttributeNode
(
name
)
Returns the Attr
with the given
name. If the named attribute doesn't exist, this method
returns None
.
getAttributeNodeNS
(
namespaceURI
,
localName
)
Returns the Attr
with the given
name. If the named attribute doesn't exist, this method
returns None
.
This is an attribute, within an Element. In <para
id="sample">Text</para>
, the tag is
<para>
; this tag has an attribute of
id
with a value of sample
. Generally,
the nodeType
, nodeName
and
nodeValue
attributes are all that are used. The
Attr
class adds some attributes to the
Node
definition.
The full name of the attribute, which may include
colons. The Node
class defines
localName
, prefix
and
namespaceURI
which may be necessary for
correctly processing this attribute.
The string value of the attribute. Also note that
nodeValue
will have a copy of the
attribute's value.
This is the text within an element. In <para
id="sample">Text</para>
, the text is
Text
. Note that end of line characters and
indentation also count as Text
nodes.
Further, the parser may break up a large piece of text into a
number of smaller Text
nodes. The
Text
class adds an attribute to the
Node
definition.
The text. Also note that nodeValue
will have a copy of the text.
This is the text within a comment. The <!--
and -->
characters are not included. The
Comment
class adds an attribute to the
Node
definition.
The comment. Also note that
nodeValue
will have a copy of the
comment.
Published under the terms of the Open Publication License | Design by Interspire |