Office Suite Extraction. Most office suite software can save files in XML format as
well as their own proprietary format. The XML is complex, but you
can examine it in pieces using Python programs. It helps to work
with highly structured data, like an XML version of a spreadsheet.
For example, your spreadsheet may use tags like
<Table>
, <Row>
and
<Cell>
to organize the content of the
spreadsheet.
First, write a simple program to show the top-level elements
of the document. It often helps to show the text within those
elements so that you can correlate the XML structure with the
original document contents.
Once you can display the top-level elements, you can focus on
the elements that have meaningful data. For example, if you are
parsing spreadsheet XML, you can assembled the values of all of the
<Cell>
's in a <Row>
into a
proper row of data, perhaps using a simple Python
list
.