Understanding XMLParser Validation

The XML 1.0 specification defines the behavior of two different types of XML parsers: validating and non-validating. Validating parsers enforce all document constraints, including Well-Formedness Constraints (WFCs) and Validity Constraints (VCs), while non-validating parsers enforce Well-Formedness Constraints only and ignore Validity Constraints. To enforce all VCs, validating parsers typically require that a document conform to some type of schema, like a DTD. Non-validating parsers only require that a document be well-formed (that it conform to all WFCs).

The Pharo XMLParser library supports both validating and non-validating modes of operation, and it uses separate exception classes, XMLWellFormednessException and XMLValidationException, to signal violations of WFCs and VCs.

By default, XMLParser operates as a validating parser. But it actually supports two different levels of validation: “soft” and “standard.”

Soft validation

With “soft” validation (the default), XMLParser will enforce all entity-related VCs, check that the name of the root element matches the name specified by the DOCTYPE declaration (if a DOCTYPE declaration is present), will validate any xml:id attributes, and if the document has an internal or external DTD subset with at least one ELEMENT or ATTLIST declaration, then it will attempt to validate the entire document against the DTD schema. In other words, in this mode, validation against a DTD will only be attempted if one is present.

Standard validation

With “standard” validation, all of the constraints enforced by “soft” validation are in effect. In addition, a DTD (with ELEMENT and ATTLIST declarations) or some other type of schema (only DTDs are presently supported) describing the structure of the document is required, and the absence of one is treated as a validation error. This is the behavior mandated by the XML 1.0 specification for validating parsers.

To get “standard” validation, just set #requiresSchema: to true. Enabling #requriesSchema: enables validation (if it wasn’t already), and disabling validation (with #isValidating:) also disables #requriesSchema:.

Implementation details

To implement the content model regular expression syntax of ELEMENT declarations, XMLParser uses a variant of the classic Thompson NFA construction with lazy, bounded conversion to DFAs.

Feel free to contact me with any XMLParser-related questions.


XMLParser Performance Tips

Here are some tips for improving the performance of the Pharo XMLParser library.

Use SAX instead of DOM

SAX parsing is generally faster and more memory efficient than DOM parsing. However, there is a definite trade-off in usability, especially since the XPath library can only be used with DOM and StAX (from the XMLParserStAX project) parsers.

Use partial document parsing

Instead of parsing the entire document with #parseDocument, use #parseDocumentWhile: or #parseDocumentUntil: to do partial document parsing. With SAX parsing, you can also use #interruptParsing from within a handler.

Disable validation and namespace support

If you don’t need validation or namespace support, they can be disabled before parsing for performance:

	usesNamespaces: false;
	isValidating: false;

#optimizeForLargeDocuments disables both, and also disables the document security read limit.

When namespace support is disabled, namespace declarations are treated as ordinary attributes.

Avoid parsing files in-memory

"bad: slurps the entire contents of fileName
into an in-memory string before parsing it"
XMLDOMParser parse: fileName asFileReference contents.

"good: parses fileName directly, without
first reading it into an in-memory string"
XMLDOMParser parseFileNamed: fileName.

Use XMLAttributeList when DOM parsing

XMLCachingAttributeList is the default attribute list class used by XMLDOMParser. Unlike XMLAttributeList, it maintains an internal Dictionary to provide faster attribute lookup, which means it requires more memory than XMLAttributeList. To use XMLAttributeList instead, inject a custom node factory with it prior to parsing:

(parser := XMLDOMParser on: xmlSource)
		(XMLPluggableNodeFactory new
			attributeListClass: XMLAttributeList);

DOM parsing with XMLAttributeList can not only conserve memory, but it can actually be slightly faster. But manipulating XMLAttributeLists through the DOM API can be significantly slower, and using the DOM API to add new attributes to one can have quadratic complexity.

Feel free to contact me with any XMLParser-related questions.

Querying US Government Data Sets With Pharo and XMLParser

data.medicare.gov is a US government website that distributes public healthcare data sets, including a collection of data sets called Hospital Compare for comparing American hospitals. These data sets are available in multiple formats, such as XML, JSON, and CSV.

Today, we’re going to use Pharo and its XMLParser library (I am its principal author and maintainer) to download, parse, and analyze the Total Performance Score XML data set from the Hospital Compare collection, which ranks hospitals by their performance on multiple healthcare outcome indexes.

First, we must install the XMLParser project from the Pharo Catalog. Opening the Catalog and typing “xml” returns a list of XML-related projects, including the one we want (XMLParser). Selecting it and clicking the green button installs the latest stable version into our image:

Once the installation completes, we can start working with the data set by inspecting this code in a workspace:

XMLDOMParser parseURL:

Inspecting (with Ctrl+i) brings up a GTInspector on the resulting XMLDocument object, which we will use to navigate and discover its tree structure:

We can see that it has a root response element with a single redundant row element, which in turn has many row element children containing the actual data for each hospital.

Let’s say we wanted to find the top-ranked hospitals in the state of Florida. Extracting the hospital_name and total_performance_score for each hospital in Florida and then sorting them by the total_performance_score would do the trick:

((self root firstElement
	elementsSelect: [:each |
		(each isNamed: 'row')
			and: [(each contentStringAt: 'state') = 'FL']])
		collect: [:each |
			(each contentStringAt: 'hospital_name') ->
				(each contentStringAt: 'total_performance_score') asNumber])
			sort: [:a :b |
				a value >= b value]

We use self root firstElement to get the first child row element from the root response element of the document. We then enumerate and filter all of its row children using elementsSelect:, and then use the standard collect: to get an ordered list of hospital name -> score Associations, which we sort by the values (the scores):

The top 10 hospitals:


Feel free to contact me with any XMLParser-related questions.