The XPath GTInspector

The XPath package comes with a custom GTInspector. Pressing Cmd+I on an XPath object (which can be created by sending a string #asXPath) will bring it up.

xpath-workspace-screenshot

Source tab

xpath-inspector-source-tab

The Source tab displays the XPath source code with syntax highlighting. The highlighting was implemented with a new expression parser handler class that just tracks the start and end positions of syntax constructs instead of building an AST, and with the XMLParser GT highlighting classes to highlight those constructs. We need this position-tracking parser handler even if our AST already supports printing with syntax highlighting, because the transformation from source to AST (even before optimization) discards things like non-significant whitespace, making it impossible to reproduce the original source. It’s also inefficient to generate a full AST when we only need token positions.

The default colors chosen to highlight XPath are based on the colors used to highlight Smalltalk, so similar constructs are highlighted the same way (strings, numbers, and blocks in the example).

 

The Source tab is also editable. Pressing Cmd+s (or clicking the green checkmark) accepts the edits and updates the XPath object. Cmd+z (or the purple arrow) reverts the edits.

xpath-inspector-source-tab-error

Invalid syntax edits (like the missing “]” above) are highlighted using red.

AST tab

The AST tab displays a navigable tree of AST node objects. Clicking on an arrow expands a node, and clicking on a node brings up a right pane with AST node subtree and source tabs for the node you clicked on.

Optimized AST tab

The Optimized AST tab shows the resulting AST after it has been optimized. Notice that the [position() = 2] predicate was simplified to [2]. Like the AST tab, the subtree and source tabs in the right pane are generated from the AST node you click on in the left pane.

Compiled tab

xpath-inspector-compiled-tab

The Compiled tab shows the Smalltalk code that the optimized XPath AST is translated into, which is compiled using the system compiler before evaluation.

Conclusion

The XPath GTInspector is useful for both users and maintainers. User get an XPath editor with syntax highlighting and syntax checking, while maintainers can quickly compare XPath source with its AST, optimized AST, and generated Smalltalk code.

XMLParser and Security

Most security-aware programmers have heard of at least one XML-related vulnerability by now, such as the infamous XML External Entity (XXE) exploit, which resulted in Facebook paying out what was its largest ever bug bounty.

How does the Pharo XMLParser library deal with XXE and other XML exploits? We’ll use this list of common vulnerabilities from the Python documentation website: https://docs.python.org/3/library/xml.html#xml-vulnerabilities

Billion Laughs

The Billion Laughs attack – also known as exponential entity expansion – uses multiple levels of nested entities. Each entity refers to another entity several times, and the final entity definition contains a small string. The exponential expansion results in several gigabytes of text and consumes lots of memory and CPU time.

Mitigation: XMLParser protects against this in two ways:

  1. It limits the depth of entity reference replacement by default.
  2. It limits the total number of logical characters the parser is allowed read when parsing a document.

(These and other limits are configurable, for example using #documentReadLimit:, and can be removed quickly with #removeLimits. Browse SAXHandler’s “configuring” protocol for more.)

Quadratic Blowup Entity Expansion

A quadratic blowup attack is similar to a Billion Laughs attack; it abuses entity expansion, too. Instead of nested entities it repeats one large entity with a couple of thousand chars over and over again. The attack isn’t as efficient as the exponential case but it avoids triggering parser countermeasures that forbid deeply-nested entities.

Mitigation: same as previous.

External Entity Expansion

Entity declarations can contain more than just text for replacement. They can also point to external resources or local files. The XML parser accesses the resource and embeds the content into the XML document.

Mitigation: XMLParser disables external entity resolution by default, meaning that external entity references in the document being parsed will not be resolved, and the inability to resolve them will raise an error if resolution is required to properly parse the document. Note that even with external entity resolution disabled, you can still parse external documents using messages like parseURL: or onURL:, but make sure the URL is trusted.

DTD Retrieval

Some XML libraries like Python’s xml.dom.pulldom retrieve document type definitions from remote or local locations. The feature has similar implications as the external entity expansion issue.

Mitigation: same as previous.

Decompression Bomb

Decompression bombs (aka ZIP bomb) apply to all XML libraries that can parse compressed XML streams such as gzipped HTTP streams or LZMA-compressed files. For an attacker it can reduce the amount of transmitted data by three magnitudes or more.

Mitigation: XMLParser supports HTTP internally through Pharo’s Zinc library and Squeak’s WebClient library (Zinc will be used if both are installed) and its XMLHTTPRequest/XMLHTTPResponse interfaces, which do support GZIP compression; however compression is not enabled by default.

Here is the table taken from the page linked to above:

python xml library vulnerability table

Here is that same table, updated for comparison with Pharo’s XMLParser library:

python xml library vulnerability table with pharo xmlparser comparison

Understanding XMLParser Validation

The XML 1.0 specification defines the behavior of two different types of XML parsers: validating and non-validating. Validating parsers enforce all document constraints, including Well-Formedness Constraints (WFCs) and Validity Constraints (VCs), while non-validating parsers enforce Well-Formedness Constraints only and ignore Validity Constraints. To enforce all VCs, validating parsers typically require that a document conform to some type of schema, like a DTD. Non-validating parsers only require that a document be well-formed (that it conform to all WFCs).

The Pharo XMLParser library supports both validating and non-validating modes of operation, and it uses separate exception classes, XMLWellFormednessException and XMLValidationException, to signal violations of WFCs and VCs.

By default, XMLParser operates as a validating parser. But it actually supports two different levels of validation: “soft” and “standard.”

Soft validation

With “soft” validation (the default), XMLParser will enforce all entity-related VCs, check that the name of the root element matches the name specified by the DOCTYPE declaration (if a DOCTYPE declaration is present), will validate any xml:id attributes, and if the document has an internal or external DTD subset with at least one ELEMENT or ATTLIST declaration, then it will attempt to validate the entire document against the DTD schema. In other words, in this mode, validation against a DTD will only be attempted if one is present.

Standard validation

With “standard” validation, all of the constraints enforced by “soft” validation are in effect. In addition, a DTD (with ELEMENT and ATTLIST declarations) or some other type of schema (only DTDs are presently supported) describing the structure of the document is required, and the absence of one is treated as a validation error. This is the behavior mandated by the XML 1.0 specification for validating parsers.

To get “standard” validation, just set #requiresSchema: to true. Enabling #requriesSchema: enables validation (if it wasn’t already), and disabling validation (with #isValidating:) also disables #requriesSchema:.

Implementation details

To implement the content model regular expression syntax of ELEMENT declarations, XMLParser uses a variant of the classic Thompson NFA construction with lazy, bounded conversion to DFAs.

Feel free to contact me with any XMLParser-related questions.

XMLParser Performance Tips

Here are some tips for improving the performance of the Pharo XMLParser library.

Use SAX instead of DOM

SAX parsing is generally faster and more memory efficient than DOM parsing. However, there is a definite trade-off in usability, especially since the XPath library can only be used with DOM and StAX (from the XMLParserStAX project) parsers.

Use partial document parsing

Instead of parsing the entire document with #parseDocument, use #parseDocumentWhile: or #parseDocumentUntil: to do partial document parsing. With SAX parsing, you can also use #interruptParsing from within a handler.

Disable validation and namespace support

If you don’t need validation or namespace support, they can be disabled before parsing for performance:

parser
	usesNamespaces: false;
	isValidating: false;
	parseDocument.

#optimizeForLargeDocuments disables both, and also disables the document security read limit.

When namespace support is disabled, namespace declarations are treated as ordinary attributes.

Avoid parsing files in-memory

"bad: slurps the entire contents of fileName
into an in-memory string before parsing it"
XMLDOMParser parse: fileName asFileReference contents.

"good: parses fileName directly, without
first reading it into an in-memory string"
XMLDOMParser parseFileNamed: fileName.

Use XMLAttributeList when DOM parsing

XMLCachingAttributeList is the default attribute list class used by XMLDOMParser. Unlike XMLAttributeList, it maintains an internal Dictionary to provide faster attribute lookup, which means it requires more memory than XMLAttributeList. To use XMLAttributeList instead, inject a custom node factory with it prior to parsing:

(parser := XMLDOMParser on: xmlSource)
	nodeFactory:
		(XMLPluggableNodeFactory new
			attributeListClass: XMLAttributeList);
	parseDocument.

DOM parsing with XMLAttributeList can not only conserve memory, but it can actually be slightly faster. But manipulating XMLAttributeLists through the DOM API can be significantly slower, and using the DOM API to add new attributes to one can have quadratic complexity.

Feel free to contact me with any XMLParser-related questions.

Querying US Government Data Sets With Pharo and XMLParser

data.medicare.gov is a US government website that distributes public healthcare data sets, including a collection of data sets called Hospital Compare for comparing American hospitals. These data sets are available in multiple formats, such as XML, JSON, and CSV.

Today, we’re going to use Pharo and its XMLParser library (I am its principal author and maintainer) to download, parse, and analyze the Total Performance Score XML data set from the Hospital Compare collection, which ranks hospitals by their performance on multiple healthcare outcome indexes.

First, we must install the XMLParser project from the Pharo Catalog. Opening the Catalog and typing “xml” returns a list of XML-related projects, including the one we want (XMLParser). Selecting it and clicking the green button installs the latest stable version into our image:

Once the installation completes, we can start working with the data set by inspecting this code in a workspace:

XMLDOMParser parseURL:
	'https://data.medicare.gov/api/views/ypbt-wvdk/rows.xml?accessType=DOWNLOAD'.

Inspecting (with Ctrl+i) brings up a GTInspector on the resulting XMLDocument object, which we will use to navigate and discover its tree structure:

We can see that it has a root response element with a single redundant row element, which in turn has many row element children containing the actual data for each hospital.

Let’s say we wanted to find the top-ranked hospitals in the state of Florida. Extracting the hospital_name and total_performance_score for each hospital in Florida and then sorting them by the total_performance_score would do the trick:

((self root firstElement
	elementsSelect: [:each |
		(each isNamed: 'row')
			and: [(each contentStringAt: 'state') = 'FL']])
		collect: [:each |
			(each contentStringAt: 'hospital_name') ->
				(each contentStringAt: 'total_performance_score') asNumber])
			sort: [:a :b |
				a value >= b value]

We use self root firstElement to get the first child row element from the root response element of the document. We then enumerate and filter all of its row children using elementsSelect:, and then use the standard collect: to get an ordered list of hospital name -> score Associations, which we sort by the values (the scores):

The top 10 hospitals:

  1. ST VINCENTS MEDICAL CENTER – CLAY COUNTY
  2. TWIN CITIES HOSPITAL
  3. WEST KENDALL BAPTIST HOSPITAL
  4. FLORIDA HOSPITAL WESLEY CHAPEL
  5. MAYO CLINIC
  6. GULF BREEZE HOSPITAL
  7. MEASE DUNEDIN HOSPITAL
  8. MEMORIAL HOSPITAL MIRAMAR
  9. FLORIDA HOSPITAL DELAND
  10. CLEVELAND CLINIC HOSPITAL

Feel free to contact me with any XMLParser-related questions.