Jump to content

RemexHtml

From mediawiki.org

Introduction

[edit]

RemexHtml is a parser for HTML 5, written in PHP.

RemexHtml aims to be:

  • Modular and flexible.
  • Fast, as opposed to elegant. For example, we sometimes use direct member access instead of going through accessors, and manually inline some performance-sensitive code.
  • Robust, aiming for O(N) worst-case performance.

RemexHtml contains the following modules:

  • A compliant preprocessor and tokenizer. This generates a token event stream.
  • Compliant tree construction, including error recovery. This generates a tree mutation event stream.
  • A fast integrated HTML serializer, compliant with the HTML fragment serialization algorithm.
  • DOMDocument construction.

RemexHtml presently lacks:

  • Encoding support. The input is expected to be valid UTF-8.
  • Scripting.
  • XHTML serialization.
  • Precise compliance with specified parse error generation.

RemexHtml aims to be compliant with W3C recommendation HTML 5.1, except for minor backported bugfixes. We chose to implement the W3C standard rather than the latest WHATWG draft because our application needs stability more than feature completeness.

RemexHtml passes all html5lib tests, except for parse error counts and tests which reference a future version of the standard.

Installation

[edit]

In MediaWiki

[edit]

RemexHtml has been available in MediaWiki as a core composer dependency since MediaWiki 1.29. Its initial use case was as a replacement for HTML Tidy. Output from the wikitext parser is fed into RemexHtml's HTML parser and cleaned up per the HTML 5 tag soup specification. The Tokenizer component is now also used for tag stripping in Sanitizer.

It is also used for HTML postprocessing in the Collection , TEI and Wikibase extensions.

Everywhere else

[edit]

Install the wikimedia/remex-html package from Packagist:

composer require wikimedia/remex-html

Semantic versioning is used. The major version number will be incremented for every change that breaks backwards compatibility.

Architecture overview

[edit]

For full reference documentation, please see the documentation generated from the source (or the source itself)

RemexHtml uses a pipeline model. Each event producer calls the attached callback object when it has an event ready to produce. The pipeline stages are:

Tokenizer
Produces a stream of tokens from HTML. Performs tokenization, as described by the tokenization chapter in the HTML specification.
Dispatcher
Tracks the insertion mode, and relays token events to the handler specific to the current insertion mode. Each insertion mode has its own class, with methods for each of the token types.
TreeBuilder
A helper class for the insertion modes. It tracks the state of the tree construction process, receives requests for tree mutation from the insertion mode classes, and dispatches tree mutation events.

In the HTML specification, the tree construction algorithm is imagined as being tightly integrated with creation of a DOM data structure. A major innovation of RemexHtml is to separate tree construction into a phase which generates a tree mutation event stream, and a phase which actually produces the data structure. RemexHtml is able to directly serialize the tree mutation event stream, without needing to store the whole DOM in memory.

Serializer
Produces HTML from a tree mutation event stream.
DOMBuilder
Produces a native PHP DOMDocument from a tree mutation event stream.

When Serializer is used, there is a final pipeline stage:

Formatter
The Formatter interface converts SerializerNode objects to strings. It is a helper for Serializer which allows details of the produced HTML to be easily customised. Serializer is complex and stateful, whereas Formatter subclasses are generally stateless, except for configuration.

RemexHtml also provides:

DOMSerializer
a utility class to serialize a DOM contained within a DOMBuilder, with an interface similar to Serializer.
PropGuard
Many RemexHtml classes use the PropGuard trait, which prevents accidental assignment of undeclared properties. This helps detect developer confusion over class types. If there is a pressing need to use undeclared properties in your application, PropGuard can be globally disabled using PropGuard::$armed = false.
TokenGenerator
a class which provides a token stream via a generator interface, instead of an event stream. It constructs its own Tokenizer. Consuming token events in this way is less efficient, but may be more convenient for some use cases.

There are optional pipeline stages providing debugging facilities:

DispatchTracer
This class sits between Tokenizer and Dispatcher. It reports all token events, and reports insertion mode transitions within Dispatcher. Log messages are sent to a callback function.
TreeMutationTracer
This forwards tree mutation events coming from TreeBuilder, and reports such events to a callback.
DestructTracer
This class forwards tree mutation events, and reports when the Element object emitted by TreeBuilder is destroyed. This helps to identify memory leaks.

RemexHtml's model of a configurable pipeline provides a great deal of flexibility. Applications may subclass pipeline classes provided by RemexHtml, or write their own from scratch, implementing the relevant event receiver interface. Or they may interpose custom pipeline stages in between RemexHtml's standard stages.

However, for simple use cases, there is a fair amount of boilerplate. T217850 proposes to add a simplified method for constructing a standard pipeline, but this has not yet been implemented.

Examples

[edit]

Construct a DOM from input text

[edit]
use RemexHtml\DOM\DOMBuilder;
use RemexHtml\TreeBuilder\TreeBuilder;
use RemexHtml\TreeBuilder\Dispatcher;
use RemexHtml\Tokenizer\Tokenizer;

function parseHtmlToDom( $input ) {
	$domBuilder = new DOMBuilder();
	$treeBuilder = new TreeBuilder( $domBuilder );
	$dispatcher = new Dispatcher( $treeBuilder );
	$tokenizer = new Tokenizer( $dispatcher, $input );
	$tokenizer->execute();
	return $domBuilder->getFragment();
}

In the above code sample, the pipeline is constructed backwards, from end to start. The constructor of each pipeline stage receives the following pipeline stage. Then with the pipeline fully constructed, $tokenizer->execute() causes the whole input text to be parsed and emitted through the pipeline, eventually reaching the DOMBuilder. After execution, the constructed document is available via $domBuilder->getFragment().

[edit]
use RemexHtml\HTMLData;
use RemexHtml\Serializer\HtmlFormatter;
use RemexHtml\Serializer\Serializer;
use RemexHtml\Serializer\SerializerNode;
use RemexHtml\Tokenizer\Tokenizer;
use RemexHtml\TreeBuilder\Dispatcher;
use RemexHtml\TreeBuilder\TreeBuilder;

function changeLinks( $html ) {
	$formatter = new class extends HtmlFormatter {
		public function element( SerializerNode $parent, SerializerNode $node, $contents ) {
			if ( $node->namespace === HTMLData::NS_HTML
				&& $node->name === 'a'
				&& isset( $node->attrs['href'] )
			) {
				$node = clone $node;
				$node->attrs = clone $node->attrs;
				$node->attrs['href'] = 'http://example.com/' . $node->attrs['href'];
			}
			return parent::element( $parent, $node, $contents );
		}
	};

	$serializer = new Serializer( $formatter );
	$treeBuilder = new TreeBuilder( $serializer );
	$dispatcher = new Dispatcher( $treeBuilder );
	$tokenizer = new Tokenizer( $dispatcher, $html );
	$tokenizer->execute();
	return $serializer->getResult();
}

This example modifies an HTML document on the fly, altering href attributes inside ‎<a> tags and returning an HTML string. It does this by subclassing HtmlFormatter, which is a relatively easy hook point into reserialization. It clones the SerializerNode and Attributes objects to avoid altering the document as seen by Serializer, since it is possible this function may be called more than once on each node, and we don't want to prefix the domain name more than once.

Alternatively we could have used SerializerNode::$snData as a flag, to avoid double-prefixing:

if ( !$node->snData ) {
	$node->snData = true;
	$node->attrs['href'] = 'http://example.com/' . $node->attrs['href'];
}

Performance

[edit]

Various options can be enabled which improve performance, potentially at the expense of correctness:

  • Tokenizer
    • ignoreErrors - This does not simply discard parse errors as they are generated. In some cases it chooses a more efficient algorithm which implicitly ignores errors. If parse errors are not required, this should always be set.
    • skipPreprocess - The HTML specification requires that the input be preprocessed to normalize line endings and strip control characters. If line endings are already normalized in your application, and if you don't mind control characters being propagated through to the output, this option can be enabled, for a small improvement to performance.
    • ignoreNulls - Enabling this option causes any null characters to be passed through to the output. The HTML specification requires complex, context-dependent handling of null characters whenever they appear in the input. So if the application simply strips null characters from the input and enables this option, the result will not be standards-compliant, but performance will be slightly improved.
    • ignoreCharRefs - This is an aggressive and rarely-useful optimisation option which ignores character references, passing them through unmodified. It needs to be paired with a special serializer that will emit bare ampersands from text nodes instead of escaping them.
  • TreeBuilder
    • ignoreNulls, ignoreErrors - Same as the corresponding Tokenizer options

About TokenizerError exceptions

[edit]

If RemexHtml throws a TokenizerError exception, for example "pcre.backtrack_limit exhausted", this is usually not a bug in RemexHtml. Either the relevant configuration setting should be increased, or the input size should be limited. The pcre.backtrack_limit INI setting should be at least double the input size.

See also

[edit]
[edit]