HTML Filters/Purifiers The Need, and Introducing htmLawed

PHP Labware internal utilities / htmLawed

HTML Filters/Purifiers The Need, and Introducing htmLawed

Web-based applications like blogs, content management systems (CMSs), forums, newsfeed aggregators, and wikis that utilize user-submitted text are widely deployed today. Often the applications permit HTML code in the text; after all, the input is used for display in web-pages.

HTML specifications are not just about the HTML elements and attributes. For instance, there are rules regarding the validity of characters and character entities (surrogate text like '&' used to represent the ampersand '&'). Plain text in the input that has no obvious HTML markup is thus still technically HTML.

Users either directly type the HTML code in text, or indirectly put in the code using BBCode (in which surrogates like '[url=...]' are used to represent HTML code like '<a href=...>'), WYSIWYG (What You See Is What You Get) editors like TinyMCE, etc. Though both BBCode and the browser-based WYSIWYG systems are capable of generating the most correct HTML markup without typographical or syntactical errors, they may have only a limited ability to restrict the markup (e.g., to disallow certain attributes for the HTML elements), and they generally do not encompass all of HTML (e.g., to deal with the 'form' element).

***

The presence of HTML markup in the input text poses certain problems. The code may not be in compliance with the right HTML standard; e.g., input meant for a web-page using the XHTML 1.0 Strict DTD may incorrectly be using the deprecated 'u' tag to depict underlined text. A submitter may inadvertently have mistyped HTML code. For instance, he may have forgotten to put a closing tag, or to properly nest the HTML elements. This too can make web-pages standard-incompliant. Poor standard compliance can break the display of a web-page or it can render the purposeful use of a tag useless.

A second issue with HTML markup is that of security. HTML code meant for cross-site scripting (XSS) attacks may have been put in by someone with a malintent. Similarly, HTML code may be used to spam web-pages with links. HTML-invalid characters like the null character as well as invalid character entities in the input can crash browsers, or prevent a web-page from being displayed.

HTML code, even if valid, can still mess up the design and layout of web-pages that use the input text by, for example, presenting text in disruptive sizes or styles. The content of a web-page may be in use outside the web-site (e.g., on newsfeed aggregator sites) and even rendered using clients that are not browsers (e.g., stand-alone XML readers), and bad user input can disrupt the functionality of such distributions.

***

It is thus important to check user-submitted text for security, and standards and administrative policy-compliance. This is true in general for any case in which text from external sources is being used (e.g., a newsfeed aggregator displaying newsfeed items collected from others), and also applies for instances when any HTML markup is being generated indirectly by BBCode parsers, WYSIWYG editors, etc.

Stand-alone applications like 'HTML Tidy', and script-based code are available for this purpose. Such utilities are effectively input text filters that process, purify and sanitize the text. They take care of illegal characters and character entities, and of illegal or disallowed HTML elements and attributes by removing them or transforming them to plain text or allowed markup. They also balance the tags used to represent the elements, ensure that the elements are properly nested, etc.

Some HTML filters are able to check attribute values for correctness and can even modify them (e.g., to obfuscate email addresses as an anti-spam measure). The various filtering scripts available today have different capabilities and customizabilities. It is also possible to use two different filters in tandem to have the desired filtering effect. In general, filters with more capabilities require more time and resources (CPU cycles and memory) for processing input text.

***

Some good HTML filters/purifiers available in various scripting languages are: for Perl, HTML Scrubber; for PHP, htmLawed and HTMLPurifier; for Python, HTML5lib; and, for Ruby, HTML5lib.

The htmLawed PHP script, in a single, ~45-kb file, is fast, with low memory consumption, and offers a high degree of configurability. Besides covering all aspects of HTML markup as described in the current HTML/XHTML standards, it can also deal with common but non-standard elements and attributes like 'embed'. Its additional capabilities include URL protocol checks, anti-email and anti-link spam measures, relative/absolute URL conversions, transformation of deprecated elements and attributes, and code beautification.

To learn more about using htmLawed in your applications, please visit the htmLawed web-site: http://www.bioinformatics.org/phplabware/internal_utilities/htmLawed

PHP Labware home | visitors since Sept 2017