5.2 Valid attribute-element combinations
(to top)
* includes deprecated attributes (marked
^), attributes for microdata (marked
*), some non-standard attributes for
embed (marked
**), and the non-standard
bordercolor; can have multiple comma-separated values (marked
%); can have multiple space-separated values (marked
$)
* only non-frameset, HTML body elements
*
name for
a and
map, and
lang are invalid in XHTML 1.1
*
xml:space is only for XHTML 1.1
* excludes data-* and author-specified, non-standard attributes of custom elements
abbr - td, th
accept - form, input
accept-charset - form
action - form
align - applet, caption^, col, colgroup, div^, embed, h1^, h2^, h3^, h4^, h5^, h6^, hr^, iframe, img^, input^, legend^, object^, p^, table^, tbody, td, tfoot, th, thead, tr
allowfullscreen - iframe
alt - applet, area, img, input
archive - applet, object
async - script
autocomplete - input
autofocus - button, input, keygen, select, textarea
autoplay - audio, video
axis - td, th
bgcolor - embed, table^, tbody^, td^, tfoot^, th^, thead^, tr^
border - img, object^, table
bordercolor - table, td, tr
cellpadding - table
cellspacing - table
challenge - keygen
char - col, colgroup, tbody, td, tfoot, th, thead, tr
charoff - col, colgroup, tbody, td, tfoot, th, thead, tr
charset - a, script
checked - command, input
cite - blockquote, del, ins, q
classid - object
clear - br^
code - applet
codebase - object, applet
codetype - object
color - font
cols - textarea
colspan - td, th
compact - dir, dl^, menu, ol^, ul^
content - meta
controls - audio, video
coords - area, a
crossorigin - img
data - object
datetime - del, ins, time
declare - object
default - track
defer - script
dir - bdo
dirname - input, textarea
disabled - button, command, fieldset, input, keygen, optgroup, option, select, textarea
download - a
enctype - form
face - font
flashvars** - embed
for - label, output
form - button, fieldset, input, keygen, label, object, output, select, textarea
formaction - button, input
formenctype - button, input
formmethod - button, input
formnovalidate - button, input
formtarget - button, input
frame - table
frameborder - iframe
headers - td, th
height - applet, canvas, embed, iframe, img, input, object, td^, th^, video
high - meter
href - a, area, link
hreflang - a, area, link
hspace - applet, embed, img^, object^
icon - command
ismap - img, input
keytype - keygen
keyparams - keygen
kind - track
label - command, menu, option, optgroup, track
language - script^
list - input
longdesc - img, iframe
loop - audio, video
low - meter
marginheight - iframe
marginwidth - iframe
max - input, meter, progress
maxlength - input, textarea
media - a, area, link, source, style
mediagroup - audio, video
method - form
min - input, meter
model** - embed
multiple - input, select
muted - audio, video
name - a^, applet^, button, embed, fieldset, form^, iframe^, img^, input, keygen, map^, object, output, param, select, slot, textarea
nohref - area
noshade - hr^
novalidate - form
nowrap - td^, th^
object - applet
open - details, dialog
optimum - meter
pattern - input
ping - a, area
placeholder - input, textarea
pluginspage** - embed
pluginurl** - embed
poster - video
pqg - keygen
preload - audio, video
prompt - isindex
pubdate - time
radiogroup* - command
readonly - input, textarea
required - input, select, textarea
rel$ - a, area, link
rev - a
reversed - old
rows - textarea
rowspan - td, th
rules - table
sandbox - iframe
scope - td, th
scoped - style
scrolling - iframe
seamless - iframe
selected - option
shape - area, a
size - font, hr^, input, select
sizes - img, link, source
span - col, colgroup
src - audio, embed, iframe, img, input, script, source, track, video
srcdoc~ - iframe
srclang~ - track
srcset~% - img, link, source
standby - object
start - ol
step~ - input
summary - table
target - a, area, form
type - a, area, button, command, embed, input, li, link, menu, object, ol, param, script, source, style, ul
typemustmatch~ - object
usemap - img, input, object
valign - col, colgroup, tbody, td, tfoot, th, thead, tr
value - button, data, input, li, meter, option, param, progress
valuetype - param
vspace - applet, embed, img^, object^
width - applet, canvas, col, colgroup, embed, hr^, iframe, img, input, object, pre^, table, td^, th^, video
wmode - embed
wrap~ - textarea
The following attributes, including event-specific ones and attributes of ARIA and microdata specifications, are considered global and allowed in all elements:
accesskey, autocapitalize, autofocus, aria-activedescendant, aria-atomic, aria-autocomplete, aria-braillelabel, aria-brailleroledescription, aria-busy, aria-checked, aria-colcount, aria-colindex, aria-colindextext, aria-colspan, aria-controls, aria-current, aria-describedby, aria-description, aria-details, aria-disabled, aria-dropeffect, aria-errormessage, aria-expanded, aria-flowto, aria-grabbed, aria-haspopup, aria-hidden, aria-invalid, aria-keyshortcuts, aria-label, aria-labelledby, aria-level, aria-live, aria-multiline, aria-multiselectable, aria-orientation, aria-owns, aria-placeholder, aria-posinset, aria-pressed, aria-readonly, aria-relevant, aria-required, aria-roledescription, aria-rowcount, aria-rowindex, aria-rowindextext, aria-rowspan, aria-selected, aria-setsize, aria-sort, aria-valuemax, aria-valuemin, aria-valuenow, aria-valuetext, class, contenteditable, contextmenu, dir, draggable, dropzone, enterkeyhint, hidden, id, inert, inputmode, is, itemid, itemprop, itemref, itemscope, itemtype, lang, nonce, onabort, onblur, oncanplay, oncanplaythrough, onchange, onclick, oncontextmenu, oncopy, oncuechange, oncut, ondblclick, ondrag, ondragend, ondragenter, ondragleave, ondragover, ondragstart, ondrop, ondurationchange, onemptied, onended, onerror, onfocus, onformchange, onforminput, oninput, oninvalid, onkeydown, onkeypress, onkeyup, onload, onloadeddata, onloadedmetadata, onloadend, onloadstart, onlostpointercapture, onmousedown, onmousemove, onmouseout, onmouseover, onmouseup, onmousewheel, onpaste, onpause, onplay, onplaying, onpointercancel, ongotpointercapture, onpointerdown, onpointerenter, onpointerleave, onpointermove, onpointerout, onpointerover, onpointerup, onprogress, onratechange, onreadystatechange, onreset, onsearch, onscroll, onseeked, onseeking, onselect, onshow, onstalled, onsubmit, onsuspend, ontimeupdate, ontoggle, ontouchcancel, ontouchend, ontouchmove, ontouchstart, onvolumechange, onwaiting, onwheel, onauxclick, oncancel, onclose, oncontextlost, oncontextrestored, onformdata, onmouseenter, onmouseleave, onresize, onsecuritypolicyviolation, onslotchange, role, slot, spellcheck, style, tabindex, title, translate, xmlns, xml:base, xml:lang, xml:space
Custom
data-* attributes, where the first three characters of the value of
star (*) after lower-casing do not equal
xml and the value of
star does not have a colon (:), equal-to (=), newline, solidus (/), space, tab, or any A-Z character, are also considered global and allowed in all elements.
5.6 Brief on htmLawed code
(to top)
Much of the code's logic and reasoning can be understood from the documentation above.
The
output of htmLawed is a text string containing the processed input. There is no custom error tracking.
Function arguments for htmLawed are:
*
$in - first argument; a text string; the
input text to be processed. Any extraneous slashes added by PHP when
magic quotes are enabled should be removed beforehand using PHP's
stripslashes() function.
*
$config - second argument; an associative array; optional; named
$C within htmLawed code. The array has keys with names like
balance and
keep_bad, and the values, which can be boolean, string, or array, depending on the key, are read to accordingly set the
configurable parameters (indicated by the keys). All configurable parameters receive some default value if the value to be used is not specified by the user through
$config.
Finalized $config is thus a filtered and possibly larger array.
*
$spec - third argument; a text string; optional. The string has rules, written in an htmLawed-designated format,
specifying element-specific attribute and attribute value restrictions. Function
hl_spec() is used to convert the string to an associative-array, named
$S within htmLawed code, for internal use.
Finalized $spec is thus an array.
Finalized $config and
$spec are made
global variables while htmLawed is at work. Values of any pre-existing global variables with same names are noted, and their values are restored after htmLawed finishes processing the input (to capture the
finalized values, the
show_settings parameter of
$config should be used). Depending on
$config, another global variable
hl_Ids, to track
id attribute values for uniqueness, may be set. Unlike the other two variables, this one is not reset (or unset) post-processing.
Except for the main
htmLawed() function, htmLawed's functions are
name-spaced using the
hl_ prefix. The
functions and their roles are:
*
hl_attributeValue - check attribute values against
$spec rules
*
hl_balance - balance tags and ensure proper nesting
*
hl_commentCdata - handle CDATA sections and HTML comments
*
hl_deprecatedElement - transform element tags
*
hl_entity - handle character entities
*
hl_regex - check syntax of a regular expression
*
hl_spec - convert
$spec value to one used internally
*
hl_tag - handle element tags and attributes
*
hl_tidy - compact/beautify HTML
*
hl_url - check URL-containing values
*
hl_version - report htmLawed version
*
htmLawed - main function
htmLawed() finalizes
$spec (with the help of
hl_spec()) and
$config, and globalizes them. Finalization of
$config involves setting default values if an inappropriate or invalid one is supplied. This includes calling
hl_regex() to check well-formedness of regular expression patterns if such expressions are user-supplied through
$config.
htmLawed() then removes invalid characters like nulls and
x01 and appropriately handles entities using
hl_entity(). HTML comments and CDATA sections are identified and treated as per
$config with the help of
hl_commentCdata(). When retained, the
< and
> characters identifying them, and the
<,
> and
& characters inside them, are replaced with control characters (code-points
1 to
5) till any tag balancing is completed.
After this
initial processing htmLawed() identifies tags using regex and processes them with the help of
hl_tag() -- a large function that analyzes tag content, filtering it as per HTML standards,
$config and
$spec. Among other things,
hl_tag() transforms deprecated elements using
hl_deprecatedElement(), removes attributes from closing tags, checks attribute values as per
$spec rules using
hl_attributeValue(), and checks URL protocols using
hl_url().
htmLawed() performs tag balancing and nesting checks with a call to
hl_balance(), and optionally compacts/beautifies the output with proper white-spacing with a call to
hl_tidy(). The latter temporarily replaces white-space, and
<,
> and
& characters inside
pre,
script and
textarea elements, and HTML comments and CDATA sections with control characters (code-points
1 to
5, and
7).
htmLawed permits the use of custom code or
hook functions at two stages. The first, called inside
htmLawed(), allows the input text as well as the finalized
$config and
$spec values to be altered right after the initial processing (see
section 3.7). The second is called by
hl_tag() once the tag content is finalized (see
section 3.4.9).
The functionality of htmLawed is dictated by the external HTML standards. The code of htmLawed is thus written for a clear-cut aim, with not much concern for tweaking by other developers. The code is only minimally annotated with comments -- it is not meant to instruct. PHP developers familiar with the HTML specifications will see the logic, and others can always refer to the htmLawed documentation.
htmLawed 1.2.15
Copyright Santosh Patnaik
Dual licensed with LGPL 3 and GPL 2+
A PHP Labware internal utility - https://bioinformatics.org/phplabware/internal_utilities/htmLawed