String Subtypes for Safer Web Programming

Valid HTML markup involves several different contexts and escaping rules, yet many APIs give no precise indication of which context their string return values are escaped for, or how strings should be escaped before being passed in (let’s not even get into character encoding). Most programming languages only have a single String type, so there’s a strong urge to document function with @param string and/or @return string and move on to other work, but this is rarely sufficient information.

Look at the documentation for WordPress’s get_the_title:


Post title. …

If the title is Stan "The Man" & Capt. <Awesome>, will & and < be escaped? Will the quotes be escaped? “string” leaves these important questions unanswered. This isn’t meant to slight WordPress’s documentation team (they at least frequently give you example code from which you can guess the escaping model); the problem is endemic to web software.

So for better web security—and developer sanity—I think we need a shared vocabulary of string subtypes which can supply this missing metadata at least via mention or annotation in the documentation (if not via actual types).

Proposed Subtypes and Content Models

A basic set of four might help quite a bit. Each should have its own URL to explain its content model in detail, and how it should be handled:

Arbitrary characters not escaped for HTML in any way, possibly including nulls/control characters. If a string’s subtype is not explicit, for safety it should be assumed to contain this content.
Well-formed HTML markup matching the serialization of a DocumentFragment
Markup containing no literal less-than sign (U+003C) characters (e.g. for output inside title/textarea elements)
TaglessMarkup containing no literal apostrophe (U+0027) or quotation mark (U+0022) characters, for output as a single/double-quoted attribute value

What would these really give us?

These subtypes cannot make promises about what they contain, but are rather for making explicit what they should contain. It’s still up to developers to correctly handle input, character encoding, filtering, and string operations to fulfill those contracts.

The work left to do is to define how these subtypes should be handled and in what contexts they can be output as-is, and what escaping needs to be applied in other contexts.

Obvious Limitations

For the sake of simplicity, these subtypes shouldn’t attempt to address notions of input filtering or whether a string should be considered “clean”, “tainted”, “unsafe”, etc. A type/annotation convention like this should be used to assist—not replace—experienced developers practicing secure coding methods.

RotURL: Rot13 for URLs

RotURL is a simple substitution cipher for encoding/obscuring URLs embedded in other URLs (e.g. in a querystring). Also, common chars that need to be escaped (:/?=&%#) are mapped to infrequently used capital letters, so this generally yields shorter querystrings, too.

 * Rot35 with URL/urlencode-friendly mappings. To avoid increasing size during
 * urlencode(), commonly encoded chars are mapped to more rarely used chars.
function rotUrl($url) {
    return strtr($url,
        './-:?=&%# ZQXJKVWPY abcdefghijklmnopqrstuvwxyz123456789ABCDEFGHILMNORSTU',
        'ZQXJKVWPY ./-:?=&%# 123456789ABCDEFGHILMNORSTUabcdefghijklmnopqrstuvwxyz');

    == '8MMGLJQQ5EZR9B9G5491ZFI7QRQ9E45SZG8GKM9MC5VxG5391CPcjx51I38WL51I38Vk1L5fdY6FF';
rotUrl(rotUrl($anyUrl)) = $anyUrl;

You could save a few more bytes by encoding the schema (e.g. “h” for http://, “H” for https://). Since your end encoding has to be URL-safe, there’s not much you can do beyond this to compress a URL embedded in a URL.

Validate Private Page Bookmarklet

ValidatePrivatePage <– validates in current window

ValidatePrivatePage <– validates in new window (your pop-up blocker may complain)

If you need to validate the markup of a page that’s not public (e.g. on localhost), you can now use this bookmarklet to auto-submit the current page source to the validator (instead of viewing source, copying, opening the validator, pasting in, and pressing “check”).

Note: this gets the page source making an XMLHTTPRequest to the current URL, so it does not get interpreted by the browser; i.e. this is NOT based on innerHTML(). If the request made returns a different page (e.g. you were logged out in the meantime), that page’s source will be sent to the validator. Not much can be done about that. I once wrote a crusty PHP4 class/bookmarklet combo that helped do this, but thanks to the standardization of XMLHTTPRequest, this is easy in JS now. You should also thank W3C for allowing cross-domain POSTs to the validator :)