This is a mirror of official site:

The great thing about URL encodings is that there are so many to choose from

| Monday, April 12, 2010
The phrase URL encoding appears to mean different things to different people.

First, Tim Berners-Lee says that URLs are encoded by using %xx to encode "dangerous" characters, or to suppress the special meaning that would normally be assigned to characters such as / or ?. For example, the URL http://server/why%3F/?q=bother is a request to the server server with the path /why?/ and with the query string q=bother. Notice that by escaping the question mark, we prevent it from being interpreted as the start of the query portion of the URL.

Now, it so happens that when a form is submitted via GET, then the contents of the form are encoded (by default) into the query according to a set of rules laid out in the HTML 4.01 specification: The query string takes the basic form of var=value&var=value&.... If a variable name or a value contains a "dangerous" character or a special character like = or &, then it must be %-escaped. For example, co=AT%26T says that the variable co has the value AT&T. Encoding the ampersand prevents it from being interpreted as a separator.

And here is the special additional rule that confuses a lot of people: When submitting a form via GET, the form data is encoded into the query portion of a URL, and under the default encoding, the character U+0020 (space) is encoded as U+002B (plus sign). This special use of the plus sign applies only to the query portion of the URL. Sometimes people get confused and think that it applies to URLs in general.


The base URL and fragment (colored in blue) use the %20 sequence to encode the embedded space, whereas the query (colored in green) uses the plus sign.

You'd think that would be the end of the story, but in fact it's just the beginning, because now we get to throw in all sorts of nonstandard URL encoders.

The PHP function urlencode treats the entire string as if it were a value (or variable name) in a query string, encoding spaces as a plus sign and being careful to escape all other punctuation. Not to be confused with rawurlencode which encodes everything (even characters like /).

JScript comes with a whole bucketload of functions for URL encoding. There's escape(), which encodes almost everything but leaves the slash and—bafflingly—the plus sign unencoded. And then there's the encodeURI() function which leaves a few more characters unencoded (including the colon (U+003A), and question mark (U+003F)). But wait, there's also encodeURIComponent() which goes to the effort of encoding slashes too. It's a total mess, but this site tries to make some sense out of the whole thing.

Read more: The old new thing

Posted via email from jasper22's posterous