Home > PHP > Quick code snippet: normalizing a URI (for friendly URLs, etc.)

Quick code snippet: normalizing a URI (for friendly URLs, etc.)

December 3rd, 2008 Leave a comment Go to comments

When you enter the realm of "friendly URLs" "slugs" "nice names" or whatever else you call them, it can make everything a lot better looking. However, if done incorrectly, you can get some duplicate indexed pages and the like. I couldn't sleep and wanted to try approaching this again a different way, and every URI I've thrown at it comes out how I want it.

Why does this matter? Typically, without rewrites and using normal webserver, directory and file semantics, a request for "/foo" should make the webserver bounce you to "/foo/" - but when dealing with rewritten URLs, there is no enforcement of this behavior. A lot of the time (at least with the stuff I'm currently dealing with) the same page shows up with "/foo" or "/foo/" and both are considered unique to a search engine. It's duplication of data which violates the normalization devil in me! Even worse, certain apps might not even process the request the same. "/foo" could load one page, and "/foo/" could load another, or an error. That's worse; when people send URLs out, sometimes they take artistic license with what they look like. This is to thwart all that and force search engines, users, etc. to all view the same URL. I chose to enforce the URL structure ending with "/" as I think it helps establish that "final" signoff of the non-query string portion of the URI.

You could go about this different ways; however, due to the way I have my nginx rewrites done, I can't rely on $_SERVER['QUERY_STRING'] which is how I had originally written it - and I wondered why I was getting some weird behavior. Now I realize I need the function to handle any string that is passed to it, and then I can make this behave appropriately.

Below is the function and some sample code to run it. There's probably a few opcodes that could be saved in trade for a little bit of memory by calculating the length of the URI, position of the "?" if there is one, etc. However, I hate defining variables so much that get used only once (a major pet peeve is when people create a new variable for absolutely no reason, this one would at least save a couple CPU cycles...) - so this is my least-amount-of-code-possible version. Enjoy.

# some example URIs
$uris = array(
   "/bar/",
   "/bar",
   "/bar/ee",
   "/bar/index.html",
   "/bar/index.php",
   "/bar/index.php?fds",
   "/bar/index.php?f=bar&fbahd=3",
   "/bar/index.php?http://www.foo.com",
   "/bar/index.php/bark",
   "/bar/index.php!meow|fJG)*#)$*J:g",
   "?somehow",
   ""
);

foreach($uris as $uri) {
   echo $uri." => ".normalize_uri($uri)."\n";
}

function normalize_uri($uri) {
   # if there is query string, we want to chop it off and put it aside
   if(strstr($uri, '?')) {
      $query = substr($uri, strpos($uri, '?'), strlen($uri));
      $uri = substr($uri, 0, strpos($uri, '?'));
   }

   # scrub any index.* stuff off the end
   $uri = preg_replace("/index.(\S{0,3})$/", '', $uri);

   # if it doesn't end with a '/', then add one
   if(substr($uri, strlen($uri)-1, strlen($uri)) != '/') {
      $uri .= '/';
   }

   # finally, put the query string back on
   if(!empty($query)) {
      $uri .= $query;
   }

   return $uri;
}

You'd tie this in with something like a:

header('Location: http://'.$_SERVER['HTTP_HOST'].$uri, true, 301);
exit();

To make sure that it is redirecting with a 301 (search engine friendly) header. (Don't hardcode the scheme - https or http, depending on what your site uses.) Something like this should work:

if(isset($_SERVER['HTTPS']) && strtolower($_SERVER['HTTPS']) == 'on') {
   $scheme = 'https://';
} else {
   $scheme = 'http://';
}

This is how I would throw it all together:

if(isset($_SERVER['REQUEST_URI']) && substr($_SERVER['REQUEST_URI'], strlen($_SERVER['REQUEST_URI'])-1, strlen($_SERVER['REQUEST_URI'])) != '/') {
   if(isset($_SERVER['HTTPS']) && strtolower($_SERVER['HTTPS']) == 'on') {
      $url = 'https://';
   } else {
      $url = 'http://';
   }
   # fill in $_SERVER['REQUEST_URI'] here with whatever is holding the original URI
   $url .= $_SERVER['HTTP_HOST'].normalize_uri($_SERVER['REQUEST_URI']);
   header('Location: '.$url, true, 301);
   exit();
}

Now, it is 3:30am and I am trying to compose this inside of WordPress, but I believe that will work.

I apologize - for some reason the indentation is not showing up... remind me to add in that neat code sample plugin soon.

Categories: PHP
  1. mike
    December 3rd, 2008 at 13:52 | #1

    Oops. Forgot the Location: in the header() call in the last code sample. That's what I get for doing this at 3:45am...

  1. No trackbacks yet.
You must be logged in to post a comment.