Just Software Solutions

Blog Archive for / webdesign /

How Search Engines See Keywords

Friday, 25 July 2008

Jennifer Laycock's recent post on How Search Engines See Keywords over at Search Engine Guide really surprised me. It harks back to the 1990s, with talk of keyword density, and doesn't match my understanding of modern search engines at all. It especially surprised me given the author: I felt that Jennifer was pretty savvy about these things. Maybe I'm just missing something really crucial.

Anyway, my understanding is that the search engines index each and every word on your page, and store a count of each word and phrase. If you say "rubber balls" three times, it doesn't matter if you also say "red marbles" three times: the engines don't assign "keywords" to a page, they find pages that match what the user types. This is why if I include a random phrase on a web page exactly once, and then search for that phrase then my page will likely show up in the results (assuming my phrase was sufficiently uncommon), even though other phrases might appear more often on the same page.

Once the engine has found the pages that contain the phrase that users have searched for (whether in content, or in links to that page), the search engine then ranks those pages to decide what to show. The ranking will use things like the number of times the phrase appears on the page, whether it appears in the title, in headings, links, <strong> tags or just in plain text, how many other pages link to that page with that phrase, and all the usual stuff.

Here, let's put it to the test. At the time of writing, a search on Google for "wibble flibble splodge bucket" with quotes returns no results, and a search without quotes returns just three entries. Given Google's crawl rate for my website, I expect this blog entry will turn up in the search results for that phrase within a few days, even though it only appears the once and other phrases such as "search engines" appear far more often. Of course, I may be wrong, but only time will tell.

Posted by Anthony Williams
[/ webdesign /] permanent link
Tags: , ,
Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

Using CSS to Replace Text with Images

Monday, 29 October 2007

Lots has been said about ways to replace text with images so that users with a graphical browser get a nice pretty logo, whilst search engines and screen readers get to see the text version. Most recently, Eric Enge has posted A Comprehensive Guide to Hidden Text & Search Engines over at SEOmoz. In general, I think it's a fair summary of the techniques I've encountered.

However, I was surprised to see the order of entries in the "may be OK" list. Firstly, I'd have expected sIFR to be top of the list — this is a widely used technique, and just replaces existing text with the same text in a different font. I prefer to do without Flash where possible, and this only works where you want to change the font rather than use a logo, but I can certainly see the draw here.

Secondly, I was also surprised to see the suggestion that is top of the list is to position the text off screen. I think this is a really bad idea, for accessibility reasons. When I only had a dial-up connection, I often used to browse with images turned off in order to reduce download times. If the text is positioned off screen, I would have just got a blank space. Even now, I often check websites with images turned off, because I think it is important. It is for this reason that my preferred technique is "Fahrner Image Replacement" (FIR). Whilst Eric says this is a no-no according to the Google Guidelines, I can't really see how — it's not deceptive in intent, and the text is seen by users without image support (or with images turned off) as well as the search engine bots. Also, given the quote from Susan Moskwa, it seems fine. Here's a quick summary of how it works:

Overlaying text with an image in CSS

The key to this technique is to have a nested SPAN with no content, position it over the text, and set a background image on it. If the background image loads, it hides the original text.

<h1 id="title"><span></span>Some Title Text</h1>

It is important to set the size of the enclosing tag to match the image, so that the hidden text doesn't leak out round the edges at large font sizes. The CSS is simple:

#title
{
    position: relative;
    width: 200px;
    height: 100px;
    margin: 0px;
    padding: 0px;
    overflow: hidden;
}

#title span
{
    position: absolute;
    top: 0px;
    left: 0px;
    width: 200px;
    height: 100px;
    background-image: url(/images/title-image.png);
}

This simple technique works in all the major browsers, including Internet Explorer, and gracefully degrades. Obviously, you can't select text from the image, but you can generally select the hidden text (though it's hard to see what you're doing), and copying the whole page will include the hidden text. Check it out — how does the title above ("Overlaying text with an image in CSS") appear in your browser?

Update: It has been pointed out in a comment on the linked SEOmoz article by bjornjohansen that you need to be aware of the potential for browsers with a different font size. This is definitely important — that's why we specify the exact dimensions for the enclosing element, and use overflow: hidden to avoid overhang. It's also important to ensure that the raw text (without the image) fits the specified space when rendered in at least one font size larger than "normal", so that people who use larger fonts can still read it with images disabled, without getting the text clipped.

Update: In another comment over on the SEOmoz article, MarioFr suggested that for headings the A tag could be used instead of SPAN — since empty A tags can be used as a link target in the heading, it works as a suitable replacement. I've changed the heading above to use an A tag for both purposes as an example.

Posted by Anthony Williams
[/ webdesign /] permanent link
Tags: , ,
Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

Reduce Bandwidth Usage by Compressing Pages in PHP

Monday, 15 October 2007

In Reduce Bandwidth Usage by Supporting If-Modified-Since in PHP, I identified one way to reduce your bandwidth usage — use the appropriate HTTP headers to avoid sending content that hasn't changed. Another way to reduce your bandwidth usage is to compress your pages.

HTTP headers

The Accept-Encoding HTTP header is used by browsers to specify potential encodings for a requested web page. For Firefox, this is generally set to "gzip, deflate", meaning that the browser will accept (and decompress) web pages compressed with the gzip or deflate compression algorithms. The web server can then use the Content-Encoding header to indicate that it has used a particular encoding for the served page. The Vary header is used to tell the browser or proxy that different encodings can be used. For example, if the server compresses the page using gzip, then it will return headers that say

    Content-Encoding: gzip
    Vary: Accept-Encoding

Handling compression in PHP

For static pages, compression is handled by your web server (though you might have to configure it to do so). For pages generated with PHP you are in charge. However, supporting compression is really easy. Just add:

    ob_start('ob_gzhandler');

to the start of the script. It is important that this comes before any output has been written as in order to compress the output, all output has to be passed through the filter, and the headers have to be set. If any content has already been sent to the browser, then this won't work, which is why I put it at the start of the script — that way, there's not much chance of anything interfering.

Tags: , , , ,

Posted by Anthony Williams
[/ webdesign /] permanent link
Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

Reduce Bandwidth Usage by Supporting If-Modified-Since in PHP

Sunday, 30 September 2007

By default, pages generated with PHP are not cached by browsers or proxies, as they are generated anew every time the page is loaded by the server. If you have repeat visitors to your website, or even many visitors that use the same proxy, this means that a lot of bandwidth is wasted transferring content that hasn't changed since last time. By adding appropriate code to your PHP pages, you can allow your pages to be cached, and reduce the required bandwidth.

As Bruce Eckel points out in RSS: The Wrong Solution to a Broken Internet, this is a particular problem for RSS feeds — feed readers are often overly enthusiastic in their checking rate, and given the tendency of bloggers to provide full feeds this can lead to a lot of wasted bandwidth. By using the code from this article in your feed-generating code you can save yourself a whole lot of bandwidth.

Caching and HTTP headers

Whenever a page is requested by a browser, the server response includes a Last-Modified header in the response which indicates the last modification time. For static pages, this is the last modification time of the file, but for dynamic pages it typically defaults to the time the page was requested. Whenever a page is requested that has been seen before, browsers or proxies generally take the Last-Modified time from the cached version and populate an If-Modified-Since request header with it. If the page has not changed since then, the server should respond with a 304 response code to indicate that the cached version is still valid, rather than sending the page content again.

To handle this correctly for PHP pages requires two things:

  • Identifying the last modification time for the page, and
  • Checking the request headers for the If-Modified-Since.

Timestamps

There are two components to the last modification time: the date of the data used to generate the page, and the date of the script itself. Both are equally important, as we want the page to be updated when the data changes, and if the script has been changed the generated page may be different (for example, the layout could be different). My PHP code incorporates both by defaulting the modification time of the script, and allowing the user to pass in the data modification time, which is used if it is more recent than the script. The last modification time is then used to generate a Last-Modified header, and returned to the caller. Here is the function that adds the Last-Modified header. It uses both getlastmod() and filemtime(__FILE__) to determine the script modification time, on the assumption that this function is in a file included from the main script, and we want to detect changes to either.

function setLastModified($last_modified=NULL)
{
    $page_modified=getlastmod();
    
    if(empty($last_modified) || ($last_modified < $page_modified))
    {
        $last_modified=$page_modified;
    }
    $header_modified=filemtime(__FILE__);
    if($header_modified > $last_modified)
    {
        $last_modified=$header_modified;
    }
    header('Last-Modified: ' . date("r",$last_modified));
    return $last_modified;
}

Handling If-Modified-Since

If the If-Modified-Since request header is present, then it can be parsed to get a timestamp that can be compared against the modification time. If the modification time is older than the request time, a 304 response can be returned instead of generating the page.

In PHP, the HTTP request headers are generally stored in the $_SERVER superglobal with a name starting with HTTP_ based on the header name. For our purposes, we need the HTTP_IF_MODIFIED_SINCE entry, which corresponds to the If-Modified-Since header. We can check for this with array_key_exists, and parse the date with strtotime. There's a slight complication in that old browsers used to add additional data to this header, separated with a semicolon, so we need to strip that out (using preg_replace) before parsing. If the header is present, and the specified date is more recent than the last-modified time, we can just return the 304 response code and quit — no further output required. Here is the function that handles this:

function exitIfNotModifiedSince($last_modified)
{
    if(array_key_exists("HTTP_IF_MODIFIED_SINCE",$_SERVER))
    {
        $if_modified_since=strtotime(preg_replace('/;.*$/','',$_SERVER["HTTP_IF_MODIFIED_SINCE"]));
        if($if_modified_since >= $last_modified)
        {
            header("HTTP/1.0 304 Not Modified");
            exit();
        }
    }
}

Putting it all together

Using the two functions together is really simple:

     exitIfNotModifiedSince(setLastModified()); // for pages with no data-dependency
     exitIfNotModifiedSince(setLastModified($data_modification_time)); // for data-dependent pages

Of course, you can use the functions separately if that better suits your needs.

Posted by Anthony Williams
[/ webdesign /] permanent link
Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

Free SEO Tools for Webmasters

Monday, 24 September 2007

I thought I'd share some of the free online tools that I use for assisting with Search Engine Optimization.

The people behind the iWebTool Directory provide a set of free webmaster tools, including a Broken Link Checker, a Backlink Checker and their Rank Checker. For most tools, just enter your domain or URL in the box, click "Check!" and wait for the results.

Whereas the iWebTool tools each perform one small task, Website Grader is an all-in-one tool for grading your website. Put in your URL, the keywords you wish to rank well for, and the websites of your competitors (if you wish for a comparison). When you submit your site, the tool then displays its progress at the bottom of the page, and after a few moments will give a report on your website, including your PageRank, Alexa rank, inbound links and Google rankings for you and your competitors for the search terms you provided, as well as a quick analysis of the content of your page.

We Build Pages offers a suite of SEO tools, much like the ones from iWebTool. I find the Top Ten Analysis SEO Tool really useful, as it compares your site against the top ten ranking sites for the search term you specify. The Backlink and Anchor Text Tool is also pretty good — it takes a while, but eventually tells you which pages link to your site, and what anchor text they use for the link.

Posted by Anthony Williams
[/ webdesign /] permanent link
Stumble It! stumbleupon logo | Submit to Reddit reddit logo | Submit to DZone dzone logo

Comment on this post

If you liked this post, why not subscribe to the RSS feed RSS feed or Follow me on Twitter? You can also subscribe to this blog by email using the form on the left.

Design and Content Copyright © 2005-2024 Just Software Solutions Ltd. All rights reserved. | Privacy Policy