joereger.com

something opinionated and awesome goes here


8
Month
30
Day
2004
Year
11
Hour
31
Minute
AM

Linkrot Fixer



As the web becomes more interconnected and information systems rely more heavily on the currency of the link to determine relevance Linkrot becomes a bigger problem.  So we've decided to implement a Linkrot Fixer as part of the Reger.com web logging tool.  This message outlines what the Linkrot Fixer does and asks some questions where we need help.


Influences for the Linkrot Fixer are Dave Winer's (http://www.scripting.com) weblog wishlist manifesto (summary by Lisa Williams http://www.cadence90.com/blogs/2004_03_01_nixon_archives.html#107902918872392913) and Phil Wolff's Linkrot Spider outline (http://radio.weblogs.com/0100827/categories/blueSkyRadio/2002/04/26.html#a323).


Linkrot Fixer finds broken outbound links, recommends fixes and allows you to quickly and easily repair them.  It's still in beta and we need help testing/refining it.  The two primary areas of evaluation are:



  1. Accuracy of recommended fixes - How well does the tool find the page that's moved or a reasonable replacement?  We're using Google's API to recommend pages so this boils down to our ability to compress a page into keywords.
  2. User interface - How easy is it to fix your broken links?

Here's an outline of what the Linkrot Fixer does:



  • Every night a thread (spooky little invisible program) runs behind the scenes to find all outbound links in entries made with the reger.com web logging tool. 

    • Links are found by passing the contents of an entry into a parser that, very simply, looks for the <a href=''></a> tag, extracting the url when it finds it. 

  • Each link is saved in the database and relationships to entries are created.  There is a one-to-many relationship between links and entries.  This allows us to test urls used frequently by web loggers just one time, saving bandwidth.
  • Each link is spidered which means that the thread (that spooky little invisible program) acts like a web browser and retrieves the url in question.
  • When a status 200 is found (successful page retrieval) the html is compacted to a set of keywords that can be used later on when the link is broken and the page can't be found any more.  These keywords are stored with the link in the database.   Here's how it's done:

    • Extract the first three words of the page title (if there is one) that are at least three characters in length.  When we go back to the search engines to find the page once it's moved we're only allowed to search on ten words (Google's limit) so we need to be smart.  Many sites have standard page titles despite dymanic underlying content so there's a balance here.  If we use too much of our word budget for the title such sites will never be found.  At the same time, the page title is often very relevant and increases the quality of the recommendations.  The content of the page clearly needs to be taken into account.
    • Ignore metatags.  Keywords are too often added universally to all pages on a site.  This means that they aren't relevant to the page itself.
    • Remove everything in the Head section of the page.  We already have the title and are ignoring metatags.  Replace with a space to retain word boundaries.
    • Remove any javascript and any css by searching for <script> and <style> tags.  Replace with a space to retain word boundaries.
    • Remove all remaining html tags (just the tags, not what's between them).  Replace with a space to retain word boundaries.
    • Remove all punctuation.  This obliterates words with apostophres but is worth it because the results are much cleaner.  We used java's Character.isLetterOrDigit(char) function to determine whether a character was punctuation or not.  We could use some help to make this cleaner.  Anybody with a better java method to do this?
    • Remove some misc stuff. (&nbsp;)
    • Create a hashmap of words and the number of times that it is used on the page.  This looks something like this:

      • Joe - 3


      • massive - 21
      • of - 98
      • penis - 5
      • the - 176
      • somewhat - 1
      • girth - 18
      • logging - 15
      • pleasurable - 9

    • Remove any words less than 3 characters in length.
    • Remove popular words.  We used a list of the most used words in the English language from http://esl.about.com/library/vocabulary/bl1000_list1.htm  So far we've found that using the top 100 provides good results.
    • Sort the hashmap by the number of times that words are used.  The goal here is to prioritize the words that are most used on the page.  The hashmap now looks like:

      • massive - 21
      • girth - 18
      • logging - 15
      • pleasurable - 9
      • penis - 5
      • Joe - 3
      • somewhat - 1

    • Create a string representation of the keywords with the most-used words on the left.  We can currently only throw 10 words into Google, but we're going to save a few more keywords just in case.
    • Save the string of keywords in the Linkrot database along with the link.  The string looks like: 

      • "massive girth logging pleasurable penis Joe somewhat"

    • This will happen over and over again each night for each URL until...

  • When a 404 Page Not Found error is found we mark the URL as being broken in the database

    • We do not generate keywords when a 404 is found because we want to defend the original keywords generated when the page was valid.
    • The sands of time flow until you log into your account and check your Linkrot Spider.

      • Question: How important is it that you're immediately emailed about broken links?  We think that simply finding them the next time you log on it reasonable but we're open to suggestions. 

    • User logs into his/her Linkrot Fixer utility and find a list of dead links.

      • They are shown which entry the link is used in
      • They are presented with a "Fix" button

    • Click "Fix" button

      • Behind the scenes we call Google's API and pass it the keywords that were created when the link worked.  It's like bringing it back from the dead.  Or, less dramatically, using a compressed cache version.
      • Google responds with a set of web pages that it thinks fit the keywords. 

    • User is presented with a list of possible fixes

      • They can one-click to fix the link with one of the recommended pages.

        • If another user has fixed the same link the url they used will be offered (see below)

      • They are also given the opportunity to manually type a fixed url.

    • Entry is updated.  The entry is parsed, the old link removed and the new one inserted. 

      • Note: If, at the time of fixing, the link is no longer found in the entry (a possibility because the Linkrot spider only runs once a day and users can update their entries 24/7) then nothing is done.  It's not like we can randomly choose another link to replace. 

    • Linkrot database updated

      • If this user was the only one using the link then the link is removed from the Linkrot database. 
      • The new, fixed, link will be added later that night when the entry is re-parsed.
      • If another user was using the link it is kept in the Linkrot database.  This is key because each person needs to be able to fix their own links.

        • Question: Let's say ten people share the same link and suddenly it breaks.  When Bob logs in and fixes it, should Sally, Ted and the others who use it be able to see what URL Bob used to fix it?  Ideally Bob would be able to choose whether he wants to share his fix or not.

  • When a 301 Moved Permanently status is found, the link that it points to is presented to the user and they can fix it. 

    • The redirect is stored as a recommendation and the status of 301 is saved so that users know it's a 301, not a 404.
    • Question: Would you like the nightly process to automatically update your links when a 301 is encountered?  We certainly want to keep every logger in control of his/her own content so this needs to be configurable. 

  • When a 500 (Server Error) status is found we mark as a broken link.  A 500 error says that there's something wrong on the server.  The page may be fine but have a small syntax error at the code level, etc.   A 500 does not indicate that the page isn't there or that it has moved. 

    • Question: Does this decision to mark broken make sense?  The reality is that a 500 looks bad for users who click on them, which is what the Linkrot fixer is trying to fix.  At the same time, technically the page is still there and it's the responsibility of the page owner to fix it.

  • Here's the code that determines what is done with each status code.  Ignore this if it doesn't make sense to you.  If it does make sense to you, note that there are only three functions to deal with all of the statuses:

    •            if (myHttp.statusCode<300){
                      process200(myHttp, eventid);
                  } else if (myHttp.statusCode==300){
                      process301(myHttp, eventid);
                  } else if (myHttp.statusCode==301){
                      process301(myHttp, eventid);
                  } else if (myHttp.statusCode>=302 && myHttp.statusCode<=399){
                      process301(myHttp, eventid);
                  } else if (myHttp.statusCode>=400 && myHttp.statusCode<403){
                      process404(myHttp, eventid);
                  } else if (myHttp.statusCode==404){
                      process404(myHttp, eventid);
                  } else if (myHttp.statusCode>=405 && myHttp.statusCode<499){
                      process404(myHttp, eventid);
                  } else if (myHttp.statusCode==500){
                      process500(myHttp, eventid);
                  } else if (myHttp.statusCode>500){
                      process500(myHttp, eventid);
                  } else {
                      process404(myHttp, eventid);
                  }

    • Question: Are other behaviors (methods) needed?  I'm particularly concerned with the 400 series of errors. They're client errors and I don't want errors on my side to give users false broken link reports.

  • How to help:

    • Create a free account at http://www.reger.com
    • Log in and create a web log entry with a few links.

      • Create at least a few valid links and a few broken links.  To create a broken link just find a valid link and add a few garbage characters (like "fssfd") to the end.
      • Advanced testing step: create a link to a page that you can control and break later.  This is important because the Linkrot spider needs to have a successful page retrieval at least once to generate keywords.

    • Wait for about 24 hours.  The Linkrot Fixer will go out and spider the links.

      • Depending on which timezone you're in the Linkrot Fixer may run at various times throughout the day.

    • Log back in to your account and click on the Advanced tab on to top right of the page.
    • Click on the Linkrot Fixer link at the bottom of the next page.
    • You'll be presented (if the Linkrot Fixer has had a chance to run) with a list of the outbound links on your site and a status for each of them.
    • Click "Fix" on a broken link and you'll be presented with a list of recommendations from Google.  You can also fix the link manually yourself. 

      • Note: If you didn't do the Advanced testing step above you're not likely to get good recommendations on your broken pages.  This is because the Linkrot Fixer hasn't had the opportunity to spider the page and generate keywords... the first time it visited the page it was broken.

    • Let us know what you think by sending feedback to: http://www.reger.com/about/feedback.log
    • Advanced testing step: Now you've seen how the system works.  It's time to really put it to the test.  Break the link that you control. 
    • Advanced testing step: Wait another 24 hours.
    • Advanced testing step: Log in to your account and go to the Linkrot Fixer. 
    • Advanced testing step: The link you control should now be broken and you can click "Fix"
    • Advanced testing step: You should see a set of page recommendations from Google.

  • Please give us any feedback at: http://www.reger.com/about/feedback.log
  • Thanks!