<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Chasing &#187; project</title>
	<atom:link href="http://chase.ratchetsoftware.com/tag/project/feed/" rel="self" type="application/rss+xml" />
	<link>http://chase.ratchetsoftware.com</link>
	<description>Chase Gray's blog with solutions to various problems by a curious american Ph.D. student.</description>
	<lastBuildDate>Wed, 09 Jun 2010 05:13:48 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=abc</generator>
		<item>
		<title>URL Archive System or: URL Hacking Made Easy</title>
		<link>http://chase.ratchetsoftware.com/2008/11/url-archive-system-or-url-hacking-made-easy/</link>
		<comments>http://chase.ratchetsoftware.com/2008/11/url-archive-system-or-url-hacking-made-easy/#comments</comments>
		<pubDate>Tue, 04 Nov 2008 02:53:32 +0000</pubDate>
		<dc:creator>chasemgray</dc:creator>
				<category><![CDATA[Projects]]></category>
		<category><![CDATA[Google]]></category>
		<category><![CDATA[project]]></category>

		<guid isPermaLink="false">http://chase.ratchetsoftware.com/?p=125</guid>
		<description><![CDATA[Have you ever found yourself slightly modifying a URL to try to find something you know used to exist or should exist but you keep getting that dreaded 404 page?  Perhaps you were trying to find something that shouldn&#8217;t be online anymore but it was simply unlinked to, benevolent purposes or otherwise? Another example might [...]]]></description>
			<content:encoded><![CDATA[<p>Have you ever found yourself slightly modifying a URL to try to find something you know used to exist or should exist but you keep getting that dreaded 404 page?  Perhaps you were trying to find something that shouldn&#8217;t be online anymore but it was simply unlinked to, benevolent purposes or otherwise? Another example might be that a site&#8217;s main homepage is down and so many users are unable to access the site.  If they had a list of URLs they could see which are still valid very quickly with the URL Archive and go to the site through those URLs.  <a href="http://www.googleguide.com/cached_pages.html">Google cache </a>and <a href="http://www.archive.org/web/web.php">Wayback Machine</a> get you pretty close to what you&#8217;re after but sometimes they just fall a little too short.  Some things are beyond even<em> </em>Google&#8217;s giant umbrella of web applications (at least for now).</p>
<p>I&#8217;m sure an application focusing more on URLs is somebody&#8217;s 10% project somewhere sitting on the backburner until they have time to finish it.  It seems like almost anything you&#8217;d want to do has a corresponding web application.  When I come across a need that doesn&#8217;t I just feel like that void should soon be filled.  I&#8217;m going to outline the application that I believe would fill this void.  Maybe in future posts I&#8217;ll also walkthrough my attempts to do it as my personal .5% project.<span id="more-125"></span></p>
<h3>Doesn&#8217;t this already exist?</h3>
<p>When giving the quick summary of what I envisioned for the project the first question I get is, &#8220;Doesn&#8217;t Google Cache already do that?&#8221;  My initial reaction is that if it did, I would already be using it for that purpose.  After thinking about it, it doesn&#8217;t seem like that far of a stretch for Wayback Machine or Google Cache to archive URLs as well.  I feel like the real kicker is that in order to make it useful there would have to be extra processing and utilities for the URL archive.  The information is already there, so what needs to be done is create an interface for analyzing the URL a site has.  Before talking about the ideal solution, first let&#8217;s look at what is offered by Google Cache and Wayback Machine already to compare.</p>
<p>To get this sort of functionality out of Google Cache as far as I know is to use a &#8220;site:&#8221; search.  So, typing in the search, &#8220;site:arena.cse.sc.edu&#8221;, I get the following:</p>
<div id="attachment_127" class="wp-caption alignnone" style="width: 510px"><a href="http://chase.ratchetsoftware.com/wp-content/2008/11/picture-1.png" rel="lightbox[125]"><img class="size-full wp-image-127" title="ARENA Google Site: Result" src="http://chase.ratchetsoftware.com/wp-content/2008/11/picture-1.png" alt="Result for site:arena.cse.sc.edu" width="500" height="271" /></a><p class="wp-caption-text">Result for site:arena.cse.sc.edu</p></div>
<p>I know that our research group&#8217;s website has lists of most of the pdfs that we have published as well a number of static pages that total more than the 17 that are listed in these results.  Of course, this site isn&#8217;t really optimized for search engines, so that might also not be helping.  Searching for just pdfs returns a measly 7 results.  Some of our papers are listed as links to other sites such as IEEE, which causes them to not get listed in these results.  Alternatively, if I could somehow easily browse all the URLs that exists for this domain, I might be able to see more information about what the website actually contains internally (more on this later).  Another experiment was to search for, &#8220;site:chase.ratchetsoftware.com&#8221;, and because this site is optimized for search engines, most of the URLs that exist were found pretty easily.  There is still no way to really tell what the site contains as an overview without going to the site or reading the provided description.  Put another way, it would be interesting to be able to see a something like a content tree that lets a viewer easily see what is contained in the site with a quick glance.  These are all things I believe can be solved and would be useful, but Google Cache and Wayback Machine are not the tools to do it.</p>
<h3>What would an ideal solution look like?</h3>
<p>I&#8217;ve given a few examples where Google Cache and Wayback Machine don&#8217;t offer what I am looking for.  I think to improve my argument it is best to talk about what the URL archive system would be in a perfect world.  It&#8217;s hard to see the difference in something that doesn&#8217;t yet exist.</p>
<p>I am going to list the features I have in mind and then explain any that aren&#8217;t obvious:</p>
<ol>
<li>Provide a list of all URLs ever found for a given domain.</li>
<li>When a user views a list, it should check and display whether that URL is still valid.</li>
<li>Organize the list of URLs so that it is tractable by a normal user.  It should have collapsible SubURLs so that only top level URLs are in the master list.</li>
<li>Provide a filter so that only URLs matching the given filter are shown.</li>
<li>Provide the ability to click on a URL to make that the new root URL for our list.</li>
<li>A utility that can display a graphical tree view of the page and its links.  If the page links to an outside page, then show the link with the icon to that page if available.  This view would ideally provide enough information and &#8220;graphics&#8221; to give the user a good idea of what is available on the site.</li>
<li>If the site uses dynamic URLs, perhaps for a site written in rails or php, there should be some sort of analysis of the GET arguments from what the system can figure out.  This might be an area where users can submit additional information to improve the analysis of a web applications generated URL addresses.</li>
<li>One of the coolest ones.  A firefox plugin that automatically helps you when trying to figure out URLs based on what is stored in the URL Archive.  So if you&#8217;re trying to modify a URL slightly, you can get real-time help and see which URLs are still valid, etc.  Firefox is good right now with it&#8217;s new location bar that automatically searches your history and all that stuff.  It&#8217;d be even better if it could help you with long URLs that you haven&#8217;t typed in before!</li>
<li>Another plugin idea would be to provide screen readers easier access to a site&#8217;s content.  If all we had are the URLs and perhaps keywords for those URLs, a blind person might be able to more easily find content quicker.</li>
<li>One more plugin would be to have a sidebar while browsing sites, and as you browse you get an updated list of URLs that have existed at your current location on the site.  This would put a sidebar on your left with all the links you might see on that site and all the URLs that you might not see but existed in the past.  This would be useful for quickly navigating huge sites, especially ones that are moving their links around a lot.</li>
<li>The last idea I can think of right now is that the system could analyze how the URLs are changing and suggest to users URLs that may be used in the future or are currently used but not linked to based on the patterns in  the current URLs and their change history.  This might be useful if someone is trying to URL hack but doesn&#8217;t have any idea where to start.</li>
</ol>
<p>I feel like number six is the one that needs further explanation in order to seem as cool as it does to me.  I believe there are already existing tools that will show you a tree view of a website.  There are definitely tools to show you this information for your own site, cause I&#8217;ve used them before.  So how does number six differ from these tools?  The answer is that it doesn&#8217;t differ dramatically, but hopefully it would offer more information because the purpose is slightly different.  Let&#8217;s give an example of a user that is looking for a site that offers ebooks in pdf form online somewhere.  The user might go to various sites by searching through Google and waste time at many sites that seem to offer ebooks but don&#8217;t have that large of a selection.  If the user could easily view a graph of each site before visiting, and turned on a filter for &#8220;.pdf&#8221;, he could hopefully see a tree that branches out into many different URLs leading to ebooks he&#8217;s interested in.  He should be able to quickly get an idea of the quantity, quality, and link validity of the site&#8217;s offerings.  Each node could be clicked on to make that node the new root node in the tree to perhaps get more specific information about a certain path.   There are many possibilities for graphically browsing a site&#8217;s URLs and URL relationships.  To me, this seems like a separate application entirely.  It might be that the site should be developed with a SOAP or REST interface and another application could be developed that provides this tree view of websites.</p>
<h3>Potential problems and Questions</h3>
<p>Not all good things come from having a URL Archive.  All the <a href="http://searchsecurity.techtarget.com/news/article/0,,sid14_gci1315588,00.html">security concerns</a> that come with Google Cache and Wayback Machine will also affect the URL Archive.  Some information that people wished to remove from the internet by unlinking it might be affected in a negative way by a system that indexes URLs forever.  Analyzng the GET parameters of a web application might be good for some users, but a malicious user could probably use this information to make the website work as it wasn&#8217;t intended.  Most of these problems come with developing this type of application.  There are tradeoffs and we just need to try to do our best to mitigate them while still providing a good service.</p>
<p>Another issue is the fact that I have no idea how to store such a large amount of information as this system would require.  I don&#8217;t even know where to begin with the indexing of URLs on the Internet.  It seems like it would take way too long with this little virtual machine on a server somewhere that I have.  I&#8217;d love to hear some comments about this.  How would one create an application like this that has to store such large amounts of information and process so much new data all the time?  Some resources relating to this would probably make for some great bathroom pulp <img src='http://chase.ratchetsoftware.com/wp-includes/images/smilies/icon_smile.gif' alt=':)' class='wp-smiley' /> .</p>
<h3>Call for help</h3>
<p>In my current situation I have way too much on my plate to try to develop an application for which there would be no immediate gain.  I need to work on my own business to try to get that going before I have to get a real job <img src='http://chase.ratchetsoftware.com/wp-includes/images/smilies/icon_wink.gif' alt=';)' class='wp-smiley' /> .  I also have an seemingly endless amount of Ph.D. work.  Despite this, I would more than willing to contribute to a project that was working on this solution.  I would even be happy to set up a Rails application on this server and with a Git repository if there were a couple developers interested in working on it.  All I&#8217;d need to distract me from my other work is a little motivation from other people&#8217;s interest, so please let me know in the comments if this is something you&#8217;d like to use or have needed in the past.</p>
<p>Thanks for reading,</p>
<p>- Chase Gray</p>
]]></content:encoded>
			<wfw:commentRss>http://chase.ratchetsoftware.com/2008/11/url-archive-system-or-url-hacking-made-easy/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>
