Treat Google Referred Users Special
The content of my main web application, Myhealthcaresource, will contain more than 15000 detailed financial reports for nursing facilities at its future peak. Each of these reports contains textual information as well as monetary values. It might list administrator names, employee names/salaries, owners, products or services purchased. I wanted to have all of this information searchable on Google, without Google caching the page. I also wanted to let users who came from Google as a result of searching for this information see it without having to log in, but only the page that they found through Google search.I’m going to make this post much shorter and only list the steps I took to solve these problems.
Problem 1: How to let Google index my data that requires a login
A prerequisite for this step is to submit sitemaps for your site’s content to Google Webmaster Tools. Once you have done this, at some point in the future Googlebot will attempt to traverse your page and index its contents (hopefully).
The content on my site requires a user to pay a subscription fee in order to have free reign and browse through whatever they please. I’m not very concerned about any single piece of information getting leaked, instead I’m protecting the resource the site provides as a whole. By letting Google index your site, you’re obviously opening up your content to outside viewers that are not logged in. The extent that you need to protect your content could vary from what my needs are.
What I want is for a user to be able to type their name into Google search engine and find out that they are referenced in the data provided by Myhealthcaresource.com. So, I need to somehow let Googlebot traverse my site, even through it requires a login by a normal user.
I’m using Restful Authentication and a role requirement system, so this is how I include this exception.
class FacilitiesController < ApplicationController require_role :basic, :only => :show, :unless => request.env['HTTP_USER_AGENT'] =~ /.*Googlebot.*/ end
This could vary depending on your authentication system, but the part that will remain the same is the check if the user agent string contains ‘Googlebot’.
That’s all it takes in order to allow Googlebot to traverse my controlelr that previously required a login. A potential problem that could come up is that users could ‘fake’ their user agent to say ‘Googlebot’ and gain access to the site. This could be a problem for some sites, and we would keep an eye out for this in our logs or through analytics software. We provide a free trial to all our users before they decide to pay, so uers get a free look at the data already and there is no reason to try to sneak around the site to see it. We’re not concerned with this problem, but it could be an issue for others.
Problem 2: How to prevent Google from caching private pages
Once Google starts indexes pages of your site that require a login for a normal user, it will also start caching all the sites that it visits. If you are worried about your internal content getting cached by Google you will want to prevent Google or any other robots from caching. This is simple and is accomplished by putting the following in the heading of your main layout file.
<META name="robots" content="index,follow,noarchive" /> <META NAME="googleBOT" CONTENT="noarchive" />
When Googlebot reaches your page and sees this tag, it will not cache it. I have this on the Myhealthcaresource layout, and by searching myhealthcaresource in google you can see there are no cache links.
Problem 3: Allowing Users Referred From Google View Content Without Logging In
When a user sees a link to myhealthcaresource on a set of Google search results, they are most likely to the detailed facility reports that I submitted in a sitemap. They probably searched their company name or their own and found that they were referred to by my site’s information. When someone clicks this link in Google, by default they would get redirected to a page asking them to log in. This would make almost every user referred from Google resort to their back button immediately. I noticed this behavior during the period when my site functioned this way. These users from Google have no idea what kind of information our site offers and they might be a potential customer, so we don’t want them to leave because we present a log in form when they first arrive.
An alternative is to allow users referred to our site through a Google search to view the entire page. We decided to do just this. When they click the link in Google’s search results they are not asked to log in and they are shown the same page a logged in user would see. What happens when they try to browse the rest of the private content? They are asked to log in, but we already have their interest at this point and they are more likely to stay if they found something that was useful to them on the page they landed on.
So how do we accomplish this? In a similar way that we allowed Googlebot to traverse our pages, we can allow users that were referred from Google to access our page:
class FacilitiesController < ApplicationController require_role :basic, :only => :show, :unless => request.referer =~ /.*google.*/ || request.env['HTTP_USER_AGENT'] =~ /.*Googlebot.*/ end
Of course this code would vary slightly if you are using a different authentication scheme. The core idea would be the same though. Now, the potential problem that could come up with this is that a malicious user uses Google to repeatedly access your site by creating clever search queries. In our case, if a user is this intent on stealing the data, we weren’t likely to win them as a customer in the first place. We can also monitor this through analytics tools and if it is abused we will look into an alternative solution. It could be possible to restrict the number of Google referrals a certain IP can have in a time period or something.
I hope that some of this work can be useful to people who are looking for ways to leverage Google for their dynamically generated login-only content. We see people come from Google searches every day, and now they stay much longer as a result of these changes.
In a future post, I will briefly talk about another method our site uses to send links to users that allow them to browse the site freely without logging in. After a certain time period these links expire and the users must sign up to browse the site further.







