Overlap Category Archive
Salaries, Overlap, and the Perils of Phrase Searching
Summer's over, so it's time to start posting and updating SearchEngineShowdown again. I'll start off with yet another screencast of unique results found at only one or two search engines, but this search is also an example of the peril of relying too heavily on lengthy phrases for finding the best answers. The story: I ran a quick search at Google in response to a question at the reference desk about the starting salaries in engineering. I found a site with data from a 2005 NACE survey. Since the NACE site only makes its data available for a fee, I thought I'd try a phrase search for the most recent report. Unfortunately, "nace salary survey fall 2007" found nothing at Google. Yahoo (and a few days later, Live) was the only search engine I tried that found some results, which gave the information for which I was looking. Yes the lesson from this example was more than just to be sure to check more than one search engine. There is also the lesson about relying too heavily on phrase searching.
Conflicting Overlap
I seem to be on a roll lately in finding Web pages that are not indexed by Google and are only found by one search engine. Yesterday I was exploring LibraryThing, a social networking and book cataloging site. I became curious as to how well the search engines covered the personal pages that people create in a social networking site like LibraryThing. So I took a look at two user profiles, grabbed a unique-looking phrase from the profile page and checked to see which, if any, of the main Web search engines could find it. Each page was found by only one search engine, but it was not the same one!
Page Found at 3 or 6: Not Google
While working on an upcoming presentation, I came across a Web page that I could not find on Google. The page has been at that URL, on a Canadian academic, site since at least 2003. I was rather surprised to not find it there, so I decided to try a quick overlap showdown. The comparison showed that it was indexed by Yahoo!, Gigablast, and Exalead, but it was not found by Google, Live, or Ask.
So here's another screencast showing the lack of overlap between search engines. This is a short screencast (01:30).
If the embedded version above does not work, try it direct at YouTube.
Overlap Showdown: Only at 1 of 6
Maybe it is just the type of searches I run, but today I had yet another example of the lack of overlap at the major search engines. I was searching for more information about someone for whom I only had their AIM screen name. Searching that screen name at Ask, Exalead, Gigablast, Google, Live, and Yahoo! (although not in that order), I found one page, that actually had the information I wanted -- the person's name. The one page was found by only one of the six search engines. All the rest found zero results.
HereUAre, Gigablast, 10 Billion, and Spam
Ever heard of HereUAre, which has "Over 10 billion pages indexed?" Try a search and you may recognize the results as coming from Gigablast. So what's the connection? This leads to a rather strange story of a vanished press release that I've been researching on and off for the past month or so. Here's the story.
In trying to update my site awhile back, I came across one page that linked to a June 19, 2006 press release from Gigablast about a database size increase to 10 billion and a new "report as spam" feature. The linked page (beta.gigablast.com/prnew.html) was no longer live. I did find a cached copy of the page, from Sept. 10, 2006 only at MSN Search. (No cached copy were available on Oct. 8 at Google, Yahoo!, Ask, or the Wayback machine.) Fortunately, when I came across, I FURLed the MSN Search cached copy of the page. In checking today, I could not find a cache or link at any of the main search engines. Since FURL saves a copy of the page, I have the text from the press release. I'm glad I did, since I could not find a cached copy of the page at Live or any of the other search engines today when I checked.
To summarize the release, Gigablast now has a database with over 10 bilion pages, and here is where it calls it the "HereUAre search engine." It also mentions a beta (no longer available), "multi-language support, real-time indexing, and improved spam control." One part of the spam control is that at the end of each search result, Gigablast now has a link labeled "[report as spam]." Click that link on to report an entry as spam. The Gigablast site does not have the 10 billion claim on it, although it does continue to have the [report as spam] links. The HereUAre site does have the 10 billion claim and the spam reporting. It also makes it sound as if the search technology is its own, with no mention of Gigablast. I was also surprised that I found no mention of HereUAre, the Gigablast 10 billion, or the spam report at other search engine news sites. So, I'm posting what I've found out, and in the interest of sharing information, is a copy of MSN's cached copy of the press release.
Why Search More than One?
Here is another example of a search I ran today where several search engines failed to give me the answer I needed. In particular, I was looking for a cached copy of a Web page, since the page was unavailable when I tried to view it. Three search engines failed to have any record of the page, but fortunately, that last one I tried had the page indexed and a cached copy available for me to view. The winner? Live Search. The losers? Ask, Yahoo!, and Google.
Wikipedia as Source?
An interesting posting today first claimed that the U.S. Dept. of State shamelessly stole text from the Wikipedia:
At this point some of you may ask just what the heck the US Dept. of State was doing, but let's take a moment to clear things up. First, it's obvious the Wikipedia page has been around for quite some time, and has evolved from that older state. . . . the US Dept. of State page doesn't even mention WikipediaI find this posting fascinating in that some people assume that the Wikipedia is an old, established resource. Obviously, the author did not know that the State Department has been producing Background Notes for decades. Certainly, most librarians reading this will guess correctly that the Wikipedia grabbed the text from the State Department originally, and not vice versa.
The page also (somewhat) demonstrates how sometimes, the social, self-correcting nature of the Web can fix such mistakes. After its initial posting, the author added an update at the end and a "Read the update at the bottom, old article preserved for amusement potential only!" at the top. The update does note that
Some people did some great digging and found a copy of the original US Dept. of State document. And guess what? It just barely predates the Wikipedia page.but "just barely" still hints at the lack of understanding of the preceding print versions of the Background Notes.
Anyway, there is also an interesting Web search connection here. I first came across the page after the update had been added. I wanted to see an earlier version, but based on internal content, I could guess that the original had just been posted earlier today. In trying to find the older version, I knew it was too recent for the Wayback Machine. Instead, I checked for cached copies at the search engines. Yahoo! indeed had indexed it, and their cached copy was the earlier version. Out of curiosity I checked at Google, MSN, Ask, and Gigablast. None of the other search engines had yet indexed the page. Once again, the answer to my question was found by one search engine, and in this case, not by Google.

Subscribe