Proposed Search Engine(s)
Enhancement

(2007 Jan 14 blog post)

A couple of years ago (2005 Mar), I tried to propose to Google a major enhancement to their search engine. I got an automated reply --- essentially a non-reply.

The image above indicates the suggestion --- a search-words distance-apart number that the user can specify. Many web pages are huge and contain sections on many different topics. If this suggestion were implemented, as outlined further below, this feature would drastically reduce the number of useless 'hits', in large pages [such as Google blogspot.com pages], in most of my web searches.

I found what I thought was an appropriate email address --- suggestions@google.com. But their reply said to "register" at a Google "posting" web site and submit the suggestion there. Interesting --- the email address suggestions@google.com does not accept suggestions. As Spock would say, it is not logical.

    I did not have time or energy to go through their registry dance to post the suggestion. The dance: Get a userid and password ... and try to remember where I hid the information (so that I can follow up to responses to the posting), as I go through computer and mail system migrations ... along with potentially 50 other registrations, if I responded to every such command to "register". So I let the suggestion to Google go, for the time being.

I am still, years later, just as frustrated by the massive amount of non-pertinent pages that I get --- on doing almost any wordS search, with any search engine.

So I am posting the suggestion openly now --- hoping that ANY search engine organization will take up this challenge. Are you listening AltaVista, A9, AOL, Ask.com, Clusty, Exalead, Gigablast, Google, Lycos, MSN, WiseNut, Yahoo, and others? Readers, please alert them.

I plan to periodically mail this suggestion to Google and others. Hence I am formatting this page to support printout with appropriate page breaks and other formatting.

    [Actually, there have been a couple of attempts at implementing an enhancement like this. But one was done by an essentially-one-person web-searcher development-operation, in the Netherlands --- walhello.com (Web+valhalla+hello). They/he did not have a very big database of web documents to search, nor the huge server farm of an organization like Google.

    The other attempt was limited to two options --- a fixed word span of 16 words, OR no limit on word span (the current, lamentable state of affairs). This (preliminary?) attempt is by a major search engine organization in France, exalead.com.

    With Exalead, you can use the word NEAR between words in a search query --- to do a "proximity search". "The NEAR operator finds documents where the query terms are within 16 words of each other."

    Note that the French and Dutch are not willing to resign themselves to using Google for all their searches. They know they can do better.

    Hopefully, these two, and other searcher development organizations, are still working on this feature.]

The image at the top of this page (for a hypothetical search engine called Hoogle) gives the gist of the suggestion in a readily assimilatable visual form.

To give some details of the suggestion, here is the text of the original proposal that I e-mailed to suggestions@google.com on March 13, 2005.

Subject: Suggestion for search feature to blow competitors away [2005 Mar]

Dear Google Developers,

In doing searches on multiple keywords, I am continually getting many pages that do not apply --- because they are long pages (like pages with hundreds of mail responses, or a lot of information on many different subjects).

Suggestion:
If there were a user-option to allow a Google user to say that they want the two (or more) keywords to be within, say, 30 words of each other, Google users could eliminate hundreds or thousands of 'false positives'.

Implementation:
It seems that when storing keywords, with each keyword, the Google data gathering engine(s) could store an integer that represents the location of the word in the page. Then the 'distance' between two keywords could be determined by subtracting the two integers stored with the two keywords.

Data gathering (word location) considerations:
I realize that the format of some pages may render the meaning of the integer-location rather meaningless as a measure of distance-between-words --- BUT for the vast majority of web pages, the integer would be useful to determine distance between words --- even if the integer were simply a count of the word-location in the sequence of text and HTML tags in a web page (i.e. treat the HTML source as plain text and simply count the keyword-location-integer using that approach).

Storage overhead:
The storage of the integer-location of keywords could be very compact, say a 4-byte binary integer, which would allow for assigning word-location-integers in web pages about 4 billion words long. This should accomodate essentially any page that anyone would want to look at.

Although the 4-bytes for each keyword might increase the size of Google database(s) by about 20%, the pay-back would be well worth it.

Cheers, a constant Google user (still looking for a better search engine)


2013 UPDATE :

I recently (2013 April) bought a book called "9 Algorithms That Changed the Future" by John McCormack. That book points out, in the first chapter, on web search algorithms, that the position of words within web pages IS SAVED and accessible to search engines like Google. So there is no reason why they could not provide the facility suggested here --- if not on the main search page, then via the 'Advanced Search' link.

That chapter even points out that search engines like Google use the 'near' capability very heavily for their own purposes. Why they do not make that ability available to users is puzzling --- especially when it could cut down searches that return millions of pages down to returning thousands of pages instead. A situation devoutly to be wished --- especially as the databases of web pages explode in size.

Here is a page of web searcher sites for reference.

Bottom of page on the blog topic
A Proposed Search Engine Enhancement
(a-max-distance-apart-entry-for-search-words)
.

To return to a previously visited web page location, click on the
Back button of your web browser, a sufficient number of times.
OR, use the History-list option of your web browser.
OR ...

< Go to Top of Page, above. >
< Go to Blog menu. >
< Go to Home page. >

Or you can scroll up, to the top of this page.

Posted 2007 Jan 14.
Some sentence additons and re-wording was done on 2007 Jan 21.
Added page breaks for better printout, and minor additions, 2009 Aug 10.
Minor format changes 2013 Apr 18.