Google's Search Results Protocols

Google has developed search results protocols that can be used by Google partners who want to have complete control over how they process and present Google search results.  This document describes one or more of those protocols.

Format of a Google partner's search request

For Google partners using a Google search results protocol, a search request is a standard HTTP GET request.  Google recommends performing an HTTP Version 1.0 (or later) GET.  Google will notify the partner of which host and port it should send its GET requests to.  In general, a request should be for a URL something like the following:
/search?q=<REQ>&num=<NUM>&start=<START>&output=<FORMAT>&client=<PARTNER>
where: As an example, suppose a Google partner has an agreement with Google to use Google's XML output results protocol to get 10 search results for websearches.  When one of the partner's end users enters
chicken teriyaki
in his/her browser window to do a search, the partner's software might get its first page of chicken teriyaki results from Google by sending the following query to bar.google.com on port 80:
GET /search?q=chicken+teriyaki&start=0&output=xml&client=foo
This is only an example, of course; Google will inform each of its partners individually what machine and port number to use for queries (as well as the precise format for those queries).

Overview of Google's search results

In response to the requests described above, Google returns an HTTP response containing a page of results, which are presented as a list of items in a format described below.  Many of the items returned for each format are specified as being HTML.  The values of such items are ready to be put into HTML documents as they stand, as soon as any unescaping appropriate for the output format has been performed.

The MIME type (if any) returned by Google's webserver for results outputs may not be accurate, and should be ignored.

Using Google's output protocols

Since Google partners run on a tremendous variety of platforms, the exact way a partner makes use of Google's output protocols varies from partner to partner.  No matter what, however, a certain amount of development is required to make use of them.  At a very high level, the general control flow for Google partners' Google search is the following:
  1. An end user uses a browser to submit a search query to the partner's website via a form.
  2. The partner sends the query to Google.
  3. Google returns the search results to the partner.
  4. The partner parses Google's search results.
  5. The partner generates HTML on the fly which incorporates the parsed Google output.
  6. The partner returns the HTML to the end user's browser.
Here are a few of the many possible ways that a partner can use Google's output protocols:

Microsoft and ASP

A partner can write a COM object which fetches Google output results; parses them; and produces HTML to return.  ASP code can then invoke this COM object.

CGI and Perl

A partner can run a Perl script as a CGI on a webserver.  This script can take the search inputs submitted to it; make an appropriate query to Google; and produce HTML from the results it gets and return it.

Apache and C++

A partner can write an Apache module in C++ to fetch Google output results; parse them; and produce HTML from the results it gets and return it.

Upgrades and modifications to Google's output protocols

Minor upgrades

From time to time (potentially quite frequently), Google may upgrade its existing results output protocols to add in new features.  All Google results output protocol parsers should therefore be written in such a way that they function properly even when "unexpected information" is presented in Google's output; this way, partners' operations will remain unaffected by such protocol upgrades.  More information about how to ignore "unexpected information" will be presented specifically for each output protocol described here.

A minor upgrade of this sort will typically result in a results protocol's version's "minor" component being changed.  E.g., a protocol's version might change from "3.0" to "3.1" or from "4A.5" to "4A.6".

Despite the fact that results output protocol parsers should be written to handle minor upgrades to results output protocols completely transparently, Google will notify partners at least 1 business day in advance of making any such upgrades.  This way, partners have a chance to review their output protocol parsers to ensure that they correctly ignore any "unexpected information".

Major upgrades

Any upgrade to a results output protocol other than adding "unexpected information" to provide new features is considered a major upgrade.  Such upgrades include, but are not limited to: Major upgrades are expected to be quite rare.  Because of the synchronization issues which such an upgrade causes between Google and Google's partners, major upgrades will be handled by creating a completely new output format. E.g., instead of making a major upgrade to the "protocol4" output, Google will create a new "protocol4a" or "protocol5" output.

Major upgrades of a protocol may be followed by subsequent discontinuation (see below) of the original protocol.

Discontinuation

Rarely, Google will need to discontinue an output format (i.e., stop supporting it).  Since discontinuing an output format will obviously cause problems for any partner still using it, Google will give partners a 60-day advance notice before obsoleting any output protocols.  In addition, Google will of course send partners documentation on how to use the latest Google results output protocols.  Google partners will therefore have at least 60 days to modify their systems to deal with changes in Google output results protocols.

Exceptional searches

Google's GoogleScout ("related:") searches

In addition to performing "normal" web searches, Google can perform "related:" searches to find pages similar to a given page.

The way most partners will set up their Google search services, one way for end users to perform "related:" searches is by simply entering a query into the partner's search box.  If an end user enters the query:

related:<URL>
on his or her browser, the browser will URL-escape the string "related:<URL>" and send it to the partner.  The partner will then send that URL-escaped string to Google as the user's query, as usual.  Google will send back its results page (although if Google doesn't have related-page information for the page in question, the Google response will not have any actual results on it).

More typically, however, partners will cause "related:" searches to be done in a different way.  A partner might display a "related" link next to each search result displayed on a results page; clicking on it might cause the partner to send Google a "related:" query. Note that such a query should be indistinguishable by Google from the Google query resulting from an end user performing a "related:" query-- the query string sent to Google should be properly URL-escaped. E.g., to have Google perform a "related:" search on the URL:

http://www.foo.com/frob.asp?A=bb&C=dd
the partner might send to Google the query
GET /search?q=related%3Ahttp%3A%2F%2Fwww.foo.com%2Ffrob.asp%3FA%3Dbb%26C%3Ddd
    &output=xml&client=foo
A "related:" query may fail to return any results.  Each of Google's output protocols has a way of indicating whether or not a particular search result has "related:" information available for it, and Google partners shouldn't display "related" links next to search results for which no "related:" information is available.

Google's "link:" searches

Google also has the ability to perform "link:" searches, which return pages which link to a given page.  These searches are similar to the "related:" searches described above.  Also, as with "related:" searches, each of Google's output protocols has a way of indicating whether or not a particular search result has "link:" information available for it.

Google's "cache:" queries

Google keeps the text of the many of the documents it crawls available in a cache.  These cached documents can be accessed via "cache:" queries, which are somewhat (but not entirely) similar to "related:" queries and "link:" queries.

If desired, a partner can allow its end users to make use of Google's cache.  To do this, the page that a partner returns can have links which point directly to Google's cache.  To enable end users to use Google's cached text for, e.g., the document:

http://www.foo.com/frob.asp?A=bb&C=dd
the partner should create a hyperlink on its page to something like: This hyperlink points to the same host to which the partner submits "normal" Google queries, and is similar in that the query begins with "/search" and has the "&client=<PARTNER>" in it (if the partner's "normal" Google queries begin with something other than "/search", that string should be used here, as well; similarly, the partner should substitute its partner identification string into "&client=foo").  However, it differs in that it is missing the "&output=<FORMAT>" portion present in "normal" Google queries.  This is because the cached content is independent of the partner's output format.

As with "related:" searches and "link:" searches, each of Google's output protocols has a way of indicating whether or not a particular search result has "cache:" information available for it.

Exceptional Search Results

Queries for results past what Google can return for a query

As mentioned earlier, Google cannot guarantee being able to return any particular number of results for a search.  If a query is submitted to Google requesting results past what Google can return for that search,  Google will perform one of two actions.  (Note that the partner cannot select which of these actions occurs.)
  1. Google may return a results page displaying results for the last page of available results for the query submitted.
  2. Google may return a results page indicating that there were no results for the query submitted (an error page, essentially).
If desired, a partner can easily test to determine whether or not the first of these actions has occurred for a query.  To test this, the partner simply needs to check if the result number of the first result Google returns matches the first result number that the partner requested.

If a partner creates results pages containing links to other results pages past the next page, then it will sometimes be the case that the destinations of some of these links actually will return no results.  For example, a partner's end user might perform a query which Google services, returning an estimate of 40 total results for the query to the partner.  The partner might reasonably then create a results page displaying results 1-10 and containing links to results page 2 (with results 11-20); results page 3 (with results 21-30); and results page 4 (with results 31-40).  However, it is possible that, despite Google's original estimate of 40 total results, only 20 results are actually available for the query.  This can happen because of Google's advanced filtering capabilities, which can weed out duplicate and other undesirable results.  In the example at hand, this means that results page 3 and results page 4 will actually not return any results.

Although this kind of unfortunate event need not be considered serious, some partners may wish to avoid it.  For such partners, Google suggests not displaying a full-fledged "navigation bar" on results pages.  Instead, Google suggests that these partners consider simply displaying "Previous" and "Next" links which go to the previous page and next page of results, respectively.  The results information that the partner receives in response to a Google query indicates whether it is appropriate for the partner to place a "Previous" link or a "Next link" on their own results page.  Because Google results are presented in decreasing order of quality, a typical user will be more likely to find the information he or she seeks on an earlier results page, instead of on a later results page (and the most likely page to find the information sought is the first results page).  Therefore, typical users won't be skipping ahead (e.g., skipping straight from the first results page to the fourth results page for a query), anyway.

Google's XML results protocol

Overview

By specifying output=xml for a search query, the results for that query will be returned in Google's XML format.  The DTD describing the XML grammar for Google's results page will be in the location indicated in the results page.  From time to time, Google's XML results format may be modified, resulting in a modification of the DTD.  However, as indicated earlier, such modifications will always be augmentations of Google's result format, and they will not cause any problems for a partner using a properly written XML parser, which should simply ignore elements which it is not expecting.

At the present time, the DTD describing Google's XML results has the following contents:

<!ELEMENT GSP (TM, Q, CT?, TT?, SC*, RES?)>
<!ATTLIST GSP VER CDATA #REQUIRED>
<!ELEMENT TM (#PCDATA)>
<!ELEMENT Q (#PCDATA)>
<!ELEMENT CT (#PCDATA)>
<!ELEMENT TT (#PCDATA)>
<!ELEMENT SC (#PCDATA)>
<!ELEMENT RES (M, NB?, R*)>
<!ATTLIST RES SN CDATA #REQUIRED
              EN CDATA #REQUIRED>
<!ELEMENT M (#PCDATA)>
<!ELEMENT NB (PU?, NU?)>
<!ELEMENT PU (#PCDATA)>
<!ELEMENT NU (#PCDATA)>
<!ELEMENT R (U, T?, RK, F*, S?, HAS)>
<!ATTLIST R N CDATA #REQUIRED
            L CDATA "1">
<!ELEMENT U (#PCDATA)>
<!ELEMENT T (#PCDATA)>
<!ELEMENT RK (#PCDATA)>
<!ELEMENT F  (#PCDATA)>
<!ELEMENT S  (#PCDATA)>
<!ELEMENT HAS (CI?, L?, C?, RT?)>
<!ELEMENT CI (RC, DT?, DS?)>
<!ELEMENT RC (#PCDATA)>
<!ELEMENT DT (#PCDATA)>
<!ELEMENT DS (#PCDATA)>
<!ELEMENT L EMPTY>
<!ATTLIST L TAG CDATA "link:">
<!ELEMENT C EMPTY>
<!ATTLIST C TAG CDATA "cache:"
            SZ  CDATA #REQUIRED>
<!ELEMENT RT EMPTY>
<!ATTLIST RT TAG CDATA "related:">

Values of elements; permitted characters in XML element values

The following characters in Google's elements' values will be escaped:
 
Character
Escaped form
<
either &lt; or &#60;
&
either &amp; or &#38;
>
either &gt; or &#62;
'
either &apos; or &#39;
"
either &quot; or &#34;

All other characters will be presented without modification.

In other words, partners should take the values for elements that Google sends and unescape only these five characters.  This unescaping should be performed only for elements which do not have element content; by performing this unescaping, the partner will recover the correct value of the element.  If the element is described as containing HTML, the newly unescaped string will be valid HTML suitable for displaying by inserting it into an HTML document.  If the element is described as containing a URL, the newly unescaped string will be valid HTML suitable for using as a link destination in a document (i.e., suitable for assigning to an href attribute); to have a browser actually display an element which is described as containing a URL, the newly unescaped string should be HTML-escaped.

The meanings of the tags in Google's XML results

The table below lists the various parts of Google's XML results.  As indicated in the DTD, all components of each element's value must be in the order indicated (although some components are optional, as indicated both in the DTD and in the table).

Further information and comments about the tags are listed below the table.
 

Tag Name Meaning of Contents Format Attributes
GSP The entire output from Google (GSP: "Google Search Protocol") Contains a TM; a Q; an optional CT; an optional TT; any number of SC's; and an optional RES VER
TM Total search time in seconds A floating-point number  
Q The search query submitted, suitable for viewing HTML  
CT Search comments HTML  
TT Search tips HTML  
SC A directory category relevant to the search as a whole A string (needs HTML-escaping to view; needs URL-escaping to put in a URL)  
RES The search results themselves Contains an M; an optional NB; and any number of R's SN; EN
M The estimated total number of results for the search An integer  
NB The search navigation bar Contains an optional PU and an optional NU  
PU Relative URL for the previous results page [Relative] URL (needs HTML-escaping to view)  
NU Relative URL for the next results page [Relative] URL (needs HTML-escaping to view)  
R A single search result Contains a U; an optional T; an RK; any number of F's; an optional S; and a HAS N; L
U The URL of a single search result [Absolute] URL (needs HTML-escaping to view)  
T The title of a single search result HTML  
RK Google's rating of how good a single search result is An integer in the range 0-10, inclusive  
F Special-purpose field Potentially anything  
S A document snippet of a single search result HTML  
HAS Indicates what "special" features are available for this document Contains an optional CI;an optional L; an optional C; and an optional RT  
CI Directory category information for a single search result Contains an RC; an optional DT; and an optional DS  
RC A directory category for a single search result A string (needs HTML-escaping to view; needs URL-escaping to put in a URL)  
DT The title listed in the directory for a single search result HTML  
DS The summary listed in the directory of a single search result HTML  
L If present, indicates that Google has backlinks information for this document (empty)  
C If present, indicates that Google has this document in its cache (empty) SZ
RT If present, indicates that Google has GoogleScout ("related:") information for this document (empty)  

GSP

GSP has a VER attribute which indicates the version of the search results output format.  For Google's XML output format, this attribute will have a string value beginning with the character '3'.

If there are no appropriate search comments to put in the optional CT element, CT will not be present.  The same holds for the search tips in the optional TT element and the actual search results in the RES element.

Q

The search string sent to Google has been subjected to certain browser URL escapes (e.g., spaces are convert to '+' symbols).  These escapes are undone to create the value of the Q element.  So in a typical partner's usage of Google, if an end user types the value:
1 < pi
into a browser search box, the partner's server sees this value as:
1+%3C+pi
and therefore sends a query to Google like:
GET /search?q=1+%3C+pi
Google unescapes this query to see the end user's original query:
1 < pi
To make this string into something that can be put into an HTML document, relevant HTML characters are escaped.  In this case, only the '<' needs to be escaped, yielding:
1 &lt; pi
Finally, to create the value of the Q element, Google escapes all characters which need to be escaped for XML.  This produces the string:
1 &amp;lt; pi
This is the exact sequence of bytes that Google sends to the partner in between <Q> and </Q>.  The partner takes what it receives from Google and unescapes the characters that Google escaped,  yielding
1 &lt; pi
This text is a valid HTML snippet, and is ready to put into a document.

The above process is admittedly somewhat convoluted, but is really only presented in this amount of detail for completeness.  The only thing a partner needs to do to display Q is the same as for any other element containing HTML: unescape the characters that are escaped in the XML format, and then output the result as HTML.

CT

Search comments are query-specific messages such as:
"in" is a very common word and was ignored.  [details]

TT

Search tips are helpful messages which are potentially specific to the way the query was specified.  An example of a search tip from the Google website is:
Tip: in most browsers you can just hit the return key instead of clicking on the search button.
However, since Google doesn't know how its partners' customers have navigated to get their search results, this particular search tip is not relevant for its partners, and will not be returned in search results.  This example is only presented to indicate the general flavor intended for search tips.  At the present time, Google may not return any search tips.

SC

Google may associate one or more relevant directory categories with a query.  Each such category is returned as a string consisting of components separated by '/' characters; the first component in a category is the name of the directory itself.

To make an HTML-printable string from a category, the category should be HTML-escaped by substituting escaped values for each of the five characters

< & > ' "
This is the same process that should be applied to make any URL that Google returns into something viewable.

In addition to this HTML-escaping, partners may well want to make other modifications to category strings.  As a trivial example, a partner might substitute the string " &gt; " for each instance of the character '/' within the category string.

Category strings should be URL-escaped in some fashion before putting them into URLs.

RES

RES possesses ST and EN attributes which indicate the 1-based index of the the first and last results on the results page. E.g., for a results page containing the first 10 results for a query, SN="1" and EN="10".

M

Note that Google's estimate of the total number of hits for a search can be inaccurate in either direction (i.e., it can be too high or too low).  It can also exceed the number of results for a query that Google is actually willing to return.

NB, PU, and NU

The PU and NU elements contain URLs for the previous page and next page of search results for the current query, respectively.  The PU element will only be present when there is a previous results page (i.e., when the current results page is not the first results page), and the NU element will only be present when there is a next results page (i.e., when the current results page is not the last results page).  If neither a previous results page nor a next results page is available, the entire containing NB element will not be present.

R

An R holds the data for a single search result.  It has a required attribute, N, which holds the 1-based index of the result in the list of all results for the search.  E.g., the very first (and presumably very best) search result for a query has N="1"; the second search result for that query has N="2"; etc. The various R values in a results page will be listed in increasing order of their N attributes.

An R has another attribute, L, whose value indicates to what level it might be appropriate to indent that result if the partner wants to present results in a "clustered" format.  Google clusters its results by host so that multiple hits from the same host tend to appear together; the first hit from a given host has L="1", and later hits from the same host have L="2".  It is possible that Google will implement more sophisticated clustering in the future, so a parser should not assume that the only permissible values for L are "1" and "2".  However, it may be assumed that the value of L is a positive integer; its default value is "1", as indicated in the DTD.

F

F elements are reserved for future use by Google.  Unless you have a particular arrangement with Google to use them, your parser should ignore them.

S

S holds a query-dependent "snippet" for a result.  This snippet is the document summary that Google displays on the Google website, and it is intended to enable users to determine whether or not a particular document is relevant for their needs.  Snippets may contain arbitrary HTML, including <br> tags.

HAS

A single search result's HAS field can contain other elements, each of which indicates that some additional information or functionality is available through Google for that search result.

RC

An RC is analogous to a SC, except that it is a category associated with a single search result, instead of with a search query as a whole.

DT

If Google associates a particular directory category with a single search result, it may also associate a title from the directory with that result.  Note that this title is not necessarily related to the title in the document's T element, which (if present) comes from the actual document's HTML.

DS

If Google associates a particular directory category with a single search result, it may also associate a document summary from the directory with that result.  Note that this summary is not necessarily related to the query-dependent snippet in the document's S element.

L; C; and RT

When present, these elements indicate that Google has backlink information; cached text; or GoogleScout ("related:") information for a document, respectively.  If one of these elements is not present, the corresponding functionality is not available for the document.

The C element also has a mandatory attribute, SZ, which holds the size of Google's cached content for the document.  A typical value for the SZ attribute might be the string "8k".

Google's protocol4 results protocol

Overview

By specifying output=protocol4 for a search query, the results for that query will be returned in Google's protocol4 format.  The protocol4 output format consists of a list of items in the format described below.  Each item in the output is a name-length-value triplet; the three pieces of the triplet are separated by colons (':'). The order of triplets in protocol4 output must be as indicated below.  Note that many of the triplets in the protocol4 format are optional, and may not be present.  Any triplet which is optional is marked as such below.

Between triplets, arbitrary quantities of whitespace may be present in Google's protocol4 output (primarily for the sake of legibility).  For example, a terminating newline character is likely to be appended after each triplet (although partners' protocol4 parsers should not require this).  In addition, arbitrary amounts of whitespace may precede or follow the entire collection of triplets (although, once more, partners' protocol4 parsers should not require any particular amount of whitespace in these places).

As indicated earlier, protocol4 may be modified from time to time.  Such modifications will always be augmentations: either new triplets will be added to the output format, or previously optional triplets will become mandatory.  Therefore, a protocol4 parser should be written so that it ignores any unexpected triplets.  In this way, it will not be affected by any protocol4 modifications.

General information about the results page

The first set of triplets in the search results contains information about the results page itself (as opposed to containing information about a particular result on the page). These triplets occur in the order specified in the table below, although some are optional and may therefore not always be present.  They are:
 
Name Meaning Format of value Comments
GSPVersion The version number of the output format A string beginning with '4' Like the VER attribute of GSP in XML.  The current protocol version is 4.0
Time The number of seconds the query took A floating-point number Like the TM element in XML
Search The query that Google searched on HTML Like the Q element in XML
Comments Search comments HTML Like CT element in XML.  Optional
Tips Search tips HTML Like TT element in XML.  Optional
SearchCat_<i> A directory category relevant to the search as a whole A string (needs HTML-escaping to view; needs URL-escaping to put in a URL) Like SC category in XML.  Optional; any number of these may be present
Results The (1-based) range of results displayed on this page. Two integers, separated by a hyphen Holds the same information as in the SN and EN attributes of RES in XML.  If there are no results, this triplet will not be present.  Optional
Matches Google's estimate of the total number of hits it has for the query. A single integer Like the M element in XML.  If there are no results, this triplet will not be present.  Optional
BackURL Relative URL for the previous results page [Relative] URL (needs HTML-escaping to view) Like the PU element in XML.  If there is no previous results page, this triplet will not be present.  Optional
NextURL Relative URL for the next results page [Relative] URL (needs HTML-escaping to view) Like the NU element in XML.  If there is no next results page, this triplet will not be present.  Optional

Note that any number (including zero) of SearchCat_<i> triplets may be present.  The first one is SearchCat_1, the second one is SearchCat_2, etc.

Information about the specific "hits" for the query

After all the data described above, information about the actual documents returned for the query is supplied.  Each result returned has an index which indicates its position in the list of results; the lower the index for a result, the better Google considers that result to be. For each result on the current page, a list of triplets is returned. First, all the triplets for the first result (the result with the lowest index) on the page are returned; then, all the triplets for the second result (the result with the next lowest index) on the page are returned; etc.  The triplets for a single particular result occur in the order specified in the table below, although some are optional and may therefore not always be present.  In addition, it is possible for no results to be returned, in which case none of any of the triplets listed below will be present.

For result #<i>, the following triplets occur in the output results:
 
Name Meaning Format of value Comments
Level_<i> The "level" at which this result should be displayed A positive integer Like the L attribute of R in XML
URL_<i> The URL of a single search result. [Absolute] URL (needs HTML-escaping to view) Like the U element in XML
Title_<i> The title of a single search result. HTML Like the T element in XML.  Optional
Rank_<i> Google's rating of how good a single search result is An integer in the range 0-10, inclusive Like the RK element in XML
Summary_<i> A document snippet of a single search result HTML Like the S element in XML.  Optional
Cat_<i> A directory category for a single search result A string (needs HTML-escaping to view; needs URL-escaping to put in a URL) Like the RC element in XML.  Optional
DirTitle_<i> The title listed in the directory for a single search result HTML Like the DT element in XML.  Optional
DirSummary_<i> The summary listed in the directory for a single search result HTML Like the DS element in XML.  Optional
Link_<i> Indicates that Google has backlinks information for this document (empty string) Conveys the same information as in the L element in XML.  If Google has no backlinks information for this document, this triplet will not be present.  Optional
CacheSize_<i> The approximate size of Google's cached copy of this document An integral number of Kilobytes, such as "8k" Holds the same information as in the SZ attribute of C in XML.  If Google has no cached copy of this document, this triplet will not be present.  Optional
Related_<i> Indicates that Google has GoogleScout ("") information for this document (empty string) Conveys the same information as in the RT element in XML.  If Google has no backlinks information for this document, this triplet will not be present.  Optional

A given result can only have a DirTitle_<i> triplet or a DirSummary_<i> triplet if it has a Cat_<i> triplet, as well.  However, the converse does not hold.  Also, a given result can have a DirTitle_<i> without having a DirSummary_<i>, and vice versa.

Limitations of this document and on the use of Google's results protocols

This document is solely a technical description of Google's results protocols. Being in possession of the information contained herein does not entitle you to use any of these protocols to send queries to Google; the only way to become so entitled is by making an appropriate agreement with Google.

Some agreements with Google may permit only the use of part of Google's results protocols. For example, a partner might have an agreement with Google to perform searches using Google's results protocols, but might nevertheless not be entitled to make use of Google's GoogleScout feature, despite the fact that an interface to it exists in Google's results protocols.  Or a partner might have an agreement with Google to perform searches-- including GoogleScout searches-- using Google's results protocols, but might nevertheless not be entitled to make use of Google's directory and category information.

The contents of this document are confidential and proprietary to Google.

©2000 Google Inc.