To encourage webmasters to create sites and content in accessible ways, each of the major search engines have built support and guidance-focused services. Each provides varying levels of value to search marketers, but all of them are worthy of understanding. These tools provide data points and opportunities for exchanging information with the engines that are not provided anywhere else.
The sections below explain the common interactive elements that each of the major search engines support and identify why they are useful. There are enough details on each of these elements to warrant their own articles, but for the purposes of this guide, only the most crucial and valuable components will be discussed.
Sitemaps are a tool that enable you to give hints to the search engines on how they can crawl your website. You can read the full details of the protocols at Sitemaps.org. In addition, you can build your own sitemaps at XML-Sitemaps.com. Sitemaps come in three varieties:
Extensible Markup Language (Recommended Format)
- This is the most widely accepted format for sitemaps. It is extremely easy for search engines to parse and can be produced by a plethora of sitemap generators. Additionally, it allows for the most granular control of page parameters.
- Relatively large file sizes. Since XML requires an open tag and a close tag around each element, files sizes can get very large.
Really Simple Syndication or Rich Site Summary
- Easy to maintain. RSS sitemaps can easily be coded to automatically update when new content is added.
- Harder to manage. Although RSS is a dialect of XML, it is actually much harder to manage due to its updating properties.
- Extremely easy. The text sitemap format is one URL per line up to 50,000 lines.
- Does not provide the ability to add meta data to pages.
The robots.txt file (a product of the Robots Exclusion Protocol) should be stored in a website’s root directory (e.g., http://www.google.com/robots.txt). The file serves as an access guide for automated visitors (web robots). By using robots.txt, webmasters can indicate which areas of a site they would like to disallow bots from crawling as well as indicate the locations of sitemaps files (discussed below) and crawl-delay parameters. You can read more details about this at the robots.txt Knowledge Center page.
The following commands are available:
Prevents compliant robots from accessing specific pages or folders.
Indicates the location of a website’s sitemap or sitemaps.
Indicates the speed (in milliseconds) at which a robot can crawl a server.
|An Example of Robots.txt
Disallow:# Don’t allow spambot to crawl any pages
Warning: It is very important to realize that not all web robots follow robots.txt. People with bad intentions (ie., e-mail address scrapers) build bots that don’t follow this protocol and in extreme cases can use it to identify the location of private information. For this reason, it is recommended that the location of administration sections and other private sections of publicly accessible websites not be included in the robots.txt. Instead, these pages can utilize the meta robots tag (discussed next) to keep the major search engines from indexing their high risk content.
The meta robots tag creates page-level instructions for search engine bots.
The meta robots tag should be included in the head section of the HTML document.
|An Example of Meta Robots
<title>The Best Webpage on the Internet</title>
<meta name=”ROBOT NAME” content=”ARGUMENTS” />
In the example above, “ROBOT NAME” is the user-agent of a specific web robot (eg. Googlebot) or an asterisk to identify all robots, and “ARGUMENTS” is one arguments listed in the diagram to the right.
The rel=nofollow attribute creates link-level instructions for search engine bots that suggest how the given link should be treated. While the search engines claim to not nofollow links, tests show they actually do follow them for discovering new pages. These links certainly pass less juice (and in most cases no juice) than their non-nofollowed counterparts and as such are still recommend for SEO purposes.
|An Example of nofollow
<a href=”http://www.example.com” title=“Example” rel=”nofollow”>Example Link</a>
In the example above, the value of the link would not be passed to example.com as the rel=nofollow attribute has been added.
Geographic Target – If a given site targets users in a particular location, webmasters can provide Google with information that will help determine how that site appears in our country-specific search results, and also improve Google search results for geographic queries.
Preferred Domain – The preferred domain is the one that a webmaster would like used to index their site’s pages. If a webmaster specifies a preferred domain as http://www.example.com and Google finds a link to that site that is formatted as http://example.com, Google will treat that link as if it were pointing at http://www.example.com.
Image Search – If a webmaster chooses to opt in to enhanced image search, Google may use tools such as Google Image Labeler to associate the images included in their site with labels that will improve indexing and search quality of those images.
Crawl Rate – The crawl rate affects the speed of Googlebot’s requests during the crawl process. It has no effect on how often Googlebot crawls a given site. Google determines the recommended rate based on the number of pages on a website.
Web Crawl – Web Crawl identifies problems Googlebot encountered when it crawls a given website. Specifically, it lists Sitemap errors, HTTP errors, nofollowed URLs, URLs restricted by robots.txt and URLs that time out.
Mobile Crawl – Identifies problems with mobile versions of websites.
Content Analysis – This analysis identifies search engine unfriendly HTML elements. Specifically, it lists meta description issues, title tag issues and non-indexable content issues.
These statistics are a window into how Google sees a given website. Specifically, it identifies top search queries, crawl stats, subscriber stats, “What Googlebot sees” and Index stats.
This section provides details on links. Specifically, it outlines external links, internal links and sitelinks. Sitelinks are section links that sometimes appear under websites when they are especially applicable to a given query.
This is the interface for submitting and managing sitemaps directly with Google.
Statistics – These statistics are very basic and include data like the title tag of a homepage and number of indexed pages for the given site.
Feeds – This interface provides a way to directly submit feeds to Yahoo! for inclusion into its index. This is mostly useful for websites with frequently updated blogs.
Actions – This simplistic interface allows webmasters to delete URLs from Yahoos index and to specify dynamic URLs. The latter is especially important because Yahoo! traditionally has a lot of difficulty differentiating dynamic URLs.
Profile – This interface provides a way for webmasters to specify the location of sitemaps and a form to provide contact information so Bing can contact them if it encounters problems while crawling their website.
Crawl Issues – This helpful section identifies HTTP status code errors, Robots.txt problems, long dynamic URLs, unsupported content type and, most importantly, pages infected with malware.
Backlinks – This section allows webmasters to find out which webpages (including their own) are linking to a given website.
Outbound Links – Similarly to the aforementioned section, this interface allows webmasters to view all outbound pages on a given webpage.
Keywords – This section allows webmasters to discover which of their webpages are deemed relevant to specific queries.
Sitemaps – This is the interface for submitting and managing sitemaps directly to Microsoft.
While not run by the search engines, SEOmoz’s Open Site Explorerdoes provide similar data.
Identify Powerful Links – Open Site Explorer sorts all of your inbound links by their metrics that help you determine which links are most important.
Find the Strongest Linking Domains – This tool shows you the strongest domains linking to your domain.
Analyze Link Anchor Text Distribution – Open Site Explorer shows you the distribution of the text people used when linking to you.
Head to Head Comparison View – This feature allows you to compare two websites to see why one is outranking the other.
For more information, click below:
It is a relatively recent occurrence that search engines have started to provide tools that allow webmasters to interact with their search results. This is a big step forward in SEO and the webmaster/search engine relationship. That said, the engines can only go so far with helping webmaster. It is true today, and will likely be true in the future that the ultimate responsibility of SEO is on the marketers and webmasters. It is for this reason that learning SEO is so important.
Myths and Misconceptions of SEO
Unfortunately, over the past 12 years, a great number of misconceptions have emerged about how the search engines operate and what’s required to perform effectively. In this section, we’ll cover the most common of these, and explain the real story behind the myths.
In classical SEO times (the late 1990’s), search engines had “submission” forms that were part of the optimization process. Webmasters & site owners would tag their sites & pages with information (this would sometimes even include the keywords they wanted to rank for), and “submit” them to the engines, after which a bot would crawl and include those resources in their index. For obvious reasons (manipulation, reliance on submitters, etc.), this practice was unscalable and eventually gave way to purely crawl-based engines. Since 2001, search engine submission has not only not been required, but is actually virtually useless. The engines have all publicly noted that they rarely use the “submission” URL lists, and that the best practice is to earn links from other sites, as this will expose the engines to your content naturally.
You can still see submission pages (for Yahoo!, Google, Bing), but these are remnants of time long past, and are essentially useless to the practice of modern SEO. If you hear a pitch from an SEO offering “search engine submission” services, run, don’t walk to a real SEO. Even if the engines did use the submission service to crawl your site, you’d be very unlikely to earn enough “link juice” to be included in their indices or rank competitively for search queries.
Once upon a time, much like search engine submission, meta tags (in particular, the meta keywords tag) were an important part of the SEO process. You would include the keywords you wanted your site to rank for and when users typed in those terms, your page could come up in a query. This process was quickly spammed to death, and today, only Yahoo! among the major engines will even index content from the meta keywords tag, and even they claim not to use those terms for ranking, but merely content discovery.
It is true that other meta tags, namely the title tag and meta description tag (which we’ve covered previously in this guide), are of critical importance to SEO best practices. And, certainly, the meta robots tag is an important tool for controlling spider access. However, SEO is not “all about meta tags”, at least, not anymore.
Not surprisingly, a persistent myth in SEO revolves around the concept that keyword density – a mathematical formula that divides the number of words on a page by the number of instances of a given keyword – is used by the search engines for relevancy & ranking calculations and should therefore be a focus of SEO efforts. Despite being proven untrue time and again, this farce has legs, and indeed, many SEO tools feed on the concept that keyword density is an important metric. It’s not. Ignore it and use keywords intelligently and with usability in mind. The value from an extra 10 instances of your keyword on the page is far less than earning one good editorial link from a source that doesn’t think you’re a search spammer.
Put on your tin foil hats, it’s time for the most common SEO conspiracy theory – that upping your PPC spend will improve your organic SEO rankings (or, likewise, that lowering that spend can cause ranking drops). In all of the experiences we’ve ever witnessed or heard about, this has never been proven nor has it ever been a probable explanation for effects in the organic results. Google, Yahoo! & Bing all have very effective walls in their organizations to prevent precisely this type of crossover. At Google in particular, advertisers spending tens of millions of dollars each month have noted that even they cannot get special access of consideration from the search quality or web spam teams. So long as the existing barriers are in place and the search engines cultures maintain their separation, we believe that this will remain a myth. That said, we have seen anecdotal evidence that bidding on keywords you already organically rank for can help increase your organic click through rate.
Personalization seems to primarily affect areas in which we devote tons of time, energy and repeated queries. This means for many/most “discovery” and early funnel searches, we’re going to get very standardized search results. It’s true that it can influence some searches significantly, but it’s also true that, 90%+ of queries we perform are unaffected (and that goes for what we hear from other SEOs, too). This post helps to validate this, showing that while rankings changes can be dramatic, they only happen when there’s substantive query volume from a user around a specific topic.
Reciprocal links are of dubious value: they are easy for an algorithm to catch and to discount. Having your own version of the Yahoos directory on your site isn’t helping your users, nor is it helping your SEO.
We wouldn’t be concerned at all with a technically “reciprocated” link, but we would watch out for schemes and directories that leverage this logic to earn their own links and promise value back to your site in exchange. Also, watch out for those who’ve evolved to build “three-way” or “four-way” reciprocal directories such that you link to them and they’ll link to you from a separate site – it’s still attempted manipulation and there are so many relevant directoriesout there; why bother!?
The practice of spamming the search engines – creating pages and schemes designed to artificially inflate rankings or abuse the ranking algorithms employed to sort content – has been rising since the mid-1990’s. With payouts so high (at one point, a fellow SEO noted to us that a single day ranking atop Google’s search results for the query “buy viagra” could bring upwards of $20,000 in affiliate revenue), it’s little wonder that manipulating the engines is such a popular activity on the web. However, it’s become increasingly difficult and, in our opinion, less and less worthwhile for two reasons.
Search engines have learned that users hate spam. This may seem a trivial and obvious lesson, but in fact, many who study the field of search from a macro perspective believe that along with improved relevancy, Google’s greatest product advantage over the last 10 years has been their ability to control and remove spam better than their competitors. While it’s hard to say if this directly influenced their dramatic rise to lead in market share worldwide, it’s undoubtedly something all the engines spend a great deal of time, effort and resources on – and with hundreds of the world’s smartest engineers dedicated to fighting the practice, those of us at SEOmoz loathe to ever recommend search spam as a winnable endeavor in the long term.
Search engines have done a remarkable job identifying scalable, intelligent methodologies for fighting manipulation and making it dramatically more difficult to adversely impact their intended algorithms. Concepts like TrustRank (which SEOmoz’s Linkscape index leverages), HITS, statistical analysis, historical data and more, along with specific implementations like the Google Sandbox, penalties for directories, reduction of value for paid links, combating footer links, etc. have all driven down the value of search spam and made so-called “white hat” tactics (those that don’t violate the search engines’ guidelines) far more attractive.
This guide is not intended to show off specific spam tactics (either those that no longer work or are still practiced), but, due to the large number of sites that get penalized, banned or flagged and seek help, we will cover the various factors the engines use to identify spam so as to help SEO practitioners avoid problems. For additional details about spam from the engines, see Google’s Webmaster Guidelines,Yahoo!’s Search Content Quality Guidelines & Bing’s Guidelines for Successful Indexing.
One of the most obvious and unfortunate spamming techniques, keyword stuffing, involves littering numerous repetitions of keyword terms or phrases into a page in order to make it appear more relevant to the search engines. The thought behind this – that increasing the number of times a term is mentioned can considerably boost a page’s ranking – is generally false. Studies looking at thousands of the top search results across different queries have found that keyword repetitions (or keyword density) appear to play an extremely limited role in boosting rankings, and have a low overall correlation with top placement.
The engines have very obvious and effective ways of fighting this. Scanning a page for stuffed keywords is not massively challenging, and the engines’ algorithms are all up to the task. You can read more about this practice, and Google’s views on the subject, in a blog post from the head of their web spam team – SEO Tip: Avoid Keyword Stuffing.
One of the most popular forms of web spam, manipulative link acquisition relies on the search engines’ use of link popularity in their ranking algorithms to attempt to artificially inflate these metrics and improve visibility. This is one of the most difficult forms of spamming for the search engines to overcome because it can come in so many forms. A few of the many ways manipulative links can appear include:
- Reciprocal link exchange programs, wherein sites create link pages that point back and forth to one another in an attempt to inflate link popularity. The engines are very good at spotting and devaluing these as they fit a very particular pattern.
- Incestuous or self-referential links, including “link farms” and “link networks” where fake or low value websites are built or maintained purely as link sources to artificially inflate popularity. The engines combat these through numerous methods of detecting connections between site registrations, link overlap or other common factors.
- Paid links, where those seeking to earn higher rankings buy links from sites and pages willing to place a link in exchange for funds. These sometimes evolve into larger networks of link buyers and sellers, and although the engines work hard to stop them (and Google in particular has taken dramatic actions), they persist in providing value to many buyers & sellers (see this post on paid links for more on that perspective).
- Low quality directory links are a frequent source of manipulation for many in the SEO field. A large number of pay-for-placement web directories exist to serve this market and pass themselves off as legitimate with varying degrees of success. Google often takes action against these sites by removing the PageRank score from the toolbar (or reducing it dramatically), but won’t do this in all cases.
There are many more manipulative link building tactics that the search engines have identified and, in most cases, found algorithmic methods of reducing their impact. As new spam systems (like this new reciprocal link cloaking scheme uncovered by Avvo Marketing Manager Conrad Saam) emerge, engineers will continue to fight them with targeted algorithms, human reviews and the collection of spam reports from webmasters & SEOs.
A basic tenet of all the search engine guidelines is to show the same content to the engine’s crawlers that you’d show to an ordinary visitor. When this guideline is broken, the engines call it “cloaking” and take action to prevent these pages from ranking in their results. Cloaking can be accomplished in any number of ways and for a variety of reasons, both positive and negative. In some cases, the engines may let practices that are technically “cloaking” pass, as they’re done for positive user experience reasons. For more on the subject of cloaking and the levels of risks associated with various tactics and intents, see this post, White Hat Cloaking, from Rand Fishkin.
Although it may not technically be considered “web spam,” the engines all have guidelines and methodologies to determine if a page provides unique content and “value” to its searchers before including it in their web indices and search results. The most commonly filtered types of pages are affiliate content (pages whose material is used on dozens or hundreds of other sites promoting the same product/service), duplicate content (pages whose content is a copy of or extremely similar to other pages already in the index), and dynamically generated content pages that provide very little unique text or value (this frequently occurs on pages where the same products/services are described for many different geographies with little content segmentation). The engines are generally against including these pages and use a variety of content and link analysis algorithms to filter out “low value” pages from appearing in the results.
In addition to watching individual pages for spam, engines can also identify traits and properties across entire root domains or subdomains that could flag them as spam signals. Obviously, excluding entire domains is tricky business, but it’s also much more practical in cases where greater scalability is required.
Just as with individual pages, the engines can monitor the kinds of links and quality of referrals sent to a website. Sites that are clearly engaging in the manipulative activities described above on a consistent or seriously impacting way may see their search traffic suffer, or even have their sites banned from the index. You can read about some examples of this from past posts – Widgetbait Gone Wild, What Makes a Good Directory and Why Google Penalized Dozens of Bad Ones,Google’s Sandbox Still Exists: Exemplified by Grader.com, and How to Handle a Google Penalty – And, an Example from the Field of Real Estate.
Websites that earn trusted status are often treated differently from those who have not. In fact, many SEOs have commented on the “double standards” that exist for judging “big brand” and high importance sites vs. newer, independent sites. For the search engines, trust most likely has a lot to do with the links your domain has earned (see these videos on Using Trust Rank to Guide Your Link Building and How the Link Graph Works for more). Thus, if you publish low quality, duplicate content on your personal blog, then buy several links from spammy directories, you’re likely to encounter considerable ranking problems. However, if you were to post that same content to a page on Wikipedia and get those same spammy links to point to that URL, it would likely still rank tremendously well – such is the power of domain trust & authority.
Trust built through links is also a great methodology for the engines to employ in considering new domains and analyzing the activities of a site. A little duplicate content and a few suspicious links are far more likely to be overlooked if your site has earned hundreds of links from high quality, editorial sources like CNN.com, LII.org, Cornell.edu, and similarly reputable players. On the flip side, if you have yet to earn high quality links, judgments may be far stricter from an algorithmic view.
Similar to how a page’s value is judged against criteria such as uniqueness and the experience it provides to search visitors, so too does this principle apply to entire domains. Sites that primarily serve non-unique, non-valuable content may find themselves unable to rank, even if classic on and off page factors are performed acceptably. The engines simply don’t want thousands of copies of Wikipedia or Amazon affiliate websites filling up their index, and thus take algorithmic and manual review methods to prevent this.
It can be tough to know if your site/page actually has a penalty or if things have changed, either in the search engines’ algorithms or on your site that negatively impacted rankings or inclusion. Before you assume a penalty, check for the following:
Once you’ve ruled out the list below, follow the flowchart beneath for more specific advice.
Errors on your site that may have inhibited or prevented crawling.
Changes to your site or pages that may have changed the way search engines view your content. (on-page changes, internal link structure changes, content moves, etc.)
Sites that share similar backlink profiles, and whether they’ve also lost rankings – when the engines update ranking algorithms, link valuation and importance can shift, causing ranking movements.
While this chart’s process won’t work for every situation, the logic has been uncanny in helping us identify spam penalties or mistaken flagging for spam by the engines and separating those from basic ranking drops. This page from Google (and the embedded Youtube video) may also provide value on this topic.
The task of requesting re-consideration or re-inclusion in the engines is painful and often unsuccessful. It’s also rarely accompanied by any feedback to let you know what happened or why. However, it is important to know what to do in the event of a penalty or banning.
Hence, the following recommendations:
- If you haven’t already, register your site with the engine’s Webmaster Tools service (Google’s, Yahoo!’s, Bing’s). This registration creates an additional layer of trust and connection between your site and the webmaster teams.
- Make sure to thoroughly review the data in your Webmaster Tools accounts, from broken pages to server or crawl errors to warnings or spam alert messages. Very often, what’s initially perceived as a mistaken spam penalty is, in fact, related to accessibility issues.
- Send your re-consideration/re-inclusion request through the engine’s Webmaster Tools service rather than the public form – again, creating a greater trust layer and a better chance of hearing back.
- Full disclosure is critical to getting consideration. If you’ve been spamming, own up to everything you’ve done – links you’ve acquired, how you got them, who sold them to you, etc. The engines, particularly Google, want the details, as they’ll apply this information to their algorithms for the future. Hold back, and they’re likely to view you as dishonest, corrupt or simply incorrigible (and fail to ever respond).
- Remove/fix everything you can. If you’ve acquired bad links, try to get them taken down. If you’ve done any manipulation on your own site (over-optimized internal linking, keyword stuffing, etc.), get it off before you submit your request.
- Get ready to wait – responses can take weeks, even months, and re-inclusion itself, if it happens, is a lengthy process. Hundreds (maybe thousands) of sites are penalized every week, so you can imagine the backlog the webmaster teams encounter.
- If you run a large, powerful brand on the web, re-inclusion can be faster by going directly to an individual source at a conference or event. Engineers from all of the engines regularly participate in search industry conferences (SMX, SES, Pubcon, etc.), and the cost of a ticket can easily outweigh the value of being re-included more quickly than a standard request might take.
Be aware that with the search engines, lifting a penalty is not their obligation or responsibility. Legally (at least, so far), they have the right to include or reject any site/page for any reason (or no reason at all). Inclusion is a privilege, not a right, so be cautious and don’t apply techniques you’re unsure or skeptical of – or you could find yourself in a very rough spot.