Cloaking

Cloaking is a search engine optimization (SEO) technique in which the content presented to the search engine spider is different from that presented to the user's browser. This is done by delivering content based on the IP addresses or the User-Agent HTTP header of the user requesting the page. When a user is identified as a search engine spider, a server-side script delivers a different version of the web page, one that contains content not present on the visible page, or that is present but not searchable. The purpose of cloaking is sometimes to deceive search engines so they display the page when it would not otherwise be displayed (black hat SEO). However, it can also be a functional (though antiquated) technique for informing search engines of content they would not otherwise be able to locate because it is embedded in non-textual containers, such as video or certain Adobe Flash components. Since 2006, better methods of accessibility, including progressive enhancement, have been available, so cloaking is no longer necessary for regular SEO.^[1]

Cloaking is often used as a spamdexing technique to attempt to sway search engines into giving the site a higher ranking. By the same method, it can also be used to trick search engine users into visiting a site that is substantially different from the search engine description, including delivering pornographic content cloaked within non-pornographic search results.

Cloaking is a form of the doorway page technique.

A similar technique is used on DMOZ web directory, but it differs in several ways from search engine cloaking:

It is intended to fool human editors, rather than computer search engine spiders.
The decision to cloak or not is often based upon the HTTP referrer, the user agent or the visitor's IP; but more advanced techniques can be also based upon the client's behaviour analysis after a few page requests: the raw quantity, the sorting of, and latency between subsequent HTTP requests sent to a website's pages, plus the presence of a check for robots.txt file, are some of the parameters in which search engines' spiders differ heavily from a natural user behaviour. The referrer tells the URL of the page on which a user clicked a link to get to the page. Some cloakers will give the fake page to anyone who comes from a web directory website, since directory editors will usually examine sites by clicking on links that appear on a directory web page. Other cloakers give the fake page to everyone except those coming from a major search engine; this makes it harder to detect cloaking, while not costing them many visitors, since most people find websites by using a search engine.

YouTube Encyclopedic

1/3
Views:
141 761
47 245
1 413

Transcription

MATT CUTTS: Hi, everybody. It's Matt Cutts. And we're back to talk a little bit about cloaking today. A lot of people have questions about cloaking. What exactly is it? How does Google define it? Why is it high risk behavior? All those sorts of things. And there's a lot of HTML documentation. We've done a lot of blog posts. But I wanted to sort of do the definitive cloaking video, and answer some of those questions, and give people a few rules of thumb to make sure that you're not in a high risk area. So first off, what is cloaking? Cloaking is essentially showing different content to users than to Googlebot. So imagine that you have a web server right here. And a user comes and asks for a page. So here's your user. You give him some sort of page. Everybody's happy. And now, let's have Googlebot come and ask for a page as well. And you give Googlebot a page. Now in the vast majority of situations, the same content goes to Googlebot and to users. Everybody's happy. Cloaking is when you show different content to users than to Googlebot. And it's definitely high risk. That's a violation of our quality guidelines. If you do a search for quality guidelines on Google, you'll find a list of all the stuff-- a lot of auxiliary documentation about how to find out whether you're in a high risk area. But let's just talk through this a little bit. Why do we consider cloaking bad, or why does Google not like cloaking? Well, the answer is sort of in the ancient days of search engines, when you'd see a lot of people do really deceptive or misleading things with cloaking. So for example, when Googlebot came, the web server that was cloaking might return a page all about cartoons-- Disney cartoons, whatever. But when a user came and visited the page, the web server might return something like porn. And so if you do a search for Disney cartoons on Google, you'd get a page that looked like it would be about cartoons, you'd click on it, and then you'd get porn. That's a hugely bad experience. People complain about it. It's an awful experience for users. So we say that all types of cloaking are against our quality guidelines. So there's no such thing as white hat cloaking. Certainly, when somebody's doing something especially deceptive or misleading, that's when we care the most. That's when the web spam team really gets involved. But any type of cloaking is against our guidelines. OK. So what are some rules of thumb to sort of save you the trouble or help you stay out of a high risk area? One way to think about cloaking is, almost take the page, like you Wget it or you cURL it. You somehow fetch it, and you take a hash of that page. So take all the different content and boil it down to one number. And then you pretend to be Googlebot, with a Googlebot user agent. We even have a Fetch as Googlebot feature in Google Webmaster Tools. So you fetch a page as Googlebot, and you hash that page as well. And if those numbers are different, then that could be a little bit tricky. That could be something where you might be in a high risk area. Now pages can be dynamic. You might have things like timestamps, the ads might change, so it's not a hard and fast rule. Another simple heuristic to keep in mind is if you were to look through the code of your web server, would you find something that deliberately checks for a user agent of Googlebot specifically or Googlebot's IP address specifically? Because if you're doing something very different, or special, or unusual for Googlebot-- either its user agent or its IP address-- that's the potential to maybe be showing different content to Googlebot than to users. And that's the stuff that's high risk. So keep those kinds of things in mind. Now one question we get from a lot of people who are white hat, and don't want to be involved in cloaking in any way, and want to make sure that they steer clear of high risk areas, are what about geolocation and mobile user agents-- so phones and that sort of thing. And the good news-- the executive sort of summary-- is that you don't really need to worry about that. But let's talk through exactly why geolocation and handling mobile phones is not cloaking. OK. So until now, we've had one user. Now let's go ahead and say this user is coming from France. And let's have a completely different user, and let's say maybe they're coming from the United Kingdom. In an ideal world, if you have your content available on a .fr domain, or .uk domain, or in different languages, because you've gone through the work to translate them, it's really, really helpful if someone coming from a French IP address gets their content in French. They're going to be much happier about that. So what geolocation does is whenever a request comes in to the web server, you look at the IP address and you say, ah, this is a French IP address. I'm going to send them the French language version or send them to .fr version of my domain. If someone comes in and their browser language is English, or their IP address is something from America or Canada, something like that, then you say, aha, English is probably the best message, unless they're coming from the French part of Canada, of course. So what that is doing is you're making the decision based on the IP address. As long as you're not making some specific country that Googlebot belongs to-- Googlandia or something like that-- then you're not doing something special or different for Googlebot. At least currently-- when we're making this video-- Googlebot crawls from the United States. And so you would treat Googlebot just like a visitor from the United States. You'd serve up content in English. And we typically recommend that you treat Googlebot just like a regular desktop browser-- so Internet Explorer 7 or whatever a very common desktop browser is for your particular site. So geolocation-- that is, looking at the IP address and reacting to that-- is totally fine, as long as you're not reacting specifically to the IP address of just Googlebot, just that very narrow range. Instead, you're looking at OK, what's the best user experience overall depending on the IP address? In the same way, if someone now comes in-- and let's say that they're coming in from a mobile phone-- so they're accessing it via an iPhone or an Android phone. And you can figure out OK, that is a completely different user agent. It's got completely different capabilities. It's totally fine to respond to that user agent and give them a more squeezed version of the website or something that fits better on a smaller screen. Again, the difference is if you're treating Googlebot like a desktop user-- so that user agent doesn't have anything special or different that you're doing-- then you should be in perfectly fine shape. So you're looking at the capabilities of the mobile phone, you're returning an appropriately customized page, but you're not trying to do anything deceptive or misleading. You're not treating Googlebot really differently, based on its user agent. And you should be fine there. So the one last thing I want to mention-- and this is a little bit of a power user kind of thing-- is some people are like, OK, I won't make the distinction based on the exact user agent string or the exact IP address range that Googlebot comes from, but maybe I'll say check for cookies. And if somebody doesn't respond to cookies or if they don't treat JavaScript the same way, then I'll carve out and I'll treat that differently. And the litmus test there is are you basically using that as an excuse to try to find a way to treat Googlebot differently or try to find some way to segment Googlebot and make it do a completely different thing? So again the instinct behind cloaking is are you treating users the same way as you're treating Googlebot? We want to score and return roughly the same page that the user is going to see. So we want the end user experience when they click on a Google result to be the same as if they'd just come to the page themselves. So that's why you shouldn't treat Googlebot differently. That's why cloaking is a bad experience, why it violates our quality guidelines. And that's why we do pay attention to it. There's no such thing as white hat cloaking. We really do want to make sure that the page the user sees is the same page that Googlebot saw. OK, so I hope that kind of helps. I hope that explains a little bit about cloaking, some simple rules of thumb. And again, if you get nothing else from this video, basically ask yourself, do I have special code that looks exactly for the user agent Googlebot or the exact IP address of Googlebot and treat it differently somehow? If you treat it just like everybody else-- so you send it based on geolocation, you look at the user agent phones-- that sort of thing is fine. It's just you're looking for Googlebot specifically, and you're doing something different, that's where you start to get into a high risk area. We've got more documentation on our website. So we'll probably have links to that, if you look at the metadata for this video. But I hope that explains a little bit about why we feel the way we do about cloaking, why we take it seriously, and how we look at the overall effect in trying to decide whether something is cloaking. The end user effect is what we're ultimately looking at. And so regardless of what your code is, if something is served up that's radically different to Googlebot than to users, that's something that we're probably going to be concerned about. Hope that helps.

Cloaking versus IP delivery

IP delivery can be considered a more benign variation of cloaking, where different content is served based upon the requester's IP address. With cloaking, search engines and people never see the other's pages, whereas, with other uses of IP delivery, both search engines and people can see the same pages. This technique is sometimes used by graphics-heavy sites that have little textual content for spiders to analyze.^[2]

One use of IP delivery is to determine the requester's location, and deliver content specifically written for that country. This isn't necessarily cloaking. For instance, Google uses IP delivery for AdWords and AdSense advertising programs to target users in different geographic locations.

IP delivery is a crude and unreliable method of determining the language in which to provide content. Many countries and regions are multilingual, or the requestor may be a foreign national. A better method of content negotiation is to examine the client's Accept-Language HTTP header.

References

^ "Cloaking | Google Search Central". Google Developers.
^ Eberwein, Helgo (2012). Wettbewerbsrechtliche Aspekte von Domains und Suchmaschinen die Rechtslage in Deutschland und Österreich (1. Aufl ed.). Baden-Baden. ISBN 978-3-8329-7890-7. OCLC 885168276.{{cite book}}: CS1 maint: location missing publisher (link)

From Wikipedia, the free encyclopedia

YouTube Encyclopedic

Transcription

Cloaking versus IP delivery

See also

References

Further reading