[
Skip Navigation]
≡
β©οΈ
π£οΈ
-
π
Help
:
Wiki
:
Robots Behaviors
≡
Welcome
Signin
Create Account
Robots Behaviors@Help
View
Source
History
Discussion
Help Group
Create/Find Pages
Group Feed
My Groups
π
Locale: en-US
Page: Robots Behaviors
β
ποΈ
Page Type:
Standard
Page and Feedback
Page Alias
Media List
Presentation
Url Shortener
Share Wall
Alias Page To:
Page Border:
Solid
Dashed
None
Table of Contents:
Title:
Author:
Meta Robots:
Meta Description:
Meta Properties (such as Open Graph)
One line per property in format: name|content
Header Page Name:
Footer Page Name:
The '''Robots Behaviors''' dropdown controls the degree to which your Yioop crawler respects '''robots.txt''' files. A '''robots.txt''' is a file placed by a site operator in the document root of their web site. I.e., it would typically have a url like: https://some_host_name/robots.txt<br> or<br> http://some_host_name/robots.txt. It is used to specify the files that a particular kind of crawler is allowed to download from a site and at what rate. So for example it might have instructions for how the GoogleBot is allowed to crawl the site, how the BingBot is allowed to crawl the site, etc. The available options are: * '''Always Follow''' which always follows to the best of Yioop's abilities the robots.txt instructions. * '''Allow Landing Page Crawl''' which allows Yioop to download urls of the form https://some_host_name/<br> or<br> http://some_host_name/ but otherwise respects the robots.txt file. * '''Ignore''' which allows Yioop to completely ignore the robots.txt file. This option should only be used at your own risk. There might be some use cases such as where you want to crawl part of a site that you yourself own, but where you don't have control of the robots.txt. For the most part, you should not use this option.
X
(c) 2024 Findcan -
Canadian Search Engine
We use cookies to implement this site's user functionality, social media features, and traffic analytics.
Privacy Policy Details
.
Allow Cookies