The Robots Behaviors dropdown controls the degree to which your Yioop crawler respects robots.txt files. A robots.txt is a file placed by a site operator in the document root of their web site. I.e., it would typically have a url like: https://some_host_name/robots.txt
or
http://some_host_name/robots.txt. It is used to specify the files that a particular kind of crawler is allowed to download from a site and at what rate. So for example it might have instructions for how the GoogleBot is allowed to crawl the site, how the BingBot is allowed to crawl the site, etc. The available options are:
  • Always Follow which always follows to the best of Yioop's abilities the robots.txt instructions.
  • Allow Landing Page Crawl which allows Yioop to download urls of the form https://some_host_name/
    or
    http://some_host_name/ but otherwise respects the robots.txt file.
  • Ignore which allows Yioop to completely ignore the robots.txt file. This option should only be used at your own risk. There might be some use cases such as where you want to crawl part of a site that you yourself own, but where you don't have control of the robots.txt. For the most part, you should not use this option.
X