2014-08-06

Summer 2014 Crawling Redux .

From July 9 to July 31, Findcan did seven crawls of between 1 and 10 million with a last crawl of 18 million being left as the current default index. The goal of the earlier crawls was to experiment with different settings of the underlying Yioop Search Engine Software to try to get a reasonable crawl of "Canadian Content". I thought I'd share a little bit of the strategy of getting the crawl for a particular country.
There are two main places where you can control the sites that Yioop's crawler will crawl: Manage Crawl :: Options and under Page Options. Manage Crawl:: Options allows you to specify the Allowed to crawl sites, Disallowed to Crawl Sites, and an initial set of Seed Sites; Page Options allows you to turn on and off various indexing plugins. For the Canadian crawl I made use of the Word Filter plugin. I let my brother, Allan give me an initial list of sties he thought were Canadian. To come up with seed sites I went to Alexa - Top Sites in Canada and went through the first 500 listings and looked for sites that were either of a .ca domain or looked like they had a primary focus on Canadian content. For example, the top two sites Alexa lists for Canada are google.ca and google.com, the latter seemed like it is not focused on Canada so was not added as a seed site. After doing this, I added my brothers list to these to get an initial list of seed sites.
I then expanded this initial list by searching around for what were the top queries for Canada on the major search engines in 2013. This led me to tack on a few more site. I then wanted to get the "Canadian portions" of sites like Wikipedia, Youtube, Dmoz, etc. This is where using the Word Filter Plugin under Page Options became useful. For instance, for wikipedia.org I used the following rule: [domain:wikipedia.org] -canada,-canadian,-quebec,-hockey,-newfoundland:NOPROCESS This says if a page come from wikipedia.org but does not contain the terms canada, canadian, etc. then don't process the page (so it won't appear in the index, and links on the page will be ignored as far as crawling goes).
Once I had my seed sites, and my word plugin considered, I wanted to restrict the crawl to urls which seemed to be "Canadian". Under Manage Crawl:: Options I checked the Restrict Sites By Url: checkbox, this makes an Allowed To Crawl Sites check box appear. I added lines whitelisting just those urls I wanted to allow to be crawl. For example, I had a line: domain:ca to whitelist the whole .ca domain, but I also had other lines like: http://www.calgarystampede.com/ to allow crawling of just this particular website. I whitelisted all of wikipedia.org, but the word filter plugin rules would prevent crawling of pages that didn't contain Canadian keywords.
The seven different crawls I carried out were to experiment with and try to improve the initial seeds sites and filtering. For example, I noticed that it would be useful to add to the seed sites the Wikipedia pages of provinces. I added sites for various major Canadian companies, restaurants, and artists, magazines, etc. I also added some seeds sites for tourism and real estate. On the other hand, I noticed that it was useful to restrict the rate that some magazine sites and news sites were crawled -- news sites can be crawled on a periodic basis anyways using news updater and although the Yioop software does have some mechanisms for sitemaps from flooding results, these are more effective mid-crawl than at a start crawl. To do this kind of rate restriction on crawling sites I add lines like: http://www.torontosun.com/#100 to the Disallowed Sites/Sites with Quotas textarea of Manage Crawl:: Options. This restricts crawling of http://www.torontosun.com/ to 100 urls/hour.
Crawling on Findcan is now stopped for the time being. One of the secondary goals of this crawling was on a limited scale to work out any kinks that might appear in the Yioop software. One thing that was noticed was sites which block Findcan after an initial crawl by using a 302 Refresh back to the Fincan.ca IP address. It is of course completely up to a site operator if they want a given crawler to crawl their site or not. The Yioop software checks at least once a day for a robots.txt file to determine whether or not it is allowed to crawl a site. However, when a Refresh was used, a given site also tended to prevent Findcan from actually downloading the site's robots.txt which meant that if Findcan had already acquired several links to that site, it would assume that the robots.txt (which redirected back to Findcan's index page) was not preventing it from crawling anything. The Yioop software has since been modified to check for these kind of redirects.
(Edited: 2015-07-29)
From July 9 to July 31, Findcan did seven crawls of between 1 and 10 million with a last crawl of 18 million being left as the current default index. The goal of the earlier crawls was to experiment with different settings of the underlying [[http://www.seekquarry.com|Yioop Search Engine Software]] to try to get a reasonable crawl of "Canadian Content". I thought I'd share a little bit of the strategy of getting the crawl for a particular country. There are two main places where you can control the sites that Yioop's crawler will crawl: Manage Crawl :: Options and under Page Options. Manage Crawl:: Options allows you to specify the Allowed to crawl sites, Disallowed to Crawl Sites, and an initial set of Seed Sites; Page Options allows you to turn on and off various indexing plugins. For the Canadian crawl I made use of the Word Filter plugin. I let my brother, Allan give me an initial list of sties he thought were Canadian. To come up with seed sites I went to [[http://www.alexa.com/topsites/countries/CA|Alexa - Top Sites in Canada]] and went through the first 500 listings and looked for sites that were either of a .ca domain or looked like they had a primary focus on Canadian content. For example, the top two sites Alexa lists for Canada are google.ca and google.com, the latter seemed like it is not focused on Canada so was not added as a seed site. After doing this, I added my brothers list to these to get an initial list of seed sites. I then expanded this initial list by searching around for what were the top queries for Canada on the major search engines in 2013. This led me to tack on a few more site. I then wanted to get the "Canadian portions" of sites like Wikipedia, Youtube, Dmoz, etc. This is where using the Word Filter Plugin under Page Options became useful. For instance, for wikipedia.org I used the following rule: [domain:wikipedia.org] -canada,-canadian,-quebec,-hockey,-newfoundland:NOPROCESS This says if a page come from wikipedia.org but does not contain the terms canada, canadian, etc. then don't process the page (so it won't appear in the index, and links on the page will be ignored as far as crawling goes). Once I had my seed sites, and my word plugin considered, I wanted to restrict the crawl to urls which seemed to be "Canadian". Under Manage Crawl:: Options I checked the Restrict Sites By Url: checkbox, this makes an Allowed To Crawl Sites check box appear. I added lines whitelisting just those urls I wanted to allow to be crawl. For example, I had a line: domain:ca to whitelist the whole .ca domain, but I also had other lines like: http://www.calgarystampede.com/ to allow crawling of just this particular website. I whitelisted all of wikipedia.org, but the word filter plugin rules would prevent crawling of pages that didn't contain Canadian keywords. The seven different crawls I carried out were to experiment with and try to improve the initial seeds sites and filtering. For example, I noticed that it would be useful to add to the seed sites the Wikipedia pages of provinces. I added sites for various major Canadian companies, restaurants, and artists, magazines, etc. I also added some seeds sites for tourism and real estate. On the other hand, I noticed that it was useful to restrict the rate that some magazine sites and news sites were crawled -- news sites can be crawled on a periodic basis anyways using news updater and although the Yioop software does have some mechanisms for sitemaps from flooding results, these are more effective mid-crawl than at a start crawl. To do this kind of rate restriction on crawling sites I add lines like: http://www.torontosun.com/#100 to the Disallowed Sites/Sites with Quotas textarea of Manage Crawl:: Options. This restricts crawling of http://www.torontosun.com/ to 100 urls/hour. Crawling on Findcan is now stopped for the time being. One of the secondary goals of this crawling was on a limited scale to work out any kinks that might appear in the Yioop software. One thing that was noticed was sites which block Findcan after an initial crawl by using a 302 Refresh back to the Fincan.ca IP address. It is of course completely up to a site operator if they want a given crawler to crawl their site or not. The Yioop software checks at least once a day for a robots.txt file to determine whether or not it is allowed to crawl a site. However, when a Refresh was used, a given site also tended to prevent Findcan from actually downloading the site's robots.txt which meant that if Findcan had already acquired several links to that site, it would assume that the robots.txt (which redirected back to Findcan's index page) was not preventing it from crawling anything. The Yioop software has since been modified to check for these kind of redirects.
X