Web Scrapers are used to help Yioop get to the most important content on a web page during. When Yioop crawls it tries to extract the most important content of a page into a succinct summary. It then indexes just this summary. Web pages generated by a content management system such as Wordpress have a reasonably standard format and a web scraper can be used to isolated the sub-portion of a web page which is more likely to have useful content. Below we describe how to use Web Scraper activity to make a new scraper or view existing one.

Name is what to call the scraper that is being defined. A Web Scraper must have a Name, the Signature and Scrape Rules fields are optional but at least one of them must be present for the web scraper to have effect while crawling.

Signature is used to detect when a particular Web Scraper should be used. It should consist of an XPath query which would evaluate to a non-empty set of elements in the case of a page the scraper might work for.

Priority is used to determine which scraper to apply to a web page when a page matches multiple scraper signatures. Yioop chooses the highest (larger) priority scraper that matches. If two scrapers have the same priority it would choose the first one it found matching. The priority dropdown allows one to set the priority of a scraper.

Text XPath is used to specify an xpath to the most important content of a page for summarization.

Delete XPaths is used to specify xpaths, one per line, of content under the Text Xpath portion of the web page, that should be non considered for summarizations.

Extract Fields is used to specify a sequence of rules to extract to specific fields in the summary. Each rule should be on a line by itself and have the format: NAME_OF_SUMMARY_FIELD = SOME_XPATH. The meaning of such a rule compute the xpath on the original document and concatenate the text contents of the resulting nodes into NAME_OF_SUMMARY_FIELD in the summary. For example,

 SITE_NAME=//meta[@property='og:site_name']/@content

would take the value of the content attribute of all meta tags with property attribute having value og:site_name, concatenate them as a string, and store the key SITE_NAME with value this string in the pages summary when it is indexed.