Vitus ElasticSearch Web Integration

What is the Vitus?

The Vitus ElasticSearch Web Integration is a simple tool for transforming - cleaning - validating web data and index it into ElasticSearch.

Vitus ElasticSearch Web Integration is built with

  • Java 8
  • Akka (Scalable actor based processing)
  • and several other open source frameworks

The core consists of

  • A web crawler / spider (Depth based with throttling)
  • A data transformation component that use XPath for data extraction and converts it to a JSON object
  • A data cleaner that utilize regular expression to remove or replace undesired text data
  • A regular expression data validator that makes sure that only valid objects are indexed
  • An indexer that sends the validated JSON objects to ElasticSearch

Use case: Structuring company data

A typical use case is a company intranet, company website or a wiki that have grown out of proportions and do not have any form of common search interface. In many cases you also have domain specific patterns in your data that you want to feed into own fields in the search engine. This is seldom or never supported by the built in search engines, but a core component in the semantic web.

Install

Follow the installation instructions: https://github.com/tarjeir/vitus-elasticsearch-webintegration

Extract data from a web page

Crawler settings

  1. Open the home.folder/etc/crawler.properties file and follow the instructions

  2. Start by configuring crawler.url.filters. Use this setting to match urls you want to extract information from. These filters matches on the exact urls that are located in the source code of the HTML files. So be sure to investigate if the target site is utilizing relative or absolute urls before you start. Lets lets make two patterns that can match all respective wiki pages to the Vitus project: ^/tarjeir/vitus\\-elasticsearch\\-webintegration/wiki,^/tarjeir/vitus\\-elasticsearch\\-webintegration/wiki/[a-zA-Z0-9\\-%]+

  3. ElasticSearch needs to know what index you should put the extracted data into. You need to configure this by edit the crawler.indexName and crawler.indexType

  4. Next you can optionally configure crawler.indexHost and crawler.indexPort to point to either a local or a remote instance of ElasticSearch.

  5. The crawler.maxDepth parameter should be adjusted to reflect the structure of your web-page and link structure. Depth is shortly how many levels in the tree structure you should allow the crawler to traverse. Normally you need to set this relatively high or turn it off.

  6. The crawler.throttler.* parameters are important to adjust. They basically make sure that you limit the amount of requests towards the page you are visiting.

Transformation and cleaning

  1. Open home.folder/etc/transformation.json. This file contains XPaths to the data you want to extract. XPath is an extremely powerful XML/HTML navigation language and you can learn it here: http://www.w3schools.com/xpath/. (Hint: Use Chrome (Right click on the element you want to extract)-> "Inspect Element" (Right click)-> "Copy XPath" to get an exact XPath)
    Add this content: {"code": "//*[@id=’wiki-body’]/div[1]/p/code/text()"}

  2. Open home.folder/etc/cleansing.json. In this file you add expressions to remove and transform the data in each of the corresponding fields. The fields support regular expression cleaning and will remove all matching occurrences. In addition it also supports vim like search replace. This pattern s/expression/replacewith/g will replace all matches of "expression" with the text "replacewith". In our example we will just add a simple replacement of “some garbage”: {"code":"some garbage"}

  3. Open home.folder/etc/validation_pattern.json. In this file all data validation patterns are found. If the transformed data entry does not match the regular expression pattern in the JSON file the entry will not be indexed and a warning will be printed. In our case we want to validate JavaScript code. Example: {"code": "[0-9A-Za-z%\\s/\\.:,;\\&\\+=\\(\\)\\\\éä#\\$''\\-]+"} (If you leave the JSON object empty it will validate all data and insert into the index)

Start the crawler

  1. Run the bash script cd home.folder and ./docs/startCrawler.sh "https://github.com/tarjeir/vitus-elasticsearch-webintegration")

  2. Tail the logs in the home.folder/log folder

  3. A few errors are quite normal since it is a brute force transformation and validation process. We plan to improve this at a later stage

  4. To see the output fire up the browser: http://localhost:9200/vitus/snapshot/_search?q=*. The single raw page should be indexed and the field transformed_content should contain a validated and cleaned JSON object: transformed_content: {code: "var variable = new Code(); "}

Tips

ElasticSearch is handling a multitude of inputs but does not try to convert types unless it is specified in the mapping document. (I have at least not found a way yet to do this) Vitus have no functionality to do typecasting yet so therefore I will highly recommend creating a mapping before you start to index. There is added an example of a mapping document in the docs folder in the Vitus project.

Author

Tarjei Romtveit

Co-founder of Monokkel with solid experience in systems design, data management, data analysis, software development and agile processes.