Crawling for phenotype data


In this project, we implement computational services that simplify the collection, collation, storage, retrieval and analysis of phenotype data. The primary objective of the data crawler is to provide a consistent and interactive centralised management system for tracking the collection of data from various research centres, and for carrying out quality control over the collected data before they are made available to the general public and to researchers throughout the world.

The Crawler architecture

The crawler uses a multithreaded architecture. Each of the periodic crawling happens inside a crawling session. A session consists of several processing phases. After a session has been initiated, the session thread starts a download manager. The task of the download manager is to spawn multiple threads for each of the file sources (i.e., ftp/sFtp servers) hosted by each of the centres. Each of these threads crawl one of the file sources by visiting the directories.

Crawler architecture

During the crawling, each crawler thread identifies new zip files that have not be processed. When choosing zip files, the crawler ignores files that do not conform to the IMPC file naming convention. For each of the files that should be processed, the crawler makes an entry to the tracker database identifying which file to download from which file source. The file download does not start immediately.

Once all of the crawling threads have returned, the download manager initiates the download and processing phase. In this phase, the download manager creates a thread pool of downloaders using the size of the thread pool supplied by the user on the command line. The responsibility for each of these threads is to download and process all of the zip files that were mapped out by the crawlers.

See design manual for further details...

Source code

git clone https://github.com/mpi2/phenodcc-crawler.git