We are looking for talented profiles to help build and maintain the distributed data collection system that is at the heart of our business. You'll be in the front-lines, facing massive (but interesting!) challenges as we try to scrape all retail data available.
We are a data-driven company which collects and processes more than 600GB of raw data (HTML) daily. We leverage big data technologies such as Serverless, Spark on AWS EMR to crunch these volumes of data and make it queryable.
In this role you will ensure that our data collection engine, which consists of distributed web crawlers, is state of the art and ahead of our competition. You will ensure that we can scrape any webshop, no matter the ban-detection that has been put in place. Next to that it will be important that the proper monitoring tools are in place. We are currently scraping 60 sites and your goal is to at least triple that without losing completeness and quality.
About the stack
Preferably you also have:
Daltix is a young company from Ghent (BE) with offices in Boom (BE) and Lisbon (PT) bringing real-time insights to the world of retail. We have developed a set of tools to gather, process and analyze e-commerce data from webshops. Every day, we collect prices, promotions and assortment data from a myriad of e-commerce channels. This data is turned into actionable insights for the right people at the right time by extracting high level insights, introducing more structure with A.I. techniques, and enriching the information with alternative data sources. These insights are used by retailers and suppliers to help them in their market positioning.