Distributed Systems Engineer - Web Scraping

Job description

We are looking for talented profiles to help build and maintain the distributed data collection system that is at the heart of our business.

We are a data-driven company which collects and processes more than 500GB of raw data daily. We leverage big data technologies such as Serverless, Spark on AWS EMR to crunch these volumes of data and make it queryable.

 

In this role you will ensure that our data collection engine, which consists of distributed web crawlers, is state of the art and ahead of our competition. You will ensure that we can scrape any webshop, no matter the ban-detection that has been put in place. Next to that it will be important that the proper monitoring tools are in place. We are currently scraping 60 sites and your goal is to at least triple that without losing completeness and quality.

 

As a distributed systems engineer you will be responsible for the following topics:

  • Distributed web crawling architectures.
  • Cost-effective data processing architectures.
  • Advanced system monitoring solutions & dashboards.
  • Design advanced ways of interpreting scraped HTMLs.
  • Advanced proxy management.

On top of all this you’ll make sure that Daltix stays competitive in terms of data collection by using the latest & most suitable technologies throughout our stack.

 

About the stack

  • This distributed system is made on top of Amazon Web Services and uses Serverless architectures where possible, with Python & Javascript being the main programming languages used.
  • As Daltix scales from 50+ websites to 200+ websites (which it scrapes multiple times per day!) it has to invest in orchestration technologies such as Kubernetes as well as logging & monitoring solutions to keep an overview at scale.

Requirements

Minimum

  • At least 5 years of experience in object oriented software engineering & design in any object oriented programming language.
  • Experience with and understanding of large-scale web crawling.
  • Experience with databases, SQL.
  • Experience with infrastructure such as load-balancers, caches, …
  • Highly proficient in spoken and written English.
  • You never stop learning.

 

Preferred

  • Have experience building on top of Amazon Web Services.
  • Have programming experience with Python
  • Expert knowledge of web-scraping & web-scraping architectures.
  • Experience with GoLang & JavaScript (NodeJS) is a plus.
  • Experience with big data technologies (such as Hadoop, Spark, Airflow, Cassandra, Elasticsearch) is a plus.
  • Have a deep understanding of cloud possibilities and limitations in the areas of distributed systems, load balancing and networking, massive data storage, and security.
  • Get energy from working in a highly complex and challenging startup environment with a high tech product.

 

What can Daltix offer you?

Daltix’ offers a competitive wage (including various benefits etc) and a young, dynamic and international (we have offices in Belgium and Portugal) atmosphere to work in.

You will also receive the possibility to work from home if you prefer (even if you live in Lisbon).

When you start working at Daltix, you will get a deep dive experience. You learn all you need to know about us, our journey, your future colleagues, the tools we work with, etc.  

Going beyond, is coded in our company DNA. As soon as you start working, we expect a hands-on approach, with an entrepreneurial mentality.

You will also be able to participate in relevant trainings to stay at the top of this field.

Besides developing your technical skills you will also have the opportunity to grow into the following skill sets:

Technical/architectural lead. SW project management. Team leading & coaching.