Contact Us
Data Engineer or Data Scientist? The jobs you’re looking for
Tempo de leitura: 4 minutos

Data Engineer or Data Scientist? The jobs you’re looking for

By André Pires, Big Data Engineer, and Nuno Chicória, Data Scientist @Xpand IT

90% of the existing data was created in the last two years”. As the quantity of data raises, so does the need to store it, analyze it, and make it useful. Out of all these needs, different jobs started to emerge in the IT world each with its own goals in mind. It is in this panorama that we find data scientists and data engineers. United by data but separated by goals. So, in this article, we look forward to clarifying the main differences between the roles of a Data Engineer and a Data Scientist as you’re probably having many questions arising concerning your career. Let’s find out which job better fits your interests.

Data scientist role explained

The number of data scientists has doubled over the last 4 years, it has been consistently identified as the number one job in the US and also branded as the sexiest job of the 21st century . But what does this all mean to this emerging job? What does data science even mean?

There is a running joke that a data scientist is better at programming than a statistician and it’s a better statistician than a programmer (never said it was a good one). That “joke” alone can hint us at the different skills that a data scientist must have to thrive in the IT world. It’s not only programming because at the end of the day we still care about Euclidian distances, standard deviations or averages, and it’s not only statistics because you would never ask a statistician to train a Support Vector Machine in Python [1].

Data science is a mixture of tools, theoretical knowledge, and the limitations that those entail.

Now, how do we introduce some order into this cocktail of skills and knowledge? With questions. A data scientist lives and dies by their ability to ask the right questions. We can create a pipeline for a data scientist to work (believe me, we’ve done it); and as we can see, the ground basis of all is data exploration, that is heavily driven and influenced by the questions you make. When we talk about data science you will hear many times that is 80% EDA, 20% modelling which is very close to the truth. Understanding the data received, knowing how to manipulate and “prepare” it are some of the most important skills necessary to create a good data analysis/machine learning model that will answer the questions we made.

The next question should be: ‘what tools fit the trade?’. In the data science case, there is no go-to solution. You will have to learn and work with a variety of tools. Just to give you an example, there is no clear programming language. You will see many use Python but R [2], Matlab and Java are other viable options. Inside these, you will find numerous packages that will help you do your job in a faster and easier way (NumPy [3] and Pandas on Python and the TidyVerse [4] on R). Finally, as your knowledge grows, you will start to delve into Machine and Deep Learning and here you will also find many useful packages (SciKit Learn [5], TensorFlow [6], PyTorch [7]).

Summarily, a data scientist “only” needs to understand the business, know how to ask the right questions, be a statistician, be a programmer and be familiarized with a “small” set of tools. The data scientist can be labelled as the jack of all trades of the IT world.

What about Data Engineers?

If a Data Scientist is the jack of all trades of the IT world, a Big Data Engineer is the master plumber of the data-driven world. Why the master plumber? Because it is his/her responsibility to architect/design and develop the data pipelines (batch or streaming/NRT) that will be the backbone of the future data-driven organizations.

A Data Engineer day to day consists on engineering (yes most of our work is engineering!) efficient data processing pipelines and platforms that can take organizations’ data from sources to destinations in an efficient and adequate manner with the best data quality as possible. This way, organizations can leverage its best asset — its data — for operational (e.g. feed critical backends that serve as operational foundations) and analytical purposes (e.g. be used by the Data Scientists to extract game changing insights for the organization). Data Engineers should be seen as the facilitators of the data and its ultimate goal should be to easily provide the data for the Data Scientist (not exclusively) to shine.

Good data engineering can magnify by tenfold the output of Data Scientists by providing good and timely data in the best format for the Data Scientist to leverage it.

Usually a Data Engineer is the middle man in the data world and is responsible for integrating the data between multiple boundaries: technological, political, and departmental, this usually makes a Data Engineer a facilitator and system integrator.

Being a Data Engineer takes advantage of very good knowledge of distributed systems like:

  • Storages Engines (e.g. HDFS [8], S3, NoSQL DBs)
  • Message Brokers (e.g. Apache Kafka [9], Apache Pulsar)
  • Distributed Processing Systems/Frameworks (e.g. Apache Spark [10], Apache Hive, Apache Impala, Kafka Streams, Apache Flink)
  • Cloud Computing Platforms (e.g. Azure, AWS)

A Data Engineer usually works with strongly typed, structured and very efficient programming languages/runtimes (e.g. Scala, Java) that allows him to produce robust and very fast data processing pipelines. Since a Data Engineer, most of the times applies the “glue” that attaches multiple systems (including Data Science Projects), he/she should also be well versed in Python and other common languages used by Data Scientists.

Data engineering also requires very good data knowledge, not in the same way as Data Scientists. One needs to know how to map efficiently the properties of the data (type, volume, production rate, relationships and more) in hands to the most efficient processing transformations, physical storage layers and more. Having a very good knowledge of SQL like languages is also desirable.

In the end, the equation is: Data Engineer = Distributed Systems + Software Engineer + Data Knowledge.

To sum up…

Apart from their differences, Data Engineers and Data Scientists should coexist in the same medium. We may share some tools and there is an overlap of skills but the ones that define us and our goals are very different and well established. They both bring great value to the IT world but in different ways. While data engineers focus on efficient data processing/treatment, movement and storage, data scientists focus on knowledge discoverability and data analysis. Regardless of your specific path, curiosity is the common prerequisite of these data-driven careers.

The future data platforms will be built with a well-balanced Data Engineering and Data Science that should amplify each other instead of trying to obfuscate each other.

  • You can also watch other related content to this article below:

[1] www.python.org

[2] www.r-project.org

[3] www.numpy.org

[4] https://www.tidyverse.org

[5] https://scikit-learn.org/stable/

[6] https://www.tensorflow.org

[7] https://pytorch.org

[8] https://en.wikipedia.org/wiki/Apache_Hadoop

[9] https://kafka.apache.org/

[10] https://spark.apache.org/