What is data science anyway?

There is no generally accepted definition of what data science is. We would probably all agree that it is a relatively new term and that it is field that requires a mix of IT, analytics and soft skills.

Our view is that data science is synonymous with empirical science and that this is a result of two major changes. First, the evolution of empirical science to a level that now requires a much broader skill set. And second, the evolution of business, industry and other non-scientific areas to a level where more and more processes require treatment at the level of empirical science.

Examples of empirical science go back as far as recorded history itself. From censuses in ancient Egypt that were used to scale the pyramid labor force to weather forecasting to profit from olive oil futures in ancient Greece. Historically, even as little as 100 years ago, empirical science was much more difficult than it is today (not that it is not still very difficult if done properly). Data were recorded and manipulated by hand. If one was not an expert mathematician and well-equipped with logarithm and quantile tables, it was near-impossible to do even the most basic statistical analyses. And to spread the results, one would write their report and plot their graphs by hand.

Over time, this skill set has evolved, with computers playing the leading role. Pens were replaced by typsetting software. Graphing paper was replaced by visualization packages. And logarithm tables were replaced by computer code. Indeed, a lot of the brain-intensive mathematics gave way to computationally intensive numerical approaches, both out of convenience and out of necessity. On the other hand, progress also brought new challenges and raised standards in data management, software development, reproducible research and visualization.

The names that describe people with this skill set have also evolved. From scientist, empirical scientist, researcher, and statistician to more cool names, such as quantitative analyst, quant, predictive analyst, data miner, machine learning expert and now data scientist. Tomorrow, data science will probably be replaced by another even catchier term. But the core skills will remain the same: analytical thinking, attention to detail, communication and domain-specific knowledge. Mathematics and computation. Science.

Like the skill set itself the demand for it has also always been present. And growing. To the point where we now live in an increasingly data-driven world: all empirical sciences, finance, marketing, sports training, politics, HR and medicine, to name just a few, rely on data. Today, the practical reality is that there is a big shortage of people with such skills or even just some of them. This is great opportunity for those considering to become data scientists, but also a challenge for educational institutions on how to best prepare students for a career in this data-driven world.

0.1 The role of this course

This course aims at breadth not depth. The goal is to familiarize ourselves with all key ingredients of data science from a practitioners perspective. We will learn enough methods, techniques and tools to deal with the vast majority of data science scenarios we encounter in real life. However, only with practice and through more academic courses will we be able to develop a fundamental understanding of some of these methods and refine their use to a professional standard. That is, this course focuses on doing. With time, study, practice and through mistakes, we will learn how to do better.

So what do we think are the key ingredients of a data science practitioner? The first and most important is computer programming. Data science requires data manipulation and computation that can not be done without a computer. Working with data also represents most of the workload in a typical data science scenario. And if we are not able to tell the computer exactly what to do, we can’t be a professional data scientist. So, it should come as no surprise that Chapter 1 is dedicated to the Python programming language. Python and R are the two most common languages in data science and we advocate that a professional data scientist should learn both. Python is the more versatile of the two and thus more suitable for an introductory course such as this one.

Once we are able to do something, it becomes very important that we do it in a transparent and easily repeatable way. We put a lot of emphasis on reproducibility and the tools that help us achieve it. Why? First, because we need to be true to the word science in data science. If it is not clear how something was done, if some steps cannot be reproduced, then we cannot have confidence in the results and that is not science! And second, from a purely practical perspective, if our work is easily reproducible, we ourselves will have a much easier job of repeating our analyses on different data or with slightly modified methodology. What time we invest into reproducibility we will recieve 10-fold savings later on. We will cover source code control in Chapter 2, Docker for portability in Chapter 6 and Jupyter notebooks and other dynamic reporting tools in Chapter 3.

In practice, data are in most cases not in tabular format and stored in a csv file. It is important to know how to retrieve and deal with less structured data and how to store data in a more systematic and efficent way. We will cover web scraping in Chapter 4 and relational databases and SQL in Chapter 9.

While most of the work is in getting, preprocessing and storing the data, most of the added value in data science is in extracting valuable information from data. First, we need to learn how to efficiently and effectively summarize data and do exploratory data analysis. In most cases, these techniques combined with some method of quantifying uncertainty in our summaries will be all that is required to successfully complete our analysis! This broad topic will be covered in three chapters: basic numerical summaries for univariate and bivariate data are covered in Chapter 5, standard visualization techiques are covered in Chapter 7 and techniques for multivariate data are covered in Chapter 8. Predictive modelling, the family of tasks that includes a vast majority of the analytic tasks we encounter in practice, is covered in Chapter 10. And finally, dealing with missing data - a very important topic and a topic that requires us to combine all the summarization, exploratory and predictive techniques - is covered in Chapter 11.