Introduction to data science (IDS)¶

Slavko Žitnik, Tomaž Curk
Klemen Vovk, tutor





Data Science Master's, 2023/2024
University of Ljubljana, Faculty of computer and information science.

What is data science?¶

No generally accepted definition of what data science is.

It is a a mix of IT, analytics and soft skills.

Our view of data science¶

data science = empirical science

  • Empirical science nowadays requires a much broader set of skills.
  • Business, industry and other non-scientific areas are becoming empirical sciences.

Examples¶

The history of data science is as old as recorded history itself.

Building pyramids in ancient Egypt.¶

Census in ancient Egypt was used to scale the labor force needed to build pyramids.

Weather forecasting to profit from olive oil futures in ancient Greece.¶

An olive mill and an olive press dating from Roman times.

Thales of Miletus, 624 BC, Greek mathematician, astronomer and philosopher.

Thales achieved riches from an olive harvest by prediction of the weather.

He bought all the olive presses in Miletus after predicting the weather and a good harvest for a particular year.

Thales had reserved presses in advance, at a discount, and could rent them out at a high price when demand peaked, following his prediction of a particularly good harvest.

This first historically known creation and use of futures.

Empirical science¶

In the old times, empirical science was difficult:

  • data needed to be recorded and manipulated by hand,
  • an well-equiped expert mathematician with logarithm and quantile tables was required to do the even most basic statistical analyses,
  • results had to be plotted by hand.

Nowadays, data science is still difficult.

All these are now assisted by computer and software tools.

New challenges and raised standards in data management, software development, reproducible research and visualization.

The names of people involved with empirical data science evolved, from:

  • scientist
  • empirical scientist
  • researcher
  • statistician

to more cool names:

  • quantitative analyst
  • quant
  • predictive analyst
  • data miner
  • machine learning expert
  • data scientist

Next catchy name?

The name of the people involved may change, but the core skills remain the same.

Core skill sets needed¶

  • analytical thinking
  • attention to detail
  • communication and domain-specific knowledge
  • mathematics and computation
  • science
  • reporting and presentation

The demand for such skills is growing.

We now (still) live in a data-driven world:

  • all empirical sciences
  • finance
  • marketing
  • sports training
  • politics
  • HR
  • medicine

A big shortage of people with such skills.

Data Science course¶

An introductory overview of topics relevant to data science.

Main topics:

  • Working with data. Getting. Processing. Storing. Cleaning. Summarizing. Visualizing.

  • Analytics. Prediction. Clustering. Statistical inference.

  • Business and social aspects. Privacy. Security. Ethics. Licensing. Intellectual property.

  • Best practices (tools). Programming, coding standards (Python). Versioning (GitHub). Reproducibility (Jupyter). Typsetting (LaTeX). Public repositories (ArXiv, Zenodo).

Goal of the course¶

  • To gain breadth not depth.
  • To familiarize ourselves with all key ingredients of data science from a practitioners perspective.
  • To learn enough methods, techniques and tools to deal with the vast majority of data science scenarios in real life.

Practice and other more academic courses will provide you with a fundamental understanding of the methods used.

  • Focus on doing.

You need to fail (make a mistake) to learn how to do better.

Key ingredients of a data science practitioner¶

Programming¶

Data manipulation and computation using a computer.

You need to tell the computer what to do.

tools: Python programming language.

Reproducibility¶

Tools that help you document all your steps.

Be true to science:

  • You need to be transparent on how did you obtain you results.

  • You need to provide a means for other to repeat your work.

tools: git, docker, jupyter notebooks and other dynamic reporting and dashboarding.

Working with data¶

Retrieve and store data.

Most of time is spent on getting, preprocessing and storing the data.

tools: web scraping, (non)relational databases

Understanding data and extracting information¶

Most of the added value in data science is achived in these steps that lead to the extraction of information and new knowledge:

  • Summarization of data.
  • Visulization of data.
  • Exploratory data analysis.
  • Dealing with missing data.
  • Predictive modeling.
  • Reporting and presenting.

tools: R, Python and associated libraries.

Topics and key ingredients of data science¶

Presented to students through lectures by faculty members and guest lecturers from industry and research institutions.