Slavko Žitnik, Tomaž Curk
Klemen Vovk, tutor
Data Science Master's, 2023/2024
University of Ljubljana, Faculty of computer and information science.
No generally accepted definition of what data science is.
It is a a mix of IT, analytics and soft skills.
data science = empirical science
The history of data science is as old as recorded history itself.
Census in ancient Egypt was used to scale the labor force needed to build pyramids.
Thales of Miletus, 624 BC, Greek mathematician, astronomer and philosopher.
Thales achieved riches from an olive harvest by prediction of the weather.
He bought all the olive presses in Miletus after predicting the weather and a good harvest for a particular year.
Thales had reserved presses in advance, at a discount, and could rent them out at a high price when demand peaked, following his prediction of a particularly good harvest.
This first historically known creation and use of futures.
In the old times, empirical science was difficult:
Nowadays, data science is still difficult.
All these are now assisted by computer and software tools.
New challenges and raised standards in data management, software development, reproducible research and visualization.
The names of people involved with empirical data science evolved, from:
to more cool names:
Next catchy name?
The name of the people involved may change, but the core skills remain the same.
The demand for such skills is growing.
We now (still) live in a data-driven world:
A big shortage of people with such skills.
An introductory overview of topics relevant to data science.
Main topics:
Working with data. Getting. Processing. Storing. Cleaning. Summarizing. Visualizing.
Analytics. Prediction. Clustering. Statistical inference.
Business and social aspects. Privacy. Security. Ethics. Licensing. Intellectual property.
Best practices (tools). Programming, coding standards (Python). Versioning (GitHub). Reproducibility (Jupyter). Typsetting (LaTeX). Public repositories (ArXiv, Zenodo).
Practice and other more academic courses will provide you with a fundamental understanding of the methods used.
You need to fail (make a mistake) to learn how to do better.
Data manipulation and computation using a computer.
You need to tell the computer what to do.
tools: Python programming language.
Tools that help you document all your steps.
Be true to science:
You need to be transparent on how did you obtain you results.
You need to provide a means for other to repeat your work.
tools: git, docker, jupyter notebooks and other dynamic reporting and dashboarding.
Retrieve and store data.
Most of time is spent on getting, preprocessing and storing the data.
tools: web scraping, (non)relational databases
Most of the added value in data science is achived in these steps that lead to the extraction of information and new knowledge:
tools: R, Python and associated libraries.
Presented to students through lectures by faculty members and guest lecturers from industry and research institutions.