Course Description
This course covers methods that became influential mostly, with few exceptions, during the two decades
around the turn of the twenty first century. Main motivation for development of these methods was
tremendous increase in volume and complexity of data in 1990-s. One of important takeaways of the course is that even traditional methods, like linear regression, need to be reviewed and adjusted for new realities of big data era.
Key features of this course are: fundamental academic background of the lecturer combined with lifelong
teaching experience in multiple subjects related to data analyses as well as solving real life problems.
On one hand, this course contains deep conceptual content presented in accessible form with large number of
interactive tools and examples. On the other hand, it contains large number of workshops helping students to
acquire hands-on experience.
This course illustrates methods with examples from broad range of areas of application in which the lecturer has had real experience during his career as data scientist. One area of applications in which the lecturer has specialized for especially long time is financial industry. Depending on demand from the audience the lecturer can include discussions and insights of applications in specific areas.
Projects in the course use R and Python. Familiarity with both languages at least at an introductory level is a requirement.
Course Contents
Important Note: Changes may occur to the syllabus at the instructor's discretion. When changes are made, students will be notified via email and in-class announcement.
SESSION 1: Introduction
- Data Analysis from Gauss to Google.
- Review of regression methods, linear model with large
number of predictors, selection of predictors.
- Review of Principal Components Analysis applied to regression analysis (PCR).
SESSION 2: Regularization Techniques
- Regularization and shrinkage of parameters in regression analysis.
- Ridge and lasso regression methods.
- Comparison with other regression methods on data with large number of predictors.
SESSION 3: Decision Trees
- Decision trees for regression and classification, their assumptions their strengths and limitations.
- Comparison of tree regression with linear and generalized linear models.
- Review of measures of classification decision quality: confusion matrix and log-loss.
SESSION 4: Bagging and Boosting
- Random forests, methods of bagging and boosting: parallel and sequential ways of achieving complexity.
- Applications
SESSION 5: Support Vector Machines
- Support vector machines for regression and classification
- Kernel trick and review of selected kernels
- Applications and comparison with other methods for regression and classification
SESSION 6: Filtering
- Recommender algorithms
- Regression vs. collaborative filtering
- Review of factorization machines
- Applications for classification, prediction and collaborative filtering
SESSION 7: Neural Networks
- Neural networks. From biological to artificial neural network.
- Statistical models as neural networks.
- Applications
SESSION 8: Deep Learning
- Introduction to deep learning: motivation, comparison with other methods.
- Back propagation algorithm
- Introduction to TensorFlow
SESSION 9: Deep Learning Continued
- Introduction to deep learning: Keras basics, main architectures, sequential networks
- fitting and tuning models. Using models for prediction
SESSION 10: NLP
- Introduction to natural language processing.
- Feature extraction and engineering.
- Example project