Portfolio Manager and Data Scientist
Data science and machine learning is a fast growing industry with a slow growing skill force. In fact, it’s costing the UK economy £2 billion a year yet most are unaware of how to find and structure a data science team. Join Carlos Salas in this video as he explores data science workflow and machine learning algorithms.
Data science and machine learning is a fast growing industry with a slow growing skill force. In fact, it’s costing the UK economy £2 billion a year yet most are unaware of how to find and structure a data science team. Join Carlos Salas in this video as he explores data science workflow and machine learning algorithms.
Western economies are currently finding it difficult to balance the skill gap present in data-intensive workforce niches such as data science and machine learning. The most important of these job roles include business analysts, data scientists, software developers, data engineers and DevOps engineers. The easiest way to fully understand each team member’s role is to first understand the whole data science workflow feedback loop. This can be broken down generically into 5 stages: problem definition, data analysis, model research, model prototype and model deployment.
Key learning objectives:
Identify data science job roles
Understand the data science workflow
This video is now available for free. It is also part of a premium, accredited video course. Sign up for a 14-day free trial to watch more.
Western economies are currently finding it difficult to balance the skill gap present in data-intensive workforce niches such as data science and machine learning. As a result, a shortage of data skills in the job market has become prevalent across many western countries. The UK is a clear example of this phenomenon, with the government recently publishing a report in 2021 that points out an approximate potential 178,000 to 234,000 data roles that are yet to be filled. A recent analysis report conducted in 2018 showed that data-driven skills shortages are already costing the UK economy £2 billion a year.
Although there are many roles within a data science workflow, the most important are:
Business Analysts. Business analysts try to narrow the gap between IT and business by identifying how data can be linked to actionable business insights.
Data Scientists. Data Scientists gather and analyse information from databases, application programming interfaces, or APIs to explore the data, create visualisations and train machine learning models to extract insights for business decision-making users.
Software Developers. Software Developers are the link between Data Scientists and Data Engineers and their main role is to develop production versions of the models developed by Data Scientists. In other words, Software Developers play an important role in making the internally-developed models scalable.
Data Engineers. Data Engineers develop, maintain, test, and evaluate big data solutions within the organisation. They create data pipelines, big data platforms, and data integration into databases, data warehouses, and data lakes working with both on-premise and cloud technologies.
DevOps Engineers. DevOps Engineers rely on a combination of people, processes, and technology to deliver machine learning and software solutions in a robust, scalable, reliable, and automated way.
1. Problem Definition: Business analysts and even data scientists ask questions about the business in order to work out what problems need solving.
2. Data Analysis: data scientists conduct Exploratory Data Analysis (EDA), data transformation and feature selection as input for machine learning models.
3. Model Research: Data scientists develop and test multiple machine learning models to describe or predict that data in order to produce a tool that can answer the earlier proposed questions in a systematic manner.
4. Model Prototype: Software developers use data scientist feedback to build a production prototype that will make use of the machine learning model on a regular basis.
5. Model Deployment: Software developers, data engineers and DevOps engineers collaborate to efficiently deploy a machine learning model prototype.
1. Data Analysis
Exploratory Data Analysis. This consists of understanding the data via visualisation, descriptive and statistical inference tests using multiple techniques such as univariate analysis, multivariate analysis, correlation analysis and normality tests, among others.
Data transformation. This is the identification ex-ante of the requirements of our machine learning models, along with the implementation of multiple transformations so that the data can be more easily digested by the models. There are also multiple data transformation methodologies such as rescaling, standardisation, and normalisation.
Feature engineering. This describes the process of selecting, manipulating, and transforming the raw data into features that can be used in the machine learning model in order to improve the model’s performance and robustness.
Feature selection. This is the process of trimming down the number of features to improve the performance and robustness of the model. The feature selection can be executed using techniques such as:
- Mean Decrease Impurity (MDI), based on using in-sample data and a tree-based classifier
- Mean Decrease Accuracy (MDA), based on out-of-sample data and any type of classifier algorithms
2. Model Research
Model selection. This is carried out by understanding the problem at hand. The data scientist splits the dataset between train and test data in order to proceed to the future steps. This train-test split is essential in order to avoid creating models that only perform under very specific circumstances.
Cross-validation. This stage consists of training the model with a portion of the in-sample data while the robustness of the model is confirmed using validation data. Some machine learning models contain specific parameters, or hyperparameters, that require calibration via cross-validation.
Generalisation performance. This is the last stage where the data scientist selects the best cross-validated model and tests it using data from the test sample in order to understand whether or not the model generalised out-of-sample.
This video is now available for free. It is also part of a premium, accredited video course. Sign up for a 14-day free trial to watch more.
There are no available videos from "Carlos Salas"