Today we are going to look at how we can use Azure’s ML Studio platform to build a no-code AI model for tabular data.
Introducing the Kaggle Titanic ML project.
This is a common Machine Learning (ML) practice case where we are predicting which passengers survived the Titanic shipwreck based on features including their ticket class, gender, age and more.
From the Kaggle page we can see the dataset and have a quick look at the features available to us.
The first thing we need to do is create a ML Studio resource. To do this go to Azure Portal and search for ML Studio. Create a new resource and fill in the details following this Microsoft Guide.
It is a good idea to create a separate storage account for your ML Studio resource.
ML Studio requires a storage account to store Assets and a compute resource to train or inference a model.
During resource creation you should have already created or linked to a storage account.
Let’s create a simple compute resource.
As we are doing a single training program that will be computationally inexpensive, we will use a compute instance.
In your ML studio resource go to Compute -> Compute instances -> New
Give your resource a unique name.
Select an appropriate VM. We don’t need anything fancy for this so we will use a Standard_DS3_v2 resource.
Configure any advanced settings you require; I am going to leave as is for the demo.
Press create and wait for the resource state to turn green and say Running.
We now have our environment to train a simple ML model.
First, let’s import the dataset for training the model into the storage account connected to our ML Studio resource.
Go to the Kaggle titanic competition page and download the train dataset. We will use this data to train our classification model.
In ML Studio, select Assets -> Datasets -> Registered datasets -> Create dataset
Select from local file.
Give the dataset an intuitive name and select Tabular for Dataset type
Select next, then select workplaceblobstore as the datastore. This uses the storage account associated to the ML Studio workspace. Under the browse option, attach the train.csv file from your local storage and press next.
Leave settings and preview as is.
Leave schema as is. We could set it here, but we will set it in the pipeline instead as this makes it easier to change if we make a mistake.
Select next and create the dataset.
Go to Author -> Designer -> New pipeline -> Easy-to-use prebuilt components
Configure pipeline settings:
In its current format the data would not perform very well when training the model. Some columns have data missing, some columns are non-numeric and some columns will bias the model due to their scale. We need to solve all these problems before we can select and train our model.
First of all, we need to specify the datatype of all the columns and specify which are features and which is the label column.
Integer values:
String values:
Double values:
Categorical values:
Label Values:
We should now have a pipeline that looks something like this:
*Integer columns should be data where the values follow a sequential pattern of whole integer numbers. Hence PassengerID is not considered an integer type as we cannot say that a value of 2 holds any relationship to a value of 1. String column should be data where there is no relationship between values. Double columns should contain data where the data can be any numerical value and follows a sequential pattern. Categorical columns should contain only entries that match a set of predetermined values. Label columns should be what we are trying to predict. As this is a classification problem, the label is also a categorical type.
Next, we will select the columns we think are most relevant to train the model. In this process we are going to disregard some columns we have processed in the last step. This may seem like a waste, but it means that if we come back later and decide to use different features (or create new features e.g. PCA) we already have the features formatted.
Drag the Select Columns in Dataset block into the pipeline and attach it to the last Edit Metadata.
We have chosen these as I am assuming there is a strong correlation between each of these features and the chance of survival. In practice, a better approach would to be explore the relationship of each feature with the label, or to train multiple instances using a variety of labels and assess over a validation dataset. For an introduction into no-code ML, simply taking the above features will suffice (but feel free to explore alternate features, you may get a better model performance!)
Next, we will clean up any missing values from the dataset.
We can see from the dataset source (Kaggle) that the age column contains missing values.
Pull in the Clean Missing Data block and attach Age column.
Now we will sort out the categorical columns. Most ML models can’t handle categorical inputs and so we need to format it into a method that they can understand. Azure ML does this by converting each categorical value into its own binary column. For example, the Gender column would become 2 new columns: Male and Female, each with a value of 1 for true. This type of encoding is referred to as one-hot-encoding.
In the pipeline, pull in the Convert to Indicator Values block and select the Pclass, Sex columns.
Finally, we will normalise the data. The scale of some features (e.g. Age) is much larger than other features (e.g. Sex – Male), which can result in a biased model that puts more weight on certain fields leading to poor performance.
To counter that we normalise all fields so that they are on a similar scale but retain their statistical variance.
(Another popular method is MinMax, however this can lead to single outlier records having a major impact on the scaling and affecting the weight the model places on the feature.)
Hurray! We now have a nice clean dataset that we are ready to use to train a model. #
First, we need to split the dataset into a training, validation and testing set.
We will randomly select 70% of our data to train & validate the model and 30% to test the best model.
We are performing a two-class classification problem (survived or did not survive) so we will select a Two-Class Decision Forest Block and drag it into the pipeline.
You could try any alternative two-class model, or even train multiple and validate the best performing model using the validation set.
We will not set any of the model hyperparameters (e.g. number of decision trees) as we will automate training multiple instances of the model with various hyperparameters, and select which one performs best on the validation set.
Next pull in the Tune Model Hyperparameters block. This performs the process listed above. Attach the model, training data and validation data to the nodes in that order as shown. Set the sweeping mode to Entire grid and the Metric for measuring performance for classification to F-score. Set the Label Column to Survived as this is what we are predicating.
Now, when the pipeline runs, we will have a trained model that we can use. Once trained, we will need to assess its performance on the testing data (which it hasn’t seen before).
Pull in a Score Model block from the right and connect the tuned model and test set as shown. We don’t need to make any settings changes here.
Pull in a Evaluate Model block and attach the Score Model output as seen below.
We are now ready to run the pipeline and see what we get.
Make sure your compute instance created earlier is running.
Select Submit from the top then select create new experiment. Give the experiment an intuitive name and press Submit.
The pipeline will begin running. Wait for all tasks to complete.
Now let’s see how our model performed.
Assessing the performance of a classifier is often highly dependent on the scenario.
For example, if we were diagnosing whether someone should be screened for cancer (label) based on symptoms (features), sending a few extra people (false positives) to be tested when they shouldn’t is not too much of a problem. But we definitely don’t want to accidently refuse someone screening when they should be (false negative).
This complexity is well documented and there are plenty of other online resources on choosing the best performance criteria, so we won’t dive into that here.
For our classifier we will assess the performance based on accuracy. This is the fraction of instances that were correctly classified.
We can see that we have an accuracy of about 75%, which is not bad for a simple classifier.
We can see a threshold bar above the results. This lets us specify the confidence at which an instance will be classified as having survived.
With a value of 0.5 the classifier will set any instances that it believes have an above 50% chance of surviving to survived.
Lets drop the value to 0.3 (30+ % chance of survival will survive) and see what happens {pic 9.1}
Notice that our overall accuracy increase now to 82% & our false negative rate drops, but we now have a higher false positive rate
Finally let’s have a look at the raw data.
And that’s it! We have now used the pipeline designer in ML studio to build a no-code classifier to predict whether someone would survive on the titanic.
To summarise the steps:
We have just seen the tip of the iceberg (pun intended…) of what Azure ML Studio can offer.
Once you have a basic classifier trained, explore ways of improving your model performance and let us know how it goes on LinkedIn. A few ideas:
Good luck!