Welcome
Hi again, hi again! If you've been catching up with my blog, thanks for your continuous support π If you're new here, thank you for giving my blog a chance π Since I started learning R, I've thought about making code comparisons between Python and R. Concidentally, I've also started learning machine learning so I thought... why not try and compare machine learning codes between Python and R! So far, I've learned how to build logistic regression models using Python and R. Project 8 is divided into parts 1 and 2 where the codes using Python and R will be described respectively.
I will be using the Iris dataset to demonstrate how the codes workπ If you're someone who requires assistive software to read, I suggest downloading the PDF documents to read the codes.
Python - Jupyter Notebook
For this project, I built a logistic regression model using sklearn. For starters, the packages I used were Pandas, Numpy, Scipy, Sklearn, and matplotlib.
(Click here for the PDF version of code: import Iris dataset)
So now that I have the predictions for class based on the other variables in the Iris dataset. How would I know how accurate the predictions are? One way is to use the Jaccard index to produce an average percentage of how similar the actually Y values were vs the predicted Y values (called Yhat).
Just type in iris_df to have a look at the dataset!
In order for the model to work, we have to make sure that the variable that we want to predict, in this case "class", is an integer.
Just type in iris_df["class"].dtype to confirm!
In order to make a model that predicts Y ("class") from X ("sepal_len", "sepal_wid", "petal_len", "petal_wid"), then both X and Y have to be turned into arrays.
X had to be rescaled so that the maximum value becomes 1 so that we can produce Y which will be returned from a value within the range of 0 to 1.
In order to test the model, the data was split into a training set and a test set. The training set provides information for the model to "learn" how to make classifications. The test set makes sure that the model can actually make classifications and is useful for finding out how accurate the model is. In this project, I split the data so that the training set contains 80% of the data and the test set the remaining 20%.
I made sure whether the data was actually split into 80:20. The full dataset has 150 data points. The train set has 120 data points and the test set has 30 data points. Since 150 x 0.8 = 120 and 150 x 0.2 = 30, the splitting was performed accordingly.
The logistic regression model was made using the train set.
You'll see a big array of decimal numbers ranging from 0 to 1. Logistic regression provides an outcome of the variable class as either "Yes" or "No"... kind of. What this model really does is provide the probability that class would be "Yes". The closer the Y value is to 1, the higher chance that class is "Yes."
How useful is the model?
(Click here for the PDF version of code: how good is the model?)
The Jaccard index was 0.825. That means that the model produces results correctly 82.5% of the time. You could interpret that as "1/5 of all cases could be wrong" or you could say that "it's a whole lot better than a 50:50 chance!" Personally, I think that 82.5% is a pretty solid number considered that it's a pretty small dataset!
Final thoughts
Thank you so much for reading! Making this post was actually a lot of fun and I hope you all enjoyed it ❤ I feel like knowing that you are out there reading this blog keeps me motivated to keep on coding π Next time, I will be showing that making the same logistic regression model would look like when using R. Until then, please feel free to read my other posts in this blog. If there's anything you want to say about this post, comment down below!