Monday, February 28, 2022

Project 8 Part 1: Logistic Regression - Python

 Welcome

Hi again, hi again!  If you've been catching up with my blog, thanks for your continuous support πŸ’“ If you're new here, thank you for giving my blog a chance πŸ’• Since I started learning R, I've thought about making code comparisons between Python and R.  Concidentally, I've also started learning machine learning so I thought... why not try and compare machine learning codes between Python and R!  So far, I've learned how to build logistic regression models using Python and R.  Project 8 is divided into parts 1 and 2 where the codes using Python and R will be described respectively.

I will be using the Iris dataset to demonstrate how the codes workπŸ‘ If you're someone who requires assistive software to read, I suggest downloading the PDF documents to read the codes.

Python - Jupyter Notebook

For this project, I built a logistic regression model using sklearn.  For starters, the packages I used were Pandas, Numpy, Scipy, Sklearn, and matplotlib.







Sklearn allows us to import some of the most famous datasets when learning data science.  For this project, I imported the Iris dataset and included data under the columns sepal_len, sepal_wid, petal_len, petal_wid, and class.  The NAs were dropped and empty lines were removed.  (Note:  This section of the code is based on the work of Srishti Saha from GitHub)

(Click here for the PDF version of code: import Iris dataset)













Just type in iris_df to have a look at the dataset!





























In order for the model to work, we have to make sure that the variable that we want to predict, in this case "class", is an integer.

Just type in iris_df["class"].dtype to confirm!









In order to make a model that predicts Y ("class") from X ("sepal_len", "sepal_wid", "petal_len", "petal_wid"), then both X and Y have to be turned into arrays.












X had to be rescaled so that the maximum value becomes 1 so that we can produce Y which will be returned from a value within the range of 0 to 1.


In order to test the model, the data was split into a training set and a test set.  The training set provides information for the model to "learn" how to make classifications.  The test set makes sure that the model can actually make classifications and is useful for finding out how accurate the model is.  In this project, I split the data so that the training set contains 80% of the data and the test set the remaining 20%.



I made sure whether the data was actually split into 80:20.  The full dataset has 150 data points.  The train set has 120 data points and the test set has 30 data points.  Since 150 x 0.8 = 120 and 150 x 0.2 = 30, the splitting was performed accordingly.  


The logistic regression model was made using the train set.



This is what happened when I tried to test the model on the test set.































You'll see a big array of decimal numbers ranging from 0 to 1.  Logistic regression provides an outcome of the variable class as either "Yes" or "No"... kind of.  What this model really does is provide the probability that class would be "Yes".  The closer the Y value is to 1, the higher chance that class is "Yes."

How useful is the model?


So now that I have the predictions for class based on the other variables in the Iris dataset.  How would I know how accurate the predictions are?  One way is to use the Jaccard index to produce an average percentage of how similar the actually Y values were vs the predicted Y values (called Yhat). 

(Click here for the PDF version of code:  how good is the model?)











The Jaccard index was 0.825.  That means that the model produces results correctly 82.5% of the time.  You could interpret that as "1/5 of all cases could be wrong" or you could say that "it's a whole lot better than a 50:50 chance!"  Personally, I think that 82.5% is a pretty solid number considered that it's a pretty small dataset!

Final thoughts

Thank you so much for reading!  Making this post was actually a lot of fun and I hope you all enjoyed it ❤ I feel like knowing that you are out there reading this blog keeps me motivated to keep on coding πŸ˜€ Next time, I will be showing that making the same logistic regression model would look like when using R.  Until then, please feel free to read my other posts in this blog.  If there's anything you want to say about this post, comment down below!



LLM Part 2: Encoding

Welcome to Part 2 of building your own large language model! Part 1 was about breaking down your input text into smaller subwords. (tokeniza...