The source values are discrete. A one hot encoded column is where the integer encoded variable is removed and is replaced with a new binary variable (0s and 1s). This allows categorial data to be used and actually expressed as the labels are changed into numbers.
A dense feature shows the incomplete and complete data. For the incomplete data the value of 0 is given throughout the entire dataset. The model is now able to train on a complete set instead of figuring out how to process unknown data.
When looking at Logistical Regression compared to Boosted Tree the graphs show that the Boosted tree model will predict someone would not have survived slightly more often than the Logistical Regression. However, both models overall are similar to each other.
### Boosted Tree
Density Function
### Logistical Regression
Density Function
The ROC shows that the model is okay, there could definitely be some improvements, but it is not a bad model. The best model would be one that went straight up the left side of the graph at zero then across the top of the graph at 1. If a graph just had a 45-degree angle line that would be a very bad model as the model would be no more accurate than a person guessing between two options. The more area that there is under a curve (AUC), the better the model is performing.
I changed the ID to look at the data from someone at idx 110. The individual was lucky and appeared to have a lot of feature characteristics working in their favor. The two most influential feature appear to be class and sex, as the individual was a second-class, female.
The most important feature when it came to the model’s predication was clearly the gender of the individual. The second most important I am not completely sure how to interpret because it does appear to be between fare and class which rationally thinking about it makes sense with the real world implications and the discrimination that takes place on the basis of social-economic class divides.