## Requires Changes ### 5 specifications require changes Hello student, Well done in your first submission! :clap: :clap: A few minor changes are still required in order to meet our rubric. Keep doing this great job! Cheers, ## T1 - Data Exploration ***Student's implementation correctly calculates the following:*** * ***Number of records*** * ***Number of individuals with income >$50,000*** * ***Number of individuals with income <=$50,000*** * ***Percentage of individuals with income > $50,000*** ### Required No big deal here. Kindly note that `greater_percent` is a percentage, not a decimal value. Are you sure that the graduation rate is 0.25%? ## T2 - Preparing the Data ***Student correctly implements one-hot encoding for the feature and income data.*** ### Awesome Well done using the map method combined with a lambda function! ### Comment [This reference](http://www.kdnuggets.com/2015/12/beyond-one-hot-exploration-categorical-variables.html) provides 7 different encoding strategies. [Binary encoding](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelBinarizer.html) is a great choice for cases where the number of categories for a given feature is very high. Lately, [entity embedding](https://arxiv.org/abs/1604.06737) has increasingly becoming a very popular choice as well. Another of my favorite personal choices is to train models using [LightGBM](https://lightgbm.readthedocs.io/en/latest/). It can handle categorical features without the need of one-hot encoding them. ## Q1 - Nayve Predictor Performance ***Student correctly calculates the benchmark score of the naive predictor for both accuracy and F1 scores.*** Great job calculating the accuracy and the F-score for a Naive predictor! ### Comment Note that the F-score is higher than the accuracy which seems counter-intuitive since the F-score is a more elaborate calculation. That happens because a value of beta = 0.5 attenuates the influence of false negatives. In other words, this value of beta weights more the positive predictions (>50K) than the negative one (<=50K). ----------- ## Q2 - Model Evaluation ***The pros and cons or application for each model is provided with reasonable justification why each model was chosen to be explored. Please list all the references you use while listing out your pros and cons.*** ### Required Please make it clear in your answer for each estimator the application, advantage, weakness and why it is a good candidate. For example, for adaboost, it is unclear whether the following sentence is an advantage or a reason to be a good candidate: > The reason it works well is because it takes "week classifiers" (such as decision trees) and combine their result to improve to a "strong classifier" ### Suggestion For listing their advantages and disadvantages, I highly suggest using sklearn documentation. I couldn't find for AdaBoost, but for other estimators like SGD [it is possible to find](https://scikit-learn.org/stable/modules/sgd.html). ____________ ## T3 - Creating a Training and Predicting Pipeline ***Student successfully implements a pipeline in code that will train and predict on the supervised learning algorithm given.*** ### Awesome Everything looks great here! ## T4 - Initial Model Evaluation ***Student correctly implements three supervised learning models and produces a performance visualization.*** ### Required As described in the project: - *Use a 'random_state' for each model you use, if provided.* Please make sure to use a `random_state` for each estimator (if available) in order to guarantee the [reproducibility](https://en.wikipedia.org/wiki/Reproducibility) of your results. ## Q3 - Improving Results ***Justification is provided for which model appears to be the best to use given computational cost, model performance, and the characteristics of the data.*** ### Comment I agree with your choice of AdaBoost! It is one of the best estimators for this project and in the analysis is the one which is leading to the highest test score. In general, tree-based estimators do better in this project because they have the flexibility to create non-linear decision boundaries, thus opening space for greater generalisation potential after tuning. ## Q4 - Model in Layman’s Terms ***Student is able to clearly and concisely describe how the optimal model works in layman's terms to someone who is not familiar with machine learning nor has a technical background.*** ### Bonus This [video](https://youtu.be/k4G2VCuOMMg) shows AdaBoost in action. It might be useful to get an intuition of how this estimator works. I suggest that you watch it in slow motion: ![517d9e542ed24fadac05a88f4f9e1c77-1518138914015.gif](https://udacity-reviews-uploads.s3.us-west-2.amazonaws.com/_attachments/38140/1518139006/517d9e542ed24fadac05a88f4f9e1c77-1518138914015.gif) ## T4 - Model Tuning ***The final model chosen is correctly tuned using grid search with at least one parameter using at least three settings. If the model does not need any parameter tuning it is explicitly stated with reasonable justification.*** ### Required Likewise before, please make sure to also set a `random_state` to the classifier here. ## Q5 - Final F1 Score ***Student reports the accuracy and F1 score of the optimized, unoptimized, models correctly in the table provided. Student compares the final model results to previous results obtained.*** ### Comment Great job! Scores higher than 0.74 are only accomplished with boosting algorithms in this project! The best score I've seen was with Gradient Boosting (0.75). ### Bonus You can also check your results with a [Confusion Matrix](https://en.wikipedia.org/wiki/Confusion_matrix): import seaborn as sns # Install using 'pip install seaborn' from sklearn.metrics import confusion_matrix import matplotlib.pyplot as plt %matplotlib inline cm_test = confusion_matrix(y_test, best_clf.predict(X_test)) plt.figure(figsize=(7,5)) sns.heatmap(cm_test, annot=True, cmap='Greys', xticklabels=['No', 'Yes'], yticklabels=['No', 'Yes']) plt.title('Confusion Matrix for the Test Set') plt.ylabel('True') plt.xlabel('Predicted') ![Screen Shot 2017-11-18 at 23.30.06.png](https://udacity-reviews-uploads.s3.amazonaws.com/_attachments/38140/1511044234/Screen_Shot_2017-11-18_at_23.30.06.png) ## Q6 - Feature Relevance Observation ***Student ranks five features which they believe to be the most relevant for predicting an individual's’ income. Discussion is provided for why these features were chosen.*** ### Required Although it is a minor change, please make sure to also include a discussion on why the features were chosen. As mentioned in the question: - and in what order would you rank them and **why?** ## Q7 - Extracting Feature Importance ***Student correctly implements a supervised learning model that makes use of the `feature_importances_` attribute. Additionally, student discusses the differences or similarities between the features they considered relevant and the reported relevant features.*** ### Comment It is worth noting that each model with `feature_importances_` might return different top predictive features depending on their internal algorithm implementation. ### Suggestion You can also use the attribute `feature_importances_` from `best_clf` since it is already tuned so you will have a better choice of top 5 features: ``` importances = best_clf.feature_importances_ ``` ## Q8 - Extracting Feature Importance ***Student analyzes the final model's performance when only the top 5 features are used and compares this performance to the optimized model from **Question 5**.*** ### Comment An alternative strategy for reducing the number of features is to use dimensionality reduction techniques (PCA for example). Then, we could pick only the top descriptive features for training the model. You will see more details about PCA in the next module.