Simple and Efficient Machine Learning Prototyping in Python Using Sweetviz and PyCaret

Image created by author

Introduction

Prototyping new models can be a time-consuming process. It combines business understanding, data understanding, data preparation, model creation, and evaluation. Luckily, there are a few ways to speed up the process to enable fast decision-making in a business setting. We can speed up the data exploration process using a package like Sweetviz and a semi-automated machine-learning workflow library like PyCaret. Using both libraries together can streamline your end-to-end data science workflows if you have some experience implementing models in a business environment.

This blog aims to help you explore, create and quickly implement machine learning models to determine if your current process and results make sense for the business goal you are trying to achieve.

Files

In this example, we will use NYPD historic shooting incident dataset. You can find a link to the files on my GitHub repository: GitHub SolisAnalytics. The files in the repository will contain the entire code. The goal will be to determine which independent variables (features) are most significant in determining same-race murders in the New York area. This example is not meant to prove any real-life findings.

The Process

We will start by importing the necessary libraries and packages. The yml file in the GitHub repository contains all the dependencies.

The dataset has twenty-one variables, not including the target variable, which we will engineer called “same race murder.” The target variable will be positive if the perpetrator's race is the same as the victim's; otherwise, it will be negative.

Before creating the target variable, let’s clean up the perpetrator values by removing nulls or low-count values since more than 25,000 observations are present.

The code above uses the Pandas query function only to include the values in the query list.

After filtering, we can create the target variable. As mentioned above, a positive instance will be any instance where the perpetrator's race is the same as the victim's.

The np.where function returns 1 for any instance where race is the same for perpetrator and victim. Otherwise, it returns 0.

We can use a tool like Sweetviz to determine potential features based on correlations between the dependent and independent (target) variables. We can also view the correlation between the dependent variables themselves. Highly correlated dependent variables can be problematic because they can undermine the statistical significance of another dependent variable.

Sweetviz also allows us to look at all input variables' counts and distributions. Viewing any outliers affecting model performance during the training phase is also easy. Let’s look at the simple code to output those insights.

We can force a data type for the analytical method using “FeatureConfig”. The analyze method takes a dataset, a target feature, and any configurations. You can output the results as an HTML file using the “show_html” method.

Below are a few images of the findings, along with some comments. Some features were removed after performing exploratory data analysis using Sweetviz. The process is iterative, which allows you to focus on significant variables.

The target variable category counts are fairly balanced. There are no numerical features present, but there are quite a few categorical variables with some association with the target variable.

The association matrix shows us the correlation between all variables. Some features correlate with the target variable: BORO, perpetrator age group, perpetrator sex, and victim race. There is some sign of multicollinearity between the jurisdiction code and location description. However, PyCaret can handle highly correlated features if they surpass a certain threshold.

We can use the insights gathered from Sweetviz to enhance our model setup in PyCaret. For instance, we will set a multicollinearity threshold of 0.90 to remove highly correlated variables. There are no numerical variables in our dataset. Still, we will include common data preparation steps like normalizing, removing outliers, and imputations if we want to include numeric features. We will also set up a session id to reproduce results and use stratified k-fold validation to generalize well to unseen data.

We can compare various classification models once the classification object has been set up. We will include a list of seven models to compare predictive performance using an F1 score as our main indicator. In a future post, I will explain when certain classification models are a great choice.

In a real-world scenario, you would select a measure most appropriate for the business problem you are trying to solve. The F1 score is the harmonic mean of precision and recall, a good measure to use when there is an imbalance in the dataset.

The gradient boosting classifier has the highest F1 score among the selected classification models. We will tune it below by optimizing the F1 score using the tune model method, then plot model performance and significant features.

The mean F1 score using five fold with a total of 50 fits barely improved model performance. Specific tuning might improve performance further.

I listed the hyper-parameter of the original and tuned gradient boosting models to observe the differences. I highlighted important parameters that were tuned.

The plot method allows us to plot many model performance evaluation visualizations. We will plot an AUC, Confusion Matrix, Learning Curve, and significant features to understand better how our model performs.

The AUC plot shows similar performance for both classes. Same-race murder (class 1) does seem to have a higher true positive rate, around a 0.2 false positive rate.

The confusion matrix shows recall score of about 0.83 (2768 + 562) and a precision score of about 0.87 (2768 + 418).

Training and cross-validation scores are similar, showing the model generalizes well to unseen instances. However, the cross-validation band is slightly wider.

The Feature plot calculated using Gini importance in a gradient boosting model returns the top 10 most significant features. Some of these features are not helpful due to the nature of the data being used. However, there seems to be some indication that same-race murders among minorities like African Americans and Hispanics mainly come from the BRONX Boro.

Conclusion

In conclusion, the combination of Sweetviz and PyCaret offers a simple yet highly efficient approach for exploratory data analysis and model development and evaluation. By leveraging these powerful Python libraries, we were able to gain valuable insights into the NYPD murder dataset and build robust predictive models.

Sweetviz provided an intuitive and comprehensive overview of the dataset, allowing us to understand its structure and identify key patterns and relationships quickly. Its automated visualizations and detailed statistical summaries were invaluable in uncovering insights, enabling us to make informed decisions in our analysis.

With the foundation laid by Sweetviz, PyCaret took our analysis to the next level by streamlining the model development and evaluation process. Its extensive collection of pre-processing techniques, feature selection methods, and model algorithms allowed us to iterate and experiment with different configurations rapidly.

By combining the strengths of Sweetviz and PyCaret, we could efficiently explore the NYPD murder dataset, gain a deep understanding of the underlying patterns, and develop a high-performing predictive model. This simple yet effective approach can be applied to various domains and datasets, empowering users to extract valuable insights and make data-driven decisions. Whether you are a beginner or an experienced practitioner, Sweetviz and PyCaret are valuable tools for your machine-learning toolkit.

Previous
Previous

Topic Modeling Using LDA and Topic Coherence

Next
Next

Common Time Series Metrics Using Darts in Python