- data = pd.read_csv()
- data.head()
- data.tail()
- data.info()
- data.dtypes
- Identify which is the y (predicted output/dependent variable).
- Identify the features X which are relevant to the prediction accuracy rate of y.
- data.isnull().sum()
- Check for null values in the dataset.
- Calculate Correlation Coefficients:
- Statistical measure that show how much two variables are related.
- Common: Pearson Correlation Coefficient (r)
corr = df.corr().round(2) # round to make the visualization more aesthetic
- Create a heatmap to visualize the correlation matrix
import seaborn as sns
import matplotlib.pyplot as plt
# sets the size of the figure
plt.figure(figsize=(10, 8)) # in inches
# use seaborn heatmap() function to create a heatmap using the correlation matrix,
# adding numerical annotations to each cell,
# set the color map to coolwarm
# specify the format for the annotation text as 2 decimanl placess
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt=".2f")
# display the plot
plt.show()
- Visualize categorical variables
plt.figure(figsize = (20, 25))
plt.subplot(5, 2, 1)
plt.gca().set_title('Variable Geography')
sns.countplot(x = 'Geography', palette = 'Set2', data = df)
plt.subplot(5, 2, 2)
plt.gca().set_title('Variable Gender')
sns.countplot(x = 'Gender', palette = 'Set2', data = df)
plt.subplot(5, 2, 3)
plt.gca().set_title('Variable Tenure')
sns.countplot(x = 'Tenure', palette = 'Set2', data = df)
plt.subplot(5, 2, 4)
plt.gca().set_title('Variable NumOfProducts')
sns.countplot(x = 'NumOfProducts', palette = 'Set2', data = df)
plt.subplot(5, 2, 5)
plt.gca().set_title('Variable HasCrCard')
sns.countplot(x = 'HasCrCard', palette = 'Set2', data = df)
plt.subplot(5, 2, 6)
plt.gca().set_title('Variable IsActiveMember')
sns.countplot(x = 'IsActiveMember', palette = 'Set2', data = df)
plt.subplot(5, 2, 7)
plt.gca().set_title('Variable Exited')
sns.countplot(x = 'Exited', palette = 'Set2', data = df)
plt.subplot(5, 2, 8)
plt.gca().set_title('Variable Complain')
sns.countplot(x = 'Complain', palette = 'Set2', data = df)
plt.subplot(5, 2, 9)
plt.gca().set_title('Variable Satisfaction Score')
sns.countplot(x = 'Satisfaction Score', palette = 'Set2', data = df)
plt.subplot(5, 2, 10)
plt.gca().set_title('Variable Card Type')
sns.countplot(x = 'Card Type', palette = 'Set2', data = df)
- set the figure size for the big figure containing all the subplot.
- plot the subplot: plt.subplot(nrows,ncols,index)
- get the current axis and set the title to the variable name: plt.gca().set_title(’Variable name’)