People Analytics: Evaluating Factors Predictive of Resignation

Matheus Schmitz

LinkedIn

Github Portfolio

Problem Statement

I've obtained a dataset from IBM containing various work-related features about a large number of employees, including a target feature Attrition, which assigns a employee to one of three classes {Current Employee, Terminated, Voluntary Resignation}.

Although a target feature does exist, my goal here is not to develop a Machine Learning model, but rather to perform People Analytics, using all the available features to extract insights which can then be used by the company's HR (Human Resources) department.

The dataset countains both information which was generated internal employee datasets and employee surveys on specific work-related manners, and it also contains certain information which related to the employees work history. The latter class is of especial interest to me, as those information could be used during the hiring process, to select employee's less likely to quit (Voluntary Resignation) or to be fired (Terminated).

The bulk of the work in this project will consist of Exploratory Analysis to extract insights, and Feature Selection, to uncover which are the most relevant attributes when it comes to predicting turnover.

Dataset Source: https://developer.ibm.com/patterns/data-science-life-cycle-in-action-to-solve-employee-attrition-problem/

Data Encoding

# Education EnvironmentSatisfaction JobInvolvement JobSatisfaction PerformanceRating RelationshipSatisfaction WorkLifeBalance
1 Below College Low Low Low Low Low Bad
2 College Medium Medium Medium Good Medium Good
3 Bachelor High High High Excellent High Better
4 Master Very High Very High Very High Outstanding Very High Best
5 Doctor

Define Directory

Packages

Dataset

Data Cleaning

Feature Engineering

A person's stability in a company (job tenure) is a mesure of the time an employee has been employed by it's current employer. An employee's job tenure history (how long the person stayed at each company) is highly important and often times is taken into account by recruiters when hiring a new employee.

Seems like in this case I did break something. Checking back at the quartiles for the NumCompaniesWorked variable, I can see that some employees have a 0 in there, which means some rows got divided by zero, which resulted in those infs. I'll fix that by simply imputting 0 in place of the infs.

Exploratory Analysis

Prior Work Experience & Age

58% of the employees have under 3 years of work experience before joining IBM.

Possible problems: young workforce, underdeveloped skillsets, less mature work mentality.

Only 22% of the employees are under 30 years old. Turns out the employee base isn't as young as I had assumed!

Education

Data Encoding

# Education EnvironmentSatisfaction JobInvolvement JobSatisfaction PerformanceRating RelationshipSatisfaction WorkLifeBalance
1 Below College Low Low Low Low Low Bad
2 College Medium Medium Medium Good Medium Good
3 Bachelor High High High Excellent High Better
4 Master Very High Very High Very High Outstanding Very High Best
5 Doctor

IBM's workforce is well educated, with ~39% of employees having a bachelors degree, a further 27% having a masters degree and a further 3% having a doctors degree. That's about 69% of the workforce having university education.

Especially interesting are the 30% of the workforce that posses some sort of advanced degree, thaty ought to be considerably above average in comparison to other companies.

Monthly Income x Job Satisfaction

Monthly income appears to have no effect on job satisfaction.

Years at Company

Not much surprise here - Analysing from the realistic perspective that people aren't going to be promoted every year, and that most often even with promotions people keep their roles and managers.

Monthly Income

The fact the TotalWorkingYears is a better predictor of MonthlyIncome than YearsAtCompany just tells me that overall experience and expertise are more valuable (higher paid) than company loyalty.

Monthly Income x Work Life Balance

Employees who rate their work-life balance poorly (rate 1) also have a lower monthly income.

Poor work-life balance AND low income? That's a problem worth further investigations by IBM's HR!

Gender Pay Gap

There are no signs of gender discrimination in pay, with women earning slightly more than men.

Gender Breakdown

Job Role

Resignations

Note that here I'm using data_rh_1, which is the dataset without terminated employees.

Funnily enough, HR staff seem to have the highest resignation rates... Perhaps they know something other employees don't?..

Unsurprisingly we see that managers and (research) directors are less prone to resignations, while we see that sales representatives are quite prone to resignations, which is not really surprising for anyone who has been in the job market and know that the profile of people working in those areas tends to be those always chasing a better deal.

Looks like that at about the age people start having babies, they lose their appetite for resigning and seeking new opportunities...

Also Single people are somewhat more likely to resign, as are those living further away from the company.

When business travel and over time requirements make people more prone to resignations. Are those two correlated?

No correlation between BusinessTravel and OverTime. Seem that regardless of business travel status, about 1/3 of employee's report doing overtime.

Again no surprises, less satisfaction and involment lead to more resignations. Same for worse work-life balance.

Predictive Modeling

Here the goal isn't to train the best possible algorithm to predict new data points, but rather to train understandable algorithms which would provide useful information to the Human Resources department. For this reason there will also be no train-test split.

One of the main implicatios is that I will only be using variables which are known during the hiring process, as variables that only occur after hiring are useless for making hiring predictions.

Logistic Regression

Seems like all education related variables (Education and EducationField) are not really significative (have high p-values) when it comes to predicting Attrition, so I'll remove them to make a simpler model.

Variables that are significative predictors of STAYING: (negative coefficients)

  1. Age
  2. Department = Research & Development
  3. JobRole = Manager
  4. JobRole = Research Director
  5. AverageTenure

>

Variables that are significative predictors of LEAVING: (positive coefficients)

  1. DistanceFromHome
  2. EmployeeSource = Company Website
  3. JobRole = Laboratory Technician
  4. JobRole = Sales Representative
  5. MaritalStatus = Married
  6. MaritalStatus = Single
  7. PriorYearsOfExperience

Gradient Boosting with XGBoost

XGBoost documentation on Feature Selection: https://xgboost.readthedocs.io/en/latest/R-package/discoverYourData.html

As expected there is some inbalance, but if I recall the literature correctly, as long as the rare class is still ~20% the size of the common class, then one shouldn't expect many issues due to class balance.

Conclusion

All our p-values are much lower than 0.05, meaning we've found our most significant variables (among those available at hiring time) in predicting employee resignation!

The top 6 ranked features by XGBoost are also ranked as significant by the Logistic Regression. Those are: Age, DistanceFromHome, PriorYearsOfExperience, AverageTenure, Department (Research and Development), and MaritalStatus (Married or Single, but not divorced).

Among those, the GLM tells us that the variables correlated with staying are Age (older is more likely to stay), AverageTenure (longer predicts staying), and Department (those hired at the Research and Development department are more likely to stay). While increases in YearsOfExperience are predictors of greater likelihood of employee resignation, as is any MaritalStatus other than divorced, and a longer DistanceFromHome.

Additionally, XGBoost also ranks Education (2, 3 and 4) as important, while GLM adds JobRole (Manager, Research Director, Laboratory Technician or Sales Representative) and EmployeeSource (Company Website).

The insights generated during both exploratory analysis and predictive modeling we can now be shared with IBM's HR department as we team up to reduce employee resignation and turnover!

End

Matheus Schmitz

LinkedIn

Github Portfolio