Visualization is crucial in a People Analytics project, both in the data exploration phase and the reporting. However, from my experience and endeavor to promote People Analytics practices among HR professionals, there is another advantage to compelling visualization.
Most HR professionals are still not data-savvy, and frankly, some may have aversive responses to tasks involving data analysis. However, an appealing and powerful visualization creates attention, curiosity, and enthusiasm that may assist in overcoming those barriers.
To demonstrate how visualization is helpful, I picked a relatively common topic in People Analytics: Employee absenteeism. In the following sections, I present two types of hypotheses regarding absenteeism, specifically, how employee background and work characteristics are related to absenteeism.
I used R to explore the data. However, I don’t expect HR professionals to do that necessarily. But I do stress that they become good clients of data scientists. So I hope that being exposed to such visualization possibilities will help them set expectations and demands. (Readers interested in exploring the R code of the following visualizations are warmly invited to visit my GitHub profile).
Employee Absenteeism Open Data
I found the data of this demonstration in the UC Irvine Machine Learning Repository. The database was created at a courier company in Brazil. It includes records of absenteeism from July 2007 to July 2010. Variables in this dataset encompass time and duration of absence, employee background (distance from residence to work, service time, age, education, social drinking, social smoking), and work characteristics (workload, hit targets, disciplinary failure).
The Structure of this data set is unique. It contains 740 rows and 20 columns. Each record represents an occurrence of absenteeism due to a single reason, measured in hours. Therefore, each employee may have multiple records, all marked with the same employee’s ID and summed up.
The dataset was used in academic research at the Universidade Nove de Julho – Postgraduate Program in Informatics and Knowledge Management. The data creators are Andrea Martiniano, Ricardo Pinto Ferreira, and Renato Jose Sassi.
A special thanks to my colleagues who wrote the open book HR Analytics in R and brought this data set. However, my approach in this article is different from theirs. I aim my analysis towards actionable insights, as if my clients are HR leaders, rather than simply exploring the data for analysis. Therefore, all variables are considered predictors in creating the following visualizations, while absenteeism is the outcome.
Obviously, an actual project will include additional multivariate statistics and statistical models. Such analysis based on this dataset may become my next article. However, I added in most visualizations a remark for further critical thinking.
Absenteeism and Employee Background
Does absenteeism in the Brazilian courier company relate to its employees’ background? To be more precise, can we point to certain employees groups who are more prone to be absent? And maybe intervein among these groups?
In this part of the analysis, I explored employee background variables, such as age, tenure, body attributes, social behaviors, and family coincidence. Obviously, I tried to leverage whatever variables I could find in the data. However, People Analysts who work with actual data in their organizations may discover many more relevant variables. In addition, other variables in this dataset are not typical in organizations.
Employee Age and Tenure
How absenteeism, measured in hours, is associated with employee age or tenure? To explore the relationship between these three numerical variables, I suggested a plot that captures the distribution of each variable, scatters each pair of variables and presents the correlation.
As clearly shown in Figure 1, absenteeism is not related to age or tenure in the courier company. Notice, however, that age and tenure are correlated in this organization. It may not be the case among other occupations and organizations. Furthermore, the density plots that enable you to get an impression of the shape of the distribution of each variable may also be unique for this case study.
Employee Body Attributes
Does absenteeism associate with employee weight, height, or body mass index? Not at all, according to Figure 2. It may be nonsense to explore this in the context of an organization, and obviously, you don’t collect such data in most occupations. However, since the data set includes those variables, why not explore them? The following exploration is precisely the same as the previous one. Since I already had the code, I could quickly reproduce the visualization.
Notice that some employees in this data set are obese. If we find a positive correlation between absenteeism and weight, it doesn’t necessarily mean obesity causes absenteeism. Correlation does not imply causation. The alternative direction of the relation is possible too: People who tend to be absent gain more weight, for some reason. Either way, it’s not the case here.
Employee Social Behaviors
Do social smokers and social drinkers tend to be absent more than non-consumers? Here is where the results get interesting. To compare these types of employees, I suggested box plots that capture the median, quartiles, and outliers of each group’s distribution. Box plots are very common and useful for data exploration. It would be best to get used to them because there is a higher chance that your data scientist will use them.
As you can see in Figure 3, social smokers tend to be absent less than non-smokers. However, Social drinkers tend to be absent more than non-drinkers, as seen in Figure 4. How can we explain this? Is this because drinking makes you wake up with a hangover and because you prefer to smoke and slack with your coworkers?
Employee Family Members
Does having kids or pets influence absenteeism? Here is where the results get pretty confusing. Let’s explore the visualization first and then resolve the confusion. I suggested the box plots again to compare parents and non-parents, either parents to kids or pets. According to Figure 5, employees with kids tend to be slightly more absent. However, this is not the case for employees with pets, as shown in Figure 6, who tend to be less absent. Why is this visualization confusing or even misleading? As opposed to the former visualization, here it appears that we have a dependency between the two classifications. For example, parents may have both kids and pets. We don’t know how this confounding possibility affects each of the visualizations. I recommend always keeping your analytical mindset, and critical thinking activated. (Hint: ask for additional inferential statistics, e.g., Chi-square test. But this is beyond the scope of this demonstration).
Absenteeism and Work Characteristics
Does absenteeism of employees in the Brazilian courier company associated with work characteristics? For example, are there situations or circumstances in which employees are at higher risk of absenteeism? What should we know about working conditions to prevent or reduce the risk?
In this part of the analysis, I explored work characteristic variables, such as workload, commute distance, hitting targets, disciplinary failures, and seasonality. Again, the variables in the dataset set my analysis boundaries. However, People Analysts should explore many more variables of work characteristics, which the HR department or the line of business may own.
Absenteeism, Workload, and Commute Distance
How do workload and commute distance affect absenteeism? Do these work characteristics influence the absenteeism of various employee types the same? It is reasonable to expect that both variables, heavy workload and long commute distance, cause increased absenteeism. However, as clearly shown in Figures 7 and 8, there are no such positive correlations in the dataset.
A deeper exploration of the data reveals that workload and commute distance affect absenteeism differently among employees of different education levels (see Figures 9-10). The workload is positively associated with absenteeism, but only among mid and high-education employees. Commute distance is positively associated with absenteeism, but only among employees with high education. Sometimes a trend that may appear in subgroups disappears when groups are combined. This phenomenon is known as Simpson Paradox.
Absenteeism, Hitting targets, and Disciplinary failures
Do less disciplined employees are different from disciplined ones in their absenteeism in various target hit levels? According to Figure 11, they are not. I picked this lollipop plot to mirror or present the two distributions against each other. Frankly, I also wanted to be more creative and sweetly go beyond the standard bar plots.
When Absenteeism Occurs? Months and Seasons
Is there seasonality in absenteeism? Apparently Yes. According to Figure 12, absenteeism peaks in March and July, and its lowest point is in January. Not surprisingly, Mondays are at the highest risk of absenteeism.
Understanding such trends is essential for workforce planning. Unfortunately, the dataset does not include years, so Figures 12-13 represent the total hours of absenteeism in the entire research period. Otherwise, it would be interesting to explore the trend as a time series. However, a heatmap that presents total absenteeism hours by day and month adds an insightful perspective. As shown in Figure 14, each month has a unique pattern of absenteeism over the working week.