I. Preface

This notebook will focus on the predictive modeling centered around machine maintenance and failure metrics. Using a dataset with a categorical failure observation, we will perform some EDA, followed by creating and tuning models, and ending with data visualization.

II. Importing Libraries and Dataset

## Warning: package 'ggplot2' was built under R version 4.3.2
## Warning: package 'tidyr' was built under R version 4.3.2
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Warning: package 'tidymodels' was built under R version 4.3.2
## ── Attaching packages ────────────────────────────────────── tidymodels 1.2.0 ──
## ✔ broom        1.0.6      ✔ rsample      1.2.1 
## ✔ dials        1.2.1      ✔ tune         1.2.1 
## ✔ infer        1.0.7      ✔ workflows    1.1.4 
## ✔ modeldata    1.4.0      ✔ workflowsets 1.1.0 
## ✔ parsnip      1.2.1      ✔ yardstick    1.3.1 
## ✔ recipes      1.0.10
## Warning: package 'broom' was built under R version 4.3.3
## Warning: package 'dials' was built under R version 4.3.2
## Warning: package 'infer' was built under R version 4.3.2
## Warning: package 'modeldata' was built under R version 4.3.3
## Warning: package 'parsnip' was built under R version 4.3.2
## Warning: package 'recipes' was built under R version 4.3.2
## Warning: package 'rsample' was built under R version 4.3.2
## Warning: package 'tune' was built under R version 4.3.2
## Warning: package 'workflows' was built under R version 4.3.2
## Warning: package 'workflowsets' was built under R version 4.3.2
## Warning: package 'yardstick' was built under R version 4.3.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard() masks purrr::discard()
## ✖ dplyr::filter()   masks stats::filter()
## ✖ recipes::fixed()  masks stringr::fixed()
## ✖ dplyr::lag()      masks stats::lag()
## ✖ yardstick::spec() masks readr::spec()
## ✖ recipes::step()   masks stats::step()
## • Search for functions across packages at https://www.tidymodels.org/find/
## Loading required package: lattice
## 
## Attaching package: 'caret'
## 
## The following objects are masked from 'package:yardstick':
## 
##     precision, recall, sensitivity, specificity
## 
## The following object is masked from 'package:purrr':
## 
##     lift
machine <- read_csv("Downloads/predictive_maintenance.csv")
## Rows: 10000 Columns: 10
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): Product ID, Type, Failure Type
## dbl (7): UDI, Air temperature [K], Process temperature [K], Rotational speed...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
colnames(machine)
##  [1] "UDI"                     "Product ID"             
##  [3] "Type"                    "Air temperature [K]"    
##  [5] "Process temperature [K]" "Rotational speed [rpm]" 
##  [7] "Torque [Nm]"             "Tool wear [min]"        
##  [9] "Target"                  "Failure Type"

III. Exploratory Data Analysis

summary(machine)
##       UDI         Product ID            Type           Air temperature [K]
##  Min.   :    1   Length:10000       Length:10000       Min.   :295.3      
##  1st Qu.: 2501   Class :character   Class :character   1st Qu.:298.3      
##  Median : 5000   Mode  :character   Mode  :character   Median :300.1      
##  Mean   : 5000                                         Mean   :300.0      
##  3rd Qu.: 7500                                         3rd Qu.:301.5      
##  Max.   :10000                                         Max.   :304.5      
##  Process temperature [K] Rotational speed [rpm]  Torque [Nm]    Tool wear [min]
##  Min.   :305.7           Min.   :1168           Min.   : 3.80   Min.   :  0    
##  1st Qu.:308.8           1st Qu.:1423           1st Qu.:33.20   1st Qu.: 53    
##  Median :310.1           Median :1503           Median :40.10   Median :108    
##  Mean   :310.0           Mean   :1539           Mean   :39.99   Mean   :108    
##  3rd Qu.:311.1           3rd Qu.:1612           3rd Qu.:46.80   3rd Qu.:162    
##  Max.   :313.8           Max.   :2886           Max.   :76.60   Max.   :253    
##      Target       Failure Type      
##  Min.   :0.0000   Length:10000      
##  1st Qu.:0.0000   Class :character  
##  Median :0.0000   Mode  :character  
##  Mean   :0.0339                     
##  3rd Qu.:0.0000                     
##  Max.   :1.0000
sum(is.na(machine))
## [1] 0
library(corrplot)
## Warning: package 'corrplot' was built under R version 4.3.3
## corrplot 0.94 loaded
machine %>% 
  select_if(is.numeric) %>%
  drop_na() %>% 
  cor() %>% 
  corrplot(addCoef.col = T)

It is clear to see the negative correlation between rotational speed and torque. This is due to the fact that increased torque requires decreased rotational speed. We can also see a positive correlation between process temp and air temp.

machine %>%
  ggplot(aes(`Torque [Nm]`, `Rotational speed [rpm]`))+
           geom_point()+
           geom_smooth(method= 'lm')
## `geom_smooth()` using formula = 'y ~ x'

table(machine$`Failure Type`)
## 
## Heat Dissipation Failure               No Failure       Overstrain Failure 
##                      112                     9652                       78 
##            Power Failure          Random Failures        Tool Wear Failure 
##                       95                       18                       45
machine2 <- machine %>% 
    janitor::clean_names()%>%
    select(-udi, -product_id,-rotational_speed_rpm, -air_temperature_k)
head(machine2)
## # A tibble: 6 × 6
##   type  process_temperature_k torque_nm tool_wear_min target failure_type
##   <chr>                 <dbl>     <dbl>         <dbl>  <dbl> <chr>       
## 1 M                      309.      42.8             0      0 No Failure  
## 2 L                      309.      46.3             3      0 No Failure  
## 3 L                      308.      49.4             5      0 No Failure  
## 4 L                      309.      39.5             7      0 No Failure  
## 5 L                      309.      40               9      0 No Failure  
## 6 M                      309.      41.9            11      0 No Failure
summary_table <- machine2 %>%
  group_by(failure_type) %>%
  summarise(process_temp_avg = mean(process_temperature_k),
         torque_avg = mean(torque_nm),
         tool_wear_min_avg = mean(tool_wear_min))
ggplot(summary_table, aes(failure_type, process_temp_avg))+
geom_col(aes(fill = failure_type))+
ggtext::geom_richtext(aes(label = round(process_temp_avg, 2)), alpha = 0.7,size = 6)+
ggtitle('Process Temperature per Failure')+
ylab("Temperature [K]")+
theme_bw()+
  theme(axis.title.x = element_blank(),
    plot.title = element_text(hjust = 0.5),
    text = element_text(size = 9),
     legend.title =  element_blank())

As we can see, the temperature during failure does not drastically vary from one failure to another. With this knowledge, we will know to not spend too much time looking into temperature changes as it relates to correlation to certain failure types.

ggplot(summary_table, aes(failure_type, tool_wear_min_avg))+
geom_col(aes(fill = failure_type))+
ggtext::geom_richtext(aes(label = round(process_temp_avg, 2)), alpha = 0.7,size = 6)+
ggtitle('Tool Wear Avg per Failure')+
ylab("Tool Wear [min]")+
theme_bw()+
  theme(axis.title.x = element_blank(),
    plot.title = element_text(hjust = 0.5),
    text = element_text(size = 9),
     legend.title =  element_blank())

ggplot(summary_table, aes(failure_type, torque_avg))+
geom_col(aes(fill = failure_type))+
ggtext::geom_richtext(aes(label = round(process_temp_avg, 2)), alpha = 0.7,size = 6)+
ggtitle('Torque Avg per Failure')+
ylab("Torque [Nm]")+
theme_bw()+
  theme(axis.title.x = element_blank(),
    plot.title = element_text(hjust = 0.5),
    text = element_text(size = 9),
     legend.title =  element_blank())