Predicting food inspection failures in Chicago

0.1 Introduction

Over 15,000 food establishments across the City of Chicago are subject to sanitation inspections by the Department of Public Health. Only three dozen inspectors are responsible for checking these establishments, which means one inspector is responsible for nearly 470 food establishments. Locating restaurants with critical issues is an essential priority for the Department of Public Health. Approximately 15% of inspections result in at least one critical violation. Given the large number of inspections that inspectors have to complete, the time and effort it takes to discover critical violations can mean prolonged exposure to potential disease, illness, and unsanitary conditions at some food establishments. This informs both a social and monetary cost for the City of Chicago, its citizens and visitors, and of course the restaurants themselves.

To help the department of Chicago Health Department better allocate their limited health inspectors across the many food establishments in the City, we designed an app using an algorithm to better predict at-risk food establishments.

The Taste Check app is designed to help health inspectors prioritize food establishments that are most likely to have critical violations based on our predictive model. So who is this app for? This app is made for Chicago Health inspectors.

The app consists of a secure login page where health inspectors can use their government email and passwords to gain access to their TasteCheck account. Once logged in, the app operates much like google maps, where a user can search for a specific location and get the fastest route via GPS. Where TasteCheck and Google maps differentiate, is that our app is specific to food establishments across the City of Chicago. Inspectors can utilize the filter feature to locate low, medium, or high-risk establishments. Users can also click on a point location icon and access a quick summary of the establishments. Here they can see an image of the restaurant, its name, business hours, and contact info. The last feature of this interface is the selected establishment landing page. This page allows the user to have a more detailed summary of this location with info like the restaurant’s previous violations history, owner name, and the data the businesses opened.

Our model allows inspectors to answer questions like, “which restaurants are on my priority list to visit today?” and “where are the highest-risk establishments located near me?”.

The following markdown demonstrates how we arrived at a better solution for the Chicago Public Health Departments using feature engineering, logistic regression modeling, and an efficient optimization cost-benefit analysis for efficient resource allocation.

Link to Youtube Video: https://www.youtube.com/watch?v=mSOdjVIxOnE

0.2 Set-up

Using data on food establishments available in Chicago’s Open Data Portal, we began by visualizing the inspection failures in food establishments across Chicago in points and density. Because the data table was rather large, with inspections going back as far as 2010, we focused on just 2019 inspections.


chicagoBoundary <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform("EPSG:4326")
## Reading layer `chicagoBoundary' from data source 
##   `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA///Chapter5/chicagoBoundary.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84

inspect <- 
  read.socrata("https://data.cityofchicago.org/Health-Human-Services/Food-Inspections/4ijn-s7e5")

0.3 Feature Engineering

To optimize the features in the food inspection dataset and prevent collinearity in features with multiple categories, we created new features with clearer categories which would be part of the logistic regression. We also added external variables and transformed them using the nearest neighbor feature to add them to our dataset.

0.3.1 Manipulating Internal Variables

To develop a stronger model, we manipulated several of the dataset’s internal variables. Training and designing better features enhanced our overall model accuracy. Although we were quite limited with the number of useful variables within our dataset, we were able to transform 3 internal variables.

inspect_ <- inspect

# Mutate results to numeric 
inspect_ <- mutate(inspect_, results_numeric = as.numeric(case_when(results == "Fail" ~ 1,
                                                    results == "Pass" ~ 0))) %>% 
    na.omit()%>%
    filter(grepl('2018', inspection_date))

# Filter by Facility Type
    inspect_ <- mutate(inspect_, facility = as.factor(case_when(facility_type == "Restaurant" ~ "Rest",
        facility_type == "Grocery Store" ~ "Grocery", facility_type == "Grocery Store" ~ "Grocery", facility_type == "School" ~ "School"))) %>%na.omit()
    

# Filter by Inspection Types
    inspect_ <- mutate(inspect_, inspection_type = as.factor(case_when(inspection_type == "Canvass" ~ "Canvass",
        inspection_type == "Canvass Re-Inspection" ~ "Re-Inspection",
        inspection_type == "Complaint" ~ "Complaint",
        inspection_type == "License" ~ "License"))) %>%na.omit()
    
  
 # Filter by seasons
inspect_$month <- format(as.Date(inspect_$inspection_date, format="%Y/%m/%d"),"%m")

inspect_ <-
  inspect_%>%
  mutate(Season = case_when(
    month == "12" |month == "01" | month == "02" ~ "Winter",
    month == "03" |month == "04" | month == "05" ~ "Spring",
    month == "06" |month == "07" | month == "08" ~ "Summer",
    month == "09" |month == "10" | month == "11" ~ "Fall"))

  inspect_ <-    
    st_as_sf(inspect_,coords = c("longitude", "latitude"), crs = "EPSG:4326", agr = "constant")%>%
    st_transform(st_crs(chicagoBoundary)) %>% 
    distinct()
  
  inspect_ <- st_set_crs(inspect_, 4326)

0.3.2 Exploratory Maps

0.3.2.1 Visualizing Food Inspection Failure

As expected, the density of failed inspections is concentrated in the city center, with some hot spots also located just north of the city.


grid.arrange(ncol=2,
ggplot() + 
    geom_sf(data = chicagoBoundary,  fill = "ivory", color = "#999999") +
  geom_sf(data = inspect_, colour="#a72608", size=0.1, show.legend = "point") +
  labs(title= "Restaurant Inspection Failure, Points", 
       subtitle= "Chicago, IL") +
  mapTheme(title_size = 14),

ggplot() + 
 geom_sf(data = chicagoBoundary,  fill = "ivory", color = "#999999")+
  stat_density2d(data = data.frame(st_coordinates(inspect_)), 
                 aes(X, Y, fill = ..level.., alpha = ..level..),
                 size = 0.01, bins = 40, geom = 'polygon') + scale_fill_gradientn(colors = c("#1A1A1A","#581905",
"#a72608","#ac4e31", "#e6eed6"))+
  scale_alpha(range = c(0.00, 0.35), guide = FALSE) +
  labs(title = "Density of Restaurant Inspection Failures", 
       subtitle= "Chicago, IL") +
  mapTheme(title_size = 14) + theme(legend.position = "none"))

0.3.2.2 Mapping Categorical Features

Inspect_charts %>%
  dplyr::select(results, facility_type) %>%
  gather(Variable, value, -results) %>%
  count(Variable, value, results) %>%
  ggplot(aes(value, n, fill = results)) +   
    geom_bar(position = "dodge", stat="identity") +
    facet_wrap(~Variable, scales="free") +
    scale_fill_manual(values = palette2) +
    labs(x="Categories", y="Count",
         title = "Feature associations with the likelihood of failing",
         subtitle = "Three category features") +
    plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Inspect_charts %>%
  dplyr::select(results, inspection_type) %>%
  gather(Variable, value, -results) %>%
  count(Variable, value, results) %>%
  ggplot(aes(value, n, fill = results)) +   
    geom_bar(position = "dodge", stat="identity") +
    facet_wrap(~Variable, scales="free") +
    scale_fill_manual(values = palette2) +
    labs(x="Categories", y="Count",
         title = "Feature associations with the likelihood of failing",
         subtitle = "Three category features") +
    plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Inspect_charts %>%
  dplyr::select(results, risk) %>%
  gather(Variable, value, -results) %>%
  count(Variable, value, results) %>%
  ggplot(aes(value, n, fill = results)) +   
    geom_bar(position = "dodge", stat="identity") +
    facet_wrap(~Variable, scales="free") +
    scale_fill_manual(values = palette2) +
    labs(x="Categories", y="Count",
         title = "Feature associations with the likelihood of failing",
         subtitle = "Three category features") +
    plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

Inspect_charts %>%
  dplyr::select(results, Season) %>%
  gather(Variable, value, -results) %>%
  count(Variable, value, results) %>%
  ggplot(aes(value, n, fill = results)) +   
    geom_bar(position = "dodge", stat="identity") +
    facet_wrap(~Variable, scales="free") +
    scale_fill_manual(values = palette2) +
    labs(x="Categories", y="Count",
         title = "Feature associations with the likelihood of failing",
         subtitle = "Four category features") +
    plotTheme() + theme(axis.text.x = element_text(angle = 45, hjust = 1))

0.3.3 Adding External Variables

In addition to our internal variables manipulated within our dataset, we introduced some external risk factors that would contribute to the model’s ability to determine the establishment at risk of critical health violations. The variables include burglary occurrences, sanitation complaints, rodent sightings, and garbage cart complaints.


burglaries <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
    filter(Primary.Type == "BURGLARY" & Description == "FORCIBLE ENTRY") %>%  
    dplyr::select(Y = Latitude, X = Longitude)%>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs =  "EPSG:4326", agr = "constant") %>%
    st_transform(st_crs(chicagoBoundary)) %>% 
    mutate(Legend = "Burglaries") %>% 
    distinct()


sanitation <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") 

sanitation2 <- sanitation
sanitation2 <- sanitation2 %>% mutate(year = substr(creation_date,1,4)) 
sanitation2 <- sanitation2 %>% filter(year == "2018") %>% 
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"),  crs =  "EPSG:4326", agr = "constant") %>%
    st_transform(st_crs(chicagoBoundary)) %>% mutate(Legend = "Sanitation")
rm(sanitation)


rodents <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-Historical/97t6-zrhs") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"),  crs =  "EPSG:4326", agr = "constant") %>%
    st_transform(st_crs(chicagoBoundary)) %>%
    mutate(Legend = "Rodents")

garbage <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Garbage-Carts-Historical/9ksk-na4q") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"),  crs = "EPSG:4326", agr = "constant") %>%
    st_transform(st_crs(chicagoBoundary)) %>%
    mutate(Legend = "Garbage")

0.3.4 Vizualizing External Variables


chicagoBoundary2 <- 
  st_read(file.path(root.dir,"/Chapter5/chicagoBoundary.geojson")) %>%
  st_transform('ESRI:102271') 
## Reading layer `chicagoBoundary' from data source 
##   `https://raw.githubusercontent.com/urbanSpatial/Public-Policy-Analytics-Landing/master/DATA///Chapter5/chicagoBoundary.geojson' 
##   using driver `GeoJSON'
## Simple feature collection with 1 feature and 1 field
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: -87.8367 ymin: 41.64454 xmax: -87.52414 ymax: 42.02304
## Geodetic CRS:  WGS 84

burglaries2 <- 
  read.socrata("https://data.cityofchicago.org/Public-Safety/Crimes-2018/3i3m-jwuy") %>% 
    filter(Primary.Type == "BURGLARY" & Description == "FORCIBLE ENTRY") %>%  
    dplyr::select(Y = Latitude, X = Longitude)%>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"), crs =  "EPSG:4326", agr = "constant") %>%
    st_transform(st_crs(chicagoBoundary2)) %>% 
    mutate(Legend = "Burglaries") %>% 
    distinct()


sanitation3 <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Sanitation-Code-Complaints-Hi/me59-5fac") 

sanitation4 <- sanitation3
sanitation4 <- sanitation4 %>% mutate(year = substr(creation_date,1,4)) 
sanitation4 <- sanitation4 %>% filter(year == "2018") %>% 
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"),  crs =  "EPSG:4326", agr = "constant") %>%
    st_transform(st_crs(chicagoBoundary2)) %>% mutate(Legend = "Sanitation")



rodents2 <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Rodent-Baiting-Historical/97t6-zrhs") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"),  crs =  "EPSG:4326", agr = "constant") %>%
    st_transform(st_crs(chicagoBoundary2)) %>%
    mutate(Legend = "Rodents")

garbage2 <-
  read.socrata("https://data.cityofchicago.org/Service-Requests/311-Service-Requests-Garbage-Carts-Historical/9ksk-na4q") %>%
    mutate(year = substr(creation_date,1,4)) %>% filter(year == "2018") %>%
    dplyr::select(Y = latitude, X = longitude) %>%
    na.omit() %>%
    st_as_sf(coords = c("X", "Y"),  crs = "EPSG:4326", agr = "constant") %>%
    st_transform(st_crs(chicagoBoundary2)) %>%
    mutate(Legend = "Garbage")

# Make Chicago Fishnet

fishnet <- 
  st_make_grid(chicagoBoundary2,
               cellsize = 500, 
               square = TRUE) %>%
  .[chicagoBoundary2] %>%           
  st_sf() %>%
  mutate(uniqueID = 1:n()) 

# Aggregate a features to our fishnet


vars_net <- 
  rbind(burglaries2, sanitation4, rodents2, garbage2) %>%
  st_join(., fishnet, join=st_within) %>%
  st_drop_geometry() %>%
  group_by(uniqueID, Legend) %>%
  summarize(count = n()) %>%
    full_join(fishnet) %>%
    spread(Legend, count, fill=0) %>%
    st_sf() %>%
    na.omit() %>%
    ungroup() %>% dplyr::select (uniqueID, geometry, Burglaries, Garbage, Rodents, Sanitation)
## `summarise()` has grouped output by 'uniqueID'. You can override using the
## `.groups` argument.
## Joining, by = "uniqueID"

vars_net.long <- 
  gather(vars_net, Variable, value, -geometry, -uniqueID)

# Plot Spatial variables
vars <- unique(vars_net.long$Variable)
mapList <- list()

for(i in vars){
  mapList[[i]] <- 
    ggplot() +
      geom_sf(data = filter(vars_net.long, Variable == i), aes(fill=value), colour=NA) +
       scale_fill_gradientn(colors = c("#802d2b", "#7a2a31", "#a50700", "#d84422", "#ff7249", "#ffa072", "#ffce9c", 
                                       "#f1f1f1")) +
      labs(title=i) +
      mapTheme(title_size = 16)}

do.call(grid.arrange,c(mapList, ncol=2))

0.3.5 Nearest Neighbors

Using the nearest neighbor tool, we calculated the average nearest neighbor distance from each food establishment to its k nearest neighbor burglary, sanitation complaint, rodent sightings, and garbage cart complaints.

# Burglaries
st_c <- st_coordinates

inspect_<-
 inspect_%>% 
    mutate(
    burglaries_nn1 = nn_function(st_c(inspect_), st_c(burglaries), 1),
      burglaries_nn2 = nn_function(st_c(inspect_), st_c(burglaries), 2), 
      burglaries_nn3 = nn_function(st_c(inspect_), st_c(burglaries), 3), 
      burglaries_nn4 = nn_function(st_c(inspect_), st_c(burglaries), 4), 
      burglaries_nn5 = nn_function(st_c(inspect_), st_c(burglaries), 5))


# Rodents 
inspect_<-
 inspect_%>% 
    mutate(
    rodents_nn1 = nn_function(st_c(inspect_), st_c(rodents), 1),
      rodents_nn2 = nn_function(st_c(inspect_), st_c(rodents), 2), 
      rodents_nn3 = nn_function(st_c(inspect_), st_c(rodents), 3), 
      rodents_nn4 = nn_function(st_c(inspect_), st_c(rodents), 4), 
      rodents_nn5 = nn_function(st_c(inspect_), st_c(rodents), 5))


# Sanitation 
inspect_<-
 inspect_%>% 
    mutate(
    sanitation_nn1 = nn_function(st_c(inspect_), st_c(sanitation2), 1),
      sanitation_nn2 = nn_function(st_c(inspect_), st_c(sanitation2), 2), 
      sanitation_nn3 = nn_function(st_c(inspect_), st_c(sanitation2), 3), 
      sanitation_nn4 = nn_function(st_c(inspect_), st_c(sanitation2), 4), 
      sanitation_nn5 = nn_function(st_c(inspect_), st_c(sanitation2), 5))


# Garbage cart complaints 
inspect_<-
 inspect_%>% 
    mutate(
    garbage_nn1 = nn_function(st_c(inspect_), st_c(garbage), 1),
      garbage_nn2 = nn_function(st_c(inspect_), st_c(garbage), 2), 
      garbage_nn3 = nn_function(st_c(inspect_), st_c(garbage), 3), 
      garbage_nn4 = nn_function(st_c(inspect_), st_c(garbage), 4), 
      garbage_nn5 = nn_function(st_c(inspect_), st_c(garbage), 5))

Inspect_nogeo<-st_set_geometry(inspect_, NULL)

0.4 Logistic Regression

We used a Logistic regression to predict the probability a food establishment would “Fail” or “Pass” a food inspection, conditional on the features of the model. After adding the engineered internal and external features, we ran our regression. The dataset was split into a 75/25 training/test set to train the model to predict inspection failures in food establishments and to subsequently generate our cost/benefit predictions.

set.seed(3456)
trainIndex <- createDataPartition(Inspect_nogeo$results, p = .75,
                                  list = FALSE,
                                  times = 1)
Train <- Inspect_nogeo[ trainIndex,]
Test  <- Inspect_nogeo[-trainIndex,]


reg1 <- glm(results_numeric ~ .,
                  data= Train %>% dplyr::select(facility_type, risk, inspection_type, results_numeric, burglaries_nn1, burglaries_nn2, burglaries_nn3, burglaries_nn4, burglaries_nn5, sanitation_nn1, sanitation_nn2, sanitation_nn3, sanitation_nn4, sanitation_nn5, garbage_nn1, garbage_nn2, garbage_nn3, garbage_nn4, garbage_nn5, Season),
                  family="binomial" (link="logit"))

## 
## Call:
## glm(formula = results_numeric ~ ., family = binomial(link = "logit"), 
##     data = Train %>% dplyr::select(facility_type, risk, inspection_type, 
##         results_numeric, burglaries_nn1, burglaries_nn2, burglaries_nn3, 
##         burglaries_nn4, burglaries_nn5, sanitation_nn1, sanitation_nn2, 
##         sanitation_nn3, sanitation_nn4, sanitation_nn5, garbage_nn1, 
##         garbage_nn2, garbage_nn3, garbage_nn4, garbage_nn5, Season))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8886  -0.9206  -0.6430   1.1298   2.3351  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   1.326e+00  1.581e-01   8.386  < 2e-16 ***
## facility_typeRestaurant      -5.564e-01  1.134e-01  -4.907 9.27e-07 ***
## facility_typeSchool          -5.997e-01  1.410e-01  -4.253 2.11e-05 ***
## riskRisk 2 (Medium)          -3.504e-01  9.330e-02  -3.756 0.000172 ***
## riskRisk 3 (Low)             -3.058e-01  1.789e-01  -1.709 0.087459 .  
## inspection_typeComplaint      6.064e-01  8.664e-02   6.999 2.58e-12 ***
## inspection_typeLicense        2.779e-01  8.616e-02   3.226 0.001257 ** 
## inspection_typeRe-Inspection -1.253e+00  1.033e-01 -12.136  < 2e-16 ***
## burglaries_nn1               -4.392e+00  9.782e+01  -0.045 0.964190    
## burglaries_nn2               -4.649e+02  2.454e+02  -1.895 0.058130 .  
## burglaries_nn3                1.034e+03  4.217e+02   2.452 0.014211 *  
## burglaries_nn4               -1.105e+02  5.542e+02  -0.199 0.841977    
## burglaries_nn5               -5.248e+02  3.165e+02  -1.658 0.097255 .  
## sanitation_nn1                1.258e+02  1.420e+02   0.886 0.375789    
## sanitation_nn2                2.917e+02  3.657e+02   0.798 0.425131    
## sanitation_nn3               -1.431e+03  6.176e+02  -2.317 0.020511 *  
## sanitation_nn4                1.969e+03  8.534e+02   2.308 0.021026 *  
## sanitation_nn5               -8.353e+02  4.811e+02  -1.736 0.082497 .  
## garbage_nn1                   8.219e+01  7.496e+01   1.097 0.272855    
## garbage_nn2                  -3.716e+02  2.121e+02  -1.752 0.079838 .  
## garbage_nn3                   7.405e+02  5.431e+02   1.363 0.172787    
## garbage_nn4                  -2.938e+02  6.672e+02  -0.440 0.659739    
## garbage_nn5                  -2.208e+02  3.580e+02  -0.617 0.537360    
## SeasonSpring                 -1.198e+00  9.910e-02 -12.083  < 2e-16 ***
## SeasonSummer                 -3.677e-01  1.091e-01  -3.369 0.000755 ***
## SeasonWinter                 -1.193e+00  1.042e-01 -11.449  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 7092.5  on 5382  degrees of freedom
## Residual deviance: 6396.0  on 5357  degrees of freedom
## AIC: 6448
## 
## Number of Fisher Scoring iterations: 4

0.5 Goodness of Fit

To test the goodness of fit of our model, several methods were considered. Starting with the McFadden R-Squared, our engineered model had a score of 0.0982. Multiple combinations of engineered features were tested in the regression, and the McFadden R-Squared score remained slightly below 0.09. Hence, we used the best model we achieved to further this analysis.

## fitting null model for pseudo-r2
##   McFadden 
## 0.09820099

0.5.1 Distribution of predicted probabilities by observed outcomes

Using an alternative approach to the goodness of fit, we calculated the distribution of predicted probabilities by observed outcomes. The graph illustrates this below.

testProbs <- data.frame(Outcome = as.factor(Test$results_numeric),
                        Probs = predict(reg1, Test, type= "response"))
head(testProbs)
##    Outcome     Probs
## 2        0 0.2719177
## 9        1 0.3353489
## 13       0 0.3336171
## 20       0 0.3838073
## 24       1 0.3435746
## 29       1 0.3461836

ggplot(testProbs, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) + xlim(0, 1) +
  labs(x = "Outcome", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  plotTheme() + theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

testProbs <- 
  testProbs %>%
  mutate(predOutcome  = as.factor(ifelse(testProbs$Probs > 0.5 , 1, 0)))

head(testProbs)
##    Outcome     Probs predOutcome
## 2        0 0.2719177           0
## 9        1 0.3353489           0
## 13       0 0.3336171           0
## 20       0 0.3838073           0
## 24       1 0.3435746           0
## 29       1 0.3461836           0

0.5.2 Confusion matrix

The Confusion Matrix shows the number of observed instances of food inspection “fails” as well as their predictions. Each entry in the matrix provides a different comparison between observed and predicted, given the 50% threshold.

There were 240 true positives (when a “fail” was correctly predicted as a “fail”). There were 137 false positives (when a “fail” was incorrectly predicted as a “pass”) There were 994 true negatives (when a “pass” was correctly predicted as a “pass”) There were 423 false negatives ( when a “fail” was incorrectly predicted as a “pass”)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 994 423
##          1 137 240
##                                           
##                Accuracy : 0.6878          
##                  95% CI : (0.6658, 0.7092)
##     No Information Rate : 0.6304          
##     P-Value [Acc > NIR] : 1.964e-07       
##                                           
##                   Kappa : 0.2645          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3620          
##             Specificity : 0.8789          
##          Pos Pred Value : 0.6366          
##          Neg Pred Value : 0.7015          
##              Prevalence : 0.3696          
##          Detection Rate : 0.1338          
##    Detection Prevalence : 0.2101          
##       Balanced Accuracy : 0.6204          
##                                           
##        'Positive' Class : 1               
##

0.5.3 ROC Curves

The ROC Curve it visualizes trade-offs for two important confusion metrics, while also providing a single goodness of fit indicator. The AUC curve for the engineered model is .73. Given that the AUC is between 0.5 and 1, our model is a decent predictive model.

ggplot(testProbs, aes(d = as.numeric(testProbs$Outcome), m = Probs)) +
  geom_roc(n.cuts = 50, labels = FALSE, colour = "#a72608") +
  style_roc(theme = theme_grey) +
  geom_abline(slope = 1, intercept = 0, size = 1.5, color = 'grey') +
  labs(title = "ROC Curve - Inspection Fail Model")

## Setting levels: control = 0, case = 1
## Setting direction: controls < cases
## Area under the curve: 0.7308

0.6 Cross-validation

The figure below plots the distribution of AUC, Sensitivity, and Specificity across the 100 folds. Our model generalizes well for specificity- the rate at which the model predicted an establishment would not fail an inspection and not send a health inspector. It does not predict that well for sensitivity or the rate at which we predicted a restaurant would fail and therefore send an inspector. Although the model is not great at generalizing establishments that are predicted to fail, in this context, it still allows our inspectors to know which restaurants are not at risk which can still save cost and time for the Chicago Public Health Department and its limited number of inspectors.

ctrl <- trainControl(method = "cv", number = 100, classProbs=TRUE, summaryFunction=twoClassSummary)

cvFit <- train(results ~ ., data = Inspect_nogeo%>% dplyr::select(facility_type, risk, inspection_type, results, burglaries_nn1, burglaries_nn2, burglaries_nn3, burglaries_nn4, burglaries_nn5, sanitation_nn1, sanitation_nn2, sanitation_nn3, sanitation_nn4, sanitation_nn5, garbage_nn1, garbage_nn2, garbage_nn3, garbage_nn4, garbage_nn5, Season), 
                method="glm", family="binomial",
                metric="ROC", trControl = ctrl)
cvFit
## Generalized Linear Model 
## 
## 7177 samples
##   19 predictor
##    2 classes: 'Fail', 'Pass' 
## 
## No pre-processing
## Resampling: Cross-Validated (100 fold) 
## Summary of sample sizes: 7105, 7105, 7105, 7105, 7105, 7106, ... 
## Resampling results:
## 
##   ROC        Sens      Spec     
##   0.7068938  0.374245  0.8556329

dplyr::select(cvFit$resample, -Resample) %>%
  gather(metric, value) %>%
  left_join(gather(cvFit$results[2:4], metric, mean)) %>%
  ggplot(aes(value)) + 
    geom_histogram(bins=35, fill = "#babca0ff") +
    facet_wrap(~metric) +
    geom_vline(aes(xintercept = mean), colour = "#a72608", linetype = 3, size = 1.5) +
    scale_x_continuous(limits = c(0, 1)) +
    labs(x="Goodness of Fit", y="Count", title="CV Goodness of Fit Metrics",
         subtitle = "Across-fold mean reprented as dotted lines") +
    plotTheme()
## Joining, by = "metric"

0.7 Cost/Benefit Analysis

The cost/benefit analysis was developed with the following parameters:

Hourly rates for food safety inspector, assuming that they spend 1 hour per restaurant: $30/HR
Minimum cost the restaurant assumes to rectify their failed inspection: $2,000.

The cost/benefit for each outcome in our confusion matrix is calculated below:

True negative cost: We predicted that the restaurant would pass the inspection, and it did. Inspector was still deployed, but at a later time. Count *30
True positive cost: We predicted that the restaurant would fail the inspection, and it did. Inspector was sent to the location immediately. Count *30
False negative cost: We predicted that the restaurant would pass the inspection, but it did not. Inspector was deployed at a later time, thus there is a social/health cost that impacts the customers. (Count 30) + (Count 2000)
False positive cost: We predicted that the restaurant would fail the inspection, but it did not. Inspector was sent to the location immediately. Count *30

cost_benefit_table <-
   testProbs %>%
      count(predOutcome, Outcome) %>%
      summarize(True_Negative = sum(n[predOutcome==0 & Outcome==0]),
                True_Positive = sum(n[predOutcome==1 & Outcome==1]),
                False_Negative = sum(n[predOutcome==0 & Outcome==1]),
                False_Positive = sum(n[predOutcome==1 & Outcome==0])) %>%
       gather(Variable, Count) %>%
       mutate(Revenue =
               case_when(Variable == "True_Negative"  ~ (Count * 30),
                         Variable == "True_Positive"  ~ (Count * 30),
                         Variable == "False_Negative" ~ ((Count *30) + (Count * 2000)),
                         Variable == "False_Positive" ~ (Count * 30))) %>%
    bind_cols(data.frame(Description = c(
              "We predicted that the restaurant would pass the inspection, and it did. Inspector was still deployed, but at a later time",
              "We predicted that the restaurant would fail the inspection, and it did. Inspector was sent to the location immediately",
              "We predicted that the restaurant would pass the inspection, but it did not. Inspector was deployed at a later time",
              "We predicted that the restaurant would fail the inspection, but it did not. Inspector was sent to the location immediately")))

kable(cost_benefit_table,
       caption = "Cost/Benefit Table") %>% kable_styling()

Cost/Benefit Table
Variable	Count	Revenue	Description
True_Negative	994	29820	We predicted that the restaurant would pass the inspection, and it did. Inspector was still deployed, but at a later time
True_Positive	240	7200	We predicted that the restaurant would fail the inspection, and it did. Inspector was sent to the location immediately
False_Negative	423	858690	We predicted that the restaurant would pass the inspection, but it did not. Inspector was deployed at a later time
False_Positive	137	4110	We predicted that the restaurant would fail the inspection, but it did not. Inspector was sent to the location immediately

whichThreshold <- 
  iterateThresholds(
     data=testProbs, observedClass = Outcome, predictedProbs = Probs)

whichThreshold[1:5,]
## # A tibble: 5 × 10
##   Count_TN Count_TP Count_FN Count_FP Rate_TP Rate_FP Rate_FN  Rate_TN Accuracy
##      <int>    <int>    <int>    <int>   <dbl>   <dbl>   <dbl>    <dbl>    <dbl>
## 1        0      663        0     1131       1   1           0 0           0.370
## 2        1      663        0     1130       1   0.999       0 0.000884    0.370
## 3        2      663        0     1129       1   0.998       0 0.00177     0.371
## 4        3      663        0     1128       1   0.997       0 0.00265     0.371
## 5        4      663        0     1127       1   0.996       0 0.00354     0.372
## # … with 1 more variable: Threshold <dbl>

whichThreshold <- 
  whichThreshold %>%
    dplyr::select(starts_with("Count"), Threshold) %>%
    gather(Variable, Count, -Threshold) %>%
    mutate(Cost =
             case_when(Variable == "Count_TN"  ~ (Count * 30),
                       Variable == "Count_TP"  ~ (Count * 30),
                       Variable == "Count_FN"  ~ ((Count *30) + (Count * 2000)),
                       Variable == "Count_FP"  ~ (Count * 30)))

whichThreshold %>%
  ggplot(.,aes(Threshold, Cost, colour = Variable)) +
  geom_point() +
  scale_colour_manual(values = palette2[c(5, 1:3)]) +    
  labs(title = "Cost by confusion matrix type and threshold",
       y = "Cost") +
  plotTheme() +
  guides(colour=guide_legend(title = "Confusion Matrix"))

0.8 Comparing the efficacy of the model by facility types

We decided to explore how our model performed based on facility type; specifically with restaurants vs. grocery stores. We felt that this would offer us a deeper analysis of how generalizable our model would be.

0.8.1 Restaurants only

When exploring restaurants only, the model from our baseline regression which included both grocery stores and restaurants predicted at-risk failed restaurants at a sensitivity level of 36%. Restaurants only produced an accuracy of 68% but a lower sensitivity rate of just 33%. Overall both models are better at predicting restaurants that are not at risk of critical violations than those that are.


# Subset data for restaurants only
Inspect_nogeo2 <- Inspect_nogeo

Inspect_nogeo2 <- mutate(Inspect_nogeo2, facility = as.factor(case_when(facility_type == "Restaurant" ~ "Rest"))) %>%na.omit()

# Split data and run regression
set.seed(3456)
trainIndex <- createDataPartition(Inspect_nogeo2$results, p = .75,
                                  list = FALSE,
                                  times = 1)
Train2 <- Inspect_nogeo2[ trainIndex,]
Test2  <- Inspect_nogeo2[-trainIndex,]



reg2 <- glm(results_numeric ~ .,
                  data= Train2 %>% dplyr::select(risk, inspection_type, results_numeric, burglaries_nn1, burglaries_nn2, burglaries_nn3, burglaries_nn4, burglaries_nn5, sanitation_nn1, sanitation_nn2, sanitation_nn3, sanitation_nn4, sanitation_nn5, garbage_nn1, garbage_nn2, garbage_nn3, garbage_nn4, garbage_nn5, Season),
                  family="binomial" (link="logit"))

summary(reg2)
## 
## Call:
## glm(formula = results_numeric ~ ., family = binomial(link = "logit"), 
##     data = Train2 %>% dplyr::select(risk, inspection_type, results_numeric, 
##         burglaries_nn1, burglaries_nn2, burglaries_nn3, burglaries_nn4, 
##         burglaries_nn5, sanitation_nn1, sanitation_nn2, sanitation_nn3, 
##         sanitation_nn4, sanitation_nn5, garbage_nn1, garbage_nn2, 
##         garbage_nn3, garbage_nn4, garbage_nn5, Season))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8365  -0.9156  -0.6036   1.1033   2.5743  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   9.846e-01  1.440e-01   6.840 7.95e-12 ***
## riskRisk 2 (Medium)          -4.473e-01  1.079e-01  -4.146 3.38e-05 ***
## riskRisk 3 (Low)             -9.741e-02  4.333e-01  -0.225 0.822119    
## inspection_typeComplaint      5.821e-01  9.458e-02   6.155 7.49e-10 ***
## inspection_typeLicense        3.499e-01  9.403e-02   3.721 0.000198 ***
## inspection_typeRe-Inspection -1.784e+00  1.443e-01 -12.361  < 2e-16 ***
## burglaries_nn1               -4.764e+01  1.163e+02  -0.410 0.681984    
## burglaries_nn2               -6.011e+02  2.838e+02  -2.118 0.034177 *  
## burglaries_nn3                1.666e+03  4.881e+02   3.413 0.000644 ***
## burglaries_nn4               -9.408e+02  6.393e+02  -1.472 0.141125    
## burglaries_nn5               -1.572e+02  3.631e+02  -0.433 0.665067    
## sanitation_nn1                1.883e+02  1.666e+02   1.130 0.258471    
## sanitation_nn2                1.945e+02  4.264e+02   0.456 0.648275    
## sanitation_nn3               -1.310e+03  7.413e+02  -1.767 0.077248 .  
## sanitation_nn4                2.038e+03  9.977e+02   2.043 0.041084 *  
## sanitation_nn5               -9.677e+02  5.567e+02  -1.738 0.082182 .  
## garbage_nn1                  -3.100e+01  7.694e+01  -0.403 0.687035    
## garbage_nn2                  -2.783e+02  2.214e+02  -1.257 0.208766    
## garbage_nn3                   5.130e+02  5.685e+02   0.902 0.366858    
## garbage_nn4                   1.089e+02  6.757e+02   0.161 0.871934    
## garbage_nn5                  -3.980e+02  3.583e+02  -1.111 0.266666    
## SeasonSpring                 -1.472e+00  1.235e-01 -11.917  < 2e-16 ***
## SeasonSummer                 -5.481e-01  1.296e-01  -4.230 2.34e-05 ***
## SeasonWinter                 -1.346e+00  1.282e-01 -10.495  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 5379.4  on 4123  degrees of freedom
## Residual deviance: 4748.0  on 4100  degrees of freedom
## AIC: 4796
## 
## Number of Fisher Scoring iterations: 4

## Goodness of Fit

pR2(reg2)[4]
## fitting null model for pseudo-r2
##  McFadden 
## 0.1173753

testProbs2 <- data.frame(Outcome = as.factor(Test2$results_numeric),
                        Probs = predict(reg2, Test2, type= "response"))
head(testProbs2)
##    Outcome     Probs
## 2        0 0.3080540
## 11       1 0.3879545
## 12       1 0.3995623
## 26       1 0.4692226
## 29       1 0.3571005
## 30       1 0.3548451

ggplot(testProbs2, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) + xlim(0, 1) +
  labs(x = "Outcome", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  plotTheme() + theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

testProbs2 <- 
  testProbs2 %>%
  mutate(predOutcome2  = as.factor(ifelse(testProbs2$Probs > 0.5 , 1, 0)))

## Confusion matrix
caret::confusionMatrix(testProbs2$predOutcome2, testProbs2$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 775 325
##          1 107 167
##                                           
##                Accuracy : 0.6856          
##                  95% CI : (0.6603, 0.7101)
##     No Information Rate : 0.6419          
##     P-Value [Acc > NIR] : 0.0003649       
##                                           
##                   Kappa : 0.2418          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
##                                           
##             Sensitivity : 0.3394          
##             Specificity : 0.8787          
##          Pos Pred Value : 0.6095          
##          Neg Pred Value : 0.7045          
##              Prevalence : 0.3581          
##          Detection Rate : 0.1215          
##    Detection Prevalence : 0.1994          
##       Balanced Accuracy : 0.6091          
##                                           
##        'Positive' Class : 1               
##

0.8.2 Grocery stores only

When analyzing grocery stores, we observed significant improvements in the model at predicting at-risk establishments. It had a slightly lower accuracy rate of 61% but a sensitivity rate of nearly 70%. This model does a much better job at predicting at-risk grocery store establishments, than grocery stores that are not at risk.

In summary, our model is decent at predicting at-risk restaurants, but stellar at predicting at-risk grocery stores. Considering the widespread impact a critical violation at grocery store could have on a community (i.e E. coli outbreak, or recall of a food item), we consider this finding to be rather significant for this context.


# Subset for Grocery Stores                        
Inspect_nogeo3 <- Inspect_nogeo

Inspect_nogeo3 <- mutate(Inspect_nogeo3, facility = as.factor(case_when(facility_type == "Grocery Store" ~ "Grocery"))) %>%na.omit()                         

# Split data and run regression
set.seed(3456)
trainIndex <- createDataPartition(Inspect_nogeo3$results, p = .75,
                                  list = FALSE,
                                  times = 1)
Train3 <- Inspect_nogeo3[ trainIndex,]
Test3  <- Inspect_nogeo3[-trainIndex,]

reg3 <- glm(results_numeric ~ .,
                  data= Train3 %>% dplyr::select(risk, inspection_type, results_numeric, burglaries_nn1, burglaries_nn2, burglaries_nn3, burglaries_nn4, burglaries_nn5, sanitation_nn1, sanitation_nn2, sanitation_nn3, sanitation_nn4, sanitation_nn5, garbage_nn1, garbage_nn2, garbage_nn3, garbage_nn4, garbage_nn5, Season),
                  family="binomial" (link="logit"))

summary(reg3)
## 
## Call:
## glm(formula = results_numeric ~ ., family = binomial(link = "logit"), 
##     data = Train3 %>% dplyr::select(risk, inspection_type, results_numeric, 
##         burglaries_nn1, burglaries_nn2, burglaries_nn3, burglaries_nn4, 
##         burglaries_nn5, sanitation_nn1, sanitation_nn2, sanitation_nn3, 
##         sanitation_nn4, sanitation_nn5, garbage_nn1, garbage_nn2, 
##         garbage_nn3, garbage_nn4, garbage_nn5, Season))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.2328  -1.1026   0.4726   1.0537   2.2128  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                   2.408e+00  4.472e-01   5.385 7.25e-08 ***
## riskRisk 2 (Medium)           1.967e-01  2.536e-01   0.776  0.43802    
## riskRisk 3 (Low)              4.962e-02  2.607e-01   0.190  0.84906    
## inspection_typeComplaint      2.474e-01  2.543e-01   0.973  0.33069    
## inspection_typeLicense       -1.248e-01  2.915e-01  -0.428  0.66861    
## inspection_typeRe-Inspection -1.634e+00  3.630e-01  -4.500 6.78e-06 ***
## burglaries_nn1                2.460e+02  3.325e+02   0.740  0.45945    
## burglaries_nn2               -7.103e+02  7.627e+02  -0.931  0.35170    
## burglaries_nn3                7.794e+01  1.317e+03   0.059  0.95282    
## burglaries_nn4                7.766e+02  1.817e+03   0.427  0.66911    
## burglaries_nn5               -5.021e+02  1.055e+03  -0.476  0.63400    
## sanitation_nn1                2.138e+02  4.485e+02   0.477  0.63364    
## sanitation_nn2               -2.928e+02  1.224e+03  -0.239  0.81090    
## sanitation_nn3               -1.452e+03  2.034e+03  -0.714  0.47520    
## sanitation_nn4                2.534e+03  2.630e+03   0.964  0.33526    
## sanitation_nn5               -1.158e+03  1.472e+03  -0.787  0.43140    
## garbage_nn1                   1.081e+03  5.223e+02   2.070  0.03846 *  
## garbage_nn2                  -4.073e+02  1.181e+03  -0.345  0.73020    
## garbage_nn3                  -8.878e+02  2.878e+03  -0.308  0.75772    
## garbage_nn4                   6.033e+02  3.364e+03   0.179  0.85766    
## garbage_nn5                  -4.032e+02  1.642e+03  -0.246  0.80606    
## SeasonSpring                 -1.998e+00  3.422e-01  -5.839 5.26e-09 ***
## SeasonSummer                 -1.121e+00  3.563e-01  -3.146  0.00165 ** 
## SeasonWinter                 -1.943e+00  3.502e-01  -5.548 2.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 743.14  on 537  degrees of freedom
## Residual deviance: 637.50  on 514  degrees of freedom
## AIC: 685.5
## 
## Number of Fisher Scoring iterations: 6

# Goodness of Fit
pR2(reg3)[4]
## fitting null model for pseudo-r2
##  McFadden 
## 0.1421582

testProbs3 <- data.frame(Outcome = as.factor(Test3$results_numeric),
                        Probs = predict(reg3, Test3, type= "response"))
head(testProbs3)
##    Outcome     Probs
## 6        0 0.3898402
## 38       1 0.2929631
## 59       1 0.6443704
## 60       1 0.4850357
## 72       1 0.3763821
## 75       1 0.4718849

ggplot(testProbs3, aes(x = Probs, fill = as.factor(Outcome))) + 
  geom_density() +
  facet_grid(Outcome ~ .) +
  scale_fill_manual(values = palette2) + xlim(0, 1) +
  labs(x = "Outcome", y = "Density of probabilities",
       title = "Distribution of predicted probabilities by observed outcome") +
  plotTheme() + theme(strip.text.x = element_text(size = 18),
        legend.position = "none")

testProbs3 <- 
  testProbs3 %>%
  mutate(predOutcome3  = as.factor(ifelse(testProbs3$Probs > 0.5 , 1, 0)))

## Confusion matrix
caret::confusionMatrix(testProbs3$predOutcome3, testProbs3$Outcome, 
                       positive = "1")
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0 46 31
##          1 37 64
##                                           
##                Accuracy : 0.618           
##                  95% CI : (0.5423, 0.6897)
##     No Information Rate : 0.5337          
##     P-Value [Acc > NIR] : 0.0143          
##                                           
##                   Kappa : 0.2289          
##                                           
##  Mcnemar's Test P-Value : 0.5443          
##                                           
##             Sensitivity : 0.6737          
##             Specificity : 0.5542          
##          Pos Pred Value : 0.6337          
##          Neg Pred Value : 0.5974          
##              Prevalence : 0.5337          
##          Detection Rate : 0.3596          
##    Detection Prevalence : 0.5674          
##       Balanced Accuracy : 0.6140          
##                                           
##        'Positive' Class : 1               
##

0.9 Conclusion

In conclusion, we found that our model predicted grocery stores at risk of critical code violations the best. If time had permitted, an analysis of how our model performed in certain neighborhoods would be useful, along with additional variables like whether or not establishments have an alcohol or tobacco license, the number of days since the last inspection, and how long the establishment has been in operation.

We found that the model performed when we filtered for one facility at a time. It is important to consider this in future analyses. Food processes, frequency of use, and operation logistics may differ greatly by facility type. Thus it may be beneficial to study the facility types individually or in groups where facilities share many similarities to optimize the prediction model.

Due to the gravity of food-borne diseases to the public health of the community, our costs do not represent financial costs alone. While there are financial costs and time costs that the Department of Public Health’s Food Protection department and restaurants would incur, the social cost to the community is far greater. We recommend our model as a tool to improve Chicago’s health department food inspection processes, better allocate resources and protect the citizens of Chicago.