psychic
RECIDIVISM
MODULE 1
MODULE 2

RECIDIVISM

Classification Trees and Recidivism



Part 1A: Description of the Data

In this next section, we explore two methods, Classification Trees and Random Forests in trying to predict recidivism. First, let's take a look at the data.


Description of the Data

The data being used consists of 10,428 criminals in Broward County, Florida from the years 2013 to 2014 and contains past criminal history and what they were charged for when they were arrested. Moreover, it contains the outcome (recidivated / did not recidivate) after two years of receiving their COMPASS risk assessment. For a description of each variable, please click here.


Classification Trees

Our first method of machine learning that we will look at are classification trees. To familiarize yourself, please watch the video below that goes through the process of creating a classification tree by using a sample of the data we will be using.



With measuring accuracies, it is also important to measure the accuracy of a model as it relates to different groups. This is a good way to check if the model may be inherently biased towards a specific group.

Part 1B: Recidivism in the United States

Investigating Your Own Classification Tree

Use the app below to make your own Classification Trees. The app allows you to choose which variables go into making a classification tree and how bushy (# number of branches) your tree can be. Also, you can see how well your tree performs by looking at the confusion matrix and overall accuracy. You can also view the accuracy of your tree by race, sex, marital status, and age category.

To use the app below, start with the following settings then answer the questions to the left.



Predicting if someone will recidivate is a difficult task. While decision trees are powerful classifying algorithms, they have their advantages and disadvantages. Unlike many machine learning algorithms, decision trees are white-box, meaning the user can see the steps taken in order to classify an individual, and the construction of the tree can be clearly explained; furthermore, they are easy to understand and interpret. Generally, though, decision trees alone are not useful as they are unstable, lack flexibility, and are relatively inaccurate compared to other algorithms.

While the COMPAS algorithm is proprietary, its mechanisms are likely more complex than a standard decision tree. By adopting concepts from decision trees, we can build a more advanced model known as a random forest. Instead of outputting one specific category, a random forest returns a probability of some outcome occurring based on the results of many decision trees. For example, the decision tree for recisivism in the above app will result in either a Yes or No response. Random forest algorithms provide the probability of a person recidivating. When used to predict recidivism, we can use this probabilistic output to create a more equitable model.

A very useful aspect of random forests is that the result is an aggregation of many different, uncorrelated decision trees. Each tree in a random forest is made from bootstrapped data -- a method where a new sample is created by randomly selecting rows from the original sample (with replacement). Another reason random forests produce different trees is because only a subset of variables are considered at each node of a tree’s construction. Since trees are created with bootstrapping and we aggregate the results of many trees to return a probability of an outcome, a random forest is known as a bagging algorithm.


Human versus COMPAS algorithmic predictions

This figure displays how forests reach their predictions

In essence, random forests differ from a single decision tree in three ways:

  1. Each tree in a random forest is created using a bootstrapped dataset that is the same size as the original
  2. Only a subset of variables is considered at each node of a tree.
  3. Classification of an individual is based on the result of many different, uncorrelated trees.

Note: For more detail about random forests, it is recommended you watch this video here.


After running training or testing data through a model, we can create confusion matrices to examine different accuracy metrics. Based on ProPublica’s report, we can visualize the racial disparities of COMPAS by creating different confusion matrices based on race:





Model Validation

While this was hinted in first section, we want to go through how model validation is done. The technique we will be going through is called cross validation. Cross validation is a set of techniques used to measure how well a model will generalize to data not used in its creation. The most basic cross validation technique is the validation set approach. For this method, we divide the original dataset into two unique datasets: a training dataset, used in the creation of the model, and a testing dataset, used in the validation of the model. By having two independent sets, we can more equitably assess differences in accuracy between models. This has already been done for you when you made your trees and forests above. How do the accuraccies differ from the training dataset and the testing dataset?



Part 1C: Get Curiousget curious icon

1. In the application above, how is the filter tab useful?

2. Do people who were identified as female by the police have the same arrest patterns as those who were identified as male?

3. In the application above, can we create a graphic that effectively uses both the "Facet By" and "Color By" options?

4. What is the best accuracy you can achieve?

5. Do accuracies vary by race? By age? What implications does this have?

6. Why do you think accuracies can go down when more variables are added?

7. Should states across the US continue to use the COMPAS algorithm? Why or why not?

8. Can we guarantee justice in the courtroom for all individuals when inherently biased algorithms are used in the decision process?

9. What questions should be asked before spending government money on such algorithms?

10. What other real-world applications could CART have?






Explore Other Stories


 



Data Stories


NYPD
Covid-19
Recidivism
Brexit

Stats Games


Racer
Greenhouse
Statistically Grounded
Psychic

Questions?


If you have any questions or comments, please email us at DASIL@grinnell.edu

Dataspace is supported by the Grinnell College Innovation Fund and was developed by Grinnell College faculty and students. Copyright © 2021. All rights reserved

This page was last updated on  November 11th  2024.