Category Archives: Exploratory Data Analysis (EDA)

Exploring the data

I love using array (vector) languages. Amongst them J (http://www.jsoftware.com) and R (http://www.r-project.org/) are my two most preferred languages. Why? Because, they enable exploratory, interactive programming. This is such an under appreciated feature that I also became aware of it, only, very recently.

As a part of overseeing graduate students’ work (as a part of my job) I get asked questions about student’s theses research. In our discussion it will, more often than not, happen that the student will say “Let me show you how I did it” and will want to show me how they went about doing whatever it is that they tried to do. I observe what they are doing and I start getting questions in my head. If the student has thought about my questions they’ll be able to answer them promptly, but if they haven’t thought of it we’re stuck! I ask my questions. Either I let the student try to figure it out and provide a reasonable answer or get ready to do some very rough/quick exploratory analysis. Whenever this situation occurs I always remember my graduate school years. When I was a graduate student I didn’t know R (only a little Python…most definitely not J) or any other exploratory/interactive language and I would spend hours to find basic information which now takes me a couple of minutes. Let me try to explain with an example.

Let’s say a student is very annoyed with all the stuff that [s]he’s hearing about racism/militarization of police (Michael Brown Case Wikipedia article and Death of Eric Garner Wikipedia article) and has read/heard about Radley Balko’s Rise of the Warrior Cop: The Militarization of America’s Police Forces and has definitely read the NY Times article: In Wake of Clashes, Calls to Demilitarize Police and wants to recreate the map found on Mapping the Spread of the Military’s Surplus Gear. [S]He has found out that the data can be downloaded from https://github.com/TheUpshot/Military-Surplus-Gear. The student comes to me and asks, “Vijay, I’m having a difficult time generating these numbers!” What they mean is: “I’m having a difficult time generating count of items grouped by counties.” I so wish some student had come to me with this problem. But, then I remember the graduate school Vijay (who didn’t care about anything in the world but himself) and everything’s fine! So, how do I help this hypothetical student? Well, I use what I know to explore this data and also show him/her how they can do this all by themselves.

I will assume that you have access to a computer (sorry a tablet/iphone/android just won’t do for now) on which you have installed R and the package data.table (http://cran.r-project.org/web/packages/data.table/index.html) installed. Also, you have 1033-program-foia-may-2014.csv stored in a directory somewhere and your R session’s current working directory (inquired by getwd() in an R session) workspace is currently that directory. Below is the session of my usage of R to get this information from the csv file. represents shell prompt and represents R prompt.

$ R --no-init-file
> getwd()
[1] "/v/tmp"
> list.files()
[1] "1033-program-foia-may-2014.csv"
> library(data.table)
data.table 1.9.4  For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.
> d <- fread('1033-program-foia-may-2014.csv')
> 
> d
        State    County              NSN             Item Name Quantity   UI
     1:    AK ANCHORAGE 1005-00-073-9421 RIFLE,5.56 MILLIMETER        1 Each
     2:    AK ANCHORAGE 1005-00-073-9421 RIFLE,5.56 MILLIMETER        1 Each
     3:    AK ANCHORAGE 1005-00-073-9421 RIFLE,5.56 MILLIMETER        1 Each
     4:    AK ANCHORAGE 1005-00-073-9421 RIFLE,5.56 MILLIMETER        1 Each
     5:    AK ANCHORAGE 1005-00-073-9421 RIFLE,5.56 MILLIMETER        1 Each
    ---                                                                     
243488:    WY    WESTON 1005-00-589-1271 RIFLE,7.62 MILLIMETER        1 Each
243489:    WY    WESTON 1005-00-589-1271 RIFLE,7.62 MILLIMETER        1 Each
243490:    WY    WESTON 1005-00-589-1271 RIFLE,7.62 MILLIMETER        1 Each
243491:    WY    WESTON 1005-00-589-1271 RIFLE,7.62 MILLIMETER        1 Each
243492:    WY    WESTON 1005-00-589-1271 RIFLE,7.62 MILLIMETER        1 Each
        Acquisition Cost  Ship Date
     1:              499 2012-08-30
     2:              499 2012-08-30
     3:              499 2012-08-30
     4:              499 2012-08-30
     5:              499 2012-08-30
    ---                            
243488:              138 2008-10-20
243489:              138 2008-10-20
243490:              138 2008-10-20
243491:              138 2008-10-20
243492:              138 2008-10-20
> setnames(d, gsub(" ", "_",colnames(d))) # ?setnames
> d[,list(State,County,.N),by=list(State,County,Item_Name)]
       State    County
    1:    AK ANCHORAGE
    2:    AK ANCHORAGE
    3:    AK ANCHORAGE
    4:    AK ANCHORAGE
    5:    AK ANCHORAGE
   ---                
84404:    WY  WASHAKIE
84405:    WY  WASHAKIE
84406:    WY  WASHAKIE
84407:    WY  WASHAKIE
84408:    WY    WESTON
                                                             Item_Name State
    1:                                           RIFLE,5.56 MILLIMETER    AK
    2:                                        HOLDER,MULTIPLE MAGAZINE    AK
    3: CAMOUFLAGE SCREENING SYSTEM,SNOW LIGHT WEIGHT RADAR TRANSPARENT    AK
    4:                          CAMOUFLAGE NET SYSTEM,RADAR SCATTERING    AK
    5:                                                       BINOCULAR    AK
   ---                                                                      
84404:                                    PISTOL,CALIBER .45,AUTOMATIC    WY
84405:                                                   TRUCK,UTILITY    WY
84406:                                                   CARRIER,CARGO    WY
84407:                                             MODULAR SLEEP SYSTE    WY
84408:                                           RIFLE,7.62 MILLIMETER    WY
          County   N
    1: ANCHORAGE 123
    2: ANCHORAGE   1
    3: ANCHORAGE   1
    4: ANCHORAGE   2
    5: ANCHORAGE   2
   ---              
84404:  WASHAKIE  10
84405:  WASHAKIE   5
84406:  WASHAKIE   1
84407:  WASHAKIE   1
84408:    WESTON   7
> write.csv(d[,list(State,County,.N),by=list(State,County,Item_Name)],'aggregation.csv')
> list.files()
[1] "1033-program-foia-may-2014.csv" "aggregation.csv"               
> 

Now you can generate all the pretty maps by joining aggregation.csv with your counties shapefile. I’m not quite sure how the items were aggregated to generate the map found on Mapping the Spread of the Military’s Surplus Gear.