Category Archives: R

An example of reading fixed width format data with R

R is very useful for data import/export. It even has a detailed manual entitled R Data Import/Export on the CRAN page (link). I will first give an example and then explain the reasons why I prefer to use R for data munging. I will also list a few useful links at the end.

One of our students used Uniform Crime Report (UCR [Wikipedia link]) data provided by Indianapolis Metropolis Police Department (IMPD) [IMPD UCR Downloads link]) for his thesis. One of the first things he had to do was to clean up the data so that he could use it in his thesis. He tried to use Excel for much of his initial work but was running into some issues and he asked me to help him with it so I agreed. Here are some lessons I learned during my exploration.

The data listed on the IMPD’s website is in fixed width data format and has the following characteristics.

  1. The first line is comma separated.
  2. The data is in fixed width (as far as I’m aware) format.
  3. There are lots and lots of rows (Year 2013 has the least amount of rows but even then it is approximately 55K).

This is the code I used to convert it into CSV file which can be easily opened in Excel/ArcGIS/QGIS. It is assumed that

colwidths <- c(5,28,10,-1,5,-2,11,-1, 44, -1, 4, -3, 14, -9, 15)
colClasses = c("numeric", "character", "character",
               "character", "character", "character",
               "character", "numeric", "numeric")
colnames <- c("ucr", "crime", "date", "time",
              "case", "address", "beat", "x", "y")
IMPDUCR2013 <- read.fwf("IMPDUCR2013.txt", colwidths, colClasses=colClasses, skip=1, comment.char="")
write.csv(IMPDUCR2013, "IMPDUCR2013.csv")

Here are some things to think about:

  1. Positive widths in the colwidths variable are columns which will be read. Negative widths indicate the columns that are to be skipped.
  2. colClasses is defined because it makes the file reading faster (Variations on read.table) and raises errors (for e.g., when there's X instead of a NA/missing).
  3. I could've used column names from the first line but we skip it.
  4. Skip one (header) line. See the function argument skip=1. You can skip many as many lines as you want. NOTE: This is so useful!

You could have easily used Excel/Perl/Python/sed+awk to accomplish this task but here are some of my reasons for not using them.

  1. Excel only shows the first few lines of the text file and it is easy to miss how wide the column actually are!
  2. The column widths are never stored anywhere in Excel. How are you going to replicate your workflow for other years? R makes this utterly simple. I love this aspect of R!
  3. Using Perl/Python/sed+awk to clean this data is very easy but the only hindrance is, and it's a big hindrance, that you've got to know how to use them and have access to a computer (no longer a problem since you can just run a linux on Virtualbox!) with these utilities. However, these solutions are one-off/esoteric (nothing wrong with that). In contrast, R's import/export procedure is standard and it provides familiarity for options which might be useful in wide array of read.table functions!
  4. R script is so easy to use and share with colleagues.
  5. R workflow is reproducible.
  6. Found an error in columns not being delimited correctly? Tweak the colwidths variable.
  7. Never waste your time again trying to figure out how IMPD's UCR data is delimited.
  8. All the selection/subsetting flexibility of R at your fingertips! Check out Subsetting in Advanced R by Hadley Wickham.

Also check out Hadley Wickham's Tidy Data paper which talks about many different kinds of ways that data are messy and how you can clean them.

Some useful help queries/function arguments to keep in mind for your future adventures.

  • help(read.fwf) for this example. Others are help(read.csv), help(read.delim), or help(read.table). Actually, read.table is the general function which is called by all the other specialized functions.
  • Function arguments to pay particular attention to: sep, header, quote, colClasses, stringsAsFactors, and skip. There are other arguments but these are the ones that I have used and found useful to solve a lot of my data issues.

In conclusion, R provides some very compelling tools for data cleaning/munging which can be so useful in a GIS workflow. Not only will it make your analysis faster and more robust but it is also so much fun!

Exploring the data

I love using array (vector) languages. Amongst them J (http://www.jsoftware.com) and R (http://www.r-project.org/) are my two most preferred languages. Why? Because, they enable exploratory, interactive programming. This is such an under appreciated feature that I also became aware of it, only, very recently.

As a part of overseeing graduate students’ work (as a part of my job) I get asked questions about student’s theses research. In our discussion it will, more often than not, happen that the student will say “Let me show you how I did it” and will want to show me how they went about doing whatever it is that they tried to do. I observe what they are doing and I start getting questions in my head. If the student has thought about my questions they’ll be able to answer them promptly, but if they haven’t thought of it we’re stuck! I ask my questions. Either I let the student try to figure it out and provide a reasonable answer or get ready to do some very rough/quick exploratory analysis. Whenever this situation occurs I always remember my graduate school years. When I was a graduate student I didn’t know R (only a little Python…most definitely not J) or any other exploratory/interactive language and I would spend hours to find basic information which now takes me a couple of minutes. Let me try to explain with an example.

Let’s say a student is very annoyed with all the stuff that [s]he’s hearing about racism/militarization of police (Michael Brown Case Wikipedia article and Death of Eric Garner Wikipedia article) and has read/heard about Radley Balko’s Rise of the Warrior Cop: The Militarization of America’s Police Forces and has definitely read the NY Times article: In Wake of Clashes, Calls to Demilitarize Police and wants to recreate the map found on Mapping the Spread of the Military’s Surplus Gear. [S]He has found out that the data can be downloaded from https://github.com/TheUpshot/Military-Surplus-Gear. The student comes to me and asks, “Vijay, I’m having a difficult time generating these numbers!” What they mean is: “I’m having a difficult time generating count of items grouped by counties.” I so wish some student had come to me with this problem. But, then I remember the graduate school Vijay (who didn’t care about anything in the world but himself) and everything’s fine! So, how do I help this hypothetical student? Well, I use what I know to explore this data and also show him/her how they can do this all by themselves.

I will assume that you have access to a computer (sorry a tablet/iphone/android just won’t do for now) on which you have installed R and the package data.table (http://cran.r-project.org/web/packages/data.table/index.html) installed. Also, you have 1033-program-foia-may-2014.csv stored in a directory somewhere and your R session’s current working directory (inquired by getwd() in an R session) workspace is currently that directory. Below is the session of my usage of R to get this information from the csv file. represents shell prompt and represents R prompt.

$ R --no-init-file
> getwd()
[1] "/v/tmp"
> list.files()
[1] "1033-program-foia-may-2014.csv"
> library(data.table)
data.table 1.9.4  For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.
> d <- fread('1033-program-foia-may-2014.csv')
> 
> d
        State    County              NSN             Item Name Quantity   UI
     1:    AK ANCHORAGE 1005-00-073-9421 RIFLE,5.56 MILLIMETER        1 Each
     2:    AK ANCHORAGE 1005-00-073-9421 RIFLE,5.56 MILLIMETER        1 Each
     3:    AK ANCHORAGE 1005-00-073-9421 RIFLE,5.56 MILLIMETER        1 Each
     4:    AK ANCHORAGE 1005-00-073-9421 RIFLE,5.56 MILLIMETER        1 Each
     5:    AK ANCHORAGE 1005-00-073-9421 RIFLE,5.56 MILLIMETER        1 Each
    ---                                                                     
243488:    WY    WESTON 1005-00-589-1271 RIFLE,7.62 MILLIMETER        1 Each
243489:    WY    WESTON 1005-00-589-1271 RIFLE,7.62 MILLIMETER        1 Each
243490:    WY    WESTON 1005-00-589-1271 RIFLE,7.62 MILLIMETER        1 Each
243491:    WY    WESTON 1005-00-589-1271 RIFLE,7.62 MILLIMETER        1 Each
243492:    WY    WESTON 1005-00-589-1271 RIFLE,7.62 MILLIMETER        1 Each
        Acquisition Cost  Ship Date
     1:              499 2012-08-30
     2:              499 2012-08-30
     3:              499 2012-08-30
     4:              499 2012-08-30
     5:              499 2012-08-30
    ---                            
243488:              138 2008-10-20
243489:              138 2008-10-20
243490:              138 2008-10-20
243491:              138 2008-10-20
243492:              138 2008-10-20
> setnames(d, gsub(" ", "_",colnames(d))) # ?setnames
> d[,list(State,County,.N),by=list(State,County,Item_Name)]
       State    County
    1:    AK ANCHORAGE
    2:    AK ANCHORAGE
    3:    AK ANCHORAGE
    4:    AK ANCHORAGE
    5:    AK ANCHORAGE
   ---                
84404:    WY  WASHAKIE
84405:    WY  WASHAKIE
84406:    WY  WASHAKIE
84407:    WY  WASHAKIE
84408:    WY    WESTON
                                                             Item_Name State
    1:                                           RIFLE,5.56 MILLIMETER    AK
    2:                                        HOLDER,MULTIPLE MAGAZINE    AK
    3: CAMOUFLAGE SCREENING SYSTEM,SNOW LIGHT WEIGHT RADAR TRANSPARENT    AK
    4:                          CAMOUFLAGE NET SYSTEM,RADAR SCATTERING    AK
    5:                                                       BINOCULAR    AK
   ---                                                                      
84404:                                    PISTOL,CALIBER .45,AUTOMATIC    WY
84405:                                                   TRUCK,UTILITY    WY
84406:                                                   CARRIER,CARGO    WY
84407:                                             MODULAR SLEEP SYSTE    WY
84408:                                           RIFLE,7.62 MILLIMETER    WY
          County   N
    1: ANCHORAGE 123
    2: ANCHORAGE   1
    3: ANCHORAGE   1
    4: ANCHORAGE   2
    5: ANCHORAGE   2
   ---              
84404:  WASHAKIE  10
84405:  WASHAKIE   5
84406:  WASHAKIE   1
84407:  WASHAKIE   1
84408:    WESTON   7
> write.csv(d[,list(State,County,.N),by=list(State,County,Item_Name)],'aggregation.csv')
> list.files()
[1] "1033-program-foia-may-2014.csv" "aggregation.csv"               
> 

Now you can generate all the pretty maps by joining aggregation.csv with your counties shapefile. I’m not quite sure how the items were aggregated to generate the map found on Mapping the Spread of the Military’s Surplus Gear.

Getting familiar with R’s help system.

So you’ve decided to seriously try R for your analysis. Now the question is how do you get comfortable in using it? Or more importantly how do you use R’s help system. I have some friends who really want to become more proficient at R but are not sure how to begin. This is my feeble attempt to provide some pointers for them to consider. Maybe you, also, might find some of it useful.

First of all, try to use something more user friendly. Like RStudio (http://www.rstudio.com/). You can switch to base R console (or OS console) once you’re comfortable with R. I use base R console only because when I learned R (starting around 2007) there was no RStudio and now I don’t see any benefit of switching to it.

Let’s start exploring. So, now you have an R-session open and you’re staring at a blank area. You think you need help? Damn right, you need help!

  1. Type ?help in your session. A window or a webpage will popup. This is what popped up for me!

    help

  2. The top left of that page shows the function and package that provides this function. In our case help comes from utils package.
  3. If you really don’t know how to do stuff in R, just ask google. Since I don’t know how to read csv file in R I searched for read csv file in R I learned that I have to use read.csv . So, I looked up ?read.csv . And, lo and behold, it tells me that I’m reading help about read.table{utils} from the utils package.
  4. Just remember this for now.
  5. Scroll to the Usage section of the document and see how it’s defined and all the other fancy stuff. Don’t worry if it doesn’t make too much sense.
  6. Now scroll down to the bottom until you reach Examples section.
  7. Everything you see in the Examples section can be tried out in an active R session. Try out the examples listed there by manually typing them in the R session.
  8. I know I know. It is bogus!! There’s an easier way.
  9. Type example(read.csv) and see all the code in the Examples section executed in your session! This is one of the reasons why I really like R a lot!.

Just remember that google, ? OR help, and example can help you a lot. Just try to create a mental map of the function name and the package where it came from and you’re well on your way to becoming a competent R user/programmer.

Grouping in R using data.table

Check out http://www.ats.ucla.edu/stat/r/faq/firstlast.htm

Now compare the steps below.

> options(prompt="R=> ")
R=> hsb2 <- read.csv('http://www.ats.ucla.edu/stat/r/faq/hsb2.csv')
R=> library(data.table) # ?data.table
R=>
R=> hsb2dt <- as.data.table(hsb2)
R=> setkey(hsb2dt, math) # sort by math... See ?setkey
R=>
R=> highest <- hsb2dt[, tail(.SD, 1), by=prog] # ?tail
R=> highest
   prog  id female race ses schtyp read write math science socst
1:    3 143      0    4   2      1   63    63   75      72    66
2:    1 169      0    4   1      1   55    59   63      69    46
3:    2 200      0    4   2      2   68    54   75      66    66
R=>
R=> lowest <- hsb2dt[, head(.SD, 1), by=prog]
R=> lowest
   prog  id female race ses schtyp read write math science socst
1:    3   2      1    1   2      1   39    41   33      42    41
2:    1 167      0    4   2      1   63    49   35      66    41
3:    2 128      0    4   3      1   39    33   38      47    41
R=>
R=> quit()

Checkout data.table at https://github.com/Rdatatable/data.table/wiki