Monthly Archives: February 2015

An example of reading fixed width format data with R

R is very useful for data import/export. It even has a detailed manual entitled R Data Import/Export on the CRAN page (link). I will first give an example and then explain the reasons why I prefer to use R for data munging. I will also list a few useful links at the end.

One of our students used Uniform Crime Report (UCR [Wikipedia link]) data provided by Indianapolis Metropolis Police Department (IMPD) [IMPD UCR Downloads link]) for his thesis. One of the first things he had to do was to clean up the data so that he could use it in his thesis. He tried to use Excel for much of his initial work but was running into some issues and he asked me to help him with it so I agreed. Here are some lessons I learned during my exploration.

The data listed on the IMPD’s website is in fixed width data format and has the following characteristics.

  1. The first line is comma separated.
  2. The data is in fixed width (as far as I’m aware) format.
  3. There are lots and lots of rows (Year 2013 has the least amount of rows but even then it is approximately 55K).

This is the code I used to convert it into CSV file which can be easily opened in Excel/ArcGIS/QGIS. It is assumed that

colwidths <- c(5,28,10,-1,5,-2,11,-1, 44, -1, 4, -3, 14, -9, 15)
colClasses = c("numeric", "character", "character",
               "character", "character", "character",
               "character", "numeric", "numeric")
colnames <- c("ucr", "crime", "date", "time",
              "case", "address", "beat", "x", "y")
IMPDUCR2013 <- read.fwf("IMPDUCR2013.txt", colwidths, colClasses=colClasses, skip=1, comment.char="")
write.csv(IMPDUCR2013, "IMPDUCR2013.csv")

Here are some things to think about:

  1. Positive widths in the colwidths variable are columns which will be read. Negative widths indicate the columns that are to be skipped.
  2. colClasses is defined because it makes the file reading faster (Variations on read.table) and raises errors (for e.g., when there's X instead of a NA/missing).
  3. I could've used column names from the first line but we skip it.
  4. Skip one (header) line. See the function argument skip=1. You can skip many as many lines as you want. NOTE: This is so useful!

You could have easily used Excel/Perl/Python/sed+awk to accomplish this task but here are some of my reasons for not using them.

  1. Excel only shows the first few lines of the text file and it is easy to miss how wide the column actually are!
  2. The column widths are never stored anywhere in Excel. How are you going to replicate your workflow for other years? R makes this utterly simple. I love this aspect of R!
  3. Using Perl/Python/sed+awk to clean this data is very easy but the only hindrance is, and it's a big hindrance, that you've got to know how to use them and have access to a computer (no longer a problem since you can just run a linux on Virtualbox!) with these utilities. However, these solutions are one-off/esoteric (nothing wrong with that). In contrast, R's import/export procedure is standard and it provides familiarity for options which might be useful in wide array of read.table functions!
  4. R script is so easy to use and share with colleagues.
  5. R workflow is reproducible.
  6. Found an error in columns not being delimited correctly? Tweak the colwidths variable.
  7. Never waste your time again trying to figure out how IMPD's UCR data is delimited.
  8. All the selection/subsetting flexibility of R at your fingertips! Check out Subsetting in Advanced R by Hadley Wickham.

Also check out Hadley Wickham's Tidy Data paper which talks about many different kinds of ways that data are messy and how you can clean them.

Some useful help queries/function arguments to keep in mind for your future adventures.

  • help(read.fwf) for this example. Others are help(read.csv), help(read.delim), or help(read.table). Actually, read.table is the general function which is called by all the other specialized functions.
  • Function arguments to pay particular attention to: sep, header, quote, colClasses, stringsAsFactors, and skip. There are other arguments but these are the ones that I have used and found useful to solve a lot of my data issues.

In conclusion, R provides some very compelling tools for data cleaning/munging which can be so useful in a GIS workflow. Not only will it make your analysis faster and more robust but it is also so much fun!