An R package to access CDC Wonder API

      No Comments on An R package to access CDC Wonder API
Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Print this pageEmail this to someone

(Note: this blog post will be more comprehensible if you first take a look at the wonderapi Readme.)

“Write a package” has been on my wish-to-do list for a quite a while, so this summer I took the plunge. I figured I’d start with something seemingly simple, like writing a few functions to query an API and parse the results. I chose to tackle the CDC Wonder API, with the goal of making federal health data more accessible. Bob Rudis had already created a function to make CDC Wonder API query requests–wondr::make_query()–so I focused on writing code to set up the query requests and process the results. As often is the case, the task was more complex than I anticipated.

The problems

1. A long list of parameters must be included in every query, but the CDC Wonder API help does not specify what those paramters are, other than providing two examples for one of many databases (Ex. 1, Ex. 2).

2. Query parameters are name-value pairs, which employ CDC codes for variables, such as:

<parameter>
    <name>V_D76.V22</name>
    <value>4</value>
</parameter>

However, no codebook is provided to translate the codes to human readable form.

3. The important part of the xml response is a nested table which is somewhat difficult to parse, since the tables have a number of quirks, including rows with differing number of cells, such as in this example:

> query_result %>% xml2::xml_find_all("//r") %>% head(3)
{xml_nodeset (3)}
[1] <r>\n  <c l="2007" r="7"/>\n  <c l="Female" r="3"/>\n  <c l="Married"/>\n  <c v= ...
[2] <r>\n  <c l="Unmarried"/>\n  <c v="839,267"/>\n</r>
[3] <r>\n  <c c="1"/>\n  <c dt="2,108,162"/>\n</r>

xml2 and rvest to the rescue

The big picture solution was to do a lot of work behind the scenes to figure out how to set up a query, and then store relevant information in two forms: internal lookup tables in R/sysdata.rda for use by package functions, and codebook vignettes for the user. The script to create all of the data files is here.

Addressing the problems above more specifically, here’s what I did. I welcome your feedback and would be happy to learn that there are better ways to get this done.

1. Which parameters to include?
Using the CDC examples as templates, I created default query lists so that users do not have to think about what needs to be in the query more than necessary — that is beyond the specific groupings and measures that they’re interested in obtaining. These default query lists can be found in the data-raw/ folder of the package. The make_query_list() function converts a single default query list to an R list; purrr::map(list of query default files, make_query_list) is employed to combine and convert all of the default query lists into a list which is stored with other lookup table lists in R/sysdata.rda. When the user initiates a query with wonderapi::getData(), the appropriate default query list is summoned, and defaults are replaced by parameter values specified in the user’s call to getData().

2. Missing codebooks
Not having codebooks is a major issue. The CDC recommends using View Source on the web pages of the Wonder web interface. Aiming for a more user-friendly experience, I created a make_codebook_vignette() function that employs rvest::html_form() to scrape the appropriate web form, take apart each item, and write in human readable form to an .Rmd file, to be built into a vignette. Automating the process of creating codebooks as such has a lot of advantages, though I’m not sure that vignettes are the best place to store them. The idea was that as vignettes the codebooks would be readily available to the user. On the downside, they have to built when the package is installed with devtools::install_github(socdataR/wonderapi, build_vignettes = TRUE). It’s slow, potentially prone to error, and most significantly, my sense is that a lot of people don’t read vignettes.

3. Turning the xml response into a tidy data frame
To address the issue of inconsistent number of columns, I calculated the number of columns that the table should have, and built the tidy data frame row by row, padding rows short on columns with NAs on the left (see getrows()), and then pulling down contents from top to bottom to replace the NAs
(see replaceNAs()). A lot of for-loops here, but they get the job done.

At the moment, the package can handle five of the CDC databases. I expect to add more soon. Although the process of adding a database is highly automated, each one–not surprisingly–alerts me to issues that I hadn’t considered previously, and requires some code adjusting (a.k.a. bug fixing). I also plan to add human readable entries for parameter values, not just names, and tests, an area I have avoided. There’s a list of other improvements that could be made, but the basic structure is there. I look forward to your comments!

** A big thank you to Elin Waring for initiating me into the world of package writing with concrete help and a good dose of optimism.

Datavis with R:
Drawing a Cleveland dot plot with ggplot2

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Print this pageEmail this to someone

Cleveland dot plots are a great alternative to a simple bar chart, particularly if you have more than a few items. It doesn’t take much for a bar chart to look cluttered. In the same amount of space, many more values can be included in a dot plot, and it’s easier to read as well. R has a built-in base function, dotchart(), but since it’s such an easy graph to draw, doing it “from scratch” in ggplot2 or base allows for more customization. Here is a dot plot showing fertility data from the built-in swiss dataset drawn with ggplot2:

Cleveland dot plot with ggplot2

Hold mouse over blue code for explanation.

```{r ggdot, fig.height = 6, fig.width = 5} beginning of an Rmarkdown code chunk specifying figure height and width in inches
library(dplyrintuitive data manipulation package that works well with ggplot2)
library(ggplot2R data visualization package based on the grammar of graphics)
# create a theme for dot plots, which can be reused
theme_dotplot <- theme_bw(14)switches to a theme with a white background and sets the base font size to 14 pt +
    theme(axis.text.y = element_text(size = rel(.75))makes y-axis tick mark labels (Province names) 75% of default size,
    	axis.ticks.y = element_blank()removes y-axis tick marks,
        axis.title.x = element_text(size = rel(.75))makes x-axis label 75% of default size,
        panel.grid.major.x = element_blank()removes major vertical gridlines (theme default is 0.2),
        panel.grid.major.y = element_line(size = 0.5)darkens major horizontal gridlines (theme default is 0.2),
        panel.grid.minor.x = element_blank()removes minor vertical gridlines)
        
# move row names to a dataframe column        
df <-  swissbuilt-in dataset %>% add_rownamesmoves rownames to a new column (named "Province" here), needed since ggplot2 ignores rownames {dplyr}("Province")

# create the plot
ggplot(df, aes(x = Fertility
maps "Fertility" column to x axis, y = reorder(Province, Fertility)reorders "Province" by "Fertility" column (so dots will be plotted in ascending order from bottom to top), and maps it to the y axis)) +
	geom_point(color = "blue")geom for creating scatterplots (a.k.a. "dots") +
	scale_x_continuouscontrols mapping of data values to the x-axis(limits = c(35, 95)sets range of x-axis (35 to 95),
		breaks = seq(40, 90, 10)places labeled tick marks at multiples of 10 from 40 to 90) +
	theme_dotplotadds the dot plot theme created above--no parens used since it's not a function +
	xlabadds x-axis label("\nadds line break (has the effect of moving the x-axis label down)annual live births per 1,000 women aged 15-44") +
	ylabadds y-axis label("French-speaking provinces\nadds line break (has the effect of moving the y-axis label to the left)") +
	ggtitleadds title("Standardized Fertility Measure\nadds line break in titleSwitzerland, 1888")
```end of Rmarkdown code chunk

For more information on dot plots, see:
Naomi Robbins, “Dot Plots: A Useful Alternative to Bar Charts”

Webinar April 28, 10am PST: Effective Graphs with Microsoft R Open

Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Print this pageEmail this to someone

(reposted from: http://blog.revolutionanalytics.com/2016/04/webinar-april-28-effective-graphs.html)

Naomi Robbins, author of Creating More Effective Graphs and Forbes contributor has teamed up with daughter Dr Joyce Robbins to present a new webinar this Thursday April 28, Creating Effective Graphs with Microsoft R Open. The webinar will demonstrate how to create a variety of useful graphics with R: comparisons, distributions, trends over time, relationships, divisions of a whole, and much more like this:

likert

This webinar will be useful for anyone who wants to learns how to display data graphically with the greatest impact. The webinar will use Microsoft R Open, but since it’s 100% compatible, the code provided during the webinar can be used with any edition of R. The webinar will begin at 10AM Pacific time (click here to see your local time), and I’ll be hosting and passing your questions to the presenters. Even if you can’t make the live event, sign up to receive a link to the slides and replay, plus a free copy of a new 50-page e-book by the presenters.

To register for the webinar, follow the link below.

Microsoft Advanced Analytics and IoT: Creating Effective Graphs with Microsoft R Open

 

How to Improve Your Graphs in R

      No Comments on How to Improve Your Graphs in R
Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Print this pageEmail this to someone

I teamed up with my mom, EGMROcover Naomi Robbins, to write Effective Graphs with Microsoft R Open, which we just completed and is available for download. I enjoyed returning to coding after a long break, and it was a great experience to be part of a mother-daughter team. Naomi took responsibility for the theory on drawing good graphs, and I did the R coding. The guide is written on an advanced beginner level–it assumes that readers will have basic knowledge of data manipulation in R–but our hope is that anyone with an interest in improving their graphs will find something useful in it. We divided the material into sections based on the type of data: direct comparisons, distributions, trends over time, relationships, percents, and “special cases”: diverging stacked bar charts (see graph below) and linked micromaps. Code is available on Github.

Two Worlds

      No Comments on Two Worlds
Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Print this pageEmail this to someone

In the blog posts to follow, I’ll share my observations on society, interspersed with how-to’s on programming, mainly with R. If you have any ideas on what you’d like me to blog about please contact me

Screenshot 2016-04-27 10.17.13

 

“Specialists without spirit, sensualists without heart; this nullity imagines that it has attained a level of civilization never before achieved.”
-Max Weber