An R package to access CDC Wonder API

      No Comments on An R package to access CDC Wonder API
Tweet about this on TwitterShare on LinkedInShare on FacebookShare on Google+Print this pageEmail this to someone

(Note: this blog post will be more comprehensible if you first take a look at the wonderapi Readme.)

“Write a package” has been on my wish-to-do list for a quite a while, so this summer I took the plunge. I figured I’d start with something seemingly simple, like writing a few functions to query an API and parse the results. I chose to tackle the CDC Wonder API, with the goal of making federal health data more accessible. Bob Rudis had already created a function to make CDC Wonder API query requests–wondr::make_query()–so I focused on writing code to set up the query requests and process the results. As often is the case, the task was more complex than I anticipated.

The problems

1. A long list of parameters must be included in every query, but the CDC Wonder API help does not specify what those paramters are, other than providing two examples for one of many databases (Ex. 1, Ex. 2).

2. Query parameters are name-value pairs, which employ CDC codes for variables, such as:

<parameter>
    <name>V_D76.V22</name>
    <value>4</value>
</parameter>

However, no codebook is provided to translate the codes to human readable form.

3. The important part of the xml response is a nested table which is somewhat difficult to parse, since the tables have a number of quirks, including rows with differing number of cells, such as in this example:

> query_result %>% xml2::xml_find_all("//r") %>% head(3)
{xml_nodeset (3)}
[1] <r>\n  <c l="2007" r="7"/>\n  <c l="Female" r="3"/>\n  <c l="Married"/>\n  <c v= ...
[2] <r>\n  <c l="Unmarried"/>\n  <c v="839,267"/>\n</r>
[3] <r>\n  <c c="1"/>\n  <c dt="2,108,162"/>\n</r>

xml2 and rvest to the rescue

The big picture solution was to do a lot of work behind the scenes to figure out how to set up a query, and then store relevant information in two forms: internal lookup tables in R/sysdata.rda for use by package functions, and codebook vignettes for the user. The script to create all of the data files is here.

Addressing the problems above more specifically, here’s what I did. I welcome your feedback and would be happy to learn that there are better ways to get this done.

1. Which parameters to include?
Using the CDC examples as templates, I created default query lists so that users do not have to think about what needs to be in the query more than necessary — that is beyond the specific groupings and measures that they’re interested in obtaining. These default query lists can be found in the data-raw/ folder of the package. The make_query_list() function converts a single default query list to an R list; purrr::map(list of query default files, make_query_list) is employed to combine and convert all of the default query lists into a list which is stored with other lookup table lists in R/sysdata.rda. When the user initiates a query with wonderapi::getData(), the appropriate default query list is summoned, and defaults are replaced by parameter values specified in the user’s call to getData().

2. Missing codebooks
Not having codebooks is a major issue. The CDC recommends using View Source on the web pages of the Wonder web interface. Aiming for a more user-friendly experience, I created a make_codebook_vignette() function that employs rvest::html_form() to scrape the appropriate web form, take apart each item, and write in human readable form to an .Rmd file, to be built into a vignette. Automating the process of creating codebooks as such has a lot of advantages, though I’m not sure that vignettes are the best place to store them. The idea was that as vignettes the codebooks would be readily available to the user. On the downside, they have to built when the package is installed with devtools::install_github(socdataR/wonderapi, build_vignettes = TRUE). It’s slow, potentially prone to error, and most significantly, my sense is that a lot of people don’t read vignettes.

3. Turning the xml response into a tidy data frame
To address the issue of inconsistent number of columns, I calculated the number of columns that the table should have, and built the tidy data frame row by row, padding rows short on columns with NAs on the left (see getrows()), and then pulling down contents from top to bottom to replace the NAs
(see replaceNAs()). A lot of for-loops here, but they get the job done.

At the moment, the package can handle five of the CDC databases. I expect to add more soon. Although the process of adding a database is highly automated, each one–not surprisingly–alerts me to issues that I hadn’t considered previously, and requires some code adjusting (a.k.a. bug fixing). I also plan to add human readable entries for parameter values, not just names, and tests, an area I have avoided. There’s a list of other improvements that could be made, but the basic structure is there. I look forward to your comments!

** A big thank you to Elin Waring for initiating me into the world of package writing with concrete help and a good dose of optimism.

Leave a Reply

Your email address will not be published. Required fields are marked *