OpenRefine (formerly Google Refine) is a popular, open source data cleaning software1. rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions available in this package, you can import, export, apply data cleaning operations, or delete a project in OpenRefine directly from R. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby2. rrefine extends this functionality to R users.
rrefine is available on CRAN:
The latest version of the package is also available on GitHub and can be installed via devtools by using the following:
lateformeeting
rrefine includes a sample “dirty” data set to
illustrate its features. This object (lateformeeting
) is a
simulated data frame that holds 63 observations of dates, days of the
week, numbers of hours slept and indicators of whether or not the
subject was on time for work. The data are recorded in inconsistent
formats and will require cleaning in order to be parsed correctly by R.
You can take a look at how messy things are below:
theDate | what.day.whas.it | sleephours | was.i.on.time.for.work |
---|---|---|---|
1/8/99 | Friday | 7 | Yes |
Janury 15 1999 | friday | 5 | No |
1/25/1999 | Monday | 8 | No |
1/28/1999 | thursday | 8 | Yes |
2/9/99 | Tuesday | six | Yes |
2/11/1999 | Thursday | 7 | No |
February 15, 1999 | Monday | 8 | Yes |
2/19/1999 | Friday | 10 | Yes |
3/3/1999 | Wednesday | 5 | n |
3/4/1999 | Thursday | 9 | Yes |
3/5/1999 | Fridday | 8 | Y |
March 8, 1999 | Monday | 11 | Yes |
3/15/99 | Monday | 8 | yep |
3/16/1999 | Tuesday | 8 | Yes |
3/17/1999 | Wedensday | 6.5 | N |
4/1/99 | Thursday | 7 | Yes |
4/12/1999 | Monday | 8 | Yes |
4/20/1999 | Tuesday | 10 | Yes |
4/27/99 | Tuesday | 7 | Yes |
5/5/1999 | Wednesday | 6 | N |
5/12/1999 | Wednseday | 9 | Yes |
5/13/99 | Thursday | 7 | Yes |
5/17/1999 | monday | 7 and a half | Y |
May 25 1999 | Tuesday | 8 | YES |
5/27/1999 | Thursday | 8 | No |
6/1/1999 | tuesday | 7 | Yes |
6/7/1999 | Monday | 10 | Yes |
6/11/1999 | Friday | 6 | NO |
6/15/99 | Tuesday | 8 | Yes |
6/24/1999 | thursday | 7.5 | Yes |
7/1/1999 | Thursday | 7 | No |
July 6 1999 | Wednesday | 5 | Yes |
7/12/1999 | Monday | 8 | Y |
7/20/1999 | tuesday | 9 | Yes |
7/21/99 | Wednesday | 6 | No |
7/23/1999 | fridya | 8 | Yes |
8/3/1999 | Tuesday | 7 | Yes |
8/5/1999 | Thursday | 6 | Y |
8/16/1999 | Modnay | 8 | Yes |
Aug 27 1999 | Friday | 4 | Yes |
August 31, 1999 | Teusday | 7 | yep |
September 1, 1999 | Wednesday | 11 | Yes |
9/10/1999 | Friday | 7 | Yes |
9/20/1999 | Monday | 7 | No |
9/23/1999 | thursdayy | 7.5 | Yes |
9/28/1999 | Monday | 8 | Yes |
10/1/99 | Friday | 7 | Y |
10/4/1999 | Monday | 6 | Yes |
10/5/1999 | tuesday | 9 | Yes |
10/14/1999 | Thursday | 8 | Yes |
10/22/1999 | Friday | 7 | No |
10/25/99 | Monday | 7 | Yes |
10/27/1999 | Wednesday | 8 | Y |
11/2/1999 | Tuesday | 7 | Yes |
11/3/1999 | Wednseday | 8 | Yes |
11/8/1999 | monday | 5 | Yes |
11/18/99 | Thursday | 8 | No |
12/1/1999 | Wednesday | 7 | Yes |
12/3/1999 | Friday | 6 | N |
12/6/1999 | Monday | 6.5 | Yes |
12/13/1999 | Monday | 9 | Yes |
12/16/1999 | Thursday | 7 | Yes |
12/21/99 | Tuesday | 8 | Y |
refine_upload()
While the data cleaning could be performed using R, the operations
here describe a typical scenario for OpenRefine users.
The first step to creating a new project is to make sure
OpenRefine is installed and running3. By default, the
application will run locally at http://127.0.0.1:3333/
. All
of the functions in rrefine will assume the default
local host name and port, however these can both be overridden4.
Additionally, as of v1.1.0
the package will internally
connect to the OpenRefine instance using a CSRF token
in API requests5. The refine_upload()
function
allows you to pass the contents of a delimited text file (csv or tsv)
along with a project name (optional) and an argument to automatically
open the browser in which OpenRefine is running. The
example below demonstrates this workflow using the
lateformeeting
sample data:
write.csv(lateformeeting, file = "lateformeeting.csv", row.names = FALSE)
refine_upload(file = "lateformeeting.csv", project.name = "lfm_cleanup", open.browser = TRUE)
With the project uploaded, you can perform any of the desired clean-up procedures in OpenRefine.
refine_operations()
Whether the data in OpenRefine has been uploaded via
refine_upload()
or another method, users can
programmatically apply operations to projects using
refine_operations()
. This function will pass an arbitrary
list of data cleaning operations to the specified project. Operations
must be defined in valid JSON format6. In addition to the generic
refine_operations()
that can flexibly accept any valid JSON
operation, the rrefine package includes a series of
wrapper functions to perform common data cleaning procedures:
refine_remove_column()
: Remove a column from a
projectrefine_add_column()
: Add a column to a projectrefine_rename_column()
: Rename an existing column in a
projectrefine_move_column()
: Move a column to a new indexrefine_transform()
: Apply arbitrary text
transformationsrefine_to_lower()
: Coerce text to lowercaserefine_to_upper()
: Coerce text to uppercaserefine_to_title()
: Coerce text to title caserefine_to_null()
: Set values to NULL
refine_to_empty()
: Set text values to empty string
(""
)refine_to_text()
: Coerce value to stringrefine_to_number()
: Coerce value to numericrefine_to_date()
: Coerce value to daterefine_trim_whitespace()
: Remove leading and trailing
whitespacesrefine_collapse_whitespace()
: Collapse consecutive
whitespaces to single whitespacerefine_unescape_html()
: Unescape HTML in stringThe example below demonstrates several operations using the
lateformeeting
sample data:
refine_add_column(new_column = "dotw_allcaps",
base_column = "what.day.whas.it",
value = "grel:value",
project.name = "lfm_cleanup")
refine_to_upper(column_name = "dotw_allcaps", project.name = "lfm_cleanup")
refine_export(project.name = "lfm_cleanup")$dotw_allcaps
## [1] "FRIDAY" "FRIDAY" "MONDAY" "THURSDAY" "TUESDAY" "THURSDAY"
## [7] "MONDAY" "FRIDAY" "WEDNESDAY" "THURSDAY" "FRIDAY" "MONDAY"
## [13] "MONDAY" "TUESDAY" "WEDNESDAY" "THURSDAY" "MONDAY" "TUESDAY"
## [19] "TUESDAY" "WEDNESDAY" "WEDNESDAY" "THURSDAY" "MONDAY" "TUESDAY"
## [25] "THURSDAY" "TUESDAY" "MONDAY" "FRIDAY" "TUESDAY" "THURSDAY"
## [31] "THURSDAY" "WEDNESDAY" "MONDAY" "TUESDAY" "WEDNESDAY" "FRIDAY"
## [37] "TUESDAY" "THURSDAY" "MONDAY" "FRIDAY" "TUESDAY" "WEDNESDAY"
## [43] "FRIDAY" "MONDAY" "THURSDAY" "MONDAY" "FRIDAY" "MONDAY"
## [49] "TUESDAY" "THURSDAY" "FRIDAY" "MONDAY" "WEDNESDAY" "TUESDAY"
## [55] "WEDNESDAY" "MONDAY" "THURSDAY" "WEDNESDAY" "FRIDAY" "MONDAY"
## [61] "MONDAY" "THURSDAY" "TUESDAY"
refine_export()
Once you’ve cleaned up the data in OpenRefine you
can pull it back into R for plotting, modeling, etc. by using
refine_export()
. This function will accept either
the project name or the numerical unique identifier. It is only
necessary to use both if there are multiple projects with the
same name in your OpenRefine application. Note that the
data is exported directly into R as a data frame and you can assign it
to a new object.
date | dotw | hours.slept | on.time |
---|---|---|---|
1999-01-08T00:00:00Z | Friday | 7.0 | Yes |
1999-01-15T00:00:00Z | Friday | 5.0 | No |
1999-01-25T00:00:00Z | Monday | 8.0 | No |
1999-01-28T00:00:00Z | Thursday | 8.0 | Yes |
1999-02-09T00:00:00Z | Tuesday | 6.0 | Yes |
1999-02-11T00:00:00Z | Thursday | 7.0 | No |
1999-02-15T00:00:00Z | Monday | 8.0 | Yes |
1999-02-19T00:00:00Z | Friday | 10.0 | Yes |
1999-03-03T00:00:00Z | Wednesday | 5.0 | No |
1999-03-04T00:00:00Z | Thursday | 9.0 | Yes |
1999-03-05T00:00:00Z | Friday | 8.0 | Yes |
1999-03-08T00:00:00Z | Monday | 11.0 | Yes |
1999-03-15T00:00:00Z | Monday | 8.0 | Yes |
1999-03-16T00:00:00Z | Tuesday | 8.0 | Yes |
1999-03-17T00:00:00Z | Wednesday | 6.5 | No |
1999-04-01T00:00:00Z | Thursday | 7.0 | Yes |
1999-04-12T00:00:00Z | Monday | 8.0 | Yes |
1999-04-20T00:00:00Z | Tuesday | 10.0 | Yes |
1999-04-27T00:00:00Z | Tuesday | 7.0 | Yes |
1999-05-05T00:00:00Z | Wednesday | 6.0 | No |
1999-05-12T00:00:00Z | Wednesday | 9.0 | Yes |
1999-05-13T00:00:00Z | Thursday | 7.0 | Yes |
1999-05-17T00:00:00Z | Monday | 7.5 | Yes |
1999-05-25T00:00:00Z | Tuesday | 8.0 | Yes |
1999-05-27T00:00:00Z | Thursday | 8.0 | No |
1999-06-01T00:00:00Z | Tuesday | 7.0 | Yes |
1999-06-07T00:00:00Z | Monday | 10.0 | Yes |
1999-06-11T00:00:00Z | Friday | 6.0 | No |
1999-06-15T00:00:00Z | Tuesday | 8.0 | Yes |
1999-06-24T00:00:00Z | Thursday | 7.5 | Yes |
1999-07-01T00:00:00Z | Thursday | 7.0 | No |
1999-07-06T00:00:00Z | Wednesday | 5.0 | Yes |
1999-07-12T00:00:00Z | Monday | 8.0 | Yes |
1999-07-20T00:00:00Z | Tuesday | 9.0 | Yes |
1999-07-21T00:00:00Z | Wednesday | 6.0 | No |
1999-07-23T00:00:00Z | Friday | 8.0 | Yes |
1999-08-03T00:00:00Z | Tuesday | 7.0 | Yes |
1999-08-05T00:00:00Z | Thursday | 6.0 | Yes |
1999-08-16T00:00:00Z | Monday | 8.0 | Yes |
1999-08-27T00:00:00Z | Friday | 4.0 | Yes |
1999-08-31T00:00:00Z | Tuesday | 7.0 | Yes |
1999-09-01T00:00:00Z | Wednesday | 11.0 | Yes |
1999-09-10T00:00:00Z | Friday | 7.0 | Yes |
1999-09-20T00:00:00Z | Monday | 7.0 | No |
1999-09-23T00:00:00Z | Thursday | 7.5 | Yes |
1999-09-28T00:00:00Z | Monday | 8.0 | Yes |
1999-10-01T00:00:00Z | Friday | 7.0 | Yes |
1999-10-04T00:00:00Z | Monday | 6.0 | Yes |
1999-10-05T00:00:00Z | Tuesday | 9.0 | Yes |
1999-10-14T00:00:00Z | Thursday | 8.0 | Yes |
1999-10-22T00:00:00Z | Friday | 7.0 | No |
1999-10-25T00:00:00Z | Monday | 7.0 | Yes |
1999-10-27T00:00:00Z | Wednesday | 8.0 | Yes |
1999-11-02T00:00:00Z | Tuesday | 7.0 | Yes |
1999-11-03T00:00:00Z | Wednesday | 8.0 | Yes |
1999-11-08T00:00:00Z | Monday | 5.0 | Yes |
1999-11-18T00:00:00Z | Thursday | 8.0 | No |
1999-12-01T00:00:00Z | Wednesday | 7.0 | Yes |
1999-12-03T00:00:00Z | Friday | 6.0 | No |
1999-12-06T00:00:00Z | Monday | 6.5 | Yes |
1999-12-13T00:00:00Z | Monday | 9.0 | Yes |
1999-12-16T00:00:00Z | Thursday | 7.0 | Yes |
1999-12-21T00:00:00Z | Tuesday | 8.0 | Yes |
From there the clean data is available for analyses that couldn’t have been performed in its original format.
refine_delete()
To clean up your OpenRefine workspace you can delete
projects using refine_delete()
. Just like
refine_export()
it’s possible to pass either a
project name or unique identifier to this function. And it is only
necessary to use both if there are multiple projects with the
same name.
https://docs.openrefine.org/technical-reference/openrefine-api#third-party-software-libraries↩︎
For documentation on how to specify a different host or
port number see ?refine_path()
.↩︎
https://github.com/OpenRefine/OpenRefine/wiki/Changes-for-3.3#csrf-protection-changes↩︎
https://docs.openrefine.org/technical-reference/openrefine-api#apply-operations↩︎