Title: | Work with Two-by-Two Tables |
---|---|
Description: | A collection of functions for data analysis with two-by-two contingency tables. The package provides tools to compute measures of effect (odds ratio, risk ratio, and risk difference), calculate impact numbers and attributable fractions, and perform hypothesis testing. Statistical analysis methods are oriented towards epidemiological investigation of relationships between exposures and outcomes. |
Authors: | VP Nagraj [aut, cre] |
Maintainer: | VP Nagraj <[email protected]> |
License: | MIT + file LICENSE |
Version: | 0.1.0 |
Built: | 2025-03-04 04:03:16 UTC |
Source: | https://github.com/vpnagraj/twoxtwo |
Provides a collection of functions for data analysis with two-by-two contingency tables.
In addition to measures of effect such as odds ratio, risk ratio, and risk difference, the twoxtwo framework allows for calculation of attributable fractions: attributable risk proportion in the exposed (ARP) and the population attributable risk proportion (PARP).
Estimates of the attributable fractions can be calculated with the arp()
and parp()
functions respectively. Each function takes an input dataset and arguments for outcome and exposure as bare, unquoted variable names. If the input has the twoxtwo class then the effect measures will be calculated using exposure and outcome information from that object. The functions all return a tidy tibble
with the name of the measure, the point estimate, and lower/upper bounds of a confidence interval (CI) based on the SE.
Formulas used in point estimate and SE calculations are available in 'Details'.
arp(.data, exposure, outcome, alpha = 0.05, percent = FALSE, ...) parp( .data, exposure, outcome, alpha = 0.05, percent = FALSE, prevalence = NULL, ... )
arp(.data, exposure, outcome, alpha = 0.05, percent = FALSE, ...) parp( .data, exposure, outcome, alpha = 0.05, percent = FALSE, prevalence = NULL, ... )
.data |
Either a data frame with observation-level exposure and outcome data or a twoxtwo object |
exposure |
Name of exposure variable; ignored if input to |
outcome |
Name of outcome variable; ignored if input to |
alpha |
Significance level to be used for constructing confidence interval; default is |
percent |
Logical as to whether or not the measure should be returned as a percentage; default is |
... |
Additional arguments passed to twoxtwo function; ignored if input to |
prevalence |
Prevalence of exposure in the population; must be numeric between |
The formulas below denote cell values as A,B,C,D. For more on twoxtwo
notation see the twoxtwo documentation.
Note that formulas for standard errors are not provided below but are based on forumlas described in Hildebrandt et al (2006).
If "prevalence" argument is not NULL
then the formula uses the value specified for prevalence of exposure (p):
A tibble
with the following columns:
measure: Name of the measure calculated
estimate: Point estimate for the effect measure
ci_lower: The lower bound of the confidence interval for the estimate
ci_upper: The upper bound of the confidence interval for the estimate
exposure: Name of the exposure variable followed by +/- levels (e.g. smoking::yes/no)
outcome: Name of the outcome variable followed by +/- levels (e.g. heart_disease::yes/no)
Hildebrandt, M., Bender, R., Gehrmann, U., & Blettner, M. (2006). Calculating confidence intervals for impact numbers. BMC medical research methodology, 6, 32. https://doi.org/10.1186/1471-2288-6-32
Szklo, M., & Nieto, F. J. (2007). Epidemiology: Beyond the basics. Sudbury, Massachussets: Jones and Bartlett.
Zapata-Diomedi, B., Barendregt, J. J., & Veerman, J. L. (2018). Population attributable fraction: names, types and issues with incorrect interpretation of relative risks. British journal of sports medicine, 52(4), 212–213. https://doi.org/10.1136/bjsports-2015-095531
This unexported helper function bounds a numeric vector on a minimum and maximum value.
bound(x, min = 0.01, max = 0.99)
bound(x, min = 0.01, max = 0.99)
x |
Numeric vector to be bounded |
min |
Minimum allowed value for vector "x"; default is |
max |
Maximum allowed value for vector "x"; default is |
Numeric vector of the same length as x
with no values less than minimum
nor greater than maximum
.
This function conducts a Pearson's chi-squared test for a twoxtwo
constructed using the specified exposure and outcome. Internally the function uses chisq.test. The output of the function includes the chi-squared test statistic, degrees of freedom, and the p-value from the test.
chisq(.data, exposure, outcome, correct = TRUE, ...)
chisq(.data, exposure, outcome, correct = TRUE, ...)
.data |
Either a data frame with observation-level exposure and outcome data or a twoxtwo object |
exposure |
Name of exposure variable; ignored if input to |
outcome |
Name of outcome variable; ignored if input to |
correct |
Logical as to whether or not to apply continuity correction; default is |
... |
Additional arguments passed to twoxtwo function; ignored if input to |
A tibble
with the following columns:
test: Name of the test conducted
estimate: Point estimate from the test (NA
for chisq()
)
ci_lower: The lower bound of the confidence interval for the estimate (NA
for chisq()
)
ci_upper: The upper bound of the confidence interval for the estimate (NA
for chisq()
)
statistic: Test statistic from the test
df: Degrees of freedom parameter for the test statistic
pvalue: P-value from the test
exposure: Name of the exposure variable followed by +/- levels (e.g. smoking::yes/no)
outcome: Name of the outcome variable followed by +/- levels (e.g. heart_disease::yes/no)
This is a helper to render a twoxtwo object as a kable. The function extracts twoxtwo
cell counts and uses exposure levels as row names and outcome levels as column names.
display(.twoxtwo, ...)
display(.twoxtwo, ...)
.twoxtwo |
twoxtwo object |
... |
Additional arguments passed to kable |
A knitr_kable
object with the twoxtwo
cell counts, exposure levels as row names, and outcome levels as column names.
This function conducts a Fisher's exact test using specified exposure and outcome. Internally the function uses fisher.test to test independence of twoxtwo
rows and columns. The output of the function includes the odds ratio, the lower/upper bounds for the confidence interval around the estimate, and the p-value from the test.
fisher( .data, exposure, outcome, alternative = "two.sided", conf_level = 0.95, or = 1, ... )
fisher( .data, exposure, outcome, alternative = "two.sided", conf_level = 0.95, or = 1, ... )
.data |
Either a data frame with observation-level exposure and outcome data or a twoxtwo object |
exposure |
Name of exposure variable; ignored if input to |
outcome |
Name of outcome variable; ignored if input to |
alternative |
Alternative hypothesis for test; must be one of "two.sided", "greater", or "less"; default is |
conf_level |
Confidence level for the confidence interval; default is |
or |
Hypothesized odds ratio; default is |
... |
Additional arguments passed to twoxtwo function; ignored if input to |
A tibble
with the following columns:
test: Name of the test conducted
estimate: Point estimate from the test
ci_lower: The lower bound of the confidence interval for the estimate
ci_upper: The upper bound of the confidence interval for the estimate
statistic: Test statistic from the test (NA
for fisher()
)
df: Degrees of freedom parameter for the test statistic (NA
for fisher()
)
pvalue: P-value from the test
exposure: Name of the exposure variable followed by +/- levels (e.g. smoking::yes/no)
outcome: Name of the outcome variable followed by +/- levels (e.g. heart_disease::yes/no)
This helper takes the output from a twoxtwo
effect measure function and formats the point estimate and lower/upper bounds of the computed confidence interval (CI) as a string.
format_measure(.data, digits = 3)
format_measure(.data, digits = 3)
.data |
Output from a twoxtwo effect measure function (e.g. odds_ratio) |
digits |
Number of digits; default is |
A character vector of length 1 with the effect measure formatted as point estimate (lower bound of CI, upper bound of CI). The point estimate and CI are rounded to precision specified in "digits" argument.
Impact numbers are designed to communicate how impactful interventions and/or exposures can be on a population. The twoxtwo framework allows for calculation of impact numbers: exposure impact number (EIN), case impact number (CIN), and the exposed cases impact number (ECIN).
The ein()
, cin()
, and ecin()
functions provide interfaces for calculating impact number estimates. Each function takes an input dataset and arguments for outcome and exposure as bare, unquoted variable names. If the input has the twoxtwo class then the measures will be calculated using exposure and outcome information from that object. The functions all return a tidy tibble
with the name of the measure, the point estimate, and lower/upper bounds of a confidence interval (CI) based on the SE.
Formulas used in point estimate and SE calculations are available in 'Details'.
ein(.data, exposure, outcome, alpha = 0.05, ...) cin(.data, exposure, outcome, alpha = 0.05, prevalence = NULL, ...) ecin(.data, exposure, outcome, alpha = 0.05, ...)
ein(.data, exposure, outcome, alpha = 0.05, ...) cin(.data, exposure, outcome, alpha = 0.05, prevalence = NULL, ...) ecin(.data, exposure, outcome, alpha = 0.05, ...)
.data |
Either a data frame with observation-level exposure and outcome data or a twoxtwo object |
exposure |
Name of exposure variable; ignored if input to |
outcome |
Name of outcome variable; ignored if input to |
alpha |
Significance level to be used for constructing confidence interval; default is |
... |
Additional arguments passed to twoxtwo function; ignored if input to |
prevalence |
Prevalence of exposure in the population; must be numeric between |
The formulas below denote cell values as A,B,C,D. For more on twoxtwo
notation see the twoxtwo documentation.
Note that formulas for standard errors are not provided below but are based on forumlas described in Hildebrandt et al (2006).
If "prevalence" argument is not NULL
then the formula uses the value specified for prevalence of exposure (p):
A tibble
with the following columns:
measure: Name of the measure calculated
estimate: Point estimate for the impact number
ci_lower: The lower bound of the confidence interval for the estimate
ci_upper: The upper bound of the confidence interval for the estimate
exposure: Name of the exposure variable followed by +/- levels (e.g. smoking::yes/no)
outcome: Name of the outcome variable followed by +/- levels (e.g. heart_disease::yes/no)
Hildebrandt, M., Bender, R., Gehrmann, U., & Blettner, M. (2006). Calculating confidence intervals for impact numbers. BMC medical research methodology, 6, 32. https://doi.org/10.1186/1471-2288-6-32
Heller, R. F., Dobson, A. J., Attia, J., & Page, J. (2002). Impact numbers: measures of risk factor impact on the whole population from case-control and cohort studies. Journal of epidemiology and community health, 56(8), 606–610. https://doi.org/10.1136/jech.56.8.606
The twoxtwo framework allows for estimation of the magnitude of association between an exposure and outcome. Measures of effect that can be calculated include odds ratio, risk ratio, and risk difference. Each measure can be calculated as a point estimate as well as the standard error (SE) around that value. It is critical to note that the interpretation of measures of effect depends on the study design and research question being investigated.
The odds_ratio()
, risk_ratio()
, and risk_diff()
functions provide a standard interface for calculating measures of effect. Each function takes an input dataset and arguments for outcome and exposure as bare, unquoted variable names. If the input has the twoxtwo class then the effect measures will be calculated using exposure and outcome information from that object. The functions all return a tidy tibble
with the name of the measure, the point estimate, and lower/upper bounds of a confidence interval (CI) based on the SE.
Formulas used in point estimate and SE calculations are available in 'Details'.
odds_ratio(.data, exposure, outcome, alpha = 0.05, ...) risk_ratio(.data, exposure, outcome, alpha = 0.05, ...) risk_diff(.data, exposure, outcome, alpha = 0.05, ...)
odds_ratio(.data, exposure, outcome, alpha = 0.05, ...) risk_ratio(.data, exposure, outcome, alpha = 0.05, ...) risk_diff(.data, exposure, outcome, alpha = 0.05, ...)
.data |
Either a data frame with observation-level exposure and outcome data or a twoxtwo object |
exposure |
Name of exposure variable; ignored if input to |
outcome |
Name of outcome variable; ignored if input to |
alpha |
Significance level to be used for constructing confidence interval; default is |
... |
Additional arguments passed to twoxtwo function; ignored if input to |
The formulas below denote cell values as A,B,C,D. For more on twoxtwo
notation see the twoxtwo documentation.
A tibble
with the following columns:
measure: Name of the measure calculated
estimate: Point estimate for the effect measure
ci_lower: The lower bound of the confidence interval for the estimate
ci_upper: The upper bound of the confidence interval for the estimate
exposure: Name of the exposure variable followed by +/- levels (e.g. smoking::yes/no)
outcome: Name of the outcome variable followed by +/- levels (e.g. heart_disease::yes/no)
Tripepi, G., Jager, K. J., Dekker, F. W., Wanner, C., & Zoccali, C. (2007). Measures of effect: relative risks, odds ratios, risk difference, and 'number needed to treat'. Kidney international, 72(7), 789–791. https://doi.org/10.1038/sj.ki.5002432
Walter S. D. (2000). Choice of effect measure for epidemiological data. Journal of clinical epidemiology, 53(9), 931–939. https://doi.org/10.1016/s0895-4356(00)00210-9
Szklo, M., & Nieto, F. J. (2007). Epidemiology: Beyond the basics. Sudbury, Massachussets: Jones and Bartlett.
Keyes, K.M, & Galea S. (2014). Epidemiology Matters: A new introduction to methodological foundations. New York, New York: Oxford University Press.
The print.twoxtwo()
function provides an S3 method for printing objects created with twoxtwo. The printed output formats the contents of the twoxtwo
table as a kable.
## S3 method for class 'twoxtwo' print(x, ...)
## S3 method for class 'twoxtwo' print(x, ...)
x |
twoxtwo object |
... |
Additional arguments passed to kable |
A printed knitr_kable
object with the twoxtwo
cell counts, exposure levels as row names, and outcome levels as column names.
The summary.twoxtwo()
function provides an S3 method for summarizing objects created with twoxtwo. The summary function prints the twoxtwo
via print.twoxtwo along with characteristics of the contingency table such the number of missing observations and exposure/outcome variables and levels. The summary will also compute effect measures using odds_ratio, risk_ratio, and risk_diff and print the estimates and confidence interval for each.
## S3 method for class 'twoxtwo' summary(object, alpha = 0.05, ...)
## S3 method for class 'twoxtwo' summary(object, alpha = 0.05, ...)
object |
twoxtwo object |
alpha |
Significance level to be used for constructing confidence interval; default is |
... |
Additional arguments passed to print.twoxtwo |
Printed summary information including the outcome and exposure variables and levels, as well as the number of missing observations, the twoxtwo
contingency table, and formatted effect measures (see "Description"). In addition to printed output, the function invisibly returns a named list with computed effect measures (i.e. the tibble
outputs from odds_ratio, risk_ratio, and risk_diff respectively).
This data is based on the Titanic dataset. Unlike the version in the datasets
package, the data here is expanded to the observation-level rather than cross-tabulated.
titanic
titanic
A data frame with 2201 rows and 4 variables:
Class: Passenger class ("1st", "2nd", "3rd") or crew status ("Crew")
Crew: Logical as to whether or not a crew member (TRUE) or not (FALSE)
Sex: Sex of individual ("Male" or "Female")
Age: Categorized age ("Adult" or "Child")
Survived: Whether or not individual survived ("Yes" or "No")
head(titanic)
head(titanic)
The twoxtwo
constructor function takes an input data frame and summarizes counts of the specified exposure and outcome variables as a two-by-two contingency table. This function is used internally in other functions, but can be used on its own as well. The returned object is given a twoxtwo
class which allows dispatch of the twoxtwo
S3 methods (see print.twoxtwo and summary.twoxtwo).
For more information on how the two-by-two table is created see 'Details'.
twoxtwo(.data, exposure, outcome, levels = NULL, na.rm = TRUE, retain = TRUE)
twoxtwo(.data, exposure, outcome, levels = NULL, na.rm = TRUE, retain = TRUE)
.data |
Data frame with observation-level exposure and outcome data |
exposure |
Name of exposure variable |
outcome |
Name of outcome variable |
levels |
Levels for the exposure and outcome as a named list; if supplied, then the contingency table will be oriented with respect to the sequence of levels specified; default is |
na.rm |
Logical as to whether or not to remove |
retain |
Logical as to whether or not the original data passed to the ".data" argument should be retained; if |
The two-by-two table covers four conditions that can be specified with A,B,C,D notation:
A: Exposure "+" and Outcome "+"
B: Exposure "+" and Outcome "-"
C: Exposure "-" and Outcome "+"
D: Exposure "-" and Outcome "-"
twoxtwo()
requires that the exposure and outcome variables are binary. The columns can be character, numeric, or factor but must have only two levels. Each column will internally be coerced to a factor with levels reversed. The reversal results in exposures with TRUE
and FALSE
(or 1
and 0
) oriented in the two-by-two table with the TRUE
as "+" (first row) and FALSE
as "-" (second row). Likewise, TRUE
/FALSE
outcomes will be oriented with TRUE
as "+" (first column) and FALSE
as "-" (second column). Note that the user can also define the orientation of the table using the "levels" argument.
A named list with the twoxtwo
class. Elements include:
tbl: The summarized two-by-two contingency table as a tibble
.
cells: Named list with the counts in each of the cells in the two-by-two contingency table (i.e. A,B,C,D)
exposure: Named list of exposure information (name of variable and levels)
outcome: Named list of outcome information (name of variable and levels)
n_missing: The number of missing values (in either exposure or outcome variable) removed prior to computing counts for the two-by-two table
data: The original data frame passed to the ".data" argument. If retain=FALSE
, then this element will be NULL
.