summarytools provides a coherent set of functions centered on data exploration and simple reporting. At its core reside the following four functions:
Function | Description |
---|---|
freq()
|
Frequency Tables featuring counts, proportions, cumulative statistics as well as missing data reporting |
ctable()
|
Cross-Tabulations (joint frequencies) between pairs of discrete/categorical variables, featuring marginal sums as well as row, column or total proportions |
descr()
|
Descriptive (Univariate) Statistics for numerical data, featuring common measures of central tendency and dispersion |
dfSummary()
|
Data Frame Summaries featuring type-specific information for all variables: univariate statistics and/or frequency distributions, bar charts or histograms, as well as missing data counts and proportions. Very useful to quickly, detect anomalies and identify trends at a glance |
The package was developed with the following objectives in mind:
Results can be
When creating R Markdown documents, make sure to
results="asis"
plain.ascii=FALSE
dfSummary()
freq()
,
ctable()
and descr()
support sampling
weights%>%
, %$%
) and pipeR
(%>>%
) are fully supported; the native
|>
introduced in R 4.0 is supported as wellst_options()
to minimize redundancy in
function callsstby()
function which is a slightly modified
version of base::by()
, but dplyr::group_by()
is also supportedformat()
parameters are also supported; this
can be used to set thousands separator or modify the decimal separator,
among other possibilities (see help("format")
)The freq()
function generates frequency
tables with counts, proportions, as well as missing data
information. Side note: the very idea for creating this package stemmed
from the absence of such a function in base R.
iris$Species
Type: Factor
Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|
setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 |
In this first example, the plain.ascii
and
style
arguments were specified. However, since we have
defined them globally for this document using st_options()
,
they are redundant and will be omitted from hereon
(section 16 contains a detailed description of this
vignette’s configuration).
One of summarytools’ main purposes is to help
cleaning and preparing data for further analysis. But in some
circumstances, we don’t need (or already have) information about missing
data. Using report.nas = FALSE
makes the output table
smaller by one row and two columns:
Freq | % | % Cum. | |
---|---|---|---|
setosa | 50 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 |
Total | 150 | 100.00 | 100.00 |
The headings = FALSE parameter suppresses the heading
section.
|
By “switching off” all optional elements, a much simpler table will be produced:
Freq | % | |
---|---|---|
setosa | 50 | 33.33 |
versicolor | 50 | 33.33 |
virginica | 50 | 33.33 |
While the output is much simplified, the syntax is not; I blame it on
Tesler’s law
of conservation of complexity! Thankfully, st_options()
is there to accommodate everyone’s preferences (see section on package options).
To generate frequency tables for all variables in a data frame, we
could (and in the earliest versions, needed to) use
lapply()
. However, this is not required since
freq()
accepts data frames as the main argument:
To avoid cluttering the results, numerical columns having
more than 25 distinct values are ignored. This threshold of 25 can be
changed by using st_options()
; for example, to change it to
10, we’d use st_options(freq.ignore.threshold = 10)
.
The tobacco data frame contains simulated data and is included in the package. Another simulated data frame is included: exams. Both have French versions (tabagisme, examens). |
The rows
parameter allows subsetting frequency tables;
we can use this parameter in different ways:
rows = 1:10
will show the frequencies for the first
10 values only. To account for the frequencies of unshown values, the
“(Other)” row is automatically added?regex
for more information on this
topic)By combining the order
and rows
parameters,
we can easily filter the results to show, for example, the 5 most common
values in a factor:
Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|
Hypertension | 36 | 16.22 | 16.22 | 3.60 | 3.60 |
Cancer | 34 | 15.32 | 31.53 | 3.40 | 7.00 |
Cholesterol | 21 | 9.46 | 40.99 | 2.10 | 9.10 |
Heart | 20 | 9.01 | 50.00 | 2.00 | 11.10 |
Pulmonary | 20 | 9.01 | 59.01 | 2.00 | 13.10 |
(Other) | 91 | 40.99 | 100.00 | 9.10 | 22.20 |
<NA> | 778 | 77.80 | 100.00 | ||
Total | 1000 | 100.00 | 100.00 | 100.00 | 100.00 |
Instead of "freq"
, we can use "-freq"
to
reverse the ordering and get results ranked from lowest to highest in
frequency.
Notice the “(Other)” row, which is automatically generated. |
When generating html results, use the
collapse = TRUE
argument with print()
or
view()
/ stview()
to get collapsible sections;
clicking on the variable name in the heading section will collapse /
reveal the frequency table (results not shown).
ctable()
generates cross-tabulations (joint frequencies)
for pairs of categorical variables.
Using the tobacco simulated data frame, we’ll cross-tabulate the two categorical variables smoker and diseased.
smoker * diseased
Data Frame: tobacco
diseased | Yes | No | Total | |
smoker | ||||
Yes | 125 (41.9%) | 173 (58.1%) | 298 (100.0%) | |
No | 99 (14.1%) | 603 (85.9%) | 702 (100.0%) | |
Total | 224 (22.4%) | 776 (77.6%) | 1000 (100.0%) |
As can be seen, since markdown does not fully support multiline table headings, pander does what it can to display this particular type of table. To get better results, the “render” method is recommended and will be used in the next examples.
Row proportions are shown by default. To display column or
total proportions, use prop = "c"
or
prop = "t"
, respectively. To omit proportions altogether,
use prop = "n"
.
By “switching off” all optional features, we get a simple “2 x 2” table:
with(tobacco,
print(ctable(x = smoker,
y = diseased,
prop = 'n',
totals = FALSE,
headings = FALSE),
method = "render")
)
diseased | ||
---|---|---|
smoker | Yes | No |
Yes | 125 | 173 |
No | 99 | 603 |
To display the chi-square statistic, set chisq = TRUE
.
For 2 x 2 tables, use OR
and RR
to
show odds ratio and risk ratio (also called relative risk),
respectively. Those can be set to TRUE
, in which case 95%
confidence intervals are shown; to use different confidence levels, use
for example OR = .90
.
Using pipes generally makes it easier to generate ctable()
results.
|
library(magrittr)
tobacco %$% # Acts like with(tobacco, ...)
ctable(x = smoker, y = diseased,
chisq = TRUE,
OR = TRUE,
RR = TRUE,
headings = FALSE) %>%
print(method = "render")
diseased | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
smoker | Yes | No | Total | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Yes | 125 | ( | 41.9% | ) | 173 | ( | 58.1% | ) | 298 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
No | 99 | ( | 14.1% | ) | 603 | ( | 85.9% | ) | 702 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Total | 224 | ( | 22.4% | ) | 776 | ( | 77.6% | ) | 1000 | ( | 100.0% | ) | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Χ2 = 91.7088 df = 1 p = .0000 O.R. (95% C.I.) = 4.40 (3.22 - 6.02) R.R. (95% C.I.) = 2.97 (2.37 - 3.73) |
descr()
generates descriptive / univariate statistics,
i.e. common central tendency statistics and measures of
dispersion. It accepts single vectors as well as data frames; in the
latter case, all non-numerical columns are ignored, with a message to
that effect.
Non-numerical variable(s) ignored: Species
iris
N: 150
Petal.Length | Petal.Width | Sepal.Length | Sepal.Width | |
---|---|---|---|---|
Mean | 3.76 | 1.20 | 5.84 | 3.06 |
Std.Dev | 1.77 | 0.76 | 0.83 | 0.44 |
Min | 1.00 | 0.10 | 4.30 | 2.00 |
Q1 | 1.60 | 0.30 | 5.10 | 2.80 |
Median | 4.35 | 1.30 | 5.80 | 3.00 |
Q3 | 5.10 | 1.80 | 6.40 | 3.30 |
Max | 6.90 | 2.50 | 7.90 | 4.40 |
MAD | 1.85 | 1.04 | 1.04 | 0.44 |
IQR | 3.50 | 1.50 | 1.30 | 0.50 |
CV | 0.47 | 0.64 | 0.14 | 0.14 |
Skewness | -0.27 | -0.10 | 0.31 | 0.31 |
SE.Skewness | 0.20 | 0.20 | 0.20 | 0.20 |
Kurtosis | -1.42 | -1.36 | -0.61 | 0.14 |
N.Valid | 150.00 | 150.00 | 150.00 | 150.00 |
Pct.Valid | 100.00 | 100.00 | 100.00 | 100.00 |
To turn off the variable-type messages, use
silent = TRUE
. It is possible to set that option globally,
which we will do here, so it won’t be displayed in the remaining of this
vignette.
Results can be transposed by using transpose = TRUE
, and
statistics can be selected using the stats
argument:
Mean | Std.Dev | |
---|---|---|
Petal.Length | 3.76 | 1.77 |
Petal.Width | 1.20 | 0.76 |
Sepal.Length | 5.84 | 0.83 |
Sepal.Width | 3.06 | 0.44 |
See ?descr
for a list of all available statistics.
Special values “all”, “fivenum”, and “common” are also valid. The
default value is “all”, and it can be modified using
st_options()
:
dfSummary()
creates a summary table with statistics,
frequencies and graphs for all variables in a data frame. The
information displayed is type-specific (character, factor, numeric,
date) and also varies according to the number of distinct values.
To see the results in RStudio’s Viewer (or in the default Web browser
if working in another IDE or from a terminal window), use the
view()
function, or its twin stview()
in case
of name conflicts:
Be careful to use view() to point to View() ;
loading summarytools after these packages will
ensure its own view() works properly. Otherwise,
stview() is always there as a foolproof alternative.
|
When using dfSummary()
in R Markdown documents,
it is generally a good idea to exclude a column or two to avoid margin
overflow. Since the Valid and Missing columns are
redundant, we can drop either one of them.
dfSummary(tobacco,
plain.ascii = FALSE,
style = "grid",
graph.magnif = 0.75,
valid.col = FALSE,
tmp.img.dir = "/tmp")
The tmp.img.dir
parameter is mandatory
when generating dfSummaries in R Markdown documents,
except for html rendering. The explanation for this can be
found further below.
Some users reported repeated X11 warnings; those can be avoided by
setting the warning chunk option to FALSE :
{r chunk_name, results="asis", warning=FALSE} .
|
This feature has been requested several times since the package was
released. Introduced in version 1.0.0, it provides control over which
statistics to shown in the Stats/Values column. Namely, the
third row, which displays IQR (CV)
, can be modified to show
any available statistics in R. An additional “slot” (unused by default)
is also made available. To use this feature, define
dfSummary.custom.1
and/or dfSummary.custom.2
using st_options()
in the following way, encapsulating the
code in an expression()
:
st_options(
dfSummary.custom.1 =
expression(
paste(
"Q1 - Q3 :",
round(
quantile(column_data, probs = .25, type = 2,
names = FALSE, na.rm = TRUE), digits = 1
), " - ",
round(
quantile(column_data, probs = .75, type = 2,
names = FALSE, na.rm = TRUE), digits = 1
)
)
)
)
print(
dfSummary(iris,
varnumbers = FALSE,
na.col = FALSE,
style = "multiline",
plain.ascii = FALSE,
headings = FALSE,
graph.magnif = .8),
method = "render"
)
Variable | Stats / Values | Freqs (% of Valid) | Graph | Valid | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sepal.Length [numeric] |
|
35 distinct values | 150 (100.0%) | ||||||||||||||||
Sepal.Width [numeric] |
|
23 distinct values | 150 (100.0%) | ||||||||||||||||
Petal.Length [numeric] |
|
43 distinct values | 150 (100.0%) | ||||||||||||||||
Petal.Width [numeric] |
|
22 distinct values | 150 (100.0%) | ||||||||||||||||
Species [factor] |
|
|
150 (100.0%) |
If we had used dfSummary.custom.2
instead of
dfSummary.custom.1
, a fourth row would have been added
under the default IQR (CV)
row.
Note that instead of round()
, it is possible to use the
internal format_number()
, which ensures the number is
formatted according to all specified arguments (rounding digits, decimal
mark and thousands mark, etc.). The internal variable
round.digits
which contains the value of
st_options("round.digits")
can also be used. This is how
the default IQR (CV)
is defined – here we set the first
custom stat back to its default value and then display its definition
(formatR::tidy_source()
is used to format / indent the
expression):
library(formatR)
st_options(dfSummary.custom.1 = "default")
formatR::tidy_source(
text = deparse(st_options("dfSummary.custom.1")),
indent = 2,
args.newline = TRUE
)
expression(
paste(
paste0(
trs("iqr"),
" (", trs("cv"),
") : "
),
format_number(
IQR(column_data, na.rm = TRUE),
round.digits
),
" (", format_number(
sd(column_data, na.rm = TRUE)/mean(column_data, na.rm = TRUE),
round.digits
),
")", collapse = "", sep = ""
)
)
Don’t forget to specify na.rm = TRUE for all functions that
use this parameter (most of base R functions do).
|
The dfSummary()
function also
max.tbl.height
parameter; This is especially convenient if
the analyzed data frame has numerous variables; see
vignette("rmarkdown", package = "summarytools")
for more
detailsAlthough most columns can be excluded using the function’s parameters, it is also possible to delete them with the following syntax (results not shown):
To produce optimal results, summarytools has its own
version of the base by()
function. It’s called
stby()
, and we use it exactly as we would
by()
:
(iris_stats_by_species <- stby(data = iris,
INDICES = iris$Species,
FUN = descr,
stats = "common",
transpose = TRUE))
iris
Group: Species = setosa
N: 50
Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
---|---|---|---|---|---|---|---|
Petal.Length | 1.46 | 0.17 | 1.00 | 1.50 | 1.90 | 50.00 | 100.00 |
Petal.Width | 0.25 | 0.11 | 0.10 | 0.20 | 0.60 | 50.00 | 100.00 |
Sepal.Length | 5.01 | 0.35 | 4.30 | 5.00 | 5.80 | 50.00 | 100.00 |
Sepal.Width | 3.43 | 0.38 | 2.30 | 3.40 | 4.40 | 50.00 | 100.00 |
Group: Species = versicolor
N: 50
Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
---|---|---|---|---|---|---|---|
Petal.Length | 4.26 | 0.47 | 3.00 | 4.35 | 5.10 | 50.00 | 100.00 |
Petal.Width | 1.33 | 0.20 | 1.00 | 1.30 | 1.80 | 50.00 | 100.00 |
Sepal.Length | 5.94 | 0.52 | 4.90 | 5.90 | 7.00 | 50.00 | 100.00 |
Sepal.Width | 2.77 | 0.31 | 2.00 | 2.80 | 3.40 | 50.00 | 100.00 |
Group: Species = virginica
N: 50
Mean | Std.Dev | Min | Median | Max | N.Valid | Pct.Valid | |
---|---|---|---|---|---|---|---|
Petal.Length | 5.55 | 0.55 | 4.50 | 5.55 | 6.90 | 50.00 | 100.00 |
Petal.Width | 2.03 | 0.27 | 1.40 | 2.00 | 2.50 | 50.00 | 100.00 |
Sepal.Length | 6.59 | 0.64 | 4.90 | 6.50 | 7.90 | 50.00 | 100.00 |
Sepal.Width | 2.97 | 0.32 | 2.20 | 3.00 | 3.80 | 50.00 | 100.00 |
When used to produce split-group statistics for a single variable,
stby()
assembles everything into a single table instead of
displaying a series of one-column tables.
with(tobacco,
stby(data = BMI,
INDICES = age.gr,
FUN = descr,
stats = c("mean", "sd", "min", "med", "max"))
)
BMI by age.gr
Data Frame: tobacco
N: 258
18-34 | 35-50 | 51-70 | 71 + | |
---|---|---|---|---|
Mean | 23.84 | 25.11 | 26.91 | 27.45 |
Std.Dev | 4.23 | 4.34 | 4.26 | 4.37 |
Min | 8.83 | 10.35 | 9.01 | 16.36 |
Median | 24.04 | 25.11 | 26.77 | 27.52 |
Max | 34.84 | 39.44 | 39.21 | 38.37 |
The syntax is a little trickier for this combination, so here is an example (results not shown):
stby(data = list(x = tobacco$smoker, y = tobacco$diseased),
INDICES = tobacco$gender,
FUN = ctable)
# or equivalently
with(tobacco,
stby(data = list(x = smoker, y = diseased),
INDICES = gender,
FUN = ctable))
To create grouped statistics with freq()
,
descr()
or dfSummary()
, it is possible to use
dplyr’s group_by()
as an alternative to
stby()
. Syntactic differences aside, one key distinction is
that group_by()
considers NA
values on the
grouping variable(s) as a valid category, albeit with a warning
suggesting the use of forcats::fct_explicit_na
to make
NA
’s explicit in factors. Following this advice, we
get:
library(dplyr)
tobacco$gender %<>% forcats::fct_explicit_na()
tobacco %>%
group_by(gender) %>%
descr(stats = "fivenum")
Warning: `fct_explicit_na()` was deprecated in forcats 1.0.0.
ℹ Please use `fct_na_value_to_level()` instead.
This warning is displayed once every 8 hours.
Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
generated.
tobacco
Group: gender = F
N: 489
BMI | age | cigs.per.day | samp.wgts | |
---|---|---|---|---|
Min | 9.01 | 18.00 | 0.00 | 0.86 |
Q1 | 22.98 | 34.00 | 0.00 | 0.86 |
Median | 25.87 | 50.00 | 0.00 | 1.04 |
Q3 | 29.48 | 66.00 | 10.50 | 1.05 |
Max | 39.44 | 80.00 | 40.00 | 1.06 |
Group: gender = M
N: 489
BMI | age | cigs.per.day | samp.wgts | |
---|---|---|---|---|
Min | 8.83 | 18.00 | 0.00 | 0.86 |
Q1 | 22.52 | 34.00 | 0.00 | 0.86 |
Median | 25.14 | 49.50 | 0.00 | 1.04 |
Q3 | 27.96 | 66.00 | 11.00 | 1.05 |
Max | 36.76 | 80.00 | 40.00 | 1.06 |
Group: gender = (Missing)
N: 22
BMI | age | cigs.per.day | samp.wgts | |
---|---|---|---|---|
Min | 20.24 | 19.00 | 0.00 | 0.86 |
Q1 | 24.97 | 36.00 | 0.00 | 1.04 |
Median | 27.16 | 55.50 | 0.00 | 1.05 |
Q3 | 30.23 | 64.00 | 10.00 | 1.05 |
Max | 32.43 | 80.00 | 28.00 | 1.06 |
When generating freq()
or descr()
tables,
it is possible to turn the results into “tidy” tables with the use of
the tb()
function (think of tb as a diminutive for
tibble). For example:
# A tibble: 4 × 8
variable mean sd min med max n.valid pct.valid
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Petal.Length 3.76 1.77 1 4.35 6.9 150 100
2 Petal.Width 1.20 0.762 0.1 1.3 2.5 150 100
3 Sepal.Length 5.84 0.828 4.3 5.8 7.9 150 100
4 Sepal.Width 3.06 0.436 2 3 4.4 150 100
# A tibble: 3 × 3
Species freq pct
<fct> <dbl> <dbl>
1 setosa 50 33.3
2 versicolor 50 33.3
3 virginica 50 33.3
By definition, no total rows are part of tidy tables, and the row names are converted to a regular column.
When displaying tibbles using rmarkdown, the
knitr chunk option results should be set
to ‘markup’ instead of ‘asis’.
|
Here are some examples showing how lists created using
stby()
or group_by()
can be transformed into
tidy tibbles.
grouped_descr <- stby(data = exams,
INDICES = exams$gender,
FUN = descr,
stats = "common")
grouped_descr %>% tb()
# A tibble: 12 × 9
gender variable mean sd min med max n.valid pct.valid
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Girl economics 72.5 7.79 62.3 70.2 89.6 14 93.3
2 Girl english 73.9 9.41 58.3 71.8 93.1 14 93.3
3 Girl french 71.1 12.4 44.8 68.4 93.7 14 93.3
4 Girl geography 67.3 8.26 50.4 67.3 78.9 15 100
5 Girl history 71.2 9.17 53.9 72.9 86.4 15 100
6 Girl math 73.8 9.03 55.6 74.8 86.3 14 93.3
7 Boy economics 75.2 9.40 60.5 71.7 94.2 15 100
8 Boy english 77.8 5.94 69.6 77.6 90.2 15 100
9 Boy french 76.6 8.63 63.2 74.8 94.7 15 100
10 Boy geography 73 12.4 47.2 71.2 96.3 14 93.3
11 Boy history 74.4 11.2 54.4 72.6 93.5 15 100
12 Boy math 73.3 9.68 60.5 72.2 93.2 14 93.3
The order
parameter controls row ordering:
# A tibble: 12 × 9
gender variable mean sd min med max n.valid pct.valid
<fct> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Girl economics 72.5 7.79 62.3 70.2 89.6 14 93.3
2 Boy economics 75.2 9.40 60.5 71.7 94.2 15 100
3 Girl english 73.9 9.41 58.3 71.8 93.1 14 93.3
4 Boy english 77.8 5.94 69.6 77.6 90.2 15 100
5 Girl french 71.1 12.4 44.8 68.4 93.7 14 93.3
6 Boy french 76.6 8.63 63.2 74.8 94.7 15 100
7 Girl geography 67.3 8.26 50.4 67.3 78.9 15 100
8 Boy geography 73 12.4 47.2 71.2 96.3 14 93.3
9 Girl history 71.2 9.17 53.9 72.9 86.4 15 100
10 Boy history 74.4 11.2 54.4 72.6 93.5 15 100
11 Girl math 73.8 9.03 55.6 74.8 86.3 14 93.3
12 Boy math 73.3 9.68 60.5 72.2 93.2 14 93.3
Setting order = 3
changes the order of the sort
variables exactly as with order = 2
, but it also reorders
the columns:
# A tibble: 12 × 9
variable gender mean sd min med max n.valid pct.valid
<chr> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 economics Girl 72.5 7.79 62.3 70.2 89.6 14 93.3
2 economics Boy 75.2 9.40 60.5 71.7 94.2 15 100
3 english Girl 73.9 9.41 58.3 71.8 93.1 14 93.3
4 english Boy 77.8 5.94 69.6 77.6 90.2 15 100
5 french Girl 71.1 12.4 44.8 68.4 93.7 14 93.3
6 french Boy 76.6 8.63 63.2 74.8 94.7 15 100
7 geography Girl 67.3 8.26 50.4 67.3 78.9 15 100
8 geography Boy 73 12.4 47.2 71.2 96.3 14 93.3
9 history Girl 71.2 9.17 53.9 72.9 86.4 15 100
10 history Boy 74.4 11.2 54.4 72.6 93.5 15 100
11 math Girl 73.8 9.03 55.6 74.8 86.3 14 93.3
12 math Boy 73.3 9.68 60.5 72.2 93.2 14 93.3
For more details, see ?tb
.
summarytools objects are not always compatible with
packages focused on table formatting, such as formattable or
kableExtra.
However, tb()
can be used as a “bridge”, an intermediary
step turning freq()
and descr()
objects into
simple tables that any package can work with. Here is an example using
kableExtra:
library(kableExtra)
library(magrittr)
stby(data = iris,
INDICES = iris$Species,
FUN = descr,
stats = "fivenum") %>%
tb(order = 3) %>%
kable(format = "html", digits = 2) %>%
collapse_rows(columns = 1, valign = "top")
variable | Species | min | q1 | med | q3 | max |
---|---|---|---|---|---|---|
Petal.Length | setosa | 1.0 | 1.4 | 1.50 | 1.6 | 1.9 |
versicolor | 3.0 | 4.0 | 4.35 | 4.6 | 5.1 | |
virginica | 4.5 | 5.1 | 5.55 | 5.9 | 6.9 | |
Petal.Width | setosa | 0.1 | 0.2 | 0.20 | 0.3 | 0.6 |
versicolor | 1.0 | 1.2 | 1.30 | 1.5 | 1.8 | |
virginica | 1.4 | 1.8 | 2.00 | 2.3 | 2.5 | |
Sepal.Length | setosa | 4.3 | 4.8 | 5.00 | 5.2 | 5.8 |
versicolor | 4.9 | 5.6 | 5.90 | 6.3 | 7.0 | |
virginica | 4.9 | 6.2 | 6.50 | 6.9 | 7.9 | |
Sepal.Width | setosa | 2.3 | 3.2 | 3.40 | 3.7 | 4.4 |
versicolor | 2.0 | 2.5 | 2.80 | 3.0 | 3.4 | |
virginica | 2.2 | 2.8 | 3.00 | 3.2 | 3.8 |
Using the file
argument with print()
or
view()
/ stview()
, we can write outputs to a
file, be it html, Rmd, md, or just plain text
(txt). The file extension is used by the package to determine
the type of content to write out.
view(iris_stats_by_species, file = "~/iris_stats_by_species.html")
view(iris_stats_by_species, file = "~/iris_stats_by_species.md")
A Note on PDF documents
There is no direct way to create a PDF file with
summarytools. One option is to generate an
html file and convert it to PDF using Pandoc or WK<html>TOpdf (the
latter gives better results than Pandoc with
dfSummary()
output).
Another option is to create an Rmd document using
PDF as the output format. See
vignette("rmarkdown", package = "summarytools")
for the
details on how to proceed.
The append
argument allows adding content to existing
files generated by summarytools. This is useful if we
wish to include several statistical tables in a single file. It is a
quick alternative to creating an Rmd document.
The following options can be set globally with
st_options()
:
Option name | Default | Note |
---|---|---|
style (1) | “simple” | Set to “rmarkdown” in .Rmd documents |
plain.ascii | TRUE | Set to FALSE in .Rmd documents |
round.digits (2) | 2 | Number of decimals to show |
headings | TRUE | Formerly “omit.headings” |
footnote | “default” | Customize or set to NA to omit |
display.labels | TRUE | Show variable / data frame labels in headings |
bootstrap.css (3) | TRUE | Include Bootstrap 4 CSS in html output files |
custom.css | NA | Path to your own CSS file |
escape.pipe | FALSE | Useful for some Pandoc conversions |
char.split (4) | 12 | Threshold for line-wrapping in column headings |
subtitle.emphasis | TRUE | Controls headings formatting |
lang | “en” | Language (always 2-letter, lowercase) |
1 Does not apply to dfSummary()
, which has
its own style option (see next table)
2 Does not apply to ctable()
, which has its own
round.digits
option (see next table)
3 Set to FALSE
in Shiny apps
4 Affects only html outputs for descr()
and ctable()
Option name | Default | Note |
---|---|---|
freq.cumul | TRUE | Display cumulative proportions in freq() |
freq.totals | TRUE | Display totals row in freq() |
freq.report.nas | TRUE | Display |
freq.ignore.threshold (1) | 25 | Used to determine which vars to ignore |
freq.silent | FALSE | Hide console messages |
ctable.prop | “r” | Display row proportions by default |
ctable.totals | TRUE | Show marginal totals |
ctable.round.digits | 1 | Number of decimals to show in
ctable() |
descr.stats | “all” | “fivenum”, “common” or vector of stats |
descr.transpose | FALSE | Display stats in columns instead of rows |
descr.silent | FALSE | Hide console messages |
dfSummary.style | “multiline” | Can be set to “grid” as an alternative |
dfSummary.varnumbers | TRUE | Show variable numbers in 1st col. |
dfSummary.labels.col | TRUE | Show variable labels when present |
dfSummary.graph.col | TRUE | Show graphs |
dfSummary.valid.col | TRUE | Include the Valid column in the output |
dfSummary.na.col | TRUE | Include the Missing column in the output |
dfSummary.graph.magnif | 1 | Zoom factor for bar plots and histograms |
dfSummary.silent | FALSE | Hide console messages |
tmp.img.dir (2) | NA | Directory to store temporary images |
use.x11 (3) | TRUE | Allow creation of Base64-encoded graphs |
1 See section 2.3 for
details
2 Applies to dfSummary()
only
3 Set to FALSE in text-only environments
When a summarytools object is created, its
formatting attributes are stored within it. However, we can override
most of them when using print()
or view()
.
The following table indicates what arguments can be used with
print()
or view()
to override formatting
attributes. Base R’s format()
function arguments can also
be used (although they are not listed here).
Argument | freq | ctable | descr | dfSummary |
---|---|---|---|---|
style | x | x | x | x |
round.digits | x | x | x | |
plain.ascii | x | x | x | x |
justify | x | x | x | x |
headings | x | x | x | x |
display.labels | x | x | x | x |
varnumbers | x | |||
labels.col | x | |||
graph.col | x | |||
valid.col | x | |||
na.col | x | |||
col.widths | x | |||
totals | x | x | ||
report.nas | x | |||
display.type | x | |||
missing | x | |||
split.tables (1) | x | x | x | x |
caption (1) | x | x | x | x |
1 pander options
To change the information shown in the heading section, use the
following arguments with print()
or
view()
:
Argument | freq | ctable | descr | dfSummary |
---|---|---|---|---|
Data.frame | x | x | x | x |
Data.frame.label | x | x | x | x |
Variable | x | x | x | |
Variable.label | x | x | x | |
Group | x | x | x | x |
date | x | x | x | x |
Weights | x | x | ||
Data.type | x | |||
Row.variable | x | |||
Col.variable | x |
In the following example, we will create and display a
freq()
object, and then display it again, this time
overriding three of its formatting attributes, as well as one of its
heading attributes.
tobacco$age.gr
Type: Factor
Freq | % Valid | % Valid Cum. | % Total | % Total Cum. | |
---|---|---|---|---|---|
18-34 | 258 | 26.46 | 26.46 | 25.80 | 25.80 |
35-50 | 241 | 24.72 | 51.18 | 24.10 | 49.90 |
51-70 | 317 | 32.51 | 83.69 | 31.70 | 81.60 |
71 + | 159 | 16.31 | 100.00 | 15.90 | 97.50 |
<NA> | 25 | 2.50 | 100.00 | ||
Total | 1000 | 100.00 | 100.00 | 100.00 | 100.00 |
tobacco$age.gr
Label: Age Group
Freq | % | % Cum. | |
---|---|---|---|
18-34 | 258 | 26.46 | 26.46 |
35-50 | 241 | 24.72 | 51.18 |
51-70 | 317 | 32.51 | 83.69 |
71 + | 159 | 16.31 | 100.00 |
print()
or view()
parameters have
precedence (overriding feature)freq() / ctable() / descr() / dfSummary()
parameters
come secondst_options()
come third and act
as defaultThe logic for the evaluation of the various parameter values can be summarized as follows:
If an argument is explicitly supplied in the function call, it will have precedence over any stored value for the parameter (stored values are the ones that are written to the object’s attributes when using a core function, as well as the ones stored in summarytools’ global options list).
If both a core function and the print or view function are called at once and have conflicting parameter values, print/view has precedence (they always win the argument!).
if the parameter values cannot be found in the function calls,
the stored defaults (modified with st_options()
or left as
they are when loading the package) will be applied.
When creating html reports, both Bootstrap’s CSS and summarytools.css are included by default. For greater control on the looks of html content, it is also possible to add class definitions in a custom CSS file.
We need to use a very small font size for a simple html
report containing a dfSummary()
. For this, we create a
.css file (with the name of our choosing) which contains the
following class definition:
Then we use print()
’s custom.css
argument
to specify to location of our newly created CSS file (results
not shown):
print(dfSummary(tobacco),
custom.css = 'path/to/custom.css',
table.classes = 'tiny-text',
file = "tiny-tobacco-dfSummary.html")
To successfully include summarytools functions in Shiny apps,
bootstrap.css = FALSE
to avoid interacting with the
app’s layoutheadings = FALSE
in case problems arisegraph.magnif
parameter or
with the dfSummary.graph.magnif
global optiondfSummary()
tables are too wide, omit a column or
two (valid.col
and varnumbers
, for
instance)col.widths
parametercol.widths
or graph.magnig
do not seem
to work, try using them as parameters for print()
rather
than dfSummary()
When using dfSummary()
in an Rmd document using
markdown styling (as opposed to html rendering), three
elements are needed in order to display the png graphs
properly:
1 - plain.ascii
must be set to FALSE
2 - style
must be set to “grid”
3 - tmp.img.dir
must be defined and be at most 5 characters
wide
Note that as of version 0.9.9, setting
tmp.img.dir
is no longer required when
using method = "render"
and can be left to
NA
. It is only necessary to define it when a transitory
markdown table must be created, as shown below. Note how narrow the
Graph column is – this is actually required, since the width of
the rendered column is determined by the number of characters in the
cell, rather than the width of the image itself:
+---------------+--------|----------------------+---------+
| Variable | stats | Graph | Valid |
+===============+========|======================+=========+
| age\ | ... | ![](/tmp/ds0001.png) | 978\ |
| [numeric] | ... | | (97.8%) |
+---------------+--------+----------------------+---------+
CRAN policies are really strict when it comes to writing content in the user directories, or anywhere outside R’s temporary zone (for good reasons). So users need to set this temporary location themselves, therefore consenting to having content written outside R’s predefined temporary zone.
On Mac OS and Linux, using “/tmp” makes a lot of sense: it’s a short path, and the directory is purged automatically. On Windows, there is no such convenient directory, so we need to pick one – be it absolute (“/tmp”) or relative (“img”, or simply “.”).
Thanks to the R community’s efforts, the following languages can be used, in addition to English (default):
To switch languages, simply use
All output from the core functions will now use that language:
iris$Species
Type: Facteur
Fréq. | % Valide | % Valide cum. | % Total | % Total cum. | |
---|---|---|---|---|---|
setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 |
On most Windows systems, it is necessary to change the
LC_CTYPE
element of the locale settings if the character
set is not included in the system’s default locale. For instance, in
order to get good results with the Russian language in a “latin1”
environment, use the following settings:
To go back to default settings…
Using the function use_custom_lang()
, it is possible to
add your own set of translations or personalized terms. To achieve this,
get the csv
template, customize one, many or all of the +/- 70 terms, and call
use_custom_lang()
, giving it as sole argument the path to
the edited csv template. Note that such custom language
settings will not persist across R sessions. This means that you should
always have this csv file handy for future use.
The define_keywords()
makes it easy to change just one
or a few terms. For instance, you might prefer using “N” or “Count”
rather than “Freq” in the title row of freq()
tables. Or
you might want to generate a document which uses the tables’ titles as
heading sections.
For this, call define_keywords()
and feed it the term(s)
you wish to modify (which can themselves be stored in predefined
variables). Here, the terms we need to change are
freq.title
and freq
:
section_title <- "**Species of Iris**"
define_keywords(title.freq = section_title,
freq = "N")
freq(iris$Species)
iris$Species
Type: Facteur
N | % Valide | % Valide cum. | % Total | % Total cum. | |
---|---|---|---|---|---|
setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 |
Calling define_keywords()
without any arguments will
bring up, on systems that support graphical devices (the vast majority,
that is), a window from which we can edit all the terms we want.
After closing the edit window, a dialogue box gives the option to
save the newly created custom language to a csv file (even
though we changed just a few keywords, the package considers the terms
as a whole). We can later reload into memory the custom language file by
calling
use_custom_lang("path-to-custom-language-file.csv")
.
See ?define_keywords
for a list of all customizable
terms in the package.
To revert all changes, we can simply use
st_options(lang = "en")
.
It is possible to further customize the headings by adding arguments
to the print()
function. Here, we use an empty string to
override the value of Variable
; this causes the second line
of the heading to disappear altogether.
define_keywords(title.freq = "Types and Counts, Iris Flowers")
print(
freq(iris$Species,
display.type = FALSE), # Variable type won't be displayed...
Variable = "" # and neither will the variable name
)
N | % Valide | % Valide cum. | % Total | % Total cum. | |
---|---|---|---|---|---|
setosa | 50 | 33.33 | 33.33 | 33.33 | 33.33 |
versicolor | 50 | 33.33 | 66.67 | 33.33 | 66.67 |
virginica | 50 | 33.33 | 100.00 | 33.33 | 100.00 |
<NA> | 0 | 0.00 | 100.00 | ||
Total | 150 | 100.00 | 100.00 | 100.00 | 100.00 |
Knowing how this vignette is configured can help you get started with using summarytools in R Markdown documents.
The output element is the one that matters:
---
output:
rmarkdown::html_vignette:
css:
- !expr system.file("rmarkdown/templates/html_vignette/resources/vignette.css", package = "rmarkdown")
---
```{r setup, include=FALSE}
library(knitr)
opts_chunk$set(results = 'asis', # Can also be set at chunk level
comment = NA,
prompt = FALSE,
cache = FALSE)
library(summarytools)
st_options(plain.ascii = FALSE, # Always use in Rmd documents
style = "rmarkdown", # Always use in Rmd documents
subtitle.emphasis = FALSE) # Improves layout w/ some themes
```
The needed CSS is automatically added to html files created
using print()
or view()
with the
file
argument. But in R Markdown documents, this
needs to be done explicitly in a setup chunk just after the YAML header
(or following a first setup chunk specifying knitr and
summarytools options):
```{r, echo=FALSE}
st_css(main = TRUE, global = TRUE)
```
The package comes with no guarantees. It is a work in progress and feedback is always welcome. Please open an issue on GitHub if you find a bug or wish to submit a feature request.
Check out the GitHub project’s page; from there you can see the latest updates and also submit feature requests.
For a preview of what’s coming in the next release, have a look at the development branch.