1. Overview

summarytools provides a coherent set of functions centered on data exploration and simple reporting. At its core reside the following four functions:

Function	Description
`freq()`	Frequency Tables featuring counts, proportions, cumulative statistics as well as missing data reporting
`ctable()`	Cross-Tabulations (joint frequencies) between pairs of discrete/categorical variables, featuring marginal sums as well as row, column or total proportions
`descr()`	Univariate (‘Descriptive’) Statistics for numerical data, with common measures of central tendency and dispersion
`dfSummary()`	Data Frame Summaries featuring type-specific information for all variables: univariate statistics and/or frequency distributions, bar charts or histograms, as well as missing data counts and proportions. Very useful to quickly, detect anomalies and identify trends at a glance

1.1 Motivation

The package was developed with the following objectives in mind:

Provide a coherent set of easy-to-use descriptive functions that are akin to those included in commercial statistical software suites such as SAS, SPSS, and Stata
Offer flexibility in terms of output format & content
Integrate well with commonly used software & tools for reporting (the RStudio IDE, Rmarkdown, and knitr) while also allowing for standalone, simple report generation using any R interface

1.2 Directing Output

Results can be

Displayed in the R console as plain text
Rendered as html and shown in a Web browser or in RStudio’s Viewer Pane
Written / appended to plain text, markdown, or html files

When creating R Markdown documents, make sure to

Use chunk option results = "asis"
Une the function argument plain.ascii = FALSE
Set the style parameter to “rmarkdown”, or “grid” for dfSummary()

1.3 Other Characteristics

Weights-enabled: freq(), ctable() and descr() support sampling weights
Multilingual:
- Built-in translations exist for French, Portuguese, Spanish, Russian, and Turkish. Users can easily add custom translations or modify existing ones as needed
Flexible and extensible:
- The built-in features used to support alternate languages provide a way to modify a great number of terms used in outputs (headings and tables)
- Pipe operators from magrittr (%>%, %$%) and pipeR (%>>%) are fully supported; the native |> introduced in R 4.0 is supported as well and in most cases is a safer bet as far as objects’ name searching goes.
- Default values for a good number of function parameters can be modified using st_options() to minimize redundancy in function calls
- By-group processing is easily achieved using the package’s stby() function which is a slightly modified version of base::by(), but dplyr::group_by() is also supported
- Pander options can be used to customize or enhance plain text and markdown tables
- Base R’s format() parameters are also supported; this can be used to set thousands separator or modify the decimal separator, among other possibilities (see help("format"))
- Bootstrap CSS is used by default with html output, and user-defined classes can be added at will

<< 1. Overview | TOC | 3. Cross-Tabulations: ctable() >>

2. Frequency Tables: freq()

The freq() function generates frequency tables with counts, proportions, as well as missing data information. Side note: the very idea for creating this package stemmed from the absence of such a function in base R.

freq(iris$Species, plain.ascii = FALSE, style = "rmarkdown")

Frequencies

iris$Species
Type: Factor

	Freq	% Valid	% Valid Cum.	% Total	% Total Cum.
setosa	50	33.33	33.33	33.33	33.33
versicolor	50	33.33	66.67	33.33	66.67
virginica	50	33.33	100.00	33.33	100.00
<NA>	0			0.00	100.00
Total	150	100.00	100.00	100.00	100.00

In this first example, the plain.ascii and style arguments were specified. However, since we have defined them globally for this document using st_options(), they are redundant and will be omitted from hereon (section 16 contains a detailed description of this vignette’s configuration).

2.1 Missing Data

One of summarytools’ main purposes is to help cleaning and preparing data for further analysis. But in some circumstances, we don’t need (or already have) information about missing data. Using report.nas = FALSE makes the output table smaller by one row and two columns:

freq(iris$Species, report.nas = FALSE, headings = FALSE)

	Freq	%	% Cum.
setosa	50	33.33	33.33
versicolor	50	33.33	66.67
virginica	50	33.33	100.00
Total	150	100.00	100.00

The headings = FALSE parameter suppresses the heading section.

2.2 Simplest Expression

By “switching off” all optional elements, a much simpler table will be produced:

freq(iris$Species, 
     report.nas = FALSE, 
     totals     = FALSE, 
     cumul      = FALSE, 
     headings   = FALSE)

	Freq	%
setosa	50	33.33
versicolor	50	33.33
virginica	50	33.33

While the output is much simplified, the syntax is not; I blame it on Tesler’s law of conservation of complexity! Thankfully, st_options() is there to accommodate everyone’s preferences (see section on package options).

2.3 Multiple Frequency Tables At Once

To generate frequency tables for all variables in a data frame, we could (and in the earliest versions, needed to) use lapply(). However, this is not required since freq() accepts data frames as the main argument:

freq(tobacco)

To avoid cluttering the results, numerical columns having more than 25 distinct values are ignored. This threshold of 25 can be changed by using st_options(); for example, to change it to 10, we’d use st_options(freq.ignore.threshold = 10).

The tobacco data frame contains simulated data and is included in the package. Another simulated data frame is included: exams. Both have French versions (tabagisme, examens).

2.4 Subsetting (Filtering) Frequency Tables

The rows parameter allows subsetting frequency tables; we can use this parameter in different ways:

To filter rows by their order of appearance, we use a numerical vector; rows = 1:10 will show the frequencies for the first 10 values only. To account for the frequencies of unshown values, the “(Other)” row is automatically added
To filter rows by name, we can use either
- a character vector specifying all the row names we wish to keep
- a single character string, which will be used as a regular expression (see ?regex for more information on this topic)

Showing The Most Common Values

By combining the order and rows parameters, we can easily filter the results to show, for example, the 5 most common values in a factor:

freq(tobacco$disease, 
     order    = "freq",
     rows     = 1:5,
     headings = FALSE)

	Freq	% Valid	% Valid Cum.	% Total	% Total Cum.
Hypertension	36	16.22	16.22	3.60	3.60
Cancer	34	15.32	31.53	3.40	7.00
Cholesterol	21	9.46	40.99	2.10	9.10
Heart	20	9.01	50.00	2.00	11.10
Pulmonary	20	9.01	59.01	2.00	13.10
(Other)	91	40.99	100.00	9.10	22.20
<NA>	778			77.80	100.00
Total	1000	100.00	100.00	100.00	100.00

Instead of "freq", we can use "-freq" to reverse the ordering and get results ranked from lowest to highest in frequency.

Notice the “(Other)” row, which is automatically generated.

2.5 Collapsible Sections

When generating html results, use the collapse = TRUE argument with print() or view() / stview() to get collapsible sections; clicking on the variable name in the heading section will collapse / reveal the frequency table (results not shown).

view(freq(tobacco), collapse = TRUE)

<< 2. Frequency Tables: freq() | TOC | 4. Descriptive Statistics: descr() >>

3. Cross-Tabulations: ctable()

ctable() generates cross-tabulations (joint frequencies) for pairs of categorical variables.

Using the tobacco simulated data frame, we’ll cross-tabulate the two categorical variables smoker and diseased.

ctable(x = tobacco$smoker, 
       y = tobacco$diseased, 
       prop = "r")   # Show row proportions

Cross-Tabulation, Row Proportions

smoker * diseased
Data Frame: tobacco

	diseased	Yes	No	Total
smoker
Yes		125 (41.9%)	173 (58.1%)	298 (100.0%)
No		99 (14.1%)	603 (85.9%)	702 (100.0%)
Total		224 (22.4%)	776 (77.6%)	1000 (100.0%)

As can be seen, since markdown does not fully support multiline table headings, pander does what it can to display this particular type of table. To get better results, the “render” method is recommended and will be used in the next examples.

3.1 Row, Column, or Total Proportions

Row proportions are shown by default. To display column or total proportions, use prop = "c" or prop = "t", respectively. To omit proportions altogether, use prop = "n".

3.2 Minimal Cross-Tabulations

By “switching off” all optional features, we get a simple “2 x 2” table:

with(tobacco, 
     print(ctable(x = smoker, 
                  y = diseased, 
                  prop     = 'n',
                  totals   = FALSE, 
                  headings = FALSE),
           method = "render")
)

	diseased
smoker	Yes	No
Yes	125	173
No	99	603

3.3 Chi-Square (𝛘²), Odds Ratio and Risk Ratio

To display the chi-square statistic, set chisq = TRUE. For 2 x 2 tables, use OR and RR to show odds ratio and risk ratio (also called relative risk), respectively. Those can be set to TRUE, in which case 95% confidence intervals are shown; to use different confidence levels, use for example OR = .90.

Using pipes generally makes it easier to generate ctable() results.

library(magrittr)
tobacco %$%  # Acts like with(tobacco, ...)
  ctable(x = smoker, y = diseased,
         chisq = TRUE,
         OR    = TRUE,
         RR    = TRUE,
         headings = FALSE) |>
  print(method = "render")

	diseased
smoker	Yes				No				Total
Yes	125	(	41.9%	)	173	(	58.1%	)	298	(	100.0%	)
No	99	(	14.1%	)	603	(	85.9%	)	702	(	100.0%	)
Total	224	(	22.4%	)	776	(	77.6%	)	1000	(	100.0%	)
Χ² = 91.7100 df = 1 p = .0000 O.R. (95% C.I.) = 4.40 (3.22 - 6.02) R.R. (95% C.I.) = 2.97 (2.37 - 3.73)

<< 3. Cross-Tabs: ctable() | TOC | 5. Data Frame Summaries: dfSummary() >>

4. Descriptive (Univariate) Statistics: descr()

descr() generates descriptive / univariate statistics, i.e. common central tendency statistics and measures of dispersion. It accepts single vectors as well as data frames; in the latter case, all non-numerical columns are ignored, with a message to that effect.

descr(iris)

Non-numerical variable(s) ignored: Species

Descriptive Statistics

iris
N: 150

	Petal.Length	Petal.Width	Sepal.Length	Sepal.Width
Mean	3.76	1.20	5.84	3.06
Std.Dev	1.77	0.76	0.83	0.44
Min	1.00	0.10	4.30	2.00
Q1	1.60	0.30	5.10	2.80
Median	4.35	1.30	5.80	3.00
Q3	5.10	1.80	6.40	3.30
Max	6.90	2.50	7.90	4.40
MAD	1.85	1.04	1.04	0.44
IQR	3.50	1.50	1.30	0.50
CV	0.47	0.64	0.14	0.14
Skewness	-0.27	-0.10	0.31	0.31
SE.Skewness	0.20	0.20	0.20	0.20
Kurtosis	-1.42	-1.36	-0.61	0.14
N.Valid	150.00	150.00	150.00	150.00
N	150.00	150.00	150.00	150.00
Pct.Valid	100.00	100.00	100.00	100.00

To turn off the variable-type messages, use silent = TRUE. It is possible to set that option globally, which we will do here, so it won’t be displayed in the remaining of this vignette.

st_options(descr.silent = TRUE)

4.1 Transposing and Selecting Statistics

Results can be transposed by using transpose = TRUE, and statistics can be selected using the stats argument:

descr(iris,
      stats     = c("mean", "sd"),
      transpose = TRUE,
      headings  = FALSE)

	Mean	Std.Dev
Petal.Length	3.76	1.77
Petal.Width	1.20	0.76
Sepal.Length	5.84	0.83
Sepal.Width	3.06	0.44

See ?descr for a list of all available statistics. Special values “all”, “fivenum”, and “common” are also valid. The default value is “all”, and it can be modified using st_options():

st_options(descr.stats = "common")

<< 4. Descriptive Statistics with descr() | TOC | 6. Grouped Statistics: stby() >>

5. Data Frame Summaries: dfSummary()

dfSummary() creates a summary table with statistics, frequencies and graphs for all variables in a data frame. The information displayed is type-specific (character, factor, numeric, date) and also varies according to the number of distinct values.

To see the results in RStudio’s Viewer (or in the default Web browser if working in another IDE or from a terminal window), use the view() function, or its twin stview() in case of name conflicts:

view(dfSummary(iris))

Be careful to use view() (or stview()) and not View() with capital `V’. Otherwise, results will be shown in the data viewer.

Also, be mindful of the order in which the packages are loaded. Some packages redefine view() to point to View(); loading summarytools after these packages will ensure its own view() function works properly. Otherwise, stview() is always there as a foolproof alternative.

5.1 Using dfSummary() in R Markdown Documents

When using dfSummary() in R Markdown documents, it is generally a good idea to exclude a column or two to avoid margin overflow. Since the Valid and Missing columns are complementary (and therefore redundant), we can safely drop either one of them.

dfSummary(tobacco, 
          plain.ascii  = FALSE, 
          style        = "grid", 
          graph.magnif = 0.75, 
          varnumbers   = FALSE,
          valid.col    = FALSE,
          tmp.img.dir  = "/tmp")

Data Frame Summary

tobacco
Dimensions: 1000 x 9
Duplicates: 2

Variable	Stats / Values	Freqs (% of Valid)	Missing
gender [factor]	1. F 2. M	489 (50.0%) 489 (50.0%)	22 (2.2%)
age [numeric]	Mean (sd) : 49.6 (18.3) min < med < max: 18 < 50 < 80 IQR (CV) : 32 (0.4)	63 distinct values	25 (2.5%)
age.gr [factor]	1. 18-34 2. 35-50 3. 51-70 4. 71 +	258 (26.5%) 241 (24.7%) 317 (32.5%) 159 (16.3%)	25 (2.5%)
BMI [numeric]	Mean (sd) : 25.7 (4.5) min < med < max: 8.8 < 25.6 < 39.4 IQR (CV) : 5.7 (0.2)	974 distinct values	26 (2.6%)
smoker [factor]	1. Yes 2. No	298 (29.8%) 702 (70.2%)	0 (0.0%)
cigs.per.day [numeric]	Mean (sd) : 6.8 (11.9) min < med < max: 0 < 0 < 40 IQR (CV) : 11 (1.8)	37 distinct values	35 (3.5%)
diseased [factor]	1. Yes 2. No	224 (22.4%) 776 (77.6%)	0 (0.0%)
disease [character]	1. Hypertension 2. Cancer 3. Cholesterol 4. Heart 5. Pulmonary 6. Musculoskeletal 7. Diabetes 8. Hearing 9. Digestive 10. Hypotension [ 3 others ]	36 (16.2%) 34 (15.3%) 21 ( 9.5%) 20 ( 9.0%) 20 ( 9.0%) 19 ( 8.6%) 14 ( 6.3%) 14 ( 6.3%) 12 ( 5.4%) 11 ( 5.0%) 21 ( 9.5%)	778 (77.8%)
samp.wgts [numeric]	Mean (sd) : 1 (0.1) min < med < max: 0.9 < 1 < 1.1 IQR (CV) : 0.2 (0.1)	0.86!: 267 (26.7%) 1.04!: 249 (24.9%) 1.05!: 324 (32.4%) 1.06!: 160 (16.0%) ! rounded	0 (0.0%)

The tmp.img.dir parameter is mandatory when generating dfSummaries in R Markdown documents, except for html rendering. The explanation for this can be found further below.

Some users reported repeated X11 warnings; those can be avoided by setting the warning chunk option to FALSE: {r chunk_name, results="asis", warning=FALSE}.

5.2 Optional Statistics

Introduced in version 1.0.0 in response to feature requests, a mechanism provides control over which statistics to shown in the Stats/Values column. The third row, which displays IQR (CV), can be modified to show any available statistics in R. An additional “slot” (unused by default) is also made available. To use this feature, define dfSummary.custom.1 and/or dfSummary.custom.2 using st_options() in the following way, encapsulating the code in an expression():

st_options(
  dfSummary.custom.1 = 
    expression(
      paste(
        "Q1 - Q3 :",
        round(
          quantile(column_data, probs = .25, type = 2, 
                   names = FALSE, na.rm = TRUE), digits = 1
        ), " - ",
        round(
          quantile(column_data, probs = .75, type = 2, 
                   names = FALSE, na.rm = TRUE), digits = 1
        )
      )
    )
)

print(
  dfSummary(iris, 
            varnumbers   = FALSE,
            na.col       = FALSE,
            style        = "multiline",
            plain.ascii  = FALSE,
            headings     = FALSE,
            graph.magnif = .8),
  method = "render"
)

Variable

Stats / Values

Freqs (% of Valid)

Graph

Valid

Sepal.Length [numeric]

Mean (sd) : 5.8 (0.8)

min ≤ med ≤ max:

4.3 ≤ 5.8 ≤ 7.9

Q1 - Q3 : 5.1 - 6.4

35 distinct values

150 (100.0%)

Sepal.Width [numeric]

Mean (sd) : 3.1 (0.4)

min ≤ med ≤ max:

2 ≤ 3 ≤ 4.4

Q1 - Q3 : 2.8 - 3.3

23 distinct values

150 (100.0%)

Petal.Length [numeric]

Mean (sd) : 3.8 (1.8)

min ≤ med ≤ max:

1 ≤ 4.3 ≤ 6.9

Q1 - Q3 : 1.6 - 5.1

43 distinct values

150 (100.0%)

Petal.Width [numeric]

Mean (sd) : 1.2 (0.8)

min ≤ med ≤ max:

0.1 ≤ 1.3 ≤ 2.5

Q1 - Q3 : 0.3 - 1.8

22 distinct values

150 (100.0%)

Species [factor]

1. setosa

2. versicolor

3. virginica

50	(	33.3%	)
50	(	33.3%	)
50	(	33.3%	)

150 (100.0%)

If we had used dfSummary.custom.2 instead of dfSummary.custom.1, a fourth line would have been added under IQR (CV).

Note that instead of round(), it is possible to use the internal function format_number(), which will ensure correct formatting of numbers according to all specified arguments (rounding digits, decimal mark, big/small mark, etc.). The variable round.digits, which contains the value of st_options("round.digits") can also be used.

This is how the default IQR (CV) is defined – here we set the first custom stat back to its default value and then display its definition (formatR::tidy_source() is used to format the expression):

library(formatR)
st_options(dfSummary.custom.1 = "default")
formatR::tidy_source(
  text   = deparse(st_options("dfSummary.custom.1")),
  indent = 2,
  args.newline = TRUE
)

expression(
  paste(
    paste0(
      trs("iqr"),
      " (", trs("cv"),
      ") : "
    ),
    format_number(
      IQR(column_data, na.rm = TRUE),
      round.digits
    ),
    " (", format_number(
      sd(column_data, na.rm = TRUE)/mean(column_data, na.rm = TRUE),
      round.digits
    ),
    ")", collapse = "", sep = ""
  )
)

Don’t forget to specify na.rm = TRUE for all functions that use this parameter (most of base R functions do).

5.3 Other Notable Features

The dfSummary() function also

Reports the number of duplicate records in the heading section
Detects UPC/EAN codes (barcode numbers) and doesn’t calculate irrelevant statistics for them
Detects email addresses and reports counts of valid, invalid and duplicate addresses; note that the proportions of valid and invalid sum up to 100%; the duplicates proportion is calculated independently, which is why in the bar chart (html version), the bar for this category is shown with a different color
Allows the display of “windowed” results by using the max.tbl.height parameter; This is especially convenient if the analyzed data frame has numerous variables; see vignette("rmarkdown", package = "summarytools") for more details

5.4 Excluding Columns

Although most columns can be excluded using the function’s parameters, it is also possible to delete them with the following syntax (results not shown):

dfs <- dfSummary(iris)
dfs$Variable <- NULL # This deletes the Variable column

<< 5. Data Frame Summaries | TOC | 7. Grouped Statistics: group_by() >>

6. Grouped Statistics: stby()

To produce optimal results, summarytools has its own version of the base by() function. It’s called stby(), and we use it exactly as we would by():

(iris_stats_by_species <- stby(data      = iris, 
                               INDICES   = iris$Species, 
                               FUN       = descr, 
                               stats     = "common", 
                               transpose = TRUE))

Descriptive Statistics

iris
Group: Species = setosa
N: 50

	Mean	Std.Dev	Min	Median	Max	N.Valid	N	Pct.Valid
Petal.Length	1.46	0.17	1.00	1.50	1.90	50.00	50.00	100.00
Petal.Width	0.25	0.11	0.10	0.20	0.60	50.00	50.00	100.00
Sepal.Length	5.01	0.35	4.30	5.00	5.80	50.00	50.00	100.00
Sepal.Width	3.43	0.38	2.30	3.40	4.40	50.00	50.00	100.00

Group: Species = versicolor
N: 50

	Mean	Std.Dev	Min	Median	Max	N.Valid	N	Pct.Valid
Petal.Length	4.26	0.47	3.00	4.35	5.10	50.00	50.00	100.00
Petal.Width	1.33	0.20	1.00	1.30	1.80	50.00	50.00	100.00
Sepal.Length	5.94	0.52	4.90	5.90	7.00	50.00	50.00	100.00
Sepal.Width	2.77	0.31	2.00	2.80	3.40	50.00	50.00	100.00

Group: Species = virginica
N: 50

	Mean	Std.Dev	Min	Median	Max	N.Valid	N	Pct.Valid
Petal.Length	5.55	0.55	4.50	5.55	6.90	50.00	50.00	100.00
Petal.Width	2.03	0.27	1.40	2.00	2.50	50.00	50.00	100.00
Sepal.Length	6.59	0.64	4.90	6.50	7.90	50.00	50.00	100.00
Sepal.Width	2.97	0.32	2.20	3.00	3.80	50.00	50.00	100.00

6.1 Special Case of descr() with stby()

When used to produce split-group statistics for a single variable, stby() assembles everything into a single table instead of displaying a series of one-column tables.

with(tobacco, 
     stby(data    = BMI, 
          INDICES = age.gr, 
          FUN     = descr,
          stats   = c("mean", "sd", "min", "med", "max"))
)

NA detected in grouping variable(s); consider using useNA = TRUE

Descriptive Statistics

BMI by age.gr
Data Frame: tobacco
N: 975

	18-34	35-50	51-70	71 +
Mean	23.84	25.11	26.91	27.45
Std.Dev	4.23	4.34	4.26	4.37
Min	8.83	10.35	9.01	16.36
Median	24.04	25.11	26.77	27.52
Max	34.84	39.44	39.21	38.37

6.2 Using stby() with ctable()

The syntax is a little trickier for this combination, so here is an example (results not shown):

stby(data    = list(x = tobacco$smoker, y = tobacco$diseased), 
     INDICES = tobacco$gender, 
     FUN     = ctable)

# or equivalently
with(tobacco, 
     stby(data    = list(x = smoker, y = diseased), 
          INDICES = gender, 
          FUN     = ctable))

<< 6. Grouped Statistics : group_by() | TOC | 8. Tidy Tables : tb() >>

7. Grouped Statistics: group_by()

To create grouped statistics with freq(), descr() or dfSummary(), it is possible to use dplyr’s group_by() as an alternative to stby(). Syntactic differences aside, one key distinction is that group_by() considers NA values on the grouping variable(s) as a valid category, albeit with a warning suggesting the use of forcats::fct_na_value_to_level to make NA’s explicit in factors. Following this advice, we get:

library(dplyr)
tobacco$gender %<>% forcats::fct_na_value_to_level()
tobacco |> 
  group_by(gender) |> 
  descr(stats = "fivenum")

Descriptive Statistics

tobacco
Group: gender = F
N: 489

	BMI	age	cigs.per.day	samp.wgts
Min	9.01	18.00	0.00	0.86
Q1	22.98	34.00	0.00	0.86
Median	25.87	50.00	0.00	1.04
Q3	29.48	66.00	10.50	1.05
Max	39.44	80.00	40.00	1.06

Group: gender = M
N: 489

	BMI	age	cigs.per.day	samp.wgts
Min	8.83	18.00	0.00	0.86
Q1	22.52	34.00	0.00	0.86
Median	25.14	49.50	0.00	1.04
Q3	27.96	66.00	11.00	1.05
Max	36.76	80.00	40.00	1.06

Group: gender = (Missing)
N: 22

	BMI	age	cigs.per.day	samp.wgts
Min	20.24	19.00	0.00	0.86
Q1	24.97	36.00	0.00	1.04
Median	27.16	55.50	0.00	1.05
Q3	30.23	64.00	10.00	1.05
Max	32.43	80.00	28.00	1.06

<< 7. Grouped Statistics : group_by() | TOC | 9. Directing Output to Files >>

8. Tidy Tables : tb()

When generating freq() or descr() tables, it is possible to turn the results into “tidy” tables with the use of the tb() function (think of tb as a diminutive for tibble). For example:

library(magrittr)
iris |>
  descr(stats = "common") |>
  tb() |>
  knitr::kable()

variable	mean	sd	min	med	max	n.valid	n	pct.valid
Petal.Length	3.758000	1.7652982	1.0	4.35	6.9	150	150	100
Petal.Width	1.199333	0.7622377	0.1	1.30	2.5	150	150	100
Sepal.Length	5.843333	0.8280661	4.3	5.80	7.9	150	150	100
Sepal.Width	3.057333	0.4358663	2.0	3.00	4.4	150	150	100

iris$Species |> 
  freq(cumul = FALSE, report.nas = FALSE) |> 
  tb() |>
  knitr::kable()

Species	freq	pct
setosa	50	33.33333
versicolor	50	33.33333
virginica	50	33.33333

By definition, no total rows are part of tidy tables, and the row names are turned into a regular column.

When displaying tibbles using rmarkdown, the knitr chunk option results should be set to ‘markup’ instead of ‘asis’.

Not all tables generated by tb() are strictly speaking tidy; you can choose for instance to not recalculate proportions/valid proportions of grouped freq outputs, using recalculate = FALSE (TRUE by default).

8.1 Tidy Split-Group Statistics

Here are some examples showing how lists created using stby() or group_by() can be transformed into tidy tibbles.

grouped_descr <- stby(data    = exams,
                      INDICES = exams$gender, 
                      FUN     = descr,
                      stats   = "common")

grouped_descr |> tb()

# A tibble: 12 × 10
   gender variable   mean    sd   min   med   max n.valid     n pct.valid
   <fct>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>     <dbl>
 1 Girl   economics  72.5  7.79  62.3  70.2  89.6      14    15      93.3
 2 Girl   english    73.9  9.41  58.3  71.8  93.1      14    15      93.3
 3 Girl   french     71.1 12.4   44.8  68.4  93.7      14    15      93.3
 4 Girl   geography  67.3  8.26  50.4  67.3  78.9      15    15     100  
 5 Girl   history    71.2  9.17  53.9  72.9  86.4      15    15     100  
 6 Girl   math       73.8  9.03  55.6  74.8  86.3      14    15      93.3
 7 Boy    economics  75.2  9.40  60.5  71.7  94.2      15    15     100  
 8 Boy    english    77.8  5.94  69.6  77.6  90.2      15    15     100  
 9 Boy    french     76.6  8.63  63.2  74.8  94.7      15    15     100  
10 Boy    geography  73   12.4   47.2  71.2  96.3      14    15      93.3
11 Boy    history    74.4 11.2   54.4  72.6  93.5      15    15     100  
12 Boy    math       73.3  9.68  60.5  72.2  93.2      14    15      93.3

The order parameter controls row ordering:

grouped_descr |> tb(order = 2)

# A tibble: 12 × 10
   gender variable   mean    sd   min   med   max n.valid     n pct.valid
   <fct>  <chr>     <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>     <dbl>
 1 Girl   economics  72.5  7.79  62.3  70.2  89.6      14    15      93.3
 2 Boy    economics  75.2  9.40  60.5  71.7  94.2      15    15     100  
 3 Girl   english    73.9  9.41  58.3  71.8  93.1      14    15      93.3
 4 Boy    english    77.8  5.94  69.6  77.6  90.2      15    15     100  
 5 Girl   french     71.1 12.4   44.8  68.4  93.7      14    15      93.3
 6 Boy    french     76.6  8.63  63.2  74.8  94.7      15    15     100  
 7 Girl   geography  67.3  8.26  50.4  67.3  78.9      15    15     100  
 8 Boy    geography  73   12.4   47.2  71.2  96.3      14    15      93.3
 9 Girl   history    71.2  9.17  53.9  72.9  86.4      15    15     100  
10 Boy    history    74.4 11.2   54.4  72.6  93.5      15    15     100  
11 Girl   math       73.8  9.03  55.6  74.8  86.3      14    15      93.3
12 Boy    math       73.3  9.68  60.5  72.2  93.2      14    15      93.3

Setting order = 3 changes the order of the sort variables exactly as with order = 2, but it also reorders the columns:

grouped_descr |> tb(order = 3)

# A tibble: 12 × 10
   variable  gender  mean    sd   min   med   max n.valid     n pct.valid
   <chr>     <fct>  <dbl> <dbl> <dbl> <dbl> <dbl>   <dbl> <dbl>     <dbl>
 1 economics Girl    72.5  7.79  62.3  70.2  89.6      14    15      93.3
 2 economics Boy     75.2  9.40  60.5  71.7  94.2      15    15     100  
 3 english   Girl    73.9  9.41  58.3  71.8  93.1      14    15      93.3
 4 english   Boy     77.8  5.94  69.6  77.6  90.2      15    15     100  
 5 french    Girl    71.1 12.4   44.8  68.4  93.7      14    15      93.3
 6 french    Boy     76.6  8.63  63.2  74.8  94.7      15    15     100  
 7 geography Girl    67.3  8.26  50.4  67.3  78.9      15    15     100  
 8 geography Boy     73   12.4   47.2  71.2  96.3      14    15      93.3
 9 history   Girl    71.2  9.17  53.9  72.9  86.4      15    15     100  
10 history   Boy     74.4 11.2   54.4  72.6  93.5      15    15     100  
11 math      Girl    73.8  9.03  55.6  74.8  86.3      14    15      93.3
12 math      Boy     73.3  9.68  60.5  72.2  93.2      14    15      93.3

Note that percentages will be recalculated, unless setting tb()’s recalculate argument to FALSE, in which case the results won’t comply with the tidy principles.

tobacco |> dplyr::group_by(gender) |> freq(smoker) |> tb()

# A tibble: 9 × 7
  gender    smoker  freq pct_valid pct_valid_cum pct_tot pct_tot_cum
  <fct>     <fct>  <dbl>     <dbl>         <dbl>   <dbl>       <dbl>
1 F         Yes      147      14.7          14.7    14.7        14.7
2 F         No       342      34.2          48.9    34.2        48.9
3 F         <NA>       0      NA            NA       0          48.9
4 M         Yes      143      14.3          63.2    14.3        63.2
5 M         No       346      34.6          97.8    34.6        97.8
6 M         <NA>       0      NA            NA       0          97.8
7 (Missing) Yes        8       0.8          98.6     0.8        98.6
8 (Missing) No        14       1.4         100       1.4       100  
9 (Missing) <NA>       0      NA            NA       0         100

For more details, see ?tb.

8.2 A Bridge to Other Packages

summarytools objects are not always compatible with packages focused on table formatting, such as formattable or kableExtra. However, tb() can be used as a “bridge”, an intermediary step turning freq() and descr() objects into simple tables that any package can work with. Here is an example using kableExtra:

library(kableExtra)
library(magrittr)
stby(data    = iris, 
     INDICES = iris$Species, 
     FUN     = descr, 
     stats   = "fivenum") |>
  tb(order = 3) |>
  kable(format = "html", digits = 2) |>
  collapse_rows(columns = 1, valign = "top")

variable	Species	min	q1	med	q3	max
Petal.Length	setosa	1.0	1.4	1.50	1.6	1.9
	versicolor	3.0	4.0	4.35	4.6	5.1
	virginica	4.5	5.1	5.55	5.9	6.9
Petal.Width	setosa	0.1	0.2	0.20	0.3	0.6
	versicolor	1.0	1.2	1.30	1.5	1.8
	virginica	1.4	1.8	2.00	2.3	2.5
Sepal.Length	setosa	4.3	4.8	5.00	5.2	5.8
	versicolor	4.9	5.6	5.90	6.3	7.0
	virginica	4.9	6.2	6.50	6.9	7.9
Sepal.Width	setosa	2.3	3.2	3.40	3.7	4.4
	versicolor	2.0	2.5	2.80	3.0	3.4
	virginica	2.2	2.8	3.00	3.2	3.8

<< 8. Tidy Tables : tb() | TOC | 10. Global Options >>

9. Directing Output to Files

Using the file argument with print() or view() / stview(), we can write outputs to a file, be it html, Rmd, md, or just plain text (txt). The file extension is used to determine the type of content to write out.

view(iris_stats_by_species, file = "~/iris_stats_by_species.html")
view(iris_stats_by_species, file = "~/iris_stats_by_species.md")

A Note on PDF documents

There is no direct way to create a PDF file with summarytools. One option is to generate an html file and convert it to PDF using Pandoc or WK<html>TOpdf (the latter gives better results than Pandoc with dfSummary() output).

Another option is to create an Rmd document using PDF as the output format. See vignette("rmarkdown", package = "summarytools") for details; some tweaking is necessary for the graphs to align properly.

9.1 Appending Output Files

The append argument allows adding content to existing files generated by summarytools. This is useful if we wish to include several statistical tables in a single file. It is a quick alternative to creating an Rmd document.

<< 9. Directing Output to Files | TOC | 11. Format Attributes >>

10. Package Options

The following options can be set globally with st_options():

10.1 General Options

Option name	Default	Note
style ⁽¹⁾	“simple”	Set to “rmarkdown” in .Rmd documents
plain.ascii	TRUE	Set to FALSE in .Rmd documents
round.digits ⁽²⁾	2	Number of decimals to show
headings	TRUE	Formerly “omit.headings”
footnote	“default”	Customize or set to NA to omit
display.labels	TRUE	Show variable / data frame labels in headings
na.val	NULL	Value to treat as NA in factor / char variables
bootstrap.css ⁽³⁾	TRUE	Include Bootstrap 4 CSS in html output files
custom.css	NA	Path to your own CSS file
escape.pipe	FALSE	Useful for some Pandoc conversions
char.split ⁽⁴⁾	12	Threshold for line-wrapping in column headings
subtitle.emphasis	TRUE	Controls headings formatting
lang	“en”	Language (always 2-letter, lowercase)

¹ Does not apply to dfSummary(), which has its own style option (see next table)
² Does not apply to ctable(), which has its own round.digits option (see next table)
³ Set to FALSE in Shiny apps
⁴ Affects only html outputs for descr() and ctable()

10.2 Function-Specific Options

Option name	Default	Note
freq.cumul	TRUE	Display cumulative proportions in freq()
freq.totals	TRUE	Display totals row in freq()
freq.report.nas	TRUE	Display row and “valid” columns
freq.ignore.threshold ⁽¹⁾	25	Used to determine which vars to ignore
freq.silent	FALSE	Hide console messages
ctable.prop	“r”	Display row proportions by default
ctable.totals	TRUE	Show marginal totals
ctable.round.digits	1	Number of decimals to show in `ctable()`
ctable.silent	FALSE	Hide console messages
descr.stats	“all”	“fivenum”, “common” or vector of stats
descr.transpose	FALSE	Display stats in columns instead of rows
descr.silent	FALSE	Hide console messages
dfSummary.style	“multiline”	Can be set to “grid” as an alternative
dfSummary.varnumbers	TRUE	Show variable numbers in 1st col.
dfSummary.labels.col	TRUE	Show variable labels when present
dfSummary.graph.col	TRUE	Show graphs
dfSummary.valid.col	TRUE	Include the Valid column in the output
dfSummary.na.col	TRUE	Include the Missing column in the output
dfSummary.graph.magnif	1	Zoom factor for bar plots and histograms
dfSummary.silent	FALSE	Hide console messages
tmp.img.dir ⁽²⁾	NA	Directory to store temporary images
use.x11 ⁽³⁾	TRUE	Allow creation of Base64-encoded graphs

¹ See section 2.3 for details
² Applies to dfSummary() only
³ Set to FALSE in text-only environments

Examples

st_options()                      # Display all global options values
st_options('round.digits')        # Display the value of a specific option
st_options(style = 'rmarkdown',   # Set the value of one or several options
           footnote = NA)         # Turn off the footnote for all html output

<< 10. Global Options | TOC | 12. Fine-Tuning Looks : CSS >>

11. Format Attributes

When a summarytools object is created, its formatting attributes are stored within it. However, we can override most of them when using print() or view() / stview().

11.1 Overriding Function-Specific Arguments

The following table indicates what arguments can be used with print() or view() / stview() to override formatting attributes. Base R’s format() arguments can also be used (they are not listed here).

Argument	freq	ctable	descr	dfSummary
style	x	x	x	x
round.digits	x	x	x
plain.ascii	x	x	x	x
justify	x	x	x	x
headings	x	x	x	x
display.labels	x	x	x	x
varnumbers				x
labels.col				x
graph.col				x
valid.col				x
na.col				x
col.widths				x
totals	x	x
report.nas	x
display.type	x
missing	x
split.tables ⁽¹⁾	x	x	x	x
caption ⁽¹⁾	x	x	x	x

¹ pander options

11.2 Overriding Heading Contents

To change the information shown in the heading section, use the following arguments with print() or view():

Argument	freq	ctable	descr	dfSummary
Data.frame	x	x	x	x
Data.frame.label	x	x	x	x
Variable	x	x	x
Variable.label	x	x	x
Group	x	x	x	x
date	x	x	x	x
Weights	x		x
Data.type	x
Row.variable		x
Col.variable		x

Example

In the following example, we will create and display a freq() object, and then display it again, this time overriding three of its formatting attributes, as well as one of its heading attributes.

(age_stats <- freq(tobacco$age.gr))

Frequencies

tobacco$age.gr
Type: Factor

	Freq	% Valid	% Valid Cum.	% Total	% Total Cum.
18-34	258	26.46	26.46	25.80	25.80
35-50	241	24.72	51.18	24.10	49.90
51-70	317	32.51	83.69	31.70	81.60
71 +	159	16.31	100.00	15.90	97.50
<NA>	25			2.50	100.00
Total	1000	100.00	100.00	100.00	100.00

print(age_stats,
      report.nas     = FALSE, 
      totals         = FALSE, 
      display.type   = FALSE,
      Variable.label = "Age Group")

Frequencies

tobacco$age.gr
Label: Age Group

	Freq	%	% Cum.
18-34	258	26.46	26.46
35-50	241	24.72	51.18
51-70	317	32.51	83.69
71 +	159	16.31	100.00

11.3 Order of Priority for Parameters / Options

print() or view() parameters have precedence (overriding feature)
freq() / ctable() / descr() / dfSummary() parameters come second
Global options set with st_options() come third and act as default

The logic for the evaluation of the various parameter values can be summarized as follows:

If an argument is explicitly supplied in the function call, it will have precedence.
If both the core function and the print() / view() / stview() function are called at once and have conflicting parameter values, the latter has precedence (they always win the argument!).
If the parameter values cannot be found in the function calls, the stored defaults (which can be modified with st_options()) will be applied.

<< 11. Format Attributes | TOC | 13. Shiny Apps >>

12. Fine-Tuning Looks : CSS

When creating html reports, both Bootstrap’s CSS and summarytools.css are included by default. For greater control on the looks of html content, it is also possible to add class definitions in a custom CSS file.

Example

We need to use a very small font size for a simple html report containing a dfSummary(). For this, we create a .css file (with the name of our choosing) which contains the following class definition:

.tiny-text {
  font-size: 8px;
}

Then we use print()’s custom.css argument to specify to location of our newly created CSS file (results not shown):

print(dfSummary(tobacco),
      custom.css    = 'path/to/custom.css', 
      table.classes = 'tiny-text',
      file          = "tiny-tobacco-dfSummary.html")

<< 12. Fine-Tuning Looks : CSS | TOC | 14. Graphs in R Markdown >>

13. Shiny Apps

To successfully include summarytools functions in Shiny apps,

use html rendering
set bootstrap.css = FALSE to avoid interacting with the app’s layout
set headings = FALSE in case problems arise
adjust graph sizes with the graph.magnif parameter or with the dfSummary.graph.magnif global option
if dfSummary() tables are too wide, omit a column or two (valid.col and varnumbers, for instance)
if the results are still unsatisfactory, set column widths manually with the col.widths parameter
if col.widths or graph.magnig do not seem to work, try using them as parameters for print() rather than dfSummary()

Example (results not shown)

print(dfSummary(somedata, 
                varnumbers   = FALSE, 
                valid.col    = FALSE, 
                graph.magnif = 0.8), 
      method   = 'render',
      headings = FALSE,
      bootstrap.css = FALSE)

<< 13. Shiny Apps | TOC | 15. Languages & Term Customization >>

14. Graphs in R Markdown

When using dfSummary() in an Rmd document using markdown styling (as opposed to html rendering), three elements are needed in order to display the png graphs properly:

1 - plain.ascii must be set to FALSE
2 - style must be set to “grid”
3 - tmp.img.dir must be defined and be at most 5 characters wide

Note that as of version 0.9.9, setting tmp.img.dir is no longer required when using method = "render" and can be left to NA. It is only necessary to define it when a transitory markdown table must be created, as shown below. Note how narrow the Graph column is – this is actually required, since the width of the rendered column is determined by the number of characters in the cell, rather than the width of the image itself:

+---------------+--------|----------------------+---------+
| Variable      | stats  |  Graph               | Valid   |
+===============+========|======================+=========+
| age\          |  ...   | ![](/tmp/ds0001.png) | 978\    |
| [numeric]     |  ...   |                      | (97.8%) |
+---------------+--------+----------------------+---------+

CRAN policies are really strict when it comes to writing content in the user directories, or anywhere outside R’s temporary zone (for good reasons). So users need to set this temporary location themselves.

On Mac OS and Linux, using “/tmp” makes a lot of sense: it’s a short path, and the directory is purged automatically. On Windows, there is no such convenient directory, so we need to pick one – be it absolute (“/tmp”) or relative (“img”, or simply “.”).

<< 14. Graphs in R Markdown | TOC | 16. Vignette Setup >>

15. Languages & Term Customization

Thanks to the R community’s efforts, the following languages can be used, in addition to English (default):

French (fr)
Portuguese (pt)
Russian (ru)
Spanish (es)
Turkish (tr)

To switch languages, simply use

st_options(lang = "fr")

All output from the core functions will now use that language:

freq(iris$Species)

Tableau de fréquences

iris$Species
Type: Facteur

	Fréq.	% Valide	% Valide cum.	% Total	% Total cum.
setosa	50	33.33	33.33	33.33	33.33
versicolor	50	33.33	66.67	33.33	66.67
virginica	50	33.33	100.00	33.33	100.00
<NA>	0			0.00	100.00
Total	150	100.00	100.00	100.00	100.00

15.1 Non-UTF-8 Locales

On most Windows systems, it is necessary to change the LC_CTYPE element of the locale settings if the character set is not included in the system’s default locale. For instance, in order to get good results with the Russian language in a “latin1” environment, use the following settings:

Sys.setlocale("LC_CTYPE", "russian")
st_options(lang = 'ru')

To go back to default settings…

Sys.setlocale("LC_CTYPE", "")
st_options(lang = "en")

15.2 Defining and Using Custom Terms

Using the function use_custom_lang(), it is possible to add your own set of translations or custom terms. To achieve this, get the csv template, customize one, many or all of the +/- 70 terms, and call use_custom_lang(), giving it as sole argument the path to the edited csv template. Note that such custom language settings will not persist across R sessions. This means that you should always have this csv file handy for future use.

15.3 Defining Only Specific Keywords

The define_keywords() makes it easy to change just one or a few terms. For instance, you might prefer using “N” or “Count” rather than “Freq” in the title row of freq() tables. Or you might want to generate a document which uses the tables’ titles as sections titles.

For this, call define_keywords() and feed it the term(s) you wish to modify (which can themselves be stored in predefined variables). Here, the terms we need to change are freq.title and freq:

section_title <- "**Species of Iris**"
define_keywords(title.freq = section_title,
                freq = "N")
freq(iris$Species)

Species of Iris

iris$Species
Type: Facteur

	N	% Valide	% Valide cum.	% Total	% Total cum.
setosa	50	33.33	33.33	33.33	33.33
versicolor	50	33.33	66.67	33.33	66.67
virginica	50	33.33	100.00	33.33	100.00
<NA>	0			0.00	100.00
Total	150	100.00	100.00	100.00	100.00

Calling define_keywords() without any arguments will bring up, on systems that support graphical devices (the vast majority, that is), a window from which you can edit all terms.

After closing the edit window, a dialogue box gives the option to save the newly created custom language to a csv file (even though we changed just a few keywords, the package considers the terms as part of a whole “language”). We can later reload into memory the custom language file by calling use_custom_lang("language-file.csv").

See ?define_keywords for a list of all customizable terms used in the package.

To revert all changes, we can simply use st_options(lang = "en").

15.4 Power-Tweaking Headings

It is possible to further customize the headings by adding arguments to the print() function. Here, we use an empty string to override the value of Variable; this causes the second line of the heading to disappear altogether.

define_keywords(title.freq = "Types and Counts, Iris Flowers")
print(
  freq(iris$Species,
       display.type = FALSE), # Variable type won't be displayed...
  Variable = ""               # and neither will the variable name
  )

Types and Counts, Iris Flowers

	N	% Valide	% Valide cum.	% Total	% Total cum.
setosa	50	33.33	33.33	33.33	33.33
versicolor	50	33.33	66.67	33.33	66.67
virginica	50	33.33	100.00	33.33	100.00
<NA>	0			0.00	100.00
Total	150	100.00	100.00	100.00	100.00

<< 15. Translations & Term Customization | TOC | 17. Conclusion >>

16. Vignette Setup

Knowing how this vignette is configured can help you get started with using summarytools in R Markdown documents.

16.1 The YAML Section

The output element is the one that matters:

---
output: 
 rmarkdown::html_vignette: 
   css:
   - !expr system.file("rmarkdown/templates/html_vignette/resources/vignette.css", package = "rmarkdown")
---

16.2 The Setup Chunk

```{r setup, include=FALSE} 
library(knitr)
opts_chunk$set(results = 'asis',     # Can also be set at chunk level
              comment = NA,
              prompt  = FALSE,
              cache   = FALSE)
library(summarytools)
st_options(plain.ascii = FALSE,       # Always use in Rmd documents
           style       = "rmarkdown", # Always use in Rmd documents
           subtitle.emphasis = FALSE) # Improves layout w/ some themes
```

16.3 Including summarytools’ CSS

The needed CSS is automatically added to html files created using print() or view() with the file argument. But in R Markdown documents, this needs to be done explicitly in a setup chunk just after the YAML header (or following a first setup chunk specifying knitr and summarytools options):

```{r, echo=FALSE} 
st_css(main = TRUE, global = TRUE)
```

<< 16. Vignette Setup | TOC

17. Conclusion

The package comes with no guarantees. It is a work in progress and feedback, as well as PayPal donations, are welcome. Please open an issue on GitHub if you find a bug or wish to submit a feature request.

Stay Up to Date

Check out the GitHub project’s page; from there you can see the latest updates and also submit feature requests.

For a preview of what’s coming in the next release, have a look at the development branch.

TOC

Introduction to summarytools

1. Overview

1.1 Motivation

1.2 Directing Output

1.3 Other Characteristics

2. Frequency Tables: freq()

Frequencies

2.1 Missing Data

2.2 Simplest Expression

2.3 Multiple Frequency Tables At Once

2.4 Subsetting (Filtering) Frequency Tables

Showing The Most Common Values

2.5 Collapsible Sections

3. Cross-Tabulations: ctable()

Cross-Tabulation, Row Proportions

3.1 Row, Column, or Total Proportions

3.2 Minimal Cross-Tabulations

3.3 Chi-Square (𝛘2), Odds Ratio and Risk Ratio

4. Descriptive (Univariate) Statistics: descr()

Descriptive Statistics

4.1 Transposing and Selecting Statistics

5. Data Frame Summaries: dfSummary()

5.1 Using dfSummary() in R Markdown Documents

Data Frame Summary

5.2 Optional Statistics

5.3 Other Notable Features

5.4 Excluding Columns

6. Grouped Statistics: stby()

Descriptive Statistics

6.1 Special Case of descr() with stby()

Descriptive Statistics

6.2 Using stby() with ctable()

7. Grouped Statistics: group_by()

Descriptive Statistics

8. Tidy Tables : tb()

8.1 Tidy Split-Group Statistics

8.2 A Bridge to Other Packages

9. Directing Output to Files

9.1 Appending Output Files

10. Package Options

10.1 General Options

10.2 Function-Specific Options

Examples

11. Format Attributes

11.1 Overriding Function-Specific Arguments

11.2 Overriding Heading Contents

Example

Frequencies

Frequencies

11.3 Order of Priority for Parameters / Options

12. Fine-Tuning Looks : CSS

Example

13. Shiny Apps

Example (results not shown)

14. Graphs in R Markdown

15. Languages & Term Customization

Tableau de fréquences

15.1 Non-UTF-8 Locales

15.2 Defining and Using Custom Terms

15.3 Defining Only Specific Keywords

Species of Iris

15.4 Power-Tweaking Headings

Types and Counts, Iris Flowers

16. Vignette Setup

16.1 The YAML Section

16.2 The Setup Chunk

16.3 Including summarytools’ CSS

17. Conclusion

Stay Up to Date

3.3 Chi-Square (𝛘²), Odds Ratio and Risk Ratio