Skip to contents

Creates a table of the most frequently-occurring words (unigrams) within the data. Optionally, weights can be provided either through a `weight` column in the formatted data, or from a `svydesign` object with the raw (preformatted) data.

Usage

fst_freq_table(
  data,
  number = 10,
  norm = NULL,
  pos_filter = NULL,
  strict = TRUE,
  use_svydesign_weights = FALSE,
  id = "",
  svydesign = NULL,
  use_column_weights = FALSE
)

Arguments

data

A dataframe of text in CoNLL-U format, with optional additional columns.

number

The number of top words to return, default is `10`.

norm

The method for normalising the data. Valid settings are `"number_words"` (the number of words in the responses), `"number_resp"` (the number of responses), or `NULL` (raw count returned, default, also used when weights are applied).

pos_filter

List of UPOS tags for inclusion, default is `NULL` which means all word types included.

strict

Whether to strictly cut-off at `number` (ties are alphabetically ordered), default is `TRUE`.

use_svydesign_weights

Option to weight words in the table using weights from a `svydesign` containing the raw data, default is `FALSE`

id

ID column from raw data, required if `use_svydesign_weights = TRUE` and must match the `docid` in formatted `data`.

svydesign

A `svydesign` which contains the raw data and weights, required if `use_svydesign_weights = TRUE`.

use_column_weights

Option to weight words in the table using weights from formatted data which includes addition `weight` column, default is `FALSE`

Value

A table of the most frequently occurring words in the data.

Examples

pf <- c("NOUN", "VERB", "ADJ", "ADV")
pf2 <- "NOUN, VERB, ADJ, ADV"
fst_freq_table(fst_child, number = 15, strict = FALSE, pos_filter = pf)
#> Note:
#>  Words with equal occurrence are presented in alphabetical order. 
#>  With `strict` = FALSE, words occurring equally often as the `number` cutoff word will be displayed. 
#> 
#>         words occurrence
#> 1       lyödä         71
#> 2    lyöminen         53
#> 3        paha         42
#> 4       sanoa         33
#> 5       tehdä         33
#> 6      tietää         32
#> 7       tulla         29
#> 8       ottaa         28
#> 9      potkia         27
#> 10    kiusata         26
#> 11 potkiminen         24
#> 12    haukkua         22
#> 13     kaveri         22
#> 14  töniminen         22
#> 15     toinen         17
#> 16      tyhmä         17
#> 17      tönia         17
fst_freq_table(fst_child, number = 15, strict = FALSE, pos_filter = pf2)
#> Note:
#>  Words with equal occurrence are presented in alphabetical order. 
#>  With `strict` = FALSE, words occurring equally often as the `number` cutoff word will be displayed. 
#> 
#>         words occurrence
#> 1       lyödä         71
#> 2    lyöminen         53
#> 3        paha         42
#> 4       sanoa         33
#> 5       tehdä         33
#> 6      tietää         32
#> 7       tulla         29
#> 8       ottaa         28
#> 9      potkia         27
#> 10    kiusata         26
#> 11 potkiminen         24
#> 12    haukkua         22
#> 13     kaveri         22
#> 14  töniminen         22
#> 15     toinen         17
#> 16      tyhmä         17
#> 17      tönia         17
fst_freq_table(fst_child, norm = 'number_words')
#> Note:
#>  Words with equal occurrence are presented in alphabetical order. 
#>  By default, words are presented in order to the `number` cutoff word. 
#>  This means that equally-occurring later-alphabetically words beyond the cutoff word will not be displayed.
#> 
#>       words occurrence
#> 1    toinen      0.075
#> 2     lyödä      0.045
#> 3  lyöminen      0.034
#> 4      joku      0.029
#> 5      paha      0.027
#> 6     tehdä      0.022
#> 7     sanoa      0.021
#> 8    tietää      0.020
#> 9     jokin      0.019
#> 10    tulla      0.019
fst_freq_table(fst_child, use_column_weights = TRUE)
#> Note:
#>  Words with equal occurrence are presented in alphabetical order. 
#>  By default, words are presented in order to the `number` cutoff word. 
#>  This means that equally-occurring later-alphabetically words beyond the cutoff word will not be displayed.
#> 
#>       words occurrence
#> 1    toinen  17538.566
#> 2     lyödä  10472.019
#> 3  lyöminen   7837.492
#> 4      joku   6794.576
#> 5      paha   6329.575
#> 6     tehdä   5056.231
#> 7     sanoa   4865.005
#> 8    tietää   4677.499
#> 9     jokin   4484.374
#> 10    tulla   4402.931
c2 <- fst_child_2
s <- survey::svydesign(id=~1, weights= ~paino, data = child)
i <- 'fsd_id'
fst_freq_table(c2, use_svydesign_weights = TRUE, svydesign = s, id = i)
#> Note:
#>  Words with equal occurrence are presented in alphabetical order. 
#>  By default, words are presented in order to the `number` cutoff word. 
#>  This means that equally-occurring later-alphabetically words beyond the cutoff word will not be displayed.
#> 
#>       words occurrence
#> 1    toinen  17538.566
#> 2     lyödä  10472.019
#> 3  lyöminen   7837.492
#> 4      joku   6794.576
#> 5      paha   6329.575
#> 6     tehdä   5056.231
#> 7     sanoa   4865.005
#> 8    tietää   4677.499
#> 9     jokin   4484.374
#> 10    tulla   4402.931