Make Top N-grams Table — fst_ngrams_table • finnsurveytext

Creates a table of the most frequently-occurring n-grams within the data. Optionally, weights can be provided either through a `weight` column in the formatted data, or from a `svydesign` object with the raw (preformatted) data.

Usage

fst_ngrams_table(
  data,
  number = 10,
  ngrams = 1,
  norm = NULL,
  pos_filter = NULL,
  strict = TRUE,
  use_svydesign_weights = FALSE,
  id = "",
  svydesign = NULL,
  use_column_weights = FALSE
)

Arguments

data: A dataframe of text in CoNLL-U format, with optional additional columns.
number: The number of n-grams to return, default is `10`.
ngrams: The type of n-grams to return, default is `1`.
norm: The method for normalising the data. Valid settings are `"number_words"` (the number of words in the responses), `"number_resp"` (the number of responses), or `NULL` (raw count returned, default, also used when weights are applied).
pos_filter: List of UPOS tags for inclusion, default is `NULL` which means all word types included.
strict: Whether to strictly cut-off at `number` (ties are alphabetically ordered), default is `TRUE`.
use_svydesign_weights: Option to weight words in the table using weights from a `svydesign` containing the raw data, default is `FALSE`
id: ID column from raw data, required if `use_svydesign_weights = TRUE` and must match the `docid` in formatted `data`.
svydesign: A `svydesign` which contains the raw data and weights, required if `use_svydesign_weights = TRUE`.
use_column_weights: Option to weight words in the table using weights from formatted data which includes addition `weight` column, default is `FALSE`

Value

A table of the most frequently occurring n-grams in the data.

Examples

pf <- c("NOUN", "VERB", "ADJ", "ADV")
pf2 <- "NOUN, VERB, ADJ, ADV"
fst_ngrams_table(fst_child, norm = NULL)
#> Note:
#>  N-grams with equal occurrence are presented in alphabetical order. 
#>  By default, n-grams are presented in order to the `number` cutoff n-gram. 
#>  This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed. 
#> 
#>       words occurrence
#> 1    toinen        118
#> 2     lyödä         71
#> 3  lyöminen         53
#> 4      joku         46
#> 5      paha         43
#> 6     tehdä         34
#> 7     sanoa         33
#> 8    tietää         32
#> 9     jokin         30
#> 10    tulla         30
fst_ngrams_table(fst_child, ngrams = 2, norm = "number_resp")
#> Note:
#>  N-grams with equal occurrence are presented in alphabetical order. 
#>  By default, n-grams are presented in order to the `number` cutoff n-gram. 
#>  This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed. 
#> 
#>                  words occurrence
#> 1  lyöminen potkiminen      0.041
#> 2           joku lyödä      0.027
#> 3         lyödä potkia      0.027
#> 4          osata sanoa      0.022
#> 5       haukkua toinen      0.019
#> 6          sanoa jokin      0.017
#> 7          tulla mieli      0.017
#> 8          joku toinen      0.015
#> 9         ottaa toinen      0.015
#> 10          paha mieli      0.015
fst_ngrams_table(fst_child, ngrams = 2, pos_filter = pf)
#> Note:
#>  N-grams with equal occurrence are presented in alphabetical order. 
#>  By default, n-grams are presented in order to the `number` cutoff n-gram. 
#>  This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed. 
#> 
#>                  words occurrence
#> 1  lyöminen potkiminen         17
#> 2         lyödä potkia         12
#> 3          osata sanoa          9
#> 4          tulla mieli          9
#> 5         potkia lyödä          8
#> 6         tietää lyödä          7
#> 7           paha mieli          6
#> 8      pyytää anteeksi          6
#> 9           tehdä paha          6
#> 10  töniminen lyöminen          6
fst_ngrams_table(fst_child, ngrams = 2, pos_filter = pf2)
#> Note:
#>  N-grams with equal occurrence are presented in alphabetical order. 
#>  By default, n-grams are presented in order to the `number` cutoff n-gram. 
#>  This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed. 
#> 
#>                  words occurrence
#> 1  lyöminen potkiminen         17
#> 2         lyödä potkia         12
#> 3          osata sanoa          9
#> 4          tulla mieli          9
#> 5         potkia lyödä          8
#> 6         tietää lyödä          7
#> 7           paha mieli          6
#> 8      pyytää anteeksi          6
#> 9           tehdä paha          6
#> 10  töniminen lyöminen          6
c2 <- fst_child_2
s <- survey::svydesign(id=~1, weights= ~paino, data = child)
i <- 'fsd_id'
fst_ngrams_table(c2, use_svydesign_weights = TRUE, svydesign = s, id = i)
#> Note:
#>  N-grams with equal occurrence are presented in alphabetical order. 
#>  By default, n-grams are presented in order to the `number` cutoff n-gram. 
#>  This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed. 
#> 
#>       words occurrence
#> 1    toinen  17538.566
#> 2     lyödä  10472.019
#> 3  lyöminen   7837.492
#> 4      joku   6794.576
#> 5      paha   6329.575
#> 6     tehdä   5056.231
#> 7     sanoa   4865.005
#> 8    tietää   4677.499
#> 9     jokin   4484.374
#> 10    tulla   4402.931
fst_ngrams_table(fst_child, use_column_weights = TRUE, ngrams = 3)
#> Note:
#>  N-grams with equal occurrence are presented in alphabetical order. 
#>  By default, n-grams are presented in order to the `number` cutoff n-gram. 
#>  This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed. 
#> 
#>                            words occurrence
#> 1               tehdä jokin paha    462.179
#> 2              joku lyödä potkia    452.272
#> 3               tulla paha mieli    452.028
#> 4              ottaa toinen lelu    450.961
#> 5  töniminen lyöminen potkiminen    449.407
#> 6               joku lyödä tönia    439.120
#> 7                 jokin run sana    307.801
#> 8                sanoa jokin run    307.801
#> 9          semmoinen sanoa jokin    307.801
#> 10             toinen tulla paha    307.801