Creates a table of the most frequently-occurring n-grams within the data. Optionally, weights can be provided either through a `weight` column in the formatted data, or from a `svydesign` object with the raw (preformatted) data.
Usage
fst_ngrams_table(
data,
number = 10,
ngrams = 1,
norm = NULL,
pos_filter = NULL,
strict = TRUE,
use_svydesign_weights = FALSE,
id = "",
svydesign = NULL,
use_column_weights = FALSE
)
Arguments
- data
A dataframe of text in CoNLL-U format, with optional additional columns.
- number
The number of n-grams to return, default is `10`.
- ngrams
The type of n-grams to return, default is `1`.
- norm
The method for normalising the data. Valid settings are `"number_words"` (the number of words in the responses), `"number_resp"` (the number of responses), or `NULL` (raw count returned, default, also used when weights are applied).
- pos_filter
List of UPOS tags for inclusion, default is `NULL` which means all word types included.
- strict
Whether to strictly cut-off at `number` (ties are alphabetically ordered), default is `TRUE`.
- use_svydesign_weights
Option to weight words in the table using weights from a `svydesign` containing the raw data, default is `FALSE`
- id
ID column from raw data, required if `use_svydesign_weights = TRUE` and must match the `docid` in formatted `data`.
- svydesign
A `svydesign` which contains the raw data and weights, required if `use_svydesign_weights = TRUE`.
- use_column_weights
Option to weight words in the table using weights from formatted data which includes addition `weight` column, default is `FALSE`
Examples
pf <- c("NOUN", "VERB", "ADJ", "ADV")
pf2 <- "NOUN, VERB, ADJ, ADV"
fst_ngrams_table(fst_child, norm = NULL)
#> Note:
#> N-grams with equal occurrence are presented in alphabetical order.
#> By default, n-grams are presented in order to the `number` cutoff n-gram.
#> This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed.
#>
#> words occurrence
#> 1 toinen 118
#> 2 lyödä 71
#> 3 lyöminen 53
#> 4 joku 46
#> 5 paha 43
#> 6 tehdä 34
#> 7 sanoa 33
#> 8 tietää 32
#> 9 jokin 30
#> 10 tulla 30
fst_ngrams_table(fst_child, ngrams = 2, norm = "number_resp")
#> Note:
#> N-grams with equal occurrence are presented in alphabetical order.
#> By default, n-grams are presented in order to the `number` cutoff n-gram.
#> This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed.
#>
#> words occurrence
#> 1 lyöminen potkiminen 0.041
#> 2 joku lyödä 0.027
#> 3 lyödä potkia 0.027
#> 4 osata sanoa 0.022
#> 5 haukkua toinen 0.019
#> 6 sanoa jokin 0.017
#> 7 tulla mieli 0.017
#> 8 joku toinen 0.015
#> 9 ottaa toinen 0.015
#> 10 paha mieli 0.015
fst_ngrams_table(fst_child, ngrams = 2, pos_filter = pf)
#> Note:
#> N-grams with equal occurrence are presented in alphabetical order.
#> By default, n-grams are presented in order to the `number` cutoff n-gram.
#> This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed.
#>
#> words occurrence
#> 1 lyöminen potkiminen 17
#> 2 lyödä potkia 12
#> 3 osata sanoa 9
#> 4 tulla mieli 9
#> 5 potkia lyödä 8
#> 6 tietää lyödä 7
#> 7 paha mieli 6
#> 8 pyytää anteeksi 6
#> 9 tehdä paha 6
#> 10 töniminen lyöminen 6
fst_ngrams_table(fst_child, ngrams = 2, pos_filter = pf2)
#> Note:
#> N-grams with equal occurrence are presented in alphabetical order.
#> By default, n-grams are presented in order to the `number` cutoff n-gram.
#> This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed.
#>
#> words occurrence
#> 1 lyöminen potkiminen 17
#> 2 lyödä potkia 12
#> 3 osata sanoa 9
#> 4 tulla mieli 9
#> 5 potkia lyödä 8
#> 6 tietää lyödä 7
#> 7 paha mieli 6
#> 8 pyytää anteeksi 6
#> 9 tehdä paha 6
#> 10 töniminen lyöminen 6
c2 <- fst_child_2
s <- survey::svydesign(id=~1, weights= ~paino, data = child)
i <- 'fsd_id'
fst_ngrams_table(c2, use_svydesign_weights = TRUE, svydesign = s, id = i)
#> Note:
#> N-grams with equal occurrence are presented in alphabetical order.
#> By default, n-grams are presented in order to the `number` cutoff n-gram.
#> This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed.
#>
#> words occurrence
#> 1 toinen 17538.566
#> 2 lyödä 10472.019
#> 3 lyöminen 7837.492
#> 4 joku 6794.576
#> 5 paha 6329.575
#> 6 tehdä 5056.231
#> 7 sanoa 4865.005
#> 8 tietää 4677.499
#> 9 jokin 4484.374
#> 10 tulla 4402.931
fst_ngrams_table(fst_child, use_column_weights = TRUE, ngrams = 3)
#> Note:
#> N-grams with equal occurrence are presented in alphabetical order.
#> By default, n-grams are presented in order to the `number` cutoff n-gram.
#> This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed.
#>
#> words occurrence
#> 1 tehdä jokin paha 462.179
#> 2 joku lyödä potkia 452.272
#> 3 tulla paha mieli 452.028
#> 4 ottaa toinen lelu 450.961
#> 5 töniminen lyöminen potkiminen 449.407
#> 6 joku lyödä tönia 439.120
#> 7 jokin run sana 307.801
#> 8 sanoa jokin run 307.801
#> 9 semmoinen sanoa jokin 307.801
#> 10 toinen tulla paha 307.801