Make Top N-grams Table
fst_get_top_ngrams.Rd
Creates a table of the most frequently-occurring n-grams within the data.
Usage
fst_get_top_ngrams(
data,
number = 10,
ngrams = 1,
norm = "number_words",
pos_filter = NULL,
strict = TRUE
)
Arguments
- data
A dataframe of text in CoNLL-U format.
- number
The number of n-grams to return, default is `10`.
- ngrams
The type of n-grams to return, default is `1`.
- norm
The method for normalising the data. Valid settings are `"number_words"` (the number of words in the responses, default), `"number_resp"` (the number of responses), or `NULL` (raw count returned).
- pos_filter
List of UPOS tags for inclusion, default is `NULL` which means all word types included.
- strict
Whether to strictly cut-off at `number` (ties are alphabetically ordered), default is `TRUE`.
Examples
q11_1 <- conllu_dev_q11_1_nltk
fst_get_top_ngrams(q11_1, norm = NULL)
#> Note:
#> N-grams with equal occurrence are presented in alphabetical order.
#> By default, n-grams are presented in order to the `number` cutoff n-gram.
#> This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed.
#>
#> words occurrence
#> 1 ihminen 205
#> 2 asia 102
#> 3 elintaso 96
#> 4 köyhä 88
#> 5 paljon 86
#> 6 huono 83
#> 7 köyhyys 68
#> 8 tarvita 63
#> 9 kehitys 60
#> 10 maa 59
fst_get_top_ngrams(q11_1, number = 10, ngrams = 1, norm = "number_resp")
#> Note:
#> N-grams with equal occurrence are presented in alphabetical order.
#> By default, n-grams are presented in order to the `number` cutoff n-gram.
#> This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed.
#>
#> words occurrence
#> 1 ihminen 0.217
#> 2 asia 0.108
#> 3 elintaso 0.102
#> 4 köyhä 0.093
#> 5 paljon 0.091
#> 6 huono 0.088
#> 7 köyhyys 0.072
#> 8 tarvita 0.067
#> 9 kehitys 0.063
#> 10 maa 0.062
cb <- conllu_cb_bullying
pf <- c("NOUN", "VERB", "ADJ", "ADV")
fst_get_top_ngrams(cb, number = 15, pos_filter = pf)
#> Note:
#> N-grams with equal occurrence are presented in alphabetical order.
#> By default, n-grams are presented in order to the `number` cutoff n-gram.
#> This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed.
#>
#> words occurrence
#> 1 lyödä 0.023
#> 2 lyöminen 0.019
#> 3 paha 0.016
#> 4 sanoa 0.012
#> 5 tehdä 0.011
#> 6 tulla 0.011
#> 7 ottaa 0.010
#> 8 potkiminen 0.009
#> 9 töniminen 0.008
#> 10 kiusata 0.008
#> 11 potkia 0.008
#> 12 kaveri 0.007
#> 13 tyhmä 0.007
#> 14 haukkulla 0.007
#> 15 mieli 0.007