Skip to contents

Creates a table of the most frequently-occurring n-grams within the data.

Usage

fst_get_top_ngrams(
  data,
  number = 10,
  ngrams = 1,
  norm = "number_words",
  pos_filter = NULL,
  strict = TRUE
)

Arguments

data

A dataframe of text in CoNLL-U format.

number

The number of n-grams to return, default is `10`.

ngrams

The type of n-grams to return, default is `1`.

norm

The method for normalising the data. Valid settings are `"number_words"` (the number of words in the responses, default), `"number_resp"` (the number of responses), or `NULL` (raw count returned).

pos_filter

List of UPOS tags for inclusion, default is `NULL` which means all word types included.

strict

Whether to strictly cut-off at `number` (ties are alphabetically ordered), default is `TRUE`.

Value

A table of the most frequently occurring n-grams in the data.

Examples

q11_1 <- conllu_dev_q11_1_nltk
fst_get_top_ngrams(q11_1, norm = NULL)
#> Note:
#>  N-grams with equal occurrence are presented in alphabetical order. 
#>  By default, n-grams are presented in order to the `number` cutoff n-gram. 
#>  This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed. 
#> 
#>       words occurrence
#> 1   ihminen        205
#> 2      asia        102
#> 3  elintaso         96
#> 4     köyhä         88
#> 5    paljon         86
#> 6     huono         83
#> 7   köyhyys         68
#> 8   tarvita         63
#> 9   kehitys         60
#> 10      maa         59
fst_get_top_ngrams(q11_1, number = 10, ngrams = 1, norm = "number_resp")
#> Note:
#>  N-grams with equal occurrence are presented in alphabetical order. 
#>  By default, n-grams are presented in order to the `number` cutoff n-gram. 
#>  This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed. 
#> 
#>       words occurrence
#> 1   ihminen      0.217
#> 2      asia      0.108
#> 3  elintaso      0.102
#> 4     köyhä      0.093
#> 5    paljon      0.091
#> 6     huono      0.088
#> 7   köyhyys      0.072
#> 8   tarvita      0.067
#> 9   kehitys      0.063
#> 10      maa      0.062
cb <- conllu_cb_bullying
pf <- c("NOUN", "VERB", "ADJ", "ADV")
fst_get_top_ngrams(cb, number = 15, pos_filter = pf)
#> Note:
#>  N-grams with equal occurrence are presented in alphabetical order. 
#>  By default, n-grams are presented in order to the `number` cutoff n-gram. 
#>  This means that equally-occurring later-alphabetically n-grams beyond the cutoff n-gram will not be displayed. 
#> 
#>         words occurrence
#> 1       lyödä      0.023
#> 2    lyöminen      0.019
#> 3        paha      0.016
#> 4       sanoa      0.012
#> 5       tehdä      0.011
#> 6       tulla      0.011
#> 7       ottaa      0.010
#> 8  potkiminen      0.009
#> 9   töniminen      0.008
#> 10    kiusata      0.008
#> 11     potkia      0.008
#> 12     kaveri      0.007
#> 13      tyhmä      0.007
#> 14  haukkulla      0.007
#> 15      mieli      0.007