Skip to contents

Creates a table of the most frequently-occurring words (unigrams) within the data.

Usage

fst_get_top_words(
  data,
  number = 10,
  norm = "number_words",
  pos_filter = NULL,
  strict = TRUE
)

Arguments

data

A dataframe of text in CoNLL-U format.

number

The number of top words to return, default is `10`.

norm

The method for normalising the data. Valid settings are `"number_words"` (the number of words in the responses, default), `"number_resp"` (the number of responses), or `NULL` (raw count returned).

pos_filter

List of UPOS tags for inclusion, default is `NULL` which means all word types included.

strict

Whether to strictly cut-off at `number` (ties are alphabetically ordered), default is `TRUE`.

Value

A table of the most frequently occurring words in the data.

Examples

fst_get_top_words(conllu_dev_q11_1_nltk, number = 15, strict = FALSE)
#> Note:
#>  Words with equal occurrence are presented in alphabetical order. 
#>  With `strict` = FALSE, words occurring equally often as the `number` cutoff word will be displayed. 
#> 
#>       words occurrence
#> 1   ihminen      0.048
#> 2      asia      0.024
#> 3  elintaso      0.023
#> 4     köyhä      0.021
#> 5    paljon      0.020
#> 6     huono      0.019
#> 7   köyhyys      0.016
#> 8   tarvita      0.015
#> 9   kehitys      0.014
#> 10      maa      0.014
#> 11      apu      0.014
#> 12    ruoka      0.012
#> 13 alhainen      0.012
#> 14   kaikki      0.010
#> 15 kehittyä      0.010
#> 16    suuri      0.010
#> 17     taso      0.010
cb <- conllu_cb_bullying
pf <- c("NOUN", "VERB", "ADJ", "ADV")
fst_get_top_words(cb, number = 5, norm = "number_resp", pos_filter = pf)
#> Note:
#>  Words with equal occurrence are presented in alphabetical order. 
#>  By default, words are presented in order to the `number` cutoff word. 
#>  This means that equally-occurring later-alphabetically words beyond the cutoff word will not be displayed.
#> 
#>      words occurrence
#> 1    lyödä      0.152
#> 2 lyöminen      0.128
#> 3     paha      0.104
#> 4    sanoa      0.080
#> 5    tehdä      0.075