Skip to contents

Removes stopwords and punctuation from a dataframe containing Finnish survey text data which is already in CoNLL-U format.

Usage

fst_rm_stop_punct(
  data,
  stopword_list = "nltk",
  language = "fi",
  manual = FALSE,
  manual_list = ""
)

Arguments

data

A dataframe of Finnish text in CoNLL-U format.

stopword_list

A valid stopword list, default is `"nltk"`, `"manual"` can be used to indicate that a manual list will be provided, or `"none"` if you don't want to remove stopwords, known as 'source' in `stopwords::stopwords`

language

two-letter ISO code of the language for the stopword list

manual

An optional boolean to indicate that a manual list will be provided, `stopword_list = "manual"` can also or instead be used.

manual_list

A manual list of stopwords.

Value

A dataframe of text in CoNLL-U format without stopwords and punctuation.

Examples

if (FALSE) { # \dontrun{
c <- fst_format(child, question = 'q7', id = 'fsd_id')
fst_rm_stop_punct(c)
fst_rm_stop_punct(c, stopword_list = "snowball")
fst_rm_stop_punct(c, "stopwords-iso")

mlist <- c('en', 'et', 'ei', 'emme', 'ette', 'eivät', 'minä', 'minum')
mlist2 <- "en, et, ei, emme, ette, eivät, minä, minum"
fst_rm_stop_punct(c, manual = TRUE, manual_list = mlist)
fst_rm_stop_punct(c, stopword_list = "manual", manual_list = mlist)
unlink("finnish-ftb-ud-2.5-191206.udpipe")
} # }