Remove Finnish stopwords and punctuation from CoNLL-U dataframe
Source:R/01_prepare.R
fst_rm_stop_punct.Rd
Removes stopwords and punctuation from a dataframe containing Finnish survey text data which is already in CoNLL-U format.
Usage
fst_rm_stop_punct(
data,
stopword_list = "nltk",
language = "fi",
manual = FALSE,
manual_list = ""
)
Arguments
- data
A dataframe of Finnish text in CoNLL-U format.
- stopword_list
A valid stopword list, default is `"nltk"`, `"manual"` can be used to indicate that a manual list will be provided, or `"none"` if you don't want to remove stopwords, known as 'source' in `stopwords::stopwords`
- language
two-letter ISO code of the language for the stopword list
- manual
An optional boolean to indicate that a manual list will be provided, `stopword_list = "manual"` can also or instead be used.
- manual_list
A manual list of stopwords.
Examples
if (FALSE) { # \dontrun{
c <- fst_format(child, question = 'q7', id = 'fsd_id')
fst_rm_stop_punct(c)
fst_rm_stop_punct(c, stopword_list = "snowball")
fst_rm_stop_punct(c, "stopwords-iso")
mlist <- c('en', 'et', 'ei', 'emme', 'ette', 'eivät', 'minä', 'minum')
mlist2 <- "en, et, ei, emme, ette, eivät, minä, minum"
fst_rm_stop_punct(c, manual = TRUE, manual_list = mlist)
fst_rm_stop_punct(c, stopword_list = "manual", manual_list = mlist)
unlink("finnish-ftb-ud-2.5-191206.udpipe")
} # }