InDetail1-DataPreparation
Source:vignettes/web_only/InDetail1-DataPreparation.Rmd
InDetail1-DataPreparation.Rmd
Introduction
Many natural language processing (NLP) tasks require data which is systematically pre-processed into a format useful for analysis. Pre-processing commonly involves activities such as:
- tokenisation into words or sentences
- conversion to lowercase
- removing stopwords (common words like ‘a’, ‘the’, etc.)
- stemming (removing common suffixes from words, eg. ‘walking’ and
‘walked’ becomes ‘walk’) or lemmatising (rewriting words in base form,
eg. ‘crying’ and ‘cried’ become ‘cry’)
-
finnsurveytext
uses lemmatisation rather than stemming. - Stemming is a more straightforward process (As it just removes common suffixes such as ‘ing’) but can cause errors which change word meaning (eg. ‘caring’ is stemmed to ‘car’)
- Lemmatisation is often considered superior but it is slower as it
requires a dictionary of words.
-
- part-of-speech (POS) tagging
Installation of package.
Once the package is installed, you can load the
finnsurveytext
package as below: (Other required packages
such as dplyr
and stringr
will also be
installed if they are not currently installed in your environment.)
Data
This tutorial uses two sources of data from the Finnish Social Science Data Archive:
1. Child Barometer Data
- Source: FSD3134 Lapsibarometri 2016
- Question: q7 ‘Kertoisitko, mitä sinun mielestäsi kiusaaminen on? (Avokysymys)’
- Licence: (A) openly available for all users without registration (CC BY 4.0).
- Link to Data: https://urn.fi/urn:nbn:fi:fsd:T-FSD3134
2. Development Cooperation Data
- Source: FSD2821 Nuorten ajatuksia kehitysyhteistyöstä 2012
- Questions: q11_1 ‘Jatka lausetta: Kehitysmaa on maa, jossa… (Avokysymys)’, q11_2 ‘Jatka lausetta: Kehitysyhteistyö on toimintaa, jossa… (Avokysymys)’, q11_3’ Jatka lausetta: Maailman kolme suurinta ongelmaa ovat… (Avokysymys)’
- Licence: (A) openly available for all users without registration (CC BY 4.0).
- Link to Data: https://urn.fi/urn:nbn:fi:fsd:T-FSD2821
Both of these will be demonstrated below but either can be used to
complete the tutorial. If you would prefer to use your own data, you can
read in this data through read.csv()
or similar so that you
have a ‘raw’ dataframe ready in your R environment.
CoNLL-U Format Overview
The finnsurveytext
package uses the CoNLL-U format. This
tutorial demonstrates the process of preparing Finnish survey text data
into this format using functions in r/01_prepare.r.
CoNLL-U is a popular annotation scheme often used in Natural Language Processing (NLP) tasks to tokenise and annotate text. In CoNLL-U format, the text is split into one line per word and ten features of each word are recorded including an ID, part-of-speech tagging, the word itself (eg. ‘likes’), and word lemma (eg. ‘like’). CoNLL stands for the Conference of Natural Language Learning and CoNLL-U format was introduced in 2014.
More information on CoNLL-U format can be found in the Universal Dependencies Project, https://universaldependencies.org/format.html.
The Whole Story
A single function, fst_prepare
(which calls all the data
preparation functions within the package) can be used to prepare the
data into the required CoNNL-U format.
1. Child Barometer Data
Using our Child Barometer bullying data, we can call this function as follows:
prepd_bullying <- fst_prepare(
data = child,
question = "q7",
id = 'fsd_id'
stopword_list = "nltk",
language = "fi"
model = "ftb",
weights = NULL,
add_cols = NULL
)
Summary of components
-
data
is the dataframe of interest. In this case, we are using data that comes with the package called ‘child_barometer’. Otherwise, if you read in a csv containing a dataframe, such as throughread.csv()
in base R for use in this tutorial. - The
question
is the name of the column in your data which contains the open-ended survey question. In this example, the responses about bullying are in question 7. - The
id
is a unique identifier for each response. - We have chosen to remove stopwords from the “nltk” stopword list in
this example. To find the relevant lists of stopwords, you can run the
fst_find_stopwords()
function which is outlined below. Punctuation is also removed from the data whenever stopwords are removed. - The function also requires a language model available for
udpipe
, in this case we are using the default Finnish Treebank,model = "ftb"
. (There are two options for Finnish langage model; the other option is the Turku Dependency Treebank. For further detail on the treebanks, see the Format as CoNLL-U section below.) - Optionally, you can add a weight column and/or other columns for comparing different groups of respondents in later analysis. We are not adding these in this example.
-
manual
andmanual_list
can be used if you want to manually provide a list of stopwords to remove from the data. - The results in CoNLL-U format are stored in the local environment as “prepd_bullying”.
2. Development Cooperation Data
As an example, the Development Cooperation survey q11_2 data could be prepared using this function call:
prepd_dev <- fst_prepare_conllu(
dev_coop,
question = "q11_2",
stopword_list = "none",
model = "tdt",
weights = NULL,
add_cols = NULL,
manual = FALSE,
manual_list = ""
)
-
data
is the dataframe of interest. In this case, we are using data that comes with the package called ‘dev_data’. Otherwise, if you read in a csv containing a dataframe, such as throughread.csv()
in base R for use in this tutorial. - The
question
is the name of the column in your data which contains the open-ended survey question. In this example, the responses are in question 11_2. - We have chosen not to remove stopwords in this example.
(
stopword_list = NULL
) - The function also requires a language model available for [udpipe],
in this case we are using the Turku Dependency Treebank,
model = "tdt"
. - The results in CoNLL-U format are stored in the local environment as “prepd_dev”.
In greater detail
To better understand the fst_prepare()
function, we will
go through each of the functions that this one calls. These are:
fst_format()
fst_rm_stopwords_punct()
Additionally, the fst_find_stopwords()
function can be
used to find currently available lists of stopwords for exclusion from
the data. (The default “language” is “fi”, but ) The “name” column can
be used to choose a list for the stopword_list
variable
above. Stopword lists are lists of common words (eg. “and”, “the”, and
“is”, or in Finnish “olla”, “ollet”, “ollen”, and “on”…) which are often
filtered out of the data, leaving less frequently-occurring, and thus
more more meaningful, words remaining.
stopwords <- fst_find_stopwords("fi")
The stopwords lists can be very long, so only one (nltk) is shown
below. Another two lists, snowball and stopwords-iso, can be found by
running the fst_find_stopwords()
function in your local
environment.
Name | Stopwords | Length |
---|---|---|
nltk | olla , olen , olet , on , olemme , olette , ovat , ole , oli , olisi , olisit , olisin , olisimme, olisitte, olisivat, olit , olin , olimme , olitte , olivat , ollut , olleet , en , et , ei , emme , ette , eivät , minä , minun , minut , minua , minussa , minusta , minuun , minulla , minulta , minulle , sinä , sinun , sinut , sinua , sinussa , sinusta , sinuun , sinulla , sinulta , sinulle , hän , hänen , hänet , häntä , hänessä , hänestä , häneen , hänellä , häneltä , hänelle , me , meidän , meidät , meitä , meissä , meistä , meihin , meillä , meiltä , meille , te , teidän , teidät , teitä , teissä , teistä , teihin , teillä , teiltä , teille , he , heidän , heidät , heitä , heissä , heistä , heihin , heillä , heiltä , heille , tämä , tämän , tätä , tässä , tästä , tähän , tallä , tältä , tälle , tänä , täksi , tuo , tuon , tuotä , tuossa , tuosta , tuohon , tuolla , tuolta , tuolle , tuona , tuoksi , se , sen , sitä , siinä , siitä , siihen , sillä , siltä , sille , siksi , nämä , näiden , näitä , näissä , näistä , näihin , näillä , näiltä , näille , näinä , näiksi , nuo , noiden , noita , noissa , noista , noihin , noilla , noilta , noille , noina , noiksi , ne , niiden , niitä , niissä , niistä , niihin , niillä , niiltä , niille , niinä , niiksi , kuka , kenen , kenet , ketä , kenessä , kenestä , keneen , kenellä , keneltä , kenelle , kenenä , keneksi , ketkä , keiden , keitä , keissä , keistä , keihin , keillä , keiltä , keille , keinä , keiksi , mikä , minkä , mitä , missä , mistä , mihin , millä , miltä , mille , miksi , mitkä , joka , jonka , jota , jossa , josta , johon , jolla , jolta , jolle , jona , joksi , jotka , joiden , joita , joissa , joista , joihin , joilla , joilta , joille , joina , joiksi , että , ja , jos , koska , kuin , mutta , niin , sekä , tai , vaan , vai , vaikka , kanssa , mukaan , noin , poikki , yli , kun , nyt , itse | 229 |
Format as CoNNL-U
This function is used to format the data from your open-ended survey question into CoNLL-U format. It also:
- trims trailing whitespace from the data;
- converts the ‘lemma’ and ‘token’ columns to lowercase; and,
- removes NA values.
Our package works for two of the Finnish language models available, Turku Dependency Treebank (TDT) and FinnTreeBank (FTB). Further information about these treebanks can be found at the links but, in brief, the TDT is considered “broad coverage” and includes texts from Wikipedia and news sources, and FTB consists of manually annotated grammatical examples from VISK.
The fst_format_conllu()
function utilises the
udpipe
package
and can be run as follows:
conllu_dev_q11_1 <- fst_format(data = dev_coop, question = "q11_1", id = 'fsd_id')
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/finnish-ftb-ud-2.5-191206.udpipe to /home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/finnish-ftb-ud-2.5-191206.udpipe
#> - This model has been trained on version 2.5 of data from https://universaldependencies.org
#> - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
#> - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
#> - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
#> Downloading finished, model stored at '/home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/finnish-ftb-ud-2.5-191206.udpipe'
conllu_cb_bullying <- fst_format(data = child, question = "q7", model = "tdt", id = 'fsd_id')
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/finnish-tdt-ud-2.5-191206.udpipe to /home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/finnish-tdt-ud-2.5-191206.udpipe
#> - This model has been trained on version 2.5 of data from https://universaldependencies.org
#> - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
#> - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
#> - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
#> Downloading finished, model stored at '/home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/finnish-tdt-ud-2.5-191206.udpipe'
Note: the first time you run this function, it will download the relevant treebank from udpipe for use in the annotations.
The top 5 rows of the “conllu_cb_bullying” table are shown below:
doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Jos kaveri tönii. | 1 | jos | jos | SCONJ | C | NA | 3 | mark | NA | NA |
1 | 1 | 1 | Jos kaveri tönii. | 2 | kaveri | kaveri | NOUN | N | Case=Nom|Number=Sing | 3 | nsubj | NA | NA |
1 | 1 | 1 | Jos kaveri tönii. | 3 | tönii | töniä | VERB | V | Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act | 0 | root | NA | SpaceAfter=No |
1 | 1 | 1 | Jos kaveri tönii. | 4 | . | . | PUNCT | Punct | NA | 3 | punct | NA | NA |
1 | 1 | 2 | Jos lyö. | 1 | jos | jos | SCONJ | C | NA | 2 | mark | NA | NA |
1 | 1 | 2 | Jos lyö. | 2 | lyö | lyödä | VERB | V | Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act | 0 | root | NA | SpaceAfter=No |
´fst_format()` takes 6 arguments:
-
data
the dataframe containing the survey data. -
question
is the open-ended survey question header in the table, such as “q9” -
id
is the unique ID for each survey response. -
model
is the chosen Finnish treebank for annotation, either “ftb” (the default) or “tdt”. -
weights
, optional, a column containing weights for the reponses. -
add_cols
, optional, any other columns to bring into the formatted data.
Remove stopwords and punctuation from CoNLL-U data
This (optional) function will remove stopwords and punctuation from
the CoNLL-U data. fst_find_stopwords
can be used to find
options for stopwords lists.
fst_rm_stop_punct()
takes 2 arguments:
-
data
is output fromfst_format_conllu()
-
stopword_list
is a list of Finnish stopwords, the default is “nltk” but any “Name” column fromfst_find_stopwords()
can be used.
conllu_dev_q11_1_nltk <- fst_rm_stop_punct(data = conllu_dev_q11_1)
conllu_cb_bullying_iso <- fst_rm_stop_punct(conllu_cb_bullying, "stopwords-iso")
The top 5 rows of the “conllu_bullying_iso” table are shown below:
doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Jos kaveri tönii. | 2 | kaveri | kaveri | NOUN | N | Case=Nom|Number=Sing | 3 | nsubj | NA | NA |
1 | 1 | 1 | Jos kaveri tönii. | 3 | tönii | töniä | VERB | V | Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act | 0 | root | NA | SpaceAfter=No |
1 | 1 | 2 | Jos lyö. | 2 | lyö | lyödä | VERB | V | Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act | 0 | root | NA | SpaceAfter=No |
1 | 1 | 3 | Jos tappelee. | 2 | tappelee | tappella | VERB | V | Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act | 0 | root | NA | SpaceAfter=No |
2 | 1 | 1 | Kun nimitellään ja tönitään. | 2 | nimitellään | nimitellä | VERB | V | Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass | 0 | root | NA | NA |
2 | 1 | 1 | Kun nimitellään ja tönitään. | 4 | tönitään | töniä | VERB | V | Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass | 2 | conj | NA | SpaceAfter=No |
Conclusion
Now that you have data in CoNLL-U format, this pre-processed data is
ready for the analysis using finnsurveytext
functions. For
more information on these, please review the other vignettes in this
package.
Citation
The Office of Ombudsman for Children: Child Barometer 2016 [dataset]. Version 1.0 (2016-12-09). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD3134
Finnish Children and Youth Foundation: Young People’s Views on Development Cooperation 2012 [dataset]. Version 2.0 (2019-01-22). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD2821