InDetail1-DataPreparation • finnsurveytext

Introduction

Many natural language processing (NLP) tasks require data which is systematically pre-processed into a format useful for analysis. Pre-processing commonly involves activities such as:

tokenisation into words or sentences
conversion to lowercase
removing stopwords (common words like ‘a’, ‘the’, etc.)
stemming (removing common suffixes from words, eg. ‘walking’ and ‘walked’ becomes ‘walk’) or lemmatising (rewriting words in base form, eg. ‘crying’ and ‘cried’ become ‘cry’)
- finnsurveytext uses lemmatisation rather than stemming.
- Stemming is a more straightforward process (As it just removes common suffixes such as ‘ing’) but can cause errors which change word meaning (eg. ‘caring’ is stemmed to ‘car’)
- Lemmatisation is often considered superior but it is slower as it requires a dictionary of words.
part-of-speech (POS) tagging

Installation of package.

Once the package is installed, you can load the finnsurveytext package as below: (Other required packages such as dplyr and stringr will also be installed if they are not currently installed in your environment.)

library(finnsurveytext)

Overview of Functions

The functions covered in this tutorial are:

Data

This tutorial uses two sources of data from the Finnish Social Science Data Archive:

1. Child Barometer Data

Source: FSD3134 Lapsibarometri 2016
Question: q7 ‘Kertoisitko, mitä sinun mielestäsi kiusaaminen on? (Avokysymys)’
Licence: (A) openly available for all users without registration (CC BY 4.0).
Link to Data: https://urn.fi/urn:nbn:fi:fsd:T-FSD3134

2. Development Cooperation Data

Source: FSD2821 Nuorten ajatuksia kehitysyhteistyöstä 2012
Questions: q11_1 ‘Jatka lausetta: Kehitysmaa on maa, jossa… (Avokysymys)’, q11_2 ‘Jatka lausetta: Kehitysyhteistyö on toimintaa, jossa… (Avokysymys)’, q11_3’ Jatka lausetta: Maailman kolme suurinta ongelmaa ovat… (Avokysymys)’
Licence: (A) openly available for all users without registration (CC BY 4.0).
Link to Data: https://urn.fi/urn:nbn:fi:fsd:T-FSD2821

Both of these will be demonstrated below but either can be used to complete the tutorial. If you would prefer to use your own data, you can read in this data through read.csv() or similar so that you have a ‘raw’ dataframe ready in your R environment.

CoNLL-U Format Overview

The finnsurveytext package uses the CoNLL-U format. This tutorial demonstrates the process of preparing Finnish survey text data into this format using functions in r/01_prepare.r.

CoNLL-U is a popular annotation scheme often used in Natural Language Processing (NLP) tasks to tokenise and annotate text. In CoNLL-U format, the text is split into one line per word and ten features of each word are recorded including an ID, part-of-speech tagging, the word itself (eg. ‘likes’), and word lemma (eg. ‘like’). CoNLL stands for the Conference of Natural Language Learning and CoNLL-U format was introduced in 2014.

More information on CoNLL-U format can be found in the Universal Dependencies Project, https://universaldependencies.org/format.html.

The Whole Story

A single function, fst_prepare (which calls all the data preparation functions within the package) can be used to prepare the data into the required CoNNL-U format.

1. Child Barometer Data

Using our Child Barometer bullying data, we can call this function as follows:

prepd_bullying <- fst_prepare(
  data = child,
  question = "q7",
  id = 'fsd_id'
  stopword_list = "nltk",
  language = "fi"
  model = "ftb",
  weights = NULL,
  add_cols = NULL
)

Summary of components

data is the dataframe of interest. In this case, we are using data that comes with the package called ‘child_barometer’. Otherwise, if you read in a csv containing a dataframe, such as through read.csv() in base R for use in this tutorial.
The question is the name of the column in your data which contains the open-ended survey question. In this example, the responses about bullying are in question 7.
The id is a unique identifier for each response.
We have chosen to remove stopwords from the “nltk” stopword list in this example. To find the relevant lists of stopwords, you can run the fst_find_stopwords() function which is outlined below. Punctuation is also removed from the data whenever stopwords are removed.
The function also requires a language model available for udpipe, in this case we are using the default Finnish Treebank, model = "ftb". (There are two options for Finnish langage model; the other option is the Turku Dependency Treebank. For further detail on the treebanks, see the Format as CoNLL-U section below.)
Optionally, you can add a weight column and/or other columns for comparing different groups of respondents in later analysis. We are not adding these in this example.
manual and manual_list can be used if you want to manually provide a list of stopwords to remove from the data.
The results in CoNLL-U format are stored in the local environment as “prepd_bullying”.

2. Development Cooperation Data

As an example, the Development Cooperation survey q11_2 data could be prepared using this function call:

prepd_dev <- fst_prepare_conllu(
  dev_coop,
  question = "q11_2",
  stopword_list = "none",
  model = "tdt", 
  weights = NULL,
  add_cols = NULL, 
  manual = FALSE,
  manual_list = ""
)

data is the dataframe of interest. In this case, we are using data that comes with the package called ‘dev_data’. Otherwise, if you read in a csv containing a dataframe, such as through read.csv() in base R for use in this tutorial.
The question is the name of the column in your data which contains the open-ended survey question. In this example, the responses are in question 11_2.
We have chosen not to remove stopwords in this example. (stopword_list = NULL)
The function also requires a language model available for [udpipe], in this case we are using the Turku Dependency Treebank, model = "tdt".
The results in CoNLL-U format are stored in the local environment as “prepd_dev”.

In greater detail

To better understand the fst_prepare() function, we will go through each of the functions that this one calls. These are:

fst_format()
fst_rm_stopwords_punct()

Additionally, the fst_find_stopwords() function can be used to find currently available lists of stopwords for exclusion from the data. (The default “language” is “fi”, but ) The “name” column can be used to choose a list for the stopword_list variable above. Stopword lists are lists of common words (eg. “and”, “the”, and “is”, or in Finnish “olla”, “ollet”, “ollen”, and “on”…) which are often filtered out of the data, leaving less frequently-occurring, and thus more more meaningful, words remaining.

stopwords <- fst_find_stopwords("fi")

The stopwords lists can be very long, so only one (nltk) is shown below. Another two lists, snowball and stopwords-iso, can be found by running the fst_find_stopwords() function in your local environment.

Name	Stopwords	Length
nltk	olla , olen , olet , on , olemme , olette , ovat , ole , oli , olisi , olisit , olisin , olisimme, olisitte, olisivat, olit , olin , olimme , olitte , olivat , ollut , olleet , en , et , ei , emme , ette , eivät , minä , minun , minut , minua , minussa , minusta , minuun , minulla , minulta , minulle , sinä , sinun , sinut , sinua , sinussa , sinusta , sinuun , sinulla , sinulta , sinulle , hän , hänen , hänet , häntä , hänessä , hänestä , häneen , hänellä , häneltä , hänelle , me , meidän , meidät , meitä , meissä , meistä , meihin , meillä , meiltä , meille , te , teidän , teidät , teitä , teissä , teistä , teihin , teillä , teiltä , teille , he , heidän , heidät , heitä , heissä , heistä , heihin , heillä , heiltä , heille , tämä , tämän , tätä , tässä , tästä , tähän , tallä , tältä , tälle , tänä , täksi , tuo , tuon , tuotä , tuossa , tuosta , tuohon , tuolla , tuolta , tuolle , tuona , tuoksi , se , sen , sitä , siinä , siitä , siihen , sillä , siltä , sille , siksi , nämä , näiden , näitä , näissä , näistä , näihin , näillä , näiltä , näille , näinä , näiksi , nuo , noiden , noita , noissa , noista , noihin , noilla , noilta , noille , noina , noiksi , ne , niiden , niitä , niissä , niistä , niihin , niillä , niiltä , niille , niinä , niiksi , kuka , kenen , kenet , ketä , kenessä , kenestä , keneen , kenellä , keneltä , kenelle , kenenä , keneksi , ketkä , keiden , keitä , keissä , keistä , keihin , keillä , keiltä , keille , keinä , keiksi , mikä , minkä , mitä , missä , mistä , mihin , millä , miltä , mille , miksi , mitkä , joka , jonka , jota , jossa , josta , johon , jolla , jolta , jolle , jona , joksi , jotka , joiden , joita , joissa , joista , joihin , joilla , joilta , joille , joina , joiksi , että , ja , jos , koska , kuin , mutta , niin , sekä , tai , vaan , vai , vaikka , kanssa , mukaan , noin , poikki , yli , kun , nyt , itse	229

Format as CoNNL-U

fst_format()

This function is used to format the data from your open-ended survey question into CoNLL-U format. It also:

trims trailing whitespace from the data;
converts the ‘lemma’ and ‘token’ columns to lowercase; and,
removes NA values.

Our package works for two of the Finnish language models available, Turku Dependency Treebank (TDT) and FinnTreeBank (FTB). Further information about these treebanks can be found at the links but, in brief, the TDT is considered “broad coverage” and includes texts from Wikipedia and news sources, and FTB consists of manually annotated grammatical examples from VISK.

The fst_format_conllu() function utilises the udpipe package and can be run as follows:

conllu_dev_q11_1 <- fst_format(data = dev_coop, question = "q11_1", id = 'fsd_id')
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/finnish-ftb-ud-2.5-191206.udpipe to /home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/finnish-ftb-ud-2.5-191206.udpipe
#>  - This model has been trained on version 2.5 of data from https://universaldependencies.org
#>  - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
#>  - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
#>  - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
#> Downloading finished, model stored at '/home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/finnish-ftb-ud-2.5-191206.udpipe'
conllu_cb_bullying <- fst_format(data = child, question = "q7", model = "tdt", id = 'fsd_id')
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/finnish-tdt-ud-2.5-191206.udpipe to /home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/finnish-tdt-ud-2.5-191206.udpipe
#>  - This model has been trained on version 2.5 of data from https://universaldependencies.org
#>  - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
#>  - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
#>  - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
#> Downloading finished, model stored at '/home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/finnish-tdt-ud-2.5-191206.udpipe'

Note: the first time you run this function, it will download the relevant treebank from udpipe for use in the annotations.

The top 5 rows of the “conllu_cb_bullying” table are shown below:

doc_id	paragraph_id	sentence_id	sentence	token_id	token	lemma	upos	xpos	feats	head_token_id	dep_rel	deps	misc
1	1	1	Jos kaveri tönii.	1	jos	jos	SCONJ	C	NA	3	mark	NA	NA
1	1	1	Jos kaveri tönii.	2	kaveri	kaveri	NOUN	N	Case=Nom\|Number=Sing	3	nsubj	NA	NA
1	1	1	Jos kaveri tönii.	3	tönii	töniä	VERB	V	Mood=Ind\|Number=Sing\|Person=3\|Tense=Pres\|VerbForm=Fin\|Voice=Act	0	root	NA	SpaceAfter=No
1	1	1	Jos kaveri tönii.	4	.	.	PUNCT	Punct	NA	3	punct	NA	NA
1	1	2	Jos lyö.	1	jos	jos	SCONJ	C	NA	2	mark	NA	NA
1	1	2	Jos lyö.	2	lyö	lyödä	VERB	V	Mood=Ind\|Number=Sing\|Person=3\|Tense=Pres\|VerbForm=Fin\|Voice=Act	0	root	NA	SpaceAfter=No

´fst_format()` takes 6 arguments:

data the dataframe containing the survey data.
question is the open-ended survey question header in the table, such as “q9”
id is the unique ID for each survey response.
model is the chosen Finnish treebank for annotation, either “ftb” (the default) or “tdt”.
weights, optional, a column containing weights for the reponses.
add_cols, optional, any other columns to bring into the formatted data.

Remove stopwords and punctuation from CoNLL-U data

fst_rm_stop_punct()

This (optional) function will remove stopwords and punctuation from the CoNLL-U data. fst_find_stopwords can be used to find options for stopwords lists.

fst_rm_stop_punct() takes 2 arguments:

data is output from fst_format_conllu()
stopword_list is a list of Finnish stopwords, the default is “nltk” but any “Name” column from fst_find_stopwords() can be used.

conllu_dev_q11_1_nltk <- fst_rm_stop_punct(data = conllu_dev_q11_1)
conllu_cb_bullying_iso <- fst_rm_stop_punct(conllu_cb_bullying, "stopwords-iso")

The top 5 rows of the “conllu_bullying_iso” table are shown below:

doc_id	paragraph_id	sentence_id	sentence	token_id	token	lemma	upos	xpos	feats	head_token_id	dep_rel	deps	misc
1	1	1	Jos kaveri tönii.	2	kaveri	kaveri	NOUN	N	Case=Nom\|Number=Sing	3	nsubj	NA	NA
1	1	1	Jos kaveri tönii.	3	tönii	töniä	VERB	V	Mood=Ind\|Number=Sing\|Person=3\|Tense=Pres\|VerbForm=Fin\|Voice=Act	0	root	NA	SpaceAfter=No
1	1	2	Jos lyö.	2	lyö	lyödä	VERB	V	Mood=Ind\|Number=Sing\|Person=3\|Tense=Pres\|VerbForm=Fin\|Voice=Act	0	root	NA	SpaceAfter=No
1	1	3	Jos tappelee.	2	tappelee	tappella	VERB	V	Mood=Ind\|Number=Sing\|Person=3\|Tense=Pres\|VerbForm=Fin\|Voice=Act	0	root	NA	SpaceAfter=No
2	1	1	Kun nimitellään ja tönitään.	2	nimitellään	nimitellä	VERB	V	Mood=Ind\|Tense=Pres\|VerbForm=Fin\|Voice=Pass	0	root	NA	NA
2	1	1	Kun nimitellään ja tönitään.	4	tönitään	töniä	VERB	V	Mood=Ind\|Tense=Pres\|VerbForm=Fin\|Voice=Pass	2	conj	NA	SpaceAfter=No

Conclusion

Now that you have data in CoNLL-U format, this pre-processed data is ready for the analysis using finnsurveytext functions. For more information on these, please review the other vignettes in this package.

Citation

The Office of Ombudsman for Children: Child Barometer 2016 [dataset]. Version 1.0 (2016-12-09). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD3134

Finnish Children and Youth Foundation: Young People’s Views on Development Cooperation 2012 [dataset]. Version 2.0 (2019-01-22). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD2821