Skip to contents

Introduction

The new updated version of finnsurveytext works with svydesign objects which can be created with the survey R package. There are two ways that svydesign objects can be used:

  1. As an input during the pre-processing of your data.
  2. As a way to add weights and additional columns within data exploration and comparison functions
First, let’s create a svydesign object for use in this tutorial:

We will use the dev_coop sample dataset for the tutorial, and create a svydesign object from this sample data.

svy_d <- survey::svydesign(id = ~1, 
                           weights = ~paino, 
                           data = dev_coop)

Option 1: Formatting data using svydesign object

The relevant function here is fst_prepare_svydesign().

Explanation of parameters:

  • svydesign: the name of the svydesign object
  • question: the column of ‘data’ that contains the open-ended question
  • id: the ID column in ‘data’
  • model: a language model available for udpipe, “ftb” or “tdt”. (If you do not provide a value, this will default to “ftb”.)
  • stopword_list: a list of stopwords to remove from your ‘data’. (To find relevant stopwords lists, you can run fst_find_stopwords(). If you do not provide a value, this will default to “nltk”. “manual” can be used to indicate that a manual list will be provided.)
  • language This should be the two-letter ISO code for the language for the stopword list. The default language is “fi”.
  • use_weights: optional, a boolean as to whether to include weights from the svydesign. (If you do not provide a value, this will default to TRUE.)
  • add_columns: any columns you want to add to the formatted data from the svydesign object, such as for use in comparison functions. (If you do not provide any values, this will default to NULL.)
  • manual: an optional boolean to indicate that a manual list will be provided. (stopword_list = "manual" can also or instead be used. If you do not provide any values, this will default to FALSE.)
  • manual_list: the manual list of stopwords if you choose to provide one. (If you do not provide any values, this will default to an empty list.)

Let’s prepare our data below:

df <- fst_prepare_svydesign(svydesign = svy_d,
                            question = 'q11_3',
                            id = 'fsd_id',
                            model = 'tdt',
                            stopword_list = 'snowball',
                            use_weights = TRUE,
                            add_cols = c('gender','region')
                            )

The data is now formatted:

knitr::kable(head(df, 5))
doc_id paragraph_id sentence_id sentence token_id token lemma upos xpos feats head_token_id dep_rel deps misc weight gender region
1 1 1 saastuminen ja luonnonvarojen liikakäyttö, nälänhätä ja ylikansoittuminen 1 saastuminen saastuminen NOUN N Case=Nom|Derivation=Minen|Number=Sing 0 root NA NA 0.544 Female Etelä-Suomi
1 1 1 saastuminen ja luonnonvarojen liikakäyttö, nälänhätä ja ylikansoittuminen 3 luonnonvarojen luonnonvara NOUN N Case=Gen|Number=Plur 4 nmod:poss NA NA 0.544 Female Etelä-Suomi
1 1 1 saastuminen ja luonnonvarojen liikakäyttö, nälänhätä ja ylikansoittuminen 4 liikakäyttö liikakäyttö NOUN N Case=Nom|Number=Sing 1 conj NA SpaceAfter=No 0.544 Female Etelä-Suomi
1 1 1 saastuminen ja luonnonvarojen liikakäyttö, nälänhätä ja ylikansoittuminen 6 nälänhätä nälänhä VERB V Mood=Imp|Number=Sing|Person=2|VerbForm=Fin|Voice=Act 1 conj NA NA 0.544 Female Etelä-Suomi
1 1 1 saastuminen ja luonnonvarojen liikakäyttö, nälänhätä ja ylikansoittuminen 8 ylikansoittuminen ylikansoittuminen NOUN N Case=Nom|Derivation=Minen|Number=Sing 1 conj NA SpacesAfter= 0.544 Female Etelä-Suomi

Option 2: Using svydesign object in data exploration

The svydesign object can be used to add weights and other columns during data exploration.

First, let’s create formatted data without weights and additional columns ready to use with our svydesign object.

df2 <- fst_prepare(data = dev_coop,
                   question = 'q11_3',
                   id = 'fsd_id',
                   model = 'ftb',
                   stopword_list = 'nltk',
                   weights = NULL,
                   add_cols = NULL)

Within the data analysis functions, there are 3 parameters (which are in each function) which are used to add information from the svydesign object. These are:

  • use_svydesign_weights: should be set as TRUE if we want to use weights from within a svydesign object.
  • id: only required if weights are coming from a svydesign object
  • svydesign: the svydesign object

Within the initial functions (the ones which are not used for comparison between groups) these are used to add weights from the svydesign object.

For example,

fst_wordcloud(df2, 
              pos_filter = c("NOUN", "VERB", "ADJ", "ADV"),
              max=50, 
              use_svydesign_weights = TRUE, 
              id = 'fsd_id', 
              svydesign = svy_d)


fst_freq(df2,
         number = 10,
         norm = NULL,
         pos_filter = NULL,
         strict = TRUE,
         name = NULL,
         use_svydesign_weights = TRUE,
         id = "fsd_id",
         svydesign = svy_d,
         use_column_weights = FALSE)

Within the comparison functions, we have the following additional parameter:

  • use_svydesign_field: set to TRUE if you want to get field for splitting the data from the svydesign object
fst_ngrams_compare(fst_dev_coop_2,
                   field = 'gender',
                   number = 10,
                   ngrams = 1,
                   norm = NULL,
                   pos_filter = NULL,
                   strict = TRUE,
                   use_svydesign_weights = TRUE,
                   use_svydesign_field = TRUE,
                   id = "fsd_id",
                   svydesign = svy_d,
                   use_column_weights = FALSE,
                   exclude_nulls = TRUE,
                   rename_nulls = 'null_data',
                   unique_colour = "indianred",
                   title_size = 20,
                   subtitle_size = 15)

All of these functions call the function fst_use_svydesign() in the background to add the svydesign data to your formatted dataframe.

# FUNCTION DEFINITION:
fst_use_svydesign <- function(data,
                              svydesign,
                              id,
                              add_cols = NULL,
                              add_weights = TRUE)