Skip to contents

Introduction

This tutorial follows on from Tutorial 2 and guides you through creation of a concept network plot. A concept network is a way to visualise words which often occur together.

Our concept network function uses the TextRank algorithm which is a graph-based ranking model for text processing. Vertices represent words and co-occurrence between words is shown through edges. Word importance is determined recursively (through the unsupervised learning TextRank algorithm) where words get more weight based on how many words co-occur and the weight of these co-occurring words.

To utilise the TextRank algorithm in finnsurveytext, we use the textrank package. For further information on the package, please see this documentation. This package implements the TextRank and PageRank algorithms. (PageRank is the algorithm that Google uses to rank webpages.) You can read about the underlying TextRank algorithm here and about the PageRank algorithm here.

Installation of package.

Once the package is installed, you can load the finnsurveytext package as below: (Other required packages such as dplyr and stringr will also be installed if they are not currently installed in your environment.)

Overview of Functions

The functions covered in this tutorial are in r/03_concept_network.R. These are:

  1. fst_cn_search()
  2. fst_cn_edges()
  3. fst_cn_nodes()
  4. fst_cn_plot()
  5. fst_concept_network()

Data

There are two sets of data files available within the package which could be used in this tutorial. These files have been created following the process demonstrated in Tutorial 1. The suffixes ‘iso’, ‘nltk’ and ‘snow’ refer to the types of stopwords which have been removed during the data preparation activities.

1. Child Barometer Data

  • data/conllu_cb_bullying.rda
  • data/conllu_cb_bullying_iso.rda

2. Development Cooperation Data

  • data/conllu_dev_q11_1.rda
  • data/conllu_dev_q11_1_nltk.rda
  • data/conllu_dev_q11_1_snow.rda
  • data/conllu_dev_q11_2.rda
  • data/conllu_dev_q11_2_nltk.rda
  • data/conllu_dev_q11_3.rda
  • data/conllu_dev_q11_3_nltk.rda

You can read these in as follows:

bullying <- conllu_cb_bullying_iso
q11_1 <- conllu_dev_q11_1_nltk
q11_2 <- conllu_dev_q11_2_nltk
q11_3 <- conllu_dev_q11_3_nltk

knitr::kable(head(bullying))
doc_id paragraph_id sentence_id sentence token_id token lemma upos xpos feats head_token_id dep_rel deps misc
doc1 1 1 Jos kaveri tönii. 2 kaveri kaveri NOUN N,Sg,Nom Case=Nom|Number=Sing 3 nsubj NA NA
doc1 1 1 Jos kaveri tönii. 3 tönii tönia VERB V,Act,Ind,Pres,Sg3 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 0 root NA SpaceAfter=No
doc1 1 2 Jos lyö. 2 lyö lyödä VERB V,Act,Ind,Pres,Sg3 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 0 root NA SpaceAfter=No
doc1 1 3 Jos tappelee. 2 tappelee tapella VERB V,Act,Ind,Pres,Sg3 Mood=Ind|Number=Sing|Person=3|Tense=Pres|VerbForm=Fin|Voice=Act 0 root NA SpaceAfter=No
doc2 1 1 Kun nimitellään ja tönitään. 2 nimitellään nimitellä VERB V,Pass,Ind,Pres Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass 0 root NA NA
doc2 1 1 Kun nimitellään ja tönitään. 4 tönitään tönitä VERB V,Pass,Ind,Pres Mood=Ind|Tense=Pres|VerbForm=Fin|Voice=Pass 2 conj NA SpaceAfter=No

FUNCTIONS

Concept Network - Search TextRank for Concepts

fst_cn_search()

This function is used to find words which are related to a list of provided terms. It utilises the textrank_keywords() function which is part of the textrank package.

This function goes through the following process:

  1. separates the search string into individual terms
  2. applies the textrank_keywords() function which finds n-grams in the text that occur multiple times
  3. finds all pairs of words included in these common ngrams
  4. filters the pairs so that at least one of the pair is a searched term.

The function fst_cn_search() is demonstrated below.

bullying_concepts <- fst_cn_search(
  data = bullying,
  concepts = "kiusata, lyöminen, lyödä, potkia"
)
q11_2_concepts <- fst_cn_search(
  data = conllu_dev_q11_2_nltk,
  concepts = "kehitysmaa, auttaa, pyrkiä, maa, ihminen",
  pos_filter = c("ADV", "ADJ")
)
q11_3_concepts <- fst_cn_search(
  data = conllu_dev_q11_3_nltk,
  concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute"
)

The resulting dataframe is formatted as below:

knitr::kable(head(q11_3_concepts, n = 10))
keyword ngram freq word1 word2
nälänhätä-sota 2 7 nälänhätä sota
köyhyys-nälänhätä 2 6 köyhyys nälänhätä
nälänhätä-huono 2 3 nälänhätä huono
sota-nälänhätä-puhdas-vesi-puute 5 3 sota nälänhätä
sota-nälänhätä-puhdas-vesi-puute 5 3 nälänhätä puhdas
sota-nälänhätä-puhdas-vesi-puute 5 3 puhdas vesi
sota-nälänhätä-puhdas-vesi-puute 5 3 vesi puute
sota-köyhyys 2 3 sota köyhyys
ilmastonmuutos-nälänhätä 2 3 ilmastonmuutos nälänhätä
sota-nälänhätä 2 3 sota nälänhätä

To run fst_cn_search, we provide the following arguments to the function:

  1. data which is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu().
  2. concepts is a string of concept terms to search for, separated by commas.
  3. relevant_pos is a list of UPOS tags for inclusion, default is c(“NOUN”, “VERB”, “ADJ”, “ADV”).

Concept Network - Get TextRank Edges

fst_cn_edges()

The Get Edges function runs the search function, fst_cn_search() and then filters for edges (pairs of co-occurring words where one is a concept word) which are larger than the threshold (occurring enough times). The resulting dataframe is simplified in preparation for plotting.

bullying_edges <- fst_cn_edges(
  data = bullying,
  concepts = "kiusata, lyöminen, lyödä, potkia"
)
q11_2_edges <- fst_cn_edges(
  data = conllu_dev_q11_2_nltk,
  concepts = "kehitysmaa, auttaa, pyrkiä, maa, ihminen",
  threshold = 5,
  norm = "number_resp"
)
q11_3_edges <- fst_cn_edges(
  data = conllu_dev_q11_3_nltk,
  concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute",
  threshold = 2
)

The dataframe has a simplified format from the fst_cn_search() results, with only columns “to”, “from” and “n” which indicates the number of occurrences.

knitr::kable(head(q11_2_edges, n = 10))
from to co_occurrence
apu tarvitsla 0.00882
asua ihminen 0.00662
auttaa apu 0.00882
auttaa kehitysmaa 0.07280
auttaa köys 0.00662
ihminen auttaa 0.00551
kehitysmaa asua 0.00992
kehitysmaa auttaa 0.01650
kehitysmaa ihminen 0.01430
kehitysmaa kehittyä 0.00662

The arguments are the same as for fst_cn_search() plus the threshold.

  1. data which is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu().
  2. concepts is a string of concept terms to search for, separated by commas.
  3. threshold is the minimum number of occurrences threshold for ‘edge’ between a concept term and other word, default is NULL.
  4. norm is the method for normalising the data. Valid settings are 'number_words' (the number of words in the responses, default), 'number_resp' (the number of responses), or NULL (raw count returned). Normalisation occurs after the threshold (if it exists) is applied.
  5. pos_filter is a list of UPOS tags for inclusion, default is NULL.

Concept Network - Get TextRank Nodes

fst_cn_nodes()

This function runs the textrank_keywords() function which is part of the textrank package and returns a dataframe containing relevant lemmas and their associated PageRank.

It is demonstrated as follows:

bullying_nodes <- fst_cn_nodes(data = bullying, edges = bullying_edges)
q11_2_nodes <- fst_cn_nodes(data = conllu_dev_q11_2_nltk, edges = q11_2_edges)
q11_3_nodes <- fst_cn_nodes(data = conllu_dev_q11_3_nltk, edges = q11_3_edges)
knitr::kable(head(q11_2_nodes, n = 10))
lemma pagerank
auttaa 0.0799940
pyrkiä 0.0358074
kehitysmaa 0.0768148
parantaa 0.0223635
ihminen 0.0250921
apu 0.0089812
olo 0.0123289
asua 0.0030227
kehittyä 0.0059800
tarvitsla 0.0034266

fst_cn_nodes() requires the following arguments:

  1. data which is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu().
  2. edges is the output from fst_cn_edges().
  3. pos_filter is a list of UPOS tags for inclusion, default is NULL.

Plot Concept Network

fst_cn_plot()

This function takes the output of the previous functions and plots the concept network. Edges between words in the plot show the number of occurrences with thicker and more opaque edges showing more occurrences. Similarly, the size of the word circle indicates the PageRank with higher PageRank resulting in a larger circle. Concept words are coloured red and other terms are black.

fst_cn_plot(
  edges = bullying_edges,
  nodes = bullying_nodes,
  concepts = "kiusata, lyöminen",
  title = "Bullying Data CN"
)

Recall that for q11_2_edges we have set the threshold as 5.

fst_cn_plot(
  edges = q11_2_edges,
  nodes = q11_2_nodes,
  concepts = "kehitysmaa, auttaa, pyrkiä, maa, ihminen"
)

Recall that for q11_3_edges we have set the threshold as 2. This is why there are fewer words included in the plot despite more words available (including stopwords) in this data.

fst_cn_plot(
  edges = q11_3_edges,
  nodes = q11_3_nodes,
  concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute"
)

As fst_cn_plot() uses results from the previous functions, it has 4 arguments:

  1. edges is the output of fst_cn_edges().
  2. nodes is the output of fst_cn_nodes()
  3. concepts is a list of terms which have been searched for, separated by commas.
  4. title is an optional title for plot, default is NULL and a generic title, (‘TextRank extracted keyword occurrences’) will be used.

Plot Concept Network

fst_concept_network()

If you don’t want to run all of the individual functions, fst_cn_search(), fst_cn_edges(), fst_cn_nodes(), and fst_cn_plot(), you can run them all within the one function, fst_concept_network().

This function is run as follows:

fst_concept_network(
  data = q11_1,
  concepts = "elintaso, köyhä, ihminen"
)

fst_concept_network(
  data = q11_1,
  concepts = "elintaso, köyhä, ihminen",
  threshold = 3
)

fst_concept_network(
  data = q11_2,
  concepts = "kehitysmaa, auttaa, pyrkiä, maa, ihminen"
)

fst_concept_network(
  data = q11_2,
  concepts = "kehitysmaa, auttaa, pyrkiä, maa, ihminen",
  threshold = 10
)

fst_concept_network(
  data = q11_2,
  concepts = "kehitysmaa, auttaa, pyrkiä, maa, ihminen",
  threshold = 5
)

fst_concept_network(
  data = q11_3,
  concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute"
)

fst_concept_network(
  data = q11_3,
  concepts = "köyhyys, nälänhätä, sota, ilmastonmuutos, puute",
  threshold = 3
)

fst_concept_network(
  data = bullying,
  concepts = "kiusata, lyöminen",
  title = "Concept Network of Bullying Data"
)

The arguments are:

  1. data which is output from data preparation, prepared data in CoNLL-U format, such as the output of fst_prepare_connlu().
  2. concept is a string of concept terms to search for, separated by commas.
  3. threshold is the minimum number of occurrences threshold for ‘edge’ between a concept term and other word, default is NULL.
  4. norm is the method for normalising the data. Valid settings are 'number_words' (the number of words in the responses, default), 'number_resp' (the number of responses), or NULL (raw count returned). Normalisation occurs after the threshold (if it exists) is applied.
  5. pos_filter is a list of UPOS tags for inclusion, default is NULL.
  6. title is an optional title for plot, default is NULL and a generic title, (‘TextRank extracted keyword occurrences’) will be used.

Conclusion

This tutorial ran you through the functions used to create Concept Networks which are included in finnsurveytext. A concept network visualises word importance and co-occurrence between words.

Citation

The Office of Ombudsman for Children: Child Barometer 2016 [dataset]. Version 1.0 (2016-12-09). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD3134

Finnish Children and Youth Foundation: Young People’s Views on Development Cooperation 2012 [dataset]. Version 2.0 (2019-01-22). Finnish Social Science Data Archive [distributor]. http://urn.fi/urn:nbn:fi:fsd:T-FSD2821