Extra-AnalysingOtherLanguages
Source:vignettes/web_only/Extra-AnalysingOtherLanguages.Rmd
Extra-AnalysingOtherLanguages.Rmd
How to Use finnsurveytext
in another language!
Despite the package’s name, finnsurveytext
can be used
to analyse surveys in LOTS of different languages. This
vignette aims to explain how to use finnsurveytext
in
another language with as little additional effort as possible.
The reason finnsurveytext
can be used with other
languages is that the packages it employs to process the raw survey data
work in multiple languages! So we have the developers of the
udpipe
and stopwords
packages to thank!
There is a survey in English provided with the package called
english_sample_survey
which we will use to demonstrate the
use of the package in a language other than Finnish.
id | label | label_coder1 | label_coder2 | text |
---|---|---|---|---|
1 | proactive | proactive | proactive | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. |
2 | proactive | proactive | proactive | I think he should have the receptionist talk to the doctor to make sure that he gets in there at the appropriate time; find out if it actually can be two weeks or if two weeks later would be OK. |
3 | proactive | proactive | proactive | Joe should talk to the doctor and make arrangements to come in in two weeks. He was pretty specific about that. |
4 | proactive | proactive | proactive | I think Joe should insist on an appointment in two weeks. |
5 | proactive | proactive | proactive | Joe should discuss this with the receptionist as to what the doctor told him to do. And insist on seeing him at two weeks. |
1. Essential: Your language has a language model available for
udpipe
The udpipe
package is available from the CRAN. The relevant
udpipe
function we use is
udpipe::udpipe_download_model
. You can see the list of
available models in the udpipe
manual.
At the time of writing this vignette, these were:
afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb
Alternatively, you can find the list of available models by running
fst_print_available_models()
. By providing a
search
term, the list will be filtered for models
containing this language:
fst_print_available_models()
#> [1] "afrikaans-afribooms" "ancient_greek-perseus"
#> [3] "ancient_greek-proiel" "arabic-padt"
#> [5] "armenian-armtdp" "basque-bdt"
#> [7] "belarusian-hse" "bulgarian-btb"
#> [9] "buryat-bdt" "catalan-ancora"
#> [11] "chinese-gsd" "chinese-gsdsimp"
#> [13] "classical_chinese-kyoto" "coptic-scriptorium"
#> [15] "croatian-set" "czech-cac"
#> [17] "czech-cltt" "czech-fictree"
#> [19] "czech-pdt" "danish-ddt"
#> [21] "dutch-alpino" "dutch-lassysmall"
#> [23] "english-ewt" "english-gum"
#> [25] "english-lines" "english-partut"
#> [27] "estonian-edt" "estonian-ewt"
#> [29] "finnish-ftb" "finnish-tdt"
#> [31] "french-gsd" "french-partut"
#> [33] "french-sequoia" "french-spoken"
#> [35] "galician-ctg" "galician-treegal"
#> [37] "german-gsd" "german-hdt"
#> [39] "gothic-proiel" "greek-gdt"
#> [41] "hebrew-htb" "hindi-hdtb"
#> [43] "hungarian-szeged" "indonesian-gsd"
#> [45] "irish-idt" "italian-isdt"
#> [47] "italian-partut" "italian-postwita"
#> [49] "italian-twittiro" "italian-vit"
#> [51] "japanese-gsd" "kazakh-ktb"
#> [53] "korean-gsd" "korean-kaist"
#> [55] "kurmanji-mg" "latin-ittb"
#> [57] "latin-perseus" "latin-proiel"
#> [59] "latvian-lvtb" "lithuanian-alksnis"
#> [61] "lithuanian-hse" "maltese-mudt"
#> [63] "marathi-ufal" "north_sami-giella"
#> [65] "norwegian-bokmaal" "norwegian-nynorsk"
#> [67] "norwegian-nynorsklia" "old_church_slavonic-proiel"
#> [69] "old_french-srcmf" "old_russian-torot"
#> [71] "persian-seraji" "polish-lfg"
#> [73] "polish-pdb" "polish-sz"
#> [75] "portuguese-bosque" "portuguese-br"
#> [77] "portuguese-gsd" "romanian-nonstandard"
#> [79] "romanian-rrt" "russian-gsd"
#> [81] "russian-syntagrus" "russian-taiga"
#> [83] "sanskrit-ufal" "scottish_gaelic-arcosg"
#> [85] "serbian-set" "slovak-snk"
#> [87] "slovenian-ssj" "slovenian-sst"
#> [89] "spanish-ancora" "spanish-gsd"
#> [91] "swedish-lines" "swedish-talbanken"
#> [93] "tamil-ttb" "telugu-mtg"
#> [95] "turkish-imst" "ukrainian-iu"
#> [97] "upper_sorbian-ufal" "urdu-udtb"
#> [99] "uyghur-udt" "vietnamese-vtb"
#> [101] "wolof-wtb"
fst_print_available_models(search = 'estonian')
#> [1] "estonian-edt" "estonian-ewt"
fst_print_available_models('sami')
#> [1] "north_sami-giella"
How to use:
The relevant model, eg “swedish-talbanken”, should be used for the
model
input in fst_format()
or
fst_prepare()
Demonstration:
We find an English model and format our English data below:
fst_print_available_models("english")
#> [1] "english-ewt" "english-gum" "english-lines" "english-partut"
en_df <- fst_format(data = english_sample_survey,
question = 'text',
id = 'id',
model = 'english-ewt'
)
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to /home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/english-ewt-ud-2.5-191206.udpipe
#> - This model has been trained on version 2.5 of data from https://universaldependencies.org
#> - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
#> - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
#> - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
#> Downloading finished, model stored at '/home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/english-ewt-ud-2.5-191206.udpipe'
knitr::kable(head(en_df, 5))
doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 1 | joe | joe | PROPN | NNP | Number=Sing | 3 | nsubj | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 2 | should | should | AUX | MD | VerbForm=Fin | 3 | aux | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 3 | talk | talk | VERB | VB | VerbForm=Inf | 0 | root | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 4 | to | to | ADP | IN | NA | 6 | case | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 5 | the | the | DET | DT | Definite=Def|PronType=Art | 6 | det | NA | NA |
2. Recommended: Your language has a stopwords list available for
stopwords
package
The stopwords
package is available from the CRAN. The
relevant stopwords
functions are
stopwords::stopwords
,
stopwords::stopwords_getsources
and
stopwords::stopwords_getlanguages
. We recommend you first
identify the two-letter ISO
code for the language you are using. You can see the list of
available sources and languages in the stopwords
manual
or by running the ‘get sources’ and ‘get languages’ functions:
stopwords_getsources()
#> [1] "snowball" "stopwords-iso" "misc" "smart"
#> [5] "marimo" "ancient" "nltk" "perseus"
stopwords::stopwords_getlanguages(source = 'nltk')
#> [1] "ar" "az" "da" "nl" "en" "fi" "fr" "de" "el" "hu" "id" "it" "kk" "ne" "no"
#> [16] "pt" "ro" "ru" "sl" "es" "sv" "tg" "tr"
stopwords('da', source = 'nltk')
#> [1] "og" "i" "jeg" "det" "at" "en" "den" "til"
#> [9] "er" "som" "på" "de" "med" "han" "af" "for"
#> [17] "ikke" "der" "var" "mig" "sig" "men" "et" "har"
#> [25] "om" "vi" "min" "havde" "ham" "hun" "nu" "over"
#> [33] "da" "fra" "du" "ud" "sin" "dem" "os" "op"
#> [41] "man" "hans" "hvor" "eller" "hvad" "skal" "selv" "her"
#> [49] "alle" "vil" "blev" "kunne" "ind" "når" "være" "dog"
#> [57] "noget" "ville" "jo" "deres" "efter" "ned" "skulle" "denne"
#> [65] "end" "dette" "mit" "også" "under" "have" "dig" "anden"
#> [73] "hende" "mine" "alt" "meget" "sit" "sine" "vor" "mod"
#> [81] "disse" "hvis" "din" "nogle" "hos" "blive" "mange" "ad"
#> [89] "bliver" "hendes" "været" "thi" "jer" "sådan"
stopwords('da') # The default source is 'snowball'
#> [1] "og" "i" "jeg" "det" "at" "en" "den" "til"
#> [9] "er" "som" "på" "de" "med" "han" "af" "for"
#> [17] "ikke" "der" "var" "mig" "sig" "men" "et" "har"
#> [25] "om" "vi" "min" "havde" "ham" "hun" "nu" "over"
#> [33] "da" "fra" "du" "ud" "sin" "dem" "os" "op"
#> [41] "man" "hans" "hvor" "eller" "hvad" "skal" "selv" "her"
#> [49] "alle" "vil" "blev" "kunne" "ind" "når" "være" "dog"
#> [57] "noget" "ville" "jo" "deres" "efter" "ned" "skulle" "denne"
#> [65] "end" "dette" "mit" "også" "under" "have" "dig" "anden"
#> [73] "hende" "mine" "alt" "meget" "sit" "sine" "vor" "mod"
#> [81] "disse" "hvis" "din" "nogle" "hos" "blive" "mange" "ad"
#> [89] "bliver" "hendes" "været" "thi" "jer" "sådan"
Alternatively, you can use our function
fst_find_stopwords
to simplify this process. This function
provides a table of lists available through the stopwords
package for a language and provides the contents for comparison (if you
have multiple options!). To run this, you need the two-letter ISO
language code:
knitr::kable(fst_find_stopwords(language = 'lv'))
Name | Stopwords | Length |
---|---|---|
stopwords-iso | aiz , ap , apakš , apakšpus , ar , arī , augšpus , bet , bez , bija , biji , biju , bijām , bijāt , būs , būsi , būsiet , būsim , būt , būšu , caur , diemžēl , diezin , droši , dēļ , esam , esat , esi , esmu , gan , gar , iekam , iekams , iekām , iekāms , iekš , iekšpus , ik , ir , it , itin , iz , ja , jau , jeb , jebšu , jel , jo , jā , ka , kamēr , kaut , kolīdz , kopš , kā , kļuva , kļuvi , kļuvu , kļuvām , kļuvāt , kļūs , kļūsi , kļūsiet , kļūsim , kļūst , kļūstam , kļūstat , kļūsti , kļūstu , kļūt , kļūšu , labad , lai , lejpus , līdz , līdzko , ne , nebūt , nedz , nekā , nevis , nezin , no , nu , nē , otrpus , pa , par , pat , pie , pirms , pret , priekš , pār , pēc , starp , tad , tak , tapi , taps , tapsi , tapsiet , tapsim , tapt , tapāt , tapšu , taču , te , tiec , tiek , tiekam , tiekat , tieku , tik , tika , tikai , tiki , tikko , tiklab , tiklīdz , tiks , tiksiet , tiksim , tikt , tiku , tikvien , tikām , tikāt , tikšu , tomēr , topat , turpretim, turpretī , tā , tādēļ , tālab , tāpēc , un , uz , vai , var , varat , varēja , varēji , varēju , varējām , varējāt , varēs , varēsi , varēsiet , varēsim , varēt , varēšu , vien , virs , virspus , vis , viņpus , zem , ārpus , šaipus | 161 |
fst_find_stopwords(language = 'no')
#> # A tibble: 3 × 3
#> Name Stopwords Length
#> <chr> <list> <list>
#> 1 nltk <chr [172]> <int [1]>
#> 2 snowball <chr [176]> <int [1]>
#> 3 stopwords-iso <chr [221]> <int [1]>
How to use:
The relevant language and stopword list (‘source’), eg “sv” and
“nltk”, should be used for the language
and
stopword_list
inputs respectively in
fst_prepare()
(or fst_rm_stop_punct()
which is
automatically called within fst_prepare()
).
Demonstration:
We can find and compare English stopwords lists as below. Once we
have chosen a stopwords list, we can run fst_prepare()
to
format the data and remove the stopwords:
knitr::kable(head(fst_find_stopwords(language = 'en'), 5))
Name | Stopwords | Length |
---|---|---|
marimo | i , me , myself , we , ours , ourselves , you , yours , yourself , yourselves, he , him , himself , she , hers , herself , it , itself , they , them , theirs , themselves, this , that , these , those , my , our , your , his , her , its , their , what , which , who , whom , whose , when , where , why , how , i’m , you’re , he’s , she’s , it’s , we’re , they’re , i’ve , you’ve , we’ve , they’ve , i’d , you’d , he’d , she’d , we’d , they’d , i’ll , you’ll , he’ll , she’ll , we’ll , they’ll , am , is , are , was , were , be , been , being , have , has , had , having , do , does , did , doing , would , should , could , ought , will , isn’t , aren’t , wasn’t , weren’t , hasn’t , haven’t , hadn’t , doesn’t , don’t , didn’t , won’t , wouldn’t , shan’t , shouldn’t , can’t , cannot , couldn’t , mustn’t , let’s , that’s , who’s , what’s , here’s , there’s , when’s , where’s , why’s , how’s , say , says , said , tell , tells , told , report , reports , reported , a , an , the , and , but , if , or , because , so , while , nor , as , until , once , here , there , all , any , both , each , few , many , more , most , other , some , such , no , not , only , then , too , very , little , less , of , at , by , for , with , about , against , between , into , through , during , before , after , above , below , to , from , up , down , in , out , on , off , over , under , again , further , than , own , same , minute , hour , month , year , century , am , pm , january , february , march , april , may , june , july , august , september , october , november , december , jan , feb , mar , apr , may , jun , jul , aug , sep , sept , oct , nov , dec , sunday , monday , tuesday , wednesday , thursday , friday , saturday , one , two , three , four , five , six , seven , eight , nine , ten | 237 |
nltk | i , me , my , myself , we , our , ours , ourselves , you , you’re , you’ve , you’ll , you’d , your , yours , yourself , yourselves, he , him , his , himself , she , she’s , her , hers , herself , it , it’s , its , itself , they , them , their , theirs , themselves, what , which , who , whom , this , that , that’ll , these , those , am , is , are , was , were , be , been , being , have , has , had , having , do , does , did , doing , a , an , the , and , but , if , or , because , as , until , while , of , at , by , for , with , about , against , between , into , through , during , before , after , above , below , to , from , up , down , in , out , on , off , over , under , again , further , then , once , here , there , when , where , why , how , all , any , both , each , few , more , most , other , some , such , no , nor , not , only , own , same , so , than , too , very , s , t , can , will , just , don , don’t , should , should’ve , now , d , ll , m , o , re , ve , y , ain , aren , aren’t , couldn , couldn’t , didn , didn’t , doesn , doesn’t , hadn , hadn’t , hasn , hasn’t , haven , haven’t , isn , isn’t , ma , mightn , mightn’t , mustn , mustn’t , needn , needn’t , shan , shan’t , shouldn , shouldn’t , wasn , wasn’t , weren , weren’t , won , won’t , wouldn , wouldn’t | 179 |
smart | a , a’s , able , about , above , according , accordingly , across , actually , after , afterwards , again , against , ain’t , all , allow , allows , almost , alone , along , already , also , although , always , am , among , amongst , an , and , another , any , anybody , anyhow , anyone , anything , anyway , anyways , anywhere , apart , appear , appreciate , appropriate , are , aren’t , around , as , aside , ask , asking , associated , at , available , away , awfully , b , be , became , because , become , becomes , becoming , been , before , beforehand , behind , being , believe , below , beside , besides , best , better , between , beyond , both , brief , but , by , c , c’mon , c’s , came , can , can’t , cannot , cant , cause , causes , certain , certainly , changes , clearly , co , com , come , comes , concerning , consequently , consider , considering , contain , containing , contains , corresponding, could , couldn’t , course , currently , d , definitely , described , despite , did , didn’t , different , do , does , doesn’t , doing , don’t , done , down , downwards , during , e , each , edu , eg , eight , either , else , elsewhere , enough , entirely , especially , et , etc , even , ever , every , everybody , everyone , everything , everywhere , ex , exactly , example , except , f , far , few , fifth , first , five , followed , following , follows , for , former , formerly , forth , four , from , further , furthermore , g , get , gets , getting , given , gives , go , goes , going , gone , got , gotten , greetings , h , had , hadn’t , happens , hardly , has , hasn’t , have , haven’t , having , he , he’s , hello , help , hence , her , here , here’s , hereafter , hereby , herein , hereupon , hers , herself , hi , him , himself , his , hither , hopefully , how , howbeit , however , i , i’d , i’ll , i’m , i’ve , ie , if , ignored , immediate , in , inasmuch , inc , indeed , indicate , indicated , indicates , inner , insofar , instead , into , inward , is , isn’t , it , it’d , it’ll , it’s , its , itself , j , just , k , keep , keeps , kept , know , knows , known , l , last , lately , later , latter , latterly , least , less , lest , let , let’s , like , liked , likely , little , look , looking , looks , ltd , m , mainly , many , may , maybe , me , mean , meanwhile , merely , might , more , moreover , most , mostly , much , must , my , myself , n , name , namely , nd , near , nearly , necessary , need , needs , neither , never , nevertheless , new , next , nine , no , nobody , non , none , noone , nor , normally , not , nothing , novel , now , nowhere , o , obviously , of , off , often , oh , ok , okay , old , on , once , one , ones , only , onto , or , other , others , otherwise , ought , our , ours , ourselves , out , outside , over , overall , own , p , particular , particularly , per , perhaps , placed , please , plus , possible , presumably , probably , provides , q , que , quite , qv , r , rather , rd , re , really , reasonably , regarding , regardless , regards , relatively , respectively , right , s , said , same , saw , say , saying , says , second , secondly , see , seeing , seem , seemed , seeming , seems , seen , self , selves , sensible , sent , serious , seriously , seven , several , shall , she , should , shouldn’t , since , six , so , some , somebody , somehow , someone , something , sometime , sometimes , somewhat , somewhere , soon , sorry , specified , specify , specifying , still , sub , such , sup , sure , t , t’s , take , taken , tell , tends , th , than , thank , thanks , thanx , that , that’s , thats , the , their , theirs , them , themselves , then , thence , there , there’s , thereafter , thereby , therefore , therein , theres , thereupon , these , they , they’d , they’ll , they’re , they’ve , think , third , this , thorough , thoroughly , those , though , three , through , throughout , thru , thus , to , together , too , took , toward , towards , tried , tries , truly , try , trying , twice , two , u , un , under , unfortunately, unless , unlikely , until , unto , up , upon , us , use , used , useful , uses , using , usually , uucp , v , value , various , very , via , viz , vs , w , want , wants , was , wasn’t , way , we , we’d , we’ll , we’re , we’ve , welcome , well , went , were , weren’t , what , what’s , whatever , when , whence , whenever , where , where’s , whereafter , whereas , whereby , wherein , whereupon , wherever , whether , which , while , whither , who , who’s , whoever , whole , whom , whose , why , will , willing , wish , with , within , without , won’t , wonder , would , would , wouldn’t , x , y , yes , yet , you , you’d , you’ll , you’re , you’ve , your , yours , yourself , yourselves , z , zero | 571 |
snowball | i , me , my , myself , we , our , ours , ourselves , you , your , yours , yourself , yourselves, he , him , his , himself , she , her , hers , herself , it , its , itself , they , them , their , theirs , themselves, what , which , who , whom , this , that , these , those , am , is , are , was , were , be , been , being , have , has , had , having , do , does , did , doing , would , should , could , ought , i’m , you’re , he’s , she’s , it’s , we’re , they’re , i’ve , you’ve , we’ve , they’ve , i’d , you’d , he’d , she’d , we’d , they’d , i’ll , you’ll , he’ll , she’ll , we’ll , they’ll , isn’t , aren’t , wasn’t , weren’t , hasn’t , haven’t , hadn’t , doesn’t , don’t , didn’t , won’t , wouldn’t , shan’t , shouldn’t , can’t , cannot , couldn’t , mustn’t , let’s , that’s , who’s , what’s , here’s , there’s , when’s , where’s , why’s , how’s , a , an , the , and , but , if , or , because , as , until , while , of , at , by , for , with , about , against , between , into , through , during , before , after , above , below , to , from , up , down , in , out , on , off , over , under , again , further , then , once , here , there , when , where , why , how , all , any , both , each , few , more , most , other , some , such , no , nor , not , only , own , same , so , than , too , very , will | 175 |
stopwords-iso | ’ll , ’tis , ’twas , ’ve , 10 , 39 , a , a’s , able , ableabout , about , above , abroad , abst , accordance , according , accordingly , across , act , actually , ad , added , adj , adopted , ae , af , affected , affecting , affects , after , afterwards , ag , again , against , ago , ah , ahead , ai , ain’t , aint , al , all , allow , allows , almost , alone , along , alongside , already , also , although , always , am , amid , amidst , among , amongst , amoungst , amount , an , and , announce , another , any , anybody , anyhow , anymore , anyone , anything , anyway , anyways , anywhere , ao , apart , apparently , appear , appreciate , appropriate , approximately , aq , ar , are , area , areas , aren , aren’t , arent , arise , around , arpa , as , aside , ask , asked , asking , asks , associated , at , au , auth , available , aw , away , awfully , az , b , ba , back , backed , backing , backs , backward , backwards , bb , bd , be , became , because , become , becomes , becoming , been , before , beforehand , began , begin , beginning , beginnings , begins , behind , being , beings , believe , below , beside , besides , best , better , between , beyond , bf , bg , bh , bi , big , bill , billion , biol , bj , bm , bn , bo , both , bottom , br , brief , briefly , bs , bt , but , buy , bv , bw , by , bz , c , c’mon , c’s , ca , call , came , can , can’t , cannot , cant , caption , case , cases , cause , causes , cc , cd , certain , certainly , cf , cg , ch , changes , ci , ck , cl , clear , clearly , click , cm , cmon , cn , co , co. , com , come , comes , computer , con , concerning , consequently , consider , considering , contain , containing , contains , copy , corresponding , could , could’ve , couldn , couldn’t , couldnt , course , cr , cry , cs , cu , currently , cv , cx , cy , cz , d , dare , daren’t , darent , date , de , dear , definitely , describe , described , despite , detail , did , didn , didn’t , didnt , differ , different , differently , directly , dj , dk , dm , do , does , doesn , doesn’t , doesnt , doing , don , don’t , done , dont , doubtful , down , downed , downing , downs , downwards , due , during , dz , e , each , early , ec , ed , edu , ee , effect , eg , eh , eight , eighty , either , eleven , else , elsewhere , empty , end , ended , ending , ends , enough , entirely , er , es , especially , et , et-al , etc , even , evenly , ever , evermore , every , everybody , everyone , everything , everywhere , ex , exactly , example , except , f , face , faces , fact , facts , fairly , far , farther , felt , few , fewer , ff , fi , fifteen , fifth , fifty , fify , fill , find , finds , fire , first , five , fix , fj , fk , fm , fo , followed , following , follows , for , forever , former , formerly , forth , forty , forward , found , four , fr , free , from , front , full , fully , further , furthered , furthering , furthermore , furthers , fx , g , ga , gave , gb , gd , ge , general , generally , get , gets , getting , gf , gg , gh , gi , give , given , gives , giving , gl , gm , gmt , gn , go , goes , going , gone , good , goods , got , gotten , gov , gp , gq , gr , great , greater , greatest , greetings , group , grouped , grouping , groups , gs , gt , gu , gw , gy , h , had , hadn’t , hadnt , half , happens , hardly , has , hasn , hasn’t , hasnt , have , haven , haven’t , havent , having , he , he’d , he’ll , he’s , hed , hell , hello , help , hence , her , here , here’s , hereafter , hereby , herein , heres , hereupon , hers , herself , herse” , hes , hi , hid , high , higher , highest , him , himself , himse” , his , hither , hk , hm , hn , home , homepage , hopefully , how , how’d , how’ll , how’s , howbeit , however , hr , ht , htm , html , http , hu , hundred , i , i’d , i’ll , i’m , i’ve , i.e. , id , ie , if , ignored , ii , il , ill , im , immediate , immediately , importance , important , in , inasmuch , inc , inc. , indeed , index , indicate , indicated , indicates , information , inner , inside , insofar , instead , int , interest , interested , interesting , interests , into , invention , inward , io , iq , ir , is , isn , isn’t , isnt , it , it’d , it’ll , it’s , itd , itll , its , itself , itse” , ive , j , je , jm , jo , join , jp , just , k , ke , keep , keeps , kept , keys , kg , kh , ki , kind , km , kn , knew , know , known , knows , kp , kr , kw , ky , kz , l , la , large , largely , last , lately , later , latest , latter , latterly , lb , lc , least , length , less , lest , let , let’s , lets , li , like , liked , likely , likewise , line , little , lk , ll , long , longer , longest , look , looking , looks , low , lower , lr , ls , lt , ltd , lu , lv , ly , m , ma , made , mainly , make , makes , making , man , many , may , maybe , mayn’t , maynt , mc , md , me , mean , means , meantime , meanwhile , member , members , men , merely , mg , mh , microsoft , might , might’ve , mightn’t , mightnt , mil , mill , million , mine , minus , miss , mk , ml , mm , mn , mo , more , moreover , most , mostly , move , mp , mq , mr , mrs , ms , msie , mt , mu , much , mug , must , must’ve , mustn’t , mustnt , mv , mw , mx , my , myself , myse” , mz , n , na , name , namely , nay , nc , nd , ne , near , nearly , necessarily , necessary , need , needed , needing , needn’t , neednt , needs , neither , net , netscape , never , neverf , neverless , nevertheless , new , newer , newest , next , nf , ng , ni , nine , ninety , nl , no , no-one , nobody , non , none , nonetheless , noone , nor , normally , nos , not , noted , nothing , notwithstanding, novel , now , nowhere , np , nr , nu , null , number , numbers , nz , o , obtain , obtained , obviously , of , off , often , oh , ok , okay , old , older , oldest , om , omitted , on , once , one , one’s , ones , only , onto , open , opened , opening , opens , opposite , or , ord , order , ordered , ordering , orders , org , other , others , otherwise , ought , oughtn’t , oughtnt , our , ours , ourselves , out , outside , over , overall , owing , own , p , pa , page , pages , part , parted , particular , particularly , parting , parts , past , pe , per , perhaps , pf , pg , ph , pk , pl , place , placed , places , please , plus , pm , pmid , pn , point , pointed , pointing , points , poorly , possible , possibly , potentially , pp , pr , predominantly , present , presented , presenting , presents , presumably , previously , primarily , probably , problem , problems , promptly , proud , provided , provides , pt , put , puts , pw , py , q , qa , que , quickly , quite , qv , r , ran , rather , rd , re , readily , really , reasonably , recent , recently , ref , refs , regarding , regardless , regards , related , relatively , research , reserved , respectively , resulted , resulting , results , right , ring , ro , room , rooms , round , ru , run , rw , s , sa , said , same , saw , say , saying , says , sb , sc , sd , se , sec , second , secondly , seconds , section , see , seeing , seem , seemed , seeming , seems , seen , sees , self , selves , sensible , sent , serious , seriously , seven , seventy , several , sg , sh , shall , shan’t , shant , she , she’d , she’ll , she’s , shed , shell , shes , should , should’ve , shouldn , shouldn’t , shouldnt , show , showed , showing , shown , showns , shows , si , side , sides , significant , significantly , similar , similarly , since , sincere , site , six , sixty , sj , sk , sl , slightly , sm , small , smaller , smallest , sn , so , some , somebody , someday , somehow , someone , somethan , something , sometime , sometimes , somewhat , somewhere , soon , sorry , specifically , specified , specify , specifying , sr , st , state , states , still , stop , strongly , su , sub , substantially , successfully , such , sufficiently , suggest , sup , sure , sv , sy , system , sz , t , t’s , take , taken , taking , tc , td , tell , ten , tends , test , text , tf , tg , th , than , thank , thanks , thanx , that , that’ll , that’s , that’ve , thatll , thats , thatve , the , their , theirs , them , themselves , then , thence , there , there’d , there’ll , there’re , there’s , there’ve , thereafter , thereby , thered , therefore , therein , therell , thereof , therere , theres , thereto , thereupon , thereve , these , they , they’d , they’ll , they’re , they’ve , theyd , theyll , theyre , theyve , thick , thin , thing , things , think , thinks , third , thirty , this , thorough , thoroughly , those , thou , though , thoughh , thought , thoughts , thousand , three , throug , through , throughout , thru , thus , til , till , tip , tis , tj , tk , tm , tn , to , today , together , too , took , top , toward , towards , tp , tr , tried , tries , trillion , truly , try , trying , ts , tt , turn , turned , turning , turns , tv , tw , twas , twelve , twenty , twice , two , tz , u , ua , ug , uk , um , un , under , underneath , undoing , unfortunately , unless , unlike , unlikely , until , unto , up , upon , ups , upwards , us , use , used , useful , usefully , usefulness , uses , using , usually , uucp , uy , uz , v , va , value , various , vc , ve , versus , very , vg , vi , via , viz , vn , vol , vols , vs , vu , w , want , wanted , wanting , wants , was , wasn , wasn’t , wasnt , way , ways , we , we’d , we’ll , we’re , we’ve , web , webpage , website , wed , welcome , well , wells , went , were , weren , weren’t , werent , weve , wf , what , what’d , what’ll , what’s , what’ve , whatever , whatll , whats , whatve , when , when’d , when’ll , when’s , whence , whenever , where , where’d , where’ll , where’s , whereafter , whereas , whereby , wherein , wheres , whereupon , wherever , whether , which , whichever , while , whilst , whim , whither , who , who’d , who’ll , who’s , whod , whoever , whole , wholl , whom , whomever , whos , whose , why , why’d , why’ll , why’s , widely , width , will , willing , wish , with , within , without , won , won’t , wonder , wont , words , work , worked , working , works , world , would , would’ve , wouldn , wouldn’t , wouldnt , ws , www , x , y , ye , year , years , yes , yet , you , you’d , you’ll , you’re , you’ve , youd , youll , young , younger , youngest , your , youre , yours , yourself , yourselves , youve , yt , yu , z , za , zero , zm , zr | 1298 |
en_df2 <- fst_prepare(data = english_sample_survey,
question = 'text',
id = 'id',
model = 'english-ewt',
stopword_list = 'smart',
language = 'en')
knitr::kable(head(en_df2, 5))
doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 1 | joe | joe | PROPN | NNP | Number=Sing | 3 | nsubj | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 3 | talk | talk | VERB | VB | VerbForm=Inf | 0 | root | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 6 | doctor | doctor | NOUN | NN | Number=Sing | 3 | obl | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 10 | nurse | nurse | NOUN | NN | Number=Sing | 8 | obj | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 13 | doctor | doctor | NOUN | NN | Number=Sing | 14 | nsubj | NA | NA |
2b. Optional: Provide your own list of stopwords
If a stopword list is not available for your language, or you would
like to provide your own, you can use the manual_list
option within fst_prepare()
(or
fst_rm_stop_punct()
) making sure to also either set
manual = TRUE
or
stopwords_list = "manual"
.
You can also chose to not remove stopwords but you may find that you want to remove them to get more meaningful results!
If you provide a manual list, you can leave language
as
its default values.
Demonstration
#EXAMPLE OF PROVIDING A MANUAL LIST
manualList <- c('and', 'the', 'of', 'you', 'me', 'ours', 'mine', 'them', 'theirs')
manualList2 <- "to, the, I"
df1 <- fst_prepare(data = english_sample_survey,
question = 'text',
id = 'id',
model = 'english-ewt',
manual_list = manualList,
stopword_list = 'manual'
)
knitr::kable(head(df1, 5))
doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 1 | joe | joe | PROPN | NNP | Number=Sing | 3 | nsubj | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 2 | should | should | AUX | MD | VerbForm=Fin | 3 | aux | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 3 | talk | talk | VERB | VB | VerbForm=Inf | 0 | root | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 4 | to | to | ADP | IN | NA | 6 | case | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 6 | doctor | doctor | NOUN | NN | Number=Sing | 3 | obl | NA | NA |
df2 <- fst_prepare(data = english_sample_survey,
question = 'text',
id = 'id',
model = 'english-ewt',
manual = TRUE,
manual_list = manualList2
)
knitr::kable(head(df2, 5))
doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 1 | joe | joe | PROPN | NNP | Number=Sing | 3 | nsubj | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 2 | should | should | AUX | MD | VerbForm=Fin | 3 | aux | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 3 | talk | talk | VERB | VB | VerbForm=Inf | 0 | root | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 6 | doctor | doctor | NOUN | NN | Number=Sing | 3 | obl | NA | NA |
1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 7 | or | or | CCONJ | CC | NA | 8 | cc | NA | NA |
The remainder of the package works the same regardless of language of survey responses.