Extra-AnalysingOtherLanguages
Source:vignettes/web_only/Extra-AnalysingOtherLanguages.Rmd
Extra-AnalysingOtherLanguages.RmdHow to Use finnsurveytext in another language!
Despite the package’s name, finnsurveytext can be used
to analyse surveys in LOTS of different languages. This
vignette aims to explain how to use finnsurveytext in
another language with as little additional effort as possible.
The reason finnsurveytext can be used with other
languages is that the packages it employs to process the raw survey data
work in multiple languages! So we have the developers of the
udpipe and stopwords packages to thank!
There is a survey in English provided with the package called
english_sample_survey which we will use to demonstrate the
use of the package in a language other than Finnish.
| id | label | label_coder1 | label_coder2 | text |
|---|---|---|---|---|
| 1 | proactive | proactive | proactive | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. |
| 2 | proactive | proactive | proactive | I think he should have the receptionist talk to the doctor to make sure that he gets in there at the appropriate time; find out if it actually can be two weeks or if two weeks later would be OK. |
| 3 | proactive | proactive | proactive | Joe should talk to the doctor and make arrangements to come in in two weeks. He was pretty specific about that. |
| 4 | proactive | proactive | proactive | I think Joe should insist on an appointment in two weeks. |
| 5 | proactive | proactive | proactive | Joe should discuss this with the receptionist as to what the doctor told him to do. And insist on seeing him at two weeks. |
1. Essential: Your language has a language model available for
udpipe
The udpipe package is available from the CRAN. The relevant
udpipe function we use is
udpipe::udpipe_download_model. You can see the list of
available models in the udpipe manual.
At the time of writing this vignette, these were:
afrikaans-afribooms, ancient_greek-perseus, ancient_greek-proiel, arabic-padt, armenian-armtdp, basque-bdt, belarusian-hse, bulgarian-btb, buryat-bdt, catalan-ancora, chinese-gsd, chinese-gsdsimp, coptic-scriptorium, croatian-set, czech-cac, czech-cltt, czech-fictree, czech-pdt, danish-ddt, dutch-alpino, dutch-lassysmall, english-ewt, english-gum, english-lines, english-partut, estonian-edt, finnish-ftb, finnish-tdt, french-gsd, french-partut, french-sequoia, french-spoken, galician-ctg, galician-treegal, german-gsd, german-hdt, gothic-proiel, greek-gdt, hebrew-htb, hindi-hdtb, hungarian-szeged, indonesian-gsd, irish-idt, italian-isdt, italian-partut, italian-postwita, italian-twittiro, japanese-gsd, kazakh-ktb, korean-gsd, korean-kaist, kurmanji-mg, latin-ittb, latin-perseus, latin-proiel, latvian-lvtb, lithuanian-hse, maltese-mudt, marathi-ufal, north_sami-giella, norwegian-bokmaal, norwegian-nynorsk, norwegian-nynorsklia, old_church_slavonic-proiel, old_french-srcmf, persian-seraji, polish-lfg, polish-sz, portuguese-bosque, portuguese-br, portuguese-gsd, romanian-nonstandard, romanian-rrt, russian-gsd, russian-syntagrus, russian-taiga, sanskrit-ufal, scottish_gaelic-arcosg, serbian-set, slovak-snk, slovenian-ssj, slovenian-sst, spanish-ancora, spanish-gsd, swedish-lines, swedish-talbanken, tamil-ttb, telugu-mtg, turkish-imst, ukrainian-iu, upper_sorbian-ufal, urdu-udtb, uyghur-udt, vietnamese-vtb
Alternatively, you can find the list of available models by running
fst_print_available_models(). By providing a
search term, the list will be filtered for models
containing this language:
fst_print_available_models()
#> [1] "afrikaans-afribooms" "ancient_greek-perseus"
#> [3] "ancient_greek-proiel" "arabic-padt"
#> [5] "armenian-armtdp" "basque-bdt"
#> [7] "belarusian-hse" "bulgarian-btb"
#> [9] "buryat-bdt" "catalan-ancora"
#> [11] "chinese-gsd" "chinese-gsdsimp"
#> [13] "classical_chinese-kyoto" "coptic-scriptorium"
#> [15] "croatian-set" "czech-cac"
#> [17] "czech-cltt" "czech-fictree"
#> [19] "czech-pdt" "danish-ddt"
#> [21] "dutch-alpino" "dutch-lassysmall"
#> [23] "english-ewt" "english-gum"
#> [25] "english-lines" "english-partut"
#> [27] "estonian-edt" "estonian-ewt"
#> [29] "finnish-ftb" "finnish-tdt"
#> [31] "french-gsd" "french-partut"
#> [33] "french-sequoia" "french-spoken"
#> [35] "galician-ctg" "galician-treegal"
#> [37] "german-gsd" "german-hdt"
#> [39] "gothic-proiel" "greek-gdt"
#> [41] "hebrew-htb" "hindi-hdtb"
#> [43] "hungarian-szeged" "indonesian-gsd"
#> [45] "irish-idt" "italian-isdt"
#> [47] "italian-partut" "italian-postwita"
#> [49] "italian-twittiro" "italian-vit"
#> [51] "japanese-gsd" "kazakh-ktb"
#> [53] "korean-gsd" "korean-kaist"
#> [55] "kurmanji-mg" "latin-ittb"
#> [57] "latin-perseus" "latin-proiel"
#> [59] "latvian-lvtb" "lithuanian-alksnis"
#> [61] "lithuanian-hse" "maltese-mudt"
#> [63] "marathi-ufal" "north_sami-giella"
#> [65] "norwegian-bokmaal" "norwegian-nynorsk"
#> [67] "norwegian-nynorsklia" "old_church_slavonic-proiel"
#> [69] "old_french-srcmf" "old_russian-torot"
#> [71] "persian-seraji" "polish-lfg"
#> [73] "polish-pdb" "polish-sz"
#> [75] "portuguese-bosque" "portuguese-br"
#> [77] "portuguese-gsd" "romanian-nonstandard"
#> [79] "romanian-rrt" "russian-gsd"
#> [81] "russian-syntagrus" "russian-taiga"
#> [83] "sanskrit-ufal" "scottish_gaelic-arcosg"
#> [85] "serbian-set" "slovak-snk"
#> [87] "slovenian-ssj" "slovenian-sst"
#> [89] "spanish-ancora" "spanish-gsd"
#> [91] "swedish-lines" "swedish-talbanken"
#> [93] "tamil-ttb" "telugu-mtg"
#> [95] "turkish-imst" "ukrainian-iu"
#> [97] "upper_sorbian-ufal" "urdu-udtb"
#> [99] "uyghur-udt" "vietnamese-vtb"
#> [101] "wolof-wtb"
fst_print_available_models(search = 'estonian')
#> [1] "estonian-edt" "estonian-ewt"
fst_print_available_models('sami')
#> [1] "north_sami-giella"How to use:
The relevant model, eg “swedish-talbanken”, should be used for the
model input in fst_format() or
fst_prepare()
Demonstration:
We find an English model and format our English data below:
fst_print_available_models("english")
#> [1] "english-ewt" "english-gum" "english-lines" "english-partut"
en_df <- fst_format(data = english_sample_survey,
question = 'text',
id = 'id',
model = 'english-ewt'
)
#> Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to /home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/english-ewt-ud-2.5-191206.udpipe
#> - This model has been trained on version 2.5 of data from https://universaldependencies.org
#> - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
#> - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
#> - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
#> Downloading finished, model stored at '/home/runner/work/finnsurveytext/finnsurveytext/vignettes/web_only/english-ewt-ud-2.5-191206.udpipe'
knitr::kable(head(en_df, 5))| doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 1 | joe | joe | PROPN | NNP | Number=Sing | 3 | nsubj | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 2 | should | should | AUX | MD | VerbForm=Fin | 3 | aux | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 3 | talk | talk | VERB | VB | VerbForm=Inf | 0 | root | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 4 | to | to | ADP | IN | NA | 6 | case | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 5 | the | the | DET | DT | Definite=Def|PronType=Art | 6 | det | NA | NA |
2. Recommended: Your language has a stopwords list available for
stopwords package
The stopwords package is available from the CRAN. The
relevant stopwords functions are
stopwords::stopwords,
stopwords::stopwords_getsources and
stopwords::stopwords_getlanguages. We recommend you first
identify the two-letter ISO
code for the language you are using. You can see the list of
available sources and languages in the stopwords manual
or by running the ‘get sources’ and ‘get languages’ functions:
stopwords_getsources()
#> [1] "snowball" "stopwords-iso" "misc" "smart"
#> [5] "marimo" "ancient" "nltk" "perseus"
stopwords::stopwords_getlanguages(source = 'nltk')
#> [1] "ar" "az" "da" "nl" "en" "fi" "fr" "de" "el" "hu" "id" "it" "kk" "ne" "no"
#> [16] "pt" "ro" "ru" "sl" "es" "sv" "tg" "tr"
stopwords('da', source = 'nltk')
#> [1] "og" "i" "jeg" "det" "at" "en" "den" "til"
#> [9] "er" "som" "på" "de" "med" "han" "af" "for"
#> [17] "ikke" "der" "var" "mig" "sig" "men" "et" "har"
#> [25] "om" "vi" "min" "havde" "ham" "hun" "nu" "over"
#> [33] "da" "fra" "du" "ud" "sin" "dem" "os" "op"
#> [41] "man" "hans" "hvor" "eller" "hvad" "skal" "selv" "her"
#> [49] "alle" "vil" "blev" "kunne" "ind" "når" "være" "dog"
#> [57] "noget" "ville" "jo" "deres" "efter" "ned" "skulle" "denne"
#> [65] "end" "dette" "mit" "også" "under" "have" "dig" "anden"
#> [73] "hende" "mine" "alt" "meget" "sit" "sine" "vor" "mod"
#> [81] "disse" "hvis" "din" "nogle" "hos" "blive" "mange" "ad"
#> [89] "bliver" "hendes" "været" "thi" "jer" "sådan"
stopwords('da') # The default source is 'snowball'
#> [1] "og" "i" "jeg" "det" "at" "en" "den" "til"
#> [9] "er" "som" "på" "de" "med" "han" "af" "for"
#> [17] "ikke" "der" "var" "mig" "sig" "men" "et" "har"
#> [25] "om" "vi" "min" "havde" "ham" "hun" "nu" "over"
#> [33] "da" "fra" "du" "ud" "sin" "dem" "os" "op"
#> [41] "man" "hans" "hvor" "eller" "hvad" "skal" "selv" "her"
#> [49] "alle" "vil" "blev" "kunne" "ind" "når" "være" "dog"
#> [57] "noget" "ville" "jo" "deres" "efter" "ned" "skulle" "denne"
#> [65] "end" "dette" "mit" "også" "under" "have" "dig" "anden"
#> [73] "hende" "mine" "alt" "meget" "sit" "sine" "vor" "mod"
#> [81] "disse" "hvis" "din" "nogle" "hos" "blive" "mange" "ad"
#> [89] "bliver" "hendes" "været" "thi" "jer" "sådan"Alternatively, you can use our function
fst_find_stopwords to simplify this process. This function
provides a table of lists available through the stopwords
package for a language and provides the contents for comparison (if you
have multiple options!). To run this, you need the two-letter ISO
language code:
knitr::kable(fst_find_stopwords(language = 'lv'))| Name | Stopwords | Length |
|---|---|---|
| stopwords-iso | aiz , ap , apakš , apakšpus , ar , arī , augšpus , bet , bez , bija , biji , biju , bijām , bijāt , būs , būsi , būsiet , būsim , būt , būšu , caur , diemžēl , diezin , droši , dēļ , esam , esat , esi , esmu , gan , gar , iekam , iekams , iekām , iekāms , iekš , iekšpus , ik , ir , it , itin , iz , ja , jau , jeb , jebšu , jel , jo , jā , ka , kamēr , kaut , kolīdz , kopš , kā , kļuva , kļuvi , kļuvu , kļuvām , kļuvāt , kļūs , kļūsi , kļūsiet , kļūsim , kļūst , kļūstam , kļūstat , kļūsti , kļūstu , kļūt , kļūšu , labad , lai , lejpus , līdz , līdzko , ne , nebūt , nedz , nekā , nevis , nezin , no , nu , nē , otrpus , pa , par , pat , pie , pirms , pret , priekš , pār , pēc , starp , tad , tak , tapi , taps , tapsi , tapsiet , tapsim , tapt , tapāt , tapšu , taču , te , tiec , tiek , tiekam , tiekat , tieku , tik , tika , tikai , tiki , tikko , tiklab , tiklīdz , tiks , tiksiet , tiksim , tikt , tiku , tikvien , tikām , tikāt , tikšu , tomēr , topat , turpretim, turpretī , tā , tādēļ , tālab , tāpēc , un , uz , vai , var , varat , varēja , varēji , varēju , varējām , varējāt , varēs , varēsi , varēsiet , varēsim , varēt , varēšu , vien , virs , virspus , vis , viņpus , zem , ārpus , šaipus | 161 |
fst_find_stopwords(language = 'no')
#> # A tibble: 3 × 3
#> Name Stopwords Length
#> <chr> <list> <list>
#> 1 nltk <chr [172]> <int [1]>
#> 2 snowball <chr [176]> <int [1]>
#> 3 stopwords-iso <chr [221]> <int [1]>How to use:
The relevant language and stopword list (‘source’), eg “sv” and
“nltk”, should be used for the language and
stopword_list inputs respectively in
fst_prepare() (or fst_rm_stop_punct() which is
automatically called within fst_prepare()).
Demonstration:
We can find and compare English stopwords lists as below. Once we
have chosen a stopwords list, we can run fst_prepare() to
format the data and remove the stopwords:
knitr::kable(head(fst_find_stopwords(language = 'en'), 5))| Name | Stopwords | Length |
|---|---|---|
| marimo | i , me , myself , we , ours , ourselves , you , yours , yourself , yourselves, he , him , himself , she , hers , herself , it , itself , they , them , theirs , themselves, this , that , these , those , my , our , your , his , her , its , their , what , which , who , whom , whose , when , where , why , how , i’m , you’re , he’s , she’s , it’s , we’re , they’re , i’ve , you’ve , we’ve , they’ve , i’d , you’d , he’d , she’d , we’d , they’d , i’ll , you’ll , he’ll , she’ll , we’ll , they’ll , am , is , are , was , were , be , been , being , have , has , had , having , do , does , did , doing , would , should , could , ought , will , isn’t , aren’t , wasn’t , weren’t , hasn’t , haven’t , hadn’t , doesn’t , don’t , didn’t , won’t , wouldn’t , shan’t , shouldn’t , can’t , cannot , couldn’t , mustn’t , let’s , that’s , who’s , what’s , here’s , there’s , when’s , where’s , why’s , how’s , say , says , said , tell , tells , told , report , reports , reported , a , an , the , and , but , if , or , because , so , while , nor , as , until , once , here , there , all , any , both , each , few , many , more , most , other , some , such , no , not , only , then , too , very , little , less , of , at , by , for , with , about , against , between , into , through , during , before , after , above , below , to , from , up , down , in , out , on , off , over , under , again , further , than , own , same , minute , hour , month , year , century , am , pm , january , february , march , april , may , june , july , august , september , october , november , december , jan , feb , mar , apr , may , jun , jul , aug , sep , sept , oct , nov , dec , sunday , monday , tuesday , wednesday , thursday , friday , saturday , one , two , three , four , five , six , seven , eight , nine , ten | 237 |
| nltk | i , me , my , myself , we , our , ours , ourselves , you , you’re , you’ve , you’ll , you’d , your , yours , yourself , yourselves, he , him , his , himself , she , she’s , her , hers , herself , it , it’s , its , itself , they , them , their , theirs , themselves, what , which , who , whom , this , that , that’ll , these , those , am , is , are , was , were , be , been , being , have , has , had , having , do , does , did , doing , a , an , the , and , but , if , or , because , as , until , while , of , at , by , for , with , about , against , between , into , through , during , before , after , above , below , to , from , up , down , in , out , on , off , over , under , again , further , then , once , here , there , when , where , why , how , all , any , both , each , few , more , most , other , some , such , no , nor , not , only , own , same , so , than , too , very , s , t , can , will , just , don , don’t , should , should’ve , now , d , ll , m , o , re , ve , y , ain , aren , aren’t , couldn , couldn’t , didn , didn’t , doesn , doesn’t , hadn , hadn’t , hasn , hasn’t , haven , haven’t , isn , isn’t , ma , mightn , mightn’t , mustn , mustn’t , needn , needn’t , shan , shan’t , shouldn , shouldn’t , wasn , wasn’t , weren , weren’t , won , won’t , wouldn , wouldn’t | 179 |
| smart | a , a’s , able , about , above , according , accordingly , across , actually , after , afterwards , again , against , ain’t , all , allow , allows , almost , alone , along , already , also , although , always , am , among , amongst , an , and , another , any , anybody , anyhow , anyone , anything , anyway , anyways , anywhere , apart , appear , appreciate , appropriate , are , aren’t , around , as , aside , ask , asking , associated , at , available , away , awfully , b , be , became , because , become , becomes , becoming , been , before , beforehand , behind , being , believe , below , beside , besides , best , better , between , beyond , both , brief , but , by , c , c’mon , c’s , came , can , can’t , cannot , cant , cause , causes , certain , certainly , changes , clearly , co , com , come , comes , concerning , consequently , consider , considering , contain , containing , contains , corresponding, could , couldn’t , course , currently , d , definitely , described , despite , did , didn’t , different , do , does , doesn’t , doing , don’t , done , down , downwards , during , e , each , edu , eg , eight , either , else , elsewhere , enough , entirely , especially , et , etc , even , ever , every , everybody , everyone , everything , everywhere , ex , exactly , example , except , f , far , few , fifth , first , five , followed , following , follows , for , former , formerly , forth , four , from , further , furthermore , g , get , gets , getting , given , gives , go , goes , going , gone , got , gotten , greetings , h , had , hadn’t , happens , hardly , has , hasn’t , have , haven’t , having , he , he’s , hello , help , hence , her , here , here’s , hereafter , hereby , herein , hereupon , hers , herself , hi , him , himself , his , hither , hopefully , how , howbeit , however , i , i’d , i’ll , i’m , i’ve , ie , if , ignored , immediate , in , inasmuch , inc , indeed , indicate , indicated , indicates , inner , insofar , instead , into , inward , is , isn’t , it , it’d , it’ll , it’s , its , itself , j , just , k , keep , keeps , kept , know , knows , known , l , last , lately , later , latter , latterly , least , less , lest , let , let’s , like , liked , likely , little , look , looking , looks , ltd , m , mainly , many , may , maybe , me , mean , meanwhile , merely , might , more , moreover , most , mostly , much , must , my , myself , n , name , namely , nd , near , nearly , necessary , need , needs , neither , never , nevertheless , new , next , nine , no , nobody , non , none , noone , nor , normally , not , nothing , novel , now , nowhere , o , obviously , of , off , often , oh , ok , okay , old , on , once , one , ones , only , onto , or , other , others , otherwise , ought , our , ours , ourselves , out , outside , over , overall , own , p , particular , particularly , per , perhaps , placed , please , plus , possible , presumably , probably , provides , q , que , quite , qv , r , rather , rd , re , really , reasonably , regarding , regardless , regards , relatively , respectively , right , s , said , same , saw , say , saying , says , second , secondly , see , seeing , seem , seemed , seeming , seems , seen , self , selves , sensible , sent , serious , seriously , seven , several , shall , she , should , shouldn’t , since , six , so , some , somebody , somehow , someone , something , sometime , sometimes , somewhat , somewhere , soon , sorry , specified , specify , specifying , still , sub , such , sup , sure , t , t’s , take , taken , tell , tends , th , than , thank , thanks , thanx , that , that’s , thats , the , their , theirs , them , themselves , then , thence , there , there’s , thereafter , thereby , therefore , therein , theres , thereupon , these , they , they’d , they’ll , they’re , they’ve , think , third , this , thorough , thoroughly , those , though , three , through , throughout , thru , thus , to , together , too , took , toward , towards , tried , tries , truly , try , trying , twice , two , u , un , under , unfortunately, unless , unlikely , until , unto , up , upon , us , use , used , useful , uses , using , usually , uucp , v , value , various , very , via , viz , vs , w , want , wants , was , wasn’t , way , we , we’d , we’ll , we’re , we’ve , welcome , well , went , were , weren’t , what , what’s , whatever , when , whence , whenever , where , where’s , whereafter , whereas , whereby , wherein , whereupon , wherever , whether , which , while , whither , who , who’s , whoever , whole , whom , whose , why , will , willing , wish , with , within , without , won’t , wonder , would , would , wouldn’t , x , y , yes , yet , you , you’d , you’ll , you’re , you’ve , your , yours , yourself , yourselves , z , zero | 571 |
| snowball | i , me , my , myself , we , our , ours , ourselves , you , your , yours , yourself , yourselves, he , him , his , himself , she , her , hers , herself , it , its , itself , they , them , their , theirs , themselves, what , which , who , whom , this , that , these , those , am , is , are , was , were , be , been , being , have , has , had , having , do , does , did , doing , would , should , could , ought , i’m , you’re , he’s , she’s , it’s , we’re , they’re , i’ve , you’ve , we’ve , they’ve , i’d , you’d , he’d , she’d , we’d , they’d , i’ll , you’ll , he’ll , she’ll , we’ll , they’ll , isn’t , aren’t , wasn’t , weren’t , hasn’t , haven’t , hadn’t , doesn’t , don’t , didn’t , won’t , wouldn’t , shan’t , shouldn’t , can’t , cannot , couldn’t , mustn’t , let’s , that’s , who’s , what’s , here’s , there’s , when’s , where’s , why’s , how’s , a , an , the , and , but , if , or , because , as , until , while , of , at , by , for , with , about , against , between , into , through , during , before , after , above , below , to , from , up , down , in , out , on , off , over , under , again , further , then , once , here , there , when , where , why , how , all , any , both , each , few , more , most , other , some , such , no , nor , not , only , own , same , so , than , too , very , will | 175 |
| stopwords-iso | ’ll , ’tis , ’twas , ’ve , 10 , 39 , a , a’s , able , ableabout , about , above , abroad , abst , accordance , according , accordingly , across , act , actually , ad , added , adj , adopted , ae , af , affected , affecting , affects , after , afterwards , ag , again , against , ago , ah , ahead , ai , ain’t , aint , al , all , allow , allows , almost , alone , along , alongside , already , also , although , always , am , amid , amidst , among , amongst , amoungst , amount , an , and , announce , another , any , anybody , anyhow , anymore , anyone , anything , anyway , anyways , anywhere , ao , apart , apparently , appear , appreciate , appropriate , approximately , aq , ar , are , area , areas , aren , aren’t , arent , arise , around , arpa , as , aside , ask , asked , asking , asks , associated , at , au , auth , available , aw , away , awfully , az , b , ba , back , backed , backing , backs , backward , backwards , bb , bd , be , became , because , become , becomes , becoming , been , before , beforehand , began , begin , beginning , beginnings , begins , behind , being , beings , believe , below , beside , besides , best , better , between , beyond , bf , bg , bh , bi , big , bill , billion , biol , bj , bm , bn , bo , both , bottom , br , brief , briefly , bs , bt , but , buy , bv , bw , by , bz , c , c’mon , c’s , ca , call , came , can , can’t , cannot , cant , caption , case , cases , cause , causes , cc , cd , certain , certainly , cf , cg , ch , changes , ci , ck , cl , clear , clearly , click , cm , cmon , cn , co , co. , com , come , comes , computer , con , concerning , consequently , consider , considering , contain , containing , contains , copy , corresponding , could , could’ve , couldn , couldn’t , couldnt , course , cr , cry , cs , cu , currently , cv , cx , cy , cz , d , dare , daren’t , darent , date , de , dear , definitely , describe , described , despite , detail , did , didn , didn’t , didnt , differ , different , differently , directly , dj , dk , dm , do , does , doesn , doesn’t , doesnt , doing , don , don’t , done , dont , doubtful , down , downed , downing , downs , downwards , due , during , dz , e , each , early , ec , ed , edu , ee , effect , eg , eh , eight , eighty , either , eleven , else , elsewhere , empty , end , ended , ending , ends , enough , entirely , er , es , especially , et , et-al , etc , even , evenly , ever , evermore , every , everybody , everyone , everything , everywhere , ex , exactly , example , except , f , face , faces , fact , facts , fairly , far , farther , felt , few , fewer , ff , fi , fifteen , fifth , fifty , fify , fill , find , finds , fire , first , five , fix , fj , fk , fm , fo , followed , following , follows , for , forever , former , formerly , forth , forty , forward , found , four , fr , free , from , front , full , fully , further , furthered , furthering , furthermore , furthers , fx , g , ga , gave , gb , gd , ge , general , generally , get , gets , getting , gf , gg , gh , gi , give , given , gives , giving , gl , gm , gmt , gn , go , goes , going , gone , good , goods , got , gotten , gov , gp , gq , gr , great , greater , greatest , greetings , group , grouped , grouping , groups , gs , gt , gu , gw , gy , h , had , hadn’t , hadnt , half , happens , hardly , has , hasn , hasn’t , hasnt , have , haven , haven’t , havent , having , he , he’d , he’ll , he’s , hed , hell , hello , help , hence , her , here , here’s , hereafter , hereby , herein , heres , hereupon , hers , herself , herse” , hes , hi , hid , high , higher , highest , him , himself , himse” , his , hither , hk , hm , hn , home , homepage , hopefully , how , how’d , how’ll , how’s , howbeit , however , hr , ht , htm , html , http , hu , hundred , i , i’d , i’ll , i’m , i’ve , i.e. , id , ie , if , ignored , ii , il , ill , im , immediate , immediately , importance , important , in , inasmuch , inc , inc. , indeed , index , indicate , indicated , indicates , information , inner , inside , insofar , instead , int , interest , interested , interesting , interests , into , invention , inward , io , iq , ir , is , isn , isn’t , isnt , it , it’d , it’ll , it’s , itd , itll , its , itself , itse” , ive , j , je , jm , jo , join , jp , just , k , ke , keep , keeps , kept , keys , kg , kh , ki , kind , km , kn , knew , know , known , knows , kp , kr , kw , ky , kz , l , la , large , largely , last , lately , later , latest , latter , latterly , lb , lc , least , length , less , lest , let , let’s , lets , li , like , liked , likely , likewise , line , little , lk , ll , long , longer , longest , look , looking , looks , low , lower , lr , ls , lt , ltd , lu , lv , ly , m , ma , made , mainly , make , makes , making , man , many , may , maybe , mayn’t , maynt , mc , md , me , mean , means , meantime , meanwhile , member , members , men , merely , mg , mh , microsoft , might , might’ve , mightn’t , mightnt , mil , mill , million , mine , minus , miss , mk , ml , mm , mn , mo , more , moreover , most , mostly , move , mp , mq , mr , mrs , ms , msie , mt , mu , much , mug , must , must’ve , mustn’t , mustnt , mv , mw , mx , my , myself , myse” , mz , n , na , name , namely , nay , nc , nd , ne , near , nearly , necessarily , necessary , need , needed , needing , needn’t , neednt , needs , neither , net , netscape , never , neverf , neverless , nevertheless , new , newer , newest , next , nf , ng , ni , nine , ninety , nl , no , no-one , nobody , non , none , nonetheless , noone , nor , normally , nos , not , noted , nothing , notwithstanding, novel , now , nowhere , np , nr , nu , null , number , numbers , nz , o , obtain , obtained , obviously , of , off , often , oh , ok , okay , old , older , oldest , om , omitted , on , once , one , one’s , ones , only , onto , open , opened , opening , opens , opposite , or , ord , order , ordered , ordering , orders , org , other , others , otherwise , ought , oughtn’t , oughtnt , our , ours , ourselves , out , outside , over , overall , owing , own , p , pa , page , pages , part , parted , particular , particularly , parting , parts , past , pe , per , perhaps , pf , pg , ph , pk , pl , place , placed , places , please , plus , pm , pmid , pn , point , pointed , pointing , points , poorly , possible , possibly , potentially , pp , pr , predominantly , present , presented , presenting , presents , presumably , previously , primarily , probably , problem , problems , promptly , proud , provided , provides , pt , put , puts , pw , py , q , qa , que , quickly , quite , qv , r , ran , rather , rd , re , readily , really , reasonably , recent , recently , ref , refs , regarding , regardless , regards , related , relatively , research , reserved , respectively , resulted , resulting , results , right , ring , ro , room , rooms , round , ru , run , rw , s , sa , said , same , saw , say , saying , says , sb , sc , sd , se , sec , second , secondly , seconds , section , see , seeing , seem , seemed , seeming , seems , seen , sees , self , selves , sensible , sent , serious , seriously , seven , seventy , several , sg , sh , shall , shan’t , shant , she , she’d , she’ll , she’s , shed , shell , shes , should , should’ve , shouldn , shouldn’t , shouldnt , show , showed , showing , shown , showns , shows , si , side , sides , significant , significantly , similar , similarly , since , sincere , site , six , sixty , sj , sk , sl , slightly , sm , small , smaller , smallest , sn , so , some , somebody , someday , somehow , someone , somethan , something , sometime , sometimes , somewhat , somewhere , soon , sorry , specifically , specified , specify , specifying , sr , st , state , states , still , stop , strongly , su , sub , substantially , successfully , such , sufficiently , suggest , sup , sure , sv , sy , system , sz , t , t’s , take , taken , taking , tc , td , tell , ten , tends , test , text , tf , tg , th , than , thank , thanks , thanx , that , that’ll , that’s , that’ve , thatll , thats , thatve , the , their , theirs , them , themselves , then , thence , there , there’d , there’ll , there’re , there’s , there’ve , thereafter , thereby , thered , therefore , therein , therell , thereof , therere , theres , thereto , thereupon , thereve , these , they , they’d , they’ll , they’re , they’ve , theyd , theyll , theyre , theyve , thick , thin , thing , things , think , thinks , third , thirty , this , thorough , thoroughly , those , thou , though , thoughh , thought , thoughts , thousand , three , throug , through , throughout , thru , thus , til , till , tip , tis , tj , tk , tm , tn , to , today , together , too , took , top , toward , towards , tp , tr , tried , tries , trillion , truly , try , trying , ts , tt , turn , turned , turning , turns , tv , tw , twas , twelve , twenty , twice , two , tz , u , ua , ug , uk , um , un , under , underneath , undoing , unfortunately , unless , unlike , unlikely , until , unto , up , upon , ups , upwards , us , use , used , useful , usefully , usefulness , uses , using , usually , uucp , uy , uz , v , va , value , various , vc , ve , versus , very , vg , vi , via , viz , vn , vol , vols , vs , vu , w , want , wanted , wanting , wants , was , wasn , wasn’t , wasnt , way , ways , we , we’d , we’ll , we’re , we’ve , web , webpage , website , wed , welcome , well , wells , went , were , weren , weren’t , werent , weve , wf , what , what’d , what’ll , what’s , what’ve , whatever , whatll , whats , whatve , when , when’d , when’ll , when’s , whence , whenever , where , where’d , where’ll , where’s , whereafter , whereas , whereby , wherein , wheres , whereupon , wherever , whether , which , whichever , while , whilst , whim , whither , who , who’d , who’ll , who’s , whod , whoever , whole , wholl , whom , whomever , whos , whose , why , why’d , why’ll , why’s , widely , width , will , willing , wish , with , within , without , won , won’t , wonder , wont , words , work , worked , working , works , world , would , would’ve , wouldn , wouldn’t , wouldnt , ws , www , x , y , ye , year , years , yes , yet , you , you’d , you’ll , you’re , you’ve , youd , youll , young , younger , youngest , your , youre , yours , yourself , yourselves , youve , yt , yu , z , za , zero , zm , zr | 1298 |
en_df2 <- fst_prepare(data = english_sample_survey,
question = 'text',
id = 'id',
model = 'english-ewt',
stopword_list = 'smart',
language = 'en')
knitr::kable(head(en_df2, 5))| doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 1 | joe | joe | PROPN | NNP | Number=Sing | 3 | nsubj | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 3 | talk | talk | VERB | VB | VerbForm=Inf | 0 | root | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 6 | doctor | doctor | NOUN | NN | Number=Sing | 3 | obl | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 10 | nurse | nurse | NOUN | NN | Number=Sing | 8 | obj | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 13 | doctor | doctor | NOUN | NN | Number=Sing | 14 | nsubj | NA | NA |
2b. Optional: Provide your own list of stopwords
If a stopword list is not available for your language, or you would
like to provide your own, you can use the manual_list
option within fst_prepare() (or
fst_rm_stop_punct()) making sure to also either set
manual = TRUE or
stopwords_list = "manual".
You can also chose to not remove stopwords but you may find that you want to remove them to get more meaningful results!
If you provide a manual list, you can leave language as
its default values.
Demonstration
#EXAMPLE OF PROVIDING A MANUAL LIST
manualList <- c('and', 'the', 'of', 'you', 'me', 'ours', 'mine', 'them', 'theirs')
manualList2 <- "to, the, I"
df1 <- fst_prepare(data = english_sample_survey,
question = 'text',
id = 'id',
model = 'english-ewt',
manual_list = manualList,
stopword_list = 'manual'
)
knitr::kable(head(df1, 5))| doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 1 | joe | joe | PROPN | NNP | Number=Sing | 3 | nsubj | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 2 | should | should | AUX | MD | VerbForm=Fin | 3 | aux | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 3 | talk | talk | VERB | VB | VerbForm=Inf | 0 | root | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 4 | to | to | ADP | IN | NA | 6 | case | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 6 | doctor | doctor | NOUN | NN | Number=Sing | 3 | obl | NA | NA |
df2 <- fst_prepare(data = english_sample_survey,
question = 'text',
id = 'id',
model = 'english-ewt',
manual = TRUE,
manual_list = manualList2
)
knitr::kable(head(df2, 5))| doc_id | paragraph_id | sentence_id | sentence | token_id | token | lemma | upos | xpos | feats | head_token_id | dep_rel | deps | misc |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 1 | joe | joe | PROPN | NNP | Number=Sing | 3 | nsubj | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 2 | should | should | AUX | MD | VerbForm=Fin | 3 | aux | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 3 | talk | talk | VERB | VB | VerbForm=Inf | 0 | root | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 6 | doctor | doctor | NOUN | NN | Number=Sing | 3 | obl | NA | NA |
| 1 | 1 | 1 | Joe should talk to the doctor or tell the nurse that the doctor said he has to come back in two weeks. | 7 | or | or | CCONJ | CC | NA | 8 | cc | NA | NA |
The remainder of the package works the same regardless of language of survey responses.