The SemNetCleaner package houses several functions for the cleaning and preprocessing of semantic data. The purpose of this package is to facilitate efficient and reproducible preprocessing of semantic data. Notably, other R packages perform similar functions (e.g., spell-checking, text mining) such as hunspell (ooms2018hunspell?), qdap (rinker2019qdap?), and tm (feinerer2008tm?). However, the SemNetCleaner package sets itself apart from these other packages by focusing specifically on commonly used tasks for SemNA (e.g., verbal fluency), which allows for greater automation of data cleaning and preprocessing.
The SemNetCleaner package applies several steps to preprocess raw verbal fluency data so that it is ready to be used for estimating semantic networks. These steps include spell-checking, verifying the accuracy of the spell-check, and obtaining a binary response matrix for network estimation. To initialize this process, the following code must be run:
# Run 'textcleaner'
clean <- textcleaner(data = open.animals[,-c(1:2)], miss = 99,
partBY = "row", dictionary = "animals")
textcleaner
is the main function that handles the data cleaning and preprocessing in SemNetCleaner (for argument descriptions, see Table 2). For input into data
, it’s strongly recommended that the user input the full verbal fluency dataset and not data already separated into groups. If verbal fluency responses are already separated, then they will need to be inputted and preprocessed separately. Therefore, it’s preferable to separate the preprocessed data into groups at a later stage of the SemNA pipeline.
Table 2. textcleaner Arguments | |
Argument | Description |
---|---|
data
|
A matrix or data frame object that contains the participants’ IDs and semantic data |
miss
|
A number or character that corresponds to the symbol used for missing data. The default is set to 99
|
partBY
|
Specifies whether participants are across the rows ("row" ) or down the columns ("col" )
|
dictionary
|
Specifies which dictionaries from SemNetDictionaries should be used (more than one is possible). If no dictionary is chosen, then the "general" dictionary is used
|
tolerance
|
Enables automated spell-checking using the Damerau-Levenshtein distance (defaults to 1 )
|
When running the above code, textcleaner
will start preprocessing the data immediately. The reader may notice that a progress bar appears, which lets the user know about how many more words need to be processed (i.e., number of words processed out of how many words in total need to be processed). The progress bar should read “10 of 269 words done,” meaning that textcleaner
has already automatically processed several words. Before continuing with the tutorial, we describe how the automatic spell-check operations of textcleaner
work and then continue the tutorial with the manual spell-check operation.
The first step of textcleaner
is to spell-check all responses. The spell-checking algorithm of textcleaner
uses automatic and manual spell-checking processes in parallel. First, missing values (e.g., NA
), punctuations, digits, and extra white spaces are removed from each response in the raw verbal fluency data. From these responses, only the unique responses across participants are obtained, which are used as input into the spell-checking algorithm. Although these unique responses include responses that are misspelled, they drastically reduce the number of responses that textcleaner
needs to spell-check.
Next, these unique responses are checked against a dictionary and its associated monikers (only if it’s a dictionary from SemNetDictionaries) and replaced with a homogenized name (e.g., grizzly \(\rightarrow\) grizzly bear). In this process, responses are checked against their plural and singular forms to further expedite the identification of correctly spelled responses. Responses that are matched with their plural form are converted to their singular form.
The unique responses that have not been matched in this process are then forwarded, one-by-one, to the spell-check algorithm. The algorithm will first attempt to auto-correct the response. If it cannot be auto-corrected, then the response is passed onto the manual portion of the algorithm. This process is repeated for each unique response entered into the spell-check algorithm. We first describe how a response gets auto-corrected in the automated spell-check and then we describe the manual spell-check for a response that could not be auto-corrected.
There are two auto-correct operations in the automated portion of the algorithm. The first auto-correct operation computes the Damerau-Levenshtein (DL) distance (Damerau, 1964; Levenshtein, 1966), a method to compute the edit distance (i.e., the (dis)similarity of two words), to determine how similar a given response is to every response in the dictionary. This computation is done by counting the number of errors—insertion (i.e., adding a letter), deletion (i.e., removing a letter), substitution (i.e., exchanging one letter for an incorrect letter), and transposition (i.e., changing the position of two adjacent letters)—that are made between the target word and potential response from the dictionary.
Notably, Damerau (1964) states that the majority of spelling errors (more than 80%) are made with only one of these errors. Based on this finding, the auto-correct operation in textcleaner
can be set (using the tolerance
argument, see Table 2) to automatically correct an incorrect (or inappropriate) response when the DL distance is less than or equal to the given tolerance
value (e.g., one). The tolerance
value is used as a criterion for how close a response in the dictionary must be to the original response before it is auto-corrected. These values are integers that range anywhere from 1 to infinity. The default value is 1, following Damerau’s (1964) observation. Values greater than 1 provide a less strict criterion, however, this may increase the number of incorrect corrections made by the automated portion of the algorithm. If more than one response in the dictionary has a DL distance that is less than or equal to the tolerance
value, then they are passed onto a second auto-correct operation.
The second auto-correct operation checks for spelling errors that may have been due to erroneous keystrokes on a QWERTY keyboard—the so-called, QWERTY distance. This distance is computed by summing the physical distance (i.e., number of keys) between the letters in the response and the letters in the responses passed on from the first auto-correct operation. The letter “f,” for example, has a distance of one from the letters “d,” “e,” “r,” “t,” “g,” “v,” and “c.” This second auto-correct operation will automatically correct an incorrect (or inappropriate) response when the distance is less than or equal to the same tolerance
value as the DL distance. If no response or more than one response is less than or equal to the tolerance
value, then the response is passed onto the manual spell-check.
Because the automated spell-check occurs prior to manual spell-check for each response, the user will only receive manual spell-check prompts for responses that could not be auto-corrected. The manual spell-checking operation allows the user to self-select the appropriate correction by choosing one of several response options from an interactive menu. Our tutorial will cover an example of each response option in the interactive menu. After running the above textcleaner
code on the open.animal
data, an interactive menu appears that allows the reader to correct an incorrectly spelled word (Figure 2; for figures, see Christensen & Kenett (2020)).
The first prompt contains a continuous string (i.e., multiple responses entered as a single response): turtle <<catdog>> elephant fish bird squiral rabbit fox deer monkey giraff
. The target response that needs a decision is denoted between <<
and >>
(in this example: catdog
). Under the continuous string, the reader will find response options (denoted by Potential responses:
) to manually correct the response. The first 10 response options are the responses in the dictionary that had the lowest DL distance with the target response. The next six response options are additional options that provide the user with a greater flexibility of options for correcting responses. These six additional response options are defined in the Table 3.
Table 3. Additional Response Options | |
Option | Description |
---|---|
11:ADD TO DICTIONARY
|
Allows user to add the response to a temporary appendix dictionary |
12:TYPE MY OWN
|
Allows user to type their own response if it is not provided in the potential response options (if necessary, multiple responses can be typed using spaces) |
13:GOOGLE IT
|
Opens the user’s default internet browser to Google’s webpage and searches for a definition of the original response using the terms: dictionary ‘RESPONSE’ |
14:BAD RESPONSE
|
Marks the original response as bad and makes it so the response will be missing (i.e., NA ) and not included in the final results
|
15:SKIP
|
Allows the original response to be included in the final results but does not add it to the temporary appendix dictionary |
16:CONTEXT
|
(Single responses only) Provides the target response in context of the participant’s other responses. Will print each participant’s responses that provide the target response |
16:BAD STRING
|
(Continuous strings only) Marks the entire continuous string of responses as bad and makes all responses missing (i.e., NA ) and not included in the final results
|
For the target response in this example (i.e., catdog
), the participant likely intended to type cat and dog as separate responses. When examining the offered responses for correction (options 1-10
), the reader may notice that cat and dog are listed but none of the response options have the option to separate the response into cat and dog. To do this, we can use one of the additional response options: 12:TYPE MY OWN
. This response option acts as a catch-all option that enables the user to type the response that should replace the original response.
To select this response option, the reader can type 12
and press ENTER
. Next, the reader will be prompted with Type response:
. Here, the reader can type their correction (without quotations) of what word(s) should replace the original response. The response cat dog
should be typed and the reader can press ENTER
(Figure 3).
This completes the first prompt and moves the reader to the second prompt. The second prompt is another continuous string: dog cat horse <<guinea>> pig rooster bird fish mouse rat owl
(Figure 4).
For this continuous string, all words are actually spelled correctly; however, because textcleaner
sifts through each response word-by-word, it’s stopped at the word guinea, which is not in the dictionary. Based on the participant’s next word pig
, they likely meant to type guinea pig. The reader should not correct the response; instead, textcleaner
will handle this when parsing the continuous string. As a general rule of thumb, the reader should always focus on the target word when making a correction. Because guinea is spelled correctly,1 the reader can use the 15:SKIP
option, which will keep the word “as is,” by pressing 15
and then ENTER
. textcleaner
will remember this choice the next time it encounters the word guinea and will no longer prompt the user for a correction.
Next, the reader will be prompted when textcleaner
attempts to parse the response (Figure 4). Here, the reader should decide whether guinea and pig should be combined into a single response or remain separated as two responses. The response should be combined into a single response, guinea pig, so the response 1:combined:'guinea pig'
should be selected by pressing 1
and then ENTER
. The next prompt is another “combine or separate” response option for "bat cat dog sheep"
. With this prompt, the reader can press 2
for 2:separated:'bat' cat' 'dog' 'sheep'
and ENTER
to separate the string into individual responses.
Most responses are fairly easy to determine the word the participant intended with the offered responses; however, there are instances where it’s impossible to know exactly what the participant intended. An example of this is in the next prompt: dog cat <<mose>> moose horse lion tiger bear dear doe pig cow
(Figure 5).
Here, the first three response options: 1:mole
, 2:moose
, and 3:mouse
are equally plausible. On the one hand, it’s unlikely the participant intended to type mole because the “s” and “l” keys are quite distant from one another. On the other hand, it’s hard to know whether the participant intended to type moose or mouse. The next response in the string is moose, which could mean that the participant attempted to correct their initial response; however, there is no way of knowing for certain. In these instances, our recommendation is to err on the conservative side—that is, to not include the response in the final results. To do so, the user can type 14
for 14:BAD RESPONSE
and press ENTER
, which will remove the response from the final results.
The reader can continue through the next ten prompts using the response options that have been covered. Below is a table with the response options we selected for these prompts:
Table 4. Responses for next ten prompts | ||
Prompt | Selection | Type My Own |
---|---|---|
creatures
|
14:BAD RESPONSE
|
— |
catefrog
|
12:TYPE MY OWN
|
cat frog |
criters
|
14:BAD RESPONSE
|
— |
mario
|
14:BAD RESPONSE
|
— |
garafi
|
14:BAD RESPONSE
|
— |
snack
|
14:BAD RESPONSE
|
— |
girrage
|
14:BAD RESPONSE
|
— |
<<gieuna>> pig
|
12:TYPE MY OWN
|
guinea |
jesus
|
14:BAD RESPONSE
|
— |
squrill
|
5:squirrel
|
— |
After going through these prompts, the reader will arrive at the prompt: <<your>> mom
. Sometimes all responses in a prompt will be inappropriate for the category like your, mom, and the string of your mom. In these instances, the user can select 16
and press ENTER
for the response option 16:BAD STRING
. This response option will remove all responses in the string from the final results (Table 3). The next two prompts—<<geaniu>> pig
and dinasor
—can be corrected using response options we’ve already covered: 12:TYPE MY OWN
(guinea
) and 2:dinosaur
.
After managing these two prompts, the reader comes to a prompt for bluebird
. If the user is unsure whether a word is an actual category exemplar (or just sounds like one), then they can press 13
and ENTER
for the response option 13:GOOGLE IT
. This will open the user’s default web browser and search Google using the terms: “dictionary ‘bluebird’.” When doing so, we can see that bluebird is indeed a category exemplar. Because bluebird is not in the dictionary, the reader should add it to their temporary appendix dictionary. The reader can do so by pressing 11
and ENTER
for the response option 11:ADD TO DICTIONARY
.
The options ADD TO DICTIONARY
and TYPE MY OWN
allow the user to add the original or typed response, respectively, to a temporary appendix dictionary. For TYPE MY OWN
, the user will only be prompted to add the response to the temporary appendix dictionary if the typed response is not already in the (temporary) dictionary. textcleaner
will use these additional words to facilitate the automation of future instances of these words.
These examples fill out what is necessary to fully apply textcleaner
to the data. At the end of the textcleaner
process, the reader will be prompted on whether they would like to save their appendix dictionary to their computer, which allows them to use the dictionary in the future. If the user chooses to save the dictionary to their computer, then they will be asked to provide a name for this dictionary—for the tutorial, we named it: appendix
. The file will then be saved as appendix.dictionary.rds
in the directory the user chooses. Note that the appendix dictionary does not actually update the original pre-defined dictionary, so it’s necessary to input the name of the dictionary in the dictionary
argument of textcleaner
when using any appendix dictionary in the future (e.g., dictionary = c("animals", "appendix")
).
textcleaner
OutputThere are several other output objects from the textcleaner
function. These objects are stored in a list object, which we designated in our example as clean
. These output objects are summarized in the table below.
Table 5. textcleaner and correct.changes Output Objects
|
||
Object | Nested Object | Description |
---|---|---|
binary
|
— | Binary response matrix where rows are participants and columns are responses. 1’s are responses given by a participant and 0’s are responses not given by a participant |
responses
|
||
clean.resp
|
Spell-corrected response matrix where the ordering of the original responses are preserved. Inappropriate and duplicate responses have been removed | |
orig.resp
|
Original response matrix where uppercase letters were made to lowercase and white spaces before and after responses were removed | |
spellcheck
|
||
full
|
List of all responses whether or not they have been spell-corrected | |
auto
|
List of only unique responses that were auto-corrected and corrected by the user | |
removed
|
||
rows
|
Vector of rows for the participants with no appropriate responses | |
ids
|
Vector of the participants’ IDs with no appropriate responses | |
partChanges
|
ID
|
List of list objects labeled with each participant’s ID variable. Each participant’s list contains a data frame of the specific words that were changed for the participant |
These objects can be accessed using a dollar sign (e.g., clean$responses
) and nested objects can be accessed within their parent object (e.g., clean$responses$clean.resp
). Some of these output are useful for accessing the spell-check changes that occurred. For example, clean$spellcheck
contains objects that refer to the full list of original responses regardless of whether there were spelling changes ($full
) and a list of unique responses that were corrected during the spell-check algorithm ($auto
). The removed
object contains lists of participants who were removed because of a lack of appropriate responses, which can be identified by either the participant’s row (or column; $rows
) or ID variable ($ids
) in the input dataset (these will be the same if no ID variable is provided). Finally, the partChanges
object contains list objects, which correspond to each participant’s unique ID and the specific correction changes that were made to their responses.
Although textcleaner
is highly efficient and automatizes most of the cleaning process, it’s possible that some of the auto-correction changes are incorrect. Moreover, the user may have entered a wrong response option or misspelled a response in the TYPE MY OWN
option during the process. Therefore, the user may still need to make corrections to the output provided by textcleaner
. To view the changes made during the spell-checking step, the reader can enter the following code:
# View unique spelling changes
View(clean$spellcheck$auto)
The View
function will open a tab in R allowing the user to examine a matrix containing all of the unique changes that were made during the textcleaner
process (i.e., clean$spellcheck$auto
). The first column of this matrix is named “from” and contains the unique raw responses given by the sample. The next several columns are all named “to” and contain the spell corrected responses made by textcleaner
. The reader should see that the first row, for example, contains the response “life” in the “from” column and “louse” in the “to” column. At first, this may seem like an incorrect change; however, “life” was auto-corrected to “lice” during the spell-checking process, which was then changed to “louse” during the plural-to-singular form process.
Another worthwhile example is in the sixth row where a continuous string (i.e., “horse cat dog pig goat fidh deer duck swan goose bird eagle giraffe lion hippo”) was separated into individual responses. It’s important that the reader checks to make sure that (1) each response was separated correctly and (2) each response is spelled correctly. Finally, in the twenty-fourth row, “creatures” appears in the “from” column and the “to” column is blank. A blank in the “to” column means that the response in the “from” column has been removed from the preprocessed data.
The reader should inspect each response in the “from” column and verify that the response(s) in the “to” column(s) are correct. If any responses in the “to” column(s) are not correct, then they need to be corrected. To do so, the function correct.changes
can be used:
# Corrected 'clean' object from 'textcleaner'
corr.clean <- correct.changes(textcleaner.obj = clean,
dictionary = "animals",
incorrect = c("house", "beasts", "god",
"gunny pig", "liam", "loin",
"farrot", "oh my", "lizers",
"teranchilla","manster", "lamp"))
correct.changes
accepts textcleaner
objects only. This means that the output we stored from our textcleaner
run (i.e., clean
) should be input into this function (i.e., textcleaner.obj = clean
). Like textcleaner
, the user can also specify one or more dictionaries from SemNetDictionaries to provide potential response options. Finally, the argument incorrect
is used as the input for responses that were incorrectly changed. The responses entered here should be the original response in the “from” column. In the code above, we’ve identified several of these responses (i.e., incorrect = c("house", "beasts", "god", "gunny pig", "liam", "loin", "farrot", "oh my", "lizers", "teranchilla","manster", "lamp"))
).
Similar to textcleaner
, correct.changes
uses an interactive menu to correct responses. The first three response options—1:TYPE MY OWN
, 2:GOOGLE IT
, and 3:BAD RESPONSE
—are the same as those in textcleaner
(Table 3). Following these response options are the potential responses from the dictionary. If a response from the dictionary does not offer the appropriate correction, then TYPE MY OWN
can be used. When using TYPE MY OWN
, if the old response was intended to be multiple responses (e.g., catdog), then the user should type a comma to separate the responses (e.g., cat, dog
). If no comma is added, then correct.changes
will consider the TYPE MY OWN
response as a continuous string. BAD RESPONSE
is necessary if the changed response is not correct and the old response is an inappropriate category exemplar (this also works for continuous strings).
The reader should run through correct.changes
and make the appropriate changes. Below is a table of the changes we applied:
Table 6. Responses for correct.changes
|
|||
From | To | Selection | Type My Own |
---|---|---|---|
house
|
mouse
|
3:BAD RESPONSE
|
— |
beasts
|
yeast
|
3:BAD RESPONSE
|
— |
god
|
cod
|
3:BAD RESPONSE
|
— |
gunny pig
|
'bunny' 'pig'
|
4:guinea pig
|
— |
liam
|
lion
|
3:BAD RESPONSE
|
— |
loin
|
loon
|
4:lion
|
— |
farrot
|
parrot
|
1:TYPE MY OWN
|
ferret |
oh my
|
ox
|
3:BAD RESPONSE
|
— |
lizers
|
liger
|
5:lizard
|
— |
teranchilla
|
chinchilla
|
5:tarantula
|
— |
manster
|
hamster
|
3:BAD RESPONSE
|
— |
lamp
|
lamb
|
3:BAD RESPONSE
|
— |
When finished, correct.changes
will store its output in the object corr.clean
. Once again, the user should verify that all changes are correct using View(corr.clean$spellcheck$auto)
. This process should be repeated until all changes are correct. Once thoroughly checked for accuracy, a final .csv file can be saved to distribute these changes to others (e.g., colleagues, peer reviewers), enhancing the transparency of the preprocessing stage of SemNA. To do so, the reader can create a .csv file:
# Save .csv of unique changes
write.csv(corr.clean$spellcheck$auto,
"unique_changes.csv", row.names = FALSE)
correct.changes
outputThe output of correct.changes
is exactly the same as textcleaner
(Table 5), except that it has been corrected for the incorrect changes in the textcleaner
output. Note that this output was saved in a object with a different name: corr.clean
. A couple of these objects are worth detailing further because they can be used for standard verbal fluency analyses. First, the nested object clean.resp
contains the cleaned verbal fluency data for each participant in the order the participant gave the responses. These data are useful for performing standard analyses of clustering and switching (e.g., Troyer, Moscovitch, & Winocur, 1997), particularly with the advent of automated scoring procedures (e.g., Kim, Kim, Wolters, MacPherson, & Park, 2019). This can be exported using the following code:
# Save .csv of clean responses
write.csv(corr.clean$responses$clean.resp, "cleaned_verbal_fluency.csv")
Second, the binary
object contains the binary response matrix where each participant received a `1' for a response they provided and a `0' for a response they did not. This matrix can be used to total the number of appropriate responses each participant gave. This can be done using the following code:
# Verbal fluency response totals
totals <- rowSums(corr.clean$binary)
In this section we described and demonstrated how the packages SemNetDictionaries and SemNetCleaner are used to facilitate efficient and reproducible preprocessing of verbal fluency data. In this process, the raw data have been spell-checked, duplicate and inappropriate responses have been removed, monikers have been converged into one response, and a binary response matrix has been generated. The binary response matrix (corr.clean$binary
) is used in the next stage of the SemNA pipeline to estimate semantic networks.
Although guinea is spelled correctly, it should not be added to the dictionary—there is no animal with only the name guinea. If added, then the auto-correct functions will begin treating guinea as an appropriate category exemplar, despite it not being an appropriate response by itself.↩︎