Skip to contents

Standardization and other processing of raw dictionary data

Term standardization

To facilitate comparisons between dictionaries, the terms in all dictionaries have been transformed into a common format. All terms are provided in all-lowercase and spelling and spacing have been made standard across dictionaries (generally, the US spellings are chosen). Spaces are represented by underscores in all dictionaries. Accents and other punctuation have been removed. For transparency, the code used to perform this standardization is included in the source code of this package at R/raw_data_processing.R.

Summary statistics calculation

Where individual data is available, the mean, standard deviation, and covariance values that are reported in the summary data have been calculated directly from the individual data within this package. The script used for this is located in the source code of this package at data-raw/dicts.R, which calls functions included in R/raw_data_processing.R. Users can calculate summary statistics for their own subsets of the individual data using epa_summary().

Institution code standardization

Dictionaries often contain 14-digit binary strings known as “institution codes” that contain information about what contexts terms apply to. See the dictionaries help page for more information on what these strings mean and how to use functions within this package that help demystify and filter by them.

To determine an institution code for a term, researchers code whether or not they believe the term applies in a number of different social contexts. This coding process has been done several times by different researchers over time.

Because membership in institutional categories is sometimes ambiguous, repeated coding sometimes produces inconsistencies in institution codes in the raw data for the same term between different data sets. In addition, often a term will have an assigned institution code in one data set, but not in other data sets. In the data made available in this package, I have attempted to standardize and extend institution code assignment.

The most comprehensive attempt at institution code creation was undertaken by Kaitlyn Boyle and Dawn Robinson, who led a team of coders to develop institution codes for terms contained in the uga2015 dictionary. Where the code assigned to a term by this team conflicts with the code assigned to the same term in another data set, I replace the conflicting code with the UGA code. Where a term has an institution code in uga2015 but not in another dataset, I fill in the institution code in the other dataset with the one assigned in uga2015. The script used for this is located in the source code of this package at data-raw/dicts.R.

Note that this processes did not resolve all institution code conflicts, but rather only those for terms that were assigned codes in uga2015. A small number of terms that were not recoded in uga2015 still have conflicting codes in other data sets because in these instances it is not clear which code should be preferred.