Title: | Linguistic Matching and Accommodation |
---|---|
Description: | Measure similarity between texts. Offers a variety of processing tools and similarity metrics to facilitate flexible representation of texts and matching. Implements forms of Language Style Matching (Ireland & Pennebaker, 2010) <doi:10.1037/a0020386> and Latent Semantic Analysis (Landauer & Dumais, 1997) <doi:10.1037/0033-295X.104.2.211>. |
Authors: | Micah Iserman |
Maintainer: | Micah Iserman <[email protected]> |
License: | GPL (>= 2) |
Version: | 1.0.8 |
Built: | 2024-11-08 21:08:12 UTC |
Source: | https://github.com/miserman/lingmatch |
Assess Dictionary Categories Within a Latent Semantic Space
dictionary_meta(dict, space = "auto", n_spaces = 5, suggest = FALSE, suggestion_terms = 10, suggest_stopwords = FALSE, suggest_discriminate = TRUE, expand_cutoff_freq = 0.98, expand_cutoff_spaces = 10, dimension_prop = 1, pairwise = TRUE, glob = TRUE, space_dir = getOption("lingmatch.lspace.dir"), verbose = TRUE)
dictionary_meta(dict, space = "auto", n_spaces = 5, suggest = FALSE, suggestion_terms = 10, suggest_stopwords = FALSE, suggest_discriminate = TRUE, expand_cutoff_freq = 0.98, expand_cutoff_spaces = 10, dimension_prop = 1, pairwise = TRUE, glob = TRUE, space_dir = getOption("lingmatch.lspace.dir"), verbose = TRUE)
dict |
A vector of terms, list of such vectors, or a matrix-like object to be
categorized by |
space |
A vector space used to calculate similarities between terms.
Names of spaces (see |
n_spaces |
Number of spaces to draw from if |
suggest |
Logical; if |
suggestion_terms |
Number of terms to use when selecting suggested additions. |
suggest_stopwords |
Logical; if |
suggest_discriminate |
Logical; if |
expand_cutoff_freq |
Proportion of mapped terms to include when expanding dictionary terms.
Applies when |
expand_cutoff_spaces |
Number of spaces in which a term has to appear to be considered
for expansion. Applies when |
dimension_prop |
Proportion of dimensions to use when searching for suggested additions, where less than 1 will calculate similarities to the category core using fewer dimensions of the space. |
pairwise |
Logical; if |
glob |
Logical; if |
space_dir |
Directory from which |
verbose |
Logical; if |
A list:
expanded
: A version of dict
with fuzzy terms expanded.
summary
: A summary of each dictionary category.
terms
: Match (expanded term) similarities within terms and categories.
suggested
: If suggest
is TRUE
, a list with suggested
additions for each dictionary category. Each entry is a named numeric vector with
similarities for each suggested term.
To just expand fuzzy terms, see report_term_matches()
.
Similar information is provided in the dictionary builder web tool.
Other Dictionary functions:
download.dict()
,
lma_patcat()
,
lma_termcat()
,
read.dic()
,
report_term_matches()
,
select.dict()
if (dir.exists("~/Latent Semantic Spaces")) { dict <- list( furniture = c("table", "chair", "desk*", "couch*", "sofa*"), well_adjusted = c("happy", "bright*", "friend*", "she", "he", "they") ) dictionary_meta(dict, space_dir = "~/Latent Semantic Spaces") }
if (dir.exists("~/Latent Semantic Spaces")) { dict <- list( furniture = c("table", "chair", "desk*", "couch*", "sofa*"), well_adjusted = c("happy", "bright*", "friend*", "she", "he", "they") ) dictionary_meta(dict, space_dir = "~/Latent Semantic Spaces") }
Downloads the specified dictionaries from osf.io/y6g5b.
download.dict(dict = "lusi", check.md5 = TRUE, mode = "wb", dir = getOption("lingmatch.dict.dir"), overwrite = FALSE)
download.dict(dict = "lusi", check.md5 = TRUE, mode = "wb", dir = getOption("lingmatch.dict.dir"), overwrite = FALSE)
dict |
One or more names of dictionaries to download, or |
check.md5 |
Logical; if |
mode |
A character specifying the file write mode; default is 'wb'. See
|
dir |
Directory in which to save the dictionary; |
overwrite |
Logical; if |
Path to the downloaded dictionary, or a list of such if multiple were downloaded.
Other Dictionary functions:
dictionary_meta()
,
lma_patcat()
,
lma_termcat()
,
read.dic()
,
report_term_matches()
,
select.dict()
## Not run: download.dict("lusi", dir = "~/Dictionaries") ## End(Not run)
## Not run: download.dict("lusi", dir = "~/Dictionaries") ## End(Not run)
Downloads the specified semantic space from osf.io/489he.
download.lspace(space = "100k_lsa", decompress = TRUE, check.md5 = TRUE, mode = "wb", dir = getOption("lingmatch.lspace.dir"), overwrite = FALSE)
download.lspace(space = "100k_lsa", decompress = TRUE, check.md5 = TRUE, mode = "wb", dir = getOption("lingmatch.lspace.dir"), overwrite = FALSE)
space |
Name of one or more spaces you want to download, or |
decompress |
Logical; if |
check.md5 |
Logical; if |
mode |
A character specifying the file write mode; default is 'wb'. See
|
dir |
Directory in which to save the space. Specify this here, or set the lspace directory option
(e.g., |
overwrite |
Logical; if |
A character vector with paths to the [1] data and [2] term files.
Other Latent Semantic Space functions:
lma_lspace()
,
select.lspace()
,
standardize.lspace()
## Not run: download.lspace("glove_crawl", dir = "~/Latent Semantic Spaces") ## End(Not run)
## Not run: download.lspace("glove_crawl", dir = "~/Latent Semantic Spaces") ## End(Not run)
Offers a variety of methods to assess linguistic matching or accommodation, where matching is general similarity (sometimes called homophily), and accommodation is some form of conditional similarity (accounting for some base-rate or precedent; sometimes called alignment).
lingmatch(input = NULL, comp = mean, data = NULL, group = NULL, ..., comp.data = NULL, comp.group = NULL, order = NULL, drop = FALSE, all.levels = FALSE, type = "lsm")
lingmatch(input = NULL, comp = mean, data = NULL, group = NULL, ..., comp.data = NULL, comp.group = NULL, order = NULL, drop = FALSE, all.levels = FALSE, type = "lsm")
input |
Texts to be compared; a vector, document-term matrix (dtm; with terms as column names), or path to a file (.txt or .csv, with texts separated by one or more lines/rows). |
|||||
comp |
Defines the comparison to be made:
|
|||||
data |
A matrix-like object as a reference for column names, if variables are referred to in
other arguments (e.g., |
|||||
group |
A logical or factor-like vector the same length as |
|||||
... |
Passes arguments to |
|||||
comp.data |
A matrix-like object as a source for |
|||||
comp.group |
The column name of the grouping variable(s) in |
|||||
order |
A numeric vector the same length as |
|||||
drop |
logical; if |
|||||
all.levels |
logical; if |
|||||
type |
A character at least partially matching 'lsm' or 'lsa'; applies default settings aligning with the standard calculations of each type:
|
There are a great many points of decision in the assessment of linguistic similarity and/or
accommodation, partly inherited from the great many point of decision inherent in the numerical
representation of language. Two general types of matching are implemented here as sets of
defaults: Language/Linguistic Style Matching (LSM; Niederhoffer & Pennebaker, 2002; Ireland &
Pennebaker, 2010), and Latent Semantic Analysis/Similarity (LSA; Landauer & Dumais, 1997;
Babcock, Ta, & Ickes, 2014). See the type
argument for specifics.
A list with processed components of the input, information about the comparison, and results of the comparison:
dtm
: A sparse matrix; the raw count-dtm, or a version of the original input
if it is more processed.
processed
: A matrix-like object; a processed version of the input
(e.g., weighted and categorized).
comp.type
: A string describing the comparison if applicable.
comp
: A vector or matrix-like object; the comparison data if applicable.
group
: A string describing the group if applicable.
sim
: Result of lma_simets
.
Defining groups and comparisons can sometimes be a bit complicated, and requires dataset
specific knowledge, so it can't always (readily) be done automatically. Variables entered in the
group
argument are treated differently depending on their position and other arguments:
By default, groups are treated as if they define separate chunks of data in
which comparisons should be calculated. Functions used to calculated comparisons, and
pairwise comparisons are performed separately in each of these groups. For example, if you
wanted to compare each text with the mean of all texts in its condition, a group
variable could identify and split by condition. Given multiple grouping variables,
calculations will either be done in each split (if all.levels = TRUE
; applied in
sequence so that groups become smaller and smaller), or once after all splits are made (if
all.levels = FALSE
). This makes for 'one to many' comparisons with either calculated
or preexisting standards (i.e., the profile of the current data, or a precalculated profile,
respectively).
When comparison data is identified in comp
, groups are assumed
to apply to both input
and comp
(either both in data
, or separately
between data
and comp.data
, in which case comp.group
may be needed if
the same grouping variable have different names between data
and comp.data
).
In this case, multiple grouping variables are combined into a single factor assumed to
uniquely identify a comparison. This makes for 'one to many' comparisons with specific texts
(as in the case of manipulated prompts or text-based conditions).
If comp
matches 'sequential'
, the last grouping variable
entered is assumed to identify something like speakers (i.e., a factor with two or more
levels and multiple observations per level). In this case, the data are assumed to be ordered
(or ordered once sorted by order
if specified). Any additional grouping variables
before the last are treated as splitting groups. This can set up for probabilistic
accommodation metrics. At the moment, when sequential comparisons are made within groups,
similarity scores between speakers are averaged, resulting in mean matching between speakers
within the group.
Babcock, M. J., Ta, V. P., & Ickes, W. (2014). Latent semantic similarity and language style matching in initial dyadic interactions. Journal of Language and Social Psychology, 33, 78-88.
Ireland, M. E., & Pennebaker, J. W. (2010). Language style matching in writing: synchrony in essays, correspondence, and poetry. Journal of Personality and Social Psychology, 99, 549.
Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104, 211.
Niederhoffer, K. G., & Pennebaker, J. W. (2002). Linguistic style matching in social interaction. Journal of Language and Social Psychology, 21, 337-360.
For a general text processing function, see lma_process()
.
# compare single strings lingmatch("Compare this sentence.", "With this other sentence.") # compare each entry in a character vector with... texts <- c( "One bit of text as an entry...", "Maybe multiple sentences in an entry. Maybe essays or posts or a book.", "Could be lines or a column from a read-in file..." ) ## one another lingmatch(texts) ## the first lingmatch(texts, 1) ## the next lingmatch(texts, "seq") ## the set average lingmatch(texts, mean) ## other entries in a group lingmatch(texts, group = c("a", "a", "b")) ## one another, without stop words lingmatch(texts, exclude = "function") ## a standard average (based on function words) lingmatch(texts, "auto", dict = lma_dict(1:9))
# compare single strings lingmatch("Compare this sentence.", "With this other sentence.") # compare each entry in a character vector with... texts <- c( "One bit of text as an entry...", "Maybe multiple sentences in an entry. Maybe essays or posts or a book.", "Could be lines or a column from a read-in file..." ) ## one another lingmatch(texts) ## the first lingmatch(texts, 1) ## the next lingmatch(texts, "seq") ## the set average lingmatch(texts, mean) ## other entries in a group lingmatch(texts, group = c("a", "a", "b")) ## one another, without stop words lingmatch(texts, exclude = "function") ## a standard average (based on function words) lingmatch(texts, "auto", dict = lma_dict(1:9))
Returns a list of function words based on the Linguistic Inquiry and Word Count 2015 dictionary (in terms of category names – words were selected independently), or a list of special characters and patterns.
lma_dict(..., as.regex = TRUE, as.function = FALSE)
lma_dict(..., as.regex = TRUE, as.function = FALSE)
... |
Numbers or letters corresponding to category names: ppron, ipron, article, adverb, conj, prep, auxverb, negate, quant, interrog, number, interjection, or special. |
as.regex |
Logical: if |
as.function |
Logical or a function: if specified and |
A list with a vector of terms for each category, or (when as.function = TRUE
) a function which
accepts an initial "terms" argument (a character vector), and any additional arguments determined by function
entered as as.function
(grepl
by default).
The special
category is not returned unless specifically requested. It is a list of regular expression
strings attempting to capture special things like ellipses and emojis, or sets of special characters (those outside
of the Basic Latin range; [^\u0020-\u007F]
), which can be used for character conversions.
If special
is part of the returned list, as.regex
is set to TRUE
.
The special
list is always used by both lma_dtm
and lma_termcat
. When creating a
dtm, special
is used to clean the original input (so that, by default, the punctuation involved in ellipses
and emojis are treated as different – as ellipses and emojis rather than as periods and parens and colons and such).
When categorizing a dtm, the input dictionary is passed by the special lists to be sure the terms in the dtm match up
with the dictionary (so, for example, ": (" would be replaced with "repfrown" in both the text and dictionary).
To score texts with these categories, use lma_termcat()
.
# return the full dictionary (excluding special) lma_dict() # return the standard 7 category lsm categories lma_dict(1:7) # return just a few categories without regular expression lma_dict(neg, ppron, aux, as.regex = FALSE) # return special specifically lma_dict(special) # returning a function is.ppron <- lma_dict(ppron, as.function = TRUE) is.ppron(c("i", "am", "you", "were")) in.lsmcat <- lma_dict(1:7, as.function = TRUE) in.lsmcat(c("a", "frog", "for", "me")) ## use as a stopword filter is.stopword <- lma_dict(as.function = TRUE) dtm <- lma_dtm("Most of these words might not be all that relevant.") dtm[, !is.stopword(colnames(dtm))] ## use to replace special characters clean <- lma_dict(special, as.function = gsub) clean(c( "\u201Ccurly quotes\u201D", "na\u00EFve", "typographer\u2019s apostrophe", "en\u2013dash", "em\u2014dash" ))
# return the full dictionary (excluding special) lma_dict() # return the standard 7 category lsm categories lma_dict(1:7) # return just a few categories without regular expression lma_dict(neg, ppron, aux, as.regex = FALSE) # return special specifically lma_dict(special) # returning a function is.ppron <- lma_dict(ppron, as.function = TRUE) is.ppron(c("i", "am", "you", "were")) in.lsmcat <- lma_dict(1:7, as.function = TRUE) in.lsmcat(c("a", "frog", "for", "me")) ## use as a stopword filter is.stopword <- lma_dict(as.function = TRUE) dtm <- lma_dtm("Most of these words might not be all that relevant.") dtm[, !is.stopword(colnames(dtm))] ## use to replace special characters clean <- lma_dict(special, as.function = gsub) clean(c( "\u201Ccurly quotes\u201D", "na\u00EFve", "typographer\u2019s apostrophe", "en\u2013dash", "em\u2014dash" ))
Creates a document-term matrix (dtm) from a set of texts.
lma_dtm(text, exclude = NULL, context = NULL, replace.special = FALSE, numbers = FALSE, punct = FALSE, urls = TRUE, emojis = FALSE, to.lower = TRUE, word.break = " +", dc.min = 0, dc.max = Inf, sparse = TRUE, tokens.only = FALSE)
lma_dtm(text, exclude = NULL, context = NULL, replace.special = FALSE, numbers = FALSE, punct = FALSE, urls = TRUE, emojis = FALSE, to.lower = TRUE, word.break = " +", dc.min = 0, dc.max = Inf, sparse = TRUE, tokens.only = FALSE)
text |
Texts to be processed. This can be a vector (such as a column in a data frame)
or list. When a list, these can be in the form returned with |
|||||||||
exclude |
A character vector of words to be excluded. If |
|||||||||
context |
A character vector used to reformat text based on look- ahead/behind. For example,
you might attempt to disambiguate like by reformatting certain likes
(e.g., |
|||||||||
replace.special |
Logical: if |
|||||||||
numbers |
Logical: if |
|||||||||
punct |
Logical: if |
|||||||||
urls |
Logical: if |
|||||||||
emojis |
Logical: if |
|||||||||
to.lower |
Logical: if |
|||||||||
word.break |
A regular expression string determining the way words are split. Default is
|
|||||||||
dc.min |
Numeric: excludes terms appearing in the set number or fewer documents. Default is 0 (no limit). |
|||||||||
dc.max |
Numeric: excludes terms appearing in the set number or more. Default is Inf (no limit). |
|||||||||
sparse |
Logical: if |
|||||||||
tokens.only |
Logical: if
|
A sparse matrix (or regular matrix if sparse = FALSE
), with a row per text
,
and column per term, or a list if tokens.only = TRUE
. Includes an attribute with options (opts
),
and attributes with word count (WC
) and column sums (colsums
) if tokens.only = FALSE
.
This is a relatively simple way to make a dtm. To calculate the (more or less) standard forms of LSM and LSS, a somewhat raw dtm should be fine, because both processes essentially use dictionaries (obviating stemming) and weighting or categorization (largely obviating 'stop word' removal). The exact effect of additional processing will depend on the dictionary/semantic space and weighting scheme used (particularly for LSA). This function also does some processing which may matter if you plan on categorizing with categories that have terms with look- ahead/behind assertions (like LIWC dictionaries). Otherwise, other methods may be faster, more memory efficient, and/or more featureful.
text <- c( "Why, hello there! How are you this evening?", "I am well, thank you for your inquiry!", "You are a most good at social interactions person!", "Why, thank you! You're not all bad yourself!" ) lma_dtm(text) # return tokens only (tokens <- lma_dtm(text, tokens.only = TRUE)) ## convert those to a regular DTM lma_dtm(tokens) # convert a list-representation to a sparse matrix lma_dtm(list( doc1 = c(why = 1, hello = 1, there = 1), doc2 = c(i = 1, am = 1, well = 1) ))
text <- c( "Why, hello there! How are you this evening?", "I am well, thank you for your inquiry!", "You are a most good at social interactions person!", "Why, thank you! You're not all bad yourself!" ) lma_dtm(text) # return tokens only (tokens <- lma_dtm(text, tokens.only = TRUE)) ## convert those to a regular DTM lma_dtm(tokens) # convert a list-representation to a sparse matrix lma_dtm(list( doc1 = c(why = 1, hello = 1, there = 1), doc2 = c(i = 1, am = 1, well = 1) ))
Creates directories for dictionaries and latent semantic spaces if needed, sets them as the
lingmatch.dict.dir
and lingmatch.lspace.dir
options if they are not already set,
and creates links to them in their expected locations ('~/Dictionaries'
and
'~/Latent Semantic Spaces'
) by default if applicable.
lma_initdirs(base = "", dict = "Dictionaries", lspace = "Latent Semantic Spaces", link = TRUE)
lma_initdirs(base = "", dict = "Dictionaries", lspace = "Latent Semantic Spaces", link = TRUE)
base |
Path to a directory in which to create the |
dict |
Path to the dictionaries directory relative to |
lspace |
Path to the latent semantic spaces directory relative to |
link |
Logical; if |
Paths to the [1] dictionaries and [2] latent semantic space directories, or a single path if only
dict
or lspace
is specified.
## Not run: # set up the expected dictionary and latent semantic space directories lma_initdirs("~") # set up directories elsewhere, and links to the expected locations lma_initdirs("d:") # point options and create links to preexisting directories lma_initdirs("~/NLP_Resources", "Dicts", "Dicts/Embeddings") # create just a dictionaries directory and set the # lingmatch.dict.dir option without creating a link lma_initdirs(dict = "z:/external_dictionaries", link = FALSE) ## End(Not run)
## Not run: # set up the expected dictionary and latent semantic space directories lma_initdirs("~") # set up directories elsewhere, and links to the expected locations lma_initdirs("d:") # point options and create links to preexisting directories lma_initdirs("~/NLP_Resources", "Dicts", "Dicts/Embeddings") # create just a dictionaries directory and set the # lingmatch.dict.dir option without creating a link lma_initdirs(dict = "z:/external_dictionaries", link = FALSE) ## End(Not run)
Map a document-term matrix onto a latent semantic space, extract terms from a
latent semantic space (if dtm
is a character vector, or map.space =
FALSE
),
or perform a singular value decomposition of a document-term matrix (if dtm
is a matrix
and space
is missing).
lma_lspace(dtm = "", space, map.space = TRUE, fill.missing = FALSE, term.map = NULL, dim.cutoff = 0.5, keep.dim = FALSE, use.scan = FALSE, dir = getOption("lingmatch.lspace.dir"))
lma_lspace(dtm = "", space, map.space = TRUE, fill.missing = FALSE, term.map = NULL, dim.cutoff = 0.5, keep.dim = FALSE, use.scan = FALSE, dir = getOption("lingmatch.lspace.dir"))
dtm |
A matrix with terms as column names, or a character vector of terms to be extracted
from a specified space. If this is of length 1 and |
space |
A matrix with terms as rownames. If missing, this will be the right singular vectors
of a singular value decomposition of |
map.space |
Logical: if |
fill.missing |
Logical: if |
term.map |
A matrix with |
dim.cutoff |
If a |
keep.dim |
Logical: if |
use.scan |
Logical: if |
dir |
Path to a folder containing spaces. |
A matrix or sparse matrix with either (a) a row per term and column per latent dimension (a latent
space, either calculated from the input, or retrieved when map.space = FALSE
), (b) a row per document
and column per latent dimension (when a dtm is mapped to a space), or (c) a row per document and
column per term (when a space is calculated and keep.dim = TRUE
).
A traditional latent semantic space is a selection of right singular vectors from the singular
value decomposition of a dtm (svd(dtm)$v[, 1:k]
, where k
is the selected number of
dimensions, decided here by dim.cutoff
).
Mapping a new dtm into a latent semantic space consists of multiplying common terms:
dtm[, ct]
%*% space[ct, ]
, where ct
=
colnames(dtm)[colnames(dtm)
%in%
rownames(space)]
– the terms common between the dtm and the space. This
results in a matrix with documents as rows, and dimensions as columns, replacing terms.
Other Latent Semantic Space functions:
download.lspace()
,
select.lspace()
,
standardize.lspace()
text <- c( paste( "Hey, I like kittens. I think all kinds of cats really are just the", "best pet ever." ), paste( "Oh year? Well I really like cars. All the wheels and the turbos...", "I think that's the best ever." ), paste( "You know what? Poo on you. Cats, dogs, rabbits -- you know, living", "creatures... to think you'd care about anything else!" ), paste( "You can stick to your opinion. You can be wrong if you want. You know", "what life's about? Supercharging, diesel guzzling, exhaust spewing,", "piston moving ignitions." ) ) dtm <- lma_dtm(text) # calculate a latent semantic space from the example text lss <- lma_lspace(dtm) # show that document similarities between the truncated and full space are the same spaces <- list( full = lma_lspace(dtm, keep.dim = TRUE), truncated = lma_lspace(dtm, lss) ) sapply(spaces, lma_simets, metric = "cosine") ## Not run: # specify a directory containing spaces, # or where you would like to download spaces space_dir <- "~/Latent Semantic Spaces" # map to a pretrained space ddm <- lma_lspace(dtm, "100k", dir = space_dir) # load the matching subset of the space # without mapping lss_100k_part <- lma_lspace(colnames(dtm), "100k", dir = space_dir) ## or lss_100k_part <- lma_lspace(dtm, "100k", map.space = FALSE, dir = space_dir) # load the full space lss_100k <- lma_lspace("100k", dir = space_dir) ## or lss_100k <- lma_lspace(space = "100k", dir = space_dir) ## End(Not run)
text <- c( paste( "Hey, I like kittens. I think all kinds of cats really are just the", "best pet ever." ), paste( "Oh year? Well I really like cars. All the wheels and the turbos...", "I think that's the best ever." ), paste( "You know what? Poo on you. Cats, dogs, rabbits -- you know, living", "creatures... to think you'd care about anything else!" ), paste( "You can stick to your opinion. You can be wrong if you want. You know", "what life's about? Supercharging, diesel guzzling, exhaust spewing,", "piston moving ignitions." ) ) dtm <- lma_dtm(text) # calculate a latent semantic space from the example text lss <- lma_lspace(dtm) # show that document similarities between the truncated and full space are the same spaces <- list( full = lma_lspace(dtm, keep.dim = TRUE), truncated = lma_lspace(dtm, lss) ) sapply(spaces, lma_simets, metric = "cosine") ## Not run: # specify a directory containing spaces, # or where you would like to download spaces space_dir <- "~/Latent Semantic Spaces" # map to a pretrained space ddm <- lma_lspace(dtm, "100k", dir = space_dir) # load the matching subset of the space # without mapping lss_100k_part <- lma_lspace(colnames(dtm), "100k", dir = space_dir) ## or lss_100k_part <- lma_lspace(dtm, "100k", map.space = FALSE, dir = space_dir) # load the full space lss_100k <- lma_lspace("100k", dir = space_dir) ## or lss_100k <- lma_lspace(space = "100k", dir = space_dir) ## End(Not run)
Calculate simple descriptive statistics from text.
lma_meta(text)
lma_meta(text)
text |
A character vector of texts. |
A data.frame:
characters
: Total number of characters.
syllables
: Total number of syllables, as estimated by split length of 'a+[eu]*|e+a*|i+|o+[ui]*|u+|y+[aeiou]*'
- 1.
words
: Total number of words (raw word count).
unique_words
: Number of unique words (binary word count).
clauses
: Number of clauses, as marked by commas, colons, semicolons, dashes, or brackets
within sentences.
sentences
: Number of sentences, as marked by periods, question marks, exclamation points,
or new line characters.
words_per_clause
: Average number of words per clause.
words_per_sentence
: Average number of words per sentence.
sixltr
: Number of words 6 or more characters long.
characters_per_word
: Average number of characters per word
(characters
/ words
).
syllables_per_word
: Average number of syllables per word
(syllables
/ words
).
type_token_ratio
: Ratio of unique to total words: unique_words
/ words
.
reading_grade
: Flesch-Kincaid grade level: .39 * words
/ sentences
+
11.8 * syllables
/ words
- 15.59.
numbers
: Number of terms starting with numbers.
punct
: Number of terms starting with non-alphanumeric characters.
periods
: Number of periods.
commas
: Number of commas.
qmarks
: Number of question marks.
exclams
: Number of exclamation points.
quotes
: Number of quotation marks (single and double).
apostrophes
: Number of apostrophes, defined as any modified letter apostrophe, or backtick
or single straight or curly quote surrounded by letters.
brackets
: Number of bracketing characters (including parentheses, and square,
curly, and angle brackets).
orgmarks
: Number of characters used for organization or structuring (including
dashes, foreword slashes, colons, and semicolons).
text <- c( succinct = "It is here.", verbose = "Hear me now. I shall tell you about it. It is here. Do you hear?", couched = "I might be wrong, but it seems to me that it might be here.", bigwords = "Object located thither.", excited = "It's there! It's there! It's there!", drippy = "It's 'there', right? Not 'here'? 'there'? Are you Sure?", struggly = "It's here -- in that place where it is. Like... the 1st place (here)." ) lma_meta(text)
text <- c( succinct = "It is here.", verbose = "Hear me now. I shall tell you about it. It is here. Do you hear?", couched = "I might be wrong, but it seems to me that it might be here.", bigwords = "Object located thither.", excited = "It's there! It's there! It's there!", drippy = "It's 'there', right? Not 'here'? 'there'? Are you Sure?", struggly = "It's here -- in that place where it is. Like... the 1st place (here)." ) lma_meta(text)
Categorize raw texts using a pattern-based dictionary.
lma_patcat(text, dict = NULL, pattern.weights = "weight", pattern.categories = "category", bias = NULL, to.lower = TRUE, return.dtm = FALSE, drop.zeros = FALSE, exclusive = TRUE, boundary = NULL, fixed = TRUE, globtoregex = FALSE, name.map = c(intname = "_intercept", term = "term"), dir = getOption("lingmatch.dict.dir"))
lma_patcat(text, dict = NULL, pattern.weights = "weight", pattern.categories = "category", bias = NULL, to.lower = TRUE, return.dtm = FALSE, drop.zeros = FALSE, exclusive = TRUE, boundary = NULL, fixed = TRUE, globtoregex = FALSE, name.map = c(intname = "_intercept", term = "term"), dir = getOption("lingmatch.dict.dir"))
text |
A vector of text to be categorized. Texts are padded by 2 spaces, and potentially lowercased. |
dict |
At least a vector of terms (patterns), usually a matrix-like object with columns for terms, categories, and weights. |
pattern.weights |
A vector of weights corresponding to terms in |
pattern.categories |
A vector of category names corresponding to terms in |
bias |
A constant to add to each category after weighting and summing. Can be a vector with names
corresponding to the unique values in |
to.lower |
Logical indicating whether |
return.dtm |
Logical; if |
drop.zeros |
logical; if |
exclusive |
Logical; if |
boundary |
A string to add to the beginning and end of each dictionary term. If |
fixed |
Logical; if |
globtoregex |
Logical; if |
name.map |
A named character vector:
Missing names are added, so names can be specified positional (e.g., |
dir |
Path to a folder in which to look for |
A matrix with a row per text
and columns per dictionary category, or (when return.dtm = TRUE
)
a sparse matrix with a row per text
and column per term. Includes a WC
attribute with original
word counts, and a categories
attribute with row indices associated with each category if
return.dtm = TRUE
.
For applying term-based dictionaries (to a document-term matrix) see lma_termcat()
.
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_termcat()
,
read.dic()
,
report_term_matches()
,
select.dict()
# example text text <- c( paste( "Oh, what youth was! What I had and gave away.", "What I took and spent and saw. What I lost. And now? Ruin." ), paste( "God, are you so bored?! You just want what's gone from us all?", "I miss the you that was too. I love that you." ), paste( "Tomorrow! Tomorrow--nay, even tonight--you wait, as I am about to change.", "Soon I will off to revert. Please wait." ) ) # make a document-term matrix with pre-specified terms only lma_patcat(text, c("bored?!", "i lo", ". "), return.dtm = TRUE) # get counts of sets of letter lma_patcat(text, list(c("a", "b", "c"), c("d", "e", "f"))) # same thing with regular expressions lma_patcat(text, list("[abc]", "[def]"), fixed = FALSE) # match only words lma_patcat(text, list("i"), boundary = TRUE) # match only words, ignoring punctuation lma_patcat( text, c("you", "tomorrow", "was"), fixed = FALSE, boundary = "\\b", return.dtm = TRUE ) ## Not run: # read in the temporal orientation lexicon from the World Well-Being Project tempori <- read.csv(paste0( "https://raw.githubusercontent.com/wwbp/lexica/master/", "temporal_orientation/temporal_orientation_lexicon.csv" )) lma_patcat(text, tempori) # or use the standardized version tempori_std <- read.dic("wwbp_prospection", dir = "~/Dictionaries") lma_patcat(text, tempori_std) ## get scores on the same scale by adjusting the standardized values tempori_std[, -1] <- tempori_std[, -1] / 100 * select.dict("wwbp_prospection")$selected[, "original_max"] lma_patcat(text, tempori_std)[, unique(tempori$category)] ## End(Not run)
# example text text <- c( paste( "Oh, what youth was! What I had and gave away.", "What I took and spent and saw. What I lost. And now? Ruin." ), paste( "God, are you so bored?! You just want what's gone from us all?", "I miss the you that was too. I love that you." ), paste( "Tomorrow! Tomorrow--nay, even tonight--you wait, as I am about to change.", "Soon I will off to revert. Please wait." ) ) # make a document-term matrix with pre-specified terms only lma_patcat(text, c("bored?!", "i lo", ". "), return.dtm = TRUE) # get counts of sets of letter lma_patcat(text, list(c("a", "b", "c"), c("d", "e", "f"))) # same thing with regular expressions lma_patcat(text, list("[abc]", "[def]"), fixed = FALSE) # match only words lma_patcat(text, list("i"), boundary = TRUE) # match only words, ignoring punctuation lma_patcat( text, c("you", "tomorrow", "was"), fixed = FALSE, boundary = "\\b", return.dtm = TRUE ) ## Not run: # read in the temporal orientation lexicon from the World Well-Being Project tempori <- read.csv(paste0( "https://raw.githubusercontent.com/wwbp/lexica/master/", "temporal_orientation/temporal_orientation_lexicon.csv" )) lma_patcat(text, tempori) # or use the standardized version tempori_std <- read.dic("wwbp_prospection", dir = "~/Dictionaries") lma_patcat(text, tempori_std) ## get scores on the same scale by adjusting the standardized values tempori_std[, -1] <- tempori_std[, -1] / 100 * select.dict("wwbp_prospection")$selected[, "original_max"] lma_patcat(text, tempori_std)[, unique(tempori$category)] ## End(Not run)
A wrapper to other pre-processing functions, potentially from read.segments
, to lma_dtm
or lma_patcat
, to lma_weight
, then lma_termcat
or lma_lspace
,
and optionally including lma_meta
output.
lma_process(input = NULL, ..., meta = TRUE, coverage = FALSE)
lma_process(input = NULL, ..., meta = TRUE, coverage = FALSE)
input |
A vector of text, or path to a text file or folder. |
... |
arguments to be passed to |
meta |
Logical; if |
coverage |
Logical; if |
A matrix with texts represented by rows, and features in columns, unless there are multiple rows per output (e.g., when a latent semantic space is applied without terms being mapped) in which case only the special output is returned (e.g., a matrix with terms as rows and latent dimensions in columns).
If you just want to compare texts, see the lingmatch()
function.
# starting with some texts in a vector texts <- c( "Firstly, I would like to say, and with all due respect...", "Please, proceed. I hope you feel you can speak freely...", "Oh, of course, I just hope to be clear, and not cause offense...", "Oh, no, don't monitor yourself on my account..." ) # by default, term counts and metastatistics are returned lma_process(texts) # add dictionary and percent arguments for standard dictionary-based results lma_process(texts, dict = lma_dict(), percent = TRUE) # add space and weight arguments for standard word-centroid vectors lma_process(texts, space = lma_lspace(texts), weight = "tfidf")
# starting with some texts in a vector texts <- c( "Firstly, I would like to say, and with all due respect...", "Please, proceed. I hope you feel you can speak freely...", "Oh, of course, I just hope to be clear, and not cause offense...", "Oh, no, don't monitor yourself on my account..." ) # by default, term counts and metastatistics are returned lma_process(texts) # add dictionary and percent arguments for standard dictionary-based results lma_process(texts, dict = lma_dict(), percent = TRUE) # add space and weight arguments for standard word-centroid vectors lma_process(texts, space = lma_lspace(texts), weight = "tfidf")
Enter a numerical matrix, set of vectors, or set of matrices to calculate similarity per vector.
lma_simets(a, b = NULL, metric = NULL, group = NULL, lag = 0, agg = TRUE, agg.mean = TRUE, pairwise = TRUE, symmetrical = FALSE, mean = FALSE, return.list = FALSE)
lma_simets(a, b = NULL, metric = NULL, group = NULL, lag = 0, agg = TRUE, agg.mean = TRUE, pairwise = TRUE, symmetrical = FALSE, mean = FALSE, return.list = FALSE)
a |
A vector or matrix. If a vector, |
b |
A vector or matrix to be compared with |
metric |
A character or vector of characters at least partially matching one of the available metric names (or 'all' to explicitly include all metrics), or a number or vector of numbers indicating the metric by index:
|
group |
If |
lag |
Amount to adjust the |
agg |
Logical: if |
agg.mean |
Logical: if |
pairwise |
Logical: if |
symmetrical |
Logical: if |
mean |
Logical: if |
return.list |
Logical: if |
Use setThreadOptions
to change parallelization options; e.g., run
RcppParallel::setThreadOptions(4) before a call to lma_simets to set the number of CPU
threads to 4.
Output varies based on the dimensions of a
and b
:
Out: A vector with a value per metric.
In: Only when a
and b
are both vectors.
Out: A vector with a value per row.
In: Any time a single value is expected per row: a
or b
is a vector,
a
and b
are matrices with the same number of rows and pairwise = FALSE
, a group is
specified, or mean = TRUE
, and only one metric is requested.
Out: A data.frame with a column per metric.
In: When multiple metrics are requested in the previous case.
Out: A sparse matrix with a metric
attribute with the metric name.
In: Pairwise comparisons within an a
matrix or between
an a
and b
matrix, when only 1 metric is requested.
Out: A list with a sparse matrix per metric.
In: When multiple metrics are requested in the previous case.
text <- c( "words of speaker A", "more words from speaker A", "words from speaker B", "more words from speaker B" ) (dtm <- lma_dtm(text)) # compare each entry lma_simets(dtm) # compare each entry with the mean of all entries lma_simets(dtm, colMeans(dtm)) # compare by group (corresponding to speakers and turns in this case) speaker <- c("A", "A", "B", "B") ## by default, consecutive rows from the same group are averaged: lma_simets(dtm, group = speaker) ## with agg = FALSE, only the rows at the boundary between ## groups (rows 2 and 3 in this case) are used: lma_simets(dtm, group = speaker, agg = FALSE)
text <- c( "words of speaker A", "more words from speaker A", "words from speaker B", "more words from speaker B" ) (dtm <- lma_dtm(text)) # compare each entry lma_simets(dtm) # compare each entry with the mean of all entries lma_simets(dtm, colMeans(dtm)) # compare by group (corresponding to speakers and turns in this case) speaker <- c("A", "A", "B", "B") ## by default, consecutive rows from the same group are averaged: lma_simets(dtm, group = speaker) ## with agg = FALSE, only the rows at the boundary between ## groups (rows 2 and 3 in this case) are used: lma_simets(dtm, group = speaker, agg = FALSE)
Reduces the dimensions of a document-term matrix by dictionary-based categorization.
lma_termcat(dtm, dict, term.weights = NULL, bias = NULL, bias.name = "_intercept", escape = TRUE, partial = FALSE, glob = TRUE, term.filter = NULL, term.break = 20000, to.lower = FALSE, dir = getOption("lingmatch.dict.dir"), coverage = FALSE)
lma_termcat(dtm, dict, term.weights = NULL, bias = NULL, bias.name = "_intercept", escape = TRUE, partial = FALSE, glob = TRUE, term.filter = NULL, term.break = 20000, to.lower = FALSE, dir = getOption("lingmatch.dict.dir"), coverage = FALSE)
dtm |
A matrix with terms as column names. |
dict |
The name of a provided dictionary
(osf.io/y6g5b/wiki) or of a file found in
|
term.weights |
A |
bias |
A list or named vector specifying a constant to add to the named category. If a term
matching |
bias.name |
A character specifying a term to be used as a category bias; default is
|
escape |
Logical indicating whether the terms in |
partial |
Logical; if |
glob |
Logical; if |
term.filter |
A regular expression string used to format the text of each term (passed to
|
term.break |
If a category has more than |
to.lower |
Logical; if |
dir |
Path to a folder in which to look for |
coverage |
Logical; if |
A matrix with a row per dtm
row and columns per dictionary category
(with added coverage_
versions if coverage
is TRUE
),
and a WC
attribute with original word counts.
For applying pattern-based dictionaries (to raw text) see lma_patcat()
.
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_patcat()
,
read.dic()
,
report_term_matches()
,
select.dict()
dict <- list(category = c("cat", "dog", "pet*")) lma_termcat(c( "cat, cat, cat, cat, cat, cat, cat, cat", "a cat, dog, or anything petlike, really", "petite petrochemical petitioned petty peter for petrified petunia petals" ), dict, coverage = TRUE) ## Not run: # Score texts with the NRC Affect Intensity Lexicon dict <- readLines("https://saifmohammad.com/WebDocs/NRC-AffectIntensity-Lexicon.txt") dict <- read.table( text = dict[-seq_len(grep("term\tscore", dict, fixed = TRUE)[[1]])], col.names = c("term", "weight", "category") ) text <- c( angry = paste( "We are outraged by their hateful brutality,", "and by the way they terrorize us with their hatred." ), fearful = paste( "The horrific torture of that terrorist was tantamount", "to the terrorism of terrorists." ), joyous = "I am jubilant to be celebrating the bliss of this happiest happiness.", sad = paste( "They are nearly suicidal in their mourning after", "the tragic and heartbreaking holocaust." ) ) emotion_scores <- lma_termcat(text, dict) if (require("splot")) splot(emotion_scores ~ names(text), leg = "out") ## or use the standardized version (which includes more categories) emotion_scores <- lma_termcat(text, "nrc_eil", dir = "~/Dictionaries") emotion_scores <- emotion_scores[, c("anger", "fear", "joy", "sadness")] if (require("splot")) splot(emotion_scores ~ names(text), leg = "out") ## End(Not run)
dict <- list(category = c("cat", "dog", "pet*")) lma_termcat(c( "cat, cat, cat, cat, cat, cat, cat, cat", "a cat, dog, or anything petlike, really", "petite petrochemical petitioned petty peter for petrified petunia petals" ), dict, coverage = TRUE) ## Not run: # Score texts with the NRC Affect Intensity Lexicon dict <- readLines("https://saifmohammad.com/WebDocs/NRC-AffectIntensity-Lexicon.txt") dict <- read.table( text = dict[-seq_len(grep("term\tscore", dict, fixed = TRUE)[[1]])], col.names = c("term", "weight", "category") ) text <- c( angry = paste( "We are outraged by their hateful brutality,", "and by the way they terrorize us with their hatred." ), fearful = paste( "The horrific torture of that terrorist was tantamount", "to the terrorism of terrorists." ), joyous = "I am jubilant to be celebrating the bliss of this happiest happiness.", sad = paste( "They are nearly suicidal in their mourning after", "the tragic and heartbreaking holocaust." ) ) emotion_scores <- lma_termcat(text, dict) if (require("splot")) splot(emotion_scores ~ names(text), leg = "out") ## or use the standardized version (which includes more categories) emotion_scores <- lma_termcat(text, "nrc_eil", dir = "~/Dictionaries") emotion_scores <- emotion_scores[, c("anger", "fear", "joy", "sadness")] if (require("splot")) splot(emotion_scores ~ names(text), leg = "out") ## End(Not run)
Weight a document-term matrix.
lma_weight(dtm, weight = "count", normalize = TRUE, wc.complete = TRUE, log.base = 10, alpha = 1, pois.x = 1L, doc.only = FALSE, percent = FALSE)
lma_weight(dtm, weight = "count", normalize = TRUE, wc.complete = TRUE, log.base = 10, alpha = 1, pois.x = 1L, doc.only = FALSE, percent = FALSE)
dtm |
A matrix with words as column names. |
weight |
A string referring at least partially to one (or a combination; see note) of the available weighting methods: Term weights (applied uniquely to each cell)
Document weights (applied by column)
Alternatively, |
normalize |
Logical: if |
wc.complete |
If the dtm was made with |
log.base |
The base of logs, applied to any weight using |
alpha |
A scaling factor applied to document frequency as part of pointwise mutual
information weighting, or amplify's power ( |
pois.x |
integer; quantile or probability of the poisson distribution ( |
doc.only |
Logical: if |
percent |
Logical; if |
A weighted version of dtm
, with a type
attribute added (attr(dtm, 'type')
).
Term weights works to adjust differences in counts within documents, with differences meaning
increasingly more from binary
to log
to sqrt
to count
to amplify
.
Document weights work to treat words differently based on their between-document or overall frequency.
When term frequencies are constant, dpois
, idf
, ridf
, and normal
give
less common words increasingly more weight, and dfmax
, dfmlog
, ppois
, df
,
dflog
, and entropy
give less common words increasingly less weight.
weight
can either be a vector with two characters, corresponding to term weight and
document weight (e.g., c('count', 'idf')
), or it can be a string with term and
document weights separated by any of :\*_/; ,-
(e.g., 'count-idf'
).
'tf'
is also acceptable for 'count'
, and 'tfidf'
will be parsed as
c('count', 'idf')
, though this is a special case.
For weight
, term or document weights can be entered individually; term weights alone will
not apply any document weight, and document weights alone will apply a 'count'
term weight
(unless doc.only = TRUE
, in which case a term-named vector of document weights is returned
instead of a weighted dtm).
# visualize term and document weights ## term weights term_weights <- c("binary", "log", "sqrt", "count", "amplify") Weighted <- sapply(term_weights, function(w) lma_weight(1:20, w, FALSE)) if (require(splot)) splot(Weighted ~ 1:20, labx = "Raw Count", lines = "co") ## document weights doc_weights <- c( "df", "dflog", "dfmax", "dfmlog", "idf", "ridf", "normal", "dpois", "ppois", "entropy" ) weight_range <- function(w, value = 1) { m <- diag(20) m[upper.tri(m, TRUE)] <- if (is.numeric(value)) { value } else { unlist(lapply( 1:20, function(v) rep(if (value == "inverted") 21 - v else v, v) )) } lma_weight(m, w, FALSE, doc.only = TRUE) } if (require(splot)) { category <- rep(c("df", "idf", "normal", "poisson", "entropy"), c(4, 2, 1, 2, 1)) op <- list( laby = "Relative (Scaled) Weight", labx = "Document Frequency", leg = "outside", lines = "connected", mv.scale = TRUE, note = FALSE ) splot( sapply(doc_weights, weight_range) ~ 1:20, options = op, title = "Same Term, Varying Document Frequencies", sud = "All term frequencies are 1.", colorby = list(category, grade = TRUE) ) splot( sapply(doc_weights, weight_range, value = "sequence") ~ 1:20, options = op, title = "Term as Document Frequencies", sud = "Non-zero terms are the number of non-zero terms.", colorby = list(category, grade = TRUE) ) splot( sapply(doc_weights, weight_range, value = "inverted") ~ 1:20, options = op, title = "Term Opposite of Document Frequencies", sud = "Non-zero terms are the number of zero terms + 1.", colorby = list(category, grade = TRUE) ) }
# visualize term and document weights ## term weights term_weights <- c("binary", "log", "sqrt", "count", "amplify") Weighted <- sapply(term_weights, function(w) lma_weight(1:20, w, FALSE)) if (require(splot)) splot(Weighted ~ 1:20, labx = "Raw Count", lines = "co") ## document weights doc_weights <- c( "df", "dflog", "dfmax", "dfmlog", "idf", "ridf", "normal", "dpois", "ppois", "entropy" ) weight_range <- function(w, value = 1) { m <- diag(20) m[upper.tri(m, TRUE)] <- if (is.numeric(value)) { value } else { unlist(lapply( 1:20, function(v) rep(if (value == "inverted") 21 - v else v, v) )) } lma_weight(m, w, FALSE, doc.only = TRUE) } if (require(splot)) { category <- rep(c("df", "idf", "normal", "poisson", "entropy"), c(4, 2, 1, 2, 1)) op <- list( laby = "Relative (Scaled) Weight", labx = "Document Frequency", leg = "outside", lines = "connected", mv.scale = TRUE, note = FALSE ) splot( sapply(doc_weights, weight_range) ~ 1:20, options = op, title = "Same Term, Varying Document Frequencies", sud = "All term frequencies are 1.", colorby = list(category, grade = TRUE) ) splot( sapply(doc_weights, weight_range, value = "sequence") ~ 1:20, options = op, title = "Term as Document Frequencies", sud = "Non-zero terms are the number of non-zero terms.", colorby = list(category, grade = TRUE) ) splot( sapply(doc_weights, weight_range, value = "inverted") ~ 1:20, options = op, title = "Term Opposite of Document Frequencies", sud = "Non-zero terms are the number of zero terms + 1.", colorby = list(category, grade = TRUE) ) }
Read in or write dictionary files in Comma-Separated Values (.csv; weighted) or Linguistic Inquiry and Word Count (.dic; non-weighted) format.
read.dic(path, cats = NULL, type = "asis", as.weighted = FALSE, dir = getOption("lingmatch.dict.dir"), ..., term.name = "term", category.name = "category", raw = FALSE) write.dic(dict, filename = NULL, type = "asis", as.weighted = FALSE, save = TRUE)
read.dic(path, cats = NULL, type = "asis", as.weighted = FALSE, dir = getOption("lingmatch.dict.dir"), ..., term.name = "term", category.name = "category", raw = FALSE) write.dic(dict, filename = NULL, type = "asis", as.weighted = FALSE, save = TRUE)
path |
Path to a file, a name corresponding to a file in |
cats |
A character vector of category names to be returned. All categories are returned by default. |
type |
A character indicating whether and how terms should be altered. Unspecified or matching 'asis'
leaves terms as they are. Other options change wildcards to regular expressions:
|
as.weighted |
Logical; if |
dir |
Path to a folder containing dictionaries, or where you would like dictionaries to be downloaded;
passed to |
... |
Passes arguments to |
term.name , category.name
|
Strings identifying column names in |
raw |
Logical or a character. As logical, indicates if |
dict |
A |
filename |
The name of the file to be saved. |
save |
Logical: if |
read.dic
: A list
(unweighted) with an entry for each category containing
character vectors of terms, or a data.frame
(weighted) with columns for terms (first, "term") and
weights (all subsequent, with category labels as names).
write.dic
: A version of the written dictionary – a raw character vector for
unweighted dictionaries, or a data.frame
for weighted dictionaries.
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_patcat()
,
lma_termcat()
,
report_term_matches()
,
select.dict()
# make a small murder related dictionary dict <- list( kill = c("kill*", "murd*", "wound*", "die*"), death = c("death*", "dying", "die*", "kill*") ) # convert it to a weighted format (dict_weighted <- read.dic(dict, as.weighted = TRUE)) # categorize it back read.dic(dict_weighted) # convert it to a string without writing to a file cat(raw_dict <- write.dic(dict, save = FALSE)) # parse it back in read.dic(raw = raw_dict) ## Not run: # save it as a .dic file write.dic(dict, "murder") # read it back in as a list read.dic("murder.dic") # read in the Moral Foundations or LUSI dictionaries from urls moral_dict <- read.dic("https://osf.io/download/whjt2") lusi_dict <- read.dic("https://osf.io/download/29ayf") # save and read in a version of the General Inquirer dictionary inquirer <- read.dic("inquirer", dir = "~/Dictionaries") ## End(Not run)
# make a small murder related dictionary dict <- list( kill = c("kill*", "murd*", "wound*", "die*"), death = c("death*", "dying", "die*", "kill*") ) # convert it to a weighted format (dict_weighted <- read.dic(dict, as.weighted = TRUE)) # categorize it back read.dic(dict_weighted) # convert it to a string without writing to a file cat(raw_dict <- write.dic(dict, save = FALSE)) # parse it back in read.dic(raw = raw_dict) ## Not run: # save it as a .dic file write.dic(dict, "murder") # read it back in as a list read.dic("murder.dic") # read in the Moral Foundations or LUSI dictionaries from urls moral_dict <- read.dic("https://osf.io/download/whjt2") lusi_dict <- read.dic("https://osf.io/download/29ayf") # save and read in a version of the General Inquirer dictionary inquirer <- read.dic("inquirer", dir = "~/Dictionaries") ## End(Not run)
Split texts by word count or specific characters. Input texts directly, or read them in from files.
read.segments(path = ".", segment = NULL, ext = ".txt", subdir = FALSE, segment.size = -1, bysentence = FALSE, end_in_quotes = TRUE, preclean = FALSE, text = NULL)
read.segments(path = ".", segment = NULL, ext = ".txt", subdir = FALSE, segment.size = -1, bysentence = FALSE, end_in_quotes = TRUE, preclean = FALSE, text = NULL)
path |
Path to a folder containing files, or a vector of paths to files. If no folders or files are
recognized in |
segment |
Specifies how the text of each file should be segmented. If a character, split at that character; '\n' by default. If a number, texts will be broken into that many segments, each with a roughly equal number of words. |
ext |
The extension of the files you want to read in. '.txt' by default. |
subdir |
Logical; if |
segment.size |
Logical; if specified, |
bysentence |
Logical; if |
end_in_quotes |
Logical; if |
preclean |
Logical; if |
text |
A character vector with text to be split, used in place of |
A data.frame
with columns for file names (input
),
segment number within file (segment
), word count for each segment (WC
), and the text of
each segment (text
).
# split preloaded text read.segments("split this text into two segments", 2) ## Not run: # read in all files from the package directory texts <- read.segments(path.package("lingmatch"), ext = "") texts[, -4] # segment .txt files in dir in a few ways: dir <- "path/to/files" ## into 1 line segments texts_lines <- read.segments(dir) ## into 5 even segments each texts_5segs <- read.segments(dir, 5) ## into 50 word segments texts_50words <- read.segments(dir, segment.size = 50) ## into 1 sentence segments texts_1sent <- read.segments(dir, segment.size = 1, bysentence = TRUE) ## End(Not run)
# split preloaded text read.segments("split this text into two segments", 2) ## Not run: # read in all files from the package directory texts <- read.segments(path.package("lingmatch"), ext = "") texts[, -4] # segment .txt files in dir in a few ways: dir <- "path/to/files" ## into 1 line segments texts_lines <- read.segments(dir) ## into 5 even segments each texts_5segs <- read.segments(dir, 5) ## into 50 word segments texts_50words <- read.segments(dir, segment.size = 50) ## into 1 sentence segments texts_1sent <- read.segments(dir, segment.size = 1, bysentence = TRUE) ## End(Not run)
Extract matches to fuzzy terms (globs/wildcards or regular expressions) from provided text, in order to assess their appropriateness for inclusion in a dictionary.
report_term_matches(dict, text = NULL, space = NULL, glob = TRUE, parse_phrases = TRUE, tolower = TRUE, punct = TRUE, special = TRUE, as_terms = FALSE, bysentence = FALSE, as_string = TRUE, term_map_freq = 1, term_map_spaces = 1, outFile = NULL, space_dir = getOption("lingmatch.lspace.dir"), verbose = TRUE)
report_term_matches(dict, text = NULL, space = NULL, glob = TRUE, parse_phrases = TRUE, tolower = TRUE, punct = TRUE, special = TRUE, as_terms = FALSE, bysentence = FALSE, as_string = TRUE, term_map_freq = 1, term_map_spaces = 1, outFile = NULL, space_dir = getOption("lingmatch.lspace.dir"), verbose = TRUE)
dict |
A vector of terms, list of such vectors, or a matrix-like object to be
categorized by |
text |
A vector of text to extract matches from. If not specified, will use the terms
in the |
space |
A vector space used to calculate similarities between term matches.
Name of a the space (see |
glob |
Logical; if |
parse_phrases |
Logical; if |
tolower |
Logical; if |
punct |
Logical; if |
special |
Logical; if |
as_terms |
Logical; if |
bysentence |
Logical; if |
as_string |
Logical; if |
term_map_freq |
Proportion of terms to include when using the term map as a source
of terms. Applies when |
term_map_spaces |
Number of spaces in which a term has to appear to be included.
Applies when |
outFile |
File path to write results to, always ending in |
space_dir |
Directory from which |
verbose |
Logical; if |
A data.frame
of results, with a row for each unique term, and the following columns:
term
: The originally entered term.
regex
: The converted and applied regular expression form of the term.
categories
: Comma-separated category names,
if dict
is a list with named entries.
count
: Total number of matches to the term.
max_count
: Number of matches to the most representative
(that with the highest average similarity) variant of the term.
variants
: Number of variants of the term.
space
: Name of the latent semantic space, if one was used.
mean_sim
: Average similarity to the most representative variant among terms
found in the space, if one was used.
min_sim
: Minimal similarity to the most representative variant.
matches
: Variants, with counts and similarity (Pearson's r) to the
most representative term (if a space was specified). Either in the form of a comma-separated
string or a data.frame
(if as_string
is FALSE
).
Matches are extracted for each term independently, so they may not align with some implementations
of dictionaries. For instance, by default lma_patcat
matches destructively, and sorts
terms by length such that shorter terms will not match the same text and longer terms that overlap.
Here, the match would show up for both terms.
For a more complete assessment of dictionaries, see dictionary_meta()
.
Similar information is provided in the dictionary builder web tool.
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_patcat()
,
lma_termcat()
,
read.dic()
,
select.dict()
text <- c( "I am sadly homeless, and suffering from depression :(", "This wholesome happiness brings joy to my heart! :D:D:D", "They are joyous in these fearsome happenings D:", "I feel weightless now that my sadness has been depressed! :()" ) dict <- list( sad = c("*less", "sad*", "depres*", ":("), happy = c("*some", "happ*", "joy*", "d:"), self = c("i *", "my *") ) report_term_matches(dict, text)
text <- c( "I am sadly homeless, and suffering from depression :(", "This wholesome happiness brings joy to my heart! :D:D:D", "They are joyous in these fearsome happenings D:", "I feel weightless now that my sadness has been depressed! :()" ) dict <- list( sad = c("*less", "sad*", "depres*", ":("), happy = c("*some", "happ*", "joy*", "d:"), self = c("i *", "my *") ) report_term_matches(dict, text)
Retrieve information and links to dictionaries (lexicons/word lists) available at osf.io/y6g5b.
select.dict(query = NULL, dir = getOption("lingmatch.dict.dir"), check.md5 = TRUE, mode = "wb")
select.dict(query = NULL, dir = getOption("lingmatch.dict.dir"), check.md5 = TRUE, mode = "wb")
query |
A character matching a dictionary name, or a set of keywords to search for in dictionary information. |
dir |
Path to a folder containing dictionaries, or where you want them to be saved. Will look in getOption('lingmatch.dict.dir') and '~/Dictionaries' by default. |
check.md5 |
Logical; if |
mode |
Passed to |
A list with varying entries:
info
: The version of osf.io/kjqb8 stored internally; a
data.frame
with dictionary names as row names, and information about each dictionary in columns.
Also described at
osf.io/y6g5b/wiki/dict_variables,
here short
(corresponding to the file name [{short}.(csv|dic)
] and
wiki urls [https://osf.io/y6g5b/wiki/{short}
]) is set as row names and removed:
name
: Full name of the dictionary.
description
: Description of the dictionary, relating to its purpose and
development.
note
: Notes about processing decisions that additionally alter the original.
constructor
: How the dictionary was constructed:
algorithm
: Terms were selected by some automated process, potentially
learned from data or other resources.
crowd
: Several individuals rated the terms, and in aggregate those ratings
translate to categories and weights.
mixed
: Some combination of the other methods, usually in some iterative
process.
team
: One of more individuals make decisions about term inclusions,
categories, and weights.
subject
: Broad, rough subject or purpose of the dictionary:
emotion
: Terms relate to emotions, potentially exemplifying or expressing
them.
general
: A large range of categories, aiming to capture the content of the
text.
impression
: Terms are categorized and weighted based on the impression they
might give.
language
: Terms are categorized or weighted based on their linguistic
features, such as part of speech, specificity, or area of use.
social
: Terms relate to social phenomena, such as characteristics or concerns
of social entities.
terms
: Number of unique terms across categories.
term_type
: Format of the terms:
glob
: Include asterisks which denote inclusion of any characters until a
word boundary.
glob+
: Glob-style asterisks with regular expressions within terms.
ngram
: Includes any number of words as a term, separated by spaces.
pattern
: A string of characters, potentially within or between words, or
spanning words.
regex
: Regular expressions.
stem
: Unigrams with common endings removed.
unigram
: Complete single words.
weighted
: Indicates whether weights are associated with terms. This
determines the file type of the dictionary: dictionaries with weights are stored
as .csv, and those without are stored as .dic files.
regex_characters
: Logical indicating whether special regular expression
characters are present in any term, which might need to be escaped if the terms are used
in regular expressions. Glob-type terms allow complete parens (at least one open and one
closed, indicating preceding or following words), and initial and terminal asterisks. For
all other terms, [](){}*.^$+?\|
are counted as regex characters. These could be
escaped in R with gsub('([][)(}{*.^$+?\\|])', '\\\1', terms)
if terms
is a character vector, and in Python with (importing re)
[re.sub(r'([][(){}*.^$+?\|])', r'\\1', term)
for term in terms]
if terms
is a list.
categories
: Category names in the order in which they appear in the dictionary
file, separated by commas.
ncategories
: Number of categories.
original_max
: Maximum value of the original dictionary before standardization:
original values / max(original values) * 100
. Dictionaries with no weights are
considered to have a max of 1
.
osf
: ID of the file on OSF, translating to the file's URL:
https://osf.io/osf
.
wiki
: URL of the dictionary's wiki.
downloaded
: Path to the file if downloaded, and ''
otherwise.
selected
: A subset of info
selected by query
.
Other Dictionary functions:
dictionary_meta()
,
download.dict()
,
lma_patcat()
,
lma_termcat()
,
read.dic()
,
report_term_matches()
# just retrieve information about available dictionaries dicts <- select.dict()$info dicts[1:10, 4:9] # select all dictionaries mentioning sentiment or emotion sentiment_dicts <- select.dict("sentiment emotion")$selected sentiment_dicts[1:10, 4:9]
# just retrieve information about available dictionaries dicts <- select.dict()$info dicts[1:10, 4:9] # select all dictionaries mentioning sentiment or emotion sentiment_dicts <- select.dict("sentiment emotion")$selected sentiment_dicts[1:10, 4:9]
Retrieve information and links to latent semantic spaces (sets of word vectors/embeddings) available at osf.io/489he, and optionally download their term mappings (osf.io/xr7jv).
select.lspace(query = NULL, dir = getOption("lingmatch.lspace.dir"), terms = NULL, get.map = FALSE, check.md5 = TRUE, mode = "wb")
select.lspace(query = NULL, dir = getOption("lingmatch.lspace.dir"), terms = NULL, get.map = FALSE, check.md5 = TRUE, mode = "wb")
query |
A character used to select spaces, based on names or other features.
If length is over 1, |
dir |
Path to a directory containing |
terms |
A character vector of terms to search for in the downloaded term map, to calculate
coverage of spaces, or select by coverage if |
get.map |
Logical; if |
check.md5 |
Logical; if |
mode |
Passed to |
A list with varying entries:
info
: The version of osf.io/9yzca stored internally; a
data.frame
with spaces as row names, and information about each space in columns:
terms
: number of terms in the space
corpus
: corpus(es) on which the space was trained
model
: model from which the space was trained
dimensions
: number of dimensions in the model (columns of the space)
model_info
: some parameter details about the model
original_max
: maximum value used to normalize the space; the original
space would be (vectors *
original_max) /
100
osf_dat
: OSF id for the .dat
files; the URL would be
https://osf.io/osf_dat
osf_terms
: OSF id for the _terms.txt
files; the URL would be
https://osf.io/osf_terms
wiki
: link to the wiki for the space
downloaded
: path to the .dat
file if downloaded,
and ''
otherwise.
selected
: A subset of info
selected by query
.
term_map
: If get.map
is TRUE
or lma_term_map.rda
is found in
dir
, a copy of osf.io/xr7jv, which has space names as
column names, terms as row names, and indices as values, with 0 indicating the term is not
present in the associated space.
Other Latent Semantic Space functions:
download.lspace()
,
lma_lspace()
,
standardize.lspace()
# just retrieve information about available spaces spaces <- select.lspace() spaces$info[1:10, c("terms", "dimensions", "original_max")] # retrieve all spaces that used word2vec w2v_spaces <- select.lspace("word2vec")$selected w2v_spaces[, c("terms", "dimensions", "original_max")] ## Not run: # select spaces by terms select.lspace(terms = c( "part-time", "i/o", "'cause", "brexit", "debuffs" ))$selected[, c("terms", "coverage")] ## End(Not run)
# just retrieve information about available spaces spaces <- select.lspace() spaces$info[1:10, c("terms", "dimensions", "original_max")] # retrieve all spaces that used word2vec w2v_spaces <- select.lspace("word2vec")$selected w2v_spaces[, c("terms", "dimensions", "original_max")] ## Not run: # select spaces by terms select.lspace(terms = c( "part-time", "i/o", "'cause", "brexit", "debuffs" ))$selected[, c("terms", "coverage")] ## End(Not run)
Reformat a .rda file which has a matrix with terms as row names, or a plain-text embeddings file which has a term at the start of each line, and consistent delimiting characters. Plain-text files are processed line-by-line, so large spaces can be reformatted RAM-conservatively.
standardize.lspace(infile, name, sep = " ", digits = 9, dir = getOption("lingmatch.lspace.dir"), outdir = dir, remove = "", term_check = "^[a-zA-Z]+$|^['a-zA-Z][a-zA-Z.'\\/-]*[a-zA-Z.]$", verbose = FALSE)
standardize.lspace(infile, name, sep = " ", digits = 9, dir = getOption("lingmatch.lspace.dir"), outdir = dir, remove = "", term_check = "^[a-zA-Z]+$|^['a-zA-Z][a-zA-Z.'\\/-]*[a-zA-Z.]$", verbose = FALSE)
infile |
Name of the .rda or plain-text file relative to |
name |
Base name of the reformatted file and term file; e.g., "glove" would result in
|
sep |
Delimiting character between values in each line, e.g., |
digits |
Number of digits to round values to; default is 9. |
dir |
Path to folder containing |
outdir |
Path to folder in which to save standardized files; default is |
remove |
A string with a regex pattern to be removed from term names |
term_check |
A string with a regex pattern by which to filter terms; i.e., only lines with fully
matched terms are written to the reformatted file. The default attempts to retain only regular words, including
those with dashes, foreword slashes, and periods. Set to an empty string ( |
verbose |
Logical: if |
Path to the standardized [1] data file and [2] terms file if applicable.
Other Latent Semantic Space functions:
download.lspace()
,
lma_lspace()
,
select.lspace()
## Not run: # from https://sites.google.com/site/fritzgntr/software-resources/semantic_spaces standardize.lspace("EN_100k_lsa.rda", "100k_lsa") # from https://fasttext.cc/docs/en/english-vectors.html standardize.lspace("crawl-300d-2M.vec", "facebook_crawl") # Standardized versions of these spaces can also be downloaded with download.lspace. ## End(Not run)
## Not run: # from https://sites.google.com/site/fritzgntr/software-resources/semantic_spaces standardize.lspace("EN_100k_lsa.rda", "100k_lsa") # from https://fasttext.cc/docs/en/english-vectors.html standardize.lspace("crawl-300d-2M.vec", "facebook_crawl") # Standardized versions of these spaces can also be downloaded with download.lspace. ## End(Not run)