This vignette demonstrates the usage of various similarity functions
for analyzing speeches. We’ll be using example data
speeches_data
stored in inst/extdata
to
showcase these functions.
First, let’s load the example data:
data_path <- system.file("extdata", "speeches_data.Rdata", package = "conversim")
load(data_path)
# Print a summary of the speeches data
print(summary(speeches_data))
## speaker_id text
## Length:2 Length:2
## Class :character Class :character
## Mode :character Mode :character
Before we begin with the similarity functions, let’s look at the
preprocess_text
function:
# Example usage with our data
original_text <- substr(speeches_data$text[1], 1, 200) # First 200 characters of speech A
preprocessed_text <- preprocess_text(original_text)
print(paste("Original:", original_text))
## [1] "Original: Ladies and Gentlemen, Distinguished Guests,\n\nToday, I stand before you to address one of the most pressing challenges of our time—climate change. What was once a distant concern is now an undeniable r"
## [1] "Preprocessed: ladies and gentlemen distinguished guests today i stand before you to address one of the most pressing challenges of our timeclimate change what was once a distant concern is now an undeniable r"
The topic_similarity
function calculates the similarity
between two speeches based on their topics:
# Example usage with our speeches data
lda_similarity <- topic_similarity(speeches_data$text[1], speeches_data$text[2], method = "lda", num_topics = 5)
lsa_similarity <- topic_similarity(speeches_data$text[1], speeches_data$text[2], method = "lsa", num_topics = 5)
print(paste("LDA Similarity:", lda_similarity))
## [1] "LDA Similarity: 0.169419269706043"
## [1] "LSA Similarity: 1"
Note: The difference between LDA (Latent Dirichlet Allocation) topic similarity (0.1694) and LSA (Latent Semantic Analysis) topic similarity (1) can be attributed to several factors:
LDA and LSA use fundamentally different approaches for topic modeling and semantic analysis:
Both LDA and LSA are sensitive to the input parameters, especially the number of topics chosen. The code used five topics for both methods, which may have been more appropriate for LDA than for LSA in this particular case.
Although both speeches are about climate change, they focus on different aspects of the topic. LDA might be better suited to capture these nuanced differences in topic distribution, whereas LSA may oversimplify the analysis due to the shared overall theme and vocabulary.
The lexical_similarity
function calculates the
similarity between two speeches based on their shared unique words:
# Example usage with our speeches data
lex_similarity <- lexical_similarity(speeches_data$text[1], speeches_data$text[2])
print(paste("Lexical Similarity:", lex_similarity))
## [1] "Lexical Similarity: 0.15180265654649"
The semantic_similarity
function calculates the semantic
similarity between two speeches using different methods:
# Example usage with our speeches data
tfidf_similarity <- semantic_similarity(speeches_data$text[1], speeches_data$text[2], method = "tfidf")
word2vec_similarity <- semantic_similarity(speeches_data$text[1], speeches_data$text[2], method = "word2vec")
print(paste("TF-IDF Similarity:", tfidf_similarity))
## [1] "TF-IDF Similarity: 0.5"
## [1] "Word2Vec Similarity: 0.998952039779365"
The structural_similarity
function calculates the
similarity between two speeches based on their structure:
# Example usage with our speeches data
struct_similarity <- structural_similarity(strsplit(speeches_data$text[1], "\n")[[1]],
strsplit(speeches_data$text[2], "\n")[[1]])
print(paste("Structural Similarity:", struct_similarity))
## [1] "Structural Similarity: 0.889420039965884"
The stylistic_similarity
function calculates various
stylistic features and their similarity between two speeches:
# Example usage with our speeches data
style_similarity <- stylistic_similarity(speeches_data$text[1], speeches_data$text[2])
print("Stylistic Similarity Results:")
## [1] "Stylistic Similarity Results:"
## $text1_features
## ttr avg_sentence_length fk_grade
## 0.644186 23.888889 19.878760
##
## $text2_features
## ttr avg_sentence_length fk_grade
## 0.5490849 23.1153846 17.0446339
##
## $feature_differences
## ttr avg_sentence_length fk_grade
## 0.09510119 0.77350427 2.83412575
##
## $overall_similarity
## [1] 0.8924734
##
## $cosine_similarity
## [1] 0.9949162
The sentiment_similarity
function calculates the
sentiment similarity between two speeches:
# Example usage with our speeches data
sent_similarity <- sentiment_similarity(speeches_data$text[1], speeches_data$text[2])
print(paste("Sentiment Similarity:", sent_similarity))
## [1] "Sentiment Similarity: 0.952602694643716"
This vignette has demonstrated the usage of various similarity
functions for analyzing speeches using the provided
speeches_data.Rdata
. These functions can be used
individually or combined to create a comprehensive similarity analysis
between different speeches in your dataset.