This vignette will introduce you to phrase mining using the phm package. Those who are familiar with the tm package will recognize that there are similarities in functionality between the two.
Phrase Mining is done on a corpus with texts, or on a vector where
each element is a text, via the function phraseDoc
. This
function will create a phraseDoc object, which is equivalent to a
term-document matrix stored in a more efficient manner. To see the
term-document matrix, use the function as.matrix
on its
output.
The term-document matrix will have phrases on its rows and documents on its columns, and on the intersection of row and column there will be a frequency indicating the number of times a phrase occurs in the document.
In the case that the phrase document is created on a vector with texts, each element in the vector is considered to be a document, with as its ID the index of the element.
The phraseDoc
function will extract principal phrases
from the texts given to it. A principal phrase
is a phrase that is frequent in its own right (so not as part of a
different phrase), is meaningful, does not cross punctuation marks, and
does not start or end with so-called stop-words (with a few
exceptions).
The phraseDoc
function gives progress updates, since at
times it may take a while to complete. These can be silenced if desired.
When using this function in a Shiny application, these progress updates
can be given via a Shiny progress meter; the function uses about 100
progress steps, so it should be created inside a
withProgress
function with the argument max
set to at least 100. The argument shiny
in the
phraseDoc
function should be set to TRUE in that case.
When converting a phraseDoc object to a term-document matrix using
as.matrix
(see ?as.matrix.phraseDoc
) by
default the Ids of the documents are displayed on the columns. This can
be changed to display the indices of the documents instead.
Once the phraseDoc object has been created, there are several functions available that will obtain information from it:
freqPhrases
will display its most frequent phrasesgetDocs
will display all its documents that have
nonzero frequencies for phrases that appear in a vector of phrasesgetPhrases
will display all phrases occurring in
documents that appear in a vector of document IDs or document
indicesremovePhrases
will return the phraseDoc object with a
set of phrases removedAs an example, we create the following vector with texts:
tst=c("This is a test text",
"This is a test text 2",
"This is another test text",
"This is another test text 2",
"This girl will test text that man",
"This boy will test text that man")
Create the phraseDoc object on it:
pd=phraseDoc(tst)
#> [1] "2024-01-26 17:23:49 EST"
#> [1] "2024-01-26 17:23:49 EST"
#> [1] "Rectifying frequencies..."
#> [1] "2024-01-26 17:23:49 EST"
Display the term-document matrix:
as.matrix(pd)
#> docs
#> phrases 1 2 3 4 5 6
#> another test text 0 0 1 1 0 0
#> test text 1 1 0 0 0 0
#> will test text that man 0 0 0 0 1 1
Get the 3 most frequent principal phrases:
Obtain all frequencies for documents with the phrases “test text” or “another test text”:
getDocs(pd,c("test text","another test text"))
#> 1 2 3 4
#> another test text 0 0 1 1
#> test text 1 1 0 0
Obtain all frequencies for principal phrases in documents 1 and 2:
Remove the phrase “test text” from the phrase document:
pd=removePhrases(pd, "test text")
as.matrix(pd)
#> docs
#> phrases 1 2 3 4 5 6
#> another test text 0 0 1 1 0 0
#> will test text that man 0 0 0 0 1 1
Note that removePhrases
will remove a phrase from the
phrase document, but it is not able to restore the frequencies of the
phrases inside the removed phrases. If this is desired, the phrases
should be removed when creating the phrase document using the argument
sp
instead.
The phm package also provides a distance measure that is optimal for text. Text distance is calculated as the proportion of unmatched frequencies, i.e., the number of unmatched frequencies divided by the total frequencies among the two vectors. Text clustering functions can be used for term-document matrices with phrases, as well as for regular term-document matrices where the terms are words (usually obtained via functions in the tm package).
Text distance is a number between 0 and 1, where 0 means that the two texts have the same terms and the same frequencies of those terms, and 1 indicates that they have no terms in common. A smaller number means that the texts are more alike, while a larger number (closer to 1) means they are less alike.
The function textDist
will calculate the text distance
between two numeric vectors:
Each vector represents a document, and the numbers in the vectors are the frequencies of terms. In the example, the first document/vector has one occurrence of the first term, while the second document has none.
The function textDist
can also be used on matrices, in
which case a vector with the text distance between corresponding columns
is returned:
(M1=matrix(c(0,1,0,2,0,10,0,14),4))
#> [,1] [,2]
#> [1,] 0 0
#> [2,] 1 10
#> [3,] 0 0
#> [4,] 2 14
(M2=matrix(c(12,0,8,0,1,3,1,2),4))
#> [,1] [,2]
#> [1,] 12 1
#> [2,] 0 3
#> [3,] 8 1
#> [4,] 0 2
textDist(M1,M2)
#> [1] 1.0000000 0.6774194
Note that the first columns of the two matrices have no terms in common, and so their distance is the highest possible: 1.
The function textDistMatrix
calculates the text distance
between all combinations of the columns of a matrix:
M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4)
colnames(M)=1:4;rownames(M)=c("A","B","C","D")
M
#> 1 2 3 4
#> A 0 0 12 1
#> B 1 10 0 0
#> C 0 0 8 1
#> D 2 14 0 0
(tdm=textDistMatrix(M))
#> 1 2 3
#> 2 0.7777778
#> 3 1.0000000 1.0000000
#> 4 1.0000000 1.0000000 0.8181818
class(tdm)
#> [1] "dist"
Note that the output of this function is of type
dist
.
The function textCluster
will use text clustering to
cluster any term-document matrix. Its output is similar to the output of
the function kmeans
. However, note that, if there are any
documents without terms, they will all be stored in the last
cluster.
First we create a term-document matrix:
M=matrix(c(rep(0,4),0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0,rep(0,4)),4)
colnames(M)=1:6;rownames(M)=c("A","B","C","D")
M
#> 1 2 3 4 5 6
#> A 0 0 0 12 1 0
#> B 0 1 10 0 0 0
#> C 0 0 0 8 1 0
#> D 0 2 14 0 0 0
Then we cluster it into 3 clusters:
We can look at the output of the function:
#This shows for each document what cluster it is in
tc$cluster
#> 1 2 3 4 5 6
#> 3 1 1 2 2 3
#This shows for each cluster how many documents it contains
tc$size
#> 1 2 3
#> 2 2 2
#This matrix shows the centroid for each cluster on the columns, with terms on
#the rows
tc$centroids
#> 1 2 3
#> A 0.0 6.5 0
#> B 5.5 0.0 0
#> C 0.0 4.5 0
#> D 8.0 0.0 0
The function showCluster
will show the contents of one
specific cluster. It will also show a column with for each term the
number of documents it appears in, and a column with the total frequency
of each term in the cluster. The terms are displayed in descending order
of those last two columns, so the most common terms are displayed
first.
Note that this function can be used with any clustering method; all it needs is the term-document matrix, a vector with the cluster ID for each document, and the number of the cluster to be displayed.
Let’s take a look at the clusters we have created:
showCluster(M,tc$cluster,1)
#> 2 3 nDocs totFreq
#> D 2 14 2 16
#> B 1 10 2 11
showCluster(M,tc$cluster,2)
#> 4 5 nDocs totFreq
#> A 12 1 2 13
#> C 8 1 2 9
showCluster(M,tc$cluster,3)
#> $docs
#> [1] "1" "6"
#>
#> $note
#> [1] "Documents have no terms"
We see for example that the first cluster consists of documents 2 and 3, which contain only terms “B” and “D”, both occurring in both documents, with “D” having the greatest overall frequency in the cluster, so it occurs first.
The second cluster consists of documents 4 and 5, which contain only terms “A” and “C”. Both documents contain those two terms, but the frequency of “A” is the largest and thus it appears first.
Note that the last cluster looks different from the others; it contains all documents without terms. These are documents 1 and 6.
We can create a corpus from a data frame such that all variables
except the text
variable are stored in the meta fields of
the documents. This is done using the function DFSource
, in
conjunction with the VCorpus
function, which resides in the
tm package:
(df=data.frame(id=LETTERS[1:3],text=c("First text","Second text","Third text"),
title=c("N1","N2","N3"),author=c("Smith","Jones","Jones")))
#> id text title author
#> 1 A First text N1 Smith
#> 2 B Second text N2 Jones
#> 3 C Third text N3 Jones
#Create the corpus
co=tm::VCorpus(DFSource(df))
#The content of one of the documents
co[[1]]$content
#> [1] "First text"
#The meta data of one of the documents; all variables are present.
co[[1]]$meta
#> author : Smith
#> datetimestamp: 2024-01-26 22:23:49.3588500022888
#> description : character(0)
#> heading : character(0)
#> id : A
#> language : en
#> origin : character(0)
#> title : N1
Note that the data frame must have the variables text
and id
.
The PubMed website will allow a user to enter search criteria, and will return abstracts of medical publications related to those search criteria. These abstracts can be saved in PubMed format, in which case a file will be created on the user’s host system containing these abstracts together with some additional information for each publication.
Running the function getPubMed
on this file will create
a data table with its contents, with each row representing a
publication. The data table will have an id variable, containing the
PMIDs of the publications, a text variable containing the text of the
abstracts, and several other variables.
Running the function tm::VCorpus
on
DFSource
on this data table will create a corpus with a
plain text document for each publication. This corpus may then be used
to perform phrase mining or regular (word) text mining.