phm-intro

1. Phrase Mining

This vignette will introduce you to phrase mining using the phm package. Those who are familiar with the tm package will recognize that there are similarities in functionality between the two.

Phrase Mining is done on a corpus with texts, or on a vector where each element is a text, via the function phraseDoc. This function will create a phraseDoc object, which is equivalent to a term-document matrix stored in a more efficient manner. To see the term-document matrix, use the function as.matrix on its output.

The term-document matrix will have phrases on its rows and documents on its columns, and on the intersection of row and column there will be a frequency indicating the number of times a phrase occurs in the document.

In the case that the phrase document is created on a vector with texts, each element in the vector is considered to be a document, with as its ID the index of the element.

The phraseDoc function will extract principal phrases from the texts given to it. A principal phrase is a phrase that is frequent in its own right (so not as part of a different phrase), is meaningful, does not cross punctuation marks, and does not start or end with so-called stop-words (with a few exceptions).

The phraseDoc function gives progress updates, since at times it may take a while to complete. These can be silenced if desired. When using this function in a Shiny application, these progress updates can be given via a Shiny progress meter; the function uses about 100 progress steps, so it should be created inside a withProgress function with the argument max set to at least 100. The argument shiny in the phraseDoc function should be set to TRUE in that case.

When converting a phraseDoc object to a term-document matrix using as.matrix (see ?as.matrix.phraseDoc) by default the Ids of the documents are displayed on the columns. This can be changed to display the indices of the documents instead.

Once the phraseDoc object has been created, there are several functions available that will obtain information from it:

freqPhrases will display its most frequent phrases
getDocs will display all its documents that have nonzero frequencies for phrases that appear in a vector of phrases
getPhrases will display all phrases occurring in documents that appear in a vector of document IDs or document indices
removePhrases will return the phraseDoc object with a set of phrases removed

As an example, we create the following vector with texts:

tst=c("This is a test text",
      "This is a test text 2",
      "This is another test text",
      "This is another test text 2",
      "This girl will test text that man",
      "This boy will test text that man")

Create the phraseDoc object on it:

pd=phraseDoc(tst)
#> [1] "2024-01-26 17:23:49 EST"
#> [1] "2024-01-26 17:23:49 EST"
#> [1] "Rectifying frequencies..."
#> [1] "2024-01-26 17:23:49 EST"

Display the term-document matrix:

as.matrix(pd)
#>                          docs
#> phrases                   1 2 3 4 5 6
#>   another test text       0 0 1 1 0 0
#>   test text               1 1 0 0 0 0
#>   will test text that man 0 0 0 0 1 1

Get the 3 most frequent principal phrases:

freqPhrases(pd,3)
#>                         frequency
#> will test text that man         2
#> test text                       2
#> another test text               2

Obtain all frequencies for documents with the phrases “test text” or “another test text”:

getDocs(pd,c("test text","another test text"))
#>                   1 2 3 4
#> another test text 0 0 1 1
#> test text         1 1 0 0

Obtain all frequencies for principal phrases in documents 1 and 2:

getPhrases(pd, 1:2)
#>           1 2
#> test text 1 1

Remove the phrase “test text” from the phrase document:

pd=removePhrases(pd, "test text")
as.matrix(pd)
#>                          docs
#> phrases                   1 2 3 4 5 6
#>   another test text       0 0 1 1 0 0
#>   will test text that man 0 0 0 0 1 1

Note that removePhrases will remove a phrase from the phrase document, but it is not able to restore the frequencies of the phrases inside the removed phrases. If this is desired, the phrases should be removed when creating the phrase document using the argument sp instead.

2. Text Clustering

The phm package also provides a distance measure that is optimal for text. Text distance is calculated as the proportion of unmatched frequencies, i.e., the number of unmatched frequencies divided by the total frequencies among the two vectors. Text clustering functions can be used for term-document matrices with phrases, as well as for regular term-document matrices where the terms are words (usually obtained via functions in the tm package).

Text distance is a number between 0 and 1, where 0 means that the two texts have the same terms and the same frequencies of those terms, and 1 indicates that they have no terms in common. A smaller number means that the texts are more alike, while a larger number (closer to 1) means they are less alike.

The function textDist will calculate the text distance between two numeric vectors:

textDist(c(1,2,0),c(0,1,1))
#> [1] 0.6

Each vector represents a document, and the numbers in the vectors are the frequencies of terms. In the example, the first document/vector has one occurrence of the first term, while the second document has none.

The function textDist can also be used on matrices, in which case a vector with the text distance between corresponding columns is returned:

(M1=matrix(c(0,1,0,2,0,10,0,14),4))
#>      [,1] [,2]
#> [1,]    0    0
#> [2,]    1   10
#> [3,]    0    0
#> [4,]    2   14
(M2=matrix(c(12,0,8,0,1,3,1,2),4))
#>      [,1] [,2]
#> [1,]   12    1
#> [2,]    0    3
#> [3,]    8    1
#> [4,]    0    2
textDist(M1,M2)
#> [1] 1.0000000 0.6774194

Note that the first columns of the two matrices have no terms in common, and so their distance is the highest possible: 1.

The function textDistMatrix calculates the text distance between all combinations of the columns of a matrix:

M=matrix(c(0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0),4)
colnames(M)=1:4;rownames(M)=c("A","B","C","D")
M
#>   1  2  3 4
#> A 0  0 12 1
#> B 1 10  0 0
#> C 0  0  8 1
#> D 2 14  0 0
(tdm=textDistMatrix(M))
#>           1         2         3
#> 2 0.7777778                    
#> 3 1.0000000 1.0000000          
#> 4 1.0000000 1.0000000 0.8181818
class(tdm)
#> [1] "dist"

Note that the output of this function is of type dist.

The function textCluster will use text clustering to cluster any term-document matrix. Its output is similar to the output of the function kmeans. However, note that, if there are any documents without terms, they will all be stored in the last cluster.

First we create a term-document matrix:

M=matrix(c(rep(0,4),0,1,0,2,0,10,0,14,12,0,8,0,1,0,1,0,rep(0,4)),4)
colnames(M)=1:6;rownames(M)=c("A","B","C","D")
M
#>   1 2  3  4 5 6
#> A 0 0  0 12 1 0
#> B 0 1 10  0 0 0
#> C 0 0  0  8 1 0
#> D 0 2 14  0 0 0

Then we cluster it into 3 clusters:

(tc=textCluster(M,3))
#> << textCluster >>
#> 3 clusters
#> 6 documents

We can look at the output of the function:

#This shows for each document what cluster it is in
tc$cluster
#> 1 2 3 4 5 6 
#> 3 1 1 2 2 3

#This shows for each cluster how many documents it contains
tc$size
#> 1 2 3 
#> 2 2 2

#This matrix shows the centroid for each cluster on the columns, with terms on
#the rows
tc$centroids
#>     1   2 3
#> A 0.0 6.5 0
#> B 5.5 0.0 0
#> C 0.0 4.5 0
#> D 8.0 0.0 0

The function showCluster will show the contents of one specific cluster. It will also show a column with for each term the number of documents it appears in, and a column with the total frequency of each term in the cluster. The terms are displayed in descending order of those last two columns, so the most common terms are displayed first.

Note that this function can be used with any clustering method; all it needs is the term-document matrix, a vector with the cluster ID for each document, and the number of the cluster to be displayed.

Let’s take a look at the clusters we have created:

showCluster(M,tc$cluster,1)
#>   2  3 nDocs totFreq
#> D 2 14     2      16
#> B 1 10     2      11
showCluster(M,tc$cluster,2)
#>    4 5 nDocs totFreq
#> A 12 1     2      13
#> C  8 1     2       9
showCluster(M,tc$cluster,3)
#> $docs
#> [1] "1" "6"
#> 
#> $note
#> [1] "Documents have no terms"

We see for example that the first cluster consists of documents 2 and 3, which contain only terms “B” and “D”, both occurring in both documents, with “D” having the greatest overall frequency in the cluster, so it occurs first.

The second cluster consists of documents 4 and 5, which contain only terms “A” and “C”. Both documents contain those two terms, but the frequency of “A” is the largest and thus it appears first.

Note that the last cluster looks different from the others; it contains all documents without terms. These are documents 1 and 6.

3. Create a Corpus from a Data Frame

We can create a corpus from a data frame such that all variables except the text variable are stored in the meta fields of the documents. This is done using the function DFSource, in conjunction with the VCorpus function, which resides in the tm package:

(df=data.frame(id=LETTERS[1:3],text=c("First text","Second text","Third text"),
              title=c("N1","N2","N3"),author=c("Smith","Jones","Jones")))
#>   id        text title author
#> 1  A  First text    N1  Smith
#> 2  B Second text    N2  Jones
#> 3  C  Third text    N3  Jones

#Create the corpus
co=tm::VCorpus(DFSource(df))

#The content of one of the documents
co[[1]]$content
#> [1] "First text"

#The meta data of one of the documents; all variables are present.
co[[1]]$meta
#>   author       : Smith
#>   datetimestamp: 2024-01-26 22:23:49.3588500022888
#>   description  : character(0)
#>   heading      : character(0)
#>   id           : A
#>   language     : en
#>   origin       : character(0)
#>   title        : N1

Note that the data frame must have the variables text and id.

phm-intro

1. Phrase Mining

2. Text Clustering

3. Create a Corpus from a Data Frame

4. Obtain Data from PubMed