malaytextr: An R package to process Malay text data. It offers a number of functions/datasets for analyzing and working with text data in the Malay language.
Install the latest version of this package by entering the following in R:
install.packages("malaytextr")
Or you can install the development version from GitHub with:
# install.packages("devtools")
::install_github("zahiernasrudin/malaytextr") devtools
There is a data frame of Malay root words that can be used as a dictionary:
malayrootwords
# A tibble: 4,365 x 2
`Col Word` `Root Word`
<chr> <chr>
1 ad ada
2 ak aku
3 akn akan
4 ank anak
5 ap apa
6 awl awal
7 bg bagi
8 bkn bukan
9 blm belum
10 bnjr banjir
# ... with 4,355 more rows
stem_malay()
will find the root words in a dictionary,
in which the malayrootwords
data frame can be used, then it
will remove “extra suffix”“,”prefix” and lastly “suffix”
To stem word “banyaknya”. It will return a data frame with the word “banyaknya” and the stemmed word “banyak”:
Note: ‘Root Word’ is now returned instead of ‘root_word’
stem_malay(word = "banyaknya", dictionary = malayrootwords)
'Root Word' is now returned instead of 'root_word'
Col Word Root Word1 banyaknya banyak
To stem words in a data frame:
<- data.frame(text = c("banyaknya","sangat","terkedu", "pengetahuan"))
x
stem_malay(word = x,
dictionary = malayrootwords,
col_feature1 = "text")
'Root Word' is now returned instead of 'root_word'
Col Word Root Word1 banyaknya banyak
2 sangat sangat
3 terkedu kedu
4 pengetahuan tahu
remove_url
will remove all urls found in a string
<- c("test https://t.co/fkQC2dXwnc", "another one https://www.google.com/ to try")
x
remove_url(x)
1] "test " "another one to try" [
There is a data frame of Malay stop words:
malaystopwords# A tibble: 512 x 1
stopwords<chr>
1 ada
2 sampai
3 sana
4 itu
5 sangat
6 saya
7 jadi
8 se
9 agak
10 jangan
# ... with 502 more rows
This lexicon includes words that have been labelled as positive or negative:
sentiment_general# A tibble: 1,424 × 2
Word Sentiment<chr> <chr>
1 aduan Negative
2 agresif Negative
3 amaran Negative
4 anarki Negative
5 ancaman Negative
6 aneh Negative
7 antagonis Negative
8 azab Negative
9 babi Negative
10 bahaya Negative
# … with 1,414 more rows
This dataset is a development version that aims to provide a standardized version of Malay words. It is designed to standardize words that have multiple variations/spellings
normalized# A tibble: 153 × 2
`Col Word` `Normalized Word`
<chr> <chr>
1 ad ada
2 ak aku
3 akn akan
4 ank anak
5 ap apa
6 awl awal
7 bg bagi
8 bkn bukan
9 blm belum
10 bnjr banjir
# … with 143 more rows
To report a bug, please file an issue on Github
MIT License