malaytextr

library(malaytextr)

Examples

Malay root words

There is a data frame of Malay root words that can be used as a dictionary:


head(malayrootwords)
#>      Col Word Root Word
#> 1 pengabadian     abadi
#> 2  pengabdian      abdi
#> 3 pengacaraan     acara
#> 4 pengadangan     adang
#> 5  pengadilan      adil
#> 6   pengairan       air

Stem Malay words

stem_malay() will find the root words in a dictionary, in which the malayrootwords data frame can be used, then it will remove “extra suffix”“,”prefix” and lastly “suffix”

To stem word “banyaknya”. It will return a data frame with the word “banyaknya” and the stemmed word “banyak”:


stem_malay(word = "banyaknya", dictionary = malayrootwords)
#> 'Root Word' is now returned instead of 'root_word'
#>    Col Word Root Word
#> 1 banyaknya    banyak

To stem words in a data frame:

  1. Specify the data frame
  2. Specify the dictionary
  3. Specify the column that needs to be stemmed

x <- data.frame(text = c("banyaknya","sangat","terkedu", "pengetahuan"))

stem_malay(word = x, 
          dictionary = malayrootwords, 
          col_feature1 = "text")
#> 'Root Word' is now returned instead of 'root_word'
#>      Col Word Root Word
#> 1   banyaknya    banyak
#> 2      sangat    sangat
#> 3     terkedu      kedu
#> 4 pengetahuan      tahu

Remove URLs

remove_url will remove all urls found in a string


x <- c("test https://t.co/fkQC2dXwnc", "another one https://www.google.com/ to try")

remove_url(x)
#> [1] "test "               "another one  to try"

Malay stop words

There is a data frame of Malay stop words:


head(malaystopwords)
#> # A tibble: 6 × 1
#>   stopwords
#>   <chr>    
#> 1 ada      
#> 2 sampai   
#> 3 sana     
#> 4 itu      
#> 5 sangat   
#> 6 saya

Sentiment lexicon

This lexicon includes words that have been labelled as positive or negative. This is useful for tasks like sentiment analysis, which involves determining the overall sentiment expressed in a piece of text. To use the lexicon, process the text and check each word against the lexicon to determine its sentiment. To note, this sentiment lexicon was created based on a general corpus, sourced from news articles


head(sentiment_general)
#> # A tibble: 6 × 2
#>   Word    Sentiment
#>   <chr>   <chr>    
#> 1 aduan   Negative 
#> 2 agresif Negative 
#> 3 amaran  Negative 
#> 4 anarki  Negative 
#> 5 ancaman Negative 
#> 6 aneh    Negative