machine learning - Row-wise frequency of every word in URL in R -


i new programming , need in r programming university project. want create table frequency of each word. input file has around 70000 rows of data such ids , weburls visited id user separated comma in csv file:for example:

id                 urls  m7fdn              privatkunden:handys, tablets, tarife:vorteile & services:ausland & roaming,privatkunden:hilfe:mehr hilfe:ger,privatkunden:hilfe:service-themen:internet  dsl & ltekonfigurieren 9ufdf              mein website:kontostand & rechnung:meinerechnung:6-monate-übersicht zu ihrer rufnummer,mein website:kontostand & rechnung:meinerechnung:kosten 09nd7              404 <https://www.website.de/ussa/login/login.ftel?errorcode=2001&name=%20&goto=https%3a%,mein website:login online user:show form:login.ftel / login),mobile,mobile:meinwebsite:kundendaten (mydata.html),mobile:meinwebsite:startseite (index.html),privatkunden:home,privatkunden:meinwebsite:login.ftel 

below code has removed special characters urls , giving frequency of word used in whole document. don't want whole document @ once. want output per row.

text <- readlines("sample.csv") docs <- corpus(vectorsource(text)) inspect(docs) tospace <- content_transformer(function (x , pattern)gsub(pattern, " ", x))   docs <- tm_map(docs, tospace, "/") docs <- tm_map(docs, tospace, "@") docs <- tm_map(docs, tospace, ",") docs <- tm_map(docs, tospace, ";") docs <- tm_map(docs, tospace, "://") docs <- tm_map(docs, tospace, ":") docs <- tm_map(docs, tospace, "<") docs <- tm_map(docs, tospace, ">") docs <- tm_map(docs, tospace, "-") docs <- tm_map(docs, tospace, "_") docs <- tm_map(docs, tospace, "://") docs <- tm_map(docs, tospace, "&") docs <- tm_map(docs, tospace, ")") docs <- tm_map(docs, tospace, "%")   dtm <- termdocumentmatrix(docs)  m <- as.matrix(dtm) v <- sort(rowsums(m),decreasing=true) d <- data.frame(word = names(v),freq=v) 

output getting below:

                       word freq   mein                   mein 1451   website             website 1038   privatkunden   privatkunden  898   meinwebsite     meinwebsite  479   rechnung           rechnung  474   

the output want should this:

id               privatkunden  website    hilfe    rechnung  kosten       m7fdn               4             7         2         7       0 9ufdf               3             1         9         3       5 09nd7               5             7         2         8       9 

the above table means id m7fdn has 4 times privatkunden in urls , 2 times hilfe , on. above table sample , not count exact words. table can long many number of words there. please me output. once table have apply machine learning.

i think there 2 points mention here:

1) reading in data:

text <- readlines("sample.csv") 

gives vector text[1] being full first line of data, text[2] being full second line of data , on. need vectorsource 1 column urls column. either use read.table or e.g. this:

require(tidyr) text <- readlines("1.txt") text <- data.frame(a=text[-1]) %>% separate(a, c("id", "urls"), sep=6) 

2) using data in tm make urls corpus by:

docs <- corpus(vectorsource(text$urls)) names(docs) <- text$id 

now tm_map transformations... @ end do:

dtm <- documenttermmatrix(docs)  

and there go:

> as.matrix(dtm[1:3,1:5])         terms docs     (index.html (mydata.html 404 ã¼bersicht ausland   m7fdn            0            0   0          0       1   9ufdf            0            0   0          1       0   09nd7            1            1   1          0       0 

Comments

Post a Comment

Popular posts from this blog

serialization - Convert Any type in scala to Array[Byte] and back -

mapreduce - Resource manager does not transit to active state from standby -

matplotlib support failed in PyCharm on OSX -