machine learning - Row-wise frequency of every word in URL in R -
i new programming , need in r programming university project. want create table frequency of each word. input file has around 70000 rows of data such ids , weburls visited id user separated comma in csv file:for example:
id urls m7fdn privatkunden:handys, tablets, tarife:vorteile & services:ausland & roaming,privatkunden:hilfe:mehr hilfe:ger,privatkunden:hilfe:service-themen:internet dsl & ltekonfigurieren 9ufdf mein website:kontostand & rechnung:meinerechnung:6-monate-übersicht zu ihrer rufnummer,mein website:kontostand & rechnung:meinerechnung:kosten 09nd7 404 <https://www.website.de/ussa/login/login.ftel?errorcode=2001&name=%20&goto=https%3a%,mein website:login online user:show form:login.ftel / login),mobile,mobile:meinwebsite:kundendaten (mydata.html),mobile:meinwebsite:startseite (index.html),privatkunden:home,privatkunden:meinwebsite:login.ftel
below code has removed special characters urls , giving frequency of word used in whole document. don't want whole document @ once. want output per row.
text <- readlines("sample.csv") docs <- corpus(vectorsource(text)) inspect(docs) tospace <- content_transformer(function (x , pattern)gsub(pattern, " ", x)) docs <- tm_map(docs, tospace, "/") docs <- tm_map(docs, tospace, "@") docs <- tm_map(docs, tospace, ",") docs <- tm_map(docs, tospace, ";") docs <- tm_map(docs, tospace, "://") docs <- tm_map(docs, tospace, ":") docs <- tm_map(docs, tospace, "<") docs <- tm_map(docs, tospace, ">") docs <- tm_map(docs, tospace, "-") docs <- tm_map(docs, tospace, "_") docs <- tm_map(docs, tospace, "://") docs <- tm_map(docs, tospace, "&") docs <- tm_map(docs, tospace, ")") docs <- tm_map(docs, tospace, "%") dtm <- termdocumentmatrix(docs) m <- as.matrix(dtm) v <- sort(rowsums(m),decreasing=true) d <- data.frame(word = names(v),freq=v)
output getting below:
word freq mein mein 1451 website website 1038 privatkunden privatkunden 898 meinwebsite meinwebsite 479 rechnung rechnung 474
the output want should this:
id privatkunden website hilfe rechnung kosten m7fdn 4 7 2 7 0 9ufdf 3 1 9 3 5 09nd7 5 7 2 8 9
the above table means id m7fdn has 4 times privatkunden in urls , 2 times hilfe , on. above table sample , not count exact words. table can long many number of words there. please me output. once table have apply machine learning.
i think there 2 points mention here:
1) reading in data:
text <- readlines("sample.csv")
gives vector text[1]
being full first line of data, text[2]
being full second line of data , on. need vectorsource
1 column urls column. either use read.table
or e.g. this:
require(tidyr) text <- readlines("1.txt") text <- data.frame(a=text[-1]) %>% separate(a, c("id", "urls"), sep=6)
2) using data in tm
make urls corpus by:
docs <- corpus(vectorsource(text$urls)) names(docs) <- text$id
now tm_map
transformations... @ end do:
dtm <- documenttermmatrix(docs)
and there go:
> as.matrix(dtm[1:3,1:5]) terms docs (index.html (mydata.html 404 ã¼bersicht ausland m7fdn 0 0 0 0 1 9ufdf 0 0 0 1 0 09nd7 1 1 1 0 0
Useful post...
ReplyDeleteoutsource invoice processing services