Read JSON files from Spark streaming into H2O -


i've got cluster on aws i've installed h2o, sparkling water , h2o flow machine learning purposes on lots of data.

now, these files come in json format streaming job. let's placed in s3 in folder called streamed-data.

from spark, using sparkcontext, read them in 1 go create rdd (this python, not important):

sc = sparkcontext() sc.read.json('path/streamed-data') 

this reads them all, creates me rdd , handy.

now, i'd leverage capabilities of h2o, hence i've installed on cluster, along other mentioned software.

looking h2o flow, problem lack of json parser, i'm wondering if import them h2o in first place, or if there's go round problem.

when running sparkling water can convert rdd/df/ds h2o frames quite easily. (scala, python similar) should work:

val datadf = sc.read.json('path/streamed-data') val h2ocontext = h2ocontext.getorcreate(sc) import h2ocontext.implicits._ val h2oframe = h2ocontext.ash2oframe(datadf, "my-frame-name") 

from on can use frame code level and/or flowui.

you can find more examples here for python , here for scala.


Comments

Popular posts from this blog

serialization - Convert Any type in scala to Array[Byte] and back -

matplotlib support failed in PyCharm on OSX -

python - Matplotlib: TypeError: 'AxesSubplot' object is not callable -