hadoop - aggregate ordered rows in hive table -


i have table in hive 4 columns this:

row_id| user_id|product_id| duration 1       1       product1    3 2       1       product1    1 3       1       product2    6 4       1       product3    2 5       1       product1    4 6       1       product4    3 7       1       product4    5 8       1       product4    7 9       2       product4    3 10      2       product4    6 

i want aggregate rows of same product each user, sum duration , count clicks if consequent in order

row_id| user_id|product_id |duration_per_product |clicks_per_product 1       1       product1    4                       2 2       1       product2    6                       1 3       1       product3    2                       1 4       1       product1    4                       1 5       1       product4    15                      3 6       2       product4    9                       2 

any ideas how in hive 1.1.0?

group doesn't work because don't want group products if consequent , have tried case,lag , lead didn't work!

thanks!

first off, want in loop, hive not suitable these kind of problems.

that being said, here approach should work:

suppose our dataset

1       1       product1    3 2       1       product1    1 3       1       product2    6 4       1       product1    4 

identify starter rows: 1,3,4 can done doing left join on id=id+1 , seeing whether user , product match.

join onto these starters user , product:

1 1 1 2 1 4 3 3 4 1 4 2 4 4 

filter out things in wrong order (starter after row), remaining are:

1 1 1 2 1 4 3 3 4 4 

group find maximum valid starter each row, remaining be:

1 1  1 2 3 3 4 4 

now join reattach relevant dimensions

1 1 3 1 2 1 3 3 6 4 4 4 

now can results grouping on starter id.

1 4 3 6 4 4 

of course can choose use join attach name of product.

1 product1 4 3 product2 6 4 product1 4 

and all!


Comments

Popular posts from this blog

java - Jasper subreport showing only one entry from the JSON data source when embedded in the Title band -

mapreduce - Resource manager does not transit to active state from standby -

serialization - Convert Any type in scala to Array[Byte] and back -