hadoop - Hive 'limit' in subquery executes after full query -


i’m testing rather taxing rlike function in hive query. figured i’d test against subset first before applying tb+ of data. full query is:

create table proxy_parsed_clean select   a.*,   case      when domainname rlike '.*:443$' 1     else 0   end used_https proxy_parsed a; 

since there's data, wrote query (ostensibly) operate against subset:

select   case      when a.domainname rlike '.*:443$' 1     else 0   end used_https (select domainname proxy_parsed limit 10) a; 

however, seems take just long first query. rather applying outer query subset, seems apply case statement entire dataset , then limit. running explain confirmed suspicions (notice limit clause moved end of query):

> explain select case when a.domainname rlike '.*:443$' 1 else 0 end (select domainname proxy_parsed limit 10) a;  +---------------------------------------------------------------------------------------------------------------------+--+ |                                                       explain                                                       | +---------------------------------------------------------------------------------------------------------------------+--+ | stage dependencies:                                                                                                 | |   stage-1 root stage                                                                                           | |   stage-0 depends on stages: stage-1                                                                                | |                                                                                                                     | | stage plans:                                                                                                        | |   stage: stage-1                                                                                                    | |     map reduce                                                                                                      | |       map operator tree:                                                                                            | |           tablescan                                                                                                 | |             alias: proxy_parsed                                                                                     | |             statistics: num rows: 157462377267 data size: 6298495090688 basic stats: complete column stats: none    | |             select operator                                                                                         | |               expressions: domainname (type: varchar(40))                                                           | |               outputcolumnnames: _col0                                                                              | |               statistics: num rows: 157462377267 data size: 6298495090688 basic stats: complete column stats: none  | |               limit                                                                                                 | |                 number of rows: 10                                                                                  | |                 statistics: num rows: 10 data size: 400 basic stats: complete column stats: none                    | |                 reduce output operator                                                                              | |                   sort order:                                                                                       | |                   statistics: num rows: 10 data size: 400 basic stats: complete column stats: none                  | |                   topn hash memory usage: 0.1                                                                       | |                   value expressions: _col0 (type: varchar(40))                                                      | |       reduce operator tree:                                                                                         | |         select operator                                                                                             | |           expressions: value._col0 (type: varchar(40))                                                              | |           outputcolumnnames: _col0                                                                                  | |           statistics: num rows: 10 data size: 400 basic stats: complete column stats: none                          | |           limit                                                                                                     | |             number of rows: 10                                                                                      | |             statistics: num rows: 10 data size: 400 basic stats: complete column stats: none                        | |             select operator                                                                                         | |               expressions: case when ((_col0 rlike '.*:443$')) (1) else (0) end (type: int)                    | |               outputcolumnnames: _col0                                                                              | |               statistics: num rows: 10 data size: 400 basic stats: complete column stats: none                      | |               file output operator                                                                                  | |                 compressed: false                                                                                   | |                 statistics: num rows: 10 data size: 400 basic stats: complete column stats: none                    | |                 table:                                                                                              | |                     input format: org.apache.hadoop.mapred.textinputformat                                          | |                     output format: org.apache.hadoop.hive.ql.io.hiveignorekeytextoutputformat                       | |                     serde: org.apache.hadoop.hive.serde2.lazy.lazysimpleserde                                       | |                                                                                                                     | |   stage: stage-0                                                                                                    | |     fetch operator                                                                                                  | |       limit: -1                                                                                                     | |       processor tree:                                                                                               | |         listsink                                                                                                    | |                                                                                                                     | +---------------------------------------------------------------------------------------------------------------------+--+ 

if run select * proxy_parsed limit 10;, query executes blazingly quickly. can explain a), why query not executing on subset, , b) how make so?

i create temp table, select 10 records , execute query, seems sloppy. plus, i'd have temp table clean after that. behavior seems hive bug, i.e., limit behavior not behaving seems should in case.

the limit not applied after case, before , during processing case - gets applied twice. although coincidence, in case 2 applications of limit correspond inner , outer query, respectively.

in query plan can see map phase selects single column ("expressions: domainname") , reduces number of results 10 (from 157462377267). corresponds inner query. reduce phase applies case ("expressions: case when ((_col0 rlike '.*:443$')) (1) else (0) end") , reduces number of rows 10, can see expected number of input rows 10 in phase. reduce phase corresponds outer query.

the reason why limit applied twice distributed execution. since @ end of map phase want minimize amount of data sent reducers, makes sense apply limit here. after limit reached, mapper won't process more of input. not enough however, since potentially each mapper may produce 10 results, adding ten times number of mappers, thereby reduce phase has apply limit again. because of mechanism, in general should apply limit directly instead of creating subquery sole purpose.

to summarize, in interpretation query plan looks - limit processed in places should. answers question why limit gets applied before case. sadly, not explain why takes time though.

update: please see ozw1z5rd's answer why query slow in spite of using limit. explains using subquery causes mapreduce job launched, while direct query avoids that.


Comments

Popular posts from this blog

serialization - Convert Any type in scala to Array[Byte] and back -

matplotlib support failed in PyCharm on OSX -

python - Matplotlib: TypeError: 'AxesSubplot' object is not callable -