python - pandas 'chunked' save to_csv of pivoted table on very large dataframes -


my original file pivot large memory:

tbl = pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}) tbl.info() <class 'pandas.core.frame.dataframe'> rangeindex: 18570726 entries, 0 18570725 data columns (total 2 columns): saleid    object upc       object dtypes: object(2) memory usage: 283.4+ mb 

the unique upcs around 40000 , saleid row around million+.

the file looks like:

    saleid                              upc 0   155_02127453_20090616_135212_0021   02317639000000 1   155_02127453_20090616_135212_0021   00000000000888 2   155_01605733_20090616_135221_0016   00264850000000 3   155_01072401_20090616_135224_0010   02316877000000 4   155_01072401_20090616_135224_0010   05051969277205 

it represents 1 customer (saleid) , items he/she got (upc of item)

i correctly pivot table using solution.

                                   02317639000000 00000000000888 00264850000000 02316877000000 155_02127453_20090616_135212_0021               1              1              0              0 155_01605733_20090616_135221_0016               0              0              1              0 155_01072401_20090616_135224_0010               0              0              0              0 

so, columns unique upcs , rows unique saleids.

its not possible pivot such big file in memory (even 128gb have) try read , save in chunks:

chunksize = 1000000 f = 0 chunk in pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}, chunksize=chunksize):     print(f)     t = pd.crosstab(chunk.saleid, chunk.upc)     t.head(3)     t.to_csv('tbl_sales_index_converted_' + str(f) + '.csv.bz2',header=true,sep=';',compression='bz2')     f = f+1 

but! doing resulting pivoted tables have unique upcs in columns exist in chunk, around 25000, not of columns should around 40000. need consistent number of columns , order of columns of course.

how read, pivot , write csv in chunks keep unified column structure on resulted csvs?


Comments

Popular posts from this blog

serialization - Convert Any type in scala to Array[Byte] and back -

matplotlib support failed in PyCharm on OSX -

python - Matplotlib: TypeError: 'AxesSubplot' object is not callable -