python - pandas 'chunked' save to_csv of pivoted table on very large dataframes -
my original file pivot large memory:
tbl = pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}) tbl.info() <class 'pandas.core.frame.dataframe'> rangeindex: 18570726 entries, 0 18570725 data columns (total 2 columns): saleid object upc object dtypes: object(2) memory usage: 283.4+ mb
the unique upcs around 40000 , saleid row around million+.
the file looks like:
saleid upc 0 155_02127453_20090616_135212_0021 02317639000000 1 155_02127453_20090616_135212_0021 00000000000888 2 155_01605733_20090616_135221_0016 00264850000000 3 155_01072401_20090616_135224_0010 02316877000000 4 155_01072401_20090616_135224_0010 05051969277205
it represents 1 customer (saleid) , items he/she got (upc of item)
i correctly pivot table using solution.
02317639000000 00000000000888 00264850000000 02316877000000 155_02127453_20090616_135212_0021 1 1 0 0 155_01605733_20090616_135221_0016 0 0 1 0 155_01072401_20090616_135224_0010 0 0 0 0
so, columns unique upcs , rows unique saleids.
its not possible pivot such big file in memory (even 128gb have) try read , save in chunks:
chunksize = 1000000 f = 0 chunk in pd.read_csv('tbl_sale_items.csv',sep=';',dtype={'saleid': np.str, 'upc': np.str}, chunksize=chunksize): print(f) t = pd.crosstab(chunk.saleid, chunk.upc) t.head(3) t.to_csv('tbl_sales_index_converted_' + str(f) + '.csv.bz2',header=true,sep=';',compression='bz2') f = f+1
but! doing resulting pivoted tables have unique upcs in columns exist in chunk, around 25000, not of columns should around 40000. need consistent number of columns , order of columns of course.
how read, pivot , write csv in chunks keep unified column structure on resulted csvs?
Comments
Post a Comment