caching - Why does this recursive C++ function have such a bad cache behavior? -
let t
rooted binary tree such every internal node has 2 children. nodes of tree stored in array, let call treearray
following preorder layout.
then treearray
contain following node objects:
7, 3, 1, 0, 2, 6, 12, 9, 8, 11, 13
a node in tree struct of kind:
struct tree_node{ int id; //id of node, randomly generated int numchildren; //number of children, 2 leafs it's 0 int pos; //position in treearray node stored int lpos; //position of left child int rpos; //position of right child tree_node(){ id = -1; pos = lpos = rpos = -1; numchildren = 0; } };
now suppose want function returns sum of ids in tree. sounds trivial, have use loop iterates on treearray
, accumulates found ids. however, interested in understanding cache behavior of following implementation:
void testcache1(int cur){ //find positions of left , right children int lpos = treearray[cur].lpos; int rpos = treearray[cur].rpos; //if there no children @ leaf update r , return if(treearray[cur].numchildren == 0){ r += treearray[cur].id; return; } //otherwise in internal node, update r , recurse //first left subtree , right subtree r += treearray[cur].id; testcache1(lpos); testcache1(rpos); }
to test cache behavior have following experiment:
r = 0; //r global variable int main(int argc, char* argv[]){ for(int i=0;i<100;i++) { r = 0; testcache1(0); } cout<<r<<endl; return 0; }
for random tree 5 million leafs, perf stat -b -e cache-misses,cache-references,instructions ./run_tests 111.txt
prints following:
performance counter stats './run_tests 111.txt': 469,511,047 cache-misses # 89.379 % of cache refs 525,301,814 cache-references 20,715,360,185 instructions 11.214075268 seconds time elapsed
in beginning thought maybe because of way generate tree, exclude including in question, when run sudo perf record -e cache-misses ./run_tests 111.txt
received following output:
as can see of cache misses come function. can not understand why case. values of cur
sequential, first access position 0
of treearray
, position 1
, 2
, 3
etc.
to add more doubt understanding of happening, have following function finds same summation:
void testcache4(int index){ if(index == treearray.size) return; r += treearray[index].id; testcache4(index+1); }
testcache4
accesses elements of treearray
in same way, cache behavior better.
output perf stat -b -e cache-misses,cache-references,instructions ./run_tests 11.txt
:
performance counter stats './run_tests 111.txt': 396,941,872 cache-misses # 54.293 % of cache refs 731,109,661 cache-references 11,547,097,924 instructions 4.306576556 seconds time elapsed
in output sudo perf record -e cache-misses ./run_tests 111.txt
function not there:
i apologize long post feel lost. thank in advance.
edit:
here entire test file, parsers , required. assumed tree available inside text file given argument. compile typing g++ -o3 -std=c++11 file.cpp
, run typing ./executable tree.txt
. tree using can found here (don't open, click save us).
#include <iostream> #include <fstream> #define billion 1000000000ll using namespace std; /* * * timing functions * */ timespec startt, endt; void starttimer(){ clock_gettime(clock_process_cputime_id, &startt); } double endtimer(){ clock_gettime(clock_process_cputime_id, &endt); return endt.tv_sec * billion + endt.tv_nsec - (startt.tv_sec * billion + startt.tv_nsec); } /* * * tree node * */ //this struct used creating first tree after reading external file, need left , child pointers struct tree_node_temp{ int id; //id of node, randomly generated int numchildren; //number of children, 2 leafs it's 0 int size; //size of subtree rooted @ current node tree_node_temp *leftchild; tree_node_temp *rightchild; tree_node_temp(){ id = -1; size = 1; leftchild = nullptr; rightchild = nullptr; numchildren = 0; } }; struct tree_node{ int id; //id of node, randomly generated int numchildren; //number of children, 2 leafs it's 0 int size; //size of subtree rooted @ current node int pos; //position in treearray node stored int lpos; //position of left child int rpos; //position of right child tree_node(){ id = -1; pos = lpos = rpos = -1; numchildren = 0; } }; /* * * tree parser. input file containing tree in newick format. * */ string treenewickstr; //string storing newick format of tree read file int treecurstrindex; //index current position in while reading newick string int treenumleafs; //number of leafs in current tree tree_node ** treearrayreferences; //stack of references free node objects tree_node *treearray; //array of node objects int treestackreferencestop; //the top index references stack int curpos; //used find pos,lpos , rpos when creating pre order layout tree //helper function readnewick tree_node_temp* readnewickhelper() { int i; if(treecurstrindex == treenewickstr.size()) return nullptr; tree_node_temp * leftchild; tree_node_temp * rightchild; if(treenewickstr[treecurstrindex] == '('){ //create left child treecurstrindex++; leftchild = readnewickhelper(); } if(treenewickstr[treecurstrindex] == ','){ //create right child treecurstrindex++; rightchild = readnewickhelper(); } if(treenewickstr[treecurstrindex] == ')' || treenewickstr[treecurstrindex] == ';'){ treecurstrindex++; tree_node_temp * cur = new tree_node_temp(); cur->numchildren = 2; cur->leftchild = leftchild; cur->rightchild = rightchild; cur->size = 1 + leftchild->size + rightchild->size; return cur; } //we read label, keep reading until read "," ")" or "(" (we assume newick string has right format) = 0; char treelabel[20]; //buffer used label while(treenewickstr[treecurstrindex]!=',' && treenewickstr[treecurstrindex]!='(' && treenewickstr[treecurstrindex]!=')'){ treelabel[i] = treenewickstr[treecurstrindex]; treecurstrindex++; i++; } treelabel[i] = '\0'; tree_node_temp * cur = new tree_node_temp(); cur->numchildren = 0; cur->id = atoi(treelabel)-1; treenumleafs++; return cur; } //create pre order tree, curroot in first call points root of first tree given parser void treeinit(tree_node_temp * curroot){ tree_node * curfinalroot = treearrayreferences[curpos]; curfinalroot->pos = curpos; if(curroot->numchildren == 0) { curfinalroot->id = curroot->id; return; } //add left child tree_node * cnode = treearrayreferences[treestackreferencestop]; curfinalroot->lpos = curpos + 1; curpos = curpos + 1; treestackreferencestop++; cnode->id = curroot->leftchild->id; treeinit(curroot->leftchild); //add right child curfinalroot->rpos = curpos + 1; curpos = curpos + 1; cnode = treearrayreferences[treestackreferencestop]; treestackreferencestop++; cnode->id = curroot->rightchild->id; treeinit(curroot->rightchild); curfinalroot->id = curroot->id; curfinalroot->numchildren = 2; curfinalroot->size = curroot->size; } //the ids of leafs deteremined newick file, internal nodes incrementally give id determined dfs traversal void updateinternalnodeids(int cur){ tree_node* curnode = treearrayreferences[cur]; if(curnode->numchildren == 0){ return; } curnode->id = treenumleafs++; updateinternalnodeids(curnode->lpos); updateinternalnodeids(curnode->rpos); } //frees memory of first tree generated parser void treefreememory(tree_node_temp* cur){ if(cur->numchildren == 0){ delete cur; return; } treefreememory(cur->leftchild); treefreememory(cur->rightchild); delete cur; } //reads tree stored in "file" under newick format , creates in main memory. output (what function returns) pointer root of tree. //this tree scattered anywhere in memory. tree_node* readnewick(string& file){ treecurstrindex = -1; treenewickstr = ""; treenumleafs = 0; ifstream treefin; treefin.open(file, ios_base::in); //read newick format of tree , store in string treefin>>treenewickstr; //initialize index reading string treecurstrindex = 0; //create tree in main memory tree_node_temp* root = readnewickhelper(); //store tree in array following pre order layout treearray = new tree_node[root->size]; treearrayreferences = new tree_node*[root->size]; int i; for(i=0;i<root->size;i++) treearrayreferences[i] = &treearray[i]; treestackreferencestop = 0; tree_node* finalroot = treearrayreferences[treestackreferencestop]; curpos = treestackreferencestop; treestackreferencestop++; finalroot->id = root->id; treeinit(root); //update internal node ids (the leaf ids defined ids stored in newick string) updateinternalnodeids(0); //close file treefin.close(); //free memory of initial tree treefreememory(root); //return pre order tree return finalroot; } /* * experiments * */ int r; tree_node* t; void testcache1(int cur){ int lpos = treearray[cur].lpos; int rpos = treearray[cur].rpos; if(treearray[cur].numchildren == 0){ r += treearray[cur].id; return; } r += treearray[cur].id; testcache1(lpos); testcache1(rpos); } void testcache4(int index){ if(index == t->size) return; r += treearray[index].id; testcache4(index+1); } int main(int argc, char* argv[]){ string tnewick = argv[1]; t = readnewick(tnewick); double tt; starttimer(); for(int i=0;i<100;i++) { r = 0; testcache4(0); } tt = endtimer(); cout<<r<<endl; cout<<tt/billion<<endl; starttimer(); for(int i=0;i<100;i++) { r = 0; testcache1(0); } tt = endtimer(); cout<<r<<endl; cout<<tt/billion<<endl; delete[] treearray; delete[] treearrayreferences; return 0; }
edit2:
i run profiling tests valgrind. instructions might overhead here, don't understand why. example in experiments above perf, 1 version gives around 20 billion instructions , other 11 billion. that's 9 billion difference.
with -o3
enabled following:
so function calls expensive in testcache1
, cost nothing in testcache4
? amount of function calls in both cases should same...
i guess problem misunderstanding of cache-references count.
as explained in this answer cache-references on intel cpus number of references last-level cache. therefore memory references served l1 cache not counted. intel 64 , ia-32 architectures developer's manual states loads l1 prefetcher counted.
if compare absolute number of cache-misses, see equal both functions. used balanced tree test, removed pos
size of 16 fitting nicely cache lines , got following numbers:
testcache4
:
843.628.131 l1-dcache-loads (56,83%) 193.006.858 l1-dcache-load-misses # 22,73% of l1-dcache hits (57,31%) 326.698.621 cache-references (57,07%) 188.435.203 cache-misses # 57,679 % of cache refs (56,76%)
testcache1
:
3.519.968.253 l1-dcache-loads (57,17%) 193.664.806 l1-dcache-load-misses # 5,50% of l1-dcache hits (57,24%) 256.638.490 cache-references (57,12%) 188.007.927 cache-misses # 73,258 % of cache refs (57,23%)
and if manually disable hardware prefetchers:
testcache4
:
846.124.474 l1-dcache-loads (57,22%) 192.495.450 l1-dcache-load-misses # 22,75% of l1-dcache hits (57,31%) 193.699.811 cache-references (57,03%) 185.445.753 cache-misses # 95,739 % of cache refs (57,17%)
testcache1
:
3.534.308.118 l1-dcache-loads (57,16%) 193.595.962 l1-dcache-load-misses # 5,48% of l1-dcache hits (57,18%) 193.639.498 cache-references (57,12%) 185.120.733 cache-misses # 95,601 % of cache refs (57,15%)
as can see differences gone now. there additional cache-references events due prefetcher , actual reference being counted twice.
actually if count memory references testcache1
has lower total cache miss ratio, because each tree_node
referenced 4 times instead of ones, each data member of tree_node
lies on same cache line , there 1 out of 4 misses.
for testcache4
can see l1d load miss ratio close 25% expect if sizeof(tree_node) == 16
, cache lines 64 bytes.
also compiler (at least gcc -o2) applies tail recursion optimization both functions eliminating recursion of testcache4
, while making testcache1
one-way recursive. therefore testcache1
has many additional cache references stack frames testcache4
not have.
you can result without prefetcher using valgrind, more reliable in output. not simulate properties of cpu caches though.
regarding edits: metioned gcc aplies tail recursion optimization, there no calls left in testcache4
, of course recursion , additional memory loads in testcache1
have significant instruction overhead compared simple load/add loop left in testcache4
.
Comments
Post a Comment