caching - Why does this recursive C++ function have such a bad cache behavior? -


let t rooted binary tree such every internal node has 2 children. nodes of tree stored in array, let call treearray following preorder layout.

so example if tree have: enter image description here

then treearray contain following node objects:

7, 3, 1, 0, 2, 6, 12, 9, 8, 11, 13

a node in tree struct of kind:

struct tree_node{      int id; //id of node, randomly generated     int numchildren; //number of children, 2 leafs it's 0      int pos; //position in treearray node stored     int lpos; //position of left child     int rpos; //position of right child      tree_node(){             id = -1;             pos = lpos = rpos = -1;             numchildren = 0;        }  }; 

now suppose want function returns sum of ids in tree. sounds trivial, have use loop iterates on treearray , accumulates found ids. however, interested in understanding cache behavior of following implementation:

void testcache1(int cur){       //find positions of left , right children      int lpos = treearray[cur].lpos;      int rpos = treearray[cur].rpos;       //if there no children @ leaf update r , return       if(treearray[cur].numchildren == 0){         r += treearray[cur].id;         return;      }       //otherwise in internal node, update r , recurse      //first left subtree , right subtree       r += treearray[cur].id;       testcache1(lpos);      testcache1(rpos);  } 

to test cache behavior have following experiment:

r = 0; //r global variable int main(int argc, char* argv[]){      for(int i=0;i<100;i++) {         r = 0;         testcache1(0);     }      cout<<r<<endl;     return 0; } 

for random tree 5 million leafs, perf stat -b -e cache-misses,cache-references,instructions ./run_tests 111.txt prints following:

 performance counter stats './run_tests 111.txt':     469,511,047      cache-misses              #   89.379 % of cache refs        525,301,814      cache-references                                             20,715,360,185      instructions                 11.214075268 seconds time elapsed 

in beginning thought maybe because of way generate tree, exclude including in question, when run sudo perf record -e cache-misses ./run_tests 111.txt received following output:

enter image description here

as can see of cache misses come function. can not understand why case. values of cur sequential, first access position 0 of treearray, position 1, 2, 3 etc.

to add more doubt understanding of happening, have following function finds same summation:

void testcache4(int index){       if(index == treearray.size) return;       r += treearray[index].id;       testcache4(index+1);  } 

testcache4 accesses elements of treearray in same way, cache behavior better.

output perf stat -b -e cache-misses,cache-references,instructions ./run_tests 11.txt:

 performance counter stats './run_tests 111.txt':     396,941,872      cache-misses              #   54.293 % of cache refs        731,109,661      cache-references                                             11,547,097,924      instructions                  4.306576556 seconds time elapsed 

in output sudo perf record -e cache-misses ./run_tests 111.txt function not there:

enter image description here

i apologize long post feel lost. thank in advance.

edit:

here entire test file, parsers , required. assumed tree available inside text file given argument. compile typing g++ -o3 -std=c++11 file.cpp , run typing ./executable tree.txt. tree using can found here (don't open, click save us).

#include <iostream> #include <fstream> #define billion  1000000000ll  using namespace std;  /*  *  * timing functions  *  */  timespec startt, endt;  void starttimer(){     clock_gettime(clock_process_cputime_id, &startt); }  double endtimer(){     clock_gettime(clock_process_cputime_id, &endt);     return endt.tv_sec * billion + endt.tv_nsec - (startt.tv_sec * billion + startt.tv_nsec); }  /*  *  * tree node  *  */  //this struct used creating first tree after reading external file, need left , child pointers  struct tree_node_temp{      int id; //id of node, randomly generated     int numchildren; //number of children, 2 leafs it's 0     int size; //size of subtree rooted @ current node     tree_node_temp *leftchild;     tree_node_temp *rightchild;      tree_node_temp(){         id = -1;         size = 1;         leftchild = nullptr;         rightchild = nullptr;         numchildren = 0;     }  };  struct tree_node{      int id; //id of node, randomly generated     int numchildren; //number of children, 2 leafs it's 0     int size; //size of subtree rooted @ current node      int pos; //position in treearray node stored     int lpos; //position of left child     int rpos; //position of right child      tree_node(){         id = -1;         pos = lpos = rpos = -1;         numchildren = 0;     }  };  /*  *  * tree parser. input file containing tree in newick format.  *  */  string treenewickstr; //string storing newick format of tree read file int treecurstrindex; //index current position in while reading newick string int treenumleafs; //number of leafs in current tree tree_node ** treearrayreferences; //stack of references free node objects tree_node *treearray; //array of node objects int treestackreferencestop; //the top index references stack int curpos; //used find pos,lpos , rpos when creating pre order layout tree   //helper function readnewick tree_node_temp* readnewickhelper() {      int i;     if(treecurstrindex == treenewickstr.size())         return nullptr;      tree_node_temp * leftchild;     tree_node_temp * rightchild;      if(treenewickstr[treecurstrindex] == '('){         //create left child         treecurstrindex++;         leftchild = readnewickhelper();     }      if(treenewickstr[treecurstrindex] == ','){         //create right child         treecurstrindex++;         rightchild = readnewickhelper();     }      if(treenewickstr[treecurstrindex] == ')' || treenewickstr[treecurstrindex] == ';'){         treecurstrindex++;         tree_node_temp * cur = new tree_node_temp();         cur->numchildren = 2;         cur->leftchild = leftchild;         cur->rightchild = rightchild;         cur->size = 1 + leftchild->size + rightchild->size;         return cur;     }      //we read label, keep reading until read "," ")" or "(" (we assume newick string has right format)     = 0;     char treelabel[20]; //buffer used label     while(treenewickstr[treecurstrindex]!=',' && treenewickstr[treecurstrindex]!='(' && treenewickstr[treecurstrindex]!=')'){         treelabel[i] = treenewickstr[treecurstrindex];         treecurstrindex++;         i++;     }      treelabel[i] = '\0';     tree_node_temp * cur = new tree_node_temp();     cur->numchildren = 0;     cur->id = atoi(treelabel)-1;     treenumleafs++;      return cur; }  //create pre order tree, curroot in first call points root of first tree given parser void treeinit(tree_node_temp * curroot){      tree_node * curfinalroot = treearrayreferences[curpos];      curfinalroot->pos = curpos;      if(curroot->numchildren == 0) {         curfinalroot->id = curroot->id;         return;     }      //add left child     tree_node * cnode = treearrayreferences[treestackreferencestop];     curfinalroot->lpos = curpos + 1;     curpos = curpos + 1;     treestackreferencestop++;     cnode->id = curroot->leftchild->id;     treeinit(curroot->leftchild);      //add right child     curfinalroot->rpos = curpos + 1;     curpos = curpos + 1;     cnode = treearrayreferences[treestackreferencestop];     treestackreferencestop++;     cnode->id = curroot->rightchild->id;     treeinit(curroot->rightchild);      curfinalroot->id = curroot->id;     curfinalroot->numchildren = 2;     curfinalroot->size = curroot->size;  }  //the ids of leafs deteremined newick file, internal nodes incrementally give id determined dfs traversal void updateinternalnodeids(int cur){      tree_node* curnode = treearrayreferences[cur];      if(curnode->numchildren == 0){         return;     }     curnode->id = treenumleafs++;     updateinternalnodeids(curnode->lpos);     updateinternalnodeids(curnode->rpos);  }  //frees memory of first tree generated parser void treefreememory(tree_node_temp* cur){      if(cur->numchildren == 0){         delete cur;         return;     }     treefreememory(cur->leftchild);     treefreememory(cur->rightchild);      delete cur;  }  //reads tree stored in "file" under newick format , creates in main memory. output (what function returns) pointer root of tree. //this tree scattered anywhere in memory.  tree_node* readnewick(string& file){      treecurstrindex = -1;     treenewickstr = "";     treenumleafs = 0;      ifstream treefin;      treefin.open(file, ios_base::in);     //read newick format of tree , store in string     treefin>>treenewickstr;     //initialize index reading string     treecurstrindex = 0;     //create tree in main memory     tree_node_temp* root = readnewickhelper();      //store tree in array following pre order layout     treearray = new tree_node[root->size];     treearrayreferences = new tree_node*[root->size];     int i;     for(i=0;i<root->size;i++)         treearrayreferences[i] = &treearray[i];     treestackreferencestop = 0;      tree_node* finalroot = treearrayreferences[treestackreferencestop];     curpos = treestackreferencestop;     treestackreferencestop++;     finalroot->id = root->id;     treeinit(root);      //update internal node ids (the leaf ids defined ids stored in newick string)     updateinternalnodeids(0);     //close file     treefin.close();      //free memory of initial tree     treefreememory(root);     //return pre order tree     return finalroot;  }  /*  * experiments  *  */  int r; tree_node* t;  void testcache1(int cur){      int lpos = treearray[cur].lpos;     int rpos = treearray[cur].rpos;      if(treearray[cur].numchildren == 0){         r += treearray[cur].id;         return;     }      r += treearray[cur].id;      testcache1(lpos);     testcache1(rpos);  }   void testcache4(int index){      if(index == t->size) return;      r += treearray[index].id;      testcache4(index+1);  }   int main(int argc, char* argv[]){      string tnewick = argv[1];     t = readnewick(tnewick);     double tt;      starttimer();     for(int i=0;i<100;i++) {         r = 0;         testcache4(0);     }     tt = endtimer();     cout<<r<<endl;     cout<<tt/billion<<endl;      starttimer();     for(int i=0;i<100;i++) {         r = 0;         testcache1(0);     }     tt = endtimer();     cout<<r<<endl;     cout<<tt/billion<<endl;      delete[] treearray;     delete[] treearrayreferences;      return 0; } 

edit2:

i run profiling tests valgrind. instructions might overhead here, don't understand why. example in experiments above perf, 1 version gives around 20 billion instructions , other 11 billion. that's 9 billion difference.

with -o3 enabled following:

enter image description here

so function calls expensive in testcache1 , cost nothing in testcache4? amount of function calls in both cases should same...

i guess problem misunderstanding of cache-references count.

as explained in this answer cache-references on intel cpus number of references last-level cache. therefore memory references served l1 cache not counted. intel 64 , ia-32 architectures developer's manual states loads l1 prefetcher counted.

if compare absolute number of cache-misses, see equal both functions. used balanced tree test, removed pos size of 16 fitting nicely cache lines , got following numbers:

testcache4:

843.628.131      l1-dcache-loads                                               (56,83%) 193.006.858      l1-dcache-load-misses     #   22,73% of l1-dcache hits    (57,31%) 326.698.621      cache-references                                              (57,07%) 188.435.203      cache-misses              #   57,679 % of cache refs      (56,76%) 

testcache1:

3.519.968.253    l1-dcache-loads                                               (57,17%) 193.664.806      l1-dcache-load-misses     #    5,50% of l1-dcache hits    (57,24%) 256.638.490      cache-references                                              (57,12%) 188.007.927      cache-misses              #   73,258 % of cache refs      (57,23%) 

and if manually disable hardware prefetchers:

testcache4:

846.124.474      l1-dcache-loads                                               (57,22%) 192.495.450      l1-dcache-load-misses     #   22,75% of l1-dcache hits    (57,31%) 193.699.811      cache-references                                              (57,03%) 185.445.753      cache-misses              #   95,739 % of cache refs      (57,17%) 

testcache1:

3.534.308.118    l1-dcache-loads                                               (57,16%) 193.595.962      l1-dcache-load-misses     #    5,48% of l1-dcache hits    (57,18%) 193.639.498      cache-references                                              (57,12%) 185.120.733      cache-misses              #   95,601 % of cache refs      (57,15%) 

as can see differences gone now. there additional cache-references events due prefetcher , actual reference being counted twice.

actually if count memory references testcache1 has lower total cache miss ratio, because each tree_node referenced 4 times instead of ones, each data member of tree_node lies on same cache line , there 1 out of 4 misses.

for testcache4 can see l1d load miss ratio close 25% expect if sizeof(tree_node) == 16 , cache lines 64 bytes.

also compiler (at least gcc -o2) applies tail recursion optimization both functions eliminating recursion of testcache4, while making testcache1 one-way recursive. therefore testcache1 has many additional cache references stack frames testcache4 not have.

you can result without prefetcher using valgrind, more reliable in output. not simulate properties of cpu caches though.

regarding edits: metioned gcc aplies tail recursion optimization, there no calls left in testcache4 , of course recursion , additional memory loads in testcache1 have significant instruction overhead compared simple load/add loop left in testcache4.


Comments

Popular posts from this blog

serialization - Convert Any type in scala to Array[Byte] and back -

matplotlib support failed in PyCharm on OSX -

python - Matplotlib: TypeError: 'AxesSubplot' object is not callable -