Can't reproduce cpu cache-miss -

good day!

i'm reading wonderful article: what every programmer should know memory. right i'm trying figure out how cpu caches work , reproduce experiment cache misses. aim reproduce performance degradation when amount of accessed data rises (figure 3.4). wrote little program, should reproduce degradation, doesn't. performance degradation appears after allocate more 4gb of memory, , don't understand why. think should appear when 12 or maybe 100 of mb allocated. maybe program wrong , miss something? use

intel core i7-2630qm l1: 256kb l2: 1mb l3: 6mb

here go listing.

main.go

package main  import (     "fmt"     "math/rand" )  const (     n0 = 1000     n1 = 100000 )  func readint64time(slice []int64, idx int) int64  func main() {     ss := make([][]int64, n0)     := range ss {         ss[i] = make([]int64, n1)         j := range ss[i] {             ss[i][j] = int64(i + j)         }     }     var t int64     := 0; < n0; i++ {         j := 0; j < n1; j++ {             t0 := readint64time(ss[i], rand.intn(n1))             if t0 <= 0 {                 panic(t0)             }             t += t0         }     }     fmt.println("avg time:", t/int64(n0*n1)) }

main.s

// func readint64time(slice []int64, idx int) int64 text ·readint64time(sb),$0-40     movq    slice+0(fp), r8     movq    idx+24(fp), r9     rdtsc     shlq    $32, dx     orq     dx, ax     movq    ax, r10     movq    (r8)(r9*8), r8 // here i'm reading memory     rdtsc     shlq    $32, dx     orq     dx, ax     subq    r10, ax     movq    ax, ret+32(fp)     ret

for interested. reproduced 'cache-miss' behavior. performance degradation not dramatic article describes. here final benchmark listing:

main.go

package main  import (     "fmt"     "math/rand"     "runtime"     "runtime/debug" )  func readint64time(slice []int64, idx int) int64  const count = 2 << 25  func measure(np uint) {     n := 2 << np     s := make([]int64, n)     := range s {         s[i] = int64(i)     }     t := int64(0)     n8 := n >> 3     := 0; < count; i++ {         // intex 64 byte aligned, since cache line 64 byte         t0 := readint64time(s, rand.intn(n8)<<3)         t += t0     }     fmt.printf("allocated %d kb. avg time: %v\n",         n/128, t/count) }  func main() {     debug.setgcpercent(-1) // eliminate gc influence     := uint(10); < 27; i++ {         measure(i)         runtime.gc()     } }

main_amd64.s

// func readint64time(slice []int64, idx int) int64 text ·readint64time(sb),$0-40     movq    slice+0(fp), r8     movq    idx+24(fp), r9     rdtsc     shlq    $32, dx     orq     dx, ax     movq    ax, r10     movq    (r8)(r9*8), r11 // read memory     movq    $0, (r8)(r9*8) // write memory     rdtsc     shlq    $32, dx     orq     dx, ax     subq    r10, ax     movq    ax, ret+32(fp)     ret

i disabled garbadge collector illiminate infuence , made 64b index alignment, since processor has 64b cache line.

the benchmark result is:

allocated 16 kb. avg time: 22 allocated 32 kb. avg time: 22 allocated 64 kb. avg time: 22 allocated 128 kb. avg time: 22 allocated 256 kb. avg time: 22 allocated 512 kb. avg time: 23 allocated 1024 kb. avg time: 23 allocated 2048 kb. avg time: 24 allocated 4096 kb. avg time: 25 allocated 8192 kb. avg time: 29 allocated 16384 kb. avg time: 31 allocated 32768 kb. avg time: 33 allocated 65536 kb. avg time: 34 allocated 131072 kb. avg time: 34 allocated 262144 kb. avg time: 35 allocated 524288 kb. avg time: 35 allocated 1048576 kb. avg time: 39

i run bench many times, , gave me similar results every run. if removed read , write ops asm code, got 22 cycles allocations, time difference memory access time. can see, first time shift @ 512 kb allocation size. 1 cpu cycle, it's stable. next time change @ 2 mb. @ 8 mb there significant time change, still 4 cycles, , out of cache @ all.

after tests see there no dramatic cost cache-miss. still significant, since time difference 10-15 times, not 50-500 times saw in article. maybe todays memory faster, 7 years ago? looks promising =) maybe after next 7 years there architertures no cpu caches @ all. we'll see.

edit: @leeor mentioned, rdtsc instruction has no serializing behaviour , there can out of order execution. tried rdtscp instruction instead:

main_amd64.s

// func readint64time(slice []int64, idx int) int64 text ·readint64time(sb),$0-40     movq    slice+0(fp), r8     movq    idx+24(fp), r9     byte $0x0f; byte $0x01; byte $0xf9; // rdtscp     shlq    $32, dx     orq     dx, ax     movq    ax, r10     movq    (r8)(r9*8), r11 // read memory     movq    $0, (r8)(r9*8) // write memory     byte $0x0f; byte $0x01; byte $0xf9; // rdtscp     shlq    $32, dx     orq     dx, ax     subq    r10, ax     movq    ax, ret+32(fp)     ret

here got changes:

allocated 16 kb. avg time: 27 allocated 32 kb. avg time: 27 allocated 64 kb. avg time: 28 allocated 128 kb. avg time: 29 allocated 256 kb. avg time: 30 allocated 512 kb. avg time: 34 allocated 1024 kb. avg time: 42 allocated 2048 kb. avg time: 55 allocated 4096 kb. avg time: 120 allocated 8192 kb. avg time: 167 allocated 16384 kb. avg time: 173 allocated 32768 kb. avg time: 189 allocated 65536 kb. avg time: 201 allocated 131072 kb. avg time: 215 allocated 262144 kb. avg time: 224 allocated 524288 kb. avg time: 242 allocated 1048576 kb. avg time: 281

now see big difference between cahce , ram access. time actualy 2 times lower in article, it's predictable, since memory frequency twice higher.

Search This Blog

Today

Can't reproduce cpu cache-miss -

Comments

Post a Comment

Popular posts from this blog

java - Jasper subreport showing only one entry from the JSON data source when embedded in the Title band -

serialization - Convert Any type in scala to Array[Byte] and back -

SonarQube Plugin for Jenkins does not find SonarQube Scanner executable -