Grebennikov Roman Findify
scala.io / 2016
The resemblance of any opinion, recommendation
or comment made during this presentation to performance
tuning advice is merely coincidental
|
|
Scala & code hot spots:
Basics:
Real life:
val start = System.currentTimeMillis()
for (i <- 0 to 1000) yield doSomeStuff()
val perf = (System.currentTimeMillis() - start) / 1000.0
Avoid all the traps:
|
Harness for implementing & running [micro] benchmarks.
sbt-jmh plugin for easy scala integration:
package io.scala
import org.openjdk.jmh.annotations._
class Demo {
@Benchmark
def helloWorld = 42
}
trait Base
class Foo extends Base
class Bar extends Base
class Baz extends Base
// which class does implement Base?
def select(value:Base) = ???
@Benchmark
def measurePatMat = input match {
case _:Foo => 1
case _:Bar => 2
case _:Baz => 3
case _ => 4
}
@Benchmark
def measureIf = if (input.isInstanceOf[Foo])
1
else if (input.isInstanceOf[Bar])
2
else if (input.isInstanceOf[Baz])
3
else
4
Benchmark (classType) Mode Cnt Score Error Units
ScalaTypeMatch.measureIf foo avgt 100 3.491 ± 0.025 ns/op
ScalaTypeMatch.measureIf bar avgt 100 4.065 ± 0.016 ns/op
ScalaTypeMatch.measureIf baz avgt 100 4.167 ± 0.022 ns/op
ScalaTypeMatch.measurePatternMatch foo avgt 100 3.484 ± 0.020 ns/op
ScalaTypeMatch.measurePatternMatch bar avgt 100 4.058 ± 0.008 ns/op
ScalaTypeMatch.measurePatternMatch baz avgt 100 4.168 ± 0.015 ns/op
Benchmark (classType) Mode Cnt Score Error Units
ScalaTypeMatch.measureIf foo avgt 100 3.491 ± 0.025 ns/op
ScalaTypeMatch.measureIf bar avgt 100 4.065 ± 0.016 ns/op
ScalaTypeMatch.measureIf baz avgt 100 4.167 ± 0.022 ns/op
ScalaTypeMatch.measurePatternMatch foo avgt 100 3.484 ± 0.020 ns/op
ScalaTypeMatch.measurePatternMatch bar avgt 100 4.058 ± 0.008 ns/op
ScalaTypeMatch.measurePatternMatch baz avgt 100 4.168 ± 0.015 ns/op
Is it really the same?
I want my scala back!
mov 0x10(%r8),%r10d ;*getfield someClass // someClass = Baz
mov 0x8(%r12,%r10,8),%ecx ;*instanceof
cmp $0xf8019288,%ecx ;{metadata('ru/jugvrn/ScalaTypeMatch$Foo')}
je 0x00007f3e0d222734 ;*ifeq
cmp $0xf8019306,%ecx ;{metadata('ru/jugvrn/ScalaTypeMatch$Bar')}
je 0x00007f3e0d222761 ;*ifeq
cmp $0xf8019248,%ecx ;{metadata('ru/jugvrn/ScalaTypeMatch$Baz')}
jne 0x00007f3e0d22281d ;*instanceof
...
callq 0x00007f3e0d046020 ; *invokevirtual consume
mov 0x10(%r8),%r11d ;*getfield someClass
mov 0x8(%r12,%r11,8),%edx ;*instanceof
cmp $0xf8019288,%edx ;{metadata('ru/jugvrn/ScalaTypeMatch$Foo')}
je 0x00007faa4522ac3b ;*ifeq
cmp $0xf8019306,%edx ;{metadata('ru/jugvrn/ScalaTypeMatch$Bar')}
je 0x00007faa4522ac71 ;*ifeq
cmp $0xf8019248,%edx ;{metadata('ru/jugvrn/ScalaTypeMatch$Baz')}
jne 0x00007faa4522ad31 ;*instanceof
...
callq 0x00007faa45046020 ; *invokevirtual consume
pattern matching ~ if-else*
@Setup def setup() = {
someString = Some("hello")
nullableString = "hello"
}
@Benchmark def measureMatchOption() = someString match {
case Some(str) => str
case _ => "default value"
}
@Benchmark def measureIfNull() = if (nullableString != null)
nullableString
else
"default value"
Benchmark Mode Cnt Score Error Units
ScalaOptionMatch.measureIfNull avgt 100 3.677 ± 0.035 ns/op
ScalaOptionMatch.measureMatchOption avgt 100 4.153 ± 0.066 ns/op
mov 0x10(%r10),%r11d ;*getfield nullableString
test %r11d,%r11d
je 0x00007fb929230422 ;*ifnull
lea (%r12,%r11,8),%rdx ;*getfield nullableString
...
callq 0x00007fb929046020 ;*invokevirtual consume
mov 0xc(%r10),%r11d ;*getfield someString
mov 0x8(%r12,%r11,8),%r8d ; implicit exception: dispatches to ...
cmp $0xf80191cc,%r8d ; {metadata('scala/Some')}
jne 0x00007f7be1233711
lea (%r12,%r11,8),%r10 ;*instanceof
mov 0xc(%r10),%r11d ;*getfield value
mov 0x8(%r12,%r11,8),%r10d ; implicit exception: dispatches to ...
cmp $0xf80002da,%r10d ; {metadata('java/lang/String')}
jne 0x00007f7be1233725
...
callq 0x00007f7be1046020 ;*invokevirtual consume
Pattern matching may have overhead :(
The overhead is small :)
0 1 1 2 3 5 8 13 21 ...
@tailrec
private def fibonacciTailRec(n: Int, a:Int, b:Int):Int = n match {
case 0 => a
case _ => fibonacciTailRec(n-1, b, a+b)
}
def fibonacciRec(n:Int):Int = n match {
case 0 => a
case _ => fibonacciRec(n-1, b, a+b)
}
def fibonacciLoop(n:Int):Int = {
var a: Int = 0
var b: Int = 1
var i: Int = 0
while (i < n) {
val sum = a + b
a = b
b = sum
i += 1
}
a
}
Benchmark (N) Mode Cnt Score Error Units
ScalaFibonacci.measureLoop 10 avgt 10 6.510 ± 0.177 ns/op
ScalaFibonacci.measureLoop 20 avgt 10 10.431 ± 0.446 ns/op
ScalaFibonacci.measureLoop 40 avgt 10 20.051 ± 0.026 ns/op
ScalaFibonacci.measureRec 10 avgt 10 18.733 ± 0.049 ns/op
ScalaFibonacci.measureRec 20 avgt 10 35.517 ± 0.162 ns/op
ScalaFibonacci.measureRec 40 avgt 10 77.468 ± 0.774 ns/op
ScalaFibonacci.measureTailRec 10 avgt 10 12.477 ± 0.289 ns/op
ScalaFibonacci.measureTailRec 20 avgt 10 9.660 ± 0.190 ns/op
ScalaFibonacci.measureTailRec 40 avgt 10 20.165 ± 2.081 ns/op
Welcome to Scala 2.11.8
(Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102)
Type in expressions for evaluation. Or try :help.
scala> :javap -c -p ru.jugvrn.ScalaFibonacci
Compiled from "ScalaFibonacci.scala"
public class ru.jugvrn.ScalaFibonacci {
private int N;
...
0: iload_1 // Load first function argument, N
1: istore 5
3: iload 5
5: tableswitch {
0: 37 // if N == 0, goto :37
default: 24 // else goto :24
}
24: iload_1 // get arg1 (stack: N)
25: iconst_1 // push 1 (stack: 1, N)
26: isub // subtract (stack: N-1)
27: iload_3 // get arg3 (stack: b, N-1)
28: iload_2 // get arg2 (stack: a, b, N-1)
29: iload_3 // get arg3 (stack: b, a, b, N-1)
30: iadd // add 2 vars (stack: a+b, b, N-1)
31: istore_3 // set arg3=a+b (stack: b, N-1)
32: istore_2 // set arg2=b (stack: N-1)
33: istore_1 // set arg1=N-1 (stack: )
34: goto 0 // jump to the function start
37: iload_2
38: ireturn // return arg2=b
0: iload_1 // Load first function argument, N
1: istore 4
3: iload 4
5: tableswitch {
0: 38 // if N == 0, goto :38
default: 24 // else goto :24
}
24: aload_0 // get this reference (stack: this)
25: iload_1 // get arg1 (stack: N, this)
26: iconst_1 // push 1 (stack: 1, N, this)
27: isub // subtract (stack: N-1, this)
28: iload_3 // get arg3 (stack: b, N-1, this)
29: iload_2 // get arg2 (stack: a, b, N-1, this)
30: iload_3 // get arg3 (stack: b, a, b, N-1, this)
31: iadd // add 2 vars from the stack (stack: a+b, b, N-1, this)
32: invokevirtual #40 // call fibonacciRec:(III)I (stack: )
35: goto 39
38: iload_2
39: ireturn // return arg2=b
....[Hottest Region 1].................................................
C2, level 4, ru.jugvrn.ScalaFibonacci::fibonacciRec, version 462 (62 bytes)
0x00007f89f51fea20: mov 0x8(%rsi),%r10d ; note the real function address
0x00007f89f51fea24: shl $0x3,%r10
0x00007f89f51fea28: cmp %r10,%rax
0x00007f89f51fea2b: jne 0x00007f89f5045e20 ; {runtime_call}
0x00007f89f51fea31: xchg %ax,%ax
0x00007f89f51fea34: nopl 0x0(%rax,%rax,1)
0x00007f89f51fea3c: xchg %ax,%ax
[Verified Entry Point]
0x00007f89f51fea40: mov %eax,-0x14000(%rsp)
... skip ...
0x00007f89f51fea73: callq 0x00007f89f5046020 ; jump straight to JVM guts!
; OopMap{off=88}
;*invokevirtual fibonacciRec
; {optimized virtual_call}
Goto: simple unconditional jump, one x86 instruction.
Invokevirtual: a lot of overhead happening under the hood.
def fibonacciTCOUnroll(n: Int, a: Int, b: Int): Int = {
if (n == 0) {
a
} else if (n>16) {
val f1 = a + b
val f2 = f1 + b
val f3 = f2 + f1
val f4 = f3 + f2
val f5 = f4 + f3
// skip f6..f15
val f16 = f13 + f14
fibonacciTCOUnroll(n - 16, f15, f16)
} else {
fibonacciTCOUnroll(n - 1, b, a+b)
}
}
in: [1, 2, 3]
out: 1^2 + 2^2 + 3^2 = 14
val array:Array[Int] = ???
@Benchmark
def baseline:Long = {
var i = 0
var sum = 0L
while (i < SIZE) {
sum += array(i) * array(i)
i += 1
}
sum
}
val list:List[Int] = ???
@Benchmark
def foldList:Long = {
list.foldLeft(0L)( (sum, next) => sum + next*next)
}
Fold performance is ____________:
Benchmark (SIZE) Mode Cnt Score Error Units
baseline 100 avgt 100 71.695 ± 0.524 ns/op
baseline 1000 avgt 100 655.157 ± 2.248 ns/op
baseline 10000 avgt 100 6457.211 ± 17.542 ns/op
foldList 100 avgt 100 525.201 ± 5.903 ns/op
foldList 1000 avgt 100 4880.612 ± 23.996 ns/op
foldList 10000 avgt 100 47098.707 ± 67.334 ns/op
jmh:run -prof perf ru.jugvrn.ScalaListSquares.*
9931.655974 task-clock (msec) # 0.475 CPUs utilized
932 context-switches # 0.094 K/sec
46 cpu-migrations # 0.005 K/sec
113 page-faults # 0.011 K/sec
25,735,792,483 cycles # 2.591 GHz
9,267,196,267 stalled-cycles-frontend # 36.01% frontend cycles idle
66,270,818,101 instructions # 2.58 insns per cycle
# 0.14 stalled cycles per insn
1,876,355,486 branches # 188.927 M/sec
2,411,709 branch-misses # 0.13% of all branches
14,960,422,185 L1-dcache-loads # 1506.337 M/sec
492,379,799 L1-dcache-load-misses # 3.29% of all L1-dcache hits
62,251 LLC-loads # 0.006 M/sec
399,928 L1-icache-load-misses # 0.040 M/sec
7,623,879,953 dTLB-loads # 767.634 M/sec
13,381 dTLB-load-misses # 0.00% of all dTLB cache hits
15,045 iTLB-loads # 0.002 M/sec
3,078 iTLB-load-misses # 20.46% of all iTLB cache hits
465,275,969 L1-dcache-prefetch-misses # 46.848 M/sec
9988.074145 task-clock (msec) # 0.477 CPUs utilized
2,230 context-switches # 0.223 K/sec
386 cpu-migrations # 0.039 K/sec
208 page-faults # 0.021 K/sec
24,191,015,522 cycles # 2.422 GHz
5,917,795,429 stalled-cycles-frontend # 24.46% frontend cycles idle
78,929,694,484 instructions # 3.26 insns per cycle
# 0.07 stalled cycles per insn
13,743,825,372 branches # 1376.024 M/sec
2,356,556 branch-misses # 0.02% of all branches
26,411,359,322 L1-dcache-loads # 2644.289 M/sec
1,235,559,432 L1-dcache-load-misses # 4.68% of all L1-dcache hits
151,011,070 LLC-loads # 15.119 M/sec
2,217,386 L1-icache-load-misses # 0.222 M/sec
24,726,569,508 dTLB-loads # 2475.609 M/sec
127,828 dTLB-load-misses # 0.00% of all dTLB cache hits
655,836 iTLB-loads # 0.066 M/sec
121,926 iTLB-load-misses # 18.59% of all iTLB cache hits
1,168,908,448 L1-dcache-prefetch-misses # 117.030 M/sec
val array:Array[Int] = ???
@Benchmark
def foldArray:Long = {
array.foldLeft(0L)( (sum, next) => sum + next*next)
}
Benchmark (SIZE) Mode Cnt Score Error Units
ScalaListSquares.baseline 100 avgt 100 71.695 ± 0.524 ns/op
ScalaListSquares.baseline 1000 avgt 100 655.157 ± 2.248 ns/op
ScalaListSquares.baseline 10000 avgt 100 6457.211 ± 17.542 ns/op
ScalaListSquares.foldList 100 avgt 100 525.201 ± 5.903 ns/op
ScalaListSquares.foldList 1000 avgt 100 4880.612 ± 23.996 ns/op
ScalaListSquares.foldList 10000 avgt 100 47098.707 ± 67.334 ns/op
ScalaListSquares.foldArray 100 avgt 100 470.091 ± 1.249 ns/op
ScalaListSquares.foldArray 1000 avgt 100 6192.707 ± 12.944 ns/op
ScalaListSquares.foldArray 10000 avgt 100 60182.719 ± 215.496 ns/op
Lightweight low-overhead JMX-based GC profiling
jmh:run -prof gc ru.jugvrn.ScalaListSquares.foldArray
foldArray 88434.926 ± 4464.565 ns/op
foldArray:·gc.alloc.rate 1682.633 ± 81.873 MB/sec
foldArray:·gc.alloc.rate.norm 155936.207 ± 0.497 B/op
foldArray:·gc.churn.PS_Eden_Space 1683.088 ± 130.033 MB/sec
foldArray:·gc.churn.PS_Eden_Space.norm 155924.367 ± 7109.615 B/op
foldArray:·gc.churn.PS_Survivor_Space 0.078 ± 0.068 MB/sec
foldArray:·gc.churn.PS_Survivor_Space.norm 7.232 ± 6.315 B/op
foldArray:·gc.count 160.000 counts
foldArray:·gc.time 92.000 ms
def baseline:Long = {
var i = 0
var sum = 0L // <- primitive long
while (i < SIZE) {
sum += array(i) * array(i)
i += 1
}
sum
}
def foldArray:Long = {
array.foldLeft(0L)( (sum, next) => sum + next*next) // <- boxed Long
}
Creating a new object on each iteration
It's time for 2.12!
Benchmark (SIZE) Mode Cnt Score Error Units
ScalaListSquares.baseline 100 avgt 100 71.695 ± 0.524 ns/op
ScalaListSquares.baseline 1000 avgt 100 655.157 ± 2.248 ns/op
ScalaListSquares.baseline 10000 avgt 100 6457.211 ± 17.542 ns/op
ScalaListSquares.foldArray 100 avgt 100 470.091 ± 1.249 ns/op
ScalaListSquares.foldArray 1000 avgt 100 6192.707 ± 12.944 ns/op
ScalaListSquares.foldArray 10000 avgt 100 60182.719 ± 215.496 ns/op
ScalaListSquares.foldArray12 100 avgt 100 482.969 ± 1.975 ns/op
ScalaListSquares.foldArray12 1000 avgt 100 7114.175 ± 19.649 ns/op
ScalaListSquares.foldArray12 10000 avgt 100 71247.456 ± 311.178 ns/op
def foldArray(): Long = scala.Long.unbox(
scala.this.Predef.intArrayOps(ScalaListSquares.this.array())
.foldLeft(
scala.Long.box(0L),
{ (new <$anon: Function2>(ScalaListSquares.this): Function2) }
)
);
final class anonfun$foldArray$1 extends scala.runtime.AbstractFunction2$mcJJI$sp with Serializable {
final def apply(part: Long, next: Int): Long =
anonfun$foldArray$1.this.apply$mcJJI$sp(part, next);
def apply$mcJJI$sp(part: Long, next: Int): Long =
part.+(next.*(next));
final def apply(v1: Object, v2: Object): Object =
scala.Long.box( // here it is!
anonfun$foldArray$1.this.apply(
scala.Long.unbox(v1), // again!
scala.Int.unbox(v2) // and again!
)
);
def <init>($outer: ru.jugvrn.ScalaListSquares): <$anon: Function2> =
anonfun$foldArray$1.super.();
}
def foldArray(): Long = scala.Long.unbox(
new collection.mutable.ArrayOps$ofInt(
scala.Predef.intArrayOps(ScalaListSquares.this.array())
).foldLeft(
scala.Long.box(0L),
{ ((part: Long, next: Int) =>
ScalaListSquares.this.$anonfun$foldArray$1(part, next)) }
)
);
final def $anonfun$foldArray$1(part: Long, next: Int): Long =
part.+(next.*(next)); // Where is box/unbox?
2.11, java7:
|
2.12, java8:
|
package scala.runtime.java8;
@FunctionalInterface
public interface JFunction2$mcDII$sp
extends scala.Function2, java.io.Serializable {
double apply$mcDII$sp(int v1, int v2);
default Object apply(Object v1, Object v2) {
return scala.runtime.BoxesRunTime.boxToDouble( // found it!
apply$mcDII$sp(
scala.runtime.BoxesRunTime.unboxToInt(v1),
scala.runtime.BoxesRunTime.unboxToInt(v2)
)
);
}
}
Why foldArray@2.12 is slower than foldArray@2.11?
....[Hottest Methods (after inlining)]...................................
96.29% 97.82% C2, level 4 r.j.g.ScalaListSquares_foldArray_jmhTest
0.37% 0.19% libjvm.so StringTable::unlink_or_oops_do
0.34% 0.08% libjvm.so SpinPause
0.16% 0.04% libjvm.so ParallelTaskTerminator::offer_termination
0.14% 0.01% libjvm.so PSScavengeKlassClosure::do_klass
0.09% 0.05% libc-2.23.so vfprintf
....[Hottest Methods (after inlining)]...................................
89.76% 91.56% C2, level 4 r.j.g.ScalaListSquares_foldArray_jmhTest
4.71% 4.70% C2, level 4 s.c.m.ArrayOps$ofLong::foldLeft
0.44% 0.45% C2, level 4 r.j.ScalaListSquares::foldArray
0.26% 0.01% libjvm.so SpinPause
0.20% 0.29% libjvm.so StringTable::unlink_or_oops_do
0.17% 0.10% libjvm.so fileStream::write
0.14% 0.13% libc-2.23.so vfprintf
Scala can be slow:
Scala can be fast:
Benchmarks: github.com/shuttie/scala-perf-talk
Slides: dfdx.me/talks/scala_is_slow