Is scala code slow?

Grebennikov Roman Findify

scala.io / 2016

Disclaimer

The resemblance of any opinion, recommendation
or comment made during this presentation to performance
tuning advice is merely coincidental

Intro: cool things in scala

FP style
type system
pattern matching

laziness
collections
etc...

Intro: hot cool things

Scala & code hot spots:

More readable
Less predictable:
- create new objects
- do unexpected computations
- surprize the JVM

Agenda

Basics:

Performance measuring is hard
What JVM & scalac do with your code
JMH & how to measure ~~wrong~~ right

Agenda

Real life:

Pattern matching
Recursion
Collections

Is benchmarking really hard?


val start = System.currentTimeMillis()
for (i <- 0 to 1000) yield doSomeStuff()
val perf = (System.currentTimeMillis() - start) / 1000.0

loop can be eliminated by optimizer
doSomeStuff() can be interpreted, not compiled
the whole loop may take less than 1ms
... and many more

HotSpot JVM is smarter than you

Hit the compile button

C0 ⟶ C1 ⟶ C2
- more agressive: code is faster
- more time for warm-up
Pitfalls everywhere

Need reliable result?

Avoid all the traps:

time measurement
dead code elimination
constant folding
loop unrolling
... and many more!

Meet JMH

Harness for implementing & running [micro] benchmarks.

sbt-jmh plugin for easy scala integration:


package io.scala

import org.openjdk.jmh.annotations._

class Demo {
	@Benchmark
	def helloWorld = 42
}

live!

JVM performance testing: deep dive

Alexey Shilipev talks & blog posts:
JMH code samples

Pattern matching

Problem

trait Base
class Foo extends Base
class Bar extends Base
class Baz extends Base

// which class does implement Base?
def select(value:Base) = ???

If vs pattern matching

@Benchmark
def measurePatMat = input match {
	case _:Foo => 1
	case _:Bar => 2
	case _:Baz => 3
	case _ => 4
}

@Benchmark
def measureIf = if (input.isInstanceOf[Foo])
  1
else if (input.isInstanceOf[Bar])
  2
else if (input.isInstanceOf[Baz])
  3
else
  4

Results

Benchmark                   (classType)  Mode  Cnt  Score   Error  Units
ScalaTypeMatch.measureIf            foo  avgt  100  3.491 ± 0.025  ns/op
ScalaTypeMatch.measureIf            bar  avgt  100  4.065 ± 0.016  ns/op
ScalaTypeMatch.measureIf            baz  avgt  100  4.167 ± 0.022  ns/op
ScalaTypeMatch.measurePatternMatch  foo  avgt  100  3.484 ± 0.020  ns/op
ScalaTypeMatch.measurePatternMatch  bar  avgt  100  4.058 ± 0.008  ns/op
ScalaTypeMatch.measurePatternMatch  baz  avgt  100  4.168 ± 0.015  ns/op

Results

Benchmark                   (classType)  Mode  Cnt  Score   Error  Units
ScalaTypeMatch.measureIf            foo  avgt  100  3.491 ± 0.025  ns/op
ScalaTypeMatch.measureIf            bar  avgt  100  4.065 ± 0.016  ns/op
ScalaTypeMatch.measureIf            baz  avgt  100  4.167 ± 0.022  ns/op
ScalaTypeMatch.measurePatternMatch  foo  avgt  100  3.484 ± 0.020  ns/op
ScalaTypeMatch.measurePatternMatch  bar  avgt  100  4.058 ± 0.008  ns/op
ScalaTypeMatch.measurePatternMatch  baz  avgt  100  4.168 ± 0.015  ns/op

Is it really the same?

Machine code to the rescue

Shows all the truth
Only the brave can read it

Want some x86_64?

-XX:+PrintAssembly: dump all the compiled methods: TL&DR
perf: Linux CPU perf counters: too low level
JMH perfasm: combines both, made for humans: OK

live!

I want my scala back!

Inside pattern matching

mov    0x10(%r8),%r10d    ;*getfield someClass // someClass = Baz
mov    0x8(%r12,%r10,8),%ecx  ;*instanceof
cmp    $0xf8019288,%ecx   ;{metadata('ru/jugvrn/ScalaTypeMatch$Foo')}
je     0x00007f3e0d222734  ;*ifeq
cmp    $0xf8019306,%ecx   ;{metadata('ru/jugvrn/ScalaTypeMatch$Bar')}
je     0x00007f3e0d222761  ;*ifeq
cmp    $0xf8019248,%ecx   ;{metadata('ru/jugvrn/ScalaTypeMatch$Baz')}
jne    0x00007f3e0d22281d  ;*instanceof
...
callq  0x00007f3e0d046020  ; *invokevirtual consume

Inside if-else

mov    0x10(%r8),%r11d    ;*getfield someClass
mov    0x8(%r12,%r11,8),%edx  ;*instanceof
cmp    $0xf8019288,%edx   ;{metadata('ru/jugvrn/ScalaTypeMatch$Foo')}
je     0x00007faa4522ac3b  ;*ifeq
cmp    $0xf8019306,%edx   ;{metadata('ru/jugvrn/ScalaTypeMatch$Bar')}
je     0x00007faa4522ac71  ;*ifeq
cmp    $0xf8019248,%edx   ;{metadata('ru/jugvrn/ScalaTypeMatch$Baz')}
jne    0x00007faa4522ad31  ;*instanceof
...
callq  0x00007faa45046020  ; *invokevirtual consume

Same x86_64 code!

pattern matching ~ if-else*

What about Options?

@Setup def setup() = {
	someString = Some("hello")
	nullableString = "hello"
}

@Benchmark def measureMatchOption() = someString match {
  case Some(str) => str
  case _ => "default value"
}

@Benchmark def measureIfNull() = if (nullableString != null)
  nullableString
else
  "default value"

Results


Benchmark                            Mode  Cnt  Score   Error  Units
ScalaOptionMatch.measureIfNull       avgt  100  3.677 ± 0.035  ns/op
ScalaOptionMatch.measureMatchOption  avgt  100  4.153 ± 0.066  ns/op

Results are close, but null check is faster
Why?

Inside nullcheck

mov    0x10(%r10),%r11d   ;*getfield nullableString
test   %r11d,%r11d
je     0x00007fb929230422  ;*ifnull
lea    (%r12,%r11,8),%rdx  ;*getfield nullableString
...
callq  0x00007fb929046020  ;*invokevirtual consume

Inside option matching

mov    0xc(%r10),%r11d    ;*getfield someString
mov    0x8(%r12,%r11,8),%r8d  ; implicit exception: dispatches to ...
cmp    $0xf80191cc,%r8d   ;   {metadata('scala/Some')}
jne    0x00007f7be1233711
lea    (%r12,%r11,8),%r10  ;*instanceof
mov    0xc(%r10),%r11d    ;*getfield value
mov    0x8(%r12,%r11,8),%r10d  ; implicit exception: dispatches to ...
cmp    $0xf80002da,%r10d  ;   {metadata('java/lang/String')}
jne    0x00007f7be1233725
...
callq  0x00007f7be1046020  ;*invokevirtual consume

What's the difference?

Null check: single branch
Match by Option: TWO branches
- Are we dealing with an Option?
- Is there a String inside?

nullcheck vs patmat: Conclusion

Pattern matching may have overhead :(

The overhead is small :)

Tail recursion

Tail call optimization

Compiler trick: transform recursive call into a loop
scalac can do it only for tail calls
hotspot has no native support for TCO

Meet Fibonacci

0 1 1 2 3 5 8 13 21 ...

Recursion: with and without tail

@tailrec
private def fibonacciTailRec(n: Int, a:Int, b:Int):Int = n match {
  case 0 => a
  case _ => fibonacciTailRec(n-1, b, a+b)
}

def fibonacciRec(n:Int):Int = n match {
  case 0 => a
	case _ => fibonacciRec(n-1, b, a+b)
}

Old good loop

def fibonacciLoop(n:Int):Int = {
  var a: Int = 0
  var b: Int = 1
  var i: Int = 0
	while (i < n) {
	  val sum = a + b
	  a = b
	  b = sum
	  i += 1
	}
	a
}

Results

Benchmark                      (N)  Mode  Cnt   Score   Error  Units
ScalaFibonacci.measureLoop      10  avgt   10   6.510 ± 0.177  ns/op
ScalaFibonacci.measureLoop      20  avgt   10  10.431 ± 0.446  ns/op
ScalaFibonacci.measureLoop      40  avgt   10  20.051 ± 0.026  ns/op
ScalaFibonacci.measureRec       10  avgt   10  18.733 ± 0.049  ns/op
ScalaFibonacci.measureRec       20  avgt   10  35.517 ± 0.162  ns/op
ScalaFibonacci.measureRec       40  avgt   10  77.468 ± 0.774  ns/op
ScalaFibonacci.measureTailRec   10  avgt   10  12.477 ± 0.289  ns/op
ScalaFibonacci.measureTailRec   20  avgt   10   9.660 ± 0.190  ns/op
ScalaFibonacci.measureTailRec   40  avgt   10  20.165 ± 2.081  ns/op

measureLoop is close to measureTailRec
measureRec is 2x-4x slower than measureTailRec
measureTailRec for N=10 is slower than for N=20

Why is non-TCO function so slow?

bytecode to the rescue!

JVM Bytecode is simple
javap: a tool for class file disassembly

Welcome to Scala 2.11.8
(Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_102)
Type in expressions for evaluation. Or try :help.

scala> :javap -c -p ru.jugvrn.ScalaFibonacci
Compiled from "ScalaFibonacci.scala"
public class ru.jugvrn.ScalaFibonacci {
private int N;

...

Fibonacci with TCO

       0: iload_1           // Load first function argument, N
       1: istore        5
       3: iload         5
       5: tableswitch   {
                     0: 37  // if N == 0, goto :37
               default: 24  // else goto :24
          }
      24: iload_1           // get arg1     (stack: N)
      25: iconst_1          // push 1       (stack: 1, N)
      26: isub              // subtract     (stack: N-1)
      27: iload_3           // get arg3     (stack: b, N-1)
      28: iload_2           // get arg2     (stack: a, b, N-1)
      29: iload_3           // get arg3     (stack: b, a, b, N-1)
      30: iadd              // add 2 vars   (stack: a+b, b, N-1)
      31: istore_3          // set arg3=a+b (stack: b, N-1)
      32: istore_2          // set arg2=b   (stack: N-1)
      33: istore_1          // set arg1=N-1 (stack: )
      34: goto          0   // jump to the function start
      37: iload_2
      38: ireturn           // return arg2=b

Fibonacci without TCO

       0: iload_1           // Load first function argument, N
       1: istore        4
       3: iload         4
       5: tableswitch   {
                     0: 38  // if N == 0, goto :38
               default: 24  // else goto :24
          }
      24: aload_0           // get this reference         (stack: this)
      25: iload_1           // get arg1                   (stack: N, this)
      26: iconst_1          // push 1                     (stack: 1, N, this)
      27: isub              // subtract                   (stack: N-1, this)
      28: iload_3           // get arg3                   (stack: b, N-1, this)
      29: iload_2           // get arg2                   (stack: a, b, N-1, this)
      30: iload_3           // get arg3                   (stack: b, a, b, N-1, this)
      31: iadd              // add 2 vars from the stack  (stack: a+b, b, N-1, this)
      32: invokevirtual #40 // call fibonacciRec:(III)I   (stack: )
      35: goto          39
      38: iload_2
      39: ireturn           // return arg2=b

The secret life of invokevirtual

JVM takes control back from your code
Resolves method by name
Looks up method addr in vtable
Creates stack frame
Passes arguments
Jumps back to your function code

Go deeper

....[Hottest Region 1].................................................
C2, level 4, ru.jugvrn.ScalaFibonacci::fibonacciRec, version 462 (62 bytes)

	0x00007f89f51fea20: mov    0x8(%rsi),%r10d     ; note the real function address
  0x00007f89f51fea24: shl    $0x3,%r10
  0x00007f89f51fea28: cmp    %r10,%rax
  0x00007f89f51fea2b: jne    0x00007f89f5045e20  ;   {runtime_call}
  0x00007f89f51fea31: xchg   %ax,%ax
  0x00007f89f51fea34: nopl   0x0(%rax,%rax,1)
  0x00007f89f51fea3c: xchg   %ax,%ax
    [Verified Entry Point]
  0x00007f89f51fea40: mov    %eax,-0x14000(%rsp)

		... skip ...

  0x00007f89f51fea73: callq  0x00007f89f5046020  ; jump straight to JVM guts!
	                                               ; OopMap{off=88}
                                                 ;*invokevirtual fibonacciRec
                                                 ;   {optimized virtual_call}

goto vs invokevirtual

Goto: simple unconditional jump, one x86 instruction.

Invokevirtual: a lot of overhead happening under the hood.

Why is N=10 slower than N=20?

N=10..20 tailrec performance

live!

Loop unroll

  def fibonacciTCOUnroll(n: Int, a: Int, b: Int): Int = {
	  if (n == 0) {
			a
		} else if (n>16) {
      val f1 = a + b
      val f2 = f1 + b
      val f3 = f2 + f1
      val f4 = f3 + f2
      val f5 = f4 + f3
      // skip f6..f15
      val f16 = f13 + f14
      fibonacciTCOUnroll(n - 16, f15, f16)
		} else {
      fibonacciTCOUnroll(n - 1, b, a+b)
    }
  }

Recursive conclusion

TCO is equal to loop
TCO is cool
- stop thinking with loops
- stop computing fibonacci numbers
- start writing readable recursive code

More about recursion

Collections

Scala collections

Easy way to do complex things
Cool things cannot be free: what's the overhead?

So many things to benchmark!

* Diagrams by Lokesh Jein: http://lkumarjain.blogspot.ru/2015/01/scala-collection-framework.html

The problem

Given a sequence of integer numbers
Compute sum of their squares


  in: [1, 2, 3]
  out: 1^2 + 2^2 + 3^2 = 14

Old-school solution

val array:Array[Int] = ???

@Benchmark
def baseline:Long = {
  var i = 0
  var sum = 0L
  while (i < SIZE) {
	  sum += array(i) * array(i)
	  i += 1
	}
	sum
}

FP-way

val list:List[Int] = ???

@Benchmark
def foldList:Long = {
  list.foldLeft(0L)( (sum, next) => sum + next*next)
}

Quiz

Fold performance is ____________:

the same as loop
1.5x faster than loop
2x slower than loop
7x slower than loop

Results

  Benchmark   (SIZE)  Mode  Cnt      Score     Error  Units
  baseline       100  avgt  100     71.695 ±   0.524  ns/op
	baseline      1000  avgt  100    655.157 ±   2.248  ns/op
	baseline     10000  avgt  100   6457.211 ±  17.542  ns/op
	foldList       100  avgt  100    525.201 ±   5.903  ns/op
	foldList      1000  avgt  100   4880.612 ±  23.996  ns/op
	foldList     10000  avgt  100  47098.707 ±  67.334  ns/op

List != Array

JMH perf profiler

Hardware performance counters to the rescue!


  jmh:run -prof perf ru.jugvrn.ScalaListSquares.*

Loop perf counters


    9931.655974      task-clock (msec)         #    0.475 CPUs utilized
            932      context-switches          #    0.094 K/sec
             46      cpu-migrations            #    0.005 K/sec
            113      page-faults               #    0.011 K/sec
 25,735,792,483      cycles                    #    2.591 GHz
  9,267,196,267      stalled-cycles-frontend   #   36.01% frontend cycles idle
 66,270,818,101      instructions              #    2.58  insns per cycle
                                               #    0.14  stalled cycles per insn
  1,876,355,486      branches                  #  188.927 M/sec                  
      2,411,709      branch-misses             #    0.13% of all branches
 14,960,422,185      L1-dcache-loads           # 1506.337 M/sec
    492,379,799      L1-dcache-load-misses     #    3.29% of all L1-dcache hits
         62,251      LLC-loads                 #    0.006 M/sec                  
        399,928      L1-icache-load-misses     #    0.040 M/sec
  7,623,879,953      dTLB-loads                #  767.634 M/sec
         13,381      dTLB-load-misses          #    0.00% of all dTLB cache hits
         15,045      iTLB-loads                #    0.002 M/sec
          3,078      iTLB-load-misses          #   20.46% of all iTLB cache hits
    465,275,969      L1-dcache-prefetch-misses #   46.848 M/sec

List perf counters


    9988.074145      task-clock (msec)         #    0.477 CPUs utilized
          2,230      context-switches          #    0.223 K/sec
            386      cpu-migrations            #    0.039 K/sec
            208      page-faults               #    0.021 K/sec
 24,191,015,522      cycles                    #    2.422 GHz
  5,917,795,429      stalled-cycles-frontend   #   24.46% frontend cycles idle
 78,929,694,484      instructions              #    3.26  insns per cycle
                                               #    0.07  stalled cycles per insn
 13,743,825,372      branches                  # 1376.024 M/sec                  
      2,356,556      branch-misses             #    0.02% of all branches
 26,411,359,322      L1-dcache-loads           # 2644.289 M/sec
  1,235,559,432      L1-dcache-load-misses     #    4.68% of all L1-dcache hits
    151,011,070      LLC-loads                 #   15.119 M/sec                  
      2,217,386      L1-icache-load-misses     #    0.222 M/sec
 24,726,569,508      dTLB-loads                # 2475.609 M/sec
        127,828      dTLB-load-misses          #    0.00% of all dTLB cache hits
        655,836      iTLB-loads                #    0.066 M/sec
        121,926      iTLB-load-misses          #   18.59% of all iTLB cache hits
  1,168,908,448      L1-dcache-prefetch-misses #  117.030 M/sec

List vs Array

Get rid of List!

	val array:Array[Int] = ???

	@Benchmark
	def foldArray:Long = {
	  array.foldLeft(0L)( (sum, next) => sum + next*next)
	}

Results

  Benchmark                    (SIZE)  Mode  Cnt      Score     Error  Units
	ScalaListSquares.baseline       100  avgt  100     71.695 ±   0.524  ns/op
	ScalaListSquares.baseline      1000  avgt  100    655.157 ±   2.248  ns/op
	ScalaListSquares.baseline     10000  avgt  100   6457.211 ±  17.542  ns/op
	ScalaListSquares.foldList       100  avgt  100    525.201 ±   5.903  ns/op
	ScalaListSquares.foldList      1000  avgt  100   4880.612 ±  23.996  ns/op
	ScalaListSquares.foldList     10000  avgt  100  47098.707 ±  67.334  ns/op
  ScalaListSquares.foldArray      100  avgt  100    470.091 ±   1.249  ns/op
  ScalaListSquares.foldArray     1000  avgt  100   6192.707 ±  12.944  ns/op
  ScalaListSquares.foldArray    10000  avgt  100  60182.719 ± 215.496  ns/op

JMH Gc profiler

Lightweight low-overhead JMX-based GC profiling


jmh:run -prof gc ru.jugvrn.ScalaListSquares.foldArray

GC profiling results

foldArray                                       88434.926 ± 4464.565   ns/op
foldArray:·gc.alloc.rate                         1682.633 ±   81.873  MB/sec
foldArray:·gc.alloc.rate.norm                  155936.207 ±    0.497    B/op
foldArray:·gc.churn.PS_Eden_Space                1683.088 ±  130.033  MB/sec
foldArray:·gc.churn.PS_Eden_Space.norm         155924.367 ± 7109.615    B/op
foldArray:·gc.churn.PS_Survivor_Space               0.078 ±    0.068  MB/sec
foldArray:·gc.churn.PS_Survivor_Space.norm          7.232 ±    6.315    B/op
foldArray:·gc.count                               160.000             counts
foldArray:·gc.time                                 92.000                 ms

Apples vs Oranges

def baseline:Long = {
  var i = 0
  var sum = 0L // <- primitive long
  while (i < SIZE) {
	  sum += array(i) * array(i)
	  i += 1
  }
  sum
}

def foldArray:Long = {
  array.foldLeft(0L)( (sum, next) => sum + next*next) // <- boxed Long
}

Creating a new object on each iteration

Scala collections are generic

def foldLeft[B](z: B)(op: (B, A) => B): B
A & B cannot be primitive*
No escape from boxing/unboxing ¯\_(ツ)_/¯

No escape?

It's time for 2.12!

Results with 2.12

  Benchmark                    (SIZE)  Mode  Cnt      Score     Error  Units
	ScalaListSquares.baseline       100  avgt  100     71.695 ±   0.524  ns/op
	ScalaListSquares.baseline      1000  avgt  100    655.157 ±   2.248  ns/op
	ScalaListSquares.baseline     10000  avgt  100   6457.211 ±  17.542  ns/op
  ScalaListSquares.foldArray      100  avgt  100    470.091 ±   1.249  ns/op
  ScalaListSquares.foldArray     1000  avgt  100   6192.707 ±  12.944  ns/op
  ScalaListSquares.foldArray    10000  avgt  100  60182.719 ± 215.496  ns/op
  ScalaListSquares.foldArray12    100  avgt  100    482.969 ±   1.975  ns/op
	ScalaListSquares.foldArray12   1000  avgt  100   7114.175 ±  19.649  ns/op
	ScalaListSquares.foldArray12  10000  avgt  100  71247.456 ± 311.178  ns/op

Sugar-free scala

scalac -print Foo.scala

2.11 desugared

def foldArray(): Long = scala.Long.unbox(
  scala.this.Predef.intArrayOps(ScalaListSquares.this.array())
	  .foldLeft(
		  scala.Long.box(0L),
			{ (new <$anon: Function2>(ScalaListSquares.this): Function2) }
		)
	);
final class anonfun$foldArray$1 extends scala.runtime.AbstractFunction2$mcJJI$sp with Serializable {
  final def apply(part: Long, next: Int): Long =
	  anonfun$foldArray$1.this.apply$mcJJI$sp(part, next);
  def apply$mcJJI$sp(part: Long, next: Int): Long =
	  part.+(next.*(next));
  final def apply(v1: Object, v2: Object): Object =
	  scala.Long.box(                   // here it is!
		  anonfun$foldArray$1.this.apply(
			  scala.Long.unbox(v1),         // again!
				scala.Int.unbox(v2)           // and again!
			)
		);
  def <init>($outer: ru.jugvrn.ScalaListSquares): <$anon: Function2> =
    anonfun$foldArray$1.super.();
}

2.12 desugared


def foldArray(): Long = scala.Long.unbox(
  new collection.mutable.ArrayOps$ofInt(
	  scala.Predef.intArrayOps(ScalaListSquares.this.array())
	).foldLeft(
	  scala.Long.box(0L),
		{ ((part: Long, next: Int) =>
			ScalaListSquares.this.$anonfun$foldArray$1(part, next)) }
	)
);

final def $anonfun$foldArray$1(part: Long, next: Int): Long =
  part.+(next.*(next)); // Where is box/unbox?

Lambdas in 2.11/2.12

2.11, java7:

FunctionN class wrapper
compile-time linking
invokevirtual

2.12, java8:

lambda as a method
run-time linking
invokedynamic
JFunctionN functional interface

JFunction2

package scala.runtime.java8;

@FunctionalInterface
public interface JFunction2$mcDII$sp
    extends scala.Function2, java.io.Serializable {
  double apply$mcDII$sp(int v1, int v2);

  default Object apply(Object v1, Object v2) {
		return scala.runtime.BoxesRunTime.boxToDouble( // found it!
		  apply$mcDII$sp(
			  scala.runtime.BoxesRunTime.unboxToInt(v1),
				scala.runtime.BoxesRunTime.unboxToInt(v2)
			)
		);
	}
}

2.11 vs 2.12

Why foldArray@2.12 is slower than foldArray@2.11?

2.11: hottest methods

....[Hottest Methods (after inlining)]...................................
96.29%   97.82%     C2, level 4  r.j.g.ScalaListSquares_foldArray_jmhTest
 0.37%    0.19%       libjvm.so  StringTable::unlink_or_oops_do
 0.34%    0.08%       libjvm.so  SpinPause
 0.16%    0.04%       libjvm.so  ParallelTaskTerminator::offer_termination
 0.14%    0.01%       libjvm.so  PSScavengeKlassClosure::do_klass
 0.09%    0.05%    libc-2.23.so  vfprintf

2.12: hottest methods

....[Hottest Methods (after inlining)]...................................
89.76%   91.56%     C2, level 4  r.j.g.ScalaListSquares_foldArray_jmhTest
 4.71%    4.70%     C2, level 4  s.c.m.ArrayOps$ofLong::foldLeft
 0.44%    0.45%     C2, level 4  r.j.ScalaListSquares::foldArray
 0.26%    0.01%       libjvm.so  SpinPause
 0.20%    0.29%       libjvm.so  StringTable::unlink_or_oops_do
 0.17%    0.10%       libjvm.so  fileStream::write
 0.14%    0.13%    libc-2.23.so  vfprintf

Scala collections

Be aware of boxing
Even Array can cause boxing
Alternative collections: debox, abc

The end

Scala can be slow:

easy to write beautiful, but slow code
collections + primitives = pain
JVM doesn't like scalac bytecode

Scala can be fast:

Know your roots: beautiful code can be fast
Scala 2.12 is more JVM-friendly

Materials

Benchmarks: github.com/shuttie/scala-perf-talk

Slides: dfdx.me/talks/scala_is_slow

Me: @public_void_grv, grv@dfdx.me