tracemonkey

Loops Trees Traces Aborts Flushes Trees/Loop Traces/Tree Traces/Loop Speedup

3d-cube 3d-morph 3d-raytrace

3 2

0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

1.1 1.6 2.5 3.4 2.0 2.0 1.0 1.0 1.0 1.0 1.4 1.0 1.0 1.0 1.0 2.0 2.0 1.3 1.0 1.7 2.2 2.0 1.0 1.7 - -

1.1 1.0 4.0 1.7 1.1 1.3 1.0 1.3 1.0 1.7 1.1 1.3 2.0 1.3 3.7 1.3 1.0 1.0 1.0 1.4 1.4 1.0 9.3 1.3 - -

1.2 1.6

2.20x 2.86x 1.18x 0.93x 2.20x 4.19x 3.05x 2.75x 0.98x 1.64x 2.30x 5.95x 1.07x 0.98x 4.92x 5.90x 7.12x 4.21x 2.53x 1.49x 1.09x 1.20x 1.86x 8.67x

25 100

10.0

access-binary-trees access-fannkuch

34 16

57 18

5.7 2.3 2.7

access-nbody access-nsieve

8 3 2 3 1 3 0 4 5 3 3 2 2 2 3 5 3 4 6

5 3 0 1 0 0 1 0 0 7 3 1 1 0 0 0 6 5 0 1

6 2 3 1 3 0 4 5 3 3 4 4 2 5 6 4

8 2 4 1 5 0

bitops-3bit-bits-in-byte bitops-bits-in-byte bitops-bitwise-and bitops-nsieve-bits controlflow-recursive

1.0 25.47x

1.3

1.0 25.20x

1.7

crypto-aes crypto-md5 crypto-sha1

1.6 1.3 2.0 1.3 3.7 2.5 2.0 1.3 1.0 2.3 3.0 2.0 9.3 2.2

date-format-tofte date-format-xparb math-partial-sums math-spectral-norm math-cordic

5 4 2 7

regexp-dna string-base64 string-fasta string-tagcloud

string-unpack-code string-validate-input

37 13

Figure 13. Detailed trace recording statistics for the SunSpider benchmark set.

fastest available JavaScript inline threaded interpreter (SFX) on 9 of 26 benchmarks.

mean). We exclude regexp-dna from the following calculations, because most of its time is spent in the regular expression matcher, which has much different performance characteristics from the other programs. (Note that this only makes a difference of about 10% in the results.) Dividing the total execution time in processor clock cycles by the number of bytecodes executed in the base interpreter shows that on average, a bytecode executes in about 35 cycles. Native traces take about 9 cycles per bytecode, a 3.9x speedup over the interpreter. Using similar computations, we find that trace recording takes about 3800 cycles per bytecode, and compilation 3150 cycles per bytecode. Hence, during recording and compiling the VM runs at 1/200 the speed of the interpreter. Because it costs 6950 cycles to compile a bytecode, and we save 26 cycles each time that code is run natively, we break even after running a trace 270 times. The other VMs we compared with achieve an overall speedup of 3.0x relative to our baseline interpreter. Our estimated native code speedup of 3.9x is significantly better. This suggests that our compilation techniques can generate more efficient native code than any other current JavaScript VM. These estimates also indicate that our startup performance could be substantially better if we improved the speed of trace recording and compilation. The estimated 200x slowdown for recording and compilation is very rough, and may be influenced by startup factors in the interpreter (e.g., caches that have not warmed up yet during recording). One observation supporting this conjecture is that in the tracer, interpreted bytecodes take about 180 cycles to run. Still, recording and compilation are clearly both expensive, and a better implementation, possibly including redesign of the LIR abstract syntax or encoding, would improve startup performance. Our performance results confirm that type specialization using trace trees substantially improves performance. We are able to outperform the fastest available JavaScript compiler (V8) and the

8. Related Work Trace optimization for dynamic languages. The closest area of related work is on applying trace optimization to type-specialize dynamic languages. Existing work shares the idea of generating type-specialized code speculatively with guards along interpreter traces. To our knowledge, Rigo’s Psyco (16) is the only published type-specializing trace compiler for a dynamic language (Python). Psyco does not attempt to identify hot loops or inline function calls. Instead, Psyco transforms loops to mutual recursion before running and traces all operations. Pall’s LuaJIT is a Lua VM in development that uses trace com- pilation ideas. (1). There are no publications on LuaJIT but the cre- ator has told us that LuaJIT has a similar design to our system, but will use a less aggressive type speculation (e.g., using a floating- point representation for all number values) and does not generate nested traces for nested loops. General trace optimization. General trace optimization has a longer history that has treated mostly native code and typed languages like Java. Thus, these systems have focused less on type specialization and more on other optimizations. Dynamo (7) by Bala et al, introduced native code tracing as a replacement for profile-guided optimization (PGO). A major goal was to perform PGO online so that the profile was specific to the current execution. Dynamo used loop headers as candidate hot traces, but did not try to create loop traces specifically. Trace trees were originally proposed by Gal et al. (11) in the context of Java, a statically typed language. Their trace trees ac- tually inlined parts of outer loops within the inner loops (because

Made with FlippingBook - professional solution for displaying marketing and sales documents online