
Performance Improvements in .NET 7


A year ago, I published Performance Improvements in .NET 6, following on the heels of similar posts for .NET 5, .NET Core 3.0, .NET Core 2.1, and .NET Core 2.0. I enjoy writing these posts and love reading developers’ responses to them. One comment in particular last year resonated with me. The commenter cited the Die Hard movie quote, “‘When Alexander saw the breadth of his domain, he wept for there were no more worlds to conquer’,” and questioned whether .NET performance improvements were similar. Has the well run dry? Are there no more “[performance] worlds to conquer”? I’m a bit giddy to say that, even with how fast .NET 6 is, .NET 7 definitively highlights how much more can be and has been done.
As with previous versions of .NET, performance is a key focus that pervades the entire stack, whether it be features created explicitly for performance or non-performance-related features that are still designed and implemented with performance keenly in mind. And now that a .NET 7 release candidate is just around the corner, it’s a good time to discuss much of it. Over the course of the last year, every time I’ve reviewed a PR that might positively impact performance, I’ve copied that link to a journal I maintain for the purposes of writing this post. When I sat down to write this a few weeks ago, I was faced with a list of almost 1000 performance-impacting PRs (out of more than 7000 PRs that went into the release), and I’m excited to share almost 500 of them here with you.
One thought before we dive in. In past years, I’ve received the odd piece of negative feedback about the length of some of my performance-focused write-ups, and while I disagree with the criticism, I respect the opinion. So, this year, consider this a “choose your own adventure.” If you’re here just looking for a super short adventure, one that provides the top-level summary and a core message to take away from your time here, I’m happy to oblige:
TL;DR: .NET 7 is fast. Really fast. A thousand performance-impacting PRs went into runtime and core libraries this release, never mind all the improvements in ASP.NET Core and Windows Forms and Entity Framework and beyond. It’s the fastest .NET ever. If your manager asks you why your project should upgrade to .NET 7, you can say “in addition to all the new functionality in the release, .NET 7 is super fast.”
Or, if you prefer a slightly longer adventure, one filled with interesting nuggets of performance-focused data, consider skimming through the post, looking for the small code snippets and corresponding tables showing a wealth of measurable performance improvements. At that point, you, too, may walk away with your head held high and my thanks.
Both noted paths achieve one of my primary goals for spending the time to write these posts, to highlight the greatness of the next release and to encourage everyone to give it a try. But, I have other goals for these posts, too. I want everyone interested to walk away from this post with an upleveled understanding of how .NET is implemented, why various decisions were made, tradeoffs that were evaluated, techniques that were employed, algorithms that were considered, and valuable tools and approaches that were utilized to make .NET even faster than it was previously. I want developers to learn from our own learnings and find ways to apply this new-found knowledge to their own codebases, thereby further increasing the overall performance of code in the ecosystem. I want developers to take an extra beat, think about reaching for a profiler the next time they’re working on a gnarly problem, think about looking at the source for the component they’re using in order to better understand how to work with it, and think about revisiting previous assumptions and decisions to determine whether they’re still accurate and appropriate. And I want developers to be excited at the prospect of submitting PRs to improve .NET not only for themselves but for every developer around the globe using .NET. If any of that sounds interesting, then I encourage you to choose the last adventure: prepare a carafe of your favorite hot beverage, get comfortable, and please enjoy.
Oh, and please don’t print this to paper. “Print to PDF” tells me it would take a third of a ream. If you would like a nicely formatted PDF, one is available for download here.
Table of Contents
Setup
The microbenchmarks throughout this post utilize benchmarkdotnet. To make it easy for you to follow along with your own validation, I have a very simple setup for the benchmarks I use. Create a new C# project:
dotnet new console -o benchmarks cd benchmarksYour new benchmarks directory will contain a benchmarks.csproj file and a Program.cs file. Replace the contents of benchmarks.csproj with this:
and the contents of Program.cs with this:
using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using Microsoft.Win32; using System; using System.Buffers; using System.Collections.Generic; using System.Collections.Immutable; using System.ComponentModel; using System.Diagnostics; using System.IO; using System.IO.Compression; using System.IO.MemoryMappedFiles; using System.IO.Pipes; using System.Linq; using System.Net; using System.Net.Http; using System.Net.Http.Headers; using System.Net.Security; using System.Net.Sockets; using System.Numerics; using System.Reflection; using System.Runtime.CompilerServices; using System.Runtime.InteropServices; using System.Runtime.Intrinsics; using System.Security.Authentication; using System.Security.Cryptography; using System.Security.Cryptography.X509Certificates; using System.Text; using System.Text.Json; using System.Text.RegularExpressions; using System.Threading; using System.Threading.Tasks; using System.Xml; [MemoryDiagnoser(displayGenColumns: false)] [DisassemblyDiagnoser] [HideColumns(“Error”, “StdDev”, “Median”, “RatioSD”)] public partial class Program { static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); // … copy [Benchmark]s here }For each benchmark included in this write-up, you can then just copy and paste the code into this test class, and run the benchmarks. For example, to run a benchmark comparing performance on .NET 6 and .NET 7, do:
dotnet run -c Release -f net6.0 –filter ‘**’ –runtimes net6.0 net7.0This command says “build the benchmarks in release configuration targeting the .NET 6 surface area, and then run all of the benchmarks on both .NET 6 and .NET 7.” Or to run just on .NET 7:
dotnet run -c Release -f net7.0 –filter ‘**’ –runtimes net7.0which instead builds targeting the .NET 7 surface area and then only runs once against .NET 7. You can do this on any of Windows, Linux, or macOS. Unless otherwise called out (e.g. where the improvements are specific to Unix and I run the benchmarks on Linux), the results I share were recorded on Windows 11 64-bit but aren’t Windows-specific and should show similar relative differences on the other operating systems as well.
The release of the first .NET 7 release candidate is right around the corner. All of the measurements in this post were gathered with a recent daily build of .NET 7 RC1.
Also, my standard caveat: These are microbenchmarks. It is expected that different hardware, different versions of operating systems, and the way in which the wind is currently blowing can affect the numbers involved. Your mileage may vary.
JIT
I’d like to kick off a discussion of performance improvements in the Just-In-Time (JIT) compiler by talking about something that itself isn’t actually a performance improvement. Being able to understand exactly what assembly code is generated by the JIT is critical when fine-tuning lower-level, performance-sensitive code. There are multiple ways to get at that assembly code. The online tool sharplab.io is incredibly useful for this (thanks to @ashmind for this tool); however it currently only targets a single release, so as I write this I’m only able to see the output for .NET 6, which makes it difficult to use for A/B comparisons. godbolt.org is also valuable for this, with C# support added in compiler-explorer/compiler-explorer#3168 from @hez2010, with similar limitations. The most flexible solutions involve getting at that assembly code locally, as it enables comparing whatever versions or local builds you desire with whatever configurations and switches set that you need.
One common approach is to use the [DisassemblyDiagnoser] in benchmarkdotnet. Simply slap the [DisassemblyDiagnoser] attribute onto your test class: benchmarkdotnet will find the assembly code generated for your tests and some depth of functions they call, and dump out the found assembly code in a human-readable form. For example, if I run this test:
using BenchmarkDotNet.Attributes; using BenchmarkDotNet.Running; using System; [DisassemblyDiagnoser] public partial class Program { static void Main(string[] args) => BenchmarkSwitcher.FromAssembly(typeof(Program).Assembly).Run(args); private int _a = 42, _b = 84; [Benchmark] public int Min() => Math.Min(_a, _b); }with:
dotnet run -c Release -f net7.0 –filter ‘**’in addition to doing all of its normal test execution and timing, benchmarkdotnet also outputs a Program-asm.md file that contains this:
; Program.Min() mov eax,[rcx+8] mov edx,[rcx+0C] cmp eax,edx jg short M00_L01 mov edx,eax M00_L00: mov eax,edx ret M00_L01: jmp short M00_L00 ; Total bytes of code 17Pretty neat. This support was recently improved further in dotnet/benchmarkdotnet#2072, which allows passing a filter list on the command-line to benchmarkdotnet to tell it exactly which methods’ assembly code should be dumped.
If you can get your hands on a “debug” or “checked” build of the .NET runtime (“checked” is a build that has optimizations enabled but also still includes asserts), and specifically of clrjit.dll, another valuable approach is to set an environment variable that causes the JIT itself to spit out a human-readable description of all of the assembly code it emits. This can be used with any kind of application, as it’s part of the JIT itself rather than part of any specific tool or other environment, it supports showing the code the JIT generates each time it generates code (e.g. if it first compiles a method without optimization and then later recompiles it with optimization), and overall it’s the most accurate picture of the assembly code as it comes “straight from the horses mouth,” as it were. The (big) downside of course is that it requires a non-release build of the runtime, which typically means you need to build it yourself from the sources in the dotnet/runtime repo.
… until .NET 7, that is. As of dotnet/runtime#73365, this assembly dumping support is now available in release builds as well, which means it’s simply part of .NET 7 and you don’t need anything special to use it. To see this, try creating a simple “hello world” app like:
using System; class Program { public static void Main() => Console.WriteLine(“Hello, world!”); }and building it (e.g. dotnet build -c Release). Then, set the DOTNET_JitDisasm environment variable to the name of the method we care about, in this case “Main” (the exact syntax allowed is more permissive and allows for some use of wildcards, optional namespace and class names, etc.). As I’m using PowerShell, that means:
$env:DOTNET_JitDisasm=”Main”and then running the app. You should see code like this output to the console:
; Assembly listing for method Program:Main() ; Emitting BLENDED_CODE for X64 CPU with AVX – Windows ; Tier-0 compilation ; MinOpts code ; rbp based frame ; partially interruptible G_M000_IG01: ;; offset=0000H 55 push rbp 4883EC20 sub rsp, 32 488D6C2420 lea rbp, [rsp+20H] G_M000_IG02: ;; offset=000AH 48B9D820400A8E010000 mov rcx, 0x18E0A4020D8 488B09 mov rcx, gword ptr [rcx] FF1583B31000 call [Console:WriteLine(String)] 90 nop G_M000_IG03: ;; offset=001EH 4883C420 add rsp, 32 5D pop rbp C3 ret ; Total bytes of code 36 Hello, world!This is immeasurably helpful for performance analysis and tuning, even for questions as simple as “did my function get inlined” or “is this code I expected to be optimized away actually getting optimized away.” Throughout the rest of this post, I’ll include assembly snippets generated by one of these two mechanisms, in order to help exemplify concepts.
Note that it can sometimes be a little confusing figuring out what name to specify as the value for DOTNET_JitDisasm, especially when the method you care about is one that the C# compiler names or name mangles (since the JIT only sees the IL and metadata, not the original C#), e.g. the name of the entry point method for a program with top-level statements, the names of local functions, etc. To both help with this and to provide a really valuable top-level view of the work the JIT is doing, .NET 7 also supports the new DOTNET_JitDisasmSummary environment variable (introduced in dotnet/runtime#74090). Set that to “1”, and it’ll result in the JIT emitting a line every time it compiles a method, including the name of that method which is copy/pasteable with DOTNET_JitDisasm. This feature is useful in-and-of-itself, however, as it can quickly highlight for you what’s being compiled, when, and with what settings. For example, if I set the environment variable and then run a “hello, world” console app, I get this output:
1: JIT compiled CastHelpers:StelemRef(Array,long,Object) [Tier1, IL size=88, code size=93] 2: JIT compiled CastHelpers:LdelemaRef(Array,long,long):byref [Tier1, IL size=44, code size=44] 3: JIT compiled SpanHelpers:IndexOfNullCharacter(byref):int [Tier1, IL size=792, code size=388] 4: JIT compiled Program:Main() [Tier0, IL size=11, code size=36] 5: JIT compiled ASCIIUtility:NarrowUtf16ToAscii(long,long,long):long [Tier0, IL size=490, code size=1187] Hello, world!We can see for “hello, world” there’s only 5 methods that actually get JIT compiled. There are of course many more methods that get executed as part of a simple “hello, world,” but almost all of them have precompiled native code available as part of the “Ready To Run” (R2R) images of the core libraries. The first three in the above list (StelemRef, LdelemaRef, and IndexOfNullCharacter) don’t because they explicitly opted-out of R2R via use of the [MethodImpl(MethodImplOptions.AggressiveOptimization)] attribute (despite the name, this attribute should almost never be used, and is only used for very specific reasons in a few very specific places in the core libraries). Then there’s our Main method. And lastly there’s the NarrowUtf16ToAscii method, which doesn’t have R2R code, either, due to using the variable-width Vector
With that out of the way, let’s move on to actual performance improvements, starting with on-stack replacement.
On-Stack Replacement
On-stack replacement (OSR) is one of the coolest features to hit the JIT in .NET 7. But to really understand OSR, we first need to understand tiered compilation, so a quick recap…
One of the issues a managed environment with a JIT compiler has to deal with is tradeoffs between startup and throughput. Historically, the job of an optimizing compiler is to, well, optimize, in order to enable the best possible throughput of the application or service once running. But such optimization takes analysis, takes time, and performing all of that work then leads to increased startup time, as all of the code on the startup path (e.g. all of the code that needs to be run before a web server can serve the first request) needs to be compiled. So a JIT compiler needs to make tradeoffs: better throughput at the expense of longer startup time, or better startup time at the expense of decreased throughput. For some kinds of apps and services, the tradeoff is an easy call, e.g. if your service starts up once and then runs for days, several extra seconds of startup time doesn’t matter, or if you’re a console application that’s going to do a quick computation and exit, startup time is all that matters. But how can the JIT know which scenario it’s in, and do we really want every developer having to know about these kinds of settings and tradeoffs and configure every one of their applications accordingly? One answer to this has been ahead-of-time compilation, which has taken various forms in .NET. For example, all of the core libraries are “crossgen”‘d, meaning they’ve been run through a tool that produces the previously mentioned R2R format, yielding binaries that contain assembly code that needs only minor tweaks to actually execute; not every method can have code generated for it, but enough that it significantly reduces startup time. Of course, such approaches have their own downsides, e.g. one of the promises of a JIT compiler is it can take advantage of knowledge of the current machine / process in order to best optimize, so for example the R2R images have to assume a certain baseline instruction set (e.g. what vectorizing instructions are available) whereas the JIT can see what’s actually available and use the best. “Tiered compilation” provides another answer, one that’s usable with or without these other ahead-of-time (AOT) compilation solutions.
Tiered compilation enables the JIT to have its proverbial cake and eat it, too. The idea is simple: allow the JIT to compile the same code multiple times. The first time, the JIT can use as a few optimizations as make sense (a handful of optimizations can actually make the JIT’s own throughput faster, so those still make sense to apply), producing fairly unoptimized assembly code but doing so really quickly. And when it does so, it can add some instrumentation into the assembly to track how often the methods are called. As it turns out, many functions used on a startup path are invoked once or maybe only a handful of times, and it would take more time to optimize them than it does to just execute them unoptimized. Then, when the method’s instrumentation triggers some threshold, for example a method having been executed 30 times, a work item gets queued to recompile that method, but this time with all the optimizations the JIT can throw at it. This is lovingly referred to as “tiering up.” Once that recompilation has completed, call sites to the method are patched with the address of the newly highly optimized assembly code, and future invocations will then take the fast path. So, we get faster startup and faster sustained throughput. At least, that’s the hope.
A problem, however, is methods that don’t fit this mold. While it’s certainly the case that many performance-sensitive methods are relatively quick and executed many, many, many times, there’s also a large number of performance-sensitive methods that are executed just a handful of times, or maybe even only once, but that take a very long time to execute, maybe even the duration of the whole process: methods with loops. As a result, by default tiered compilation hasn’t applied to loops, though it can be enabled by setting the DOTNET_TC_QuickJitForLoops environment variable to 1. We can see the effect of this by trying this simple console app with .NET 6. With the default settings, run this app:
class Program { static void Main() { var sw = new System.Diagnostics.Stopwatch(); while (true) { sw.Restart(); for (int trial = 0; trial < 10_000; trial++) { int count = 0; for (int i = 0; i < char.MaxValue; i++) if (IsAsciiDigit((char)i)) count++; } sw.Stop(); Console.WriteLine(sw.Elapsed); } static bool IsAsciiDigit(char c) => (uint)(c – ‘0’) <= 9; } }I get numbers printed out like:
00:00:00.5734352 00:00:00.5526667 00:00:00.5675267 00:00:00.5588724 00:00:00.5616028Now, try setting DOTNET_TC_QuickJitForLoops to 1. When I then run it again, I get numbers like this:
00:00:01.2841397 00:00:01.2693485 00:00:01.2755646 00:00:01.2656678 00:00:01.2679925In other words, with DOTNET_TC_QuickJitForLoops enabled, it’s taking 2.5x as long as without (the default in .NET 6). That’s because this main function never gets optimizations applied to it. By setting DOTNET_TC_QuickJitForLoops to 1, we’re saying “JIT, please apply tiering to methods with loops as well,” but this method with a loop is only ever invoked once, so for the duration of the process it ends up remaining at “tier-0,” aka unoptimized. Now, let’s try the same thing with .NET 7. Regardless of whether that environment variable is set, I again get numbers like this:
00:00:00.5528889 00:00:00.5562563 00:00:00.5622086 00:00:00.5668220 00:00:00.5589112but importantly, this method was still participating in tiering. In fact, we can get confirmation of that by using the aforementioned DOTNET_JitDisasmSummary=1 environment variable. When I set that and run again, I see these lines in the output:
4: JIT compiled Program:Main() [Tier0, IL size=83, code size=319] … 6: JIT compiled Program:Main() [Tier1-OSR @0x27, IL size=83, code size=380]highlighting that Main was indeed compiled twice. How is that possible? On-stack replacement.
The idea behind on-stack replacement is a method can be replaced not just between invocations but even while it’s executing, while it’s “on the stack.” In addition to the tier-0 code being instrumented for call counts, loops are also instrumented for iteration counts. When the iterations surpass a certain limit, the JIT compiles a new highly optimized version of that method, transfers all the local/register state from the current invocation to the new invocation, and then jumps to the appropriate location in the new method. We can see this in action by using the previously discussed DOTNET_JitDisasm environment variable. Set that to Program:* in order to see the assembly code generated for all of the methods in the Program class, and then run the app again. You should see output like the following:
; Assembly listing for method Program:Main() ; Emitting BLENDED_CODE for X64 CPU with AVX – Windows ; Tier-0 compilation ; MinOpts code ; rbp based frame ; partially interruptible G_M000_IG01: ;; offset=0000H 55 push rbp 4881EC80000000 sub rsp, 128 488DAC2480000000 lea rbp, [rsp+80H] C5D857E4 vxorps xmm4, xmm4 C5F97F65B0 vmovdqa xmmword ptr [rbp-50H], xmm4 33C0 xor eax, eax 488945C0 mov qword ptr [rbp-40H], rax G_M000_IG02: ;; offset=001FH 48B9002F0B50FC7F0000 mov rcx, 0x7FFC500B2F00 E8721FB25F call CORINFO_HELP_NEWSFAST 488945B0 mov gword ptr [rbp-50H], rax 488B4DB0 mov rcx, gword ptr [rbp-50H] FF1544C70D00 call [Stopwatch:.ctor():this] 488B4DB0 mov rcx, gword ptr [rbp-50H] 48894DC0 mov gword ptr [rbp-40H], rcx C745A8E8030000 mov dword ptr [rbp-58H], 0x3E8 G_M000_IG03: ;; offset=004BH 8B4DA8 mov ecx, dword ptr [rbp-58H] FFC9 dec ecx 894DA8 mov dword ptr [rbp-58H], ecx 837DA800 cmp dword ptr [rbp-58H], 0 7F0E jg SHORT G_M000_IG05 G_M000_IG04: ;; offset=0059H 488D4DA8 lea rcx, [rbp-58H] BA06000000 mov edx, 6 E8B985AB5F call CORINFO_HELP_PATCHPOINT G_M000_IG05: ;; offset=0067H 488B4DC0 mov rcx, gword ptr [rbp-40H] 3909 cmp dword ptr [rcx], ecx FF1585C70D00 call [Stopwatch:Restart():this] 33C9 xor ecx, ecx 894DBC mov dword ptr [rbp-44H], ecx 33C9 xor ecx, ecx 894DB8 mov dword ptr [rbp-48H], ecx EB20 jmp SHORT G_M000_IG08 G_M000_IG06: ;; offset=007FH 8B4DB8 mov ecx, dword ptr [rbp-48H] 0FB7C9 movzx rcx, cx FF152DD40B00 call [Program:A few relevant things to notice here. First, the comments at the top highlight how this code was compiled:
; Tier-0 compilation ; MinOpts codeSo, we know this is the initial version (“Tier-0”) of the method compiled with minimal optimization (“MinOpts”). Second, note this line of the assembly:
FF152DD40B00 call [Program:Our IsAsciiDigit helper method is trivially inlineable, but it’s not getting inlined; instead, the assembly has a call to it, and indeed we can see below the generated code (also “MinOpts”) for IsAsciiDigit. Why? Because inlining is an optimization (a really important one) that’s disabled as part of tier-0 (because the analysis for doing inlining well is also quite costly). Third, we can see the code the JIT is outputting to instrument this method. This is a bit more involved, but I’ll point out the relevant parts. First, we see:
C745A8E8030000 mov dword ptr [rbp-58H], 0x3E8That 0x3E8 is the hex value for the decimal 1,000, which is the default number of iterations a loop needs to iterate before the JIT will generate the optimized version of the method (this is configurable via the DOTNET_TC_OnStackReplacement_InitialCounter environment variable). So we see 1,000 being stored into this stack location. Then a bit later in the method we see this:
G_M000_IG03: ;; offset=004BH 8B4DA8 mov ecx, dword ptr [rbp-58H] FFC9 dec ecx 894DA8 mov dword ptr [rbp-58H], ecx 837DA800 cmp dword ptr [rbp-58H], 0 7F0E jg SHORT G_M000_IG05 G_M000_IG04: ;; offset=0059H 488D4DA8 lea rcx, [rbp-58H] BA06000000 mov edx, 6 E8B985AB5F call CORINFO_HELP_PATCHPOINT G_M000_IG05: ;; offset=0067HThe generated code is loading that counter into the ecx register, decrementing it, storing it back, and then seeing whether the counter dropped to 0. If it didn’t, the code skips to G_M000_IG05, which is the label for the actual code in the rest of the loop. But if the counter did drop to 0, the JIT proceeds to store relevant state into the the rcx and edx registers and then calls the CORINFO_HELP_PATCHPOINT helper method. That helper is responsible for triggering the creation of the optimized method if it doesn’t yet exist, fixing up all appropriate tracking state, and jumping to the new method. And indeed, if you look again at your console output from running the program, you’ll see yet another output for the Main method:
; Assembly listing for method Program:Main() ; Emitting BLENDED_CODE for X64 CPU with AVX – Windows ; Tier-1 compilation ; OSR variant for entry point 0x23 ; optimized code ; rsp based frame ; fully interruptible ; No PGO data ; 1 inlinees with PGO data; 8 single block inlinees; 0 inlinees without PGO data G_M000_IG01: ;; offset=0000H 4883EC58 sub rsp, 88 4889BC24D8000000 mov qword ptr [rsp+D8H], rdi 4889B424D0000000 mov qword ptr [rsp+D0H], rsi 48899C24C8000000 mov qword ptr [rsp+C8H], rbx C5F877 vzeroupper 33C0 xor eax, eax 4889442428 mov qword ptr [rsp+28H], rax 4889442420 mov qword ptr [rsp+20H], rax 488B9C24A0000000 mov rbx, gword ptr [rsp+A0H] 8BBC249C000000 mov edi, dword ptr [rsp+9CH] 8BB42498000000 mov esi, dword ptr [rsp+98H] G_M000_IG02: ;; offset=0041H EB45 jmp SHORT G_M000_IG05 align [0 bytes for IG06] G_M000_IG03: ;; offset=0043H 33C9 xor ecx, ecx 488B9C24A0000000 mov rbx, gword ptr [rsp+A0H] 48894B08 mov qword ptr [rbx+08H], rcx 488D4C2428 lea rcx, [rsp+28H] 48B87066E68AFD7F0000 mov rax, 0x7FFD8AE66670 G_M000_IG04: ;; offset=0060H FFD0 call rax ; Kernel32:QueryPerformanceCounter(long):int 488B442428 mov rax, qword ptr [rsp+28H] 488B9C24A0000000 mov rbx, gword ptr [rsp+A0H] 48894310 mov qword ptr [rbx+10H], rax C6431801 mov byte ptr [rbx+18H], 1 33FF xor edi, edi 33F6 xor esi, esi 833D92A1E55F00 cmp dword ptr [(reloc 0x7ffcafe1ae34)], 0 0F85CA000000 jne G_M000_IG13 G_M000_IG05: ;; offset=0088H 81FE00CA9A3B cmp esi, 0x3B9ACA00 7D17 jge SHORT G_M000_IG09 G_M000_IG06: ;; offset=0090H 0FB7CE movzx rcx, si 83C1D0 add ecx, -48 83F909 cmp ecx, 9 7702 ja SHORT G_M000_IG08 G_M000_IG07: ;; offset=009BH FFC7 inc edi G_M000_IG08: ;; offset=009DH FFC6 inc esi 81FE00CA9A3B cmp esi, 0x3B9ACA00 7CE9 jl SHORT G_M000_IG06 G_M000_IG09: ;; offset=00A7H 488B6B08 mov rbp, qword ptr [rbx+08H] 48899C24A0000000 mov gword ptr [rsp+A0H], rbx 807B1800 cmp byte ptr [rbx+18H], 0 7436 je SHORT G_M000_IG12 G_M000_IG10: ;; offset=00B9H 488D4C2420 lea rcx, [rsp+20H] 48B87066E68AFD7F0000 mov rax, 0x7FFD8AE66670 G_M000_IG11: ;; offset=00C8H FFD0 call rax ; Kernel32:QueryPerformanceCounter(long):int 488B4C2420 mov rcx, qword ptr [rsp+20H] 488B9C24A0000000 mov rbx, gword ptr [rsp+A0H] 482B4B10 sub rcx, qword ptr [rbx+10H] 4803E9 add rbp, rcx 833D2FA1E55F00 cmp dword ptr [(reloc 0x7ffcafe1ae34)], 0 48899C24A0000000 mov gword ptr [rsp+A0H], rbx 756D jne SHORT G_M000_IG14 G_M000_IG12: ;; offset=00EFH C5F857C0 vxorps xmm0, xmm0 C4E1FB2AC5 vcvtsi2sd xmm0, rbp C5FB11442430 vmovsd qword ptr [rsp+30H], xmm0 48B9F04BF24FFC7F0000 mov rcx, 0x7FFC4FF24BF0 BAE7070000 mov edx, 0x7E7 E82E1FB25F call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE C5FB10442430 vmovsd xmm0, qword ptr [rsp+30H] C5FB5905E049F6FF vmulsd xmm0, xmm0, qword ptr [(reloc 0x7ffc4ff25720)] C4E1FB2CD0 vcvttsd2si rdx, xmm0 48B94B598638D6C56D34 mov rcx, 0x346DC5D63886594B 488BC1 mov rax, rcx 48F7EA imul rdx:rax, rdx 488BCA mov rcx, rdx 48C1E93F shr rcx, 63 48C1FA0B sar rdx, 11 4803CA add rcx, rdx FF1567CE0D00 call [Console:WriteLine(long)] E9F5FEFFFF jmp G_M000_IG03 G_M000_IG13: ;; offset=014EH E8DDCBAC5F call CORINFO_HELP_POLL_GC E930FFFFFF jmp G_M000_IG05 G_M000_IG14: ;; offset=0158H E8D3CBAC5F call CORINFO_HELP_POLL_GC EB90 jmp SHORT G_M000_IG12 ; Total bytes of code 351Here, again, we notice a few interesting things. First, in the header we see this:
; Tier-1 compilation ; OSR variant for entry point 0x23 ; optimized codeso we know this is both optimized “tier-1” code and is the “OSR variant” for this method. Second, notice there’s no longer a call to the IsAsciiDigit helper. Instead, where that call would have been, we see this:
G_M000_IG06: ;; offset=0090H 0FB7CE movzx rcx, si 83C1D0 add ecx, -48 83F909 cmp ecx, 9 7702 ja SHORT G_M000_IG08This is loading a value into rcx, subtracting 48 from it (48 is the decimal ASCII value of the ‘0’ character) and comparing the resulting value to 9. Sounds an awful lot like our IsAsciiDigit implementation ((uint)(c – ‘0’) <= 9), doesn’t it? That’s because it is. The helper was successfully inlined in this now-optimized code.
Great, so now in .NET 7, we can largely avoid the tradeoffs between startup and throughput, as OSR enables tiered compilation to apply to all methods, even those that are long-running. A multitude of PRs went into enabling this, including many over the last few years, but all of the functionality was disabled in the shipping bits. Thanks to improvements like dotnet/runtime#62831 which implemented support for OSR on Arm64 (previously only x64 support was implemented), and dotnet/runtime#63406 and dotnet/runtime#65609 which revised how OSR imports and epilogs are handled, dotnet/runtime#65675 enables OSR (and as a result DOTNET_TC_QuickJitForLoops) by default.
But, tiered compilation and OSR aren’t just about startup (though they’re of course very valuable there). They’re also about further improving throughput. Even though tiered compilation was originally envisioned as a way to optimize startup while not hurting throughput, it’s become much more than that. There are various things the JIT can learn about a method during tier-0 that it can then use for tier-1. For example, the very fact that the tier-0 code executed means that any statics accessed by the method will have been initialized, and that means that any readonly statics will not only have been initialized by the time the tier-1 code executes but their values won’t ever change. And that in turn means that any readonly statics of primitive types (e.g. bool, int, etc.) can be treated like consts instead of static readonly fields, and during tier-1 compilation the JIT can optimize them just as it would have optimized a const. For example, try running this simple program after setting DOTNET_JitDisasm to Program:Test:
using System.Runtime.CompilerServices; class Program { static readonly bool Is64Bit = Environment.Is64BitProcess; static int Main() { int count = 0; for (int i = 0; i < 1_000_000_000; i++) if (Test()) count++; return count; } [MethodImpl(MethodImplOptions.NoInlining)] static bool Test() => Is64Bit; }When I do so, I get this output:
; Assembly listing for method Program:Test():bool ; Emitting BLENDED_CODE for X64 CPU with AVX – Windows ; Tier-0 compilation ; MinOpts code ; rbp based frame ; partially interruptible G_M000_IG01: ;; offset=0000H 55 push rbp 4883EC20 sub rsp, 32 488D6C2420 lea rbp, [rsp+20H] G_M000_IG02: ;; offset=000AH 48B9B8639A3FFC7F0000 mov rcx, 0x7FFC3F9A63B8 BA01000000 mov edx, 1 E8C220B25F call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE 0FB60545580C00 movzx rax, byte ptr [(reloc 0x7ffc3f9a63ea)] G_M000_IG03: ;; offset=0025H 4883C420 add rsp, 32 5D pop rbp C3 ret ; Total bytes of code 43 ; Assembly listing for method Program:Test():bool ; Emitting BLENDED_CODE for X64 CPU with AVX – Windows ; Tier-1 compilation ; optimized code ; rsp based frame ; partially interruptible ; No PGO data G_M000_IG01: ;; offset=0000H G_M000_IG02: ;; offset=0000H B801000000 mov eax, 1 G_M000_IG03: ;; offset=0005H C3 ret ; Total bytes of code 6Note, again, we see two outputs for Program:Test. First, we see the “Tier-0” code, which is accessing a static (note the call CORINFO_HELP_GETSHARED_NONGCSTATIC_BASE instruction). But then we see the “Tier-1” code, where all of that overhead has vanished and is instead replaced simply by mov eax, 1. Since the “Tier-0” code had to have executed in order for it to tier up, the “Tier-1” code was generated knowing that the value of the static readonly bool Is64Bit field was true (1), and so the entirety of this method is storing the value 1 into the eax register used for the return value.
This is so useful that components are now written with tiering in mind. Consider the new Regex source generator, which is discussed later in this post (Roslyn source generators were introduced a couple of years ago; just as how Roslyn analyzers are able to plug into the compiler and surface additional diagnostics based on all of the data the compiler learns from the source code, Roslyn source generators are able to analyze that same data and then further augment the compilation unit with additional source). The Regex source generator applies a technique based on this in dotnet/runtime#67775. Regex supports setting a process-wide timeout that gets applied to Regex instances that don’t explicitly set a timeout. That means, even though it’s super rare for such a process-wide timeout to be set, the Regex source generator still needs to output timeout-related code just in case it’s needed. It does so by outputting some helpers like this:
static class Utilities { internal static readonly TimeSpan s_defaultTimeout = AppContext.GetData(“REGEX_DEFAULT_MATCH_TIMEOUT”) is TimeSpan timeout ? timeout : Timeout.InfiniteTimeSpan; internal static readonly bool s_hasTimeout = s_defaultTimeout != Timeout.InfiniteTimeSpan; }which it then uses at call sites like this:
if (Utilities.s_hasTimeout) { base.CheckTimeout(); }In tier-0, these checks will still be emitted in the assembly code, but in tier-1 where throughput matters, if the relevant AppContext switch hasn’t been set, then s_defaultTimeout will be Timeout.InfiniteTimeSpan, at which point s_hasTimeout will be false. And since s_hasTimeout is a static readonly bool, the JIT will be able to treat that as a const, and all conditions like if (Utilities.s_hasTimeout) will be treated equal to if (false) and be eliminated from the assembly code entirely as dead code.
But, this is somewhat old news. The JIT has been able to do such an optimization since tiered compilation was introduced in .NET Core 3.0. Now in .NET 7, though, with OSR it’s also able to do so by default for methods with loops (and thus enable cases like the regex one). However, the real magic of OSR comes into play when combined with another exciting feature: dynamic PGO.
PGO
I wrote about profile-guided optimization (PGO) in my Performance Improvements in .NET 6 post, but I’ll cover it again here as it’s seen a multitude of improvements for .NET 7.
PGO has been around for a long time, in any number of languages and compilers. The basic idea is you compile your app, asking the compiler to inject instrumentation into the application to track various pieces of interesting information. You then put your app through its paces, running through various common scenarios, causing that instrumentation to “profile” what happens when the app is executed, and the results of that are then saved out. The app is then recompiled, feeding those instrumentation results back into the compiler, and allowing it to optimize the app for exactly how it’s expected to be used. This approach to PGO is referred to as “static PGO,” as the information is all gleaned ahead of actual deployment, and it’s something .NET has been doing in various forms for years. From my perspective, though, the really interesting development in .NET is “dynamic PGO,” which was introduced in .NET 6, but off by default.
Dynamic PGO takes advantage of tiered compilation. I noted that the JIT instruments the tier-0 code to track how many times the method is called, or in the case of loops, how many times the loop executes. It can instrument it for other things as well. For example, it can track exactly which concrete types are used as the target of an interface dispatch, and then in tier-1 specialize the code to expect the most common types (this is referred to as “guarded devirtualization,” or GDV). You can see this in this little example. Set the DOTNET_TieredPGO environment variable to 1, and then run this on .NET 7:
class Program { static void Main() { IPrinter printer = new Printer(); for (int i = 0; ; i++) { DoWork(printer, i); } } static void DoWork(IPrinter printer, int i) { printer.PrintIfTrue(i == int.MaxValue); } interface IPrinter { void PrintIfTrue(bool condition); } class Printer : IPrinter { public void PrintIfTrue(bool condition) { if (condition) Console.WriteLine(“Print!”); } } }The tier-0 code for DoWork ends up looking like this:
G_M000_IG01: ;; offset=0000H 55 push rbp 4883EC30 sub rsp, 48 488D6C2430 lea rbp, [rsp+30H] 33C0 xor eax, eax 488945F8 mov qword ptr [rbp-08H], rax 488945F0 mov qword ptr [rbp-10H], rax 48894D10 mov gword ptr [rbp+10H], rcx 895518 mov dword ptr [rbp+18H], edx G_M000_IG02: ;; offset=001BH FF059F220F00 inc dword ptr [(reloc 0x7ffc3f1b2ea0)] 488B4D10 mov rcx, gword ptr [rbp+10H] 48894DF8 mov gword ptr [rbp-08H], rcx 488B4DF8 mov rcx, gword ptr [rbp-08H] 48BAA82E1B3FFC7F0000 mov rdx, 0x7FFC3F1B2EA8 E8B47EC55F call CORINFO_HELP_CLASSPROFILE32 488B4DF8 mov rcx, gword ptr [rbp-08H] 48894DF0 mov gword ptr [rbp-10H], rcx 488B4DF0 mov rcx, gword ptr [rbp-10H] 33D2 xor edx, edx 817D18FFFFFF7F cmp dword ptr [rbp+18H], 0x7FFFFFFF 0F94C2 sete dl 49BB0800F13EFC7F0000 mov r11, 0x7FFC3EF10008 41FF13 call [r11]IPrinter:PrintIfTrue(bool):this 90 nop G_M000_IG03: ;; offset=0062H 4883C430 add rsp, 48 5D pop rbp C3 retand most notably, you can see the call [r11]IPrinter:PrintIfTrue(bool):this doing the interface dispatch. But, then look at the code generated for tier-1. We still see the call [r11]IPrinter:PrintIfTrue(bool):this, but we also see this:
G_M000_IG02: ;; offset=0020H 48B9982D1B3FFC7F0000 mov rcx, 0x7FFC3F1B2D98 48390F cmp qword ptr [rdi], rcx 7521 jne SHORT G_M000_IG05 81FEFFFFFF7F cmp esi, 0x7FFFFFFF 7404 je SHORT G_M000_IG04 G_M000_IG03: ;; offset=0037H FFC6 inc esi EBE5 jmp SHORT G_M000_IG02 G_M000_IG04: ;; offset=003BH 48B9D820801A24020000 mov rcx, 0x2241A8020D8 488B09 mov rcx, gword ptr [rcx] FF1572CD0D00 call [Console:WriteLine(String)] EBE7 jmp SHORT G_M000_IG03That first block is checking the concrete type of the IPrinter (stored in rdi) and comparing it against the known type for Printer (0x7FFC3F1B2D98). If they’re different, it just jumps to the same interface dispatch it was doing in the unoptimized version. But if they’re the same, it then jumps directly to an inlined version of Printer.PrintIfTrue (you can see the call to Console:WriteLine right there in this method). Thus, the common case (the only case in this example) is super efficient at the expense of a single comparison and branch.
That all existed in .NET 6, so why are we talking about it now? Several things have improved. First, PGO now works with OSR, thanks to improvements like dotnet/runtime#61453. That’s a big deal, as it means hot long-running methods that do this kind of interface dispatch (which are fairly common) can get these kinds of devirtualization/inlining optimizations. Second, while PGO isn’t currently enabled by default, we’ve made it much easier to turn on. Between dotnet/runtime#71438 and dotnet/sdk#26350, it’s now possible to simply put
PGO already knew how to instrument virtual dispatch. Now in .NET 7, thanks in large part to dotnet/runtime#68703, it can do so for delegates as well (at least for delegates to instance methods). Consider this simple console app:
using System.Runtime.CompilerServices; class Program { static int[] s_values = Enumerable.Range(0, 1_000).ToArray(); static void Main() { for (int i = 0; i < 1_000_000; i++) Sum(s_values, i => i * 42); } [MethodImpl(MethodImplOptions.NoInlining)] static int Sum(int[] values, FuncWithout PGO enabled, I get generated optimized assembly like this:
; Assembly listing for method Program:Sum(ref,Func`2):int ; Emitting BLENDED_CODE for X64 CPU with AVX – Windows ; Tier-1 compilation ; optimized code ; rsp based frame ; partially interruptible ; No PGO data G_M000_IG01: ;; offset=0000H 4156 push r14 57 push rdi 56 push rsi 55 push rbp 53 push rbx 4883EC20 sub rsp, 32 488BF2 mov rsi, rdx G_M000_IG02: ;; offset=000DH 33FF xor edi, edi 488BD9 mov rbx, rcx 33ED xor ebp, ebp 448B7308 mov r14d, dword ptr [rbx+08H] 4585F6 test r14d, r14d 7E16 jle SHORT G_M000_IG04 G_M000_IG03: ;; offset=001DH 8BD5 mov edx, ebp 8B549310 mov edx, dword ptr [rbx+4*rdx+10H] 488B4E08 mov rcx, gword ptr [rsi+08H] FF5618 call [rsi+18H]Func`2:Invoke(int):int:this 03F8 add edi, eax FFC5 inc ebp 443BF5 cmp r14d, ebp 7FEA jg SHORT G_M000_IG03 G_M000_IG04: ;; offset=0033H 8BC7 mov eax, edi G_M000_IG05: ;; offset=0035H 4883C420 add rsp, 32 5B pop rbx 5D pop rbp 5E pop rsi 5F pop rdi 415E pop r14 C3 ret ; Total bytes of code 64Note the call [rsi+18H]Func`2:Invoke(int):int:this in there that’s invoking the delegate. Now with PGO enabled:
; Assembly listing for method Program:Sum(ref,Func`2):int ; Emitting BLENDED_CODE for X64 CPU with AVX – Windows ; Tier-1 compilation ; optimized code ; optimized using profile data ; rsp based frame ; fully interruptible ; with Dynamic PGO: edge weights are valid, and fgCalledCount is 5628 ; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data G_M000_IG01: ;; offset=0000H 4157 push r15 4156 push r14 57 push rdi 56 push rsi 55 push rbp 53 push rbx 4883EC28 sub rsp, 40 488BF2 mov rsi, rdx G_M000_IG02: ;; offset=000FH 33FF xor edi, edi 488BD9 mov rbx, rcx 33ED xor ebp, ebp 448B7308 mov r14d, dword ptr [rbx+08H] 4585F6 test r14d, r14d 7E27 jle SHORT G_M000_IG05 G_M000_IG03: ;; offset=001FH 8BC5 mov eax, ebp 8B548310 mov edx, dword ptr [rbx+4*rax+10H] 4C8B4618 mov r8, qword ptr [rsi+18H] 48B8A0C2CF3CFC7F0000 mov rax, 0x7FFC3CCFC2A0 4C3BC0 cmp r8, rax 751D jne SHORT G_M000_IG07 446BFA2A imul r15d, edx, 42 G_M000_IG04: ;; offset=003CH 4103FF add edi, r15d FFC5 inc ebp 443BF5 cmp r14d, ebp 7FD9 jg SHORT G_M000_IG03 G_M000_IG05: ;; offset=0046H 8BC7 mov eax, edi G_M000_IG06: ;; offset=0048H 4883C428 add rsp, 40 5B pop rbx 5D pop rbp 5E pop rsi 5F pop rdi 415E pop r14 415F pop r15 C3 ret G_M000_IG07: ;; offset=0055H 488B4E08 mov rcx, gword ptr [rsi+08H] 41FFD0 call r8 448BF8 mov r15d, eax EBDB jmp SHORT G_M000_IG04I chose the 42 constant in i => i * 42 to make it easy to see in the assembly, and sure enough, there it is:
G_M000_IG03: ;; offset=001FH 8BC5 mov eax, ebp 8B548310 mov edx, dword ptr [rbx+4*rax+10H] 4C8B4618 mov r8, qword ptr [rsi+18H] 48B8A0C2CF3CFC7F0000 mov rax, 0x7FFC3CCFC2A0 4C3BC0 cmp r8, rax 751D jne SHORT G_M000_IG07 446BFA2A imul r15d, edx, 42This is loading the target address from the delegate into r8 and is loading the address of the expected target into rax. If they’re the same, it then simply performs the inlined operation (imul r15d, edx, 42), and otherwise it jumps to G_M000_IG07 which calls to the function in r8. The effect of this is obvious if we run this as a benchmark:
static int[] s_values = Enumerable.Range(0, 1_000).ToArray(); [Benchmark] public int DelegatePGO() => Sum(s_values, i => i * 42); static int Sum(int[] values, FuncWith PGO disabled, we get the same performance throughput for .NET 6 and .NET 7:
Method | Runtime | Mean | Ratio |
---|---|---|---|
DelegatePGO | .NET 6.0 | 1.665 us | 1.00 |
DelegatePGO | .NET 7.0 | 1.659 us | 1.00 |
But the picture changes when we enable dynamic PGO (DOTNET_TieredPGO=1). .NET 6 gets ~14% faster, but .NET 7 gets ~3x faster!
Method | Runtime | Mean | Ratio |
---|---|---|---|
DelegatePGO | .NET 6.0 | 1,427.7 ns | 1.00 |
DelegatePGO | .NET 7.0 | 539.0 ns | 0.38 |
dotnet/runtime#70377 is another valuable improvement with dynamic PGO, which enables PGO to play nicely with loop cloning and invariant hoisting. To understand this better, a brief digression into what those are. Loop cloning is a mechanism the JIT employs to avoid various overheads in the fast path of a loop. Consider the Test method in this example:
using System.Runtime.CompilerServices; class Program { static void Main() { int[] array = new int[10_000_000]; for (int i = 0; i < 1_000_000; i++) { Test(array); } } [MethodImpl(MethodImplOptions.NoInlining)] private static bool Test(int[] array) { for (int i = 0; i < 0x12345; i++) { if (array[i] == 42) { return true; } } return false; } }The JIT doesn’t know whether the passed in array is of sufficient length that all accesses to array[i] inside the loop will be in bounds, and thus it would need to inject bounds checks for every access. While it’d be nice to simply do the length check up front and simply throw an exception early if it wasn’t long enough, doing so could also change behavior (imagine the method were writing into the array as it went, or otherwise mutating some shared state). Instead, the JIT employs “loop cloning.” It essentially rewrites this Test method to be more like this:
if (array is not null && array.Length >= 0x12345) { for (int i = 0; i < 0x12345; i++) { if (array[i] == 42) // no bounds checks emitted for this access :-) { return true; } } } else { for (int i = 0; i < 0x12345; i++) { if (array[i] == 42) // bounds checks emitted for this access :-( { return true; } } } return false;That way, at the expense of some code duplication, we get our fast loop without bounds checks and only pay for the bounds checks in the slow path. You can see this in the generated assembly (if you can’t already tell, DOTNET_JitDisasm is one of my favorite features in .NET 7):
; Assembly listing for method Program:Test(ref):bool ; Emitting BLENDED_CODE for X64 CPU with AVX – Windows ; Tier-1 compilation ; optimized code ; rsp based frame ; fully interruptible ; No PGO data G_M000_IG01: ;; offset=0000H 4883EC28 sub rsp, 40 G_M000_IG02: ;; offset=0004H 33C0 xor eax, eax 4885C9 test rcx, rcx 7429 je SHORT G_M000_IG05 81790845230100 cmp dword ptr [rcx+08H], 0x12345 7C20 jl SHORT G_M000_IG05 0F1F40000F1F840000000000 align [12 bytes for IG03] G_M000_IG03: ;; offset=0020H 8BD0 mov edx, eax 837C91102A cmp dword ptr [rcx+4*rdx+10H], 42 7429 je SHORT G_M000_IG08 FFC0 inc eax 3D45230100 cmp eax, 0x12345 7CEE jl SHORT G_M000_IG03 G_M000_IG04: ;; offset=0032H EB17 jmp SHORT G_M000_IG06 G_M000_IG05: ;; offset=0034H 3B4108 cmp eax, dword ptr [rcx+08H] 7323 jae SHORT G_M000_IG10 8BD0 mov edx, eax 837C91102A cmp dword ptr [rcx+4*rdx+10H], 42 7410 je SHORT G_M000_IG08 FFC0 inc eax 3D45230100 cmp eax, 0x12345 7CE9 jl SHORT G_M000_IG05 G_M000_IG06: ;; offset=004BH 33C0 xor eax, eax G_M000_IG07: ;; offset=004DH 4883C428 add rsp, 40 C3 ret G_M000_IG08: ;; offset=0052H B801000000 mov eax, 1 G_M000_IG09: ;; offset=0057H 4883C428 add rsp, 40 C3 ret G_M000_IG10: ;; offset=005CH E81FA0C15F call CORINFO_HELP_RNGCHKFAIL CC int3 ; Total bytes of code 98That G_M000_IG02 section is doing the null check and the length check, jumping to the G_M000_IG05 block if either fails. If both succeed, it’s then executing the loop (block G_M000_IG03) without bounds checks:
G_M000_IG03: ;; offset=0020H 8BD0 mov edx, eax 837C91102A cmp dword ptr [rcx+4*rdx+10H], 42 7429 je SHORT G_M000_IG08 FFC0 inc eax 3D45230100 cmp eax, 0x12345 7CEE jl SHORT G_M000_IG03with the bounds checks only showing up in the slow-path block:
G_M000_IG05: ;; offset=0034H 3B4108 cmp eax, dword ptr [rcx+08H] 7323 jae SHORT G_M000_IG10 8BD0 mov edx, eax 837C91102A cmp dword ptr [rcx+4*rdx+10H], 42 7410 je SHORT G_M000_IG08 FFC0 inc eax 3D45230100 cmp eax, 0x12345 7CE9 jl SHORT G_M000_IG05That’s “loop cloning.” What about “invariant hoisting”? Hoisting means pulling something out of a loop to be before the loop, and invariants are things that don’t change. Thus invariant hoisting is pulling something out of a loop to before the loop in order to avoid recomputing every iteration of the loop an answer that won’t change. Effectively, the previous example already showed invariant hoisting, in that the bounds check is moved to be before the loop rather than in the loop, but a more concrete example would be something like this:
[MethodImpl(MethodImplOptions.NoInlining)] private static bool Test(int[] array) { for (int i = 0; i < 0x12345; i++) { if (array[i] == array.Length - 42) { return true; } } return false; }Note that the value of array.Length – 42 doesn’t change on each iteration of the loop, so it’s “invariant” to the loop iteration and can be lifted out, which the generated code does:
G_M000_IG02: ;; offset=0004H 33D2 xor edx, edx 4885C9 test rcx, rcx 742A je SHORT G_M000_IG05 448B4108 mov r8d, dword ptr [rcx+08H] 4181F845230100 cmp r8d, 0x12345 7C1D jl SHORT G_M000_IG05 4183C0D6 add r8d, -42 0F1F4000 align [4 bytes for IG03] G_M000_IG03: ;; offset=0020H 8BC2 mov eax, edx 4439448110 cmp dword ptr [rcx+4*rax+10H], r8d 7433 je SHORT G_M000_IG08 FFC2 inc edx 81FA45230100 cmp edx, 0x12345 7CED jl SHORT G_M000_IG03Here again we see the array being tested for null (test rcx, rcx) and the array’s length being checked (mov r8d, dword ptr [rcx+08H] then cmp r8d, 0x12345), but then with the array’s length in r8d, we then see this up-front block subtracting 42 from the length (add r8d, -42), and that’s before we continue into the fast-path loop in the G_M000_IG03 block. This keeps that additional set of operations out of the loop, thereby avoiding the overhead of recomputing the value per iteration.
Ok, so how does this apply to dynamic PGO? Remember that with the interface/virtual dispatch avoidance PGO is able to do, it does so by doing a type check to see whether the type in use is the most common type; if it is, it uses a fast path that calls directly to that type’s method (and in doing so that call is then potentially inlined), and if it isn’t, it falls back to normal interface/virtual dispatch. That check can be invariant to a loop. So when a method is tiered up and PGO kicks in, the type check can now be hoisted out of the loop, making it even cheaper to handle the common case. Consider this variation of our original example:
using System.Runtime.CompilerServices; class Program { static void Main() { IPrinter printer = new BlankPrinter(); while (true) { DoWork(printer); } } [MethodImpl(MethodImplOptions.NoInlining)] static void DoWork(IPrinter printer) { for (int j = 0; j < 123; j++) { printer.Print(j); } } interface IPrinter { void Print(int i); } class BlankPrinter : IPrinter { public void Print(int i) { Console.Write(""); } } }When we look at the optimized assembly generated for this with dynamic PGO enabled, we see this:
; Assembly listing for method Program:DoWork(IPrinter) ; Emitting BLENDED_CODE for X64 CPU with AVX – Windows ; Tier-1 compilation ; optimized code ; optimized using profile data ; rsp based frame ; partially interruptible ; with Dynamic PGO: edge weights are invalid, and fgCalledCount is 12187 ; 0 inlinees with PGO data; 1 single block inlinees; 0 inlinees without PGO data G_M000_IG01: ;; offset=0000H 57 push rdi 56 push rsi 4883EC28 sub rsp, 40 488BF1 mov rsi, rcx G_M000_IG02: ;; offset=0009H 33FF xor edi, edi 4885F6 test rsi, rsi 742B je SHORT G_M000_IG05 48B9982DD43CFC7F0000 mov rcx, 0x7FFC3CD42D98 48390E cmp qword ptr [rsi], rcx 751C jne SHORT G_M000_IG05 G_M000_IG03: ;; offset=001FH 48B9282040F948020000 mov rcx, 0x248F9402028 488B09 mov rcx, gword ptr [rcx] FF1526A80D00 call [Console:Write(String)] FFC7 inc edi 83FF7B cmp edi, 123 7CE6 jl SHORT G_M000_IG03 G_M000_IG04: ;; offset=0039H EB29 jmp SHORT G_M000_IG07 G_M000_IG05: ;; offset=003BH 48B9982DD43CFC7F0000 mov rcx, 0x7FFC3CD42D98 48390E cmp qword ptr [rsi], rcx 7521 jne SHORT G_M000_IG08 48B9282040F948020000 mov rcx, 0x248F9402028 488B09 mov rcx, gword ptr [rcx] FF15FBA70D00 call [Console:Write(String)] G_M000_IG06: ;; offset=005DH FFC7 inc edi 83FF7B cmp edi, 123 7CD7 jl SHORT G_M000_IG05 G_M000_IG07: ;; offset=0064H 4883C428 add rsp, 40 5E pop rsi 5F pop rdi C3 ret G_M000_IG08: ;; offset=006BH 488BCE mov rcx, rsi 8BD7 mov edx, edi 49BB1000AA3CFC7F0000 mov r11, 0x7FFC3CAA0010 41FF13 call [r11]IPrinter:Print(int):this EBDE jmp SHORT G_M000_IG06 ; Total bytes of code 127We can see in the G_M000_IG02 block that it’s doing the type check on the IPrinter instance and jumping to G_M000_IG05 if the check fails (mov rcx, 0x7FFC3CD42D98 then cmp qword ptr [rsi], rcx then jne SHORT G_M000_IG05), otherwise falling through to G_M000_IG03 which is a tight fast-path loop that’s inlined BlankPrinter.Print with no type checks in sight!
Interestingly, improvements like this can bring with them their own challenges. PGO leads to a significant increase in the number of type checks, since call sites that specialize for a given type need to compare against that type. However, common subexpression elimination (CSE) hasn’t historically worked for such type handles (CSE is a compiler optimization where duplicate expressions are eliminated by computing the result once and then storing it for subsequent use rather than recomputing it each time). dotnet/runtime#70580 fixes this by enabling CSE for such constant handles. For example, consider this method:
[Benchmark] [Arguments(“”, “”, “”, “”)] public bool AllAreStrings(object o1, object o2, object o3, object o4) => o1 is string && o2 is string && o3 is string && o4 is string;On .NET 6, the JIT produced this assembly code:
; Program.AllAreStrings(System.Object, System.Object, System.Object, System.Object) test rdx,rdx je short M00_L01 mov rax,offset MT_System.String cmp [rdx],rax jne short M00_L01 test r8,r8 je short M00_L01 mov rax,offset MT_System.String cmp [r8],rax jne short M00_L01 test r9,r9 je short M00_L01 mov rax,offset MT_System.String cmp [r9],rax jne short M00_L01 mov rax,[rsp+28] test rax,rax je short M00_L00 mov rdx,offset MT_System.String cmp [rax],rdx je short M00_L00 xor eax,eax M00_L00: test rax,rax setne al movzx eax,al ret M00_L01: xor eax,eax ret ; Total bytes of code 100Note the C# has four tests for string and the assembly code has four loads with mov rax,offset MT_System.String. Now on .NET 7, the load is performed just once:
; Program.AllAreStrings(System.Object, System.Object, System.Object, System.Object) test rdx,rdx je short M00_L01 mov rax,offset MT_System.String cmp [rdx],rax jne short M00_L01 test r8,r8 je short M00_L01 cmp [r8],rax jne short M00_L01 test r9,r9 je short M00_L01 cmp [r9],rax jne short M00_L01 mov rdx,[rsp+28] test rdx,rdx je short M00_L00 cmp [rdx],rax je short M00_L00 xor edx,edx M00_L00: xor eax,eax test rdx,rdx setne al ret M00_L01: xor eax,eax ret ; Total bytes of code 69Bounds Check Elimination
One of the things that makes .NET attractive is its safety. The runtime guards access to arrays, strings, and spans such that you can’t accidentally corrupt memory by walking off either end; if you do, rather than reading/writing arbitrary memory, you’ll get exceptions. Of course, that’s not magic; it’s done by the JIT inserting bounds checks every time one of these data structures is indexed. For example, this:
[MethodImpl(MethodImplOptions.NoInlining)] static int Read0thElement(int[] array) => array[0];results in:
G_M000_IG01: ;; offset=0000H 4883EC28 sub rsp, 40 G_M000_IG02: ;; offset=0004H 83790800 cmp dword ptr [rcx+08H], 0 7608 jbe SHORT G_M000_IG04 8B4110 mov eax, dword ptr [rcx+10H] G_M000_IG03: ;; offset=000DH 4883C428 add rsp, 40 C3 ret G_M000_IG04: ;; offset=0012H E8E9A0C25F call CORINFO_HELP_RNGCHKFAIL CC int3The array is passed into this method in the rcx register, pointing to the method table pointer in the object, and the length of an array is stored in the object just after that method table pointer (which is 8 bytes in a 64-bit process). Thus the cmp dword ptr [rcx+08H], 0 instruction is reading the length of the array and comparing the length to 0; that makes sense, since the length can’t be negative, and we’re trying to access the 0th element, so as long as the length isn’t 0, the array has enough elements for us to access its 0th element. In the event that the length was 0, the code jumps to the end of the function, which contains call CORINFO_HELP_RNGCHKFAIL; that’s a JIT helper function that throws an IndexOutOfRangeException. If the length was sufficient, however, it then reads the int stored at the beginning of the array’s data, which on 64-bit is 16 bytes (0x10) past the pointer (mov eax, dword ptr [rcx+10H]).
While these bounds checks in and of themselves aren’t super expensive, do a lot of them and their costs add up. So while the JIT needs to ensure that “safe” accesses don’t go out of bounds, it also tries to prove that certain accesses won’t, in which case it needn’t emit the bounds check that it knows will be superfluous. In every release of .NET, more and more cases have been added to find places these bounds checks can be eliminated, and .NET 7 is no exception.
For example, dotnet/runtime#61662 from @anthonycanino enabled the JIT to understand various forms of binary operations as part of range checks. Consider this method:
[MethodImpl(MethodImplOptions.NoInlining)] private static ushort[]? Convert(ReadOnlySpanIt’s validating that the input span is 16 bytes long and then creating a new ushort[8] where each ushort in the array combines two of the input bytes. To do that, it’s looping over the output array, and indexing into the bytes array using i * 2 and i * 2 + 1 as the indices. On .NET 6, each of those indexing operations would result in a bounds check, with assembly like:
cmp r8d,10 jae short G_M000_IG04 movsxd r8,r8dwhere that G_M000_IG04 is the call CORINFO_HELP_RNGCHKFAIL we’re now familiar with. But on .NET 7, we get this assembly for the method:
G_M000_IG01: ;; offset=0000H 56 push rsi 4883EC20 sub rsp, 32 G_M000_IG02: ;; offset=0005H 488B31 mov rsi, bword ptr [rcx] 8B4908 mov ecx, dword ptr [rcx+08H] 83F910 cmp ecx, 16 754C jne SHORT G_M000_IG05 48B9302F542FFC7F0000 mov rcx, 0x7FFC2F542F30 BA08000000 mov edx, 8 E80C1EB05F call CORINFO_HELP_NEWARR_1_VC 33D2 xor edx, edx align [0 bytes for IG03] G_M000_IG03: ;; offset=0026H 8D0C12 lea ecx, [rdx+rdx] 448BC1 mov r8d, ecx FFC1 inc ecx 458BC0 mov r8d, r8d 460FB60406 movzx r8, byte ptr [rsi+r8] 41C1E008 shl r8d, 8 8BC9 mov ecx, ecx 0FB60C0E movzx rcx, byte ptr [rsi+rcx] 4103C8 add ecx, r8d 0FB7C9 movzx rcx, cx 448BC2 mov r8d, edx 6642894C4010 mov word ptr [rax+2*r8+10H], cx FFC2 inc edx 83FA08 cmp edx, 8 7CD0 jl SHORT G_M000_IG03 G_M000_IG04: ;; offset=0056H 4883C420 add rsp, 32 5E pop rsi C3 ret G_M000_IG05: ;; offset=005CH 33C0 xor rax, rax G_M000_IG06: ;; offset=005EH 4883C420 add rsp, 32 5E pop rsi C3 ret ; Total bytes of code 100No bounds checks, which is most easily seen by the lack of the telltale call CORINFO_HELP_RNGCHKFAIL at the end of the method. With this PR, the JIT is able to understand the impact of certain multiplication and shift operations and their relationships to the bounds of the data structure. Since it can see that the result array’s length is 8 and the loop is iterating from 0 to that exclusive upper bound, it knows that i will always be in the range [0, 7], which means that i * 2 will always be in the range [0, 14] and i * 2 + 1 will always be in the range [0, 15]. As such, it’s able to prove that the bounds checks aren’t needed.
dotnet/runtime#61569 and dotnet/runtime#62864 also help to eliminate bounds checks when dealing with constant strings and spans initialized from RVA statics (“Relative Virtual Address” static fields, basically a static field that lives in a module’s data section). For example, consider this benchmark:
[Benchmark] [Arguments(1)] public char GetChar(int i) { const string Text = “hello”; return (uint)i < Text.Length ? Text[i] : ' '; }On .NET 6, we get this assembly:
; Program.GetChar(Int32) sub rsp,28 mov eax,edx cmp rax,5 jl short M00_L00 xor eax,eax add rsp,28 ret M00_L00: cmp edx,5 jae short M00_L01 mov rax,2278B331450 mov rax,[rax] movsxd rdx,edx movzx eax,word ptr [rax+rdx*2+0C] add rsp,28 ret M00_L01: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 56The beginning of this makes sense: the JIT was obviously able to see that the length of Text is 5, so it’s implementing the (uint)i < Text.Length check by doing cmp rax,5, and if i as an unsigned value is greater than or equal to 5, it’s then zero’ing out the return value (to return the ' ') and exiting. If the length is less than 5 (in which case it’s also at least 0 due to the unsigned comparison), it then jumps to M00_L00 to read the value from the string… but we then see another cmp against 5, this time as part of a range check. So even though the JIT knew the index was in bounds, it wasn’t able to remove the bounds check. Now it is; in .NET 7, we get this:
; Program.GetChar(Int32) cmp edx,5 jb short M00_L00 xor eax,eax ret M00_L00: mov rax,2B0AF002530 mov rax,[rax] mov edx,edx movzx eax,word ptr [rax+rdx*2+0C] ret ; Total bytes of code 29So much nicer.
dotnet/runtime#67141 is a great example of how evolving ecosystem needs drives specific optimizations into the JIT. The Regex compiler and source generator handle some cases of regular expression character classes by using a bitmap lookup stored in strings. For example, to determine whether a char c is in the character class “[A-Za-z0-9_]” (which will match an underscore or any ASCII letter or digit), the implementation ends up generating an expression like the body of the following method:
[Benchmark] [Arguments(‘a’)] public bool IsInSet(char c) => c < 128 && (" u03FFuFFFEu87FFuFFFEu07FF"[c >> 4] & (1 << (c & 0xF))) != 0;The implementation is treating an 8-character string as a 128-bit lookup table. If the character is known to be in range (such that it’s effectively a 7-bit value), it’s then using the top 3 bits of the value to index into the 8 elements of the string, and the bottom 4 bits to select one of the 16 bits in that element, giving us an answer as to whether this input character is in the set or not. In .NET 6, even though we know the character is in range of the string, the JIT couldn’t see through either the length comparison or the bit shift.
; Program.IsInSet(Char) sub rsp,28 movzx eax,dx cmp eax,80 jge short M00_L00 mov edx,eax sar edx,4 cmp edx,8 jae short M00_L01 mov rcx,299835A1518 mov rcx,[rcx] movsxd rdx,edx movzx edx,word ptr [rcx+rdx*2+0C] and eax,0F bt edx,eax setb al movzx eax,al add rsp,28 ret M00_L00: xor eax,eax add rsp,28 ret M00_L01: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 75The previously mentioned PR takes care of the length check. And this PR takes care of the bit shift. So in .NET 7, we get this loveliness:
; Program.IsInSet(Char) movzx eax,dx cmp eax,80 jge short M00_L00 mov edx,eax sar edx,4 mov rcx,197D4800608 mov rcx,[rcx] mov edx,edx movzx edx,word ptr [rcx+rdx*2+0C] and eax,0F bt edx,eax setb al movzx eax,al ret M00_L00: xor eax,eax ret ; Total bytes of code 51Note the distinct lack of a call CORINFO_HELP_RNGCHKFAIL. And as you might guess, this check can happen a lot in a Regex, making this a very useful addition.
Bounds checks are an obvious source of overhead when talking about array access, but they’re not the only ones. There’s also the need to use the cheapest instructions possible. In .NET 6, with a method like:
[MethodImpl(MethodImplOptions.NoInlining)] private static int Get(int[] values, int i) => values[i];assembly code like the following would be generated:
; Program.Get(Int32[], Int32) sub rsp,28 cmp edx,[rcx+8] jae short M01_L00 movsxd rax,edx mov eax,[rcx+rax*4+10] add rsp,28 ret M01_L00: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 27This should look fairly familiar from our previous discussion; the JIT is loading the array’s length ([rcx+8]) and comparing that with the value of i (in edx), and then jumping to the end to throw an exception if i is out of bounds. Immediately after that jump we see a movsxd rax, edx instruction, which is taking the 32-bit value of i from edx and moving it into the 64-bit register rax. And as part of moving it, it’s sign-extending it; that’s the “sxd” part of the instruction name (sign-extending means the upper 32 bits of the new 64-bit value will be set to the value of the upper bit of the 32-bit value, so that the number retains its signed value). The interesting thing is, though, we know that the Length of an array and of a span is non-negative, and since we just bounds checked i against the Length, we also know that i is non-negative. That makes such sign-extension useless, since the upper bit is guaranteed to be 0. Since the mov instruction that zero-extends is a tad cheaper than movsxd, we can simply use that instead. And that’s exactly what dotnet/runtime#57970 from @pentp does for both arrays and spans (dotnet/runtime#70884 also similarly avoids some signed casts in other situations). Now on .NET 7, we get this:
; Program.Get(Int32[], Int32) sub rsp,28 cmp edx,[rcx+8] jae short M01_L00 mov eax,edx mov eax,[rcx+rax*4+10] add rsp,28 ret M01_L00: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 26That’s not the only source of overhead with array access, though. In fact, there’s a very large category of array access overhead that’s been there forever, but that’s so well known there are even old FxCop rules and newer Roslyn analyzers that warn against it: multidimensional array accesses. The overhead in the case of a multidimensional array isn’t just an extra branch on every indexing operation, or additional math required to compute the location of the element, but rather that they currently pass through the JIT’s optimization phases largely unmodified. dotnet/runtime#70271 improves the state of the world here by doing an expansion of a multidimensional array access early in the JIT’s pipeline, such that later optimization phases can improve multidimensional accesses as they would other code, including CSE and loop invariant hoisting. The impact of this is visible in a simple benchmark that sums all the elements of a multidimensional array.
private int[,] _square; [Params(1000)] public int Size { get; set; } [GlobalSetup] public void Setup() { int count = 0; _square = new int[Size, Size]; for (int i = 0; i < Size; i++) { for (int j = 0; j < Size; j++) { _square[i, j] = count++; } } } [Benchmark] public int Sum() { int[,] square = _square; int sum = 0; for (int i = 0; i < Size; i++) { for (int j = 0; j < Size; j++) { sum += square[i, j]; } } return sum; }Method | Runtime | Mean | Ratio |
---|---|---|---|
Sum | .NET 6.0 | 964.1 us | 1.00 |
Sum | .NET 7.0 | 674.7 us | 0.70 |
This previous example assumes you know the size of each dimension of the multidimensional array (it’s referring to the Size directly in the loops). That’s obviously not always (or maybe even rarely) the case. In such situations, you’d be more likely to use the Array.GetUpperBound method, and because multidimensional arrays can have a non-zero lower bound, Array.GetLowerBound. That would lead to code like this:
private int[,] _square; [Params(1000)] public int Size { get; set; } [GlobalSetup] public void Setup() { int count = 0; _square = new int[Size, Size]; for (int i = 0; i < Size; i++) { for (int j = 0; j < Size; j++) { _square[i, j] = count++; } } } [Benchmark] public int Sum() { int[,] square = _square; int sum = 0; for (int i = square.GetLowerBound(0); i < square.GetUpperBound(0); i++) { for (int j = square.GetLowerBound(1); j < square.GetUpperBound(1); j++) { sum += square[i, j]; } } return sum; }In .NET 7, thanks to dotnet/runtime#60816, those GetLowerBound and GetUpperBound calls become JIT intrinsics. An “intrinsic” to a compiler is something the compiler has intrinsic knowledge of, such that rather than relying solely on a method’s defined implementation (if it even has one), the compiler can substitute in something it considers to be better. There are literally thousands of methods in .NET known in this manner to the JIT, with GetLowerBound and GetUpperBound being two of the most recent. Now as intrinsics, when they’re passed a constant value (e.g. 0 for the 0th rank), the JIT can substitute the necessary assembly instructions to read directly from the memory location that houses the bounds. Here’s what the assembly code for this benchmark looked like with .NET 6; the main thing to see here are all of the calls out to GetLowerBound and GetUpperBound:
; Program.Sum() push rdi push rsi push rbp push rbx sub rsp,28 mov rsi,[rcx+8] xor edi,edi mov rcx,rsi xor edx,edx cmp [rcx],ecx call System.Array.GetLowerBound(Int32) mov ebx,eax mov rcx,rsi xor edx,edx call System.Array.GetUpperBound(Int32) cmp eax,ebx jle short M00_L03 M00_L00: mov rcx,[rsi] mov ecx,[rcx+4] add ecx,0FFFFFFE8 shr ecx,3 cmp ecx,1 jbe short M00_L05 lea rdx,[rsi+10] inc ecx movsxd rcx,ecx mov ebp,[rdx+rcx*4] mov rcx,rsi mov edx,1 call System.Array.GetUpperBound(Int32) cmp eax,ebp jle short M00_L02 M00_L01: mov ecx,ebx sub ecx,[rsi+18] cmp ecx,[rsi+10] jae short M00_L04 mov edx,ebp sub edx,[rsi+1C] cmp edx,[rsi+14] jae short M00_L04 mov eax,[rsi+14] imul rax,rcx mov rcx,rdx add rcx,rax add edi,[rsi+rcx*4+20] inc ebp mov rcx,rsi mov edx,1 call System.Array.GetUpperBound(Int32) cmp eax,ebp jg short M00_L01 M00_L02: inc ebx mov rcx,rsi xor edx,edx call System.Array.GetUpperBound(Int32) cmp eax,ebx jg short M00_L00 M00_L03: mov eax,edi add rsp,28 pop rbx pop rbp pop rsi pop rdi ret M00_L04: call CORINFO_HELP_RNGCHKFAIL M00_L05: mov rcx,offset MT_System.IndexOutOfRangeException call CORINFO_HELP_NEWSFAST mov rsi,rax call System.SR.get_IndexOutOfRange_ArrayRankIndex() mov rdx,rax mov rcx,rsi call System.IndexOutOfRangeException..ctor(System.String) mov rcx,rsi call CORINFO_HELP_THROW int 3 ; Total bytes of code 219Now here’s what it is for .NET 7:
; Program.Sum() push r14 push rdi push rsi push rbp push rbx sub rsp,20 mov rdx,[rcx+8] xor eax,eax mov ecx,[rdx+18] mov r8d,ecx mov r9d,[rdx+10] lea ecx,[rcx+r9+0FFFF] cmp ecx,r8d jle short M00_L03 mov r9d,[rdx+1C] mov r10d,[rdx+14] lea r10d,[r9+r10+0FFFF] M00_L00: mov r11d,r9d cmp r10d,r11d jle short M00_L02 mov esi,r8d sub esi,[rdx+18] mov edi,[rdx+10] M00_L01: mov ebx,esi cmp ebx,edi jae short M00_L04 mov ebp,[rdx+14] imul ebx,ebp mov r14d,r11d sub r14d,[rdx+1C] cmp r14d,ebp jae short M00_L04 add ebx,r14d add eax,[rdx+rbx*4+20] inc r11d cmp r10d,r11d jg short M00_L01 M00_L02: inc r8d cmp ecx,r8d jg short M00_L00 M00_L03: add rsp,20 pop rbx pop rbp pop rsi pop rdi pop r14 ret M00_L04: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 130Importantly, note there are no more calls (other than for the bounds check exception at the end). For example, instead of that first GetUpperBound call:
call System.Array.GetUpperBound(Int32)we get:
mov r9d,[rdx+1C] mov r10d,[rdx+14] lea r10d,[r9+r10+0FFFF]and it ends up being much faster:
Method | Runtime | Mean | Ratio |
---|---|---|---|
Sum | .NET 6.0 | 2,657.5 us | 1.00 |
Sum | .NET 7.0 | 676.3 us | 0.25 |
Loop Hoisting and Cloning
We previously saw how PGO interacts with loop hoisting and cloning, and those optimizations have seen other improvements, as well.
Historically, the JIT’s support for hoisting has been limited to lifting an invariant out one level. Consider this example:
[Benchmark] public void Compute() { for (int thousands = 0; thousands < 10; thousands++) { for (int hundreds = 0; hundreds < 10; hundreds++) { for (int tens = 0; tens < 10; tens++) { for (int ones = 0; ones < 10; ones++) { int n = ComputeNumber(thousands, hundreds, tens, ones); Process(n); } } } } } static int ComputeNumber(int thousands, int hundreds, int tens, int ones) => (thousands * 1000) + (hundreds * 100) + (tens * 10) + ones; [MethodImpl(MethodImplOptions.NoInlining)] static void Process(int n) { }At first glance, you might look at this and say “what could be hoisted, the computation of n requires all of the loop inputs, and all of that computation is in ComputeNumber.” But from a compiler’s perspective, the ComputeNumber function is inlineable and thus logically can be part of its caller, the computation of n is actually split into multiple pieces, and each of those pieces can be hoisted to different levels, e.g. the tens computation can be hoisted out one level, the hundreds out two levels, and the thousands out three levels. Here’s what [DisassemblyDiagnoser] outputs for .NET 6:
; Program.Compute() push r14 push rdi push rsi push rbp push rbx sub rsp,20 xor esi,esi M00_L00: xor edi,edi M00_L01: xor ebx,ebx M00_L02: xor ebp,ebp imul ecx,esi,3E8 imul eax,edi,64 add ecx,eax lea eax,[rbx+rbx*4] lea r14d,[rcx+rax*2] M00_L03: lea ecx,[r14+rbp] call Program.Process(Int32) inc ebp cmp ebp,0A jl short M00_L03 inc ebx cmp ebx,0A jl short M00_L02 inc edi cmp edi,0A jl short M00_L01 inc esi cmp esi,0A jl short M00_L00 add rsp,20 pop rbx pop rbp pop rsi pop rdi pop r14 ret ; Total bytes of code 84We can see that some hoisting has happened here. After all, the inner most loop (tagged M00_L03) is only five instructions: increment ebp (which at this point is the ones counter value), and if it’s still less than 0xA (10), jump back to M00_L03 which adds whatever is in r14 to ones. Great, so we’ve hoisted all of the unnecessary computation out of the inner loop, being left only with adding the ones position to the rest of the number. Let’s go out a level. M00_L02 is the label for the tens loop. What do we see there? Trouble. The two instructions imul ecx,esi,3E8 and imul eax,edi,64 are performing the thousands * 1000 and hundreds * 100 operations, highlighting that these operations which could have been hoisted out further were left stuck in the next-to-innermost loop. Now, here’s what we get for .NET 7, where this was improved in dotnet/runtime#68061:
; Program.Compute() push r15 push r14 push r12 push rdi push rsi push rbp push rbx sub rsp,20 xor esi,esi M00_L00: xor edi,edi imul ebx,esi,3E8 M00_L01: xor ebp,ebp imul r14d,edi,64 add r14d,ebx M00_L02: xor r15d,r15d lea ecx,[rbp+rbp*4] lea r12d,[r14+rcx*2] M00_L03: lea ecx,[r12+r15] call qword ptr [Program.Process(Int32)] inc r15d cmp r15d,0A jl short M00_L03 inc ebp cmp ebp,0A jl short M00_L02 inc edi cmp edi,0A jl short M00_L01 inc esi cmp esi,0A jl short M00_L00 add rsp,20 pop rbx pop rbp pop rsi pop rdi pop r12 pop r14 pop r15 ret ; Total bytes of code 99Notice now where those imul instructions live. There are four labels, each one corresponding to one of the loops, and we can see the outermost loop has the imul ebx,esi,3E8 (for the thousands computation) and the next loop has the imul r14d,edi,64 (for the hundreds computation), highlighting that these computations were hoisted out to the appropriate level (the tens and ones computation are still in the right places).
More improvements have gone in on the cloning side. Previously, loop cloning would only apply for loops iterating by 1 from a low to a high value. With dotnet/runtime#60148, the comparison against the upper value can be <= rather than just <. And with dotnet/runtime#67930, loops that iterate downward can also be cloned, as can loops that have increments and decrements larger than 1. Consider this benchmark:
private int[] _values = Enumerable.Range(0, 1000).ToArray(); [Benchmark] [Arguments(0, 0, 1000)] public int LastIndexOf(int arg, int offset, int count) { int[] values = _values; for (int i = offset + count – 1; i >= offset; i–) if (values[i] == arg) return i; return 0; }Without loop cloning, the JIT can’t assume that offset through offset+count are in range, and thus every access to the array needs to be bounds checked. With loop cloning, the JIT could generate one version of the loop without bounds checks and only use that when it knows all accesses will be valid. That’s exactly what happens now in .NET 7. Here’s what we got with .NET 6:
; Program.LastIndexOf(Int32, Int32, Int32) sub rsp,28 mov rcx,[rcx+8] lea eax,[r8+r9+0FFFF] cmp eax,r8d jl short M00_L01 mov r9d,[rcx+8] nop word ptr [rax+rax] M00_L00: cmp eax,r9d jae short M00_L03 movsxd r10,eax cmp [rcx+r10*4+10],edx je short M00_L02 dec eax cmp eax,r8d jge short M00_L00 M00_L01: xor eax,eax add rsp,28 ret M00_L02: add rsp,28 ret M00_L03: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 72Notice how in the core loop, at label M00_L00, there’s a bounds check (cmp eax,r9d and jae short M00_L03, which jumps to a call CORINFO_HELP_RNGCHKFAIL). And here’s what we get with .NET 7:
; Program.LastIndexOf(Int32, Int32, Int32) sub rsp,28 mov rax,[rcx+8] lea ecx,[r8+r9+0FFFF] cmp ecx,r8d jl short M00_L02 test rax,rax je short M00_L01 test ecx,ecx jl short M00_L01 test r8d,r8d jl short M00_L01 cmp [rax+8],ecx jle short M00_L01 M00_L00: mov r9d,ecx cmp [rax+r9*4+10],edx je short M00_L03 dec ecx cmp ecx,r8d jge short M00_L00 jmp short M00_L02 M00_L01: cmp ecx,[rax+8] jae short M00_L04 mov r9d,ecx cmp [rax+r9*4+10],edx je short M00_L03 dec ecx cmp ecx,r8d jge short M00_L01 M00_L02: xor eax,eax add rsp,28 ret M00_L03: mov eax,ecx add rsp,28 ret M00_L04: call CORINFO_HELP_RNGCHKFAIL int 3 ; Total bytes of code 98Notice how the code size is larger, and how there are now two variations of the loop: one at M00_L00 and one at M00_L01. The second one, M00_L01, has a branch to that same call CORINFO_HELP_RNGCHKFAIL, but the first one doesn’t, because that loop will only end up being used after proving that the offset, count, and _values.Length are such that the indexing will always be in bounds.
Other changes also improved loop cloning. dotnet/runtime#59886 enables the JIT to choose different forms for how to emit the the conditions for choosing the fast or slow loop path, e.g. whether to emit all the conditions, & them together, and then branch (if (!(cond1 & cond2)) goto slowPath), or whether to emit each condition on its own (if (!cond1) goto slowPath; if (!cond2) goto slowPath). dotnet/runtime#66257 enables loop cloning to kick in when the loop variable is initialized to more kinds of expressions (e.g. for (int fromindex = lastIndex – lengthToClear; …)). And dotnet/runtime#70232 increases the JIT’s willingness to clone loops with bodies that do a broader set of operations.
Folding, propagation, and substitution
Constant folding is an optimization where a compiler computes the value of an expression involving only constants at compile-time rather than generating the code to compute the value at run-time. There are multiple levels of constant folding in .NET, with some constant folding performed by the C# compiler and some constant folding performed by the JIT compiler. For example, given the C# code:
[Benchmark] public int A() => 3 + (4 * 5); [Benchmark] public int B() => A() * 2;the C# compiler will generate IL for these methods like the following:
.method public hidebysig instance int32 A () cil managed { .maxstack 8 IL_0000: ldc.i4.s 23 IL_0002: ret } .method public hidebysig instance int32 B () cil managed { .maxstack 8 IL_0000: ldarg.0 IL_0001: call instance int32 Program::A() IL_0006: ldc.i4.2 IL_0007: mul IL_0008: ret }You can see that the C# compiler has computed the value of 3 + (4*5), as the IL for method A simply contains the equivalent of return 23;. However, method B contains the equivalent of return A() * 2;, highlighting that the constant folding performed by the C# compiler was intramethod only. Now here’s what the JIT generates:
; Program.A() mov eax,17 ret ; Total bytes of code 6 ; Program.B() mov eax,2E ret ; Total bytes of code 6The assembly for method A isn’t particularly interesting; it’s just returning that same value 23 (hex 0x17). But method B is more interesting. The JIT has inlined the call from B to A, exposing the contents of A to B, such that the JIT effectively sees the body of B as the equivalent of return 23 * 2;. At that point, the JIT can do its own constant folding, and it transforms the body of B to simply return 46 (hex 0x2e). Constant propagation is intricately linked to constant folding and is essentially just the idea that you can substitute a constant value (typically one computed via constant folding) into further expressions, at which point they may also be able to be folded.
The JIT has long performed constant folding, but it improves further in .NET 7. One of the ways constant folding can improve is by exposing more values to be folded, which often means more inlining. dotnet/runtime#55745 helped the inliner to understand that a method call like M(constant + constant) (noting that those constants might be the result of some other method call) is itself passing a constant to M, and a constant being passed to a method call is a hint to the inliner that it should consider being more aggressive about inlining, since exposing that constant to the body of the callee can potentially significantly reduce the amount of code required to implement the callee. The JIT might have previously inlined such a method anyway, but when it comes to inlining, the JIT is all about heuristics and generating enough evidence that it’s worthwhile to inline something; this contributes to that evidence. This pattern shows up, for example, in the various FromXx methods on TimeSpan. For example, TimeSpan.FromSeconds is implemented as:
public static TimeSpan FromSeconds(double value) => Interval(value, TicksPerSecond); // TicksPerSecond is a constantand, eschewing argument validation for the purposes of this example, Interval is:
private static TimeSpan Interval(double value, double scale) => IntervalFromDoubleTicks(value * scale); private static TimeSpan IntervalFromDoubleTicks(double ticks) => ticks == long.MaxValue ? TimeSpan.MaxValue : new TimeSpan((long)ticks);which if everything gets inlined means FromSeconds is essentially:
public static TimeSpan FromSeconds(double value) { double ticks = value * 10_000_000; return ticks == long.MaxValue ? TimeSpan.MaxValue : new TimeSpan((long)ticks); }and if value is a constant, let’s say 5, that whole thing can be constant folded (with dead code elimination on the ticks == long.MaxValue branch) to simply:
return new TimeSpan(50_000_000);I’ll spare you the .NET 6 assembly for this, but on .NET 7 with a benchmark like:
[Benchmark] public TimeSpan FromSeconds() => TimeSpan.FromSeconds(5);we now get the simple and clean:
; Program.FromSeconds() mov eax,2FAF080 ret ; Total bytes of code 6Another change improving constant folding included dotnet/runtime#57726 from @SingleAccretion, which unblocked constant folding in a particular scenario that sometimes manifests when doing field-by-field assignment of structs being returned from method calls. As a small example, consider this trivial property, which access the Color.DarkOrange property, which in turn does new Color(KnownColor.DarkOrange):
[Benchmark] public Color DarkOrange() => Color.DarkOrange;In .NET 6, the JIT generated this:
; Program.DarkOrange() mov eax,1 mov ecx,39 xor r8d,r8d mov [rdx],r8 mov [rdx+8],r8 mov [rdx+10],cx mov [rdx+12],ax mov rax,rdx ret ; Total bytes of code 32The interesting thing here is that some constants (39, which is the value of KnownColor.DarkOrange, and 1, which is a private StateKnownColorValid constant) are being loaded into registers (mov eax, 1 then mov ecx, 39) and then later being stored into the relevant location for the Color struct being returned (mov [rdx+12],ax and mov [rdx+10],cx). In .NET 7, it now generates:
; Program.DarkOrange() xor eax,eax mov [rdx],rax mov [rdx+8],rax mov word ptr [rdx+10],39 mov word ptr [rdx+12],1 mov rax,rdx ret ; Total bytes of code 25with direct assignment of these constant values into their destination locations (mov word ptr [rdx+12],1 and mov word ptr [rdx+10],39). Other changes contributing to constant folding included dotnet/runtime#58171 from @SingleAccretion and dotnet/runtime#57605 from @SingleAccretion.
However, a large category of improvement came from an optimization related to propagation, that of forward substitution. Consider this silly benchmark:
[Benchmark] public int Compute1() => Value + Value + Value + Value + Value; [Benchmark] public int Compute2() => SomethingElse() + Value + Value + Value + Value + Value; private static int Value => 16; [MethodImpl(MethodImplOptions.NoInlining)] private static int SomethingElse() => 42;If we look at the assembly code generated for Compute1 on .NET 6, it looks like what we’d hope for. We’re adding Value 5 times, Value is trivially inlined and returns a constant value 16, and so we’d hope that the assembly code generated for Compute1 would effectively just be returning the value 80 (hex 0x50), which is exactly what happens:
; Program.Compute1() mov eax,50 ret ; Total bytes of code 6But Compute2 is a bit different. The structure of the code is such that the additional call to SomethingElse ends up slightly perturbing something about the JIT’s analysis, and .NET 6 ends up with this assembly code:
; Program.Compute2() sub rsp,28 call Program.SomethingElse() add eax,10 add eax,10 add eax,10 add eax,10 add eax,10 add rsp,28 ret ; Total bytes of code 29Rather than a single mov eax, 50 to put the value 0x50 into the return register, we have 5 separate add eax, 10 to build up that same 0x50 (80) value. That’s… not ideal.
It turns out that many of the JIT’s optimizations operate on the tree data structures created as part of parsing the IL. In some cases, optimizations can do better when they’re exposed to more of the program, in other words when the tree they’re operating on is larger and contains more to be analyzed. However, various operations can break up these trees into smaller, individual ones, such as with temporary variables created as part of inlining, and in doing so can inhibit these operations. Something is needed in order to effectively stitch these trees back together, and that’s forward substitution. You can think of forward substitution almost like an inverse of CSE; rather than trying to find duplicate expressions and eliminate them by computing the value once and storing it into a temporary, forward substitution eliminates that temporary and effectively moves the expression tree into its use site. Obviously you don’t want to do this if it would then negate CSE and result in duplicate work, but for expressions that are defined once and used once, this kind of forward propagation is valuable. dotnet/runtime#61023 added an initial limited version of forward substitution, and then dotnet/runtime#63720 added a more robust generalized implementation. Subsequently, dotnet/runtime#70587 expanded it to also cover some SIMD vectors, and then dotnet/runtime#71161 improved it further to enable substitutions into more places (in this case into call arguments). And with those, our silly benchmark now produces the following on .NET 7:
; Program.Compute2() sub rsp,28 call qword ptr [7FFCB8DAF9A8] add eax,50 add rsp,28 ret ; Total bytes of code 18Vectorization
SIMD, or Single Instruction Multiple Data, is a kind of processing in which one instruction applies to multiple pieces of data at the same time. You’ve got a list of numbers and you want to find the index of a particular value? You could walk the list comparing one element at a time, and that would be fine functionally. But what if in the same amount of time it takes you to read and compare one element, you could instead read and compare two elements, or four elements, or 32 elements? That’s SIMD, and the art of utilizing SIMD instructions is lovingly referred to as “vectorization,” where operations are applied to all of the elements in a “vector” at the same time.
.NET has long had support for vectorization in the form of Vector
Starting in .NET Core 3.0, .NET gained literally thousands of new “hardware intrinsics” methods, most of which are .NET APIs that map down to one of these SIMD instructions. These intrinsics enable an expert to write an implementation tuned to a specific instruction set, and if done well, get the best possible performance, but it also requires the developer to understand each instruction set and to implement their algorithm for each instruction set that might be relevant, e.g. an AVX2 implementation if it’s supported, or an SSE2 implementation if it’s supported, or an ArmBase implementation if it’s supported, and so on.
.NET 7 has introduced a middle ground. Previous releases saw the introduction of the Vector128
I have two functions: one that directly uses the Sse2.MoveMask hardware intrinsic and one that uses the new Vector128
Notice anything? The code for the two methods is identical, both resulting in a vpmovmskb (Move Byte Mask) instruction. Yet the former code will only work on a platform that supports SSE2 whereas the latter code will work on any platform with support for 128-bit vectors, including Arm64 and WASM (and any future platforms on-boarded that also support SIMD); it’ll just result in different instructions being emitted on those platforms.
To explore this a bit more, let’s take a simple example and vectorize it. We’ll implement a Contains method, where we want to search a span of bytes for a specific value and return whether it was found:
static bool Contains(ReadOnlySpanHow would we vectorize this with Vector
Now that we know we have enough data, we can get to coding our vectorized loop. In this loop, we’ll be searching for the needle, which means we need a vector that contains that value for every element; the Vector
And we’re almost done. The last issue to handle is we may still have a few elements at the end we haven’t searched. There are a couple of ways we could handle that. One would be to just continue with our fall back implementation and process each of the remaining elements one at a time. Another would be to employ a trick that’s common when vectorizing idempotent operations. Our operation isn’t mutating anything, which means it doesn’t matter if we compare the same element multiple times, which means we can just do one final vector compare for the last vector in the search space; that might or might not overlap with elements we’ve already looked at, but it won’t hurt anything if it does. And with that, our implementation is complete:
static unsafe bool Contains(ReadOnlySpanCongratulations, we’ve vectorized this operation, and fairly decently at that. We can throw this into benchmarkdotnet and see really nice speedups:
private byte[] _data = Enumerable.Repeat((byte)123, 999).Append((byte)42).ToArray(); [Benchmark(Baseline = true)] [Arguments((byte)42)] public bool Find(byte value) => Contains(_data, value); // just the fallback path in its own method [Benchmark] [Arguments((byte)42)] public bool FindVectorized(byte value) => Contains_Vectorized(_data, value); // the implementation we just wroteMethod | Mean | Ratio |
---|---|---|
Find | 484.05 ns | 1.00 |
FindVectorized | 20.21 ns | 0.04 |
A 24x speedup! Woo hoo, victory, all your performance are belong to us!
You deploy this in your service, and you see Contains being called on your hot path, but you don’t see the improvements you were expecting. You dig in a little more, and you discover that while you tested this with an input array with 1000 elements, typical inputs had more like 30 elements. What happens if we change our benchmark to have just 30 elements? That’s not long enough to form a vector, so we fall back to the one-at-a-time path, and we don’t get any speedups at all.
One thing we can now do is switch from using Vector
With that in hand, we can now try it on our smaller 30 element data set:
private byte[] _data = Enumerable.Repeat((byte)123, 29).Append((byte)42).ToArray(); [Benchmark(Baseline = true)] [Arguments((byte)42)] public bool Find(byte value) => Contains(_data, value); [Benchmark] [Arguments((byte)42)] public bool FindVectorized(byte value) => Contains_Vectorized(_data, value);Method | Mean | Ratio |
---|---|---|
Find | 15.388 ns | 1.00 |
FindVectorized | 1.747 ns | 0.11 |
Woo hoo, victory, all your performance are belong to us… again!
What about on the larger data set again? Previously with Vector
Method | Mean | Ratio |
---|---|---|
Find | 484.25 ns | 1.00 |
FindVectorized | 32.92 ns | 0.07 |
… closer to 15x. Nothing to sneeze at, but it’s not the 24x we previously saw. What if we want to have our cake and eat it, too? Let’s also add a Vector256
And, boom, we’re back:
Method | Mean | Ratio |
---|---|---|
Find | 484.53 ns | 1.00 |
FindVectorized | 20.08 ns | 0.04 |
We now have an implementation that is vectorized on any platform with either 128-bit or 256-bit vector instructions (x86, x64, Arm64, WASM, etc.), that can use either based on the input length, and that can be included in an R2R image if that’s of interest.
There are many factors that impact which path you go down, and I expect we’ll have guidance forthcoming to help navigate all the factors and approaches. But the capabilities are all there, and whether you choose to use Vector
I already mentioned several PRs that exposed the new cross-platform vector support, but that only scratches the surface of the work done to actually enable these operations and to enable them to produce high-quality code. As just one example of a category of such work, a set of changes went in to help ensure that zero vector constants are handled well, such as dotnet/runtime#63821 that “morphed” (changed) Vector128/256
Inlining
Inlining is one of the most important optimizations the JIT can do. The concept is simple: instead of making a call to some method, take the code from that method and bake it into the call site. This has the obvious advantage of avoiding the overhead of a method call, but except for really small methods on really hot paths, that’s often on the smaller side of the wins inlining brings. The bigger wins are due to the callee’s code being exposed to the caller’s code, and vice versa. So, for example, if the caller is passing a constant as an argument to the callee, if the method isn’t inlined, the compilation of the callee has no knowledge of that constant, but if the callee is inlined, all of the code in the callee is then aware of its argument being a constant value, and can do all of the optimizations possible with such a constant, like dead code elimination, branch elimination, constant folding and propagation, and so on. Of course, if it were all rainbows and unicorns, everything possible to be inlined would be inlined, and that’s obviously not happening. Inlining brings with it the cost of potentially increased binary size. If the code being inlined would result in the same amount or less assembly code in the caller than it takes to call the callee (and if the JIT can quickly determine that), then inlining is a no-brainer. But if the code being inlined would increase the size of the callee non-trivially, now the JIT needs to weigh that increase in code size against the throughput benefits that could come from it. That code size increase can itself result in throughput regressions, due to increasing the number of distinct instructions to be executed and thereby putting more pressure on the instruction cache. As with any cache, the more times you need to read from memory to populate it, the less effective the cache will be. If you have a function that gets inlined into 100 different call sites, every one of those call sites’ copies of the callee’s instructions are unique, and calling each of those 100 functions could end up thrashing the instruction cache; in contrast, if all of those 100 functions “shared” the same instructions by simply calling the single instance of the callee, it’s likely the instruction cache would be much more effective and lead to fewer trips to memory.
All that is to say, inlining is really important, it’s important that the “right” things be inlined and that it not overinline, and as such every release of .NET in recent memory has seen nice improvements around inlining. .NET 7 is no exception.
One really interesting improvement around inlining is dotnet/runtime#64521, and it might be surprising. Consider the Boolean.ToString method; here’s its full implementation:
public override string ToString() { if (!m_value) return “False”; return “True”; }Pretty simple, right? You’d expect something this trivial to be inlined. Alas, on .NET 6, this benchmark:
private bool _value = true; [Benchmark] public int BoolStringLength() => _value.ToString().Length;produces this assembly code:
; Program.BoolStringLength() sub rsp,28 cmp [rcx],ecx add rcx,8 call System.Boolean.ToString() mov eax,[rax+8] add rsp,28 ret ; Total bytes of code 23Note the call System.Boolean.ToString(). The reason for this is, historically, the JIT has been unable to inline methods across assembly boundaries if those methods contain string literals (like the “False” and “True” in that Boolean.ToString implementation). This restriction had to do with string interning and the possibility that such inlining could lead to visible behavioral differences. Those concerns are no longer valid, and so this PR removes the restriction. As a result, that same benchmark on .NET 7 now produces this:
; Program.BoolStringLength() cmp byte ptr [rcx+8],0 je short M00_L01 mov rax,1DB54800D20 mov rax,[rax] M00_L00: mov eax,[rax+8] ret M00_L01: mov rax,1DB54800D18 mov rax,[rax] jmp short M00_L00 ; Total bytes of code 38No more call System.Boolean.ToString().
dotnet/runtime#61408 made two changes related to inlining. First, it taught the inliner how to better see the what methods were being called in an inlining candidate, and in particular when tiered compilation is disabled or when a method would bypass tier-0 (such as a method with loops before OSR existed or with OSR disabled); by understanding what methods are being called, it can better understand the cost of the method, e.g. if those method calls are actually hardware intrinsics with a very low cost. Second, it enabled CSE in more cases with SIMD vectors.
dotnet/runtime#71778 also impacted inlining, and in particular in situations where a typeof() could be propagated to the callee (e.g. via a method argument). In previous releases of .NET, various members on Type like IsValueType were turned into JIT intrinsics, such that the JIT could substitute a constant value for calls where it could compute the answer at compile time. For example, this:
[Benchmark] public bool IsValueType() => IsValueTyperesults in this assembly code on .NET 6:
; Program.IsValueType() mov eax,1 ret ; Total bytes of code 6However, change the benchmark slightly:
[Benchmark] public bool IsValueType() => IsValueType(typeof(int)); private static bool IsValueType(Type t) => t.IsValueType;and it’s no longer as simple:
; Program.IsValueType() sub rsp,28 mov rcx,offset MT_System.Int32 call CORINFO_HELP_TYPEHANDLE_TO_RUNTIMETYPE mov rcx,rax mov rax,[7FFCA47C9560] cmp [rcx],ecx add rsp,28 jmp rax ; Total bytes of code 38Effectively, as part of inlining the JIT loses the notion that the argument is a constant and fails to propagate it. This PR fixes that, such that on .NET 7, we now get what we expect:
; Program.IsValueType() mov eax,1 ret ; Total bytes of code 6Arm64
A huge amount of effort in .NET 7 went into making code gen for Arm64 as good or better than its x64 counterpart. I’ve already discussed a bunch of PRs that are relevant regardless of architecture, and others that are specific to Arm, but there are plenty more. To rattle off some of them:
- Addressing modes. “Addressing mode” is the term used to refer to how the operand of instructions are specified. It could be the actual value, it could be the address from where a value should be loaded, it could be the register containing the value, and so on. Arm supports a “scaled” addressing mode, typically used for indexing into an array, where the size of each element is supplied and the instruction “scales” the provided offset by the specified scale. dotnet/runtime#60808 enables the JIT to utilize this addressing mode. More generally, dotnet/runtime#70749 enables the JIT to use addressing modes when accessing elements of managed arrays. dotnet/runtime#66902 improves the use of addressing modes when the element type is byte. dotnet/runtime#65468 improves addressing modes used for floating point. And dotnet/runtime#67490 implements addressing modes for SIMD vectors, specifically for loads with unscaled indices.
- Better instruction selection. Various techniques go into ensuring that the best instructions are selected to represent input code. dotnet/runtime#61037 teaches the JIT how to recognize the pattern (a * b) + c with integers and fold that into a single madd or msub instruction, while dotnet/runtime#66621 does the same for a – (b * c) and msub. dotnet/runtime#61045 enables the JIT to recognize certain constant bit shift operations (either explicit in the code or implicit to various forms of managed array access) and emit sbfiz/ubfiz instructions. dotnet/runtime#70599, dotnet/runtime#66407, and dotnet/runtime#65535 all handle various forms of optimizing a % b. dotnet/runtime#61847 from @SeanWoo removes an unnecessary movi emitted as part of setting a dereferenced pointer to a constant value. dotnet/runtime#57926 from @SingleAccretion enables computing a 64-bit result as the multiplication of two 32-bit integers to be done with smull/umull. And dotnet/runtime#61549 folds adds with sign extension or zero extension into a single add instruction with uxtw/sxtw/lsl, while dotnet/runtime#62630 drops redundant zero extensions after a ldr instruction.
- Vectorization. dotnet/runtime#64864 adds new AdvSimd.LoadPairVector64/AdvSimd.LoadPairVector128 hardware intrinsics.
- Zeroing. Lots of operations require state to be set to zero, such as initializing all reference locals in a method to zero as part of the method’s prologue (so that the GC doesn’t see and try to follow garbage references). While such functionality was previously vectorized, dotnet/runtime#63422 enables this to be implemented using 128-bit width vector instructions on Arm. And dotnet/runtime#64481 changes the instruction sequences used for zeroing in order to avoid unnecessary zeroing, free up additional registers, and enable the CPU to recognize various instruction sequences and better optimize.
- Memory Model. dotnet/runtime#62895 enables store barriers to be used wherever possible instead of full barriers, and uses one-way barriers for volatile variables. dotnet/runtime#67384 enables volatile reads/writes to be implemented with the ldapr instruction, while dotnet/runtime#64354 uses a cheaper instruction sequence to handle volatile indirections. There’s dotnet/runtime#70600, which enables LSE Atomics to be used for Interlocked operations; dotnet/runtime#71512, which enables using the atomics instruction on Unix machines; and dotnet/runtime#70921, which enables the same but on Windows.
JIT helpers
While logically part of the runtime, the JIT is actually isolated from the rest of the runtime, only interacting with it through an interface that enables communication between the JIT and the rest of the VM (Virtual Machine). There’s a large amount of VM functionality then that the JIT relies on for good performance.
dotnet/runtime#65738 rewrote various “stubs” to be more efficient. Stubs are tiny bits of code that serve to perform some check and then redirect execution somewhere else. For example, when an interface dispatch call site is expected to only ever be used with a single implementation of that interface, the JIT might employ a “dispatch stub” that compares the type of the object against the single one it’s cached, and if they’re equal simply jumps to the right target. You know you’re in the corest of the core areas of the runtime when a PR contains lots of assembly code for every architecture the runtime targets. And it paid off; there’s a virtual group of folks from around .NET that review performance improvements and regressions in our automated performance test suites, and attribute these back to the PRs likely to be the cause (this is mostly automated but requires some human oversight). It’s always nice then when a few days after a PR is merged and performance information has stabilized that you see a rash of comments like there were on this PR:
For anyone familiar with generics and interested in performance, you may have heard the refrain that generic virtual methods are relatively expensive. They are, comparatively. For example on .NET 6, this code:
private Example _example = new Example(); [Benchmark(Baseline = true)] public void GenericNonVirtual() => _example.GenericNonVirtualresults in:
Method | Mean | Ratio |
---|---|---|
GenericNonVirtual | 0.4866 ns | 1.00 |
GenericVirtual | 6.4552 ns | 13.28 |
dotnet/runtime#65926 eases the pain a tad. Some of the cost comes from looking up some cached information in a hash table in the runtime, and as is the case with many map implementations, this one involves computing a hash code and using a mod operation to map to the right bucket. Other hash table implementations around dotnet/runtime, including Dictionary<,>, HashSet<,>, and ConcurrentDictionary<,> previously switched to a “fastmod” implementation; this PR does the same for this EEHashtable, which is used as part of the CORINFO_GENERIC_HANDLE JIT helper function employed:
Method | Runtime | Mean | Ratio |
---|---|---|---|
GenericVirtual | .NET 6.0 | 6.475 ns | 1.00 |
GenericVirtual | .NET 7.0 | 6.119 ns | 0.95 |
Not enough of an improvement for us to start recommending people use them, but a 5% improvement takes a bit of the edge off the sting.
Grab Bag
It’s near impossible to cover every performance change that goes into the JIT, and I’m not going to try. But there were so many more PRs, I couldn’t just leave them all unsung, so here’s a few more quickies:
- dotnet/runtime#58727 from @benjamin-hodgson. Given an expression like (byte)x | (byte)y, that can be morphed into (byte)(x | y), which can optimize away some movs. private int _x, _y; [Benchmark] public int Test() => (byte)_x | (byte)_y; ; *** .NET 6 *** ; Program.Test(Int32, Int32) movzx eax,dl movzx edx,r8b or eax,edx ret ; Total bytes of code 10 ; *** .NET 7 *** ; Program.Test(Int32, Int32) or edx,r8d movzx eax,dl ret ; Total bytes of code 7
- dotnet/runtime#67182. On a machine with support for BMI2, 64-bit shifts can be performed with the shlx, sarx, and shrx instructions. [Benchmark] [Arguments(123, 1)] public ulong Shift(ulong x, int y) => x << y; ; *** .NET 6 *** ; Program.Shift(UInt64, Int32) mov ecx,r8d mov rax,rdx shl rax,cl ret ; Total bytes of code 10 ; *** .NET 7 *** ; Program.Shift(UInt64, Int32) shlx rax,rdx,r8 ret ; Total bytes of code 6
- dotnet/runtime#69003 from @SkiFoD. The pattern ~x + 1 can be changed into a two’s-complement negation. [Benchmark] [Arguments(42)] public int Neg(int i) => ~i + 1; ; *** .NET 6 *** ; Program.Neg(Int32) mov eax,edx not eax inc eax ret ; Total bytes of code 7 ; *** .NET 7 *** ; Program.Neg(Int32) mov eax,edx neg eax ret ; Total bytes of code 5
- dotnet/runtime#61412 from @SkiFoD. An expression X & 1 == 1 to test whether the bottom bit of a number is set can changed to the cheaper X & 1 (which isn’t actually expressible without a following != 0 in C#). [Benchmark] [Arguments(42)] public bool BitSet(int x) => (x & 1) == 1; ; *** .NET 6 *** ; Program.BitSet(Int32) test dl,1 setne al movzx eax,al ret ; Total bytes of code 10 ; *** .NET 7 *** ; Program.BitSet(Int32) mov eax,edx and eax,1 ret ; Total bytes of code 6
- dotnet/runtime#63545 from @Wraith2. The expression x & (x – 1) can be lowered to the blsr instruction. [Benchmark] [Arguments(42)] public int ResetLowestSetBit(int x) => x & (x – 1); ; *** .NET 6 *** ; Program.ResetLowestSetBit(Int32) lea eax,[rdx+0FFFF] and eax,edx ret ; Total bytes of code 6 ; *** .NET 7 *** ; Program.ResetLowestSetBit(Int32) blsr eax,edx ret ; Total bytes of code 6
- dotnet/runtime#62394. / and % by a vector’s .Count wasn’t recognizing that Count can be unsigned, but doing so leads to better code gen. [Benchmark] [Arguments(42u)] public long DivideByVectorCount(uint i) => i / Vector
.Count; ; *** .NET 6 *** ; Program.DivideByVectorCount(UInt32) mov eax,edx mov rdx,rax sar rdx,3F and rdx,1F add rax,rdx sar rax,5 ret ; Total bytes of code 21 ; *** .NET 7 *** ; Program.DivideByVectorCount(UInt32) mov eax,edx shr rax,5 ret ; Total bytes of code 7 - dotnet/runtime#60787. Loop alignment in .NET 6 provides a very nice exploration of why and how the JIT handles loop alignment. This PR extends that further by trying to “hide” an emitted align instruction behind an unconditional jmp that might already exist, in order to minimize the impact of the processor having to fetch and decode nops.
GC
“Regions” is a feature of the garbage collector (GC) that’s been in the works for multiple years. It’s enabled by default in 64-bit processes in .NET 7 as of dotnet/runtime#64688, but as with other multi-year features, a multitude of PRs went into making it a reality. At a 30,000 foot level, “regions” replaces the current “segments” approach to managing memory on the GC heap; rather than having a few gigantic segments of memory (e.g. each 1GB), often associated 1:1 with a generation, the GC instead maintains many, many smaller regions (e.g. each 4MB) as their own entity. This enables the GC to be more agile with regards to operations like repurposing regions of memory from one generation to another. For more information on regions, the blog post Put a DPAD on that GC! from the primary developer on the GC is still the best resource.
Native AOT
To many people, the word “performance” in the context of software is about throughput. How fast does something execute? How much data per second can it process? How many requests per second can it process? And so on. But there are many other facets to performance. How much memory does it consume? How fast does it start up and get to the point of doing something useful? How much space does it consume on disk? How long would it take to download? And then there are related concerns. In order to achieve these goals, what dependencies are required? What kinds of operations does it need to perform to achieve these goals, and are all of those operations permitted in the target environment? If any of this paragraph resonates with you, you are the target audience for the Native AOT support now shipping in .NET 7.
.NET has long had support for AOT code generation. For example, .NET Framework had it in the form of ngen, and .NET Core has it in the form of crossgen. Both of those solutions involve a standard .NET executable that has some of its IL already compiled to assembly code, but not all methods will have assembly code generated for them, various things can invalidate the assembly code that was generated, external .NET assemblies without any native assembly code can be loaded, and so on, and in all of those cases, the runtime continues to utilize a JIT compiler. Native AOT is different. It’s an evolution of CoreRT, which itself was an evolution of .NET Native, and it’s entirely free of a JIT. The binary that results from publishing a build is a completely standalone executable in the target platform’s platform-specific file format (e.g. COFF on Windows, ELF on Linux, Mach-O on macOS) with no external dependencies other than ones standard to that platform (e.g. libc). And it’s entirely native: no IL in sight, no JIT, no nothing. All required code is compiled and/or linked in to the executable, including the same GC that’s used with standard .NET apps and services, and a minimal runtime that provides services around threading and the like. All of that brings great benefits: super fast startup time, small and entirely-self contained deployment, and ability to run in places JIT compilers aren’t allowed (e.g. because memory pages that were writable can’t then be executable). It also brings limitations: no JIT means no dynamic loading of arbitrary assemblies (e.g. Assembly.LoadFile) and no reflection emit (e.g. DynamicMethod), everything compiled and linked in to the app means the more functionality that’s used (or might be used) the larger is your deployment, etc. Even with those limitations, for a certain class of application, Native AOT is an incredibly exciting and welcome addition to .NET 7.
Too many PRs to mention have gone into bringing up the Native AOT stack, in part because it’s been in the works for years (as part of the archived dotnet/corert project and then as part of dotnet/runtimelab/feature/NativeAOT) and in part because there have been over a hundred PRs just in dotnet/runtime that have gone into bringing Native AOT up to a shippable state since the code was originally brought over from dotnet/runtimelab in dotnet/runtime#62563 and dotnet/runtime#62611. Between that and there not being a previous version to compare its performance to, instead of focusing PR by PR on improvements, let’s just look at how to use it and the benefits it brings.
Today, Native AOT is focused on console applications, so let’s create a console app:
dotnet new console -o nativeaotexampleWe now have our nativeaotexample directory containing a nativeaotexample.csproj and a “hello, world” Program.cs. To enable publishing the application with Native AOT, edit the .csproj to include this in the existing
And then… actually, that’s it. Our app is now fully configured to be able to target Native AOT. All that’s left is to publish. As I’m currently writing this on my Windows x64 machine, I’ll target that:
dotnet publish -r win-x64 -c ReleaseI now have my generated executable in the output publish directory:
Directory: C:nativeaotexamplebinReleasenet7.0win-x64publish Mode LastWriteTime Length Name —- ————- —— —- -a— 8/27/2022 6:18 PM 3648512 nativeaotexample.exe -a— 8/27/2022 6:18 PM 14290944 nativeaotexample.pdbThat ~3.5MB .exe is the executable, and the .pdb next to it is debug information, which needn’t actually be deployed with the app. I can now copy that nativeaotexample.exe to any 64-bit Windows machine, regardless of what .NET may or may not be installed anywhere on the box, and my app will run. Now, if what you really care about is size, and 3.5MB is too big for you, you can start making more tradeoffs. There are a bunch of switches you can pass to the Native AOT compiler (ILC) and to the trimmer that impact what code gets included in the resulting image. Let me turn the dial up a bit:
I republish, and now I have:
Directory: C:nativeaotexamplebinReleasenet7.0win-x64publish Mode LastWriteTime Length Name —- ————- —— —- -a— 8/27/2022 6:19 PM 2061824 nativeaotexample.exe -a— 8/27/2022 6:19 PM 14290944 nativeaotexample.pdbso 2M instead of 3.5MB. Of course, for that significant reduction I’ve given up some things:
- Setting InvariantGlobalization to true means I’m now not respecting culture information and am instead using a set of invariant data for most globalization operations.
- Setting UseSystemResourceKeys to true means nice exception messages are stripped away.
- Setting IlcGenerateStackTraceData to false means I’m going to get fairly poor stack traces should I need to debug an exception.
- Setting DebuggerSupport to false… good luck debugging things.
- … you get the idea.
One of the potentially mind-boggling aspects of Native AOT for a developer used to .NET is that, as it says on the tin, it really is native. After publishing the app, there is no IL involved, and there’s no JIT that could even process it. This makes some of the other investments in .NET 7 all the more valuable, for example everywhere investments are happening in source generators. Code that previously relied on reflection emit for good performance will need another scheme. We can see that, for example, with Regex. Historically for optimal throughput with Regex, it’s been recommended to use RegexOptions.Compiled, which uses reflection emit at run-time to generate an optimized implementation of the specified pattern. But if you look at the implementation of the Regex constructor, you’ll find this nugget:
if (RuntimeFeature.IsDynamicCodeCompiled) { factory = Compile(pattern, tree, options, matchTimeout != InfiniteMatchTimeout); }With the JIT, IsDynamicCodeCompiled is true. But with Native AOT, it’s false. Thus, with Native AOT and Regex, there’s no difference between specifying RegexOptions.Compiled and not, and another mechanism is required to get the throughput benefits promised by RegexOptions.Compiled. Enter [GeneratedRegex(…)], which, along with the new regex source generator shipping in the .NET 7 SDK, emits C# code into the assembly using it. That C# code takes the place of the reflection emit that would have happened at run-time, and is thus able to work successfully with Native AOT.
private static readonly string s_haystack = new HttpClient().GetStringAsync(“https://www.gutenberg.org/files/1661/1661-0.txt”).Result; private Regex _interpreter = new Regex(@”^.*elementary.*$”, RegexOptions.Multiline); private Regex _compiled = new Regex(@”^.*elementary.*$”, RegexOptions.Compiled | RegexOptions.Multiline); [GeneratedRegex(@”^.*elementary.*$”, RegexOptions.Multiline)] private partial Regex SG(); [Benchmark(Baseline = true)] public int Interpreter() => _interpreter.Count(s_haystack); [Benchmark] public int Compiled() => _compiled.Count(s_haystack); [Benchmark] public int SourceGenerator() => SG().Count(s_haystack);Method | Mean | Ratio |
---|---|---|
Interpreter | 9,036.7 us | 1.00 |
Compiled | 9,064.8 us | 1.00 |
SourceGenerator | 426.1 us | 0.05 |
So, yes, there are some constraints associated with Native AOT, but there are also solutions for working with those constraints. And further, those constraints can actually bring further benefits. Consider dotnet/runtime#64497. Remember how we talked about “guarded devirtualization” in dynamic PGO, where via instrumentation the JIT can determine the most likely type to be used at a given call site and special-case it? With Native AOT, the entirety of the program is known at compile time, with no support for Assembly.LoadFrom or the like. That means at compile time, the compiler can do whole-program analysis to determine what types implement what interfaces. If a given interface only has a single type that implements it, then every call site through that interface can be unconditionally devirtualized, without any type-check guards.
This is a really exciting space, one we expect to see flourish in coming releases.
Mono
Up until now I’ve referred to “the JIT,” “the GC,” and “the runtime,” but in reality there are actually multiple runtimes in .NET. I’ve been talking about “coreclr,” which is the runtime that’s recommended for use on Linux, macOS, and Windows. However, there’s also “mono,” which powers Blazor wasm applications, Android apps, and iOS apps. It’s also seen significant improvements in .NET 7.
Just as with coreclr (which can JIT compile, AOT compile partially with JIT fallback, and fully Native AOT compile), mono has multiple ways of actually executing code. One of those ways is an interpreter, which enables mono to execute .NET code in environments that don’t permit JIT’ing and without requiring ahead-of-time compilation or incurring any limitations it may bring. Interestingly, though, the interpreter is itself almost a full-fledged compiler, parsing the IL, generating its own intermediate representation (IR) for it, and doing one or more optimization passes over that IR; it’s just that at the end of the pipeline when a compiler would normally emit code, the interpreter instead saves off that data for it to interpret when the time comes to run. As such, the interpreter has a very similar conundrum to the one we discussed with coreclr’s JIT: the time it takes to optimize vs the desire to start up quickly. And in .NET 7, the interpreter employs a similar solution: tiered compilation. dotnet/runtime#68823 adds the ability for the interpreter to initially compile with minimal optimization of that IR, and then once a certain threshold of call counts has been hit, then take the time to do as much optimization on the IR as possible for all future invocations of that method. This yields the same benefits as it does for coreclr: improved startup time while also having efficient sustained throughput. When this merged, we saw improvements in Blazor wasm app startup time improve by 10-20%. Here’s one example from an app being tracked in our benchmarking system:
The interpreter isn’t just used for entire apps, though. Just as how coreclr can use the JIT when an R2R image doesn’t contain code for a method, mono can use the interpreter when there’s no AOT code for a method. Once such case that occurred on mono was with generic delegate invocation, where the presence of a generic delegate being invoked would trigger falling back to the interpreter; for .NET 7, that gap was addressed with dotnet/runtime#70653. A more impactful case, however, is dotnet/runtime#64867. Previously, any methods with catch or filter exception handling clauses couldn’t be AOT compiled and would fall back to being interpreted. With this PR, the method is now able to be AOT compiled, and it only falls back to using the interpreter when an exception actually occurs, switching over to the interpreter for the remainder of that method call’s execution. Since many methods contain such clauses, this can make a big difference in throughput and CPU consumption. In the same vein, dotnet/runtime#63065 enabled methods with finally exception handling clauses to be AOT compiled; just the finally block gets interpreted rather than the entire method being interpreted.
Beyond such backend improvements, another class of improvement came from further unification between coreclr and mono. Years ago, coreclr and mono had their own entire library stack built on top of them. Over time, as .NET was open sourced, portions of mono’s stack got replaced by shared components, bit by bit. Fast forward to today, all of the core .NET libraries above System.Private.CoreLib are the same regardless of which runtime is being employed. In fact, the source for CoreLib itself is almost entirely shared, with ~95% of the source files being compiled into the CoreLib that’s built for each runtime, and just a few percent of the source specialized for each (these statements means that the vast majority of the performance improvements discussed in the rest of this post apply equally whether running on mono and coreclr). Even so, every release now we try to chip away at that few remaining percent, for reasons of maintainability, but also because the source used for coreclr’s CoreLib has generally had more attention paid to it from a performance perspective. dotnet/runtime#71325, for example, moves mono’s array and span sorting generic sorting utility class over to the more efficient implementation used by coreclr.
One of the biggest categories of improvements, however, is in vectorization. This comes in two pieces. First, Vector
Reflection
Reflection is one of those areas you either love or hate (I find it a bit humorous to be writing this section immediately after writing the Native AOT section). It’s immensely powerful, providing the ability to query all of the metadata for code in your process and for arbitrary assemblies you might encounter, to invoke arbitrary functionality dynamically, and even to emit dynamically-generated IL at run-time. It’s also difficult to handle well in the face of tooling like a linker or a solution like Native AOT that needs to be able to determine at build time exactly what code will be executed, and it’s generally quite expensive at run-time; thus it’s both something we strive to avoid when possible but also invest in reducing the costs of, as it’s so popular in so many different kinds of applications because it is incredibly useful. As with most releases, it’s seen some nice improvements in .NET 7.
One of the most impacted areas is reflection invoke. Available via MethodBase.Invoke, this functionality let’s you take a MethodBase (e.g. MethodInfo) object that represents some method for which the caller previously queried, and call it, with arbitrary arguments that the runtime needs to marshal through to the callee, and with an arbitrary return value that needs to be marshaled back. If you know the signature of the method ahead of time, the best way to optimize invocation speed is to create a delegate from the MethodBase via CreateDelegate
Method | Runtime | Mean | Ratio |
---|---|---|---|
MethodInfoInvoke | .NET 6.0 | 43.846 ns | 1.00 |
MethodInfoInvoke | .NET 7.0 | 8.078 ns | 0.18 |
Reflection also involves lots of manipulation of objects that represent types, methods, properties, and so on, and tweaks here and there can add up to a measurable difference when using these APIs. For example, I’ve talked in past performance posts about how, potentially counterintuitively, one of the ways we’ve achieved performance boosts is by porting native code from the runtime back into managed C#. There are a variety of ways in which doing so can help performance, but one is that there is some overhead associated with calling from managed code into the runtime, and eliminating such hops avoids that overhead. This can be seen in full effect in dotnet/runtime#71873, which moves several of these “FCalls” related to Type, RuntimeType (the Type-derived class used by the runtime to represent its types), and Enum out of native into managed.
[Benchmark] public Type GetUnderlyingType() => Enum.GetUnderlyingType(typeof(DayOfWeek));Method | Runtime | Mean | Ratio |
---|---|---|---|
GetUnderlyingType | .NET 6.0 | 27.413 ns | 1.00 |
GetUnderlyingType | .NET 7.0 | 5.115 ns | 0.19 |
Another example of this phenomenon comes in dotnet/runtime#62866, which moved much of the underlying support for AssemblyName out of native runtime code into managed code in CoreLib. That in turn has an impact on anything that uses it, such as when using Activator.CreateInstance overloads that take assembly names that need to be parsed.
private readonly string _assemblyName = typeof(MyClass).Assembly.FullName; private readonly string _typeName = typeof(MyClass).FullName; public class MyClass { } [Benchmark] public object CreateInstance() => Activator.CreateInstance(_assemblyName, _typeName);Method | Runtime | Mean | Ratio |
---|---|---|---|
CreateInstance | .NET 6.0 | 3.827 us | 1.00 |
CreateInstance | .NET 7.0 | 2.276 us | 0.60 |
Other changes contributed to Activator.CreateInstance improvements as well. dotnet/runtime#67148 removed several array and list allocations from inside of the RuntimeType.CreateInstanceImpl method that’s used by CreateInstance (using Type.EmptyTypes instead of allocating a new Type[0], avoiding unnecessarily turning a builder into an array, etc.), resulting in less allocation and faster throughput.
[Benchmark] public void CreateInstance() => Activator.CreateInstance(typeof(MyClass), BindingFlags.NonPublic | BindingFlags.Instance, null, Array.Empty