Vertical integration. My understanding was it's because the Apple silicon ARM has special support to make it fast. Apple has had enough experience to know that some hardware support can go a long way to making the binary emulation situation better.
That is correct, the article goes into details why. See the "Apple's Secret Extension" section as well as the "Total Store Ordering" section.
The "Apple's Secret Extension" section talks about how the M1 has 4 flag bits and the x86 has 6 flag bits, and how emulating those 2 extra flags would make every add/sub/cmp instruction significantly slower. Apple has an undocumented extension that adds 2 more flag bits to make the M1's flag bits behave the same as x86.
The "Total Store Ordering" section talks about how Apple has added a non-standard store ordering to the M1 than makes the M1 order its stores in the same way x86 guarantees instead of the way ARM guarantees. Without this, there's no good way to translate instructions in code in and around an x86 memory fence; if you see a memory fence in x86 code it's safe to assume that it depends on x86 memory store semantics and if you don't have that you'll need to emulate it with many mostly unnecessary memory fences, which will be devastating for performance.
I’m aware of both of these extensions; they’re not actually necessary for most applications. Yes, you trade fidelity with performance, but it’s not that big of a deal. The majority of Rosetta’s performance is good software decisions and not hardware.
Yeah, these features exist, and they help, but I don't think they should be given all the credit. Both "Apple's Secret Extension" and "Total Store Ordering" are features that other emulators can choose to disable to get exactly the same performance.
"Apple's Secret Extension" isn't even used by Rosetta 2 on Linux (opting for, at least, explicit parity flag calculations rather than reduced fidelity). It's still fast.
TSO is only required for accuracy on multithreaded applications, and the PF and AF flags are basically never used, and, if they are, will usually be used immediately after being set, allowing emulators to achieve reasonable fidelity by only calculating them when used.
There's perhaps a better argument for performance-via-vertical-integration with the flag-manipulation extensions, which I believe Apple created and standardised, but which now anyone can use.
But the reason I wrote this post is that I think most of the ideas are transferable and could help other emulators :)
> TSO is only required for accuracy on multithreaded applications
If by accuracy you mean not segfaulting then yes. Every moderately complex x86-64 application will have memory fences in the generated machine code. x86-64 design of store-buffers and load-buffers are making the memory fences a necessity. In reality it's enough just to use the mutex or atomics in your code to end up with the memory fence in your generated machine code. So, I'd say that this particular part of Rosetta/M1 design is quite important, if not the most important. Without it applications wouldn't run.
Not true. The required fencing has huge impact. I led development of the chpe compiler for windows on arm, and the fencing was major source of our gains.
I don't think we disagree :) If you're going for full accuracy you morally need barriers all over the place. If have TSO in your chips that makes things far easier, alternatively you can do stuff with RCpc if your hardware supports it. Otherwise you get stuck with fences everywhere, or you force your hardware into TSO compliance mode (read: turn off all the other cores) and that sucks.
The other option is you relax on the "required fencing", with the assumption that most accesses do not actually exercise the full semantics that TSO guarantees. Obviously some synchronization does matter, so you need heuristics and those won't always work. My understanding was that XTA has some of these, with knobs to turn them off if they don't work? You probably know more about that than I do. In iSH we play it even more fast-and-loose, with all regular memory accesses being lowered to ARM loads and stores, and locked operations to whatever seemed the closest. It's definitely not production-grade but we have shockingly good compatibility for what it is.