Anandtech released their review of the iPhone XS a few months ago and despite a vague commitment to Android, my curiosity got the better of me: I read the whole thing. Most of the review didn't surprise, it had a typically great screen, battery life was solid and GPU performance had improved significantly. But several pages of it were dedicated to analysing and testing the micro-architecture of Apple’s new A12 chip, and these sections stood out in a big way. Reading the final sentence, I was somewhat shocked:
“Of course there’s compiler considerations and various frequency concerns to take into account, but still we’re now talking about very small margins until Apple’s mobile SoCs outperform the fastest desktop CPUs in terms of ST [single thread] performance.
Surely that’s not possible, I figured. Desktop processors were just faster. They always have been. Surely my enormous gaming setup, with an i7 4790k processor, 5 fans and far too much desk space could not be bested in peak CPU performance by a smartphone. Bullshit. So I ignored it and moved on - nothing to see here.
A few months after this dismissal the iPad Pro 2018 was announced, and an interesting headline immediately hit the press: “Apple's A12X reaches desktop CPU performance in benchmarks”. I saw posts with titles to this effect in various places - on tech websites, reddit, forums, in discussion threads, comment sections and more - and noticed a common reaction: many saw that the benchmark in question was Geekbench or something similar, and they rejected the central claim outright. The popular consensus was very clear: if you were to truly compare the performance in an objective fashion, a high-end Intel core i7, or even an i3, would obliterate the A12X.
I wasn’t satisfied with this reaction for a bunch of reasons, even though it was precisely the reflexive rejection I employed after reading the iPhone XS review. Firstly, it struck me as odd that Anandtech, arguably the most thorough technology reviewers in tech journalism, would make such a bold claim about the A12 if it could be easily disproved. Secondly, as I looked through many iPad Pro reviews, I was consistently shocked at just how well it performed, not just in terms of fluidity, but workload tasks. This was completely against what I understood to be true: that modern x86 architectures were inherently and far better than ARM-based chips, that ARM chips only had an efficiency advantage.
But after a fair amount of research and deliberation, I came to a very different conclusion. There’s nuance in every corner, but after working through common arguments and misconceptions, such as that the RISC (ARM) instruction set architecture is inherently inferior to CISC (x86), that the benchmarks spruiking excellent ARM performance are inaccurate, that the performance gap is explained by different operating systems, and more, I’ve found that the central claim of these articles largely holds true: ARM processors, particularly those produced by Apple, have caught up to high-end x86 processors in most perspectives of performance. This article is the result of a deep-dive into the topic, intended somewhat to end the myth that ARM processors are behind in terms of performance. Be ready for a good, long read.
The first response I saw many employ against ARM parity was that x86 chips are based on the CISC Instruction Set Architecture (ISA) and are therefore inherently superior to mobile chips based on the RISC ISA. This argument centres around a postulation that complex desktop programs perform worse on ARM chips because they are RISC-based. Many discussions and upvotes seemed to end on this point.
CISC is an ISA which executes complex commands in as few lines as possible so to minimise reliance on cache and memory, while RISC organises commands into multiple simple instructions that can be each executed in a single CPU cycle. Here’s a fantastic article breaking down covering the differences between the two. While they have different advantages, CISC was first prioritised not necessarily because it was inherently better, but because memory was expensive at the time and because RISC processors required a bit of software wizardry to get right (Stanford UNI). Once Windows cemented support for CISC, specifically x86, processors only, the industry and investment interest focused on CISC, and faster and faster x86 processors were demanded. A notable exception existed in PowerPC and the Mac, but soon enough Apple joined the CISC bandwagon in their Intel transition.
In the meantime, ARM Holdings (keeping in mind that ARM stands for Advanced RISC Machine) started to develop chips for mobile devices where efficiency, not raw performance, was the initial focus. Alongside a competitive environment between Intel and AMD and a lack of demand for RISC chips in desktop systems, this created a vague industry momentum - RISC for mobile and CISC for desktop. At some point in time, this momentum turned into a popular assumption that the reason the performance gap existed was the difference in ISA. Accordingly, RISC was meant to be inferior and naturally relegated to slow but efficient chips, while high-end Intel and AMD chips would forever reign supreme on the performance front. But this is a myth, and it hasn't borne out - in so many ways the lines of RISC and CISC have blurred, borrowing design ideas over decades of manufacturing. RISC stands for Reduced Instruction Set Chip, and yet most modern RISC processors have large, complex instruction sets that rival their Complex Instruction Set (Chip) counterparts. Today, micro-architecture, which is how an ISA is actually implemented, is far more important.
A study by the University of Wisconsin backs this up, stating that the performance difference between RISC and CISC is not due to their ISA, but the micro-architecture differences between products in the market, suggesting that either ISA has the potential to perform as well as one another. In any case, 5 years ago x86 CISC chips had the developmental and micro-architectural advantage, so the ISA argument was hard to disprove.
But some combination of the portable device boom, demand for fast and efficient server processors and a general trend towards more efficient devices created a firestorm of investment and R&D that saw ARM processors begin to improve radically. In the last decade Intel and AMD CPUs have made small overall performance improvements year on year, while ARM chips have made sizable leaps each generation. Granted, 10 years ago commercial ARM chips were absolutely obliterated by any half-decent desktop Intel or AMD processor, so these leaps were absolutely necessary for any chance at actually competing, but the changes have been so quick that the old truths about ARM inferiority haven’t had time to leave the computing vernacular.
All this said, the original claim against RISC still has an element of truth to it - Windows doesn’t fully support ARM or RISC based processors. When these chips have been used on Windows devices, such as in the case of the HP Envy X2, a 32-bit x86 emulator is employed to run x86 programs, and they perform terribly. To be clear, this doesn't indicate that ARM chips are slower, just that they aren’t natively supported by desktop OSes like Windows and MacOS. The key takeaway is that there’s no ISA battle anymore, the performance debate between RISC and CISC is moot, and gains are to be found in microarchitecture, not instruction set architecture. Improvements to cache implementation, instruction set width, chip size, branch prediction, and other architectural improvements are where the biggest IPC gains exist.
The next issue I want to address are fallacies I've seen permeate discussions around ARM performance: that a benchmark is not at all useful because it is flawed or not objective in some way. I understand the reaction to a degree (I once wrote an article criticising camera benchmarks that reduce complex data into single scalar numbers), but I also believe that it’s possible to understand the shortcomings of a benchmark, not to approach it objectively, and to properly consider what it might indicate, rather than dismissing it out of hand.
There's a popular but somewhat misguided belief that cross-platform benchmarks are not at all relevant because different operating systems like iOS and Windows skew the results too significantly. You've seen it before: "Yeah, it performs that well on iOS, but on a Windows machine an i5 would destroy it". On the surface this doesn't sound like a bad argument, since it's true in many instances, especially those benchmarks that use high-level OS APIs. Metal for example is a highly optimised low-level graphics API for iOS/MacOS only, so comparing its performance on an iPhone against OpenCL performance on a Windows machine would produce highly misleading results. But unless there's a bug or performance issue with the kernel or firmware with a tested device, the kind of compute tasks most cross-platform CPU benchmarks test are designed so that they’re not particularly bound to the platforms on which they run.
To explain with a hypothetical: let’s say, using a native program, I ask an iPad Pro 2018 to calculate ten million digits of pi (a rudimentary way to measure one very particular facet of performance) and it takes 60 seconds, compared to a Windows desktop machine 90 seconds. In this scenario one of either two things are roughly true: the hardware in the iPad Pro is genuinely 33% faster at completing this task, or there's an inefficiency, bug or poor optimisation somewhere in the setup of that Windows device that is limiting its true performance in this task. It's the first scenario that is most often true. Again, this doesn’t mean the takeaway is that the iPad is 33% faster, but it does typically indicate some advantage in that particular facet of performance.
There are examples, however, where the difference between OSes are much more significant: browser tests and JS performance. In so many of these cases, the A12X excels. NotebookCheck’s review of the iPad Pro shows it performing 16% - 50% better than the brand new Surface Pro i7 model in browser benchmarks. If we compare to a desktop processor by cross-referencing the data with Anandtech’s review of the i7-8700k, we can see that the iPad Pro either wins or is on par in all tested browser benchmarks, only falling behind by a decent margin in WebXPRT 15 (keeping in mind the desktop setup is using a GTX 1080).
You’d be completely correct if you're thinking that there’s a decent room for error in these kinds of tests - they run on different browsers and operating systems after all, and one setup could be significantly more optimised than the other, but optimisation isn’t some magic bullet pulling performance out of thin-air - optimisation is something that is done to make the most out of the available hardware. It's fair to criticise a performance benchmark when the device or software in question is clearly not correctly optimised or compiled, but it's somewhat silly to ignore a benchmark because a device is too well optimised, as so often happens in response to positive ARM benchmarks. The fact that the browser performance of a device using the A12X is trading blows with a high-end desktop processor is impressive from whichever angle you look at it. Even in the worst case scenario - on the negative end of the high margin of error that browser benchmarks have - these tests indicate clearly that an ARM-powered chip is within punching distance of high-end x86 chips in browser performance, and that’s all that needs to be true to support the central claim.
Beyond the cross platform point, there's a tendency for some to disregard benchmarks that show ARM as competitive because they believe that either they don't accurately represent CPU workloads, that the compute tasks they test aren't relevant to actual performance, or because they don't actually indicate end-user experience. These are also sometimes true – after all, the final score in a benchmark like Geekbench is estimated and derived from various compute tasks and the usefulness, weighting and relevance of those tests are all subjective. Linus Torvalds famously ripped into Geekbench 3 for its ridiculous testing of crypto tasks, but this and various other issues were addressed in Geekbench 4, so the benchmark today isn't actually a bad indicator of generalised peak performance. It does not necessarily indicate sustained performance, but it does give a roughly, kinda accurate standardised way to compare peak potential. Once again, there can be an appreciable margin of error in these benchmarks, but it's not so significant to render them useless, able to be totally dismissed.
In other words: while anyone using Geekbench as a be-all-and-end-all measure of performance is, put plainly, using it wrong, the score still has relevance. Geekbench is a thermometer that’s a little broken - sure, it might be off by a few fahrenheit most of the time, but if the temperature is reporting double what it was yesterday, you can bet it's gonna feel almost double as hot. In my opinion, those who completely, utterly dismiss Geekbench as evidence are engaging in logic just as reductive as those who interpret it with objective authority - the truth lies somewhere in the middle, as it often does. All this said, I would not want to rest my ARM performance case on Geekbench alone, so let’s explore deeper.
For the sake of seeking more objective and thorough data, let’s look at an industry standard benchmark, SPEC2006, an encompassing tool that according to the excellent writers at Anandtech, “is much better [than Geekbench] as a representative benchmark that fully exhibits more details of a given microarchitecture”. SPEC2006 measures genuine real world tasks - compression, image recognition, compiling, spam filtering, game/chess AI, pathfinding and XML processing. The numbers aren’t arbitrary.
I've picked the i7 6700k to compare the A12X with in this scenario. It’s the most recent and fastest i7 in the official SPEC2006 database, so in honesty, it's the only option. But since SPEC2006 is a single-core benchmark, core count changes in recent generations aren’t that relevant, and Intel haven’t improved IPC in a few generations, the results will still be quite relevant compared to using the 8700k. The iPad Pro results have been pulled from Anandtech’s review of the iPad Pro. The i7 results have been pulled directly from the SPEC2006 website.
Compared to the top-of-the-line consumer desktop CPU from a few years ago, the A12X performs 20% worse on average in the various SPECint2006 benchmarks, a figure which almost sounds like I’ve produced evidence that doesn’t support my argument. But it achieves this result while being clocked 40% lower, while using a fraction of the power, and with absolutely zero active cooling. No one pound heatsink and fan or case fans – the processor sits inside a 5.9mm slate of silicon, metal and glass. It doesn’t matter how you slice or stretch the numbers – they are wildly impressive. The A12X is definitely winning on the IPC front, and absolutely obliterating on the performance per watt front. If you normalise for clock speed, the A12X would have performed significantly better than the i7 overall.
I want to emphasise a minor amendment after sharing this article: I am not suggesting that it would be simple or even physically possible to boost the A12X as high as 5GHz, it was just a hypothetical to take clock speed out of the equation and focus on IPC. Nonetheless, with a performance gap of only 20%, I don't think it's a stretch whatsoever to assume to assume that without limitations around power consumption, active cooling and specific design tweaks, the A12X could easily, easily surpass this hurdle. Edit date: 12 March 2019
At first I found some of these numbers hard to believe, but they're hard to argue with. The A12X has reached parity on Geekbench scores, it’s approaching parity on a host of assorted system benchmarks, it’s approaching parity on SPEC2006, it’s at parity in just about every browser benchmark, and every indication shows that it has better IPC and better performance per watt than current high-end Intel and AMD processors. Even if Anandtech are over-exaggerating, if all these benchmarks are ARM-favoured and every supporting figure has a huge margin of error, the results are still absolutely remarkable. We’re talking about processors that are designed to live in your pocket with no cooling, not one designed to go toe-to-toe with desktops, but they're doing it anyway.
It’s completely true that any decent Intel or AMD processor would defeat the A12X in intensive real-world tasks that last longer than a few minutes, like rendering a video or running a PC game (in theory), but this isn't really a rebuttal to the theme of this article, for one significant reason. Before I explain, let me clarify: the purpose of this article is not to suggest that one could take the A12X from an iPad Pro, put it in a desktop as-is and use it without any issues today, but rather to end the myth that ARM chips are behind performance-wise. With this in mind, the reason that the lack of sustained performance is not really an issue is simply because ARM processors have an absolutely enormous disadvantage that has nothing to do with microarchitecture or ISA: in practically every consumer device they are used in, they have no active cooling solutions.
ARM chips on the market already run cooler and more efficiently than modern x86 chips, so a cooling unit as good as any generic desktop fan would not only solve the issue of sustained performance and throttling, but also likely increase the headroom for boost clocks, assuming assorted 7nm nodes are mainly limited by heat to reach higher clock speeds. With an IPC lead already, this change would allow good ARM chips like the A12X to properly compare in almost all performance scenarios.
The fact that they're performing so well despite an enormous disadvantage is impressive in itself, but if Apple were to custom create an ARM chip designed to be used in a high-end desktop or laptop system with active cooling, they would likely produce the fastest consumer chip on the market. Apple modify each generation of their chip architecture and optimise it for each device category – the A12X for the iPad Pro being an enhanced tablet version of the A12 for the iPhone XS – any ARM CPU that Apple might stick in a laptop would be accordingly modified and enhanced – more Big cores, some tweaks that cater to desktop use-cases, higher base and boost clocks etc. To say we haven't seen the full potential of ARM yet would be an understatement.
In a broad sense, ARM chips have caught up because of a significant demand in the market for fast, efficient chips in portable devices, and now servers. Because of this, ARM reference designs got better and better, and Apple began to double down in a big way. While they're not the only company developing extremely fast chips, with the upcoming Qualcomm Snapdragon 8CX expected to perform rather well, their custom fabrications are particularly... insane? The A12X has more execution ports than any other consumer processor Anandtech have seen (in a desktop, laptop, phone, tablet or otherwise):
Monsoon (A11) and Vortex (A12) are extremely wide machines – with 6 integer execution pipelines among which two are complex units, two load/store units, two branch ports, and three FP/vector pipelines this gives an estimated 13 execution ports, far wider than ARM's upcoming Cortex A76 and also wider than Samsung’s M3. In fact, assuming we're not looking at an atypical shared port situation, Apple’s microarchitecture seems to far surpass anything else in terms of width, including desktop CPUs.
This brings us back to the earlier point in the RISC v CISC discussion: performance is coming down to micro-architecture, not RISC or CISC, not ARM or x86. With this in mind, ARM's desirability is only likely to further improve from here. Due in part to a stagnant desktop CPU industry, with Intel not facing significant competition from AMD up until just recently, x86 development has been, plainly put, dull. Performance and feature improvements have been minor besides core count increases. The opposite has been true in the mobile and tablet industry, and there's no sign of any slowing down.
Even if Intel manages to drastically improve IPC and efficiency in the shift to 7nm, they would be still be counting on ARM not making further strides ahead given the sheer levels of research, development and demand being funnelled from the smartphone, server, tablet, and soon laptop industry.
Why has nothing changed? Why is x86 relegated to desktop, and why aren’t people talking about this shift?
One answer to these questions is simple: only in the last 3 years has ARM started to become truly competitive, and it takes far longer than 5 years to trigger widespread adoption given just how entrenched Windows and x86 software is. While integrated mobile and tablet chips are currently excellent, to create a similar desktop chip with modular parts, a lot of work still has to be done.
The other answer is more complex: things are changing, behind the scenes. There’s already evidence of Apple’s shift to ARM - though it generated very little press coverage or discussion, the latest Macbook Pros are using Apple’s proprietary ARM-based T2 chip, built on their A11 iPhone chip, to handle H.265 video decoding. I imagine that each generation of Macbook processors will offload more and more tasks to the ARM co-processor, until eventually, some generation completes the transition, and the co-processor becomes the processor.
And this is great news for all of us consumers: What better way to light a fire under the desktop and tablet CPU industry than the threat of the most lucrative hardware company in the world shifting to a competing architecture.
One day soon, we might be able to buy Macbooks with near complete vertical integration from Apple, and despite the possibility of disruption and compatibility issues, I imagine these devices might have some pretty great advantages: industry leading performance and great battery life. For now, let's wait to see what the A13X looks like - if the last few years indicate anything, we're in for a chip that could foreshadow the biggest change in the dynamic of the processor industry in a long, long time.