White Paper
Is it Necessary to Make Great DSPs Greater?
The HiFi’s comprise a rich portfolio of audio voice and speech processing DSPs flanked at the ultra-low-power end by HiFi 1 DSP and by the versatile HiFi 5 at the high-performance end for melding AI and DSP processing. Well received in the market, these have made their way into various applications, from the always-on island of an earbud SoC to higher-order ambisonics processing in premium cars and home theaters—enhancing user interaction and listening experience. Thus, is it necessary to improve these DSPs? If yes, in what aspects?
The post-pandemic market has a growing appetite for richer features, even better user experiences, and user convenience. How can the DSP platform support this? Does it suffice to offer a commensurate increase in raw performance? Indeed, customers would welcome quantum jumps in general performance that improve their applications across the board. In many cases, though, scaling raw performance doesn’t solve the pain points of customers. Product managers ask how well the DSP plays with the rest of the system. Is it too specialized to be used outside of its core strengths? Or can it be stretched to help the other compute elements in the SoC? In other cases, the software architect asks how difficult it is to harness the increased performance. The system architect is concerned about how easy it is to connect the DSP to the SoC fabric efficiently.
The latest HiFi DSP upgrades address these concerns from customers. The multi-fold improvements are discussed below.
Overview
Platform Upgrade
The HiFi DSPs are built on an LX controller platform that has evolved through seven generations. The current platform version is LX7. HiFi DSPs now use the upgraded LX8 controller platform, bringing the following improvements.
The LX8 platform, therefore, enables HiFi DSPs to achieve higher performance through controller and system-level enhancements. Customers can choose to enable the features they need to achieve various performance/area tradeoff targets.
Auto-vectorization
The Cadence compiler toolchain based on LLVM has excellent parallelization built at the instruction level to utilize the two to five slots in the VLIW architecture of HiFi DSPs. However, it is harder to achieve the data-level parallelism to utilize the HiFi’s SIMD capability. Software engineers typically spend large amounts of time hand-optimizing the code, retrofitting the code with HiFi-specific intrinsics (code that describes data parallelism) to get the required performance. Not only does this process affect time to market, but it also ties up premium resources, that is, the DSP programmers, who could have been developing new algorithms instead.
The HiFi 1s and HiFi 5s DSPs address this problem by significantly enhancing auto-vectorization. They incorporate innovations allowing the compiler to generate SIMD code automatically without programmer intervention. For well-written parallelizable code, the compiler will automatically generate parallel SIMD code for the HiFi 1s and HiFi 5s DSPs. The generated code will be optimal, rivaling the performance of hand-written or hand-optimized code.
These innovations required careful hardware-software co-design, spanning efforts across multiple engineering teams. Close collaboration between the DSP hardware, software, and compiler teams led to embedding special ISA and data types in the HiFi DSPs and the compiler. Now, the compiler can auto-vectorize data arrays of all standard “C” data types for the HiFi 1s and HiFi 5s DSPs, saving the programmer significant time and effort.
Well-written versus non-parallelizable code
Not all code, however, is well written and may not be written to be parallelizable. Gains with such code will be limited out-of-box (OOB). Generally, programmers will hand-optimize such code without taking the time and effort to refactor such code to be parallelizable. One reason is that programmers do not see the benefit of first refactoring the code and then going through and optimizing it. Yet, this is painstaking work, is very time intensive, and the hand-optimized code looks very different from the original code, leading to lower backward traceability of performance and functional issues. Further, the optimizations pertain to a single DSP and do not scale well to other DSPs. Code hand-optimized for HiFi 1 will run on HiFi 5 but will not utilize the greater SIMD capacity of HiFi 5.
Simple edits to handle non-parallelizable code
Code can become non-parallelizable for many reasons. For one, array pointers not indicated as non-overlapping will cause the compiler to treat code as serial. If the intent was for the pointers to be non-overlapping, a simple pragma directive can qualify the pointers, and the compiler will now be free to auto-vectorize the code. Auto-vectorization can also stumble when data types and values are not crisply defined. For example, equating a variable to 1.0 would cause the compiler to treat it (unnecessarily) as a double value, preventing auto-vectorization. In contrast, a simple change of the value to 1.0f will define it as a single-precision float variable, allowing the compiler to vectorize. Such issues can sometimes stem from code developed on PCs supporting double-precision vector operations and running them efficiently. Programmers may be blissfully unaware that the type definition is overkill, that is, until such code is ported to embedded DSPs, where it will run functionally correct, but could face severe performance bottlenecks.
A list of these and other conditions that get in the way of auto-vectorization, along with recommendations to handle them, is available—please check with your Cadence sales representative. As the above examples show, simple edits can convert code from non-parallelizable to compiler-vectorizable. The resulting source code is portable across HiFi DSPs and can compile and run optimally on them without the need for per-DSP-specific code optimization. The compiler will perform DSP-specific optimization, freeing the software engineer from that burden.
ITU-T STL2019/STL2023 and Dolby Intrinsics
Codecs from standards organizations and ecosystem movers have various intrinsics defined for their mathematical operations. Auto-vectorization can also parallelize those, providing optimal to near-optimal out-of-box performance on HiFi 1s and HiFi 5s.
Double-Precision Floating Point Unit (DPFPU)
Audio, vision, and other algorithms sometimes cannot live with single-precision floating point operations. The tradeoff between precision and range that 32-bit representation causes problems, and therefore, they rely often on double-precision computation. Examples of this include functions such as log, exponential, and tan . At other times, audio algorithms start with double-precision code, typically in the PC domain, so as not to make any compromises that could affect quality, with the intention to move to single-precision or fixed-point when optimizing on embedded systems. But then, who has the time? To requalify the code with all the test inputs and test conditions after converting to single precision, and then adjusting the algorithm to make up for any quality issues that crop up, and then re-testing is a lengthy process. By then, time-to-market pressures set in, and teams are forced to leave double precision operations sprinkled all over the code to ship the product on time. Performance could suffer if the DSP platform does not include acceleration for double precision.
The inclusion of the double-precision floating point unit is optional and can be selected or deselected through the Cadence Xplorer tool while configuring the DSP. Operation speedups of up to 30X have been observed with the scalar double-precision floating point unit.
Audio DSP adding imaging features?
The market is changing quickly. Domain-specific DSPs were once the order of the day. Now, SoC architects are asking, “So it’s a great Audio DSP. What else can it do?”. The question arises because SoC architects are pressured to maximize compute capability across disparate compute elements they are designing into the SoC. Depending on the use case, they may have one compute element overloaded, whereas another one has many cycles to spare. They would like to rebalance the workload so that no compute element gets overloaded. For this, they would need to move the function temporarily to one or more of the other compute elements in the SoC.
Yet another use case is the always-on use case. Yesterday, they had one DSP listen for spoken keywords and another DSP to recognize the person who was speaking to the device. Today, they want more efficiency—why can’t the same DSP perform both functions? After all, in the always-on domain, keeping leakage low is important, which in turn means a smaller area or fewer compute elements.
Considering such use cases, HiFi 1s and HiFi 5s include imaging ISA and 8-bit MACs at different performance levels to process images and vision efficiently.
AI Performance Enhancement
HiFi 1s and HiFi 5s inherit the AI performance of their predecessor DSPs, including acceleration for non-linear functions such as sigmoid and tanh, which are key layers in many neural networks. While HiFi 5 already had 32 8-bit MACs (and as many 4x16 and 8x16 MACs), HiFi 1s adds 8 MACs of 8x8 and 8x16, significantly enhancing its capability to handle neural networks. The always-on domain with HiFi 1s can efficiently run more intensive AI inferencing workloads than its predecessor.
Conclusion
The LX8 platform and the new HiFi DSPs, namely, HiFi 1s and HiFi 5s, represent significant advancements in usability, software ease, time-to-market, performance, and functionality, leading the charge for convergent DSPs to solve a variety of problems at the edge, in the audio, voice, AI and imaging world.
Cadence is a pivotal leader in electronic systems design and computational expertise, using its Intelligent System Design strategy to turn design concepts into reality. Cadence.com