Differences between revisions 10 and 11
Revision 10 as of 2010-10-01 12:18:31
Size: 12208
Editor: ?PeterMaydell
Comment: correct some errors regarding VFPv4
Revision 11 as of 2011-04-27 13:08:48
Size: 13908
Editor: ?mdt
Comment:
Deletions are marked like this. Additions are marked like this.
Line 7: Line 7:
VFP is ARM's "Vector Floating Point" unit. SIMD operations can be better performed on several FPU extensions provided by ARM (NEON as in Cortex-A8 and Cortex-A9) or Intel/Marvell/others (iwMMXt) or supplemental DSP coprocessors (TI OMAP). VFP can perform vector operations across "banks" of registers but this is rarely used, deprecated in the presence of NEON, in all other aspects it is a standard FPU as you would see in any other processor. VFP is ARM's "Vector Floating Point" unit. SIMD operations can be better performed on several FPU extensions provided by ARM ([[http://www.arm.com/products/processors/technologies/neon.php|NEON]] as in [[http://www.arm.com/products/processors/cortex-a/cortex-a8.php|Cortex-A8]] and [[http://www.arm.com/products/processors/cortex-a/cortex-a9.php|Cortex-A9]]) or Intel/Marvell/others (iwMMXt) or supplemental DSP coprocessors (TI OMAP). VFP can perform vector operations across "banks" of registers but this is rarely used, deprecated in the presence of [[http://www.arm.com/products/processors/technologies/neon.php|NEON]], in all other aspects it is a standard FPU as you would see in any other processor.
Line 58: Line 58:
Cortex-A9 processors have far, far better FPU instruction timings meaning that the overhead is far, far higher. [[http://www.arm.com/products/processors/cortex-a/cortex-a9.php|Cortex-A9]] processors have far, far better FPU instruction timings meaning that the overhead is far, far higher.
Line 70: Line 70:
In addition, the `-mfpu=` option can be used to select a VFP/NEON (or FPA or Maverick) variant. In addition, the `-mfpu=` option can be used to select a VFP/ [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] (or FPA or Maverick) variant.
Line 98: Line 98:
It might make sense for such a new port -- which would essentially target newer hardware -- to target newer CPUs. For instance, it could target ARMv6 or ARMv7 SoCs, and VFPv2, VFPv3-D16 or NEON. It might make sense for such a new port -- which would essentially target newer hardware -- to target newer CPUs. For instance, it could target ARMv6 or ARMv7 SoCs, and VFPv2, VFPv3-D16 or [[http://www.arm.com/products/processors/technologies/neon.php|NEON]].
Line 106: Line 106:
   * Freescale i.MX5x is ARMv7-A + VFPv3 + NEON (traditional Cortex-A8)
   * TI OMAP3 is ARMv7-A + VFPv3 + NEON (traditional Cortex-A8)
   * Qualcomm Snapdragon is ARMv7-A + VFPv3 + NEON (traditional Cortex-A8)
   * nVidia Tegra2 is ARMv7-A + VFPv3-D16 (Cortex-A9 [[http://tegradeveloper.nvidia.com/tegra/forum/tegra-250-devkit-hw-documentation#comment-546|with no SIMD at all]])
   * TI OMAP4 is ARMv7-A + VFPv3 + NEON (traditional Cortex-A9).
   * Freescale i.MX5x is ARMv7-A + VFPv3 + [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] (traditional [[http://www.arm.com/products/processors/cortex-a/cortex-a8.php|Cortex-A8]])
   * TI OMAP3 is ARMv7-A + VFPv3 + [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] (traditional [[http://www.arm.com/products/processors/cortex-a/cortex-a8.php|Cortex-A8]])
   * Qualcomm Snapdragon is ARMv7-A + VFPv3 + [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] (traditional [[http://www.arm.com/products/processors/cortex-a/cortex-a8.php|Cortex-A8]])
   * nVidia Tegra2 is ARMv7-A + VFPv3-D16 ([[http://www.arm.com/products/processors/cortex-a/cortex-a9.php|Cortex-A9]] [[http://tegradeveloper.nvidia.com/tegra/forum/tegra-250-devkit-hw-documentation#comment-546|with no SIMD at all]])
   * TI OMAP4 is ARMv7-A + VFPv3 + [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] (traditional [[http://www.arm.com/products/processors/cortex-a/cortex-a9.php|Cortex-A9]]).
Line 124: Line 124:
NEON is an extension of the VFP which allows for very efficient manipulation of matrices, and vector data in general. This is notably useful for processing audio and video data but also has potential to be used for high-speed memory copies (128-bit at a time). [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] is an extension of the VFP which allows for very efficient manipulation of matrices, and vector data in general. This is notably useful for processing audio and video data but also has potential to be used for high-speed memory copies (128-bit at a time).
Line 127: Line 127:
 * While optimizing the entire system for NEON would be awesome, there is very little benefit on standard code
   * Using NEON as an scalar FPU runs faster (as does -mfpu=sse2 on x86 vs. -mfpu=x87)
     * but it's not IEEE754 compliant so using NEON like this is sometimes a bad idea
 * While optimizing the entire system for [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] would be awesome, there is very little benefit on standard code
   * Using [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] as an scalar FPU runs faster (as does -mfpu=sse2 on x86 vs. -mfpu=x87)
     * but it's not IEEE754 compliant so using [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] like this is sometimes a bad idea
Line 131: Line 131:
     * In any case on Cortex-A9 the benefit is nil (VFP runs at the same performance as NEON for scalar fp)
   * Autovectorizing (`-ftree-vectorize`) for NEON gives between zero and negligible performance gains
   * NEON optimizations - as with AltiVec (PPC) and SSE (x86) - usually come from targeted optimization of code by hand using intrinsics or hand-written assembler.
     * This includes tricks for linear algebra (matrices etc.). One technique to speed up large matrix calculations is to subdivide them and process 2x2 blocks in one NEON operation. Autovectorizing compilers cannot detect this
     * This also includes things like using NEON to approximate transcendental functions (sin, cos, etc.) by performing multiple reciprocal estimates (good replacement for division) at once to refine the accuracy. Autovectorizing compilers do not do this (although GCC `-freciprocal-math` does, can't recall if it's good for ARM)
     * In any case on [[http://www.arm.com/products/processors/cortex-a/cortex-a9.php|Cortex-A9]] the benefit is nil (VFP runs at the same performance as [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] for scalar fp)
   * Autovectorizing (`-ftree-vectorize`) for [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] gives between zero and negligible performance gains
   * [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] optimizations - as with AltiVec (PPC) and SSE (x86) - usually come from targeted optimization of code by hand using intrinsics or hand-written assembler.
     * This includes tricks for linear algebra (matrices etc.). One technique to speed up large matrix calculations is to subdivide them and process 2x2 blocks in one [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] operation. Autovectorizing compilers cannot detect this
     * This also includes things like using [[http://www.arm.com/products/processors/technologies/neon.php|NEON]] to approximate transcendental functions (sin, cos, etc.) by performing multiple reciprocal estimates (good replacement for division) at once to refine the accuracy. Autovectorizing compilers do not do this (although GCC `-freciprocal-math` does, can't recall if it's good for ARM)

This page goes into more depth about ABIs and performance.

VFP versions

VFP is ARM's "Vector Floating Point" unit. SIMD operations can be better performed on several FPU extensions provided by ARM (NEON as in Cortex-A8 and Cortex-A9) or Intel/Marvell/others (iwMMXt) or supplemental DSP coprocessors (TI OMAP). VFP can perform vector operations across "banks" of registers but this is rarely used, deprecated in the presence of NEON, in all other aspects it is a standard FPU as you would see in any other processor.

  • VFPv1 - obsoleted by ARM
  • VFPv2 - optional on ARMv5 and ARMv6 cores
    • Supports standard FPU arithmetic (add, sub, neg, mul, div), full square root
    • 16 64-bit FPU registers
  • VFPv3[-D32]
    • Broadly compatible with VFPv2 but adds
      • Exception-less FPU usage
      • 32 64-bit FPU registers as standard
      • Adds VCVT instructions to convert between scalar, float and double.
      • Adds immediate mode to VMOV such that constants can be loaded into FPU registers
  • VFPv3-D16
    • As above, but only has 16 64-bit FPU registers in VFPv3-D16 variant
  • VFPv3-F16 variant
    • Uncommon but supports IEEE754-2008 half-precision (16-bit) floating point
  • VFPv4
    • Cortex-A5
    • Has a "fused multiply-accumulate"

VFP Instruction Set Quick Reference

Details on GCC floating-point options

For historical reasons and to match the ARM RVCT kit, the GCC FPU and ABI selection options are not entirely orthogonal. The -mfloat-abi= option controls both the ABI, and whether floating point instructions may be used. The available options are:

"soft"

  • Full software floating point - compiler should refuse to generate a real FPU instruction and -mfpu= is ignored.

  • FPU operations are emulated by the compiler
  • Function calls are generated to pass FP arguments (float, double) in integer registers (one for float, a pair of registers for double)

"softfp"

  • Hardware floating point using the soft floating point ABI

  • To reiterate, function calls are generated to pass FP arguments in integer registers
  • Compiler can make smart choices about when and if it generates emulated or real FPU instructions depending on chosen FPU type (-mfpu=)

  • This means soft and softfp code can be intermixed

The caveat is that copying data from integer to floating point registers incurs a pipeline stall for each register passed (rN->fN) or a memory read for stack items. This has noticable performance implications in that a lot of time is spent in function prologue and epilogue copying data back and forth to FPU registers. This could be 20 cycles or more.

As a thought experiment, consider that for a function which may take 3 float arguments takes ~20 cycles to do its work (to simplify I used the cycle timing for FMAC and am making the huge assumption that the compiler will recognise the operation directly translates to FMAC. Confusingly, the VFP multiply accumulate function is called VMLA in the ARM Architecture Reference Manual).

float fmadd(float a, float b, float c)  {   return a + (b * c);   }

Passing these 3 FP arguments will incur 20+ cycles per float argument on entry to the function (~60) and at least one register transfer for float result (~80). In this case a 20-cycle function now takes 100 cycles to complete, 5 times more than the actual operation, and ~80% of the function time is spent handling the ABI requirements. Pseudocode follows;

u32 fmadd(u32 a, u32 b, u32 c)  { float fa, fb, fc; MOV fa, a; MOV fb, b; MOV fc, c; FMAC fa, fb, fc; MOV a, fa; }

Consider double requires 2 integer registers paired with the most significant word in an even register (r0/r1, r2/r3, r4/r5). This wastes integer registers in the first place, so passing mixed FP or integer types should be considered carefully. Copying this data into a double register will take twice as long. However the additional time to process double precision data in the FPU is only between 1 and 5 cycles more. In this case, the function will take 160 cycles plus the extra 5 overhead maximum for double precision, plus other overheads (masking and shifting 32bit data into a 64bit register) could bump this to well over 200 cycles. This is 10 times more time than it takes to run FMAC alone and ~90% of the function run time. You had better hope the compiler inlines it :)

Longer functions obviously have less overhead in comparison, but conversely may require more floating point arguments as inputs.. adding more overhead. It is a big trade-off in performance to use "softfp" on VFPv3.

Cortex-A9 processors have far, far better FPU instruction timings meaning that the overhead is far, far higher.

"hard"

  • Full hardware floating point.
  • FP arguments are passed directly in FPU registers
  • Cannot possibly run without the FPU defined in -mfpu= (or a superset of the FPU defined, where relevant)

  • FPU instructions could be emulated by the kernel so that FPU-less systems could run with this ABI but as far as we know this does not exist
  • No function prologue or epilogue requirements for FP arguments, no pipeline stalls, maximum performance (just like in PowerPC and MIPS)

FPU selection

In addition, the -mfpu= option can be used to select a VFP/ NEON (or FPA or Maverick) variant.

This has no effect when -mfloat-abi=soft is specified.

The combination of -mfpu=vfp and -mfloat-abi=hard is not available in FSF GCC 4.4, but the CodeSourcery 2009q1 compiler (and later, current release is 2010q1) supports it as does FSF GCC 4.5.

ld.so hwcaps

The GCC -mfloat-abi=softfp flag allows use of VFP while remaining compatible with soft-float code. This allows selection of appropriate routines at runtime based on the availability of VFP hardware.

The runtime linker, ld.so, supports a mechanism for selecting runtime libraries based on features reported by the kernel. For instance, it's possible to provide two versions of libm, one in /lib and another one in /lib/vfp, and ld.so will select the /lib/vfp one on systems with VFP.

This mechanism is dubbed "hwcaps".

Huge Caveat: This ld.so operation relies implicitly on the code being linked having a compatible ABI. While on, say, PowerPC there is adequate information in the ELF header to describe the floating point ABI of the executable, endian-ness of the data involved, the ARM EABI and ELF specification has no way to tell which of soft, softfp, hard is used to build it. Your only protection is that the compiler can detect at an earlier stage of object generation that soft and hard ABIs are compatible and prevent linking into a single (static) object file, but dynamic linking of an executable to a library will not know that it is a bad idea to link a softfp libc or libm to a hardfp executable and vice-versa.

This is described in the ARM Application Binary Interface for ARM Architecture manual (ARM IHI 0036B) section 3.10.1 where a build attributes record is appended to "ar" style objects produced by the compiler to "allow linkers to determine whether separately built relocatable files are inter-operable or incompatible, and to select the variant of a required library member that best matches the intentions of their builders." This build attributes record is NOT carried over in detail in the ELF for ARM Architecture document (ARM IHI 0044D) section 4.4.6 - simply that this extension does exist. No ELF files tested nor GNU ld.so seems to take into account this information, or at least objdump seems incapable of telling the difference.

This needs to be looked into.. is it only a part of the version 5 EABI and not implemented? Could there possibly be scope for a GNU EABI extension for this, or a fix for the linker?

Endianess, architecture level, CPU, VFP level

A new port would be little-endian as that is the most widely used endianess in recent ARM designs.

Since the new port would require VFP, it would limit which ?SoCs are supported by the new port.

The toolchain needs to be configured with a specific base CPU and base VFP version in mind.

It might make sense for such a new port -- which would essentially target newer hardware -- to target newer CPUs. For instance, it could target ARMv6 or ARMv7 ?SoCs, and VFPv2, VFPv3-D16 or NEON.

If targeting ARMv7, another option is to build for Thumb-2.

FPU

  • VFPv3-D16 is the common denominator of the processors to support here (therefore the recommended build option is -mfpu=vfpv3-d16)

  • Some of them support the IEEE half-precision FP format (-mfpu=vfpv3-f16) but not all, and in any case the usefulness of this extension is debatable

  • Building for VFPv3-D16 instead of VFPv3[-D32] only loses the use of 16 FP registers - not a great loss

CPU

  • The lowest CPU implementation is ARMv7-A (therefore the recommended build option is -march=armv7-a)

  • Some concern for fast-enough, pretty awesome (600MHz+) ARMv6 + VFPv2 processors here - i.MX37 etc. - which will not be supported, but.. we will have to live with that
    • The difference between ARMv6 and ARMv7 is mostly kernel level but it has better knowledge of cache and some extra memory barrier instructions
    • The difference between VFPv2 and VFPv3 is fundamentally the float-to-fixed and float-to-double (VCVT) instructions loading common FP constants (VMOV immediate).
    • These are very useful and very very desirable

NEON

NEON is an extension of the VFP which allows for very efficient manipulation of matrices, and vector data in general. This is notably useful for processing audio and video data but also has potential to be used for high-speed memory copies (128-bit at a time).

  • Building for -mfpu=neon means Marvell, nVidia are left out.

  • While optimizing the entire system for NEON would be awesome, there is very little benefit on standard code

    • Using NEON as an scalar FPU runs faster (as does -mfpu=sse2 on x86 vs. -mfpu=x87)

      • but it's not IEEE754 compliant so using NEON like this is sometimes a bad idea

      • GCC doesn't implement it anyway as it does for -mfpu=sse2
      • In any case on Cortex-A9 the benefit is nil (VFP runs at the same performance as NEON for scalar fp)

    • Autovectorizing (-ftree-vectorize) for NEON gives between zero and negligible performance gains

    • NEON optimizations - as with ?AltiVec (PPC) and SSE (x86) - usually come from targeted optimization of code by hand using intrinsics or hand-written assembler.

      • This includes tricks for linear algebra (matrices etc.). One technique to speed up large matrix calculations is to subdivide them and process 2x2 blocks in one NEON operation. Autovectorizing compilers cannot detect this

      • This also includes things like using NEON to approximate transcendental functions (sin, cos, etc.) by performing multiple reciprocal estimates (good replacement for division) at once to refine the accuracy. Autovectorizing compilers do not do this (although GCC -freciprocal-math does, can't recall if it's good for ARM)

      • The best performance comes from deriving parallelization using mathematical proof of the original function, and autovectorizing compilers don't do this. Pretty much all they do is unroll loops.
      • Therefore: make sure -ftree-vectorize is turned off :)