This page goes into more depth about ABIs and performance.

VFP versions

VFP is ARM's "Vector Floating Point" unit. SIMD operations can be better performed on several FPU extensions provided by ARM (NEON as in Cortex-A8 and Cortex-A9) or Intel/Marvell/others (iwMMXt) or supplemental DSP coprocessors (TI OMAP). VFP can perform vector operations across "banks" of registers but this is rarely used, deprecated in the presence of NEON, in all other aspects it is a standard FPU as you would see in any other processor.

VFP Instruction Set Quick Reference

Details on GCC floating-point options

For historical reasons and to match the ARM RVCT kit, the GCC FPU and ABI selection options are not entirely orthogonal. The -mfloat-abi= option controls both the ABI, and whether floating point instructions may be used. The available options are:

"soft"

"softfp"

The caveat is that copying data from integer to floating point registers incurs a pipeline stall for each register passed (rN->fN) or a memory read for stack items. This has noticable performance implications in that a lot of time is spent in function prologue and epilogue copying data back and forth to FPU registers. This could be 20 cycles or more.

As a thought experiment, consider that for a function which may take 3 float arguments takes ~20 cycles to do its work (to simplify I used the cycle timing for FMAC and am making the huge assumption that the compiler will recognise the operation directly translates to FMAC. Confusingly, the VFP multiply accumulate function is called VMLA in the ARM Architecture Reference Manual).

float fmadd(float a, float b, float c)  {   return a + (b * c);   }

Passing these 3 FP arguments will incur 20+ cycles per float argument on entry to the function (~60) and at least one register transfer for float result (~80). In this case a 20-cycle function now takes 100 cycles to complete, 5 times more than the actual operation, and ~80% of the function time is spent handling the ABI requirements. Pseudocode follows;

u32 fmadd(u32 a, u32 b, u32 c)  { float fa, fb, fc; MOV fa, a; MOV fb, b; MOV fc, c; FMAC fa, fb, fc; MOV a, fa; }

Consider double requires 2 integer registers paired with the most significant word in an even register (r0/r1, r2/r3, r4/r5). This wastes integer registers in the first place, so passing mixed FP or integer types should be considered carefully. Copying this data into a double register will take twice as long. However the additional time to process double precision data in the FPU is only between 1 and 5 cycles more. In this case, the function will take 160 cycles plus the extra 5 overhead maximum for double precision, plus other overheads (masking and shifting 32bit data into a 64bit register) could bump this to well over 200 cycles. This is 10 times more time than it takes to run FMAC alone and ~90% of the function run time. You had better hope the compiler inlines it :)

Longer functions obviously have less overhead in comparison, but conversely may require more floating point arguments as inputs.. adding more overhead. It is a big trade-off in performance to use "softfp" on VFPv3.

Cortex-A9 processors have far, far better FPU instruction timings meaning that the overhead is far, far higher.

"hard"

FPU selection

In addition, the -mfpu= option can be used to select a VFP/ NEON (or FPA or Maverick) variant.

This has no effect when -mfloat-abi=soft is specified.

The combination of -mfpu=vfp and -mfloat-abi=hard is not available in FSF GCC 4.4, but the CodeSourcery 2009q1 compiler (and later, current release is 2010q1) supports it as does FSF GCC 4.5.

ld.so hwcaps

The GCC -mfloat-abi=softfp flag allows use of VFP while remaining compatible with soft-float code. This allows selection of appropriate routines at runtime based on the availability of VFP hardware.

The runtime linker, ld.so, supports a mechanism for selecting runtime libraries based on features reported by the kernel. For instance, it's possible to provide two versions of libm, one in /lib and another one in /lib/vfp, and ld.so will select the /lib/vfp one on systems with VFP.

This mechanism is dubbed "hwcaps".

Huge Caveat: This ld.so operation relies implicitly on the code being linked having a compatible ABI. While on, say, PowerPC there is adequate information in the ELF header to describe the floating point ABI of the executable, endian-ness of the data involved, the ARM EABI and ELF specification has no way to tell which of soft, softfp, hard is used to build it. Your only protection is that the compiler can detect at an earlier stage of object generation that soft and hard ABIs are compatible and prevent linking into a single (static) object file, but dynamic linking of an executable to a library will not know that it is a bad idea to link a softfp libc or libm to a hardfp executable and vice-versa.

This is described in the ARM Application Binary Interface for ARM Architecture manual (ARM IHI 0036B) section 3.10.1 where a build attributes record is appended to "ar" style objects produced by the compiler to "allow linkers to determine whether separately built relocatable files are inter-operable or incompatible, and to select the variant of a required library member that best matches the intentions of their builders." This build attributes record is NOT carried over in detail in the ELF for ARM Architecture document (ARM IHI 0044D) section 4.4.6 - simply that this extension does exist. No ELF files tested nor GNU ld.so seems to take into account this information, or at least objdump seems incapable of telling the difference.

This needs to be looked into.. is it only a part of the version 5 EABI and not implemented? Could there possibly be scope for a GNU EABI extension for this, or a fix for the linker?

Endianess, architecture level, CPU, VFP level

A new port would be little-endian as that is the most widely used endianess in recent ARM designs.

Since the new port would require VFP, it would limit which ?SoCs are supported by the new port.

The toolchain needs to be configured with a specific base CPU and base VFP version in mind.

It might make sense for such a new port -- which would essentially target newer hardware -- to target newer CPUs. For instance, it could target ARMv6 or ARMv7 ?SoCs, and VFPv2, VFPv3-D16 or NEON.

If targeting ARMv7, another option is to build for Thumb-2.

FPU

CPU

NEON

NEON is an extension of the VFP which allows for very efficient manipulation of matrices, and vector data in general. This is notably useful for processing audio and video data but also has potential to be used for high-speed memory copies (128-bit at a time).