Differences between revisions 40 and 41
Revision 40 as of 2010-07-14 19:05:25
Size: 13068
Editor: ?MattSealey
Comment: thumb interworking doc from gcc svn
Revision 41 as of 2010-08-11 13:53:51
Size: 13291
Editor: ?PeterMaydell
Comment: clarify that interworking on v7A cores isn't as expensive as old v4T interworking was
Deletions are marked like this. Additions are marked like this.
Line 171: Line 171:
[Actually that gcc README is describing the ancient ARMv4T Thumb1 interworking; for Thumb2 (which any v7A core will have) there is no need for interworking veneers, so it's not a reason to reject Thumb2 -- PeterMaydell]

This page gathers thoughts and ideas around a hard-float ARM port for Debian.

Rationale

A lot of modern ARM boards and devices ship with a floating-point unit (FPU), but the current Debian armel port doesn't take much advantage of it.

A new ARM port requiring the presence of a FPU would help squeeze the most performance juice out of hardware with a FPU.

Background information

This section provides some background information on FPUs, ARM EABI, GCC floating-point ABIs, hwcaps...

VFP

With ARMv5 an optional floating point instruction set known as Vector Floating Point (VFP) was introduced. This is now effectively the norm for modern ARM implementations and a requirement for Cortex-A8 and Cortex-A9. Prior to this there was no real standardized set of floating point instructions, with some vendors supplying their own coprocessors.

VFP was extended over time, with VFPv2 (some ARM9/ARM11) VFPv3-D16 (e.g. Marvell Dove) and VFPv3+NEON (Most Cortex-A8) present in current production silicon.

In spite of the name, the base VFP architecture is not well suited for vector operations. For practical purposes it is a normal scalar floating point unit. The NEON extension defines vector instruction similar to SSE or ?AltiVec and shares a register file with the VFP unit.

These remain an optional part of the architecture.

ARM EABI

The ARM EABI specification covers calling conventions across libraries and binaries. It defines two incompatible ABIs: one uses (VFP) floating point registers for passing function arguments, and the other does not.

Unlike many other architectures, ARM supports use of FPU instructions while still conforming to the base ABI. This allows code to take advantage of the FPU without breaking compatibility with older libraries or applications. This does incur some overhead relative to a full hard-float system, and obviously requires a VFP capable CPU.

GCC floating-point options

For historical reasons and to match the ARM RVCT kit, the GCC FPU and ABI selection options are not entirely orthogonal. The -mfloat-abi= option controls both the ABI, and whether floating point instructions may be used. The available options are:

  • soft: Full software floating point.

  • softfp: Use the FPU, but remain compatible with soft-float code.

  • hard: Full hardware floating point.

In addition, the -mfpu= option can be used to select a VFP/NEON (or FPA or Maverick) variant. This has no effect when -mfloat-abi=soft is specified.

The combination of -mfpu=vfp and -mfloat-abi=hard is not available in FSF GCC 4.4; see TODO section below for options.

See /VfpComparison for an in depth discussion and some performance research.

ld.so hwcaps

The GCC -mfloat-abi=softfp flag allows use of VFP while remaining compatible with soft-float code. This allows selection of appropriate routines at runtime based on the availability of VFP hardware.

The runtime linker, ld.so, supports a mechanism for selecting runtime libraries based on features reported by the kernel. For instance, it's possible to provide two versions of libm, one in /lib and another one in /lib/vfp, and ld.so will select the /lib/vfp one on systems with VFP.

This mechanism is dubbed "hwcaps".

This only works when the binaries are compatible (use the same ABI); you can't select between hard-float and soft-float libs with hwcaps. See /VfpComparison for more details.

Endianess, architecture level, CPU, VFP level

A new port would be little-endian as that is the most widely used endianess in recent ARM designs.

Since the new port would require VFP, it would limit which ?SoCs are supported by the new port.

The toolchain needs to be configured with a specific base CPU and base VFP version in mind.

It might make sense for such a new port -- which would essentially target newer hardware -- to target newer CPUs. For instance, it could target ARMv6 or ARMv7 ?SoCs, and VFPv2, VFPv3-D16 or NEON.

If targeting ARMv7, another option is to build for Thumb-2.

Name of the port

The table below recaps which ports names Debian/dpkg saw so far.

name

endianess

status

arm

little-endian

last release in Debian lenny; being retired if favor of armel

armel

little-endian

introduced in Debian lenny; actively maintained; targets armv4t; doesn't require a FPU

armeb

big-endian

unofficial port; inactive and dead

The name of a new ARM port using the hard-float ABI should probably start with arm and include hf for hard-float or fp for floating-point in the name.

It is possible to encode the base architecture / CPU in the port name, e.g. arm7hf for an ARMv7 hard-float port, or not.

It is also possible to encode the endianess explicitly, e.g. armelhf but the new port could also simply be named armhf since a big-endian port is unlikely.

It is also possible to encode profiles in the name as A/M/R.

Triplet

GCC when built to target the GNU arm-linux-gnueabi triplet will support both the hard-float and soft-float calling conventions.

dpkg relies on the triplet to identify the port (gcc -dumpmachine output). Some other projects such as multiarch rely on having distinct triplets across all Debian architectures.

For the new Debian port, it is possible to use the vendor field in the triplet to have distinct triplets. For instance, the triplet could be arm-hardfloat-linux-gnueabi.

arm-none-linux-gnueabi, just like in CodeSourcery compilers, would be an option but it is confusing to relate to arm-linux-gnueabi versus arm-none-linux-gnueabi; it is clearer to relate to arm-hardfloat-linux-gnueabi and also allows distinguishing between CodeSourcery and the new port.

Performance improvements and benchmarks

Genesi USA, Inc. did a proof-of-concept rebuild of Ubuntu karmic (9.10)'s armel port with the hard-floating. They noticed important wins (in the order of 40% performance improvement) in floating-point heavy applications/libraries such as mesa, with a Cortex-A8 CPU.

It's likely that the performance benefits are much larger on Cortex-A8 CPUs than on Cortex-A9 CPUs which have a faster VFP design and more conventional pipeline.

NEON

NEON is an extension of the VFP which allows for very efficient manipulation of matrices, and vector data in general. This is notably useful for processing audio and video data, or for fast memcpy().

Programs usually take advantage of NEON thanks to hand-crafted assembly routines. GCC can automatically vectorize code and generate NEON instructions, however this tends to have limited success. It would seem sensible NOT to require NEON in a new port since some modern ARMv7 ?SoCs such as Marvell Dove and NVidia Tegra2 don't implement it.

It is also possible to use NEON instructions for regular scalar floating point code, and this can give significant (2-3x) speedup on Cortex-A8 hardware. However GCC does not currently implement this, and it is not always applicable as NEON instructions are not fully IEEE compliant.

See /VfpComparison for an in depth discussion and some performance research.

Hardware

Genesi-USA would be happy to continue sharing the 9 EfikaMX (Freescale i.MX51) buildds used for their proof-of-concept to help get a new port started.

Genesi-USA is also giving hardware (10 EfikaMX T03) to main Debian sub-project leads for Education, Embedded, Live systems, Ubuntu developers, and Linaro developers.

Genesi-USA is also giving old hardware (EfikaMX T02) which could be used to help out buildds, setup porterboxes or give away to interested developers who would work on the new port. While stock it is limited, if you are interested, register yourself into PowerDeveloper site and, then, contact Hector Oron <zumbi@debian.org>.

TODO

  • Choose a port name
  • Decide on a triplet
  • Get these it into dpkg
  • Get a compiler in shape for this port;
    • GCC 4.5 supports the hard-float ABI, but 4.4 does not
    • .. the new port could either have backported support in gcc-4.4, or use gcc-4.5 from the start

    • .. or use a different code base such as CodeSourcery SourceryG++ (~4.4.1) or Linaro GCC (~4.4.1 -> 4.5)

      • These two compilers are essentially FSF GCC with the fancier features from future GCC releases backported
  • Start bootstrapping
  • Fix / port packages
    • libffi needs porting

Partial reference of SoC and supported ISAs

Manufacturer

SoC

architecture

VFP

SIMD

Notes

Freescale

iMX3x

armv6

VFPv2

none

ARM11

Freescale

iMX5x

armv7

VFPv3

NEON

Cortex-A8; NEON only reliable in Tape-Out 3 or above

Nvidia

Tegra2

armv7

VFPv3 D16

none

Marvell

Dove

armv7

VFPv3 D16

iwMMXt

Texas Instruments

OMAP3xxx

armv7

VFPv3

NEON

Cortex-A8

Texas Instruments

OMAP4xxx

armv7

VFPv4

unclear

Cortex-A9

Qualcomm

Snapdragon

armv7

VFPv3

NEON[1]

Qualcomm "Scorpion" core

Larger list at http://en.wikipedia.org/wiki/ARM_architecture

Opinions

Genesi's (?MattSealey's) opinion/recommendations:

Port Name

As of 2010-07-14 the port name "cortex" seems to make more sense, if you consider that armv7-a and vfpv3-d16 is the base specification for all Cortex-A8 and broadly compatible processors (A8, A9, A5, Qualcomm Scorpion officially, and Marvell's armv7 additionally) as defined in the Cortex-A Series documentation and ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition. Since all of these processor lines are required to contain a VFPv3 FPU (and now come with VFPv4 in A9, VFPv4-D16 in A5) this works to restrict the port to this, and any future compatible processors. ARM may drop the Cortex monicker for far-future chips but this has not stopped debian-ppc from having to rename to debian-powerarchitecture, and ARM have a great track record of keeping binary compatibility as they move forward.

Losing the "e" for EABI and "l" for little endian doesn't seem like that big a loss, as by far the most usual operation of these processor cores is under a little-endian EABI environment, and OABI is not even considered supported anymore for these chips. It is implied that it is an ARM architecture by the compiler tuple (arm-vendor-linux-gnueabi).

  • Old Opinion: Debian original suggestion was "armelfp" or "armelhp". We like "armelvfp" as it removes the expectation that it might run on FPA or other weird FPU variants.

Compiler

CodeSourcery and Linaro gcc are almost the same thing, Genesi recommend CodeSourcery 2010q1 for now - it is gcc 4.4.1 but it is functionally equivalent to gcc 4.5 for the purpose of a port. There are qualms about not using mainline FSF GCC - I understand Debian policy decides no custom patches unless they are going to be mainlined in the future. It can be argued that Codesourcery commits most of the ARM, PPC support for GCC and are the architects of hardfp support in gcc 4.5 (published and maintained in the SourceryG++ tree). It's still GPL.

Minimum CPU & FPU

  • The lowest worthwhile CPU implementation is ARMv7-A (therefore the recommended build option is -march=armv7-a)

  • FPU should be set at VFPv3-D16 as they represent the miminum specification of the processors to support here (therefore the recommended build option is -mfpu=vfpv3-d16)

    • Building for VFPv3-D16 instead of VFPv3[-D32] only loses the use of 16 FP registers - not a great loss
  • Thumb2/ThumbEE: NO.
    • Thumb interworking relies on performance sapping call stubs (extra branches to be exact) to make sure standard ARM code and Thumb code can call each other. Applications can ship Thumb libraries if they like, or be compiled as Thumb, and have the interworking code embedded in themselves. Standard base system should be standard ARM.

  • Some concern for fast-enough, pretty awesome (600MHz+) ARMv6 + VFPv2 processors here - i.MX37 etc. - which will not be supported, but.. we will have to live with that

[Actually that gcc README is describing the ancient ARMv4T Thumb1 interworking; for Thumb2 (which any v7A core will have) there is no need for interworking veneers, so it's not a reason to reject Thumb2 -- ?PeterMaydell]

Summary of Benefits

Using armv7 as a base

[1] http://www.insidedsp.com/Articles/tabid/64/articleType/ArticleView/articleId/238/Qualcomm-Reveals-Details-on-Scorpion-Core.aspx