Differences between revisions 27 and 28
Revision 27 as of 2010-07-09 11:27:07
Size: 7698
Editor: ?LoicMinier
Comment:
Revision 28 as of 2010-07-09 19:52:51
Size: 15572
Editor: ?MattSealey
Comment: Additions from Genesi (very verbose)
Deletions are marked like this. Additions are marked like this.
Line 19: Line 19:
VFP was extended over time, with VFPv2 (some ARM9/ARM11) VFPv3-D16 (e.g. Marvell Dove) and VFPv3+NEON (Most Cortex-A8) present in current production silicon. VFP was extended over time, with VFPv2 (some ARM9/ARM11 such as the i.MX31 and i.MX37 among many others) VFPv3-D16 (e.g. Marvell Dove) and VFPv3+NEON (most Cortex-A8) present in current production silicon.
Line 33: Line 33:
For historical reasons the GCC FPU and ABI selection options are not entirely orthogonal. The `-mfloat-abi=` option controls both the ABI, and whether floating point instructions may be used. The available options are: For historical reasons and to match the ARM RVCT kit, the GCC FPU and ABI selection options are not entirely orthogonal. The `-mfloat-abi=` option controls both the ABI, and whether floating point instructions may be used. The available options are:
Line 35: Line 35:
 * `soft`: Full software floating point.
 * `softfp`: Use the FPU, but remain compatible with soft-float code.
 * `hard`: Full hardware floating point.
==== "soft" ABI ====
 * Full software floating point - compiler should refuse to generate a real FPU instruction and `-mfpu=` is ignored.
 * FPU operations are emulated using a compiler software floating point emulation library
 * Function calls are generated to pass FP arguments (float, double) in integer registers (one for float, a pair of registers for double)

==== "softfp" ABI ====
 * Hardware floating point using the `soft` floating point ABI
 * To reiterate, function calls are generated to pass FP arguments in integer registers
 * Compiler can make smart choices about when and if it generates emulated or real FPU instructions depending on chosen FPU type (`-mfpu=`)
 * This means `soft` and `softfp` code can be intermixed

The caveat is that copying data from integer to floating point registers incurs a pipeline stall for each register passed (rN->fN) or a memory read for stack items. This has noticable performance implications in that a lot of time is spent in function prologue and epilogue copying data back and forth to FPU registers. This could be 20 cycles or more.

As a thought experiment, consider that for a function which may take 3 `float` arguments takes ~20 cycles to do its work (to simplify I used the cycle timing for [[http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0344c/ch16s07s01.html|FMAC]] and am making the huge assumption that the compiler will recognise the operation directly translates to FMAC).

`float fmadd(float a, float b, float c) { return a + (b * c); }`

Passing these 3 FP arguments will incur 20+ cycles per `float` argument on entry to the function (~60) and at least one register transfer for float result (~80). In this case a 20-cycle function now takes 100 cycles to complete, 5 times more than the actual operation, and ~80% of the function time is spent handling the ABI requirements. Pseudocode follows;

`u32 fmadd(u32 a, u32 b, u32 c) { float fa, fb, fc; MOV fa, a; MOV fb, b; MOV fc, c; FMAC fa, fb, fc; MOV a, fa; }`

Consider `double` requires 2 integer registers paired with the most significant word in an even register (r0/r1, r2/r3, r4/r5). This wastes integer registers in the first place, so passing mixed FP or integer types should be considered carefully. Copying this data into a double register will take twice as long. However the additional time to process double precision data in the FPU is only between 1 and 5 cycles more. In this case, the function will take 160 cycles plus the extra 5 overhead maximum for double precision, plus other overheads (masking and shifting 32bit data into a 64bit register) could bump this to well over 200 cycles. This is 10 times more time than it takes to run FMAC alone and ~90% of the function run time. You had better hope the compiler inlines it :)

Longer functions obviously have less overhead in comparison, but conversely may require more floating point arguments as inputs.. adding more overhead. It is a big trade-off in performance to use "softfp" on VFPv3.

Cortex-A9 processors have far, far better FPU instruction timings meaning that the overhead is far, far higher.

==== "hard" ABI ====

 * Full hardware floating point.
 * FP arguments are passed directly in FPU registers
 * Cannot possibly run without the FPU defined in `-mfpu=` (or a superset of the FPU defined, where relevant)
 * No function prologue or epilogue requirements for FP arguments, no pipeline stalls, maximum performance (just like in PowerPC and MIPS).

==== FPU selection ====
Line 41: Line 73:
The combination of -mfpu=vfp and -mfloat-abi=hard is not available in FSF GCC 4.4. The combination of -mfpu=vfp and -mfloat-abi=hard is not available in FSF GCC 4.4, but the CodeSourcery 2009q1 compiler (and later, current release is 2010q1) supports it as does FSF GCC 4.5.
Line 50: Line 82:

Huge Caveat: This `ld.so` operation relies implicitly on the code being linked having a compatible ABI. While on, say, PowerPC there is adequate information in the ELF header to describe the floating point ABI of the executable, endian-ness of the data involved, the ARM EABI and ELF specification has '''''no''''' way to tell which of `soft`, `softfp`, `hard` is used to build it. Your only protection is that the '''''compiler''''' can detect at an earlier stage of object generation that soft and hard ABIs are compatible and prevent linking into a single (static) object file, but dynamic linking of an executable to a library will not know that it is a bad idea to link a softfp libc or libm to a hardfp executable and vice-versa. This needs to be looked into.. could there possibly be scope for a GNU EABI extension for this?
Line 63: Line 97:
Genesi's recommendations:

=== CPU ===
 * The lowest CPU implementation is ARMv7-A (therefore the recommended build option is `-march=armv7-a`)

=== FPU ===

 * VFPv3-D16 as they represent the miminum specification of the processors to support here (therefore the recommended build option is `-mfpu=vfpv3-d16`)
   * Marvell Dove is ARMv7-A + VFPv3-D16 (+ iwMMXt?)
   * Freescale i.MX5x is ARMv7-A + VFPv3 + NEON (traditional Cortex-A8)
   * TI OMAP3 is ARMv7-A + VFPv3 + NEON (traditional Cortex-A8)
   * Qualcomm Snapdragon is ARMv7-A + VFPv3 + NEON (traditional Cortex-A8)
   * nVidia Tegra2 is ARMv7-A + VFPv3-D16 (Cortex-A9 with no SIMD at all)
   * TI OMAP4 is ARMv7-A + VFPv4 + NEON (traditional Cortex-A9).
 * Some of them support the IEEE half-precision FP format (`-mfpu=vfpv3-f16`) but not all, and in any case the usefulness of this extension is debatable
 * Building for VFPv3-D16 instead of VFPv3[-D32] only loses the use of 16 FP registers - not a great loss

 * Some concern for fast-enough, pretty awesome (600MHz+) ARMv6 + VFPv2 processors here - i.MX37 etc. - which will not be supported, but.. we will have to live with that
   * The difference between ARMv6 and ARMv7 is mostly kernel level but it has better knowledge of cache and some extra memory barrier instructions
   * The difference between VFPv2 and VFPv3 is fundamentally the float-to-fixed and float-to-double (VCVT) instructions loading common FP constants (VMOV immediate).
   * These are very useful and very very desirable

Line 74: Line 131:
It is possible encode the base architecture / CPU in the port name, e.g. `arm7hf` for an ARMv7 hard-float port, or not.   It is possible encode the base architecture / CPU in the port name, e.g. `arm7hf` for an ARMv7 hard-float port, or not.
Line 88: Line 145:
Note: what about when the triplet includes the vendor as `none` as in the CodeSourcery compiler (`arm-none-linux-gnueabi`)? It will act the same as the first instance but there are some annoying nits with cross compiling, or Makefiles in packages being hardcoded for `arm-linux-gnueabi` instead of `arm-none-linux-gnueabi` (and even changing behavior when they should be the same!)
Line 90: Line 149:
Genesi-USA did a proof-of-concept rebuild of Ubuntu karmic (9.10)'s armel port with the hard-floating. They noticed [[http://www.powerdeveloper.org/forums/viewtopic.php?p=13609|important wins]] (in the order of 40% performance improvement) in floating-point heavy applications/libraries such as mesa, with a Cortex-A8 CPU. Genesi USA did a proof-of-concept rebuild of Ubuntu karmic (9.10)'s armel port with the hard-floating. They noticed [[http://www.powerdeveloper.org/forums/viewtopic.php?p=13609|important wins]] (in the order of 40% performance improvement) in floating-point heavy applications/libraries such as mesa, with a Cortex-A8 CPU.
Line 96: Line 155:
NEON is an extension of the VFP which allows for very efficient manipulation of matrices, and vector data in general. This is notably useful for processing audio and video data. NEON is an extension of the VFP which allows for very efficient manipulation of matrices, and vector data in general. This is notably useful for processing audio and video data but also has potential to be used for high-speed memory copies (128-bit at a time).
Line 98: Line 157:
Programs usually take advantage of NEON thanks to hand-crafted assembly routines. GCC can automatically vectorize code and generate NEON instructions, however this tends to have limited success. It would seem sensible NOT to require NEON in a new port since some modern ARMv7 SoCs such as Marvell Dove and NVidia Tegra2 don't implement it.

It is also possible to use NEON instructions for regular scalar floating point code, and this can give significant (2-3x) speedup on Cortex-A8 hardware. However GCC does not currently implement this, and it is not always applicable as NEON instructions are not fully IEEE compliant.
 * Building for `-mfpu=neon` means Marvell, nVidia are left out.
 * While optimizing the entire system for NEON would be awesome, there is very little benefit on standard code
   * Using NEON as an scalar FPU runs faster (as does -mfpu=sse2 on x86 vs. -mfpu=x87)
     * but it's not IEEE754 compliant so using NEON like this is sometimes a bad idea
     * GCC doesn't implement it anyway as it does for -mfpu=sse2
     * In any case on Cortex-A9 the benefit is nil (VFPv4 is VFPv3 compatible and runs at the same performance as NEON for scalar fp)
   * Autovectorizing (`-ftree-vectorize`) for NEON gives between zero and negligible performance gains
   * NEON optimizations - as with AltiVec (PPC) and SSE (x86) - usually come from targeted optimization of code by hand using intrinsics or hand-written assembler.
     * This includes tricks for linear algebra (matrices etc.). One technique to speed up large matrix calculations is to subdivide them and process 2x2 blocks in one NEON operation. Autovectorizing compilers cannot detect this
     * This also includes things like using NEON to approximate transcendental functions (sin, cos, etc.) by performing multiple reciprocal estimates (good replacement for division) at once to refine the accuracy. Autovectorizing compilers do not do this (although GCC `-freciprocal-math` does, can't recall if it's good for ARM)
     * The best performance comes from deriving parallelization using mathematical proof of the original function, and autovectorizing compilers don't do this. Pretty much all they do is unroll loops.
     * Therefore: make sure -ftree-vectorize is turned off :)
Line 106: Line 174:
Genesi-USA is also giving hardware (7 EfikaMX T03) to main Debian sub-project leads for Education, Embedded, Live systems. Genesi-USA is also giving hardware (10 EfikaMX T03) to main Debian sub-project leads for Education, Embedded, Live systems and some Ubuntu/Linaro developers.
Line 112: Line 180:
 * Chose a port name  * Choose a port name
   * Genesi likes "armelvfp" as it removes the expectation that it might run on FPA or other weird FPU variants
Line 114: Line 183:
   * is the more explicit "arm-none-linux-gnueabi" different enough from "arm-linux-gnueabi"?
Line 116: Line 186:
 * Start bootstraping    * Since CodeSourcery and Linaro gcc are almost the same thing, Genesi recommend CodeSourcery 2010q1 for now - it is gcc 4.4.1 but it is functionally equivalent to gcc 4.5 for the purpose of a port
 * Start bootstrapping

This page gathers thoughts and ideas around a hard-float ARM port for Debian.

Rationale

A lot of modern ARM boards and devices ship with a floating-point unit (FPU), but the current Debian armel port doesn't take much advantage of it.

A new ARM port requiring the presence of a FPU would help squeeze the most performance juice out of hardware with a FPU.

Background information

This section provides some background information on FPUs, ARM EABI, GCC floating-point ABIs, hwcaps...

VFP

With ARMv6 a floating point instruction set known as Vector Floating Point (VFP) was introduced. This is now effectively the norm for modern ARM implementations. Prior to this there was no real standardized set of floating point instructions, with some vendors supplying their own coprocessors.

VFP was extended over time, with VFPv2 (some ARM9/ARM11 such as the i.MX31 and i.MX37 among many others) VFPv3-D16 (e.g. Marvell Dove) and VFPv3+NEON (most Cortex-A8) present in current production silicon.

In spite of the name, the base VFP architecture is not well suited for vector operations. For practical purposes it is a normal scalar floating point unit. The NEON extension defines vector instruction similar to SSE or ?AltiVec and shares a register file with the VFP unit.

These remain an optional part of the architecture.

ARM EABI

The ARM EABI specification covers calling conventions across libraries and binaries. It defines two incompatible ABIs: one uses (VFP) floating point registers for passing function arguments, and the other does not.

Unlike many other architectures, ARM supports use of FPU instructions while still conforming to the base ABI. This allows code to take advantage of the FPU without breaking compatibility with older libraries or applications. This does incur some overhead relative to a full hard-float system, and obviously requires a VFP capable CPU.

GCC floating-point options

For historical reasons and to match the ARM RVCT kit, the GCC FPU and ABI selection options are not entirely orthogonal. The -mfloat-abi= option controls both the ABI, and whether floating point instructions may be used. The available options are:

"soft" ABI

  • Full software floating point - compiler should refuse to generate a real FPU instruction and -mfpu= is ignored.

  • FPU operations are emulated using a compiler software floating point emulation library
  • Function calls are generated to pass FP arguments (float, double) in integer registers (one for float, a pair of registers for double)

"softfp" ABI

  • Hardware floating point using the soft floating point ABI

  • To reiterate, function calls are generated to pass FP arguments in integer registers
  • Compiler can make smart choices about when and if it generates emulated or real FPU instructions depending on chosen FPU type (-mfpu=)

  • This means soft and softfp code can be intermixed

The caveat is that copying data from integer to floating point registers incurs a pipeline stall for each register passed (rN->fN) or a memory read for stack items. This has noticable performance implications in that a lot of time is spent in function prologue and epilogue copying data back and forth to FPU registers. This could be 20 cycles or more.

As a thought experiment, consider that for a function which may take 3 float arguments takes ~20 cycles to do its work (to simplify I used the cycle timing for FMAC and am making the huge assumption that the compiler will recognise the operation directly translates to FMAC).

float fmadd(float a, float b, float c)  {   return a + (b * c);   }

Passing these 3 FP arguments will incur 20+ cycles per float argument on entry to the function (~60) and at least one register transfer for float result (~80). In this case a 20-cycle function now takes 100 cycles to complete, 5 times more than the actual operation, and ~80% of the function time is spent handling the ABI requirements. Pseudocode follows;

u32 fmadd(u32 a, u32 b, u32 c)  { float fa, fb, fc; MOV fa, a; MOV fb, b; MOV fc, c; FMAC fa, fb, fc; MOV a, fa; }

Consider double requires 2 integer registers paired with the most significant word in an even register (r0/r1, r2/r3, r4/r5). This wastes integer registers in the first place, so passing mixed FP or integer types should be considered carefully. Copying this data into a double register will take twice as long. However the additional time to process double precision data in the FPU is only between 1 and 5 cycles more. In this case, the function will take 160 cycles plus the extra 5 overhead maximum for double precision, plus other overheads (masking and shifting 32bit data into a 64bit register) could bump this to well over 200 cycles. This is 10 times more time than it takes to run FMAC alone and ~90% of the function run time. You had better hope the compiler inlines it :)

Longer functions obviously have less overhead in comparison, but conversely may require more floating point arguments as inputs.. adding more overhead. It is a big trade-off in performance to use "softfp" on VFPv3.

Cortex-A9 processors have far, far better FPU instruction timings meaning that the overhead is far, far higher.

"hard" ABI

  • Full hardware floating point.
  • FP arguments are passed directly in FPU registers
  • Cannot possibly run without the FPU defined in -mfpu= (or a superset of the FPU defined, where relevant)

  • No function prologue or epilogue requirements for FP arguments, no pipeline stalls, maximum performance (just like in PowerPC and MIPS).

FPU selection

In addition, the -mfpu= option can be used to select a VFP/NEON (or FPA or Maverick) variant. This has no effect when -mfloat-abi=soft is specified.

The combination of -mfpu=vfp and -mfloat-abi=hard is not available in FSF GCC 4.4, but the CodeSourcery 2009q1 compiler (and later, current release is 2010q1) supports it as does FSF GCC 4.5.

ld.so hwcaps

The GCC -mfloat-abi=softfp flag allows use of VFP while remaining compatible with soft-float code. This allows selection of appropriate routines at runtime based on the availability of VFP hardware.

The runtime linker, ld.so, supports a mechanism for selecting runtime libraries based on features reported by the kernel. For instance, it's possible to provide two versions of libm, one in /lib and another one in /lib/vfp, and ld.so will select the /lib/vfp one on systems with VFP.

This mechanism is dubbed "hwcaps".

Huge Caveat: This ld.so operation relies implicitly on the code being linked having a compatible ABI. While on, say, PowerPC there is adequate information in the ELF header to describe the floating point ABI of the executable, endian-ness of the data involved, the ARM EABI and ELF specification has no way to tell which of soft, softfp, hard is used to build it. Your only protection is that the compiler can detect at an earlier stage of object generation that soft and hard ABIs are compatible and prevent linking into a single (static) object file, but dynamic linking of an executable to a library will not know that it is a bad idea to link a softfp libc or libm to a hardfp executable and vice-versa. This needs to be looked into.. could there possibly be scope for a GNU EABI extension for this?

Endianess, architecture level, CPU, VFP level

A new port would be little-endian as that is the most widely used endianess in recent ARM designs.

Since the new port would require VFP, it would limit which ?SoCs are supported by the new port.

The toolchain needs to be configured with a specific base CPU and base VFP version in mind.

It might make sense for such a new port -- which would essentially target newer hardware -- to target newer CPUs. For instance, it could target ARMv6 or ARMv7 ?SoCs, and VFPv2, VFPv3-D16 or NEON.

If targeting ARMv7, another option is to build for Thumb-2.

Genesi's recommendations:

CPU

  • The lowest CPU implementation is ARMv7-A (therefore the recommended build option is -march=armv7-a)

FPU

  • VFPv3-D16 as they represent the miminum specification of the processors to support here (therefore the recommended build option is -mfpu=vfpv3-d16)

    • Marvell Dove is ARMv7-A + VFPv3-D16 (+ iwMMXt?)
    • Freescale i.MX5x is ARMv7-A + VFPv3 + NEON (traditional Cortex-A8)
    • TI OMAP3 is ARMv7-A + VFPv3 + NEON (traditional Cortex-A8)
    • Qualcomm Snapdragon is ARMv7-A + VFPv3 + NEON (traditional Cortex-A8)
    • nVidia Tegra2 is ARMv7-A + VFPv3-D16 (Cortex-A9 with no SIMD at all)
    • TI OMAP4 is ARMv7-A + VFPv4 + NEON (traditional Cortex-A9).
  • Some of them support the IEEE half-precision FP format (-mfpu=vfpv3-f16) but not all, and in any case the usefulness of this extension is debatable

  • Building for VFPv3-D16 instead of VFPv3[-D32] only loses the use of 16 FP registers - not a great loss
  • Some concern for fast-enough, pretty awesome (600MHz+) ARMv6 + VFPv2 processors here - i.MX37 etc. - which will not be supported, but.. we will have to live with that
    • The difference between ARMv6 and ARMv7 is mostly kernel level but it has better knowledge of cache and some extra memory barrier instructions
    • The difference between VFPv2 and VFPv3 is fundamentally the float-to-fixed and float-to-double (VCVT) instructions loading common FP constants (VMOV immediate).
    • These are very useful and very very desirable

Name of the port

The table below recaps which ports names Debian/dpkg saw so far.

name

endianess

status

arm

little-endian

last release in Debian lenny; being retired if favor of armel

armel

little-endian

introduced in Debian lenny; actively maintained; targets armv4t; doesn't require a FPU

armeb

big-endian

unofficial port; inactive and dead

The name of a new ARM port using the hard-float ABI should probably start with arm and include hf for hard-float or fp for floating-point in the name.

It is possible encode the base architecture / CPU in the port name, e.g. arm7hf for an ARMv7 hard-float port, or not.

It is also possible to encode the endianess explicitly, e.g. armelhf but the new port could also simply be named armhf since a big-endian port is unlikely.

It is also possible to encode profiles in the name as A/M/R.

Triplet

GCC when built to target the GNU arm-linux-gnueabi triplet will support both the hard-float and soft-float calling conventions.

dpkg relies on the triplet to identify the port (gcc -dumpmachine output). Some other projects such as multiarch rely on having distinct triplets across all Debian architectures.

For the new Debian port, it is possible to use the vendor field in the triplet to have distinct triplets. For instance, the triplet could be arm-hardfloat-linux-gnueabi.

Note: what about when the triplet includes the vendor as none as in the CodeSourcery compiler (arm-none-linux-gnueabi)? It will act the same as the first instance but there are some annoying nits with cross compiling, or Makefiles in packages being hardcoded for arm-linux-gnueabi instead of arm-none-linux-gnueabi (and even changing behavior when they should be the same!)

Performance improvements and benchmarks

Genesi USA did a proof-of-concept rebuild of Ubuntu karmic (9.10)'s armel port with the hard-floating. They noticed important wins (in the order of 40% performance improvement) in floating-point heavy applications/libraries such as mesa, with a Cortex-A8 CPU.

It's likely that the performance benefits are much larger on Cortex-A8 CPUs than on Cortex-A9 CPUs which have a faster VFP design and more conventional pipeline.

NEON

NEON is an extension of the VFP which allows for very efficient manipulation of matrices, and vector data in general. This is notably useful for processing audio and video data but also has potential to be used for high-speed memory copies (128-bit at a time).

  • Building for -mfpu=neon means Marvell, nVidia are left out.

  • While optimizing the entire system for NEON would be awesome, there is very little benefit on standard code
    • Using NEON as an scalar FPU runs faster (as does -mfpu=sse2 on x86 vs. -mfpu=x87)
      • but it's not IEEE754 compliant so using NEON like this is sometimes a bad idea
      • GCC doesn't implement it anyway as it does for -mfpu=sse2
      • In any case on Cortex-A9 the benefit is nil (VFPv4 is VFPv3 compatible and runs at the same performance as NEON for scalar fp)
    • Autovectorizing (-ftree-vectorize) for NEON gives between zero and negligible performance gains

    • NEON optimizations - as with ?AltiVec (PPC) and SSE (x86) - usually come from targeted optimization of code by hand using intrinsics or hand-written assembler.

      • This includes tricks for linear algebra (matrices etc.). One technique to speed up large matrix calculations is to subdivide them and process 2x2 blocks in one NEON operation. Autovectorizing compilers cannot detect this
      • This also includes things like using NEON to approximate transcendental functions (sin, cos, etc.) by performing multiple reciprocal estimates (good replacement for division) at once to refine the accuracy. Autovectorizing compilers do not do this (although GCC -freciprocal-math does, can't recall if it's good for ARM)

      • The best performance comes from deriving parallelization using mathematical proof of the original function, and autovectorizing compilers don't do this. Pretty much all they do is unroll loops.
      • Therefore: make sure -ftree-vectorize is turned off :)

Hardware

Genesi-USA would be happy to continue sharing the 9 EfikaMX (Freescale i.MX51) buildds used for their proof-of-concept to help get a new port started.

Genesi-USA is also giving hardware (10 EfikaMX T03) to main Debian sub-project leads for Education, Embedded, Live systems and some Ubuntu/Linaro developers.

Genesi-USA is also giving old hardware (EfikaMX T02) which could be used to help out buildds, setup porterboxes or give away to interested developers who would work on the new port. While stock it is limited, if you are interested, register yourself into PowerDeveloper site and, then, contact Hector Oron <zumbi@debian.org>.

TODO

  • Choose a port name
    • Genesi likes "armelvfp" as it removes the expectation that it might run on FPA or other weird FPU variants
  • Decide on a triplet
    • is the more explicit "arm-none-linux-gnueabi" different enough from "arm-linux-gnueabi"?
  • Get these it into dpkg
  • Get a compiler in shape for this port; GCC 4.5 supports the hard-float ABI, but 4.4 does not; the new port could either have backported support in gcc-4.4, or use gcc-4.5 from the start, or use a different code base such as CodeSourcery SourceryG++ or Linaro GCC

    • Since CodeSourcery and Linaro gcc are almost the same thing, Genesi recommend CodeSourcery 2010q1 for now - it is gcc 4.4.1 but it is functionally equivalent to gcc 4.5 for the purpose of a port

  • Start bootstrapping
  • Fix / port packages
    • libffi needs porting