Making fast floating point math work on the Cirrus MaverickCrunch floating point unit

Arm EABI systems can find themselves running on hardware with the non-VFP "MaverickCrunch" FPU. Mainline GCC support has never worked for it but there is a modified compiler available that does and that is able to generate Crunch-accelerated Debian packages.

This page records the issues and patches to make GCC generate reliable code for the Cirrus Logic MaverickCrunch floating point unit.

It applies to the "armel" ArmEabiPort of Debian when compiling for a Cirrus Logic EP93xx ARM + Maverick chip with

    gcc -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp

For docs on the MaverickCrunch unit and its problems, see the EP93xx User's Guide, chapter 2, and the Errata of the documents for the EP9312.

Discussion specific to it usually happens on the linux-cirrus mailing list.

The compilers can be downloaded under http://martinwguy.net/martin/crunch and repositories of Crunch-accelerated Debian packages built using this compiler are available under http://martinwguy.net/crunch/debian

About the FPU

The MaverickCrunch Coprocessor is an IEEE-754 floating point accelerator unit that comes together with an ARM920T integer CPU in Cirrus Logic's EP9301, EP9302, EP9307, EP9312 and EP9315 chips. It has a different instruction set from other floating point accelerators that are found with ARM processors: ARM's old FPA unit, now rare, and the more recent VFP unit. ARMs also come with iWMMXt, NEON and DaVinci coprocessors, but these are specialized SIMD and DSP processors, not generic floating point math ones.

Five revisions of the silicon were issued: D0, D1, E0, E1 and E2. The revision of a chip is printed as the 5th and 6th characters of the second line of text on the chip housing. The now rare D0 revision has a more extensive range of hardware bugs than the later revisions; from D1-E2 no further modifications were made to the design of the Maverick unit. Here we only attempt to work around the bugs in the later series.

Cirrus stopped development of its ARM devices on 1st April 2008 (no joke!) but will continue to sell the chips.

Registers

It has 16 64-bit registers, which can be treated as single- or double-precision floating point values, or as 32- or 64-bit integers. Single-precision floats live in the top 32 bits of the register and, when they are written, the lower 32 bits are zeroed. 32-bit integers live in the lower 32 bits and, when they are written, the top 32 bits are (usually!) sign-extended according to bit 31. It also has four 72-bit multiply-accumulate integer registers which are not used by GCC.

Instruction set

It provides instructions to add, subtract, multiply, compare, negate and give absolute value for all these types, to shift the registers in the two integer modes, and to convert between the data types. These operations can only be done between Maverick registers, but data can be copied between Maverick and ARM registers and between Maverick registers and main memory.

Operating modes

The FPU can operate in several modes, controlled by bits in its status register:

Instruction format

MaverickCrunch instructions are 32-bit words that are interleaved with the regular ARM instrution stream. It appears as co-processors 4, 5 and 6 and its instruction words in hexadecimal match the regular expression 0x.[cde]...[456]..

In GCC output, this is further restricted to 0xe[cde]...[45].. as all Maverick instructions are unconditional ("e"), and it does not use the CP6 72-bit multiply-and-accumulate instructions.

GCC support

Mainline GCC

The mainline GCC support for it that was submitted in 2003 by RedHat, usually selected with

    -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp

has never produced working code for it. Most crucially, it fails to take proper account of the way that the FPU sets the condition code registers after a comparison, so the code it generates sometimes gets floating point and 64-bit integer comparisons wrong as well as failing to account for several of the hardware bugs.

GCC does not use:

It has a -mfix-cirrus-invalid-insns flag, which is supposed to ensure that the two instructions following a branch are not Cirrus one but fails to do so, and that every cfldrd, cfldr64, cfstrd, cfstr64 is followed by one non-Cirrus instruction, which should fix bugs 1 and 2.

Cirrus crunch GCC

Cirrus GCC actually carried out by Nucleusys of Bulgaria. There are three versions of it, all based on gcc-4.1.2 and uclibc in a buildroot environment (unpacked here), but none of them passes the GCC testsuite, while other test programs go into infinite loops or crash. Some real-life programs compiled with it do seem to work though. The modifications are published as a 500 megabyte tarball from which a single monolithic patch can be derived by diffing it against the mainline source releases. What a crock!

Futaris patches

futaris patches for gcc-4.1.2 and 4.2.0 for the OpenEmbedded environment is provided as several dozen individual patches and succeeds in passing the GCC IEEE testsuite, and the intensive paranoia testsuite finds only one defect and two flaws in it (which is better than the Intel and AMD chips!). Futaris' strategy includes disabling all conditional instructions other than branch and all 64-bit integer operations.

Here is how to build a futaris-patched compiler, a summary of their merits, and some benchmarks.

gcc-crunch

Developed from the futaris patches, this version passes all FP testsuites and runs as fast as possible at slightly reduced precision unless the -mieee flag is supplied.

It disables all 64-bit integer operations which appear to have more unidentified hardware bugs, as shown by the openssl testsuite. The -mcirrus-di flag enables them, caveat emptor.

There is a long description of it at http://martinwguy.net/martin/crunch and the patches for gcc-4.2.4 and gcc-4.3.4 as well as prebuilt native compiler packages can be found under http://simplemachines.it/tools.

Summary of bugs

Toolchain

CMP

1

2

3

4

5

6

7

8

9

10

11

12a

12b

13

14

15

Debian gcc 4.1.3

X

\

\

\

f

s

/

-

-

-

-

-

X

X

Debian gcc 4.2.3

X

\

\

\

f

s

/

-

-

-

-

-

X

X

Debian gcc 4.3.0

X

\

\

\

f

s

/

-

-

-

-

-

X

X

futaris 4.1.2

/

f

/

/

-

-

-

-

-

/

/

X

X

futaris 4.2.0

/

f

s

/

-

-

-

-

-

/

/

X

X

gcc-crunch

/

/

/

/

f

s

/

-

-

-

-

-

i

/

/

/

/

crunchtools 1.4.0

-

-

-

-

-

crunchtools 1.4.1-2

-

-

-

-

-

crunchtools 1.4.3

-

-

-

-

-

Legend

X

broken

/

fixed

\

should(*) be fixed when -mfix-invalid-cirrus-insns is given

f

fixed when forwarding is disabled, which is the default in mainline Linux

i

fixed when -mieee flag is given

s

fixed when not running in serialised mode, which seems usual test program

-

doesn't affect GCC's use of Maverick FPU

(*) But the code is bad. It turns branch; cirrus; non-cirrus into branch; nop; cirrus; non-cirrus, which actually makes errata 1 and 3 more likely.

The bugs

The bugs are:

The FPU doesn't set the same condition codes as the ARM core

The ARM and its FPA and VFP math coprocessors' comparison instructions all set the conditions codes the same way:

ARM/FPA/VFP - (cmp*):
        N  Z  C  V
A == B  0  1  1  0
A <  B  1  0  0  0
A >  B  0  0  1  0
unord   0  0  1  1

but comparisons performed in the Maverick, on both floating point or integer values, set them a different way:

MaverickCrunch - (cfcmp*):
        N  Z  C  V
A == B  0  1  0  0
A <  B  1  0  0  0
A >  B  1  0  0  1
unord   0  0  0  0

This means that not only do you have to test for the same conditions in different ways, according to which unit did the comparison, but some cases require strange sequences of conditional instructions.

Some conditions are only tested in GCC when floating point values have been compared, including all those involving "unordered", and others only when integers have been compared (the unsigned variants LTU GTU LEU and GEU - is this right?), so we can know how to interpret the condition codes for these when Maverick code generation is enabled.

#include <stdio.h>
main()
{
        int i; double d;
        for (i=3, d=3; i<=5; i++, d++) {
                printf("%g", d);
                if (d < 4.0) printf(" lt";
                if (d > 4.0) printf(" gt");
                if (d <= 4.0) printf(" le");
                if (d >= 4.0) printf(" ge");
                putchar('\n');
        }
}

should output:

3 lt le
4 le ge
5 gt ge

but unpatched mainline GCC outputs:

3 lt le
4 le ge
5 lt gt le ge

The best summary of the why's and wherefores is on the gcc mailing list.

Futaris-4.1.2 passes the GCC testsuite by disabling conditional execution of all instructions except for branches.

The unpublished futaris patches for 4.2.2 and 4.3.0 fix this with the exception of the "unordered" and "ltgt" cases, which are missing and cause an internal compiler error if a program mentions the isunordered() function.

They also get ICE compiling  foo(double x, double y) { return(x >= y); }  with optimisation enabled because GCC first produces BGE (which can be represented in maverick with a sequence of tests) and then optimises that to a MOVLT; MOVGE sequence and the GE cannot be represented by a single condition code on Maverick (it can, with GT, if we don't honor the UNORDERED case).

C++ exception unwinding is broken

With Maverick FP enabled, C++ exceptions (catch - try blocks) fail to preserve the Maverick FPU state.

This thread onbinutils mailing list explains why unwind support is needed. Additionally this document (ARM IHI 0038A) explains the unwind process using EABI. As you can see in Sec 9.3 of that document, there is no unwinding EABI for popping MaverickCrunch instructions. The above patch incorrectly calls the iWMMXt pop functions. A new Pop MV registers instruction needs to be added to the table, along with changes to Sec 7.5

libunwind support

unwind support should only be needed if libunwind support is enabled. At the moment, only the development branch (git) of libunwind supports ARM processors.

Joseph S. Myers says on linux-cirrus 31 Mar 2008:

iWMMXt unwind support has been in GCC since my patch
<http://gcc.gnu.org/ml/gcc-patches/2007-01/msg00049.html>.
That illustrates the sort of thing that needs changing to implement unwind
support for a new coprocessor.  Obviously you need to get the unwind
specification in the official ARM EABI documents first before implementing
it in GCC, and binutils will also need to support generating correct
information given .save directives for the coprocessor registers.
For setjmp/longjmp support in glibc you also need to get an HWCAP value
allocated in the kernel.

Hardware bugs

See cirrus.com -> ARM Processors -> EP93{01,02,07,12,15} -> Errata (PDF) -> Maverick Crunch

Errata are different for silicon revisions D0, D1/E0/E1 and differences are reported with E2 although no further changes are said to have been made to the Maverick design.

The following is from the EP9302 rev E2 errata:

Definitions

1. "Instruction does not execute": An instruction appears in the coprocessor pipeline, but does not execute for one of the following reasons:

2. "Processor is in serialized mode": It is, if and only if both:

("Serialized mode" makes instruction processing less fast so that an exception can reliably be traced to the instruction that caused it. In the sample I have tested (a TS7250) it is not operating in serialised mode by these criteria because no exceptions are enabled. Source: dspsc.c)

3. "An instruction updating an accumulator": These include all of the following:

4. "An instruction involving any two-word coprocessor load or store":

1. two-word load / store

Result: register or memory corruption

Summary: a nonexecuted coprocessor instruction that is also stalled due to an internal dependency (operand is a non-cached memory read or the result of a previous incomplete coprocessor instruction) must not immediately precede a load/store 64/double.

Effects: in a 64-bit register load, the top 32 bits are loaded with junk; in a 64-bit memory store, an extra 32 bits of junk are written to memory in the word following the 64-bits that were correctly written.

An instruction may be nonexecuted because it is conditional and the condition is false, e.g.

    cfaddne c0, c1, c2
    cfldrd  c3, [r2, #0x0]

corrected by

    cfaddne c0, c1, c2
    nop
    cfldrd  c3, [r2, #0x0]

or it may be nonexecuted because it was one of the two instructions following a branch that was taken, so was loaded into the pipeline.

   target
      cfldrd  c3, [r2, #0x0]

      b       target
      nop
      cfadd   c0, c1, c2 ; though in pipeline, this does not execute

said to be corrected by:

   target
      cfldrd c3, [r2, #0x0]

      b      target
      cfadd  c0, c1, c2 ; though in pipeline, this does not execute
      nop

because the "previous instruction that does not execute" in the two-word pipeline is the nop rather than the cfadd.

GCC does not emit conditional Maverick instructions, and the branch case would be covered by mainline's -mcirrus-fix-invalid-insns flag if that code were not broken: in fact it turns b;cfxxx;non-cirrus into b;nop;cfxxx;non-cirrus thereby causing the bug to occur!

Futaris and Cirrus remove this flag.

A test program tickles the bug in both ways on revision E1 silicon.

2. instruction with source operand

Result: bad calculation or stored value

Workaround: change instruction sequence

  1. Execute a coprocessor instruction whose target is one of the coprocessor general purpose register c0 through c15.
  2. Let the second instruction be an instruction with the same target, but not be executed.
  3. Execute a third instruction at least one of whose operands is the target of the previous two instructions.

For example, assume no pipeline interlocks other than the dependencies involving register c0 in the following instruction sequence:

    cfadd32    c0, c1, c2
    cfsub32ne  c0, c3, c4    ; assume this does not execute
    cfstr32    c0, [r2, #0x0]

In this particular case, the incorrect value stored at the address in r2 is the previous value in c0, not the expected one resulting from the cfadd32.

Suggested fix:

    cfadd32   c0, c1, c2
    nop                      ; inserted extra instruction here
    nop                      ; inserted extra instruction here
    cfsub32ne c0, c3, c4     ; assume this does not execute
    nop                      ; inserted extra instruction here
    nop                      ; inserted extra instruction here
    nop                      ; inserted extra instruction here
    cfstr32   c0, [r2, #0x0]

The exact interval for safe operation is uncertain. Empirically, on an EP9307 REV E1, we get the following results:

Buggy || cfadd - cfaddne - cfstr

Buggy || cfadd - nop - cfaddne - cfstr
Buggy || cfadd - cfaddne - nop - cfstr

OK    || cfadd - nop - nop - cfaddne - cfstr
Buggy || cfadd - nop - cfaddne - nop - cfstr
Buggy || cfadd - cfaddne - nop - nop - cfstr

OK    || cfadd - nop - nop - nop - cfaddne - cfstr
OK    || cfadd - nop - nop - cfaddne - nop - cfstr
OK    || cfadd - nop - cfaddne - nop - nop - cfstr
OK    || cfadd - cfaddne - nop - nop - nop - cfstr

Buggy || cfadd - cfaddne - cfaddne -  cfstr
Buggy || cfadd - cfaddne - cfaddne - nop - cfstr
OK    || cfadd - cfaddne - cfaddne - nop - nop - cfstr
OK    || cfadd - nop - cfaddne - cfaddne - cfstr
OK    || cfadd - nop - cfaddne - cfaddne - nop - cfstr
OK    || cfadd - nop - cfaddne - cfaddne - nop - nop - cfstr

The second instruction may also not be executed because it follows a branch: as in the following real-life case from liboil/simdpack/multsum.c which fails if either or both of the pair of nops is removed:

    cfsh64  mvdx0, mvdx4, #0        @ dest is C0
    cmp     r3, r8
    ble     548 <multsum_f64_unroll8+0x438>
    nop                     (mov r0,r0)
    nop                     (mov r0,r0)
    cfldrd  mvd0, [sl]              @ dest is C0
    cfldrd  mvd1, [r9]
    cfmuld  mvd0, mvd0, mvd1
    cfaddd  mvd0, mvd0, mvd4
    ldr     r5, [sp, #72]           @ branch target
    nop                     (mov r0,r0)
    cfstrd  mvd0, [r5]              @ src is C0

A test program tickles the bug in both ways on revision E1 silicon.

GCC doesn't emit conditional Maverick instructions and the jump case should fixed by mainline's -mfix-cirrus-invalid-instructions.

3. two-word load / store

Data in coprocessor general purpose registers or in memory may be corrupted.

  1. Let the first instruction be a serialized instruction that does not execute. For an instruction to be serialized, at least one of the following must be true:
    • The processor must be operating in serialized mode.
    • The instruction must move to or from the DSPSC (either cfmv32sc or cfmvsc32).
  2. Let the immediately following instruction be a two-word coprocessor load or store.

In the case of a load, only the lower 32 bits (the first word) will be loaded into the target register. For example:

    cfadd32ne   c0, c1, c2    ; assume this does not execute
    cfldr64     c3, [r2, #0x0]

The lower 32 bits of c3 will correctly become what is at the memory address in r2, but the upper 32 bits of c3 will not become what is at address r2 + 0x4.

Workaround:

    cfadd32ne c0, c1, c2     ; assume this does not execute
    nop                      ; inserted extra instruction here
    cfldr64   c3, [r2, #0x0] ; store sequence
    cfadd32ne c4, c5, c6     ; assume this does not execute
    nop                      ; inserted extra instruction here
    cfstr64   c3, [r2, #0x0]

The real-world CPUs I've tested are not running in serialized mode, and GCC does not emit cfmv32sc or cfmvsc32.

If there are serialized ones out there, GCC does not emit conditional Maverick instructions, which just leaves the case of a Maverick instruction being in one of the two slots after a branch that is taken, which is covered by -mcirrus-fix-invalid-insns.

4. two-word store

Only in mode: forwarding, not serialised

Result: memory corruption

Summary: data operation into Crunch register followed by 64-bit store of the same Maverick register into RAM (cfstrd or cfstr64) may write rubbish

Description: When the coprocessor is not in serialized mode and forwarding is enabled, memory can be corrupted when two types of instructions appear in the instruction stream with a particular relative timing.

  1. Execute an instruction that is a data operation (not a move between ARM and coprocessor registers) whose destination is one of the general purpose register c0 through c15.
  2. Execute an instruction that is a two-word coprocessor store (either cfstr64 or cfstrd), where the destination register of the first instruction is the source of the store instruction, that is, the second instruction stores the result of the first one to memory.
  3. Finally, the first and second instruction must appear to the coprocessor with the correct relative timing; this timing is not simply proportional to the number of intervening instructions and is difficult to predict in general.

The result is that the lower 32 bits of the result stored to memory will be correct, but the upper the 32 bits will be wrong. The value appearing in the target register will still be correct.

Workarounds:

GCC does output guilty instruction sequences. Examples from LAME:

    cfmuld  mvd1, mvd1, mvd0
    mov     r2, r7
    mov     r3, r5
    mov     r0, r8
    ldr     r1, [pc, #364]
    cfstrd  mvd1, [sp, #8]

    cfmuld  mvd1, mvd1, mvd0
    mov     r0, r8
    mov     r3, r4
    ldr     r1, [pc, #1004]
    cfstrd  mvd1, [sp]

    cfldrd  mvd0, [r8, #8]
    cfaddd  mvd0, mvd1, mvd0
    cfstrd  mvd0, [r8, #8]

but a sample system was not operating with forwarding enabled.

Code to enable forwarding (under Linux with Maverick support enabled in the kernel, the effect is limited to the process that does this):

    crunch_fwden()
    {
        asm("cfmv32sc   mvdx0, dspsc");    /* Read status register */
        asm("cfmvrdl    r0, mvd0");        /* Move LSW to ARM */
        asm("orr        r0, r0, #0x4000"); /* Set forwarding bit */
        asm("cfmvdlr    mvd0, r0");        /* Move ARM to LSW */
        asm("cfmvsc32   dspsc, mvdx0");    /* Write to status register */
    }

This appears to be unresolved at present.

Under Linux on the sample board I use, forward is disabled by default. Enabling forwarding in a test program on revision E1 hardware, I have been unable to get this bug to bite.

5. cfrshl32, cfrshl64

When operating in serialized mode, cfrshl32 and cfrshl64 (logical shifts on coprocessor registers) do not work properly. The instructions shift by an unpredictable amount, but cause no other side effects.

cfrshl32 is disabled in mainline gcc, and cfrshl64 is disabled by futaris, even though real-world CPUs seem not to run in serialized mode.

6. cfldr32, cfmv64lr may not sign-extend

If an interrupt occurs during the execution of cfldr32 or cfmv64lr, the instruction may not sign extend the result correctly.

Possible workarounds include:

Mainline GCC does not emit cfldr32, and use of cfmv64lr is disabled as buggy. In three places it is used as the first of a two-instruction sequence: in all cases the top 32 bits are either overwritten or ignored by the second instruction.

Verdict: Not a problem.

7. accumulator updates

The coprocessor can incorrectly update one of its destination accumulators even if the coprocessor instruction should not have been executed or is canceled by the ARM processor. This error can occur if the following is true:

  1. The first instruction must be a coprocessor compare instruction, one of cfcmp32, cfcmp64, cfcmps, and cfcmpd.

  2. The second instruction:
    • has an accumulator as a destination.
    • does not execute.

GCC does not use the accumulator instructions.

8. accumulator updates

If a data abort occurs on an instruction preceding a coprocessor data path instruction that writes to one of the accumulators, the accumulator may be updated even though the instruction was canceled.

GCC does not use the accumulator instructions.

9. accumulator updates

The coprocessor will erroneously update an accumulator if the coprocessor instruction that updates an accumulator is canceled and is followed by a coprocessor instruction that is not a data path instruction. This error will occur under the following conditions:

  1. The first instruction:
    • must update a coprocessor accumulator.
    • does not execute.
  2. The second instruction is not a coprocessor data path instruction. Coprocessor data path instructions include any instruction that does not move data to or from memory or to or from the ARM registers.

GCC does not use the accumulator instructions.

10. accumulator updates

An instruction that writes a result to an accumulator may cause corruption of any of the four accumulators when the coprocessor is operating in serialized mode.

GCC does not use the accumulator instructions.

11. two-word load / store

An erroneous memory transfer to or from any of the coprocessor general purpose registers c0 through c15 can occur given the following conditions are satisfied:

  1. The first instruction:
    • is a two-word load or store.
    • fails its condition code check.
    • does not busy-wait.
  2. The second consecutive instruction:
    • is a coprocessor load or store.
    • is executed.
    • does not busy-wait.

When the error occurs, the result is either coprocessor register or memory corruption. Here are several examples:

   cfstr64ne     c0, [r0, #0x0]   ; assume does not execute
   cfldrs        c2, [r2, #0x8]   ; could corrupt c2!
   cfldrdge      c0, [r0, #0x0]   ; assume does not execute
   cfstrd        c2, [r2, #0x8]   ; could corrupt memory!
   cfldr64ne     c0, [r0, #0x0]   ; assume does not execute
   cfldrdgt      c2, [r2, #0x8]   ; could corrupt c2!

The software workaround involves avoiding a pair of consecutive instructions with these properties. For example, if a conditional coprocessor two-word load or store appears, ensure that the following instruction is not a coprocessor load or store:

   cfstr64ne    c0, [r0, #0x0]     ; assume does not execute
   nop                             ; separate two instructions
   cfldrs       c2, [r2, #0x8]     ; c2 will be ok

Another workaround is to ensure that the first instruction is not conditional:

   cfstr64      c0, [r0, #0x0]     ; executes
   cfldrs       c2, [r2, #0x8]     ; c2 will be ok

Note: If both instructions depend on the same condition code, the error should not occur, as either both or neither will execute.

GCC does not emit conditional Maverick instructions.

12a. cpy/add/abs/neg/cvt take denormalised operands as zero

Result: denorm operand forced to zero

Description: When an the Crunch add/subtract unit is presented with denormalized values it takes them as zero for that input of the calculation. The sign is unaffected. This affects values of +/- 2-149 to 2-126 for floats and from 2-1074 to 2-1022 for doubles when using the following instructions:

Workaround: none. The Cirrus crunch softfloat library has integer asm code to check for denorm values before these operations (e.g. macro isddf in ieee754-df.S).

cfcpys and cfcpyd can be replaced with cfsh64 #0, which does a bit-wise copy, and cfnegs and cfnegd are disabled by futaris so the operation is performed in pairs of ARM registers (the same could be done for cfabs). Disabling the rest would only leave multiply and compare, so we live with the imprecision.

12b. cpy/neg never produces negative zero

When the operand is negative zero, cfcpys and cfcpyd write positive zero to the destination register, while the result should be negative zero. When the operand is positive zero, cfnegs and cfnegd write positive zero to the destination register, while the result should be negative zero.

Futaris disables the cfneg instructions in its arm-crunch-neg2 patch; it would be better to protected it under & ! HONOR_NEGATIVE_ZEROS([SD]Fmode) so that it is enabled when -ffast-math is selected.

cfcpyd can be replaced with cfsh64 #0, which does a bitwise copy. cfsh32 is no use for cfcpys as it copies the lower 32 bits, while floats are stored in the upper 32, but we can just use cfsh64 to copy all 64 bits the same way.

13. cfcvtds never produces denorms

The operation cfcvtds, which converts a double floating point value to a single floating point value, never produces a denormalized result, even if the value can be accurately represented as such. The result underflows directly to zero. Sign is preserved properly, however.

Workaround: none

Futaris' arm-crunch-cfcvtds-disable patch disables this instruction: double to float conversion is done using the soft-float functions. Given that any resulting denormalised numbers will probably be truncated to zero by the math ops in bug 12a, there may be not be much point in doing this.

14. double word load/store corrupts memory

There is an extra, undocumented hardware bug, in E1 silicon at least.

When an ARM register is loaded from memory and a double-word cirrus register is immediately stored indirected through the same ARM register, memory is corrupted. For example:

ldr     r2, [r3, #xx]
cfstrd  mvd1, [r2]

stores the 64-bit value at an unpredictable memory location. Presumably, cfstr64 does the same. 64-bit loads in the same context also cause memory corruption

Often, the memory corrupted will be two words of the kernel's in-core cached copy of the program itself (despite it being read-only in the MMU!), which results in the contents of the executable file appearing to have changed after the executable has been run. It can be "restored" by running something that uses as much VRAM as there is physical RAM so that the cached copy is discarded.

The solution is to insert some other instruction between the ldr and the 64-bit load or store, such as a nop.

15. 64-bit shift counts are truncated to 6-bit signed

While the manual says that the constant shift counts are limited to -32 (right shift) to +31 (left), the cfrshl64 instruction also only examines the lower 6 bits of its shift count, so a value of 48 in the ARM register results in an arithmetic right shift by 16 bits.

Mainline GCC thinks it can shift a 64-bit integer register left or right by 0 to 63 bits, so this needs working around in the two constant cases cirrus_ashldi_const and cirrus_ashiftrtdi_const and the variable one cirrus_ashldi3. For the latter, Paolo Bonzini [http://gcc.gnu.org/ml/gcc/2009-03/msg00460.html suggests]:

This could already be handled by faking a 63 bit truncation and using a
splitter to expand those into something like this (I only know integer
ARM assembly, so I'm making this up):

   AND R1, R0, #31
   MOV R2, R2, SHIFT R1
   ANDS R1, R0, #32
   MOVNE R2, R2, SHIFT #31
   MOVNE R2, R2, SHIFT #1

or

   ANDS R1, R0, #32
   MOVNE R2, R2, SHIFT #-32
   SUB R1, R1, R0              ; R1 = (x >= 32 ? 32 - x : -x)
   MOV R2, R2, SHIFT R1

Note that in gcc, gas, gdb and objdump, the source and destination maverick registers for the cfrshl32 and cfrshl64 instructions are inverted, so what appears in assembler listings as cfrshl64 mvd8, mvd0, r6 has source register mvd8 and writes into mvd0.