Differences between revisions 75 and 76
Revision 75 as of 2008-10-15 08:33:12
Size: 30574
Editor: MartinGuy
Comment:
Revision 76 as of 2008-10-15 09:03:08
Size: 31342
Editor: MartinGuy
Comment: Add range of values lost when denorms truncated
Deletions are marked like this. Additions are marked like this.
Line 59: Line 59:
* CMP: Maverick sets condition codes differently from ARM/FPA/VFP
* C++ EXC: C++ exceptions do not restore FPU state when taken (maybe?)
* 1. two-word load / store
* 2. instruction with source operand
* 3. two-word load / store
* 4. two-word store
* 5. cfrshl32, cfrshl64
* 6. cfldr32, cfmv64lr may not sign-extend
* 7. accumulator updates
* 8. accumulator updates
* 9. accumulator updates
* 10. accumulator updates
* 11. two-word load / store
* 12a. cpy/add/abs/neg/cvt take denormalised operands as zero
* 12b. cpy/neg never produces negative zero
* 13. cfcvtds
Line 474: Line 490:
Description: When an operand to the Crunch add/subtract unit is denormalized, it is forced to zero before input to the calculation. The sign is unaffected. This affects the following instructions: Description: When an the Crunch add/subtract unit is presented with denormalized values (in the range , it takes them as zero for that input of the calculation. The sign is unaffected. This affects values of +/- 2^-149 to 2^-126 for {{{float}}s and from 2-1074 to 2-1022 for {{{double}}}s when using the following instructions:
Line 484: Line 500:
{{{cfcpys}}} and {{{cfcpyd}}} can be replaced with {{{cfsh64 #0}}}, which does a bit-wise copy, and {{{cfnegs}}} and {{{cfnegd}}} are disabled by futaris. {{{cfcpys}}} and {{{cfcpyd}}} can be replaced with {{{cfsh64 #0}}}, which does a bit-wise copy, and {{{cfnegs}}} and {{{cfnegd}}} are disabled by futaris so the operation is performed in pairs of ARM registers.

Making fast floating point math work on the Cirrus MaverickCrunch floating point unit

This page records the issues and existing patches to make GCC generate reliable code for the Cirrus Logic [http://en.wikipedia.com/wiki/MaverickCrunch Maverick Crunch] floating point unit.

It applies to the "armel" ArmEabiPort of Debian when compiling for a Cirrus Logic EP93xx ARM + Maverick chip with

    gcc -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp

For docs on the Maverick Crunch unit and its problems, see the EP93xx User's Guide, chapter 2, and the Errata of the [http://www.cirrus.com/en/products/pro/detail/P131.html documents for the EP9312].

Discussion specific to it usually happens on the [http://www.freelists.org/archives/linux-cirrus/ linux-cirrus mailing list].

?TableOfContents

About the FPU

The ?MaverickCrunch Coprocessor is an IEEE-754 floating point accelerator unit that comes together with an ARM920T integer CPU in Cirrus Logic's EP9301, EP9302, EP9307, EP9312 and EP9315 chips. It has a different instruction set from other floating point accelerators that are found with ARM processors: ARM's old FPA unit, now rare, and the more recent VFP unit. ARMs also come with iWMMXt and ?DaVinci coprocessors, but these are specialized SIMD and DSP processors, not generic floating point math ones.

Five revisions of the silicon were issued: D0, D1, E0, E1 and E2. The revision of a chip is printed as the 5th and 6th characters of the second line of text on the chip housing. The now rare D0 revision has a more extensive range of hardware bugs than the later revisions; from D1-E2 no further modifications were made to the design of the Maverick unit. Here we only attempt to work around the bugs in the later series.

Cirrus stopped development of its ARM devices on 1st April 2008 (no joke!) but will continue to sell the chips.

Registers

It has 16 64-bit registers, which can be treated as single- or double-precision floating point values, or as 32- or 64-bit integers. Single-precision floats live in the top 32 bits of the register and, when they are written, the lower 32 bits are zeroed. 32-bit integers live in the lower 32 bits and, when they are written, the top 32 bits are (usually!) sign-extended according to bit 31. It also has four 72-bit multiply-accumulate integer registers which are not used by GCC.

Instruction set

It provides instructions to add, subtract, multiply, compare, negate and give absolute value for all these types, to shift the registers in the two integer modes, and to convert between the data types. These operations can only be done between Maverick registers, but data can be copied between Maverick and ARM registers and between Maverick registers and main memory.

Operating modes

The FPU can operate in several modes, controlled by bits in its status register:

  • ISAT: Selects saturating arithmetic for integer operations instead of overflowing. The default is non-saturating, as required by C.
  • UI: Unsigned integer: in comparisons between integers, the values as considered signed or unsigned when they are compared, unlike the ARM (and FPA and VFP) comparisons which set the condition codes which are then considered signed or unsigned when a decision is made. The default is signed.
  • Synchronous/Asynchronous: Synchronous mode is much slower, but ensures that, if floating point exceptions are enabled and occur, you can be sure to pinpoint the offending instruction. The default in asynchronous.
  • Forwarding/Non-forwarding: Forwarding channels the results of arithmetic operations back to the input of the logic unit as well as to the destination registers so that, when the result of one instruction is used in another soon after, execution is faster. LAME gains 2.5% in speed by enabling this; the default is non-forwarding.

How GCC uses it

GCC does not use:

  • the 32-bit integer operations. It performs these in ARM registers as usual.
  • 64-bit integer comparisons, because the signed/unsigned compare mechanism does not fit its conceptual model and setting/clearing bits in the status register at runtime to prepare for individual comparison instructions is awkward. 64-bit integer comparisons are performed in the ARM registers.
  • the conditional execution bits of Maverick instructions, to avoid hardware bugs.

The Maverick registers are handled in two banks:

  • registers 0 to 7 are used as scratch registers which are not preserved across function calls and can freely be scribbled in;
  • registers 8 to 15's values are persistent across function calls, and any function that uses them must save the ones it uses on entry and restore them on exit.

Versions of GCC

Mainline GCC has never been able to generate working code for the Maverick Crunch. It has a -mfix-cirrus-invalid-insns flag, which ensures that the two instructions following a branch are not Cirrus ones, and that every cfldrd, cfldr64, cfstrd, cfstr64 is followed by one non-Cirrus instruction, which should fix bugs number 1 and ?. In an moderately FP-intensive test piece (LAME), it only causes an overall slowdown of 0.7%.

The two main efforts at fixing it properly are:

  • [http://arm.cirrus.com/files/index.php?path=tools/ Cirrus GCC] actually carried out by [http://www.nucleusys.com Nucleusys] of Bulgaria. There are three versions of it, all based on gcc-4.1.2 and uclibc in a buildroot environment (unpacked [http://martinwguy.co.uk/martin/tech/ts7250/FPU/cirrus here]), but none of them passes the GCC testsuite, while other test programs go into infinite loops or crash. Some real-life programs compiled with it do seem to work though. The modifications are published as a 500 megabyte tarball from which a single monolithic patch can be derived by diffing it against the mainline source releases. What a crock!

  • [http://files.futaris.org/gcc futaris patches] for gcc-4.1.2 and 4.2.0 for the ?OpenEmbedded environment is provided as several dozen individual patches and succeeds in passing the GCC IEEE testsuite, and the intensive paranoia testsuite finds only one defect and two flaws in it (which is better than the Intel and AMD chips!). Futaris' strategy includes disabling all conditional instructions other than branch and all 64-bit integer operations.

Here is [http://martinwguy.co.uk/martin/FPU a summary of their merits and some benchmarks].

Summary of fixed bugs

* CMP: Maverick sets condition codes differently from ARM/FPA/VFP * C++ EXC: C++ exceptions do not restore FPU state when taken (maybe?) * 1. two-word load / store * 2. instruction with source operand * 3. two-word load / store * 4. two-word store * 5. cfrshl32, cfrshl64 * 6. cfldr32, cfmv64lr may not sign-extend * 7. accumulator updates * 8. accumulator updates * 9. accumulator updates * 10. accumulator updates * 11. two-word load / store * 12a. cpy/add/abs/neg/cvt take denormalised operands as zero * 12b. cpy/neg never produces negative zero * 13. cfcvtds

Toolchain

CMP

C++ exc

unwind

1

2

3

4

5

6

7

8

9

10

11

12a

12b

13

Debian gcc 4.1.3

X

\

\

\

f

s

/

-

-

-

-

-

Debian gcc 4.2.3

X

\

\

\

f

s

/

-

-

-

-

-

Debian gcc 4.3.0

X

\

\

\

f

s

/

-

-

-

-

-

futaris 4.1.2

/

f

/

/

-

-

-

-

-

/

/

futaris 4.2.0

/

f

s

/

-

-

-

-

-

/

/

futaris-mg 4.3.2

\

\

\

f

s

/

-

-

-

-

-

/

/

crunchtools 1.4.0

-

-

-

-

-

crunchtools 1.4.1-2

-

-

-

-

-

crunchtools 1.4.3

-

-

-

-

-

Legend

X

broken

/

fixed

\

fixed when -mfix-invalid-cirrus-insns is given

f

fixed when forwarding is disabled, which is the default in mainline Linux

s

fixed when not running in serialised mode, which seems usual [http://martinwguy.co.uk/martin/tech/Maverick/dspsc.c test program]

-

doesn't affect GCC's use of Maverick FPU

?

not resolved but I can't reproduce the bug on revision E1 hardware any more.

futaris-mg is my (Martin Guy's) unpublished reworking of Futaris' unpublished patches for 4.2.2 and 4.3.0 from April 2008. I include it here to help me keep track of what I still need to check.

The bugs

The bugs are:

  • an ARM/Maverick condition code-setting anomaly;
  • C++ exception unwindingis broken;
  • 14 hardware bugs in Cirrus' silicon.

The FPU doesn't set the same condition codes as the ARM core

The ARM and its FPA and VFP math coprocessors' comparison instructions all set the conditions codes the same way:

ARM/FPA/VFP - (cmp*):
        N  Z  C  V
A == B  0  1  1  0
A <  B  1  0  0  0
A >  B  0  0  1  0
unord   0  0  1  1

but comparisons performed in the Maverick, on both floating point or integer values, set them a different way:

MaverickCrunch - (cfcmp*):
        N  Z  C  V
A == B  0  1  0  0
A <  B  1  0  0  0
A >  B  1  0  0  1
unord   0  0  0  0

This means that not only do you have to test for the same conditions in different ways, according to which unit did the comparison, but some cases require strange sequences of conditional instructions.

Some conditions are only tested in GCC when floating point values have been compared, including all those involving "unordered", and others only when integers have been compared (the unsigned variants LTU GTE LEU and GEU - is this right?), so we can know how to interpret the condition codes for these when Maverick code generation is enabled.

#include <stdio.h>
main()
{
        int i; double d;
        for (i=3, d=3; i<=5; i++, d++) {
                printf("%g", d);
                if (d < 4.0) printf(" lt";
                if (d > 4.0) printf(" gt");
                if (d <= 4.0) printf(" le");
                if (d >= 4.0) printf(" ge");
                putchar('\n');
        }
}

should output:

3 lt le
4 le ge
5 gt ge

but unpatched mainline GCC outputs:

3 lt le
4 le ge
5 lt gt le ge

The best summary of the why's and wherefores is [http://gcc.gnu.org/ml/gcc/2007-06/msg00938.html on the gcc mailing list].

Futaris-4.1.2 passes the GCC testsuite by disabling conditional execution of all instructions except for branches.

The unpublished futaris patches for 4.2.2 and 4.3.0 fix this with the exception of the "unordered" and "ltgt" cases, which are missing and cause an internal compiler error if a program mentions the isunordered() function.

They also get ICE compiling  foo(double x, double y) { return(x >= y); }  with optimisation enabled because GCC first produces BGE (which can be represented in maverick with a sequence of tests) and then optimises that to a MOVLT; MOVGE sequence and the GE cannot be represented by a single condition code on Maverick (it can, with GT, if we don't honor the UNORDERED case).

C++ exception unwinding is broken

With Maverick FP enabled, C++ exceptions (catch - try blocks) fail to preserve the Maverick FPU state.

This thread on[http://sourceware.org/ml/binutils/2008-02/msg00273.html binutils] mailing list explains why unwind support is needed. Additionally this [http://infocenter.arm.com/help/topic/com.arm.doc.ihi0038a/IHI0038A_ehabi.pdf document (ARM IHI 0038A)] explains the unwind process using EABI. As you can see in Sec 9.3 of that document, there is no unwinding EABI for popping ?MaverickCrunch instructions. The above patch incorrectly calls the iWMMXt pop functions. A new Pop MV registers instruction needs to be added to the table, along with changes to Sec 7.5

libunwind support

unwind support should only be needed if [http://www.nongnu.org/libunwind/ libunwind] support is enabled. At the moment, only the development branch (git) of libunwind supports ARM processors.

Joseph S. Myers says on linux-cirrus 31 Mar 2008:

iWMMXt unwind support has been in GCC since my patch
<http://gcc.gnu.org/ml/gcc-patches/2007-01/msg00049.html>.
That illustrates the sort of thing that needs changing to implement unwind
support for a new coprocessor.  Obviously you need to get the unwind
specification in the official ARM EABI documents first before implementing
it in GCC, and binutils will also need to support generating correct
information given .save directives for the coprocessor registers.
For setjmp/longjmp support in glibc you also need to get an HWCAP value
allocated in the kernel.

Hardware bugs

See [http://www.cirrus.com/en/ cirrus.com] -> ARM Processors -> EP93{01,02,07,12,15} -> Errata (PDF) -> Maverick Crunch

Errata are different for silicon revisions D0, D1/E0/E1 and differences are reported with E2 although no further changes are said to have been made to the Maverick design.

The following is from [http://www.cirrus.com/en/pubs/errata/ER653E2B.pdf the EP9302 rev E2 errata]:

Definitions

1. "Instruction does not execute": An instruction appears in the coprocessor pipeline, but does not execute for one of the following reasons:

  • It fails its condition code check.
  • A branch is taken and it is one of the two instructions in the branch delay slot.
  • An exception occurs.
  • An interrupt occurs.

2. "Processor is in serialized mode": It is, if and only if both:

  • At least one exception type is enabled by setting one of the following bits in the DSPSC: IXE, UFE, OFE, or IOE.
  • Serialization is not specifically disabled by setting the AEXC bit in the DSPSC.

("Serialized mode" makes instruction processing less fast so that an exception can reliably be traced to the instruction that caused it. In the sample I have tested (a TS7250) it is not operating in serialised mode by these criteria because no exceptions are enabled. Source: [http://martinwguy.co.uk/martin/FPU/dspsc.c dspsc.c])

3. "An instruction updating an accumulator": These include all of the following:

  • Moves to accumulators: cfmva32, cfmva64, cfmval32, cfmvam32, cfmvah32.

  • Arithmetic into accumulators: cfmadd32, cfmadda32, cfmsub32, cfmsuba32.

4. "An instruction involving any two-word coprocessor load or store":

  • cfldr64 / cfldrd / cfstr64 / cfstrd.

1. two-word load / store

Result: register or memory corruption

Summary: a nonexecuted coprocessor instruction that is also stalled due to an internal dependency (operand is a non-cached memory read or the result of a previous incomplete coprocessor instruction) must not immediately precede a load/store 64/double.

Effects: in a 64-bit register load, the top 32 bits are loaded with junk; in a 64-bit memory store, an extra 32 bits of junk are written to memory in the word following the 64-bits that were correctly written.

An instruction may be nonexecuted because it is conditional and the condition is false, e.g.

    cfaddne c0, c1, c2
    cfldrd  c3, [r2, #0x0]

corrected by

    cfaddne c0, c1, c2
    nop
    cfldrd  c3, [r2, #0x0]

or it may be nonexecuted because it was one of the two instructions following a branch that was taken, so was loaded into the pipeline.

   target
      cfldrd  c3, [r2, #0x0]

      b       target
      nop
      cfadd   c0, c1, c2 ; though in pipeline, this does not execute

said to be corrected by:

   target
      cfldrd c3, [r2, #0x0]

      b      target
      cfadd  c0, c1, c2 ; though in pipeline, this does not execute
      nop

because the "previous instruction that does not execute" in the two-word pipeline is the nop rather than the cfadd.

GCC does not emit conditional Maverick instructions, but the branch case is not covered.

The Cirrus patches (and futaris' saveregs patch) pointlessly add a nop before every cfldrd while restoring saved Crunch registers before a function return... but not before its use in regular code.

The branch case seems not to have been addressed; one solution would be to ensure that Maverick Crunch instructions are never placed in the two instructions following any branch instruction, which is one thing mainline's -mcirrus-fix-invalid-insns flag does. Unfortunately, Futaris and Cirrus remove this flag.

If this bug is real, It should be enabled by default when compiling Maverick instructions.

2. instruction with source operand

Result: bad calculation or stored value

Workaround: change instruction sequence

  1. Execute a coprocessor instruction whose target is one of the coprocessor general purpose register c0 through c15.
  2. Let the second instruction be an instruction with the same target, but not be executed.
  3. Execute a third instruction at least one of whose operands is the target of the previous two instructions.

For example, assume no pipeline interlocks other than the dependencies involving register c0 in the following instruction sequence:

    cfadd32    c0, c1, c2
    cfsub32ne  c0, c3, c4    ; assume this does not execute
    cfstr32    c0, [r2, #0x0]

In this particular case, the incorrect value stored at the address in r2 is the previous value in c0, not the expected one resulting from the cfadd32.

Suggested fix:

    cfadd32   c0, c1, c2
    nop                      ; inserted extra instruction here
    nop                      ; inserted extra instruction here
    cfsub32ne c0, c3, c4     ; assume this does not execute
    nop                      ; inserted extra instruction here
    nop                      ; inserted extra instruction here
    nop                      ; inserted extra instruction here
    cfstr32   c0, [r2, #0x0]

The exact interval for safe operation is uncertain. Empirically, on an EP9307 REV E1, a total of three inserted nops are necessary in the above case, placed before, after or around the non-executed instruction.

In the branch case:

    cfadd32   c0, c1, c2
    b         foo
    cfsub32   c0, c3, c4     ; this does not execute

foo:
    cfstr32   c0, [r2, #0x0]

a single nop after the b, or at the branch target, is sufficient.

[http://martinwguy.co.uk/martin/tech/Maverick/bug2.c A test program] tickles the bug in both ways on revision E1 silicon.

GCC doesn't emit conditional Maverick instructions and the jump case, which can occur, is fixed by mainline's -mfix-cirrus-invalid-instructions.

3. two-word load / store

Data in coprocessor general purpose registers or in memory may be corrupted.

  1. Let the first instruction be a serialized instruction that does not execute. For an instruction to be serialized, at least one of the following must be true:
    • The processor must be operating in serialized mode.
    • The instruction must move to or from the DSPSC (either cfmv32sc or cfmvsc32).
  2. Let the immediately following instruction be a two-word coprocessor load or store.

In the case of a load, only the lower 32 bits (the first word) will be loaded into the target register. For example:

    cfadd32ne   c0, c1, c2    ; assume this does not execute
    cfldr64     c3, [r2, #0x0]

The lower 32 bits of c3 will correctly become what is at the memory address in r2, but the upper 32 bits of c3 will not become what is at address r2 + 0x4.

Workaround:

    cfadd32ne c0, c1, c2     ; assume this does not execute
    nop                      ; inserted extra instruction here
    cfldr64   c3, [r2, #0x0] ; store sequence
    cfadd32ne c4, c5, c6     ; assume this does not execute
    nop                      ; inserted extra instruction here
    cfstr64   c3, [r2, #0x0]

The real-world CPUs I've tested are not running in serialized mode, and GCC does not emit cfmv32sc or cfmvsc32.

If there are serialized ones out there, GCC does not emit conditional Maverick instructions, which just leaves the case of a Maverick instruction being in one of the two slots after a branch that is taken, which is covered by -mcirrus-fix-invalid-insns.

4. two-word store

Only in mode: forwarding, not serialised

Result: memory corruption

Summary: data operation into Crunch register followed by 64-bit store of the same Maverick register into RAM (cfstrd or cfstr64) may write rubbish

Description: When the coprocessor is not in serialized mode and forwarding is enabled, memory can be corrupted when two types of instructions appear in the instruction stream with a particular relative timing.

  1. Execute an instruction that is a data operation (not a move between ARM and coprocessor registers) whose destination is one of the general purpose register c0 through c15.
  2. Execute an instruction that is a two-word coprocessor store (either cfstr64 or cfstrd), where the destination register of the first instruction is the source of the store instruction, that is, the second instruction stores the result of the first one to memory.
  3. Finally, the first and second instruction must appear to the coprocessor with the correct relative timing; this timing is not simply proportional to the number of intervening instructions and is difficult to predict in general.

The result is that the lower 32 bits of the result stored to memory will be correct, but the upper the 32 bits will be wrong. The value appearing in the target register will still be correct.

The exact timing involved for reliable/unreliable operation is uncertain but can be tickled with [http://martinwguy.co.uk/martin/tech/ts7250/FPU/dspsc.c a test program].

Workarounds:

  • Operate the FPU without forwarding enabled, with a possible decrease in performance
  • Operate in serialized mode by enabling at least one exception, with significantly reduced performance
  • Ensure that at least seven instructions appear between the first and second instructions that cause the error
  • disable 64-bit load/store instructions (e.g. replace them with two 32-bit insns)

GCC does output guilty instruction sequences. Examples from LAME:

    cfmuld  mvd1, mvd1, mvd0
    mov     r2, r7
    mov     r3, r5
    mov     r0, r8
    ldr     r1, [pc, #364]
    cfstrd  mvd1, [sp, #8]

    cfmuld  mvd1, mvd1, mvd0
    mov     r0, r8
    mov     r3, r4
    ldr     r1, [pc, #1004]
    cfstrd  mvd1, [sp]

    cfldrd  mvd0, [r8, #8]
    cfaddd  mvd0, mvd1, mvd0
    cfstrd  mvd0, [r8, #8]

but a sample system was not operating with forwarding enabled.

Code to enable forwarding (under Linux with Maverick support enabled in the kernel, the effect is limited to the process that does this):

    crunch_fwden()
    {
        asm("cfmv32sc   mvdx0, dspsc");    /* Read status register */
        asm("cfmvrdl    r0, mvd0");        /* Move LSW to ARM */
        asm("orr        r0, r0, #0x4000"); /* Set forwarding bit */
        asm("cfmvdlr    mvd0, r0");        /* Move ARM to LSW */
        asm("cfmvsc32   dspsc, mvdx0");    /* Write to status register */
    }

This appears to be unresolved at present.

Under Linux on the sample board I use, forward is disabled by default. Enabling forwarding in [http://martinwguy.co.uk/martin/tech/Maverick/bug4.c a test program] on revision E1 hardware, I have been unable to get this bug to bite.

5. cfrshl32, cfrshl64

When operating in serialized mode, cfrshl32 and cfrshl64 (logical shifts on coprocessor registers) do not work properly. The instructions shift by an unpredictable amount, but cause no other side effects.

cfrshl32 is disabled in mainline gcc, and cfrshl64 is disabled by futaris, even though real-world CPUs seem not to run in serialized mode.

6. cfldr32, cfmv64lr may not sign-extend

If an interrupt occurs during the execution of cfldr32 or cfmv64lr, the instruction may not sign extend the result correctly.

Possible workarounds include:

  • Disable interrupts when executing cfldr32 or cfmv64lr instructions.

  • Avoid executing these two instructions.
  • Do not depend on the sign extension to occur; that is, ignore the upper word in any calculations involving data loaded using these instructions.
  • Add extra code to sign extend the lower word after it is loaded by explicitly forcing the upper word to be all zeroes or all ones, as appropriate. It is possible to do this selectively in exception or interrupt handler code. If the instruction preceding the interrupted instruction can be determined, and it is a cfldr32 or cfmv64lr, the instruction may be re-executed or explicitly sign extended before returning from interrupt or exception.

Mainline GCC does not emit cfldr32, and use of cfmv64lr is disabled as buggy. In three places it is used as the first of a two-instruction sequence: in all cases the top 32 bits are either overwritten or ignored by the second instruction.

Verdict: Not a problem.

7. accumulator updates

The coprocessor can incorrectly update one of its destination accumulators even if the coprocessor instruction should not have been executed or is canceled by the ARM processor. This error can occur if the following is true:

  1. The first instruction must be a coprocessor compare instruction, one of cfcmp32, cfcmp64, cfcmps, and cfcmpd.

  2. The second instruction:
    • has an accumulator as a destination.
    • does not execute.

GCC does not use the accumulator instructions.

8. accumulator updates

If a data abort occurs on an instruction preceding a coprocessor data path instruction that writes to one of the accumulators, the accumulator may be updated even though the instruction was canceled.

GCC does not use the accumulator instructions.

9. accumulator updates

The coprocessor will erroneously update an accumulator if the coprocessor instruction that updates an accumulator is canceled and is followed by a coprocessor instruction that is not a data path instruction. This error will occur under the following conditions:

  1. The first instruction:
    • must update a coprocessor accumulator.
    • does not execute.
  2. The second instruction is not a coprocessor data path instruction. Coprocessor data path instructions include any instruction that does not move data to or from memory or to or from the ARM registers.

GCC does not use the accumulator instructions.

10. accumulator updates

An instruction that writes a result to an accumulator may cause corruption of any of the four accumulators when the coprocessor is operating in serialized mode.

GCC does not use the accumulator instructions.

11. two-word load / store

An erroneous memory transfer to or from any of the coprocessor general purpose registers c0 through c15 can occur given the following conditions are satisfied:

  1. The first instruction:
    • is a two-word load or store4.
    • fails its condition code check.
    • does not busy-wait.
  2. The second consecutive instruction:
    • is a coprocessor load or store.
    • is executed.
    • does not busy-wait.

When the error occurs, the result is either coprocessor register or memory corruption. Here are several examples:

   cfstr64ne     c0, [r0, #0x0]   ; assume does not execute
   cfldrs        c2, [r2, #0x8]   ; could corrupt c2!
   cfldrdge      c0, [r0, #0x0]   ; assume does not execute
   cfstrd        c2, [r2, #0x8]   ; could corrupt memory!
   cfldr64ne     c0, [r0, #0x0]   ; assume does not execute
   cfldrdgt      c2, [r2, #0x8]   ; could corrupt c2!

The software workaround involves avoiding a pair of consecutive instructions with these properties. For example, if a conditional coprocessor two-word load or store appears, ensure that the following instruction is not a coprocessor load or store:

   cfstr64ne    c0, [r0, #0x0]     ; assume does not execute
   nop                             ; separate two instructions
   cfldrs       c2, [r2, #0x8]     ; c2 will be ok

Another workaround is to ensure that the first instruction is not conditional:

   cfstr64      c0, [r0, #0x0]     ; executes
   cfldrs       c2, [r2, #0x8]     ; c2 will be ok

Note: If both instructions depend on the same condition code, the error should not occur, as either both or neither will execute.

GCC does not emit conditional Maverick instructions.

12a. cpy/add/abs/neg/cvt take denormalised operands as zero

Result: denorm operand forced to zero

Description: When an the Crunch add/subtract unit is presented with denormalized values (in the range , it takes them as zero for that input of the calculation. The sign is unaffected. This affects values of +/- 2-149 to 2-126 for float}}s and  from 2-1074 to 2-1022 for {{{doubles when using the following instructions:

  • Copies: cfcpys, cfcpyd
  • Add/Sub: cfadds, cfaddd, cfsubs, cfsubd
  • Absolute value: cfabss, cfabsd
  • Negation: cfnegs, cfnegd
  • Conversions: cfcvtsd, cfcvtds

Workaround: none

cfcpys and cfcpyd can be replaced with cfsh64 #0, which does a bit-wise copy, and cfnegs and cfnegd are disabled by futaris so the operation is performed in pairs of ARM registers. Disabling the rest would mean losing most FP math ops, so we live with the imprecision.

12b. cpy/neg never produces negative zero

When the operand is negative zero, cfcpys and cfcpyd write positive zero to the destination register, while the result should be negative zero. When the operand is positive zero, cfnegs and cfnegd write positive zero to the destination register, while the result should be negative zero.

Futaris disables the cfneg instructions in its arm-crunch-neg2 patch; it would be better to protected it under & ! HONOR_NEGATIVE_ZEROS([SD]Fmode) so that it is enabled when -ffast-math is selected.

cfcpyd can be replaced with cfsh64 #0, which does a bitwise copy. cfsh32 is no use for cfcpys as it copies the lower 32 bits instead of the upper 32, but we can just use cfsh64 to copy all 64 bits the same way.

13. cfcvtds

The operation cfcvtds, which converts a double floating point value to a single floating point value, never produces a denormalized result, even if the value can be accurately represented as such. The result underflows directly to zero. Sign is preserved properly, however.

Workaround: none

Futaris' arm-crunch-cfcvtds-disable patch disables this instruction: double to float conversion is done using the soft-float functions. Given that any resulting denormalised numbers will probably be truncated to zero by the math ops in bug 12, there may be not be much point in doing this.