Differences between revisions 28 and 29
Revision 28 as of 2008-04-03 01:46:43
Size: 20546
Editor: ?HasjimWilliams
Comment:
Revision 29 as of 2008-04-09 11:36:22
Size: 20721
Editor: MartinGuy
Comment: Get "serialised mode" right.
Deletions are marked like this. Additions are marked like this.
Line 101: Line 101:
Line 106: Line 105:
Line 107: Line 107:
Line 110: Line 109:
The sample I have seen (TS7250) *is* operating in serialised mode by these criteria, according to [http://martinwguy.co.uk/martin/FPU/dspsc.c dspsc.c] ("Serialized mode" makes instruction processing less fast so that an exception can reliably be traced to the instruction that caused it. In the sample I have tested (a TS7250) it is not operating in serialised mode by these criteria because no exceptions are enabled. Source: [http://martinwguy.co.uk/martin/FPU/dspsc.c dspsc.c])
Line 113: Line 112:
Line 116: Line 114:
Line 117: Line 116:

This page records the issues and existing patches to make GCC generate reliable code for the Cirrus Logic [http://en.wikipedia.com/wiki/MaverickCrunch Maverick Crunch] floating point unit.

It applies to the "armel" ArmEabiPort of Debian when compiling for a Cirrus Logic EP93xx ARM + Maverick chip with

    gcc -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp

For docs on the Maverick Crunch unit and its problems, see [http://www.cirrus.com/en/products/pro/detail/P1066.html the EP9302 user's guide], chapter 2, and [http://www.cirrus.com/en/pubs/errata/ER653E2B.pdf the EP93xx errata].

?TableOfContents

Status

Mainline GCC does not have reliable Maverick Crunch code generation. Three sets of patches exist:

Futaris' strategy is to implement the funny >= case after FP compares and to disable all conditional instructions other than branch and 64-bit operations (thereby avoiding several of the timing bugs). But no set is yet yet perfect. Here is [http://martinwguy.co.uk/martin/FPU a summary of their merits and some benchmarks].

The bugs are: an ARM/Maverick condition code-setting anomaly, C++ exception unwinding and 13 hardware bugs to work around, of which the last two just affect limit-of-precision accuracy with very tiny numbers. The following are resolved:

Toolchain

CMP

C++ exc

libunwind

1

2

3

4

5

6

7

8

9

10

11

Debian gcc 4.1.3 (0)

{X}

Debian gcc 4.2.3

{X}

Debian gcc 4.3.0

{X}

futaris 4.1.2

(./)

futaris 4.2.0

(./)

crunchtools 1.4.0 (1)

crunchtools 1.4.1-2 (2)

crunchtools 1.4.3 (3)

(./) = works, {X} = broken.

  1. mainline gcc has an -mcirrus-fix-invalid-insns flag that should work around some bugs but doesn't.

  2. crunchtools-1.4.0 is gcc 4.1.2, uclibc; use gcc -mfix-crunch-d1

  3. 0.

Problems

The FPU doesn't set the same condition codes as the ARM core

Test program:

#include <stdio.h>
main()
{
        int i; double d;
        for (i=3, d=3; i<=5; i++, d++) {
                printf("%g", d);
                if (d < 4.0) printf(" lt";
                if (d > 4.0) printf(" gt");
                if (d <= 4.0) printf(" le");
                if (d >= 4.0) printf(" ge");
                putchar('\n');
        }
}

should output:

3 lt le
4 le ge
5 gt ge

not:

3 lt le
4 le ge
5 lt gt le ge

Best summary of the why's and wherefores [http://gcc.gnu.org/ml/gcc/2007-06/msg00938.html on the gcc mailing list].

C++ exception unwinding is broken

With Maverick FP enabled, C++ exceptions (catch - try blocks) fail to preserve the Maverick FPU state. There is an invalid patches to implement this.

Hasjim Williams has [http://files.futaris.org/glibc/glibc-crunch.patch one patch] which he [http://www.freelists.org/archives/linux-cirrus/03-2008/msg00004.html explains and expresses doubt about].

This thread on[http://sourceware.org/ml/binutils/2008-02/msg00273.html binutils]mailing list explains why unwind support is needed. Additionally this[http://infocenter.arm.com/help/topic/com.arm.doc.ihi0038a/IHI0038A_ehabi.pdf document (ARM IHI 0038A)]explains the unwind process using EABI. As you can see in Sec 9.3 of that document, there is no unwinding EABI for popping ?MaverickCrunch instructions. The above patch incorrectly calls the iWMMXt pop functions. A new Pop MV registers instruction needs to be added to the table, along with changes to Sec 7.5

libunwind support

unwind support should only be needed if [http://www.nongnu.org/libunwind/ libunwind] support is enabled. At the moment, only the development branch (git) of libunwind supports ARM processors.

Joseph S. Myers says on linux-cirrus 31 Mar 2008:

iWMMXt unwind support has been in GCC since my patch
<http://gcc.gnu.org/ml/gcc-patches/2007-01/msg00049.html>.
That illustrates the sort of thing that needs changing to implement unwind
support for a new coprocessor.  Obviously you need to get the unwind
specification in the official ARM EABI documents first before implementing
it in GCC, and binutils will also need to support generating correct
information given .save directives for the coprocessor registers.
For setjmp/longjmp support in glibc you also need to get an HWCAP value
allocated in the kernel.

Timing-dependent hardware bugs must be avoided

See [http://www.cirrus.com/en/ cirrus.com] -> ARM Processors -> EP93{02,07,12,15} -> Errata (PDF) -> Maverick Crunch

Errata are different for silicon revisions D0, D1/E0/E1 and E2.

The failings fall into the following four categories, and are detailed below.

1. An instruction appears in the coprocessor pipeline, but does not execute for one of the following reasons:

  • It fails its condition code check.
  • A branch is taken and it is one of the two instructions in the branch delay slot.
  • An exception occurs.
  • An interrupt occurs.

2. Dependent on whether the coprocessor operating in serialized mode or not. It is, if and only if both:

  • At least one exception type is enabled by setting one of the following bits in the DSPSC: IXE, UFE, OFE, or IOE.
  • Serialization is not specifically disabled by setting the AEXC bit in the DSPSC.

("Serialized mode" makes instruction processing less fast so that an exception can reliably be traced to the instruction that caused it. In the sample I have tested (a TS7250) it is not operating in serialised mode by these criteria because no exceptions are enabled. Source: [http://martinwguy.co.uk/martin/FPU/dspsc.c dspsc.c])

3. An instruction updating an accumulator. These include all of the following:

  • Moves to accumulators: cfmva32, cfmva64, cfmval32, cfmvam32, cfmvah32.
  • Arithmetic into accumulators: cfmadd32, cfmadda32, cfmsub32, cfmsuba32.

4. An instruction involving any two-word coprocessor load or store:

  • cfldr64, cfldrd, cfstr64, and cfstrd.

From [http://www.cirrus.com/en/pubs/errata/ER653E2B.pdf the EP9302 rev E2 errata]:

1. two-word load / store

Result: register or memory corruption

Summary: a conditional coprocessor instruction must not immediately precede a load/store 64/double (insert one nop) e.g.

    cfaddne c0, c1, c2
    nop
    cfldrd  c3, [r2, #0x0]

Futaris avoids this by disabling all non-branch conditional instructions.

In the same paragraph, Cirrus also say "Finally, consider a case where a branch occurs:"

   target
      cfldrd  c3, [r2, #0x0]
      b       target
      nop
      cfadd   c0, c1, c2 ; though in pipeline, this does not execute

said to be corrected by:

   target
      cfldrd c3, [r2, #0x0]
      b      target
      cfadd  c0, c1, c2 ; though in pipeline, this does not execute
      nop

Does anyone understand this? - this may mean don't insert a nop after a branch instruction, but after the next instruction?

2. instruction with source operand

Result: bad calculation or stored value

Workaround: change instruction sequence

  1. Execute a coprocessor instruction whose target is one of the coprocessor general purpose register c0 through c15.
  2. Let the second instruction be an instruction with the same target, but not be executed.
  3. Execute a third instruction at least one of whose operands is the target of the previous two instructions.

For example, assume no pipeline interlocks other than the dependencies involving register c0 in the following instruction sequence:

    cfadd32    c0, c1, c2
    cfsub32ne  c0, c3, c4    ; assume this does not execute
    cfstr32    c0, [r2, #0x0]

In this particular case, the incorrect value stored at the address in r2 is the previous value in c0, not the expected one resulting from the cfadd32.

Suggested fix:

    cfadd32   c0, c1, c2
    nop                      ; inserted extra instruction here
    nop                      ; inserted extra instruction here
    cfsub32ne c0, c3, c4     ; assume this does not execute
    nop                      ; inserted extra instruction here
    nop                      ; inserted extra instruction here
    nop                      ; inserted extra instruction here
    cfstr32   c0, [r2, #0x0]

The exact interval for safe operation is uncertain.

Futaris solves this by disabling conditional instructions other than branch.

3. two-word load / store

Data in coprocessor general purpose registers or in memory may be corrupted.

  1. Let the first instruction be a serialized instruction that does not execute. For an instruction to be serialized, at least one of the following must be true:
    • The processor must be operating in serialized mode.
    • The instruction must move to or from the DSPSC (either cfmv32sc or cfmvsc32).
  2. Let the immediately following instruction be a two-word coprocessor load or store.

In the case of a load, only the lower 32 bits (the first word) will be loaded into the target register. For example:

    cfadd32ne   c0, c1, c2    ; assume this does not execute
    cfldr64     c3, [r2, #0x0]

The lower 32 bits of c3 will correctly become what is at the memory address in r2, but the upper 32 bits of c3 will not become what is at address r2 + 0x4.

Workaround:

    cfadd32ne c0, c1, c2     ; assume this does not execute
    nop                      ; inserted extra instruction here
    cfldr64   c3, [r2, #0x0] ; store sequence
    cfadd32ne c4, c5, c6     ; assume this does not execute
    nop                      ; inserted extra instruction here
    cfstr64   c3, [r2, #0x0]

Futaris resolves this by disabling all conditional instructions except branch.

4. two-word store

Only in mode: forwarding, not serialised

Result: memory corruption

Summary: data operation into Crunch register followed by 64-bit store of the same Maverick register into RAM writes rubbish

Description: When the coprocessor is not in serialized mode and forwarding is enabled, memory can be corrupted when two types of instructions appear in the instruction stream with a particular relative timing.

  1. Execute an instruction that is a data operation (not a move between ARM and coprocessor registers) whose destination is one of the general purpose register c0 through c15.
  2. Execute an instruction that is a two-word coprocessor store (either cfstr64 or cfstrd), where the destination register of the first instruction is the source of the store instruction, that is, the second instruction stores the result of the first one to memory.
  3. Finally, the first and second instruction must appear to the coprocessor with the correct relative timing; this timing is not simply proportional to the number of intervening instructions and is difficult to predict in general.

The result is that the lower 32 bits of the result stored to memory will be correct, but the upper the 32 bits will be wrong. The value appearing in the target register will still be correct.

The exact timing involved for reliable/unreliable operation is uncertain but can be tickled with [http://martinwguy.co.uk/martin/tech/ts7250/FPU/dspsc.c a test program].

Workarounds:

  • Operate the FPU without forwarding enabled, with a possible decrease in performance
  • Operate in serialized mode by enabling at least one exception, with significantly reduced performance
  • Ensure that at least seven instructions appear between the first and second instructions that cause the error

Futaris gets round this by disabling 64-bit instructions.

5. cfrshl32, cfrshl64

When operating in serialized mode, cfrshl32 and cfrshl64 (logical shifts on coprocessor registers) do not work properly. The instructions shift by an unpredictable amount, but cause no other side effects.

Futaris solves this by disabling these two instructions.

6. ldr32, mv64lr

If an interrupt occurs during the execution of cfldr32 or cfmv64lr, the instruction may not sign extend the result correctly.

Possible workarounds include:

  • Disable interrupts when executing cfldr32 or cfmv64lr instructions.
  • Avoid executing these two instructions.
  • Do not depend on the sign extension to occur; that is, ignore the upper word in any calculations involving data loaded using these instructions.
  • Add extra code to sign extend the lower word after it is loaded by explicitly forcing the upper word to be all zeroes or all ones, as appropriate. It is possible to do this selectively in exception or interrupt handler code. If the instruction preceding the interrupted instruction can be determined, and it is a cfldr32 or cfmv64lr, the instruction may be re-executed or explicitly sign extended before returning from interrupt or exception.

7. accumulator updates

The coprocessor can incorrectly update one of its destination accumulators even if the coprocessor instruction should not have been executed or is canceled by the ARM processor. This error can occur if the following is true:

  1. The first instruction must be a coprocessor compare instruction, one of cfcmp32, cfcmp64, cfcmps, and cfcmpd.
  2. The second instruction:
    • has an accumulator as a destination3.
    • does not execute.

Example 1: In this case the second instruction may modify a2 even if the condition is not matched.

    cfcmp32            r15, c0, c5
    cfmva64ne          a2, c8

Example 2: In this case the second instruction may modify a2 even if an interrupt or exception causes it to be canceled and re-executed after the interrupt/exception handler returns.

    cfcmp32            r15, c0, c5
    cfmadda            a2, a2, c0, c1

The workaround for this issue is to insure that at least one other instruction appears between these instructions. For example, possible fixes for the instructions sequences above are:

   cfcmp32            r15, c0, c5
   nop
   cfmva64ne          a2, c8

and

   cfcmp32            r15, c0, c5
   nop
   cfmadda            a2, a2, c0, c1

Futaris solves this by disabling conditional instructions.

8. accumulator updates

If a data abort occurs on an instruction preceding a coprocessor data path instruction that writes to one of the accumulators, the accumulator may be updated even though the instruction was canceled.

For example:

    str       r7, [r0, #0x1d] ; assume this causes a data abort
    cfmadda32 a0, a2, c0, c1

The second instruction will update a0 even though it should be canceled due to the data abort on the previous instruction.

A complete software workaround requires ensuring that data aborts do not occur due to any instruction immediately preceding a coprocessor instruction that writes to an accumulator. The only way to ensure this is to not allow memory operations immediately preceding these types of instructions. For example, the fixes for the instructions above are:

    str        r7, [r0, #0x1d] ; assume this causes a data abort
    nop
    cfmadda32  a0, a2, c0, c1

9. accumulator updates

The coprocessor will erroneously update an accumulator if the coprocessor instruction that updates an accumulator is canceled and is followed by a coprocessor instruction that is not a data path instruction. This error will occur under the following conditions:

  1. The first instruction:
    • must update a coprocessor accumulator.
    • does not execute.
  2. The second instruction is not a coprocessor data path instruction. Coprocessor data path instructions include any instruction that does not move data to or from memory or to or from the ARM registers.

For example:

    cfmva64ne  a2, c3
    cfmvr64l   r4, c15

If the first instruction should not execute or is interrupted, it may incorrectly update a2.

Because any instruction may be canceled due to an asynchronous interrupt, the most general software workaround is to insure that no instruction that updates an accumulator is followed immediately by a non-data path coprocessor instruction. For example, the fix for the instruction sequence above is:

  cfmva64ne    a2, c3
  nop
  cfmvr64l     r4, c15

Futaris gets round this by disabling conditional instructions.

10. accumulator updates

An instruction that writes a result to an accumulator may cause corruption of any of the four accumulators when the coprocessor is operating in serialized mode.

For example, the following sequence of instructions may corrupt a2 if the second instruction is not executed.

  cfmadda32   a0, a2, c0, c1
  cfmadda32ne a2, c3, c0, c1

The only workaround for this issue is to operate the coprocessor in unserialized mode.

11. two-word load / store

An erroneous memory transfer to or from any of the coprocessor general purpose registers c0 through c15 can occur given the following conditions are satisfied:

  1. The first instruction:
    • is a two-word load or store4.
    • fails its condition code check.
    • does not busy-wait.
  2. The second consecutive instruction:
    • is a coprocessor load or store.
    • is executed.
    • does not busy-wait.

When the error occurs, the result is either coprocessor register or memory corruption. Here are several examples:

   cfstr64ne     c0, [r0, #0x0]   ; assume does not execute
   cfldrs        c2, [r2, #0x8]   ; could corrupt c2!
   cfldrdge      c0, [r0, #0x0]   ; assume does not execute
   cfstrd        c2, [r2, #0x8]   ; could corrupt memory!
   cfldr64ne     c0, [r0, #0x0]   ; assume does not execute
   cfldrdgt      c2, [r2, #0x8]   ; could corrupt c2!

The software workaround involves avoiding a pair of consecutive instructions with these properties. For example, if a conditional coprocessor two-word load or store appears, insure that the following instruction is not a coprocessor load or store:

   cfstr64ne    c0, [r0, #0x0]     ; assume does not execute
   nop                             ; separate two instructions
   cfldrs       c2, [r2, #0x8]     ; c2 will be ok

Another workaround is to insure that the first instruction is not conditional:

   cfstr64      c0, [r0, #0x0]     ; executes
   cfldrs       c2, [r2, #0x8]     ; c2 will be ok

Note: If both instructions depend on the same condition code, the error should not occur, as either both or neither will execute.

Futaris disables conditional instructions other than branch, as well as 64-bit operations. This should stop this bug from being triggered.

12. floating point add, cpy, abs, neg

Result: denorm operand forced to zero, cpy/neg never produces +zero

Description: When an operand to the Crunch add/subtract unit is denormalized, it is forced to zero before input to the calculation. The sign is unaffected. This affects the following instructions:

  • Copies: cfcpys, cfcpyd
  • Add/Sub: cfadds, cfaddd, cfsubs, cfsubd
  • Absolute value: cfabss, cfabsd
  • Negation: cfnegs, cfnegd
  • Conversions: cfcvtsd, cfcvtds

When the operand is negative zero, cfcpys and cfcpyd write positive zero to the destination register, while the result should be negative zero. When the operand is positive zero, cfnegs and cfnegd write positive zero to the destination register, while the result should be negative zero.

Workaround: none

These are limit-of-precision cases and should not affect results. Ignore.

13. cfcvtds

The operation cfcvtds, which converts a double floating point value to a single floating point value, never produces a denormalized result, even if the value can be accurately represented as such. The result underflows directly to zero. Sign is preserved properly, however.

Workaround: none

This is a limit-of-precision case which we can ignore.

arm-crunch-cfcvtds-disable.patch gets around this bug, and disables this instruction. Double to float conversion is done using the soft-float functions.