Making fast floating point math work on the Cirrus MaverickCrunch floating point unit
This page records the issues and existing patches to make GCC generate reliable code for the Cirrus Logic Maverick Crunch floating point unit.
It applies to the "armel" ArmEabiPort of Debian when compiling for a Cirrus Logic EP93xx ARM + Maverick chip with
gcc -mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp
For docs on the Maverick Crunch unit and its problems, see the EP93xx User's Guide, chapter 2, and the Errata of the documents for the EP9312.
Discussion specific to it usually happens on the linux-cirrus mailing list.
Contents
- Making fast floating point math work on the Cirrus MaverickCrunch floating point unit
- About the FPU
- Versions of GCC
-
The bugs
- The FPU doesn't set the same condition codes as the ARM core
- C++ exception unwinding is broken
- libunwind support
-
Hardware bugs
- Definitions
- 1. two-word load / store
- 2. instruction with source operand
- 3. two-word load / store
- 4. two-word store
- 5. cfrshl32, cfrshl64
- 6. cfldr32, cfmv64lr may not sign-extend
- 7. accumulator updates
- 8. accumulator updates
- 9. accumulator updates
- 10. accumulator updates
- 11. two-word load / store
- 12a. cpy/add/abs/neg/cvt take denormalised operands as zero
- 12b. cpy/neg never produces negative zero
- 13. cfcvtds never produces denorms
- 14. double word load/store corrupts memory
- 15. Shift counts are truncated to 6-bit signed
About the FPU
The ?MaverickCrunch Coprocessor is an IEEE-754 floating point accelerator unit that comes together with an ARM920T integer CPU in Cirrus Logic's EP9301, EP9302, EP9307, EP9312 and EP9315 chips. It has a different instruction set from other floating point accelerators that are found with ARM processors: ARM's old FPA unit, now rare, and the more recent VFP unit. ARMs also come with iWMMXt, NEON and ?DaVinci coprocessors, but these are specialized SIMD and DSP processors, not generic floating point math ones.
Five revisions of the silicon were issued: D0, D1, E0, E1 and E2. The revision of a chip is printed as the 5th and 6th characters of the second line of text on the chip housing. The now rare D0 revision has a more extensive range of hardware bugs than the later revisions; from D1-E2 no further modifications were made to the design of the Maverick unit. Here we only attempt to work around the bugs in the later series.
Cirrus stopped development of its ARM devices on 1st April 2008 (no joke!) but will continue to sell the chips.
Registers
It has 16 64-bit registers, which can be treated as single- or double-precision floating point values, or as 32- or 64-bit integers. Single-precision floats live in the top 32 bits of the register and, when they are written, the lower 32 bits are zeroed. 32-bit integers live in the lower 32 bits and, when they are written, the top 32 bits are (usually!) sign-extended according to bit 31. It also has four 72-bit multiply-accumulate integer registers which are not used by GCC.
Instruction set
It provides instructions to add, subtract, multiply, compare, negate and give absolute value for all these types, to shift the registers in the two integer modes, and to convert between the data types. These operations can only be done between Maverick registers, but data can be copied between Maverick and ARM registers and between Maverick registers and main memory.
Operating modes
The FPU can operate in several modes, controlled by bits in its status register:
- ISAT: Selects saturating arithmetic for integer operations instead of overflowing. The default is non-saturating, as required by C.
- UI: Unsigned integer: in comparisons between integers, the values as considered signed or unsigned when they are compared, unlike the ARM (and FPA and VFP) comparisons which set the condition codes which are then considered signed or unsigned when a decision is made. The default is signed.
- Synchronous/Asynchronous: Synchronous mode is much slower, but ensures that, if floating point exceptions are enabled and occur, you can be sure to pinpoint the offending instruction. The default is asynchronous (i.e. fast).
- Forwarding/Non-forwarding: Forwarding channels the results of arithmetic operations back to the input of the logic unit as well as to the destination registers so that, when the result of one instruction is used in another soon after, execution is faster. LAME gains 2.5% in speed by enabling this; the default is non-forwarding.
Instruction format
?MaverickCrunch instructions are 32-bit words that are interleaved with the regular ARM instrution stream. It appears as co-processors 4, 5 and 6 and its instruction words in hexadecimal match the regular expression 0x.[cde]...[456]..
In GCC output, this is further restricted to 0xe[cde]...[45].. as all Maverick instructions are unconditional ("e"), and it does not use the CP6 72-bit-multiply-and-accumulate instructions.
GCC support
The mainline GCC support for it that was submitted in 2003 by RedHat, usually selected with
-mcpu=ep9312 -mfpu=maverick -mfloat-abi=softfp
fails to produce working code for it. Most crucially, it fails to take proper account of the way that the FPU sets the condition code registers after a comparison, so the code it generates sometimes gets floating point and 64-bit integer comparisons wrong.
GCC does not use:
- the 32-bit integer operations. It performs these in ARM registers as usual.
- conditional execution of Maverick instructions, to avoid hardware bugs.
Versions of GCC
Mainline GCC has never been able to generate working code for the Maverick Crunch. It has a -mfix-cirrus-invalid-insns flag, which ensures that the two instructions following a branch are not Cirrus ones, and that every cfldrd, cfldr64, cfstrd, cfstr64 is followed by one non-Cirrus instruction, which should fix bugs 1 and 2. In an moderately FP-intensive test piece (LAME), it only causes an overall slowdown of 0.7%.
The two main efforts at fixing it properly are:
Cirrus GCC actually carried out by Nucleusys of Bulgaria. There are three versions of it, all based on gcc-4.1.2 and uclibc in a buildroot environment (unpacked here), but none of them passes the GCC testsuite, while other test programs go into infinite loops or crash. Some real-life programs compiled with it do seem to work though. The modifications are published as a 500 megabyte tarball from which a single monolithic patch can be derived by diffing it against the mainline source releases. What a crock!
futaris patches for gcc-4.1.2 and 4.2.0 for the ?OpenEmbedded environment is provided as several dozen individual patches and succeeds in passing the GCC IEEE testsuite, and the intensive paranoia testsuite finds only one defect and two flaws in it (which is better than the Intel and AMD chips!). Futaris' strategy includes disabling all conditional instructions other than branch and all 64-bit integer operations.
Here is how to build a futaris-patched compiler, a summary of their merits, and some benchmarks.
Summary of fixed bugs
- CMP: Maverick sets condition codes differently from ARM/FPA/VFP
- C++ EXC: C++ exceptions do not restore FPU state when taken (maybe?)
- 1. two-word load / store
- 2. instruction with source operand
- 3. two-word load / store
- 4. two-word store
- 5. cfrshl32, cfrshl64
- 6. cfldr32, cfmv64lr may not sign-extend
- 7. accumulator updates
- 8. accumulator updates
- 9. accumulator updates
- 10. accumulator updates
- 11. two-word load / store
- 12a. cpy/add/abs/neg/cvt take denormalised operands as zero
- 12b. cpy/neg never produces negative zero
- 13. cfcvtds
Toolchain |
CMP |
C++ |
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
11 |
12a |
12b |
13 |
Debian gcc 4.1.3 |
X |
|
\ |
\ |
\ |
f |
s |
/ |
- |
- |
- |
- |
- |
|
|
|
Debian gcc 4.2.3 |
X |
|
\ |
\ |
\ |
f |
s |
/ |
- |
- |
- |
- |
- |
|
|
|
Debian gcc 4.3.0 |
X |
|
\ |
\ |
\ |
f |
s |
/ |
- |
- |
- |
- |
- |
|
|
|
futaris 4.1.2 |
/ |
|
|
|
|
f |
/ |
/ |
- |
- |
- |
- |
- |
|
/ |
/ |
futaris 4.2.0 |
/ |
|
|
|
|
f |
s |
/ |
- |
- |
- |
- |
- |
|
/ |
/ |
futaris-mg 4.3.2 |
|
|
\ |
\ |
\ |
f |
s |
/ |
- |
- |
- |
- |
- |
|
/ |
/ |
crunchtools 1.4.0 |
|
|
|
|
|
|
|
|
- |
- |
- |
- |
- |
|
|
|
crunchtools 1.4.1-2 |
|
|
|
|
|
|
|
|
- |
- |
- |
- |
- |
|
|
|
crunchtools 1.4.3 |
|
|
|
|
|
|
|
|
- |
- |
- |
- |
- |
|
|
|
Legend
X |
broken |
/ |
fixed |
\ |
should(*) be fixed when -mfix-invalid-cirrus-insns is given |
f |
fixed when forwarding is disabled, which is the default in mainline Linux |
s |
fixed when not running in serialised mode, which seems usual test program |
- |
doesn't affect GCC's use of Maverick FPU |
(*) But the code is bad. It turns branch; cirrus; non-cirrus into branch; nop; cirrus; non-cirrus, which actually makes errata 1 and 3 more likely.
futaris-mg is my (Martin Guy's) unpublished reworking of Futaris' unpublished patches for 4.2.2 and 4.3.0 from April 2008. I include it here to help me keep track of what I still need to check.
The bugs
The bugs are:
- an ARM/Maverick condition code-setting anomaly;
- C++ exception unwinding is broken;
- 14 hardware bugs in Cirrus' silicon.
The FPU doesn't set the same condition codes as the ARM core
The ARM and its FPA and VFP math coprocessors' comparison instructions all set the conditions codes the same way:
ARM/FPA/VFP - (cmp*): N Z C V A == B 0 1 1 0 A < B 1 0 0 0 A > B 0 0 1 0 unord 0 0 1 1
but comparisons performed in the Maverick, on both floating point or integer values, set them a different way:
MaverickCrunch - (cfcmp*): N Z C V A == B 0 1 0 0 A < B 1 0 0 0 A > B 1 0 0 1 unord 0 0 0 0
This means that not only do you have to test for the same conditions in different ways, according to which unit did the comparison, but some cases require strange sequences of conditional instructions.
Some conditions are only tested in GCC when floating point values have been compared, including all those involving "unordered", and others only when integers have been compared (the unsigned variants LTU GTU LEU and GEU - is this right?), so we can know how to interpret the condition codes for these when Maverick code generation is enabled.
#include <stdio.h> main() { int i; double d; for (i=3, d=3; i<=5; i++, d++) { printf("%g", d); if (d < 4.0) printf(" lt"; if (d > 4.0) printf(" gt"); if (d <= 4.0) printf(" le"); if (d >= 4.0) printf(" ge"); putchar('\n'); } }
should output:
3 lt le 4 le ge 5 gt ge
but unpatched mainline GCC outputs:
3 lt le 4 le ge 5 lt gt le ge
The best summary of the why's and wherefores is on the gcc mailing list.
Futaris-4.1.2 passes the GCC testsuite by disabling conditional execution of all instructions except for branches.
The unpublished futaris patches for 4.2.2 and 4.3.0 fix this with the exception of the "unordered" and "ltgt" cases, which are missing and cause an internal compiler error if a program mentions the isunordered() function.
They also get ICE compiling foo(double x, double y) { return(x >= y); } with optimisation enabled because GCC first produces BGE (which can be represented in maverick with a sequence of tests) and then optimises that to a MOVLT; MOVGE sequence and the GE cannot be represented by a single condition code on Maverick (it can, with GT, if we don't honor the UNORDERED case).
C++ exception unwinding is broken
With Maverick FP enabled, C++ exceptions (catch - try blocks) fail to preserve the Maverick FPU state.
This thread onbinutils mailing list explains why unwind support is needed. Additionally this document (ARM IHI 0038A) explains the unwind process using EABI. As you can see in Sec 9.3 of that document, there is no unwinding EABI for popping ?MaverickCrunch instructions. The above patch incorrectly calls the iWMMXt pop functions. A new Pop MV registers instruction needs to be added to the table, along with changes to Sec 7.5
libunwind support
unwind support should only be needed if libunwind support is enabled. At the moment, only the development branch (git) of libunwind supports ARM processors.
Joseph S. Myers says on linux-cirrus 31 Mar 2008:
iWMMXt unwind support has been in GCC since my patch <http://gcc.gnu.org/ml/gcc-patches/2007-01/msg00049.html>. That illustrates the sort of thing that needs changing to implement unwind support for a new coprocessor. Obviously you need to get the unwind specification in the official ARM EABI documents first before implementing it in GCC, and binutils will also need to support generating correct information given .save directives for the coprocessor registers. For setjmp/longjmp support in glibc you also need to get an HWCAP value allocated in the kernel.
Hardware bugs
See cirrus.com -> ARM Processors -> EP93{01,02,07,12,15} -> Errata (PDF) -> Maverick Crunch
Errata are different for silicon revisions D0, D1/E0/E1 and differences are reported with E2 although no further changes are said to have been made to the Maverick design.
The following is from the EP9302 rev E2 errata:
Definitions
1. "Instruction does not execute": An instruction appears in the coprocessor pipeline, but does not execute for one of the following reasons:
- It fails its condition code check.
- A branch is taken and it is one of the two instructions in the branch delay slot.
- An exception occurs.
- An interrupt occurs.
2. "Processor is in serialized mode": It is, if and only if both:
- At least one exception type is enabled by setting one of the following bits in the DSPSC: IXE, UFE, OFE, or IOE.
- Serialization is not specifically disabled by setting the AEXC bit in the DSPSC.
("Serialized mode" makes instruction processing less fast so that an exception can reliably be traced to the instruction that caused it. In the sample I have tested (a TS7250) it is not operating in serialised mode by these criteria because no exceptions are enabled. Source: dspsc.c)
3. "An instruction updating an accumulator": These include all of the following:
Moves to accumulators: cfmva32, cfmva64, cfmval32, cfmvam32, cfmvah32.
Arithmetic into accumulators: cfmadd32, cfmadda32, cfmsub32, cfmsuba32.
4. "An instruction involving any two-word coprocessor load or store":
cfldr64 / cfldrd / cfstr64 / cfstrd.
1. two-word load / store
Result: register or memory corruption
Summary: a nonexecuted coprocessor instruction that is also stalled due to an internal dependency (operand is a non-cached memory read or the result of a previous incomplete coprocessor instruction) must not immediately precede a load/store 64/double.
Effects: in a 64-bit register load, the top 32 bits are loaded with junk; in a 64-bit memory store, an extra 32 bits of junk are written to memory in the word following the 64-bits that were correctly written.
An instruction may be nonexecuted because it is conditional and the condition is false, e.g.
cfaddne c0, c1, c2 cfldrd c3, [r2, #0x0]
corrected by
cfaddne c0, c1, c2 nop cfldrd c3, [r2, #0x0]
or it may be nonexecuted because it was one of the two instructions following a branch that was taken, so was loaded into the pipeline.
target cfldrd c3, [r2, #0x0] b target nop cfadd c0, c1, c2 ; though in pipeline, this does not execute
said to be corrected by:
target cfldrd c3, [r2, #0x0] b target cfadd c0, c1, c2 ; though in pipeline, this does not execute nop
because the "previous instruction that does not execute" in the two-word pipeline is the nop rather than the cfadd.
GCC does not emit conditional Maverick instructions, and the branch case would be covered by mainline's -mcirrus-fix-invalid-insns flag if that code were not broken: in fact it turns b;cfxxx;non-cirrus into b;nop;cfxxx;non-cirrus thereby causing the bug to occur!
Futaris and Cirrus remove this flag.
A test program tickles the bug in both ways on revision E1 silicon.
2. instruction with source operand
Result: bad calculation or stored value
Workaround: change instruction sequence
- Execute a coprocessor instruction whose target is one of the coprocessor general purpose register c0 through c15.
- Let the second instruction be an instruction with the same target, but not be executed.
- Execute a third instruction at least one of whose operands is the target of the previous two instructions.
For example, assume no pipeline interlocks other than the dependencies involving register c0 in the following instruction sequence:
cfadd32 c0, c1, c2 cfsub32ne c0, c3, c4 ; assume this does not execute cfstr32 c0, [r2, #0x0]
In this particular case, the incorrect value stored at the address in r2 is the previous value in c0, not the expected one resulting from the cfadd32.
Suggested fix:
cfadd32 c0, c1, c2 nop ; inserted extra instruction here nop ; inserted extra instruction here cfsub32ne c0, c3, c4 ; assume this does not execute nop ; inserted extra instruction here nop ; inserted extra instruction here nop ; inserted extra instruction here cfstr32 c0, [r2, #0x0]
The exact interval for safe operation is uncertain. Empirically, on an EP9307 REV E1, a total of three inserted nops are necessary in the above case, placed before, after or around the non-executed instruction.
The second instruction may also not be executed because it follows a branch: as in the following real-life case from liboil/simdpack/multsum.c which fails if either or both of the pair of nops is removed:
cfsh64 mvdx0, mvdx4, #0 @ dest is C0 cmp r3, r8 ble 548 <multsum_f64_unroll8+0x438> nop (mov r0,r0) nop (mov r0,r0) cfldrd mvd0, [sl] @ dest is C0 cfldrd mvd1, [r9] cfmuld mvd0, mvd0, mvd1 cfaddd mvd0, mvd0, mvd4 ldr r5, [sp, #72] @ branch target nop (mov r0,r0) cfstrd mvd0, [r5] @ src is C0
A test program tickles the bug in both ways on revision E1 silicon.
GCC doesn't emit conditional Maverick instructions and the jump case should fixed by mainline's -mfix-cirrus-invalid-instructions.
3. two-word load / store
Data in coprocessor general purpose registers or in memory may be corrupted.
- Let the first instruction be a serialized instruction that does not execute. For an instruction to be serialized, at least one of the following must be true:
- The processor must be operating in serialized mode.
- The instruction must move to or from the DSPSC (either cfmv32sc or cfmvsc32).
- Let the immediately following instruction be a two-word coprocessor load or store.
In the case of a load, only the lower 32 bits (the first word) will be loaded into the target register. For example:
cfadd32ne c0, c1, c2 ; assume this does not execute cfldr64 c3, [r2, #0x0]
The lower 32 bits of c3 will correctly become what is at the memory address in r2, but the upper 32 bits of c3 will not become what is at address r2 + 0x4.
Workaround:
cfadd32ne c0, c1, c2 ; assume this does not execute nop ; inserted extra instruction here cfldr64 c3, [r2, #0x0] ; store sequence cfadd32ne c4, c5, c6 ; assume this does not execute nop ; inserted extra instruction here cfstr64 c3, [r2, #0x0]
The real-world CPUs I've tested are not running in serialized mode, and GCC does not emit cfmv32sc or cfmvsc32.
If there are serialized ones out there, GCC does not emit conditional Maverick instructions, which just leaves the case of a Maverick instruction being in one of the two slots after a branch that is taken, which is covered by -mcirrus-fix-invalid-insns.
4. two-word store
Only in mode: forwarding, not serialised
Result: memory corruption
Summary: data operation into Crunch register followed by 64-bit store of the same Maverick register into RAM (cfstrd or cfstr64) may write rubbish
Description: When the coprocessor is not in serialized mode and forwarding is enabled, memory can be corrupted when two types of instructions appear in the instruction stream with a particular relative timing.
- Execute an instruction that is a data operation (not a move between ARM and coprocessor registers) whose destination is one of the general purpose register c0 through c15.
- Execute an instruction that is a two-word coprocessor store (either cfstr64 or cfstrd), where the destination register of the first instruction is the source of the store instruction, that is, the second instruction stores the result of the first one to memory.
- Finally, the first and second instruction must appear to the coprocessor with the correct relative timing; this timing is not simply proportional to the number of intervening instructions and is difficult to predict in general.
The result is that the lower 32 bits of the result stored to memory will be correct, but the upper the 32 bits will be wrong. The value appearing in the target register will still be correct.
Workarounds:
- Operate the FPU without forwarding enabled, with a possible decrease in performance
- Operate in serialized mode by enabling at least one exception, with significantly reduced performance
- Ensure that at least seven instructions appear between the first and second instructions that cause the error
- disable 64-bit load/store instructions (e.g. replace them with two 32-bit insns)
GCC does output guilty instruction sequences. Examples from LAME:
cfmuld mvd1, mvd1, mvd0 mov r2, r7 mov r3, r5 mov r0, r8 ldr r1, [pc, #364] cfstrd mvd1, [sp, #8] cfmuld mvd1, mvd1, mvd0 mov r0, r8 mov r3, r4 ldr r1, [pc, #1004] cfstrd mvd1, [sp] cfldrd mvd0, [r8, #8] cfaddd mvd0, mvd1, mvd0 cfstrd mvd0, [r8, #8]
but a sample system was not operating with forwarding enabled.
Code to enable forwarding (under Linux with Maverick support enabled in the kernel, the effect is limited to the process that does this):
crunch_fwden() { asm("cfmv32sc mvdx0, dspsc"); /* Read status register */ asm("cfmvrdl r0, mvd0"); /* Move LSW to ARM */ asm("orr r0, r0, #0x4000"); /* Set forwarding bit */ asm("cfmvdlr mvd0, r0"); /* Move ARM to LSW */ asm("cfmvsc32 dspsc, mvdx0"); /* Write to status register */ }
This appears to be unresolved at present.
Under Linux on the sample board I use, forward is disabled by default. Enabling forwarding in a test program on revision E1 hardware, I have been unable to get this bug to bite.
5. cfrshl32, cfrshl64
When operating in serialized mode, cfrshl32 and cfrshl64 (logical shifts on coprocessor registers) do not work properly. The instructions shift by an unpredictable amount, but cause no other side effects.
cfrshl32 is disabled in mainline gcc, and cfrshl64 is disabled by futaris, even though real-world CPUs seem not to run in serialized mode.
6. cfldr32, cfmv64lr may not sign-extend
If an interrupt occurs during the execution of cfldr32 or cfmv64lr, the instruction may not sign extend the result correctly.
Possible workarounds include:
Disable interrupts when executing cfldr32 or cfmv64lr instructions.
- Avoid executing these two instructions.
- Do not depend on the sign extension to occur; that is, ignore the upper word in any calculations involving data loaded using these instructions.
Add extra code to sign extend the lower word after it is loaded by explicitly forcing the upper word to be all zeroes or all ones, as appropriate. It is possible to do this selectively in exception or interrupt handler code. If the instruction preceding the interrupted instruction can be determined, and it is a cfldr32 or cfmv64lr, the instruction may be re-executed or explicitly sign extended before returning from interrupt or exception.
Mainline GCC does not emit cfldr32, and use of cfmv64lr is disabled as buggy. In three places it is used as the first of a two-instruction sequence: in all cases the top 32 bits are either overwritten or ignored by the second instruction.
Verdict: Not a problem.
7. accumulator updates
The coprocessor can incorrectly update one of its destination accumulators even if the coprocessor instruction should not have been executed or is canceled by the ARM processor. This error can occur if the following is true:
The first instruction must be a coprocessor compare instruction, one of cfcmp32, cfcmp64, cfcmps, and cfcmpd.
- The second instruction:
- has an accumulator as a destination.
- does not execute.
GCC does not use the accumulator instructions.
8. accumulator updates
If a data abort occurs on an instruction preceding a coprocessor data path instruction that writes to one of the accumulators, the accumulator may be updated even though the instruction was canceled.
GCC does not use the accumulator instructions.
9. accumulator updates
The coprocessor will erroneously update an accumulator if the coprocessor instruction that updates an accumulator is canceled and is followed by a coprocessor instruction that is not a data path instruction. This error will occur under the following conditions:
- The first instruction:
- must update a coprocessor accumulator.
- does not execute.
- The second instruction is not a coprocessor data path instruction. Coprocessor data path instructions include any instruction that does not move data to or from memory or to or from the ARM registers.
GCC does not use the accumulator instructions.
10. accumulator updates
An instruction that writes a result to an accumulator may cause corruption of any of the four accumulators when the coprocessor is operating in serialized mode.
GCC does not use the accumulator instructions.
11. two-word load / store
An erroneous memory transfer to or from any of the coprocessor general purpose registers c0 through c15 can occur given the following conditions are satisfied:
- The first instruction:
- is a two-word load or store4.
- fails its condition code check.
- does not busy-wait.
- The second consecutive instruction:
- is a coprocessor load or store.
- is executed.
- does not busy-wait.
When the error occurs, the result is either coprocessor register or memory corruption. Here are several examples:
cfstr64ne c0, [r0, #0x0] ; assume does not execute cfldrs c2, [r2, #0x8] ; could corrupt c2! cfldrdge c0, [r0, #0x0] ; assume does not execute cfstrd c2, [r2, #0x8] ; could corrupt memory! cfldr64ne c0, [r0, #0x0] ; assume does not execute cfldrdgt c2, [r2, #0x8] ; could corrupt c2!
The software workaround involves avoiding a pair of consecutive instructions with these properties. For example, if a conditional coprocessor two-word load or store appears, ensure that the following instruction is not a coprocessor load or store:
cfstr64ne c0, [r0, #0x0] ; assume does not execute nop ; separate two instructions cfldrs c2, [r2, #0x8] ; c2 will be ok
Another workaround is to ensure that the first instruction is not conditional:
cfstr64 c0, [r0, #0x0] ; executes cfldrs c2, [r2, #0x8] ; c2 will be ok
Note: If both instructions depend on the same condition code, the error should not occur, as either both or neither will execute.
GCC does not emit conditional Maverick instructions.
12a. cpy/add/abs/neg/cvt take denormalised operands as zero
Result: denorm operand forced to zero
Description: When an the Crunch add/subtract unit is presented with denormalized values it takes them as zero for that input of the calculation. The sign is unaffected. This affects values of +/- 2-149 to 2-126 for floats and from 2-1074 to 2-1022 for doubles when using the following instructions:
- Copies: cfcpys, cfcpyd
- Add/Sub: cfadds, cfaddd, cfsubs, cfsubd
- Absolute value: cfabss, cfabsd
- Negation: cfnegs, cfnegd
- Conversions: cfcvtsd, cfcvtds
Workaround: none. The [http://arm.cirrus.com/files/index.php?path=linux%2Fcrunch%2Fsoftfloat-crunch/ Cirrus crunch softfloat library] has integer asm code to check for denorm values before these operations (e.g. macro isddf in ieee754-df.S).
cfcpys and cfcpyd can be replaced with cfsh64 #0, which does a bit-wise copy, and cfnegs and cfnegd are disabled by futaris so the operation is performed in pairs of ARM registers (the same could be done for cfabs). Disabling the rest would only leave multiply and compare, so we live with the imprecision.
12b. cpy/neg never produces negative zero
When the operand is negative zero, cfcpys and cfcpyd write positive zero to the destination register, while the result should be negative zero. When the operand is positive zero, cfnegs and cfnegd write positive zero to the destination register, while the result should be negative zero.
Futaris disables the cfneg instructions in its arm-crunch-neg2 patch; it would be better to protected it under & ! HONOR_NEGATIVE_ZEROS([SD]Fmode) so that it is enabled when -ffast-math is selected.
cfcpyd can be replaced with cfsh64 #0, which does a bitwise copy. cfsh32 is no use for cfcpys as it copies the lower 32 bits instead of the upper 32, but we can just use cfsh64 to copy all 64 bits the same way.
13. cfcvtds never produces denorms
The operation cfcvtds, which converts a double floating point value to a single floating point value, never produces a denormalized result, even if the value can be accurately represented as such. The result underflows directly to zero. Sign is preserved properly, however.
Workaround: none
Futaris' arm-crunch-cfcvtds-disable patch disables this instruction: double to float conversion is done using the soft-float functions. Given that any resulting denormalised numbers will probably be truncated to zero by the math ops in bug 12, there may be not be much point in doing this.
14. double word load/store corrupts memory
There is an extra, undocumented hardware bug, in E1 silicon at least.
When an ARM register is loaded from memory and a double-word cirrus register is immediately stored indirected through the same ARM register, memory is corrupted. For example:
ldr r2, [r3, #xx] cfstrd mvd1, [r2]
stores the 64-bit value at an unpredictable memory location. Presumably, cfstr64 does the same. 64-bit loads in the same context also cause memory corruption
Often, the memory corrupted will be two words of the kernel's in-core cached copy of the program itself (despite it being read-only in the MMU!), which results in the contents of the executable file appearing to have changed after the executable has been run. It can be "restored" by running something that uses as much VRAM as there is physical RAM so that the cached copy is discarded.
The solution is to insert some other instruction between the ldr and the 64-bit load or store, such as a nop.
15. Shift counts are truncated to 6-bit signed
While the manual says that the constant shift counts are limited to -32 (right shift) to +31 (left), the cfrshl64 instruction also only examines the lower 6 bits of its shift count, so a value of 48 in the ARM register results in an arithmetic right shift by 16 bits.
Mainline GCC thinks it can shift a 64-bit integer register left or right by 0 to 63 bits, so this needs working around in the two constant cases cirrus_ashldi_const and cirrus_ashiftrtdi_const and the variable one ashldi3_cirrus. For the latter, Paolo Bonzini [http://gcc.gnu.org/ml/gcc/2009-03/msg00460.html suggests]:
This could already be handled by faking a 63 bit truncation and using a splitter to expand those into something like this (I only know integer ARM assembly, so I'm making this up): AND R1, R0, #31 MOV R2, R2, SHIFT R1 ANDS R1, R0, #32 MOVNE R2, R2, SHIFT #31 MOVNE R2, R2, SHIFT #1 or ANDS R1, R0, #32 MOVNE R2, R2, SHIFT #-32 SUB R1, R1, R0 ; R1 = (x >= 32 ? 32 - x : -x) MOV R2, R2, SHIFT R1
Note that in gcc, gas, gdb and objdump, the source and destination maverick registers for the cfrshl32 and cfrshl64 instructions are inverted, so what appears in assembler listings as cfrshl64 mvd8, mvd0, r6 has source register mvd8 and writes into mvd0.