Acorn Arcade forums: Programming: Code GCC produces that makes you cry #12684
|
Code GCC produces that makes you cry #12684 |
|
This is a long thread. Click here to view the threaded list. |
|
Jeffrey Lee |
Message #114920, posted by Phlamethrower at 23:22, 1/8/2010 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
Exhibit A: Why have one branch when you can have three?
000032A4 : BHI &000032E4 000032A8 : MOV R12,R4,LSL #12 000032AC : MOV R1,R12,LSR #28 000032B0 : CMP R1,#&0F ; =15 .... 000032E0 : LDMDB R11,{R4,R5,R11,R13,PC} 000032E4 : BL &00001E2C 000032E8 : B &000032A8
Exhibit B: Does 15*4=60?
00003B5C : CMP R0,#&0F ; =15 00003B60 : LDREQ R2,[R4,#60] 00003B64 : LDRNE R8,[R4,R0,LSL #2] 00003B68 : MOV R3,R1,LSR #7 00003B6C : BICEQ R8,R2,#&FC000003
Exhibit C: If A then B else B:
00003C10 : CMP R0,#&0F ; =15 00003C14 : LDREQ R2,[R4,#60] 00003C18 : LDRNE R2,[R4,#60]
Exhibit D: Ever head of AND? (Take a look at how R3 is used):
00003E14 : MOV R3,R1,LSL #16 00003E18 : AND R2,R8,#&1E ; =30 00003E1C : AND R1,R1,#&FF ; ="ÿ" 00003E20 : SUB R1,R12,R1,ROR R2 00003E24 : MOV R3,R3,LSR #28 00003E28 : CMP R3,#&0F ; =15 00003E2C : MOV R0,R4 00003E30 : BIC R2,R1,#&FC000003 00003E34 : BEQ &00003E40 00003E38 : STR R1,[R4,R3,LSL #2]
Maybe I should try compiling this with GCC 4.6 and see what that makes of it!
[Edited by Phlamethrower at 00:26, 2/8/2010] |
|
[ Log in to reply ] |
|
Rob Kendrick |
Message #114921, posted by nunfetishist at 14:25, 2/8/2010, in reply to message #114920 |
Today's phish is trout a la creme.
Posts: 524
|
GCC for ARM has been getting progressively more awful over the past 5 years or so. Roll on LLVM and Clang. |
|
[ Log in to reply ] |
|
Jason Tribbeck |
Message #114928, posted by tribbles at 22:09, 2/8/2010, in reply to message #114920 |
Captain Helix
Posts: 929
|
Exhibit A is vaguely understandable if the code in &00001e2c is in a different .o file.
Exhibit B would be easier to understand if the original source is in place.
Exhibit C is inexcusable! (Original source might be good here to see why it decided to opt for that approach)
Exhibit D is taking bits 12 to 15 of R1, and using them as the pointer offset - for a few seconds, I couldn't think of a way of doing it which would be smaller, but then I thought of this:
AND R3, R1, #&F000 ... CMP R3, #&F000 ... STR R1, [R4, R3, LSR#10]
Unfortunately, there is the possibility that R3 is used after this fragment, but I'm guessing you've not been too clip happy
I have seen GCC produce some quite nice code, but I have also seen it produce drivel.
I've been meaning to write something on how to produce optimal ARM code, but haven't had the time... |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #114929, posted by Phlamethrower at 22:55, 2/8/2010, in reply to message #114928 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
Exhibit A is vaguely understandable if the code in &00001e2c is in a different .o file. Nope, all part of the same function. Edit: Looks like I misread your statement. Although as it turns out, it is the same .o file (just not the same function)
The problem code is the result of a ternary operator, where one outcome (LS, not shown) is just a simple LDR while the other (HI) is the result of a function call.
Exhibit B would be easier to understand if the original source is in place. Ternary, again... it's the result of the LHS macro, where:
#define LHS ((LHSReg == 15) ? R15PC : (state->Reg[LHSReg])) #define R15PC (state->Reg[15] & R15PCBITS) #define R15PCBITS (0x03fffffcL)
Exhibit C is inexcusable! (Original source might be good here to see why it decided to opt for that approach) It looks like it's a slightly messed up combination of the LHS macro and the CFLAG macro (where CFLAG gets the carry flag from R15):
dest = LHS + DPImmRHS + CFLAG;
It performs the comparison for the LHS ternary operator, then either loads R15 and extracts the PC bits into R8 (not shown, obviously), or loads the general-purpose register R0 (not shown). But somewhere along the way the instruction scheduler has decided that once R15 has been loaded it's a good time to extract the carry flag, but for R15 to be loaded all the time it would need to insert an extra instruction to load it for the R0!=15 case... without spotting that it's an identical instruction to the one used to load it for R0==15.
Exhibit D is taking bits 12 to 15 of R1, and using them as the pointer offset - for a few seconds, I couldn't think of a way of doing it which would be smaller, but then I thought of this:
AND R3, R1, #&F000 ... CMP R3, #&F000 ... STR R1, [R4, R3, LSR#10] Correct!
Unfortunately, there is the possibility that R3 is used after this fragment, but I'm guessing you've not been too clip happy Yes, the final STR was the only place in which R3 would be used.
In case you're wondering, I found all these examples in the binary from the latest version of my ARM optimised version of ArcEm
[Edited by Phlamethrower at 00:08, 3/8/2010] |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #116553, posted by Phlamethrower at 12:02, 10/2/2011, in reply to message #114929 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
As many flaws as GCC seems to have with its optimiser, the compiler I'm currently using at work (which I probably can't name for legal reasons ) seems to be about 10 million times worse.
One of these days I'm going to snap and just start writing everything in assembler. |
|
[ Log in to reply ] |
|
Rob Kendrick |
Message #116554, posted by nunfetishist at 12:09, 10/2/2011, in reply to message #116553 |
Today's phish is trout a la creme.
Posts: 524
|
As many flaws as GCC seems to have with its optimiser, the compiler I'm currently using at work (which I probably can't name for legal reasons ) seems to be about 10 million times worse. Greenhills? |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #116555, posted by Phlamethrower at 12:24, 10/2/2011, in reply to message #116554 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
Never even heard of it. Should I be glad? |
|
[ Log in to reply ] |
|
Rob Kendrick |
Message #116557, posted by nunfetishist at 14:52, 10/2/2011, in reply to message #116555 |
Today's phish is trout a la creme.
Posts: 524
|
Never even heard of it. Should I be glad? Very.
People who in my experience have produced dreadful compilers include: GNU, Greenhills, Borland, IAR, and loads others. |
|
[ Log in to reply ] |
|
Trevor Johnson |
Message #116558, posted by trevj at 15:05, 10/2/2011, in reply to message #116557 |
Member
Posts: 660
|
People who in my experience have produced dreadful compilers include: GNU, Greenhills, Borland, IAR, and loads others. What's up with Borland? I thought it'd be a good idea for me to get to grips with something like that before venturing into GCC and the Acorn/Castle/ROOL tools under RISC OS (and also disregarding the small matter of not being a programmer). |
|
[ Log in to reply ] |
|
Peter Howkins |
Message #116574, posted by flibble at 21:51, 10/2/2011, in reply to message #116558 |
Posts: 892
|
I thought it'd be a good idea for me to get to grips with something like that before venturing into GCC and the Acorn/Castle/ROOL tools under RISC OS (and also disregarding the small matter of not being a programmer). Don't spoil yourself, things like IDEs and integrated debuggers will ruin you for the basicness of RISC OS dev. |
|
[ Log in to reply ] |
|
qUE |
Message #116576, posted by qUE at 03:38, 11/2/2011, in reply to message #116558 |
Posts: 187
|
People who in my experience have produced dreadful compilers include: GNU, Greenhills, Borland, IAR, and loads others. What's up with Borland? I thought it'd be a good idea for me to get to grips with something like that before venturing into GCC and the Acorn/Castle/ROOL tools under RISC OS (and also disregarding the small matter of not being a programmer). I think rjek means the machine code output from the compiler. TBH you can't expect a compiler to weave any decent code anyway it's just impossible, it's limited by the rules and preset routines it's given, a compiler won't be able to read code and then understand what's going on with it like a human can. Of course the flip side of that is human error. But from personal experience a human will always produce faster, more optimal code than a compiler. |
|
[ Log in to reply ] |
|
Rob Kendrick |
Message #116583, posted by nunfetishist at 11:20, 11/2/2011, in reply to message #116576 |
Today's phish is trout a la creme.
Posts: 524
|
I think rjek means the machine code output from the compiler. No, I mean buggyness, painfulness to use, documentation, completeness of implementation *and* the performance of the code it outputs.TBH you can't expect a compiler to weave any decent code anyway it's just impossible Except this isn't true, of course. Compilers are able to produce excellent quality code (GCC on x86, LLVM etc), often better than humans can (see almost any compiler for IA64.)But from personal experience a human will always produce faster, more optimal code than a compiler. I bet a compiler will produce better code than you for all of Cortex-M4, IA64, HPPA, POWER, Alpha, SPARC, ARC, and MIPS
The quality of the instructions fed to the CPU is not the only performance issue. I've seen delightfully over-optimised assembler programs that have been slower than ones written in BBC Basic because the author spent more time tuning the code than thinking about using the right algorithm. |
|
[ Log in to reply ] |
|
qUE |
Message #116586, posted by qUE at 16:42, 11/2/2011, in reply to message #116583 |
Posts: 187
|
GCC on x86 Not entirely conviced GCC produces better code than other x86 compilers, but I'll take another look, long time since I've had a proper look at GCC.
I bet a compiler will produce better code than you for all of Cortex-M4, IA64, HPPA, POWER, Alpha, SPARC, ARC, and MIPS All of these are on processors which are either long gone, or will be replaced by a newer range tomorrow, so why anyone would code for these is anyones guess
The quality of the instructions fed to the CPU is not the only performance issue. I've seen delightfully over-optimised assembler programs that have been slower than ones written in BBC Basic because the author spent more time tuning the code than thinking about using the right algorithm. I was implying the coder actually knew what they were doing, not just anyone. I'm suprised anything ran slower than BBC BASIC, that is a feat. |
|
[ Log in to reply ] |
|
Rob Kendrick |
Message #116587, posted by nunfetishist at 17:43, 11/2/2011, in reply to message #116586 |
Today's phish is trout a la creme.
Posts: 524
|
I bet a compiler will produce better code than you for all of Cortex-M4, IA64, HPPA, POWER, Alpha, SPARC, ARC, and MIPS All of these are on processors which are either long gone, or will be replaced by a newer range tomorrow, so why anyone would code for these is anyones guess Err... I can't tell if you're being ironic, or just painfully wrong.The quality of the instructions fed to the CPU is not the only performance issue. I've seen delightfully over-optimised assembler programs that have been slower than ones written in BBC Basic because the author spent more time tuning the code than thinking about using the right algorithm. I was implying the coder actually knew what they were doing, not just anyone. I'm suprised anything ran slower than BBC BASIC, that is a feat. I'd say two thirds of the people I know who insist on writing huge chunks of assembler by hand know how to write fast assembly code, but have no idea how to write a fast program. |
|
[ Log in to reply ] |
|
Martin Bazley |
Message #121805, posted by swirlythingy at 23:53, 24/1/2013, in reply to message #116583 |
Posts: 460
|
I've seen delightfully over-optimised assembler programs that have been slower than ones written in BBC Basic because the author spent more time tuning the code than thinking about using the right algorithm. I didn't even know it was possible to unroll a logical shift.
SUBS R2,R0,#&1C BPL |L00000420| SUBS R2,R0,#8 BMI |L0000048C| SUBS R2,R2,#4 MOVMI R1,#1 BMI |L00000464| SUBS R2,R2,#4 MOVMI R1,#2 BMI |L00000464| SUBS R2,R2,#4 MOVMI R1,#3 BMI |L00000464| SUBS R2,R2,#4 MOVMI R1,#4 BMI |L00000464| SUBS R2,R2,#4 MOVMI R1,#5 BMI |L00000464| |L00000420| SUBS R2,R2,#4 MOVMI R1,#6 BMI |L00000464| SUBS R2,R2,#4 MOVMI R1,#7 BMI |L00000464| SUBS R2,R2,#4 MOVMI R1,#8 BMI |L00000464| SUBS R2,R2,#4 MOVMI R1,#9 BMI |L00000464| SUBS R2,R2,#4 MOVMI R1,#&0A BMI |L00000464| SUB R2,R2,#4 MOV R1,#&0B |L00000464| I particularly like the little punchline at the beginning, where he pre-emptively jumps to the middle of the chain if necessary... one presumes to reduce the number of cycles. |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #123830, posted by Phlamethrower at 09:47, 11/5/2016, in reply to message #114921 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
GCC for ARM has been getting progressively more awful over the past 5 years or so. Roll on LLVM and Clang. You say that, but...
#define LHS ((LHSReg == 15) ? R15PC : (state->Reg[LHSReg])) #define R15PC (state->Reg[15] & R15PCBITS) #define R15PCBITS (0x03fffffcL)
typedef struct { int Reg[16]; } state_t;
int func(state_t *state, int LHSReg) { return LHS; }
->
clang -O3 -marm -S test.c --target=arm-linux-gnueabi -mcpu=cortex-a8
->
cmp r1, #15 ldrne r0, [r0, r1, lsl #2] ldreq r0, [r0, #60] biceq r0, r0, #-67108861 bx lr
LLVM/Clang (3.7) is almost certainly better than GCC at some things, but it seem that optimising ternary operators is not one of them.
I should probably check Norcroft when I get home, hopefully that can restore my faith in humanity. |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #123833, posted by Phlamethrower at 21:27, 11/5/2016, in reply to message #123830 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
I should probably check Norcroft when I get home, hopefully that can restore my faith in humanity. Bah!
CMP a2,#&f LDRNE a1,[a1,a2,LSL #2] LDREQ a2,[a1,#&3c]! BICEQ a1,a2,#&fc000003 MOV pc,lr
Another fun way I've discovered of generating horrible code is to have a switch statement where all the cases contain identical code. Sometimes you'll get lucky and find a compiler which will share the code between all the different case statements, but I'm yet to find a compiler which is smart enough to eliminate the jump table altogether.
[Edited by Phlamethrower at 22:28, 11/5/2016] |
|
[ Log in to reply ] |
|
Simon Willcocks |
Message #123873, posted by Stoppers at 13:50, 31/10/2016, in reply to message #123833 |
Member
Posts: 302
|
Still a recent discussion!
int func2(state_t *state, int LHSReg) { int reg = state->Reg[LHSReg]; if (LHSReg == 15) reg = reg & R15PCBITS; return reg; }
arm-none-eabi-gcc -c x.c -O4
00000014 <func2>: 14: e7900101 ldr r0, [r0, r1, lsl #2] 18: e351000f cmp r1, #15 1c: 03c003ff biceq r0, r0, #-67108861 ; 0xfc000003 20: e12fff1e bx lr |
|
[ Log in to reply ] |
|
Jeffrey Lee |
Message #123877, posted by Phlamethrower at 19:42, 31/10/2016, in reply to message #123873 |
Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff
Posts: 15100
|
Yes, if you restructure the code then you can get the compiler to produce something sensible. But the point is that the compiler should be able to spot a lot of these transformations itself. It's incredibly basic CSE which the compilers are failing at, for example:
#define THING (fudge*fudge) #define R15PCBITS (0x03fffffcL)
int func(int LHSReg, int fudge) { if (LHSReg == 15) { return THING & R15PCBITS; } else { return THING; } }
Any sane person would calculate THING once and use the result in both halves of the if. But that appears to be beyond the capabilities of GCC.
func(int, int): cmp r0, #15 muleq r1, r1, r1 biceq r0, r1, #-67108861 mulne r0, r1, r1 bx lr
Unless you change THING to something simpler like 'fudge+fudge', in which case you get:
func(int, int): cmp r0, #15 lsl r0, r1, #1 biceq r0, r0, #-67108861 bx lr
Lord knows why it would decide to optimise one but not the other.
http://gcc.godbolt.org/ is a fun place to go if you feel like breaking compilers. Just remember -marm when testing ARM output otherwise it may be being crippled by thumb compatibility. |
|
[ Log in to reply ] |
|
Rob Kendrick |
Message #123878, posted by nunfetishist at 11:06, 1/11/2016, in reply to message #123877 |
Today's phish is trout a la creme.
Posts: 524
|
Any sane person would calculate THING once and use the result in both halves of the if. But that appears to be beyond the capabilities of GCC.
func(int, int): cmp r0, #15 muleq r1, r1, r1 biceq r0, r1, #-67108861 mulne r0, r1, r1 bx lr
To be fair, this code does only calculate it once Still, there's a cycle that could be saved. |
|
[ Log in to reply ] |
|
Adrian Lees |
Message #123885, posted by adrianl at 18:36, 9/11/2016, in reply to message #123878 |
Member
Posts: 1637
|
To be fair, this code does only calculate it once Still, there's a cycle that could be saved. <uber pedant> Two cycles, but you have to transpose the input arguments such that you can use:
mul r0,r0,r0 cmp r1,#15 biceq r0,r0,#-67108861 bx lr
Promoting the mul then mitigates the result-use dependency for some implementations, although also note that the result is unpredictable for pre-v6 architectures, as is the 'muleq r1,r1,r1' since Rd == Rm. </uber pedant> |
|
[ Log in to reply ] |
|
Adrian Lees |
Message #124279, posted by adrianl at 07:12, 29/4/2018, in reply to message #123885 |
Member
Posts: 1637
|
Compilers, even if they produced great code, could only be expected to do so much anyway, given that they must adhere to the rules of the language and have no knowledge of the application. I am more concerned by what the programmers produce.
Aside from that general point, I collect micro-level stupidity from human programmers. An old favourite was of the form:
unsigned n; ... a = (int)pow(2, n);
which happened sometimes to crash on the target system, deep within the FP library implementation, locking up the embedded processor on the PCI card, and ultimately the entire industrial PC. Funnily enough, I never debugged the FP library to work out why.
I have, as of last night, another for my collection:
unsigned b = .. uint32_t val = readl(...); val &= ~(1 << b); val |= (1 << b); writel(val, );
You know, 'just in case...'
Anyone got more for my collection? |
|
[ Log in to reply ] |
|
Simon Willcocks |
Message #124364, posted by Stoppers at 18:03, 29/10/2018, in reply to message #114920 |
Member
Posts: 302
|
Beware of inline assembler (that uses loops)!
asm volatile ( "\n0:" "\n\tldxr %[result], [%[first]]" "\n\tadd %[tmp], %[result], %[size]" "\n\tstxr %w[flag], %[tmp], [%[first]]" "\n\tcbnz %w[flag], 0b" : [flag] "=r" (flag), [tmp] "=r" (tmp), [result] "=r" (result) : [size] "r" (count * 4096), [first] "r" (&core_0_storage->next_free_page) );
GCC: I'll just use the same register for tmp and first, what could possibly go wrong?
301c: f9448001 ldr x1, [x0, #2304] 3020: 91004021 add x1, x1, #0x10 3024: c85f7c20 ldxr x0, [x1] 3028: 52820002 mov w2, #0x1000 // #4096 302c: f9000420 str x0, [x1, #8] 3030: 8b020001 add x1, x0, x2 3034: c8027c21 stxr w2, x1, [x1] 3038: 35ffff62 cbnz w2, 3024 <allocate_pages.constprop.0+0xc>
FWIW, the trick is to take the loop out of the assembler.
register u64 tmp, blocked; register void *result;
do { asm volatile ( "\n\tldxr %[result], [%[first]]" "\n\tadd %[tmp], %[result], %[size]" "\n\tstxr %w[flag], %[tmp], [%[first]]" : [flag] "=r" (blocked), [tmp] "=r" (tmp), [result] "=r" (result) : [size] "r" (count * 4096), [first] "r" (&core_0_storage->next_free_page) ); } while (blocked);
return result;
gcc-6.3.0 |
|
[ Log in to reply ] |
|
Charles Baylis |
Message #124373, posted by cbcbcb at 13:16, 2/11/2018, in reply to message #124364 |
Member
Posts: 3
|
GCC: I'll just use the same register for tmp and first, what could possibly go wrong?
This happens because your asm constraints are wrong (and are still wrong in your second version)
If your inline asm performs register writes before register reads, then those outputs must be declared as early clobber, ie "=&r" (var)
[Edited by cbcbcb at 13:17, 2/11/2018] |
|
[ Log in to reply ] |
|
Simon Willcocks |
Message #124378, posted by Stoppers at 13:33, 3/11/2018, in reply to message #124373 |
Member
Posts: 302
|
This happens because your asm constraints are wrong (and are still wrong in your second version) Ah, thanks. Proof that the best way to get an answer on the internet isn't to ask the question, but to post the wrong answer!
I tried every combination I could find in examples, and the documentation, especially about "early-clobber", isn't as clear as it could be... https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html
If your inline asm performs register writes before register reads, then those outputs must be declared as early clobber, ie "=&r" (var) OK, that seems to work, but I still don't understand what it's doing, exactly.
For example, if I try:
asm volatile ( "\n0:" "\n\tldxr %[result], [%[first]]" "\n\tadd %[tmp], %[result], %[size]" "\n\tstxr %w[flag], %[tmp], [%[first]]" "\n\tcbnz %w[flag], 0b" : [flag] "=r" (flag), [tmp] "=&r" (tmp), [result] "=&r" (result) : [size] "r" (count * 4096), [first] "r" (&core_0_storage->next_free_page) ); (i.e. without the =&r for flag) it uses (part of) the same register for flag as for the [first], which is an input that I still want to use.
305c: f9409801 ldr x1, [x0, #304] 3060: 91004021 add x1, x1, #0x10 3064: c85f7c20 ldxr x0, [x1] 3068: 8b020003 add x3, x0, x2 306c: c8017c23 stxr w1, x3, [x1] <<< Here; I still want to use x1 (first) 3070: 35ffffa1 cbnz w1, 3064 <allocate_pages.constprop.0+0x10>
|
|
[ Log in to reply ] |
|
Charles Baylis |
Message #124392, posted by cbcbcb at 19:04, 12/11/2018, in reply to message #124378 |
Member
Posts: 3
|
(i.e. without the =&r for flag) it uses (part of) the same register for flag as for the [first], which is an input that I still want to use.
Your code writes to flag, then, if you go round the loop, read first. This is the same write-before-read situation which necessitates the early clobber flag. |
|
[ Log in to reply ] |
|
Simon Willcocks |
Message #124393, posted by Stoppers at 21:51, 12/11/2018, in reply to message #124392 |
Member
Posts: 302
|
(i.e. without the =&r for flag) it uses (part of) the same register for flag as for the [first], which is an input that I still want to use.
Your code writes to flag, then, if you go round the loop, read first. This is the same write-before-read situation which necessitates the early clobber flag. OK, I think I understand what is going on, and why my non-loopy version was also wrong.
asm volatile ( "\n ldxr %[result], [%[first]]" "\n add %[tmp], %[result], %[size]" "\n stxr %w[flag], %[tmp], [%[first]]" : [flag] "=r" (blocked), [tmp] "=r" (tmp), [result] "=r" (result) : [size] "r" (count * 4096), [first] "r" (&core_0_storage->next_free_page) );
The three outputs, "flag", "tmp", and "result" will have to be allocated separate registers, otherwise the information won't get out of the code. The two inputs "size" and "first" will likewise have to be separate, otherwise the information will not get into the code.
Am I right in thinking that the compiler assumes that once its used an input, it can safely re-use that register for one of the outputs? That would mean that, in the above example, I was just lucky that it didn't use the register it chose for "first" for one of the output registers, which meant that the input value of "first" was still available for the third line.
Call me dumb, if you like, but it doesn't seem to me that "clobbering" is what the output registers are doing, and that it would be easier to understand if you could mark an input as "single-use" (or "used-once"), allowing re-use of the register, and default to using a separate register for each input and output.
Thanks very much for breaking your long silence on here and helping me understand it better!
[Edited by Stoppers at 07:31, 13/11/2018] |
|
[ Log in to reply ] |
|
Charles Baylis |
Message #124395, posted by cbcbcb at 15:02, 16/11/2018, in reply to message #124393 |
Member
Posts: 3
|
Am I right in thinking that the compiler assumes that once its used an input, it can safely re-use that register for one of the outputs? That would mean that, in the above example, I was just lucky that it didn't use the register it chose for "first" for one of the output registers, which meant that the input value of "first" was still available for the third line.
The compiler does not analyse the text in the asm block at all. The compiler models the inline asm as a sequence of
1. read all inputs 2. do something 3. write all outputs
So the writes of the output registers overwrite ("clobber") the previous content of the registers late in the process. Where there is a write before a read, it overwrites the previous value "early" in the process, hence "early clobber".
You're right that the non-loopy version was just lucky.
Looking at the constraints you're using again, the use of "first" is also wrong. You need to tell the compiler that you are accessing memory, or it could move a store to that address below the asm. So your constraint should be:
[first] "m" (core_0_storage->next_free_page)
and you should refer to it as %[first] (ie without the [] around it)
You should also get better code as the compiler won't have to generate an ADD instruction to implement &core_0_storage->next_free_page. |
|
[ Log in to reply ] |
|
Simon Willcocks |
Message #124396, posted by Stoppers at 17:19, 18/11/2018, in reply to message #124395 |
Member
Posts: 302
|
Am I right in thinking that the compiler assumes that once its used an input, it can safely re-use that register for one of the outputs? That would mean that, in the above example, I was just lucky that it didn't use the register it chose for "first" for one of the output registers, which meant that the input value of "first" was still available for the third line.
The compiler does not analyse the text in the asm block at all. The compiler models the inline asm as a sequence of
1. read all inputs 2. do something 3. write all outputs
So the writes of the output registers overwrite ("clobber") the previous content of the registers late in the process. Where there is a write before a read, it overwrites the previous value "early" in the process, hence "early clobber".
Not trying to be funny, but is this written down somewhere? It's a nice, simple, explanation.
The question that springs to mind is: how can you know where you can safely read the inputs into before doing something with them? (I had high hopes for my "tmp" pattern, but now I'm not so sure. Maybe labelled as "+r"?)
You're right that the non-loopy version was just lucky.
Looking at the constraints you're using again, the use of "first" is also wrong. You need to tell the compiler that you are accessing memory, or it could move a store to that address below the asm. So your constraint should be:
[first] "m" (core_0_storage->next_free_page)
and you should refer to it as %[first] (ie without the [] around it)
You should also get better code as the compiler won't have to generate an ADD instruction to implement &core_0_storage->next_free_page. I was expecting the "volatile" keyword to take care of the instruction ordering. I hope it does, because gcc (6.3) produces invalid assembler for the ldxr and stxr instructions, when I mark "first" that way:
/tmp/ccPRcVRj.s: Assembler messages: /tmp/ccPRcVRj.s:1348: Error: the optional immediate offset can only be 0 at operand 2 -- `ldxr x0,[x2,16]' /tmp/ccPRcVRj.s:1350: Error: the optional immediate offset can only be 0 at operand 3 -- `stxr w3,x4,[x2,16]'
I'll have to make do with a "memory" clobber. |
|
[ Log in to reply ] |
|
Simon Willcocks |
Message #124443, posted by Stoppers at 16:41, 13/2/2019, in reply to message #114920 |
Member
Posts: 302
|
This is some code I'm working on, and what it compiled to, without any warnings from gcc:
VirtualMemoryBlock *vma = allocate_contiguous_memory( &all_memory, allocate, 12 * sizeof( VirtualMemoryBlock ) );
vma[0].start_page = 0; 284: d2800001 mov x1, #0x0 // #0 288: f9400020 ldr x0, [x1] 28c: 92689c00 and x0, x0, #0xffffffffff000000 290: f9000020 str x0, [x1] 294: d4207d00 brk #0x3e8
No function call, nothing. It turns out that some part of the code simplifies to division by zero (where?!), which is what that brk instruction reports. (Those first two instructions are interesting, as well!)
An Arnd Bergmann reduced the test case to a trivial division by zero:
static inline int return0(void) { return 0; } int provoke_div0_warning(void) { return 1 / return0(); }
which gets turned into a single trapping instruction
(aarch64) 0000000000000018 <provoke_div0_warning>: 18: d4207d00 brk #0x3e8
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=79828
Aparently, there's too high a chance of false positives, if a warning was introduced...
I can only find one division in the whole file, and it's by the sizeof a structure. |
|
[ Log in to reply ] |
|
Pages (2): 1
> >|
|
Acorn Arcade forums: Programming: Code GCC produces that makes you cry #12684 |
|