log in | register | forums
Show:
Go:
Forums
Username:

Password:

User accounts
Register new account
Forgot password
Forum stats
List of members
Search the forums

Advanced search
Recent discussions
- Elsear brings super-fast Networking to Risc PC/A7000/A7000+ (News:)
- Latest hardware upgrade from RISCOSbits (News:)
- RISC OS London Show Report 2024 (News:1)
- Announcing the TIB 2024 Advent Calendar (News:1)
- Code GCC produces that makes you cry #12684 (Prog:39)
- RISCOSbits releases a new laptop solution (News:)
- Rougol November 2024 meeting on monday (News:)
- Drag'n'Drop 14i1 edition reviewed (News:)
- WROCC November 2024 talk o...ay - Andrew Rawnsley (ROD) (News:2)
- October 2024 News Summary (News:3)
Latest postings RSS Feeds
RSS 2.0 | 1.0 | 0.9
Atom 0.3
Misc RDF | CDF
 
View on Mastodon
@www.iconbar.com@rss-parrot.net
Site Search
 
Article archives
Acorn Arcade forums: Programming: Test your optimization skills pt.1
 
  Test your optimization skills pt.1
  sirbod (19:02 24/10/2012)
  Phlamethrower (19:22 24/10/2012)
    nunfetishist (19:31 24/10/2012)
      sirbod (20:33 24/10/2012)
        nunfetishist (22:02 24/10/2012)
          Phlamethrower (00:28 25/10/2012)
            sirbod (08:59 25/10/2012)
              Phlamethrower (11:57 25/10/2012)
                sirbod (12:11 25/10/2012)
                  Phlamethrower (12:20 25/10/2012)
                    sirbod (12:29 25/10/2012)
                      Phlamethrower (14:19 25/10/2012)
    sirbod (20:33 24/10/2012)
      Phlamethrower (21:28 24/10/2012)
 
Jon Abbott Message #121308, posted by sirbod at 19:02, 24/10/2012
Member
Posts: 563
This is just a bit of fun, but interesting if you're into ARM code. I was looking at the MFM code in ADFFS and thought, what an interesting problem with many solutions.

PROBLEM: Reduce a 32 bit value down to just it's even bits in as few cycles and registers as possible. ie 32 bits > 16 bits, containing input bits 0, 3, 5 ... in output bits 0, 1, 2 ...

INPUT: R0 = value
OUTPUT: R0 = reduced value

eg. an input of AAAAAAAA becomes 0 output and 55555555 becomes FFFF

Here are some examples to solving the problem:

32 instructions, TST method
AND R1, R0, #1
TST R0, #1<<2
ORRNE R1, R1, #1<<1
TST R0, #1<<4
ORRNE R1, R1, #1<<2
TST R0, #1<<6
ORRNE R1, R1, #1<<3
TST R0, #1<<8
ORRNE R1, R1, #1<<4
TST R0, #1<<10
ORRNE R1, R1, #1<<5
TST R0, #1<<12
ORRNE R1, R1, #1<<6
TST R0, #1<<14
ORRNE R1, R1, #1<<7
TST R0, #1<<16
ORRNE R1, R1, #1<<8
TST R0, #1<<18
ORRNE R1, R1, #1<<9
TST R0, #1<<20
ORRNE R1, R1, #1<<10
TST R0, #1<<22
ORRNE R1, R1, #1<<11
TST R0, #1<<24
ORRNE R1, R1, #1<<12
TST R0, #1<<26
ORRNE R1, R1, #1<<13
TST R0, #1<<28
ORRNE R1, R1, #1<<14
TST R0, #1<<30
ORRNE R1, R1, #1<<15
MOV R0, R1


28 instructions, multi-flag method
MOV R1, #&AA
ORR R1, R1, R1, LSL #8
ORR R1, R1, R1, LSL #16
BIC R0, R0, R1
ORR R1, R0, R0, LSL #1

MOV R0, R1, LSR #1
AND R0, R0, #%11
MOVS R1, R1, LSL #2
ORRCS R0, R0, #1<<15
ORRMI R0, R0, #1<<14

MOVS R1, R1, LSL #4
ORRCS R0, R0, #1<<13
ORRMI R0, R0, #1<<12

MOVS R1, R1, LSL #4
ORRCS R0, R0, #1<<11
ORRMI R0, R0, #1<<10

MOVS R1, R1, LSL #4
ORRCS R0, R0, #1<<9
ORRMI R0, R0, #1<<8

MOVS R1, R1, LSL #4
ORRCS R0, R0, #1<<7
ORRMI R0, R0, #1<<6

MOVS R1, R1, LSL #4
ORRCS R0, R0, #1<<5
ORRMI R0, R0, #1<<4

MOVS R1, R1, LSL #4
ORRCS R0, R0, #1<<3
ORRMI R0, R0, #1<<2


Current winner: Jeffrey Lee with 14 instructions
MOV R3, #&0F
ORR R3, R3, R3, LSL #8
ORR R3, R3, R3, LSL #16
EOR R2,R3,R3,LSL #2
EOR R1,R2,R2,LSL #1
AND R0,R0,R1
ORR R0,R0,R0,LSR #1
AND R0,R0,R2
ORR R0,R0,R0,LSR #2
AND R0,R0,R3
ORR R0,R0,R0,LSR #4
AND R1,R0,#&FF0000
AND R0,R0,#&FF
ORR R0,R0,R1,LSR #8


[Edited by sirbod at 13:12, 25/10/2012]
  ^[ Log in to reply ]
 
Jeffrey Lee Message #121309, posted by Phlamethrower at 19:22, 24/10/2012, in reply to message #121308
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Yay! I like these things.

4 instructions (plus one 64K lookup table wink)
LDRB R0,[lut,R1,LSR #16]
MOV R1,R1,LSL #16
LDRB R1,[lut,R1,LSR #16]
ORR R0,R1,R0,LSL #8
  ^[ Log in to reply ]
 
Rob Kendrick Message #121310, posted by nunfetishist at 19:31, 24/10/2012, in reply to message #121309
nunfetishist
Today's phish is trout a la creme.

Posts: 524
My C compiler spat out this rather naïve, but brief, code:
mov ip, #0
str lr, [sp, #-4]!
mov r2, #31
mov r1, ip
mov lr, #1
.L2:
ands r3, r0, lr, asl r2
movne r3, #0
moveq r3, #1
orr r3, ip, r3, asl r1
sub r2, r2, #2
mov r3, r3, asl #16
cmn r2, #1
mov ip, r3, lsr #16
add r1, r1, #1
bne .L2
mov r0, ip
ldr pc, [sp], #4
(Edit to remove code tags that TIB's bbcode doesn't support.)

[Edited by nunfetishist at 23:02, 24/10/2012]
  ^[ Log in to reply ]
 
Jon Abbott Message #121311, posted by sirbod at 20:33, 24/10/2012, in reply to message #121309
Member
Posts: 563
4 instructions (plus one 64K lookup table wink)
LDRB R0,[lut,R1,LSR #16]
MOV R1,R1,LSL #16
LDRB R1,[lut,R1,LSR #16]
ORR R0,R1,R0,LSL #8
4 out of 5 - you drop a point for the 64K table!
  ^[ Log in to reply ]
 
Jon Abbott Message #121312, posted by sirbod at 20:33, 24/10/2012, in reply to message #121310
Member
Posts: 563
My C compiler
Instant fail! wink
  ^[ Log in to reply ]
 
Jeffrey Lee Message #121315, posted by Phlamethrower at 21:28, 24/10/2012, in reply to message #121311
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
4 instructions (plus one 64K lookup table wink)
LDRB R0,[lut,R1,LSR #16]
MOV R1,R1,LSL #16
LDRB R1,[lut,R1,LSR #16]
ORR R0,R1,R0,LSL #8
4 out of 5 - you drop a point for the 64K table!
How about 5 instructions and a 32K table? wink (i.e. just mask off the top bit of each halfword since you're not interested in them)
  ^[ Log in to reply ]
 
Rob Kendrick Message #121317, posted by nunfetishist at 22:02, 24/10/2012, in reply to message #121312
nunfetishist
Today's phish is trout a la creme.

Posts: 524
My C compiler
Instant fail! wink
Err, why? The code seems quite tight to me, self-explanatory, takes advantage of branch prediction on newer CPUs, and is quite tiny. Also, if targeting ARMv6 or later, you can use the sign-extend instructions to make it tighter still.
  ^[ Log in to reply ]
 
Jeffrey Lee Message #121321, posted by Phlamethrower at 00:28, 25/10/2012, in reply to message #121317
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Mask-and-shift approach, 8 instructions plus 3 registers needed for constants:

AND R0,R0,#&55555555
ORR R0,R0,R0,LSR #1
AND R0,R0,#&33333333
ORR R0,R0,R0,LSR #2
AND R0,R0,#&0F0F0F0F
ORR R0,R0,R0,LSR #4
BIC R0,R0,#&0000FF00
ORR R0,R0,R0,LSR #8


Similar routine in NEON, 1.75 instructions (7 instructions long, but 4 words processed), 3 Q regs needed for constants:

VAND Q0,Q0,#&55555555555555555555555555555555
VSRA.U8 Q0,Q0,#1
VAND Q0,Q0,#&33333333333333333333333333333333
VSRA.U8 Q0,Q0,#2
VAND Q0,Q0,#&0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F0F
VSRA.U16 Q0,Q0,#4
VUZP.8 Q0,Q0
  ^[ Log in to reply ]
 
Jon Abbott Message #121322, posted by sirbod at 08:59, 25/10/2012, in reply to message #121321
Member
Posts: 563
Mask-and-shift approach, 8 instructions plus 3 registers needed for constants:

AND R0,R0,#&55555555
ORR R0,R0,R0,LSR #1
AND R0,R0,#&33333333
ORR R0,R0,R0,LSR #2
AND R0,R0,#&0F0F0F0F
ORR R0,R0,R0,LSR #4
BIC R0,R0,#&0000FF00
ORR R0,R0,R0,LSR #8
Impressive, 18 instructions in total.

EDIT: Correction, its 19 instructions as you need to clear two sets of bits at the end.

[Edited by sirbod at 11:22, 25/10/2012]
  ^[ Log in to reply ]
 
Jeffrey Lee Message #121325, posted by Phlamethrower at 11:57, 25/10/2012, in reply to message #121322
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
19 instructions? Really?

LDR R1,=&55555555
LDR R2,=&33333333
LDR R3,=&0F0F0F0F
AND R0,R0,R1
ORR R0,R0,R0,LSR #1
AND R0,R0,R2
ORR R0,R0,R0,LSR #2
AND R0,R0,R3
ORR R0,R0,R0,LSR #4
AND R1,R0,#&FF0000
AND R0,R0,#&FF
ORR R0,R0,R1,LSR #8
MOV PC,LR


That's 13 instructions for an APCS compliant function with (up to) 3 extra words taken from the literal pool.

Or if you really care about your literal pool size:

LDR R3,=&0F0F0F0F
EOR R2,R3,R3,LSL #2
EOR R1,R2,R2,LSL #1


[Edited by Phlamethrower at 12:59, 25/10/2012]
  ^[ Log in to reply ]
 
Jon Abbott Message #121326, posted by sirbod at 12:11, 25/10/2012, in reply to message #121325
Member
Posts: 563
19 instructions? Really?

LDR R1,=&55555555
LDR R2,=&33333333
LDR R3,=&0F0F0F0F
AND R0,R0,R1
ORR R0,R0,R0,LSR #1
AND R0,R0,R2
ORR R0,R0,R0,LSR #2
AND R0,R0,R3
ORR R0,R0,R0,LSR #4
AND R1,R0,#&FF0000
AND R0,R0,#&FF
ORR R0,R0,R1,LSR #8
MOV PC,LR


That's 13 instructions for an APCS compliant function with (up to) 3 extra words taken from the literal pool.

Or if you really care about your literal pool size:

LDR R3,=&0F0F0F0F
EOR R2,R3,R3,LSL #2
EOR R1,R2,R2,LSL #1
The LDR's are an extra cache miss, and as we're looking for the least cycles, its quicker to code the values. Using the additional optimizations above you get:

MOV R3, #&0F
ORR R3, R3, R3, LSL #8
ORR R3, R3, R3, LSL #16
EOR R2,R3,R3,LSL #2
EOR R1,R2,R2,LSL #1
AND R0,R0,R1
ORR R0,R0,R0,LSR #1
AND R0,R0,R2
ORR R0,R0,R0,LSR #2
AND R0,R0,R3
ORR R0,R0,R0,LSR #4
AND R1,R0,#&FF0000
AND R0,R0,#&FF
ORR R0,R0,R1,LSR #8


...14 cycles, this has to be the winner!
  ^[ Log in to reply ]
 
Jeffrey Lee Message #121327, posted by Phlamethrower at 12:20, 25/10/2012, in reply to message #121326
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
I'm guessing this code must only need to work on one word of data at a time? You'd obviously get better performance if you could run it on a batch of data to avoid recalculating the constants all the time, and if you interleaved the processing of one word with another to avoid pipeline stalls due to RAW hazards (and to allow dual issue on A8/A9).
  ^[ Log in to reply ]
 
Jon Abbott Message #121328, posted by sirbod at 12:29, 25/10/2012, in reply to message #121327
Member
Posts: 563
I'm guessing this code must only need to work on one word of data at a time? You'd obviously get better performance if you could run it on a batch of data to avoid recalculating the constants all the time, and if you interleaved the processing of one word with another to avoid pipeline stalls due to RAW hazards (and to allow dual issue on A8/A9).
The problem was for a single word, but clearly it could be optimized more for batches of data.

May I use your earlier example in ADFFS?
  ^[ Log in to reply ]
 
Jeffrey Lee Message #121330, posted by Phlamethrower at 14:19, 25/10/2012, in reply to message #121328
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
May I use your earlier example in ADFFS?
Sure. It's hardly rocket science!
  ^[ Log in to reply ]
 

Acorn Arcade forums: Programming: Test your optimization skills pt.1