log in | register | forums
Show:
Go:
Forums
Username:

Password:

User accounts
Register new account
Forgot password
Forum stats
List of members
Search the forums

Advanced search
Recent discussions
- Elsear brings super-fast Networking to Risc PC/A7000/A7000+ (News:)
- Latest hardware upgrade from RISCOSbits (News:)
- Announcing the TIB 2024 Advent Calendar (News:1)
- Code GCC produces that makes you cry #12684 (Prog:39)
- RISCOSbits releases a new laptop solution (News:)
- Rougol November 2024 meeting on monday (News:)
- Drag'n'Drop 14i1 edition reviewed (News:)
- WROCC November 2024 talk o...ay - Andrew Rawnsley (ROD) (News:2)
- October 2024 News Summary (News:3)
- RISC OS London Show Report 2024 (News:1)
Latest postings RSS Feeds
RSS 2.0 | 1.0 | 0.9
Atom 0.3
Misc RDF | CDF
 
View on Mastodon
@www.iconbar.com@rss-parrot.net
Site Search
 
Article archives
Acorn Arcade forums: Programming: Test your optimization skills pt.2
 
  Test your optimization skills pt.2
  sirbod (15:21 27/11/2013)
  arawnsley (19:19 27/11/2013)
    sirbod (19:43 27/11/2013)
      Phlamethrower (20:54 27/11/2013)
 
Jon Abbott Message #122859, posted by sirbod at 15:21, 27/11/2013
Member
Posts: 563
The aim of this task is to convert a 4-bit mode to an 8-bit mode in as few cycles as possible using ARMv6 instructions:

Assumptions you can make:

R0 - 4-bit mode buffer pointer
R1 - 8-bit mode buffer pointer
R2 - pixels remaining
R3 - R12 are free to use

The screen mode you're writing to is indexed 256 colour and you can preset the palette entries above 16 as you please.

If the word read from [R0] is 87654321, the output should be two words written to [R1]: 04030201 08070605

You can process the data one byte / word or multiple words at a time. You may also interleave the instructions for optimization purposes where appropriate.


Example 1: Rotate method

.L1
LDR R7, [R0], #4
MOV R9, #0
AND R10, R7, #&F0000000
MOV R12, #32 - 4
.L2
MOV R11, R7, LSR R12
AND R11, R11, #&F
MOV R10, R10, LSL #8
ORR R10, R10, R9, LSR #32 - 8
ORR R9, R11, R9, LSL #8

SUBS R12, R12, #4
BPL L2

STMIA R1, {R9-R10}

SUBS R2, R2, #8
BNE L1



Example 2: Full palette use

This assumes the palette entries &10 / &20 / &30 etc map to the logical colours 1 / 2 / 3 etc.

Where [R0] = 87654321 the output to [R1] will be 04300210 08700650

MOV R12, #&FF00
ORR R12, R12, R12, LSL #16

.L1
LDR R4,[R0], #4
AND R3, R4, R12, LSR #8
AND R11, R4, #&FF00
AND R4, R4, R12
MOV R3, R3, LSL #4
MOV R4, R4, LSR #4
EOR R4, R4, R3, LSR #16
AND R3, R3, #&FF0
EOR R4, R4, R11, LSR #4
ORR R3, R3, R11, LSL #12
STMIA R1!, {R3-R4}
SUBS R2, R2, #8
BNE L1


The winner gets their code into ADFFS and a mention in the credits.
  ^[ Log in to reply ]
 
Andrew Rawnsley Message #122860, posted by arawnsley at 19:19, 27/11/2013, in reply to message #122859
R-Comp chap
Posts: 600
However you end up doing this, can I suggest that it be submitted to ROOL as 16 colour mode emulation would be a very worthwhile addition to the OS generally. Indeed, I suspect some of your other routines that you're developing for ADFFS might make worthwhile additions to the operating system as a whole.

I appreciate the code also needs to be present in ADFFS for compatibility with older systems, but the whole "handling 16 colour screen modes" is going to be an issue for every port of RISC OS, pretty much, going forwards.

Indeed, one suggestion would be to look into using the second core of a dual core CPU to sit there doing mode conversion code (one proposed solution to the infamous RGB<->BGR RISC OS "issue"), although I suspect it'd bottleneck because the conversion code cannot be executed until the 16 colour buffer has been calculated...
  ^[ Log in to reply ]
 
Jon Abbott Message #122861, posted by sirbod at 19:43, 27/11/2013, in reply to message #122860
Member
Posts: 563
I was pondering that very question today. The solution is so deceptively simple it could easily be added into the core OS to provide legacy MODE support.

There's no hackery involved, it's all done using valid RO SWI's and leaves the OS to handle just about everything, with the screen buffer being in DA2 instead of the GPU.

I may well knock up a stripped down stand-alone module at some point, once I've added 1 bpp and 2 bpp translation.

The only botch I had to do, was to figure out the logical GPU screen buffer address. The OS could really do with an SWI to get that info...or OS_Memory extended to handle IO physical addresses, which seems the sensible thing to do.
  ^[ Log in to reply ]
 
Jeffrey Lee Message #122863, posted by Phlamethrower at 20:54, 27/11/2013, in reply to message #122861
PhlamethrowerHot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot Hot stuff

Posts: 15100
Unrolling the loop will make for a much faster rotate method:


.L1
LDR R3,[R0],#4

MOV R5,R3,LSR #28 ; nibble 7
MOV R3,R3,LSL #4
MOV R5,R5,LSL #8
ORR R5,R5,R3,LSR #28 ; nibble 6
MOV R3,R3,LSL #4
MOV R5,R5,LSL #8
ORR R5,R5,R3,LSR #28 ; nibble 5
MOV R3,R3,LSL #4
MOV R5,R5,LSL #8
ORR R5,R5,R3,LSR #28 ; nibble 4

MOV R4,R3,LSR #28 ; nibble 3
MOV R3,R3,LSL #4
MOV R4,R4,LSL #8
ORR R4,R4,R3,LSR #28 ; nibble 2
MOV R3,R3,LSL #4
MOV R4,R4,LSL #8
ORR R4,R4,R3,LSR #28 ; nibble 1
MOV R3,R3,LSL #4
MOV R4,R4,LSL #8
ORR R4,R4,R3,LSR #28 ; nibble 0

SUBS R2,R2,#8
STMIA R1!,{R4-R5}
BNE L1


That's 8 pixels per 24 instructions, compared to 40+ for your version. But using AND to extract a pixel and then ORRing it in at the correct offset is faster, as it'll be two instructions per pixel instead of three:


MOV R9,#&FF0
.L1
LDR R3,[R0],#4

AND R4,R9,R3,LSL #4 ; nibbles 0&1
AND R5,R3,#&FF00
ORR R4,R4,R5,LSL #12 ; 2&3

AND R5,R9,R3,LSR #12 ; nibbles 4&5
AND R6,R3,#&FF000000
ORR R5,R5,R6,LSR #4 ; 6&7

SUBS R2,R2,#8
STMIA R1!,{R4-R5}
BNE L1


8 pixels in 10 instructions, using the full palette hack, compared to the 13 instructions for your approach. But considering the inner portion is so short, you could easily boost it further by unrolling the loop a few times.

The PLD instruction should also come in useful. The cache line size on the Pi is 32 bytes, so I'd suggest unrolling the loop to the point where each iteration processes 32 source bytes, with a preload instruction somewhere to preload a future cacheline. I'm not sure off the top of my head how far ahead the data should be preloaded, but I'd say 128 bytes ahead should give the hardware plenty of time to fetch the data before you need it.

It's also worth noting that this research suggests that the optimum write size is 4 words.

I'll leave the production of cycle timing optimised routines to someone with more spare time than myself smile

[Edited by Phlamethrower at 21:02, 27/11/2013]
  ^[ Log in to reply ]
 

Acorn Arcade forums: Programming: Test your optimization skills pt.2