 |
BorlandTalk.com Borland discussion newsgroups
|
| View previous topic :: View next topic |
| Author |
Message |
Nils Haeck Guest
|
Posted: Wed Mar 21, 2007 4:00 am Post subject: repeated float operatoins.. SSE? |
|
|
Hi Guys,
I need to do the following as fast as possible:
procedure Convert(X, Y: pdouble; Count: integer; const A, B: double);
var
i: integer;
begin
for i := 0 to Count - 1 do
begin
Y^ := X^ * A + B;
inc(X);
inc(Y);
end;
end;
The Count variable is usually something like 100.000 (so X and Y are arrays
of 100.000 elements).
How can I optimize this using ASM? I'm looking at something with performance
for floating point numbers. Preferrably I work with double precision floats,
but I could convert the algorithm to work with single precision floats, if
it is significantly faster.
Thanks in advance,
Nils Haeck |
|
| Back to top |
|
 |
Les Guest
|
Posted: Fri Mar 23, 2007 2:38 am Post subject: Re: repeated float operatoins.. SSE? |
|
|
| Quote: | I need to do the following as fast as possible:
procedure Convert(X, Y: pdouble; Count: integer; const A, B: double);
var
i: integer;
begin
for i := 0 to Count - 1 do
begin
Y^ := X^ * A + B;
inc(X);
inc(Y);
end;
end;
The Count variable is usually something like 100.000 (so X and Y are
arrays of 100.000 elements).
|
Both procedures require SSE2. I have no access to SSE2 CPU right now, so it
is completely untested. The loops can be unrolled by 2 or 4 for some extra
speed increase (interweave xmm2 with xmm3, xmm4, xmm5). The Count
requirements on entry would change if loops were to be unrolled.
ConvSingle should theoretically be twice as fast.
(Count >= 2) and (Count mod 2 = 0). X and Y must be 16 aligned. If not then
use movupd instead of movapd (slower).
procedure ConvDouble(X, Y: PDouble; Count: Integer; const A, B: Double);
register;
asm
movsd xmm0, [ebp+20]
movsd xmm0, [ebp+12]
pshufd xmm0, xmm0, $44
pshufd xmm1, xmm1, $44
and ecx, -2
shl ecx, 3
add eax, ecx
add edx, ecx
neg ecx
@loop:
movapd xmm2, [eax+ecx]
mulpd xmm2, xmm0
addpd xmm2, xmm1
movapd [edx+ecx], xmm2
add ecx, 16
js @loop
end;
(Count >= 4) and (Count mod 4 = 0). X and Y must be 16 aligned. If not then
use movups instead of movaps (slower).
procedure ConvSingle(X, Y: PSingle; Count: Integer; const A, B: Single);
register;
asm
movss xmm0, [ebp+12]
movss xmm1, [ebp+8]
pshufd xmm0, xmm0, 0
pshufd xmm1, xmm1, 0
and ecx, -4
shl ecx, 2
add eax, ecx
add edx, ecx
neg ecx
@loop:
movaps xmm2, [eax+ecx]
mulps xmm2, xmm0
addps xmm2, xmm1
movaps [edx+ecx], xmm2
add ecx, 16
js @loop
@exit:
end;
Les. |
|
| Back to top |
|
 |
Nils Haeck Guest
|
Posted: Fri Mar 23, 2007 7:16 pm Post subject: Re: repeated float operatoins.. SSE? |
|
|
Thank you, Les.. I'll play around with it.
Nils |
|
| Back to top |
|
 |
Nils Haeck Guest
|
Posted: Fri Mar 23, 2007 8:12 pm Post subject: Re: repeated float operatoins.. SSE? |
|
|
I did some testing..
Good news: your code seems to work :)
Bad news: there's no difference *at all* in speed, when I use the routine I
wrote, or switch to SSE/SSE2.
I'm using Delphi7's compiler, and an Intel Pentium4 CPU (2.93Gz). Might it
be that somehow the SSE commmands get mapped to "standard" FP processing? Or
might it be that Delphi already optimizes my pascal code to use SSE?
Nils
"Les" <a (AT) b (DOT) c> schreef in bericht news:4602f763$1 (AT) newsgroups (DOT) borland.com...
| Quote: | I need to do the following as fast as possible:
procedure Convert(X, Y: pdouble; Count: integer; const A, B: double);
var
i: integer;
begin
for i := 0 to Count - 1 do
begin
Y^ := X^ * A + B;
inc(X);
inc(Y);
end;
end;
The Count variable is usually something like 100.000 (so X and Y are
arrays of 100.000 elements).
Both procedures require SSE2. I have no access to SSE2 CPU right now, so
it is completely untested. The loops can be unrolled by 2 or 4 for some
extra speed increase (interweave xmm2 with xmm3, xmm4, xmm5). The Count
requirements on entry would change if loops were to be unrolled.
ConvSingle should theoretically be twice as fast.
(Count >= 2) and (Count mod 2 = 0). X and Y must be 16 aligned. If not
then use movupd instead of movapd (slower).
procedure ConvDouble(X, Y: PDouble; Count: Integer; const A, B: Double);
register;
asm
movsd xmm0, [ebp+20]
movsd xmm0, [ebp+12]
pshufd xmm0, xmm0, $44
pshufd xmm1, xmm1, $44
and ecx, -2
shl ecx, 3
add eax, ecx
add edx, ecx
neg ecx
@loop:
movapd xmm2, [eax+ecx]
mulpd xmm2, xmm0
addpd xmm2, xmm1
movapd [edx+ecx], xmm2
add ecx, 16
js @loop
end;
(Count >= 4) and (Count mod 4 = 0). X and Y must be 16 aligned. If not
then use movups instead of movaps (slower).
procedure ConvSingle(X, Y: PSingle; Count: Integer; const A, B: Single);
register;
asm
movss xmm0, [ebp+12]
movss xmm1, [ebp+8]
pshufd xmm0, xmm0, 0
pshufd xmm1, xmm1, 0
and ecx, -4
shl ecx, 2
add eax, ecx
add edx, ecx
neg ecx
@loop:
movaps xmm2, [eax+ecx]
mulps xmm2, xmm0
addps xmm2, xmm1
movaps [edx+ecx], xmm2
add ecx, 16
js @loop
@exit:
end;
Les.
|
|
|
| Back to top |
|
 |
Nils Haeck Guest
|
Posted: Fri Mar 23, 2007 8:22 pm Post subject: Re: repeated float operatoins.. SSE? |
|
|
Update..
It does seem to be a mem access problem.
When I use 300x a list length of 1000000 (1 million), I see no speed
difference.
When I use 30000 x list length of 10000 (so same number in total), I see
significant speed advantage for the SSE methods.
Fortunately, this last case is more like "real-life". However, I'd like to
know if there's anything I can do to speed up memory access, as this seems
to be the limiting factor.
Nils |
|
| Back to top |
|
 |
Les Pawelczyk Guest
|
Posted: Fri Mar 23, 2007 11:22 pm Post subject: Re: repeated float operatoins.. SSE? |
|
|
| Quote: | It does seem to be a mem access problem.
When I use 300x a list length of 1000000 (1 million), I see no speed
difference.
When I use 30000 x list length of 10000 (so same number in total), I see
significant speed advantage for the SSE methods.
Fortunately, this last case is more like "real-life". However, I'd like to
know if there's anything I can do to speed up memory access, as this seems
to be the limiting factor.
|
Try procedure below.
(count >= and (count mod 8 = 0). X and Y must be 16 aligned but it will likely be faster if they are 64 aligned. For best performance the offset value in 'prefetchnta' (128) needs to be experimented with on a particular system. Anything between 128 and 512 (in 64 byte increments) may prove to be the best.
If the array is really large (much larger than L2 cache) then you can try 'movntpd' instead of 'movapd' when saving back to memory. The second 'prefetchnta' (the one with edx) will not be needed in this case.
procedure ConvDouble(X, Y: PDouble; Count: Integer; const A, B: Double); register;
asm
movsd xmm0, [ebp+20]
movsd xmm0, [ebp+12]
pshufd xmm0, xmm0, $44
pshufd xmm1, xmm1, $44
and ecx, -8
shl ecx, 3
add eax, ecx
add edx, ecx
neg ecx
@loop:
prefetchnta [eax+ecx+128]
prefetchnta [edx+ecx+128]
movapd xmm2, [eax+ecx]
movapd xmm3, [eax+ecx+16]
movapd xmm4, [eax+ecx+32]
movapd xmm5, [eax+ecx+48]
mulpd xmm2, xmm0
mulpd xmm3, xmm0
mulpd xmm4, xmm0
mulpd xmm5, xmm0
addpd xmm2, xmm1
addpd xmm3, xmm1
addpd xmm4, xmm1
addpd xmm5, xmm1
movapd [edx+ecx], xmm2
movapd [edx+ecx+16], xmm3
movapd [edx+ecx+32], xmm4
movapd [edx+ecx+48], xmm5
add ecx, 64
js @loop
end;
Les. |
|
| Back to top |
|
 |
Nils Haeck Guest
|
Posted: Sat Mar 24, 2007 4:46 am Post subject: Re: repeated float operatoins.. SSE? |
|
|
Thanks a lot, Les.
I'm trying to understand what happens.. I have added some comments here:
procedure ConvDoubleSSE2(X, Y: PDouble; Count: Integer; const A, B: Double);
//(Count >= 2) and (Count mod 2 = 0). X and Y must be 16 aligned. If not
then
//use movupd instead of movapd (slower).
register;
asm
movsd xmm0, [ebp+16] // Copy A to xmm0: 0 A
movsd xmm1, [ebp+8] // Copy B to xmm1: 0 B
pshufd xmm0, xmm0, $44 // xmm0: A A
pshufd xmm1, xmm1, $44 // xmm1: B B
and ecx, -2 // clear last two bits of ecx (Count now
multiple of 4)
shl ecx, 3 // Count := Count * 8
add eax, ecx // X pointer + Count
add edx, ecx // Y pointer + Count
neg ecx // Count := - Count
@loop:
movapd xmm2, [eax+ecx] // mmx2: X0 X1 (two X doubles)
mulpd xmm2, xmm0 // X := X * A (two doubles at a time)
addpd xmm2, xmm1 // X := X + B (two doubles at a time)
movapd [edx+ecx], xmm2 // Store as Y (two doubles at a time)
add ecx, 16 // increment count with 2 * 8 (2 doubles)
js @loop // loop as long as count < 0
end;
I think I understand most, but I have a question about first two lines. How
does it work with e.g. [ebp+16]? I would think +16 would be B and +8 would
be A, but this seems not the case.
Also, if I would want to do a bit more complex thing, like
Z := A * X + B * Y + C
Would I still have enough registers for the mem pointers and counters?
PS I'll also test the prefetch code.
Nils |
|
| Back to top |
|
 |
Nils Haeck Guest
|
Posted: Sat Mar 24, 2007 8:11 am Post subject: Re: repeated float operatoins.. SSE? |
|
|
Thank you Bob, I'm starting to get the gripes of it :)
| Quote: | -2 has only one clear bit, not 2.
ecx is now even;
|
Yes I saw this too late, already realised it. But thanks anyway.
So about the stack..
procedure Blabla(...... const A, B: Double);
[ebp + 16] points to A
[ebp + 8] points to B
but in the other example..
procedure Blabla(..... const A, B: Single);
I would expect
[ebp + 8] points to A
[ebp + 4] points to B
But obviously when I try it is not correct; I have to use
[ebp + 12] points to A
[ebp + 8] points to B
Is there some "fixed offset" of 8?
Then about the other parameters, is it always true that
procedure Blabla(x, y, z: integer; ....)
eax = x
edx = y
ecx = z
(of course "integer" can also be "pointer" or any other 32bit equivalent)
Also, do I have to push/pop any of these to save them?
Nils |
|
| Back to top |
|
 |
Bob Gonder Guest
|
Posted: Sat Mar 24, 2007 8:11 am Post subject: Re: repeated float operatoins.. SSE? |
|
|
Nils Haeck wrote:
| Quote: | and ecx, -2 // clear last two bits of ecx (Count now
multiple of 4)
|
-2 has only one clear bit, not 2.
ecx is now even;
| Quote: | procedure ConvDoubleSSE2(X, Y: PDouble; Count: Integer; const A, B: Double);
movsd xmm0, [ebp+16] // Copy A to xmm0: 0 A
movsd xmm1, [ebp+8] // Copy B to xmm1: 0 B
I think I understand most, but I have a question about first two lines. How
does it work with e.g. [ebp+16]? I would think +16 would be B and +8 would
be A, but this seems not the case.
|
Pascal Parameters are pushed onto the stack from left to right.
So A is pushed first, then B.
The stack grows Down to smaller addresses, so A is above B.
| Quote: | Also, if I would want to do a bit more complex thing, like
Z := A * X + B * Y + C
Would I still have enough registers for the mem pointers and counters?
|
I don't know SSE, but you are only using 0,1,2, so at least 3 should
be available. |
|
| Back to top |
|
 |
Bob Gonder Guest
|
Posted: Sat Mar 24, 2007 7:51 pm Post subject: Re: repeated float operatoins.. SSE? |
|
|
Nils Haeck wrote:
| Quote: | Is there some "fixed offset" of 8?
|
FunctionEntryPoint: // esp == return address
push ebp // esp+4 == return address
mov ebp,esp // ebp+4 = return address
// ebp+8 == first parameter on stack
sub esp, local stack size
| Quote: | Then about the other parameters, is it always true that
procedure Blabla(x, y, z: integer; ....)
eax = x
edx = y
ecx = z
|
Only if Pascal is using the fastcall method.
And only if the first 3 parameters are 32bit or less.
If first param is 64bit integer, could be eax = low dword and edx =
high dword, so ecx is second param (if 32bit or less)
| Quote: | Also, do I have to push/pop any of these to save them?
|
Only if your code requires it.
Copy:
push esi // cannot change these registers
push edi
mov esi,eax // source
mov edi,edx // destination
rep movsb // count ecx down to 0
pop edi // restore non-changeable registers
pop esi
ret
Copy2Places:
push ebp
mov esp,ebp
push esi // cannot change these registers
push edi
push eax // save parameters we will reuse
push ecx
mov esi,eax
mov edi,edx
rep movsb
mov edi,[ebp+8] // 4th parameter
pop ecx // reuse parameters
pop esi
rep movsb
pop edi // restore non-changeable registers
pop esi
pop ebp
ret 4 // remove 4th parameter (4 bytes) |
|
| Back to top |
|
 |
Les Pawelczyk Guest
|
Posted: Mon Mar 26, 2007 7:16 pm Post subject: Re: repeated float operatoins.. SSE? |
|
|
| Quote: | Also, if I would want to do a bit more complex thing, like
Z := A * X + B * Y + C
Would I still have enough registers for the mem pointers and counters?
|
There is eight xmm registers: xmm0 - xmm7. There is sixteen in 64 bit environment.
| Quote: | PS I'll also test the prefetch code.
|
Don't forget that prefetch only needs to be done once every 64 bytes.
Les. |
|
| Back to top |
|
 |
Les Pawelczyk Guest
|
Posted: Mon Mar 26, 2007 7:16 pm Post subject: Re: repeated float operatoins.. SSE? |
|
|
| Quote: | I would expect
[ebp + 8] points to A
[ebp + 4] points to B
But obviously when I try it is not correct; I have to use
[ebp + 12] points to A
[ebp + 8] points to B
|
The 'call' instruction used to run your procedure needs 4 bytes to save the return address.
| Quote: | Then about the other parameters, is it always true that
procedure Blabla(x, y, z: integer; ....)
eax = x
edx = y
ecx = z
|
Only if you use 'register' calling convention. This is a default calling convention but it doesn't hurt to specify it anyway.
| Quote: | Also, do I have to push/pop any of these to save them?
|
You can freely use eax, edx and ecx without having to restore their original values before return. Every other register must be preserved.
Les. |
|
| Back to top |
|
 |
Nils Haeck Guest
|
Posted: Mon Mar 26, 2007 9:56 pm Post subject: Re: repeated float operatoins.. SSE? |
|
|
Thanks for all the feedback, everyone! It makes it all much more clear now.
Nils |
|
| Back to top |
|
 |
|
|
You cannot post new topics in this forum You cannot reply to topics in this forum You cannot edit your posts in this forum You cannot delete your posts in this forum You cannot vote in polls in this forum
|
|