SPO600 Project – blog #2

To know the differences in performance between the different architectures, I created a small program to see what was the difference in time when running them both between both AARCH64 and X86_64 architectures.

Basic testing program:

#include <string.h>
#include <stdio.h>

int main(void){
char* res;
for(int i = 0; i < 10000000; i++){
char* hash = “HelloMyNameIsLawrence”;
char* key = “Is”;
res = strstr(hash, key);
}
printf(“%s\n”, res);
}

Results in X86_64:

/*
The program was run with the given version of the function strstr at the moment
*/

Command:
gcc -std=c99 tst-strstr.c

Time:
real 0m0.491s
user 0m0.490s
sys 0m0.001s

Disassembly:
0000000000400546 <main>:
400546: 55 push %rbp
400547: 48 89 e5 mov %rsp,%rbp
40054a: 48 83 ec 20 sub $0x20,%rsp
40054e: c7 45 f4 00 00 00 00 movl $0x0,-0xc(%rbp)
400555: eb 2b jmp 400582 <main+0x3c>
400557: 48 c7 45 e8 30 06 40 movq $0x400630,-0x18(%rbp)
40055e: 00
40055f: 48 c7 45 e0 46 06 40 movq $0x400646,-0x20(%rbp)
400566: 00
400567: 48 8b 55 e0 mov -0x20(%rbp),%rdx
40056b: 48 8b 45 e8 mov -0x18(%rbp),%rax
40056f: 48 89 d6 mov %rdx,%rsi
400572: 48 89 c7 mov %rax,%rdi
400575: e8 c6 fe ff ff callq 400440 <strstr@plt>
40057a: 48 89 45 f8 mov %rax,-0x8(%rbp)
40057e: 83 45 f4 01 addl $0x1,-0xc(%rbp)
400582: 81 7d f4 7f 96 98 00 cmpl $0x98967f,-0xc(%rbp)
400589: 7e cc jle 400557 <main+0x11>
40058b: 48 8b 45 f8 mov -0x8(%rbp),%rax
40058f: 48 89 c7 mov %rax,%rdi
400592: e8 99 fe ff ff callq 400430 <puts@plt>
400597: b8 00 00 00 00 mov $0x0,%eax
40059c: c9 leaveq
40059d: c3 retq
40059e: 66 90 xchg %ax,%ax
Command:
gcc -std=c99 -O2 tst-strstr.c

Time:
real 0m0.001s
user 0m0.001s
sys 0m0.000s

Disassembly:
0000000000400400 <main>:                                                                                                              400400: 48 83 ec 08 sub $0x8,%rsp
400404: bf bb 05 40 00 mov $0x4005bb,%edi
400409: e8 e2 ff ff ff callq 4003f0 <puts@plt>
40040e: 31 c0 xor %eax,%eax
400410: 48 83 c4 08 add $0x8,%rsp
400414: c3 retq
400415: 66 2e 0f 1f 84 00 00 nopw %cs:0x0(%rax,%rax,1)
40041c: 00 00 00
40041f: 90 nop

Results in AARCH64:

/*
The program was run with the given version of the function strstr at the moment
*/

Commands:
gcc -std=c99 tst-strstr.c

Time:
real 0m0.742s
user 0m0.740s
sys 0m0.000s

Disassembly:
0000000000400620 <main>:
400620: a9bd7bfd stp x29, x30, [sp,#-48]!
400624: 910003fd mov x29, sp
400628: b90027bf str wzr, [x29,#36]
40062c: 1400000e b 400664 <main+0x44>
400630: 90000000 adrp x0, 400000 <_init-0x430>
400634: 911ca000 add x0, x0, #0x728
400638: f9000fa0 str x0, [x29,#24]
40063c: 90000000 adrp x0, 400000 <_init-0x430>
400640: 911d0000 add x0, x0, #0x740
400644: f9000ba0 str x0, [x29,#16]
400648: f9400fa0 ldr x0, [x29,#24]
40064c: f9400ba1 ldr x1, [x29,#16]
400650: 97ffff98 bl 4004b0 <strstr@plt>
400654: f90017a0 str x0, [x29,#40]
400658: b94027a0 ldr w0, [x29,#36]
40065c: 11000400 add w0, w0, #0x1
400660: b90027a0 str w0, [x29,#36]
400664: b94027a1 ldr w1, [x29,#36]
400668: 5292cfe0 mov w0, #0x967f // #38527
40066c: 72a01300 movk w0, #0x98, lsl #16
400670: 6b00003f cmp w1, w0
400674: 54fffded b.le 400630 <main+0x10>
400678: f94017a0 ldr x0, [x29,#40]
40067c: 97ffff89 bl 4004a0 <puts@plt>
400680: 52800000 mov w0, #0x0 // #0
400684: a8c37bfd ldp x29, x30, [sp],#48
400688: d65f03c0 ret

Comamnds:
gcc -std=c99 -O2 tst-strstr.c

Time:
real 0m0.001s
user 0m0.000s
sys 0m0.000s

Disassembly:                                                                                                                 0000000000400470 <main>:
400470: a9bf7bfd stp x29, x30, [sp,#-16]!
400474: 910003fd mov x29, sp
400478: 90000000 adrp x0, 400000 <_init-0x3f0>
40047c: 911a6c00 add x0, x0, #0x69b
400480: 97fffff8 bl 400460 <puts@plt>
400484: 52800000 mov w0, #0x0 // #0
400488: a8c17bfd ldp x29, x30, [sp],#16
40048c: d65f03c0 ret

Conclusions:

As you can see the run-time of the program in AARCH64 takes longer than in X86_64 when the program is compiled without any optimization from the compiler. Which made me think that there is some room for optimization in the AARCH64 architecture.Although, there might seem to be some room of optimization when the compiler is not fully optimizing the program. It worries me that when the compiler is doing some type of optimization, the run-time on both architectures is reduced to just 0.001s. Due to this result, I tried to find the difference between the Assembly code generated to run this program but there was no difference between both the architectures.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s