Every asm coder wants to produce fast, short and efficient code. In asm
language the things should be always pushed to the limits. Imagine, 58
bytes are enough to display color PCX file on the screen, 2 bytes can
cause system restart etc...
Especially virus coders should produce tight and effective code, it is
not very pleasant thing to see unoptimized code of viruses. Sometimes
the optimisation can spare couple hundred bytes - good reason to deal
with this interesting topic.
Index
0.  |   What is optimization |
A.  |   Opening words |
B.  |   Uninicialized data |
C.  |   Register settings in various situations |
D.  |   Putting 0 in register |
E.  |   Test if register is clear |
F.  |   Putting 0FFh in AH, 0FFFFh in DX and CX |
G.  |   Test if registers are 0FFFFh |
H.  |   Using EAX/AX/AL saves bytes |
I.  |   MOV vs. XCHG |
J.  |   16 bit and 8 bit registers |
K.  |   Registers and immediate operands |
L.  |   Segment registers playground |
M.  |   The string instructions |
N.  |   DEC/INC vs SUB/ADD plus SI/DI |
O.  |   SHR/SHL |
P.  |   Procedures |
Q.  |   Multiple pops/pusheh |
R.  |   Object code |
S.  |   Structure of code |
T.  |   Arithmetics with SIB (LEA) |
B9 0014 90 mov cx, buffer_size
14*(??) buffer: db 20 dup (?)
=0014 buffer_size equ $-buffer
Better result we get with multiple passes (/m switch):
B9 0014 mov cx, buffer_size
14*(??) buffer: db 20 dup (?)
=0014 buffer_size equ $-buffer
Very bad idea idea is calculating in the code some weird value which could be calculated directly in the source code using operands and parenthesses.
call read_header
call process_it
.....
buffer: db 18 dup (?)
.....
read_header:
mov ax, 4200h
xor cx, cx
cwd
int 21h
mov cx, 18
mov dx, offset buffer
mov ah, 3f
int 21h
Well, as you can seek if you place the uninicialized data in your virus somewhere inside the body size of the resulting code is larger as necessary. Therefore avoid placing uninicialized variables, structures or buffers in the code. All the mentioned things should be placed on the heap. To see example how it should not look like, just take a look on your COMMAND.COM - in the one of DOS 6.22 as well as of Windows 98 you will seek a lot of zeroes near the EOF. That code is especially poor optimized. Try to use something like
virus_code_end label near
buffer: db 1024 dup(?)
|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||
* as to Ender's tests, this might varry from bios to bios, however it remains constant for one machine (or bios version). You can use it as some sort of snapshot taken at first boot time, and then used for example in decryptor. Antivirus will not be able at all to guess these numbers without reboot and it will not be able correctly decrypt poly virus in mbr. (Quite nice Ender's idea, isn't it?) |
As we all know Turbo Debugger sets all general registers to 0. When we some perform operation like
xor cx, bp
... some code not affecting cx
cmp cx, 93Eh
jne TD_is_here
... no debuger here
This way virus can easy escape from debugger. But on good writtten emulation system we are without any chance because they (AV) know we know and they are ready. But we can easy top all the wannabies trying to "research" our virus and give them the fun they want to have.
D. Putting 0 in register This above mentioned approach can be used every time you need. Under Win32
is the above mentioned optimization more efficient, just take look at
following piece of code: because after the exit from some loop (i do not mean hard or conditional
jump out of the loop) the CX register is already cleared as well as
after the REP ??? operations. will set for us sign flag if the register is zero. Situation under Windoze
is similar: jumps when CX is zero, we do not need comprare CX with zero. But
especially demo coders should pay attention to the fact, this
instruction is relatively slow.
E. Test if register is clear
The usual way of clearing a register is
B8 0000 mov ax, 0 ; costs 3 bytes
33 C0 xor ax, ax ; is more optimised
2B C0 sub ax, ax ; only 2 bytes !
Now we will take a sneak peek on the specific situations.B8 00000000 mov eax, 0 ; costs 5 bytes
33 C0 xor eax, eax
2B C0 sub eax, eax ; costs only 2 bytes
Let's assume the situation we need to clear the AH register. Generally we
will do it this way:
but if the AL walue is less than 80h we can save a byte (DOS)
B4 00 mov ah, 0 ; cost 2 bytes in DOS
2A E4 sub ah, ah
32 E4 xor ah, ah ; is the same
while under Windoze the cbw will be assembled as
98 cbw
66 98 cbw
We use the same approach as with AH. If value in AX <8000h we can use
to clear the DX99 cwd
There is absolutely useless use code like
....
loop some_label
xor cx, cx ; or even worse - mov cx, 0
Normal, uneducated coder will use this code:
but hard core programmer will for sure use
3D 0000 cmp ax, 0 ; which takes 3 bytes
83 FB 00 cmp bx, 0
if the value can be discarded we can save another byte
0B C0 or ax, ax ; this takes only 2 bytes
0B DB or bx, bx
48 dec ax
4B dec bx ; this is only 1 byte
83 F8 00 cmp eax, 0
0B C0 or eax, eax
0B C0 or eax, eax
66| 0B C0 or ax, ax ; this is not default
; operand size
In x86 processors CX register is used as counter, the instruction set
supports this use.
E3 ?? jcxz some_label
....
loop some_label
dec cx ; CX is here 0 this, sets it to 0FFFFh.
G. Test if registers are 0FFFFh
If we need to test if the 2 registers hold both 0FFFFh and value of one
is not needed afterwards, we can easy do it with e.g.
23 C3 and ax, bx
40 inc ax
74 ?? jz a ; 5 bytes altogether
instead of
3D FFFF cmp ax, 0ffffh
75 ?? jnz a
83 FB FF cmp bx, 0ffffh
75 ?? jzn a ; making 10 bytes
H. Using EAX/AX/AL saves bytes
Opcodes for some types of instructions are shorter, when you use AX or
AL instead of any other registers. This is the case of instruction
like:
A1 0084 mov ax, word ptr ds:[84h] ; 3 bytes
8B 1E 0084 mov bx, word ptr ds:[84h] ; 4 bytes
A0 0084 mov al, byte ptr ds:[84h] ; 3 bytes
8A 26 0084 mov ah, byte ptr ds:[84h]
8A 3E 0084 mov bh, byte ptr ds:[84h] ; 4 bytes each
A3 0084 mov word ptr ds:[84h], ax ; 3 bytes
89 1E 0084 mov word ptr ds:[84h], bx ; 4 bytes
A2 0084 mov byte ptr ds:[84h], al ; 3 bytes
88 26 0084 mov byte ptr ds:[84h], ah
88 3E 0084 mov byte ptr ds:[84h], bh ; 4 bytes each
3D 04D3 cmp ax, 0D304h ; 3 bytes
81 FB 04D1 cmp bx, 0D104h ; 4 bytes
3C 22 cmp al, 34 ; 2 bytes
80 FC 22 cmp ah, 34
80 FF 22 cmp bh, 34 ; 3 bytes each
I. MOV vs. XCHG
Moving value from one register to another can be replaced with XCHG but
only in cases the value in one of source register is not important. AX
register as expected is here better solution too.
8B C3 mov ax, bx ; 2 bytes
93 xchg ax, bx ; 1 byte
Following code is of equal size
8A C3 mov al, bl
86 C3 xchg al, bl
8A E6 mov ah, dh
86 E6 xchg ah, dh ; 2 bytes all
but you can even enlarge the code with :
A1 0080 mov ax, word ptr ds:[80h] ; 3 bytes
87 06 0080 xchg ax, word ptr ds:[80h] ; 4 bytes
86 06 0080 xchg al, byte ptr ds:[80h] ; this is bad
J. 16 bit and 8 bit registers
I doubt there is any non-newbie coder who will use code ala
B0 10 mov al, 10h
B4 20 mov ah, 20h
which takes 4 bytes, but doing the code above at once with
B8 2010 mov ax, 2010h
takes only 3 bytes.
K. Registers and immediate operands
Immediate operands cost more bytes than use of registers. Just looka
at the example below:
C6 06 010C 00 mov byte ptr [10Ch], 0 ; 5 bytes
88 3E 010C mov byte ptr [10Ch], bh; 4 bytes
A2 010C mov byte ptr [10Ch], al; 3 bytes
L. Segment registers playground STOS type save 2 bytes in comparision with move. Therefore MOVS
instruction save 3 bytes (is the same as LODS followed by STOS). SCAS type instruction save 2 byte in comparition with CMP.
CMPS instruction does the same as We dont have to omit REP prefixes - using this 1 byte sized instruction
we can do miracles almost at no costs. In addition after REP ??? is CX
set to 0 Even if we need to add/sub 2 from register inc/sub twice is 1 byte
shorter. For SI/DI if AX doesn't matter we can use some of string
instruction for INC/DEC (depending on direction flag). for multiply al with 4 and save a byte here. If we are multiplying just
by 2, we can save 2 bytes because from the formula above you can clearly see, that every repeated code
which size is 3 bytes or less can not be replaced by procedure in the
process of optimisation. When we know the size of code (which should be
at least 4 bytes) and the number of repeating we can estimate the number
of saved bytes. So, for code with length of 4 bytes we save 1 bytes, if
the code is repeated 6 times and put in the procedure. R. Object code S. Structure of code Size of the code above is 26 bytes. Every added handled function adds
just 3 further bytes. But when we use this virus typical sequence of
CMP, JE/JNE, JMP the it will look like: thus we have to count at least 8 bytes for every single handled
function. If we want to hadle 10 functions, it will be as much as 80
bytes. Structured coding from previous example will fit the same
function in only (26+3*8) = 50 bytes. 30 bytes saved, do i have to
further explain all the pluses structured coding can bring to you?
If you use not default segmet register for the operation, segment prefix
will be generated, which add 1 byte to the size of the code.
3E: 8B 86 0100 mov ax, word ptr ds:[100h][bp]
8B 84 0100 mov ax, word ptr ss:[100h][bp]
We can't move directly value from one segment register to another. It
has to be coded ala
with length of 4 bytes but
8C C8 mov ax, cs
8E D8 mov ds, ax
takes only 2 bytes0E push cs
1F pop ds
LODS type instruction save 1 byte in comparision with move.
AC lodsb
8A 04 mov al, byte ptr ds:[si]
AD lodsw
8B 04 mov ax, word ptr ds:[si]
66| AD lodsd
66| 8B 04 mov eax, dword ptr ds:[si]
AA stosb
26: 88 05 mov byte ptr es:[di], al
AB stosw
26: 89 05 mov word ptr es:[di], ax
66| AB stosd
66| 26: 89 05 mov dword ptr es:[di], eax
AE scasb
26: 3A 05 cmp al, byte ptr es:[di]
AF scasw
26: 3B 05 cmp ax, word ptr es:[di]
66| AF scasd
66| 26: 3B 05 cmp eax, dword ptr es:[di]
thus we save 3 bytes with CMPS. LODSB/LODSW/LODSD
CMP AL/AX/EAX, byte/word/dword ptr ES:[DI]
40 inc ax
FE C0 inc al
4A dec dx
FE CE dec dh
2D 0001 sub ax, 1
05 0001 add ax, 1
83 2E 0080 01 sub word ptr ds:[80h], 1
FF 0E 0080 dec word ptr ds:[80h]
83 06 0080 01 add word ptr ds:[80h], 1
FF 06 0080 inc word ptr ds:[80h]
use just
B1 02 mov cl,2
F6 E1 mul cl ; 4 bytes in total
C0 E0 02 shl al,2
has size only 2 bytes.
D0 E0 shl al,1
( S+1 )
+
N*3
+
A
(ret + the size of code)
+
N*3 bytes for each call
+
difference
N*S = (S+1) + 3*N + A
N*S - 3*N - (S+1) = A
N*(S-3) - S - 1 = A
Next thing you can do with procedures, is using multiple entry points
for the same procedure.
call push_reg
...
...
push_reg:
pop si ; pop return adress
push ax
push cx
push dx
push bx
jmp si ; this does what ret normally does
Some opcodes are in some memory models aren't accesible. Therefore to
use some workaround. Most typical use of object code in virus if famous
return to original dos handler:
Another nice use of object code is let's say in decryptor ala:
back db 9ah ; this is for JMP FAR PTR
dosadr dd ? ; and here comes the adress
but as the every copy of virus will have different key, we code this
instruction as
xor ax, key
which will work perfectly.
db 35
key: dw ?
This is really what is optimisation about. If you structure your code
good you can save lot of bytes. If you are using lot of procedures, do
not forget to check input and outpot registers. Sometimes it could be
valuable to use register, which cost less bytes to handle in procedures.
As and example for good structured code, here look at part of INT 21h
handler for some hypotetic virus. Pay attention to the use of call
instruction here: 86 C4 xchg al,ah
E8 0007 call jump_there
40 db 40h
021Ar dw offset write
3E db 3eh
0458r dw offset close
00 db 0 ; end of table
jump_there:
5F pop di ; pointer to
; table begin
search:
80 3D 00 cmp byte ptr ds:[di],0
74 08 je exit ; table end ?
AE scasb
74 03 je handle ; AL = AH ?
AF scasw ; DI = DI + 2
EB F5 jmp short search
handle:
FF 25 jmp word ptr ds:[di] ; jump to
; function
exit:
80 FC 40 cmp ah, 40h
75 03 jne check_3e
E9 022D jmp infect
check_3e:
89 84 CB 12345678 mov [ecx*8+ebx+12345678h], eax
Scale is register (any 32bit) multiplied by 1, 2, 4 or 8;
Index is another 32bit register, and finally Base is a raw offset. Base
is a 32bit value as well, that takes another 4 bytes, but it might be used
also without this constant.
You can use if for aritmetical operations, even more it is faster than
comparisonable equivalents using multiplying or shifts, and you can perform
a mutliplication for even non-standard values, like eax mul 9, for example:
8D 04 C0 lea eax, [eax*8+eax]
As you can see, this lea-trick can be used in many
circumstances, and I'll not list them all here, of cos.
Hope this little introduction helps you with some ideas, but you surely
know: "truth is out there" - you need to optimize your code in more general
context, not only on instruction base, but on register usage optimizing
and stack usage as well, and there are many other things that can't be
decribed in such a general way. May be it is for another article. But for
now, go ahead and won't your code be pesimised...