In this article I would like to introduce my own view to scanning technologies for macros used in most common antiviruses.
Structure of WB6/VBA
The first thing you have to understand is how macro is stored in file. This
is probably the most important reason why is scanning doing the way it is done.
So let start. I will assume you know something about VBA (visual basic
for applications), what is the most common language used for macros.
I would like to focus to VBA, because it is more common than WB6 (word basic 6).
VBA project is stored in its own folder. (you can easy take a look
at it by using dfview - suplied with DevStudio). Each macro is stored in
stream with a bit complex structure. Before macro is written to file it is
compiled by built-in compiler to something like a stack-machine language.
So a = b + c is something like this:
push b ; put b to stack
push c ; put c to stack
add ; pop b, c and put it's summary to stack
pop a ; pop value and set a to it
Variables and function names are usually located in dictionary so it is a bit difficult to find them. In code are just pointers or indexes. Moreover dictionary is not in macro file, so if avir (and many do so) is lazy it just skips them.
Other important thing is that jumps and calls are not linked in file so it is a bit difficult to trace code flow.
Anoter aproach to scan VBA is to decode "source" suplied with macro. In this case avir have full source of code (with all names and so on), so it is enough to skip some headers or something more and CRC it.
WB6 is much simplier in structure. It's code is stored in tokens (?), where each token has exact meaning. For example space, +, -, SomeFunction, number, variable, and so on. So it is rather difficult to write emulator of this language. I don't like to waste time with this, so some example (the same case as I used for vba):
variable "a"
operator '='
variable "b"
operator '+'
variable "c"
and some function:
internal function MacroCopy
string "my_macro"
operator ','
string "new_macro"
As you could easy find out this structure is very simple and that probably caused that most recent technologie is CRC.
Scanning
Number of macro virii is growing day by day. This is a reason to use
automatic systems to extract signatures. Because they were coming with
WB6, and WB6 has very simple structure, it seemed that CRC is the best.
It is "absolutely" exact - so it can recognise sub variants, it is fast
and easy to implement. May be it seems silly, but it was in days when
everybody trusted that polymorphism in macro is not possible (what a
mistake :-). So scanning algorithm is something like this: CRC all macros
and check in database. Look at macros you found and if they describes whole
variant virii is identified exact. If something is missing and I think it
is enough: possible virus. You can see even now the legacy of those old good
days. Many variants are differ just in one tab or space after end of macro.
CRC rulez... Suddenly polymorphic ones came to the light of world and
CRC became not very efficient. Became not but it is very comfortable to
just run some program on your collection and to have signatures. The
simplies way you can see is just to modify CRC algorithm. In those days
polymorphism was when name of variable changes or something like that, so
why not to skip whitespaces or variable names. And this is probably the
final solution. So called smart-crc. Because of structure of VBA you
can afford to skip variable names (and it is even more comfortable -
who will search for it in tables). The structure of code is stored
in code. It is enough to check the structure. You can easily see that
macro is doing something like ? = ? + ?. And this is enough. With this
technique you can identify at least 90% of current virii.
Finally if you want polymorphism you NEED to change the structure of code. For
example swap lines or add garbage code. And automatic AV technologies
will be smashed to do ground.
The next and very old way are scanstrings. It is easy to implement scanstring like scanner for macros. This may be exact because you can follow either structure of code or names. For example you may be looking for:
ToolsMacro .Name = something with .Edit
These are two very easy to implement but efficient in use methods.
Some word about heuristics
The main problem of macro virii is that there doesn't exist non virii
macros (in fact they are very rare). So AV are not anxiety about how to detect,
but how to distinguish between variants and how to find out macro is not
virii :).
It is easy to follow functions macro is using, so write a macro virii
heuristics is easy. There is not problem to say: "this may be macro virii",
because you can easily find "ToolsMacro .Edit", "Organizer .Copy" and other
shit macro must be dealing with. In fact it is not enough. There are
many legal self-installing macros using macro copying functions.
In virii you can find usually somthing special ...
Very important flags for heuristics may be:
Because structure of VBA is a bit more complex it is hard to simulate it. In fact all heuristics i know about are passive (it means they are just searching for something) and i don't know about any big advantages in more complex analysis. In fact simulate whole macro is difficult task - try to write whole VBA... I have heard someone emulates variables, but i don't believe it. And i am almost sure there is no full emulator of vba. So something like this
a$="tomacro"
b$="au"
c$=b$+a$
is hardly ever suspicious for scanners. And I can't imagin that this can be suspicious:
....
a$=""
b$=""
10 c$=b$+a$
if c$<>"" goto 20
a$="tomacro"
b$="au"
goto 10
20 ....
Of course this is silly example, but i like to point that without code following emulation it is useless to have emulation of variables. And you may use as many jumps as you want or even functions, cycles or something even more evil. If you want to make day of aver harder just assign variables way that they can't be simulated from top to down. It means use goto to turn back with new values and heuristics will have to be much more complicated.
So that is all i want to say ....