Introduction
If we can handle such a complexe target as PE files are we are facing
the sad fact we can infect files on the Intel platform but we can never
get outside this platform. Rare exception from this axiom is virus
Esperanto (by Mr. Sandman published in 29A Nr. 2) which is the first of
its kind, capable of speading on various platforms and processors. Glory
goes to Mr. Sandman but unfortunately, this approach cannot be used for
larger projects. Whole Esperanto's solution is based on presence of two
parts - one for intel processors, the other for Macs, practically doubling
the size of necessary code. It doesn't seem to be the ideal solution, let's
image the 50 kB viral code for three processors and we well land somewhere
around 150 kb maxivirus.
Idea
I would solve this problem using another approach. My approach would be
more difficult (but not impossible) to code.
I state here i am not ready to participate on such a project (no time
and morale left). I would like to find some newbies or people ready
to work hard. Idea is quite simple - we should carry the body in some
kind of pre-compiled state, which should be easy translated to assembly
language of every single target processor.
Imagine, we have C compilator, which produces output at the level between C and assembly languages. Between C and assembly means, that before code is assembled it has to be compiled by special C compiler. In fact code should be at the lowest level, it could be, because we need to assemble it for various architectures. Because of this code should be register and memory addressing mode independent. The one model i like the best is stack machine (uses RPL - reverse polish logic) with direct memory adressing mode (only value on top of the stack is a memory address). Of course, this means compiled "code" will be larger than regular intel code.
Resulting code for some processor could gain quite high variability this way (by every single translation could be another instructions or registers used). Also in the case resulting code will be close enough to code produced by C compilers - some standart stack frame, analogical using of stack, registers, variables and so on - this would be very hard to differenciate by heuristics without any further analysis. And it will be even harder (if not impossible) to distinguish between variants. This would make problem of use complicated (and unemulationable) polymorphic routines, decryptors and such a things redundant. The only one condition to be not a simple target is to have the "source" (which is by its nature more or less static) encoded and decode it only if need to replicate.
Of course, precompiled "source code" has to contain "assembler" for all supported processors. Assembler - as a heart of body - gives a virus it's variability and complexness, so detection is as hard as good is assembler. That's reason why virii can be very long. It will be not enough just 5kB like for a classic poly routines. That is reason why (probably) wouldn't be such a viruses spreaded by mailing. But besides of this code will be very similar to standart languages. You needn't to deal with infecting file in general, you can link your data area wherever you need so you need not to use writeable sections for code - what is in my opinion the strongest heuristic flag.
Real time compiling
Anoter posibility is compile code at run-time - you needn't to have whole
code compiled in host file. You can compile it at time you need it. This
may at least reduce a size the file is increased of. I am not sure if
this is safe enough in order not to be visible but i think compilation is
complex enough to slow emulation down, and may be makes scanning-speed
unacceptable, so avers will have to find out new ways of detecting.
Code morphing
Another advantage is the BIG possibility of some modifications to the
pre-compiled code. Because you exactly know what your code means and what
kind of modifications can be performed on it. Because new one inherits it's
code by parent, in 10 generations there can be a very big difference between
existing variants. Just imagine block permutations (modules or just functions)
and minor changes in code like c=a+b -> c=b+a. I think it is good enough to
totaly change the look of virii from parent to child and not speaking even
about differences between distant variants. And there are possible a bit more
complex changes - of course it depends on source language and you.
Disadvantages - size
As i see it, main disadvantage is size. Because of a bit difficult
technologies necessary to implement i don't even hope that resulting code
will be smaller than 50kB, what is imho a bit problem in these days.
At first you can't use mailing strategy to spread itself. It tooks
some time to download 150kB of mails :-(. I heard that 300kB is nothing, and
there are really coming medias with 100MBs throughput, but main limiting
factor is floppy disk/internet and we still live in world, where 3kB/s is
a high speed (33k6 modems are quite usual for use of internet from home).
There can be some problems on the interference level (level, where host file and virus are directly connected). We are not far enough to say it can be whole handled by compiler or in needs special handling with PE+platform dependend code. But it should not be a big problem.
And now some sci-fi:
Probably the first reason we start with all this stuff was to try
how will genetics work in vx. And this gives you much better control over
code modularision and generation of code. Our first idea was to create virii
able to exchange modules with other one in order to optimize itself and
adapt to current environment. This gives you much better probability to
survive, but need to create environment with strong exchange of genes -
what is difficult. And now to real world ...
Closing
And now some closing words. Main advantage of the pre-compiled code is
possibility to cross-plastform infection. Besides this this approach opens
another horizonts at least at the level of today poly engines and in the
eternal 'game of hiding body' goes more to the direction of giving the virus
body 'right color' than building 'bullet-proof' walls of anti code. This
leads in no way to the lower variability of the code. Having this features
this concept leads to the viruses which are TMC-like.
Another plus is the programming in HLL is more comfortable and faster, read more effective, not speaking of the base address independency :-).
Think about it !