Hello, that's me again, if you didn't get bored of me before. Originally this part was supposed to be written by another famouse coder, but he did not do that due to beeing short of time. Unfortunately, I'm much much shorter of them than him. But never mind: we are going to aim at heuristical principles in a short articles, but be sure to read also main articles about antiviruses.
The main reason of heurisics is, as I already mentioned,
to detect unkown viruses as many viruses appears every month and it becomes
difficult to keep track of them. First it was introduces by F-Prot (well,
some kind of, a bit hard to say if we can call it heuristic) and first
real implemantation in well-known TBAV.
At first, we should define what heuristics exactly is, I try it by my own:
heuristic scanner is a program (anti-virus, more exaclty) that is able
to detect viruses by analyzis of their code - what they do. But to decide
if the given code is a virus or not isn't easy even if it can look like it
is - it is difficult to made it reliable. If you have a look on viruses,
the code they use, if you have a look on many many viruses, like avers did,
you can easily tell the things that are common for all the viruses.
This are the beginings of heuristics - F-Prot used something that was called
in av-community "heuristics scan-strings". A short scan strings, searched
in whole body, of these typical constructions: like write command which
is typicaly mov ah, 40h; mov cx, 1234h (size) ; int 21h.
This is of course only illustration,
this can be done in many ways, but not that much to have most of them
factorized (using wild-card scan-stings). If several of these scan-strings
are found, you can say there is probably a virus. But many regular programs
written in assembler looks this way and not to have false possitives it
is required to hit many of these strings to report a possible virus
infection.
Some avers trusted to f-prot's reports of possible viruses, but presented
form wasn't quite reliable and moreover, it was not able to detect more
comlpicated pieces as it was set-up-ed to low sensitivity.
TBAV ruled the world
for a short time at least. Franz Heldman presented a brand new technology
called heuristics in excelent look. (but only for a first look). For a
first look all stared in amazement: avers because they even never think
about such a things (many of them are only doing their work without real
invetions), and vx-ers because it was able to detect even viruses they are
going to write. But reallity was a bit different: avers for a long time
didn't count TBAV's heuristics into scanning methods at all, they reported
heuristics as not reliable (mostly because they weren't able to replicate
this technology even in simple look as TBAV has). Virus writers started
to find a ways how to fool TBAV (as soon as they stopped affraid of it).
Let's see how TBScan works: it uses passive heuristics (structured
dissassembly) to analyze instructions. Main aim was as before - to detect
usual code-sequences found in viruses. Tbscan marked them with letters by
each file, and there were so many flags during years of development of tbav
that covers whole alphabet plus some other characters. Starting from
entry-point Tbscan checks instruction by instruction judging them and
marking known code-sequences. But thats not enought, for sure. Also jumps
are followed and on conditional jumps both paths are disassembled.
However, as dissassembly is done in single-pass, a simple tricks that
breakes intructions, etc can make tbscan to loose its track of code.
Also it is easy to fool tbscan by doing the things in non-usual way
or indirectly. As it disassembles the code, even simple mov ah, 3f; inc ah
were enought to do so.
Tbscan also has many false possitives due to its not-fully relieable
technology - when tbscan lost track of codeflow (that happens quite often)
it detects many flags on garbage code it finds. There were a quite long
database of files that are known false postitives - some kind of anti-scan-strings,
if found, heuristics is not performed on such a file.
TBAV's main weak point is it is so clear for everyone - even for virus
writers they may easily guess how it works - and how to avoid to be caught.
As soon as TBAV becomes popular, neartly everyone started to exclame
their features they are tbav-proof. All is needed, during programmig,
periodicaly run tbscan to see when it displays its flags. Well, main keypoints
to keep tbav far from you is to use good encryption (that can't be passed
by tbscan's decryptor), or to do things not as clearly as it is usual. Tbscan
detects only usual schemes, so simple tricks like and-s, add/sub on comparing
will work. However, tbscan is out of game today. There were also some plagiats,
a german 'Suspiciouse' (as I remember), but all they went as unsucessful
as tbav.
To fix these disadvantages it is possible to partialy find out the values of registers by semi-emulating of piece of code before key instruction (e.g. int 21). Only registers are emulated and memory access only for reading (not to damage memory). This is used for example by active heuristic scanners to analyze code they can't reach (we can call it local semi-emulation). In this stage doing mov/inc will not help, but doing rot-s or and-s instead of comparing will sure fool this alorythms.
Improoving heuristics - emulation
There were a lot of big words how to do heuristics in real way, to do
the things as they really are in file, but not runned. Someone may guess
a single-stepping might be used, but in reality it weren't ever used for it.
It is equivalent to running each file, but checking what you are executing.
But your automatic debugging (its somesthing like it) can't be used due to
many protective envelopes that are designed to crash debugers. In other words
single stepping was never used for active heuristics as it can crash several
times on a hard disc files per scan. I remember, for example, dedicated
scanner for EMM1:Level_3 that uses single-stepping. It hangs several times
in my utilities directory, runs many files (even pkzip), etc.
In fact, only emulation can be used for active heuristics - that is to
check for is file exactly doing, and to decide if it is viral code or not.
In this point of view, there are two primary objectives. First one is
a bit like before - to find out suspective code constructions, but it
is less important now. The more important is to monitor activities that
are really done. Let's imagine what virus usualy do - it tests something,
becomes resident (if it is a resident virus), and infects files on
some certain activity in system. Well, and for example becoming resident
can be easily caught by active heuristics even if it is done in unreadable
way - because it detects direct modifications (let's talk about dos now) of
0:[413] or MCBs. But what really defines a virus is a infection of other
files - if emulated program searches for executables, modifies them (in
order to replicate) it is virus nearl for sure. If virus only installs
itself into memory, a simple tests are run in virual machine - a file is
runned (and checked for infection), or opened for r/w, or opened on removable
drive (likely copied to floppy). This usualy notices any virus. Now you
can surely guess some tips how to fool them. But we have to continue:
Because active heuristics is not 100% stable, there is usualy still engine
for searching of typical constructions with local semi-emulation (mentioned
before). Capabilities of virus can be detected also this way - even if they
may not appear from the first view (or emulation ;-)
Now have a look at limits of emulators - this is primary subject to be
undetectable by active heuristics: there is virtual machine. Its main advantage
and disatvantage at the same time. Of course, it is not V86 virtual machine
you probably think, but emulated computer:
Currently leading heuristic scanners are NOD/iCE32 and Dr.Web - both of
them are using mentioned technologies (with also mentioned limitations), but
only for dos executables (how lucky). At the present time, none of them has
32bit emulator (I mean not written in 32bit, but fully emulating 32bit) and thats why
they can't perform active heuristic for Windows executables (PE/LX), and
viruses for windows are not that much affected by their heuristics power.
For 32bit Win executables they are using only passive part - i.e. disassembly
and searching for suspicious code constructions (typical viral sequences).
These two antiviruses has much less false possitives, as they need exact actions to judge the file as infected. (but they uses anti-scan-strings as well, because there are always false posstives). But the most amazing thing uppon them is quite detailed description they report for infected file (especialy by NOD) - they can find out if virus infect boot as well as com/exe files, if it infects sys files, if it is resident, stealth, polymorphical, etc. You can nicely see it on NOD/iCE in which the scan-strings can be turned of to use only a heurisics. The hit-rates running heuristics-only are quite impressive.
Other heurisics scanners
There are of course some other heuristics scanners in the world, but less
important. AVP has a kind of active heurisics too, as it is part of generic
decryption engine AVP has. But as AVP is the highest-standard antivirus
in the world and it has really lots of scanstrings, its heuristics can be
set-up-ed for lower sensitivity which also brings less false possitives.
Heuristics is also less visible, because it reports unknown virus really
rarely.
Simmilar situation is for Dr.Solomon's Toolkit (not Solomon's any more, of
course). In our tests we modified toolkit's viral databse not to have
any scanstrings to test heuristics only. Result was as expected: less
than 70% (slightly vary). You really don't need to affair of this
heuristical engine, if you can beat those mentioned above. Solomon added
only some very easy one (like AVP) to have some less hitrate also on
unknown viruses. But Solomon's policy in scan-strings was to add anything,
no matter if it is a virus - so they have a biggest hitrates without
any thinking of it (this is why I don't like it).
The worst heuristical scanner I know is AVG, time ago it has same weak
point as Tbscan has, even more - it shows emulation process (with optional
step-by-step confirmation) and you can see code and registers - and easily
test the bugs in it :) It was showen only to impress audience, because
it was really buggy and useless. It was several times improoved, but
without reasonable result - first versions were extremly slow and buggy.
Afterwards, they used new scanning/heuristical core (developed by someone
else who joined their team) which is a bit faster and better, but still
pretty weak.
To finish a overview of others I have to mention NAI as well. But it
is rather easy to accomplish, because NAI has no own technology (or really
very little - only some programmers that downgrade buyed technology by
putting it together with others). NAI buys anything that can be buyed,
currently as far as I know they are using engine of dr.solomon with
roughly same capabilities as dr.solomon. May be they'll try to buy another
heuristical scanner... Who knows...
Heuristical cleaning
Now we are in second, more-less important chapter of this article. Heuristical
cleaning was firstly presented by TBAV, program named Tbclean. But at first
we have to explain what heuristical cleaning is: a cleaning of virus from
file without knowing virus exactly, just by tracing it or more complex
automated analyzis. But heuristical cleaning is less important than
scanning, because it is much more reliable and also much less used. Moreover,
the hitrate is analyzed in test-tables, not these high-tech features.
Tbclean performs it in most easy way. As TBAV was lack of emulator engine,
Tbclean uses single-stepping to trace program. You can surely guess it
will not work in many cases. Of course - it crashes (whole computer) on
protective envelopes, and sometimes also on usual programs. But it sometimes
works. Principle was simple tracing virus, because virus when it does
usual things, reconstructs host body and passes control there. Idea is
to allow reconstruction (but disallow instalation, if possible), and on
jump to host body make a snapshot of reconstructed file. Passing control
back to host was detected by jumping (or ret or whatever) to offset 100h
(for com's), or far-jump (retf respectively) for exe files. Nearly every
virus ends this way. All is needed is to write image back to disk and
work is done (it is verry simmilar to exe-unpackers with tracing).
To prevent instalation, tbscan for example returns for GetDosVersion call
version 2, that most of viruses refuses. Simple and effective. But there
were (are) also many other tricks. Some of them you may guess if you want
to write simmilar cleaner: prevent of some instructions (like cli, i/o
ports, hard stack modifications, filter interrupts (they are redirected
not to be accidentaly infected - int21 for example usualy returns error
(carry set)). Big problem it to find out where to cut-down the file.
File can be easily reconstructed, but cleaner don't know where to cut it.
There are several possibilities:
Tbclean was first and really simple and buggy. It crashed very often, in many cases reconstructed file was corrupted, and moreover - as it becomes pretty famouse there were tricks like in virus Varicella, that was really executed whey it was cleaned by tbclean (this virus mades Franz Heldman really angry ;) But in these days it is forgotten as well as whole TBAV.
Emulation makes it reliable
Yes, thats right. With a new generation of heuristical scanners (Dr.Web
and NOD/iCE), there are better possibilities to clean file - and not to crash.
Principle remains the same - emulate virus as much as possible, to find
out as many information as possible. This is some deep heavy woodoo magic
of avers - thats why only really few of them can do this. Of course, much
better is exact disinfection which is much more valueable (like AVP has),
but to impress us (vx-ers) and other avers, this magic is here ;-)
At the present time, however, it is functional enought only in NOD/iCE
(pretty impressive, but due to mentioned limitation of their emulator it doesn't
work for 32bit files (win)). Heuristics cleaning is also used in AVG, but
I would say the same as to their heuristical scanner: it was really poor
time ago, and after upgrading core it is still not enought. Finally,
there is Dr.solomon (which is also used as core engine in NAI's now)
but it doesn't perform generic cleaning at all (I mean cleaning unknown
viruses). As far as I have informations,
it is only used to disinfect virus that are possitively known as cleanable.
Thats why you will not notice a heuristical cleaning at all. Reffering to
things above, I will focus primarily on NOD/iCE as it uses imho best technologies
of all mentioned.
Generic idea is same - to emulate virus to allow him reconstruct host file,
and use retrieved infromation to repair host. All is done in emulated virtual
PC, with emulated disks. The simpliest way of disintection is to save
reconstructed file and to cut out virus. As there are several types of
appending virus to file, it might slightly differ on technology how to
cut it out. Clever technology, if virus works fine in emulator, is to virtually
infect several virtual files (goats) to find out size of virus and where
the virus is stored (by diffing with original file). This way they may guess
nearly exact the infection methods of virus. For combined boot/exe/com viruses,
as they are usualy installing itself to mbr, a virtual reboot is done to
activate them in virtual pc after they installs into virtual mbr. (oops, so
virtually ;) After guessing the size and knowing where virus stores itself
in file, a real infected file might be run (virtually, again) to repair
host, find out entry-point, and cut it using informations retrieved before
(if they are not available, some alchemy is done or virus body is left
there).
This way it looks simply, but isn't. I guess heuristics scanning and heuristics
cleaning is top high-tech technology avers are using now (also reffer
to my overview of antivirus methods). Even if the principles and ideas
are simple there is lots of things to be done to make virus work in virtual
pc, so the complications you might prepare for heuristics (cleaning especialy)
might be awarded by your success. Good luck!
flush