David Chess, Steve White
 Virus Bulletin Conference
 September 2000
 Download PDF (32.87Kb) (You need to be registered on forum)
Download PDF (32.87Kb) (You need to be registered on forum)One of the few solid theoretical results in the study of computer viruses is Cohen's 1987 demonstration that there is no algorithm that can perfectly detect all possible viruses [1]. This brief paper adds to the bad news, by pointing out that there are computer viruses which no algorithm can detect, even under a somewhat more liberal definition of detection. We also comment on the senses of "detect" used in these results, and note that the immediate impact of these results on computer virus detection in the real world is small.
Consider the set of programs which produce one or more programs as output. For any pair of programs  and
 and  ,
,  eventually produces
 eventually produces  if and only if
 if and only if  produces
 produces  either directly or through a series of steps (the "eventually produces" relation is the transitive closure of the "produces" relation.) A viral set is a maximal set of programs
 either directly or through a series of steps (the "eventually produces" relation is the transitive closure of the "produces" relation.) A viral set is a maximal set of programs  such that for every pair of programs
 such that for every pair of programs  and
 and  in
 in  ,
,  eventually produces
 eventually produces  , and
, and  eventually produces
 eventually produces  . ("Maximal" here means that there is no program
. ("Maximal" here means that there is no program  not in the set that could be added to the set and have the set still satisfy the conditions.) For the purposes of this paper, a computer virus is a viral set; a program
 not in the set that could be added to the set and have the set still satisfy the conditions.) For the purposes of this paper, a computer virus is a viral set; a program  is said to be an instance of, or to be infected with, a virus
 is said to be an instance of, or to be infected with, a virus  precisely when
 precisely when  is a member of the viral set
 is a member of the viral set  . A program is said to be infected simpliciter when there is some viral set
. A program is said to be infected simpliciter when there is some viral set  of which it is a member. A program which is an instance of some virus is said to spread whenever it produces another instance of that virus. The simplest virus is a viral set that contains exactly one program, where that program simply produces itself. Larger sets represent polymorphic viruses, which have a number of different possible forms, all of which eventually produce all the others.
 of which it is a member. A program which is an instance of some virus is said to spread whenever it produces another instance of that virus. The simplest virus is a viral set that contains exactly one program, where that program simply produces itself. Larger sets represent polymorphic viruses, which have a number of different possible forms, all of which eventually produce all the others.
 
	Figure 1. The shapes represent programs, and the arrows show which programs produce which as output. The filled shapes are members of viral sets, the empty shapes are not. The filled hexagon represents a simple non-polymorphic virus, whose sole member produces only itself.
In practical terms, this notion of computer virus encompasses overwriting viruses (which replace existing programs with copies of themselves) and some kinds of worms (which spread as standalone programs by creating new copies of themselves). A more complex notion of computer virus would incorporate "parasitic" viruses, which infect other programs by inserting themselves in such a way that both the viral code and the original program are executed when the infected program is executed. (The classic informal definition of "computer virus" is "a program that can 'infect' other programs by modifying them to include a possibly evolved copy of itself." [1]. A more formal definition in terms of regions of a Turing Machine tape can be found in [2].) In a subsequent paper, we will extend the current results to that richer notion of computer virus; essentially all the results we obtain here still hold.
For the purposes of this paper, an algorithm  detects a virus
 detects a virus  if and only if for every program
 if and only if for every program  ,
,  terminates, and returns "true" if and only if
 terminates, and returns "true" if and only if  is infected with
 is infected with  . Similarly, an algorithm
. Similarly, an algorithm  detects a set of viruses
 detects a set of viruses  if and only if for every program
 if and only if for every program  ,
,  terminates, and returns "true" if and only if
 terminates, and returns "true" if and only if  is infected with some virus
 is infected with some virus  which is a member of
 which is a member of  . This is essentially Cohen's definition in [1], and it is the only formal definition of detection that has proven theoretically fruitful. It also captures (at least to a first approximation) our intuitive notion of computer virus detection.
. This is essentially Cohen's definition in [1], and it is the only formal definition of detection that has proven theoretically fruitful. It also captures (at least to a first approximation) our intuitive notion of computer virus detection.
In [1], Fred Cohen demonstrates that there is no algorithm that can detect the set of all possible computer viruses (returning "true" if and only if its input is an object infected with some computer virus). The proof is a simple diagonal argument, like Cantor's proof of the uncountability [3] of the real numbers, or Turing's proof of the undecidability of the Halting Problem [4]. For any candidate computer virus detection algorithm  , there is a program
, there is a program  , which reads:
, which reads:
Clearly  does not return the correct result when called on
 does not return the correct result when called on  , since if it returns "true" (i.e. it says that
, since if it returns "true" (i.e. it says that  is infected), then
 is infected), then  just exits (and is therefore not infected), whereas if
 just exits (and is therefore not infected), whereas if  returns anything else (i.e. it says that
 returns anything else (i.e. it says that  is not infected), then
 is not infected), then  spreads (and is therefore infected).1 So there is no algorithm which detects all viruses without error; any program that attempts to detect all viruses will either miss some infected files (a false negative), accuse some non-infected files of being infected (a false positive) or fail to return anything (a bug).
 spreads (and is therefore infected).1 So there is no algorithm which detects all viruses without error; any program that attempts to detect all viruses will either miss some infected files (a false negative), accuse some non-infected files of being infected (a false positive) or fail to return anything (a bug).
A very similar example demonstrates that there are viruses for which no error-free detection algorithm exists. That is, not only can we not write a program that detects all viruses known and unknown with no false positives, but in addition there are some viruses for which, even when we have a sample of the virus in hand and have analyzed it completely, we cannot write a program that detects just that particular virus with no false positives.2
As noted above, a virus is said to be "polymorphic" if the size of the viral set is greater than one; that is, if the code of the virus is different in different infected objects. Consider a virus which is sufficiently polymorphic that for any implementable algorithm  the program
 the program  :
:
is an instance of the virus (provided of course that  actually spreads). There is no algorithm
 actually spreads). There is no algorithm  that correctly detects this virus, by an argument directly analogous to that above: for any algorithm
 that correctly detects this virus, by an argument directly analogous to that above: for any algorithm  that claims to detect this virus, there is a program
 that claims to detect this virus, there is a program  :
:
for which  does not return the correct result. If
 does not return the correct result. If  returns true, then
 returns true, then  does not spread, and is therefore not an instance of this or any other virus; whereas if
 does not spread, and is therefore not an instance of this or any other virus; whereas if  returns false, then
 returns false, then  does spread, and is an instance of the virus.
 does spread, and is an instance of the virus.
Is any possible actual virus sufficiently polymorphic to have this property? Clearly yes. Consider a virus  one instance of which is
 one instance of which is  :
:
For any candidate  -detection algorithm
-detection algorithm  , there is a program
, there is a program  :
:
for which  does not return the correct result; if
 does not return the correct result; if  returns true, then
 returns true, then  just exits (and is therefore not an instance of
 just exits (and is therefore not an instance of  , or of any other virus), whereas if
, or of any other virus), whereas if  returns false, then
 returns false, then  is an instance of
 is an instance of  . So no algorithm can detect
. So no algorithm can detect  without error.3
 without error.3
There is a looser notion of detection under which our result still holds. We may be willing to forgive a candidate  -detection algorithm for claiming to find
-detection algorithm for claiming to find  in some program
 in some program  which is not infected with
 which is not infected with  , provided that
, provided that  is infected with some virus. Let us say, then, that an algorithm
 is infected with some virus. Let us say, then, that an algorithm  loosely-detects a virus
 loosely-detects a virus  if and only if for every program
 if and only if for every program  ,
,  terminates, returning "true" if
 terminates, returning "true" if  is infected with
 is infected with  , and returning something other than "true" if
, and returning something other than "true" if  is not infected with any virus. The algorithm may return any result at all for programs infected with some virus other than
 is not infected with any virus. The algorithm may return any result at all for programs infected with some virus other than  (although it must still terminate).
 (although it must still terminate).
 
	Figure 2. The slanted lines show the (perfect) detection of the viral set of filled ovals; the algorithm picks out exactly those programs infected with that virus. The vertical lines show loose-detection of the viral set consisting of the filled hexagons; the algorithm picks out all the programs in that viral set, as well as some other infected programs.
It is clear that our result still applies under this looser notion of detection. Since every algorithm either returns true for a program which simply exits, or fails to return true for some program infected with  , no algorithm even loosely-detects
, no algorithm even loosely-detects  .
.
Our result is clearly complementary to Cohen's result in [1] that no algorithm can detect all viruses. That result may be expressed as
 
whereas our results are
 
 
	Figure 3. Cohen's result says that it is impossible for a program to perfectly draw the solid line suggested above, enclosing all and only those programs that are infected with some virus. For every program that attempts to draw that line, there will be some infected object that the program says is uninfected, or some uninfected object that it says is infected.
 
	Figure 4. Our result says that for some viruses it is impossible for a program to correctly draw the solid line suggested above, enclosing all those programs that are infected with that virus, and enclosing no programs that are not infected with any virus.
The virus  above is clearly not a remarkably viable virus, nor is the failure of detection a particularly serious one. In particular, no user of a real virus detection program
 above is clearly not a remarkably viable virus, nor is the failure of detection a particularly serious one. In particular, no user of a real virus detection program  would object if it were to say that it found a virus in
 would object if it were to say that it found a virus in  :
:
or in  :
:
since, although these do not in fact spread, they are closely related to programs that do spread. The main immediate impact of Cohen's undecidability result on the day-to-day activities of those who study computer viruses is that we can dismiss without detailed study any claim that some method correctly detects "all possible viruses known and unknown". The main practical impact of the current result is to dispel the notion that it is always possible to create a detector for a given virus that has no false positives, even if you have a copy of the virus in hand.
When we say in common speech that a given program detects a given virus, we mean something rather different from the formal senses above. Every widely-deployed virus detection program in use today will claim to find a virus in at least some non-viral objects (a false positive), because the methods used for detection are approximate, based on the presence of a particular binary string in a certain place, on the calculation of the finite-size checksum of a macro, on a certain pattern of changes to a file, and so on. Producers of anti-virus software of course try to minimize the number of actual non-viral programs that are falsely detected. But no one worries about the fact that the algorithms used to detect viruses produce false positives on an enormous number of non-viral objects that have never been, and will never be, present on any actual user's computer. This paper's title, then, is deliberately somewhat provocative: while the viruses that we present here are undetectable in the strict formal sense of the term, there is no reason to think that it is impossible to write a program that would detect them sufficiently well for all practical purposes.
Acceptable virus detection, in the real world, involves detecting all viable instances of the virus in question, and preferably some number of minor variants of it, while falsely detecting the virus in only a vanishingly small number of innocent programs that are actually present on a computer somewhere. It would be helpful to have a formal characterization of this more realistic notion of detection; theorists in the area of computer virus protection might usefully work toward such a characterization.
The example undetectable virus in the body of the paper requires access to an arbitrarily-large stream of random bits; many formalizations of the notion of algorithm do not allow this. This Appendix gives a slightly more complex example that does not require any random bits. Understanding the main results of this paper does not require understanding this Appendix.
Consider the virus  one instance of which is
 one instance of which is  :
:
where  is a mapping from the integers to the set of programs for the relevant machine, such that for every implementable algorithm
 is a mapping from the integers to the set of programs for the relevant machine, such that for every implementable algorithm  , there is some integer
, there is some integer  such that
 such that  maps
 maps  to
 to  . Many such mappings of course exist for the typical machine. (Note that the variable serial_number must be stored as an unlimited-size integer (a "bignum").)
. Many such mappings of course exist for the typical machine. (Note that the variable serial_number must be stored as an unlimited-size integer (a "bignum").)
All instances of the above which begin "state = 0;" are viral, and trivial to detect; they differ only in the second line, where serial_number is set. Every program which differs from the above only in having a different non-negative integer in the second line is an element of  . Instances of the above which begin "state = 1;", on the other hand, are only sometimes viral, since the content of subroutine_one() is different, and will often not terminate at all, or will return true (causing the program to exit before spreading). Those that are viral create, in the next generation, exactly the above code again, so every element of
. Instances of the above which begin "state = 1;", on the other hand, are only sometimes viral, since the content of subroutine_one() is different, and will often not terminate at all, or will return true (causing the program to exit before spreading). Those that are viral create, in the next generation, exactly the above code again, so every element of  is an ancestor, eventually, of every other (and
 is an ancestor, eventually, of every other (and  is therefore a viral set). For any candidate
 is therefore a viral set). For any candidate  -detection algorithm
-detection algorithm  , consider program
, consider program  :
:
where  maps
 maps  to an implementation of
 to an implementation of  . Just as described above,
. Just as described above,  does not return the correct result when called on
 does not return the correct result when called on  , so
, so  does not detect
 does not detect  . And because
. And because  either returns "true" on a program that simply exits, or returns something other than "true" on a program infected with
 either returns "true" on a program that simply exits, or returns something other than "true" on a program infected with  ,
,  also fails to loosely-detect
 also fails to loosely-detect  in the sense defined above.
 in the sense defined above.
1 Note that A is not an input to p here; every time p is run, it calls A on itself, and spreads if and only if A returns false. The program p therefore always spreads, or always exits, regardless of any input.
2 A similar proof, showing that no Turing Machine program can decide if one virus "evolves" into another, can be found in [2], but as far as we are aware the implications of that result for virus detection have never been explored.
3 This example assumes that P has access to an arbitrarily-long stream of random bits; some formalizations of the notion of algorithm do not allow this. See the appendix for a somewhat more complex example that does not require any random bits.
[Back to index] [Comments (0)]