David Chess, Steve White
Virus Bulletin Conference
September 2000
One of the few solid theoretical results in the study of computer viruses is Cohen's 1987 demonstration that there is no algorithm that can perfectly detect all possible viruses [1]. This brief paper adds to the bad news, by pointing out that there are computer viruses which no algorithm can detect, even under a somewhat more liberal definition of detection. We also comment on the senses of "detect" used in these results, and note that the immediate impact of these results on computer virus detection in the real world is small.
Consider the set of programs which produce one or more programs as output. For any pair of programs and , eventually produces if and only if produces either directly or through a series of steps (the "eventually produces" relation is the transitive closure of the "produces" relation.) A viral set is a maximal set of programs such that for every pair of programs and in , eventually produces , and eventually produces . ("Maximal" here means that there is no program not in the set that could be added to the set and have the set still satisfy the conditions.) For the purposes of this paper, a computer virus is a viral set; a program is said to be an instance of, or to be infected with, a virus precisely when is a member of the viral set . A program is said to be infected simpliciter when there is some viral set of which it is a member. A program which is an instance of some virus is said to spread whenever it produces another instance of that virus. The simplest virus is a viral set that contains exactly one program, where that program simply produces itself. Larger sets represent polymorphic viruses, which have a number of different possible forms, all of which eventually produce all the others.
Figure 1. The shapes represent programs, and the arrows show which programs produce which as output. The filled shapes are members of viral sets, the empty shapes are not. The filled hexagon represents a simple non-polymorphic virus, whose sole member produces only itself.
In practical terms, this notion of computer virus encompasses overwriting viruses (which replace existing programs with copies of themselves) and some kinds of worms (which spread as standalone programs by creating new copies of themselves). A more complex notion of computer virus would incorporate "parasitic" viruses, which infect other programs by inserting themselves in such a way that both the viral code and the original program are executed when the infected program is executed. (The classic informal definition of "computer virus" is "a program that can 'infect' other programs by modifying them to include a possibly evolved copy of itself." [1]. A more formal definition in terms of regions of a Turing Machine tape can be found in [2].) In a subsequent paper, we will extend the current results to that richer notion of computer virus; essentially all the results we obtain here still hold.
For the purposes of this paper, an algorithm detects a virus if and only if for every program , terminates, and returns "true" if and only if is infected with . Similarly, an algorithm detects a set of viruses if and only if for every program , terminates, and returns "true" if and only if is infected with some virus which is a member of . This is essentially Cohen's definition in [1], and it is the only formal definition of detection that has proven theoretically fruitful. It also captures (at least to a first approximation) our intuitive notion of computer virus detection.
In [1], Fred Cohen demonstrates that there is no algorithm that can detect the set of all possible computer viruses (returning "true" if and only if its input is an object infected with some computer virus). The proof is a simple diagonal argument, like Cantor's proof of the uncountability [3] of the real numbers, or Turing's proof of the undecidability of the Halting Problem [4]. For any candidate computer virus detection algorithm , there is a program , which reads:
Clearly does not return the correct result when called on , since if it returns "true" (i.e. it says that is infected), then just exits (and is therefore not infected), whereas if returns anything else (i.e. it says that is not infected), then spreads (and is therefore infected).1 So there is no algorithm which detects all viruses without error; any program that attempts to detect all viruses will either miss some infected files (a false negative), accuse some non-infected files of being infected (a false positive) or fail to return anything (a bug).
A very similar example demonstrates that there are viruses for which no error-free detection algorithm exists. That is, not only can we not write a program that detects all viruses known and unknown with no false positives, but in addition there are some viruses for which, even when we have a sample of the virus in hand and have analyzed it completely, we cannot write a program that detects just that particular virus with no false positives.2
As noted above, a virus is said to be "polymorphic" if the size of the viral set is greater than one; that is, if the code of the virus is different in different infected objects. Consider a virus which is sufficiently polymorphic that for any implementable algorithm the program :
is an instance of the virus (provided of course that actually spreads). There is no algorithm that correctly detects this virus, by an argument directly analogous to that above: for any algorithm that claims to detect this virus, there is a program :
for which does not return the correct result. If returns true, then does not spread, and is therefore not an instance of this or any other virus; whereas if returns false, then does spread, and is an instance of the virus.
Is any possible actual virus sufficiently polymorphic to have this property? Clearly yes. Consider a virus one instance of which is :
For any candidate -detection algorithm , there is a program :
for which does not return the correct result; if returns true, then just exits (and is therefore not an instance of , or of any other virus), whereas if returns false, then is an instance of . So no algorithm can detect without error.3
There is a looser notion of detection under which our result still holds. We may be willing to forgive a candidate -detection algorithm for claiming to find in some program which is not infected with , provided that is infected with some virus. Let us say, then, that an algorithm loosely-detects a virus if and only if for every program , terminates, returning "true" if is infected with , and returning something other than "true" if is not infected with any virus. The algorithm may return any result at all for programs infected with some virus other than (although it must still terminate).
Figure 2. The slanted lines show the (perfect) detection of the viral set of filled ovals; the algorithm picks out exactly those programs infected with that virus. The vertical lines show loose-detection of the viral set consisting of the filled hexagons; the algorithm picks out all the programs in that viral set, as well as some other infected programs.
It is clear that our result still applies under this looser notion of detection. Since every algorithm either returns true for a program which simply exits, or fails to return true for some program infected with , no algorithm even loosely-detects .
Our result is clearly complementary to Cohen's result in [1] that no algorithm can detect all viruses. That result may be expressed as
whereas our results are
Figure 3. Cohen's result says that it is impossible for a program to perfectly draw the solid line suggested above, enclosing all and only those programs that are infected with some virus. For every program that attempts to draw that line, there will be some infected object that the program says is uninfected, or some uninfected object that it says is infected.
Figure 4. Our result says that for some viruses it is impossible for a program to correctly draw the solid line suggested above, enclosing all those programs that are infected with that virus, and enclosing no programs that are not infected with any virus.
The virus above is clearly not a remarkably viable virus, nor is the failure of detection a particularly serious one. In particular, no user of a real virus detection program would object if it were to say that it found a virus in :
or in :
since, although these do not in fact spread, they are closely related to programs that do spread. The main immediate impact of Cohen's undecidability result on the day-to-day activities of those who study computer viruses is that we can dismiss without detailed study any claim that some method correctly detects "all possible viruses known and unknown". The main practical impact of the current result is to dispel the notion that it is always possible to create a detector for a given virus that has no false positives, even if you have a copy of the virus in hand.
When we say in common speech that a given program detects a given virus, we mean something rather different from the formal senses above. Every widely-deployed virus detection program in use today will claim to find a virus in at least some non-viral objects (a false positive), because the methods used for detection are approximate, based on the presence of a particular binary string in a certain place, on the calculation of the finite-size checksum of a macro, on a certain pattern of changes to a file, and so on. Producers of anti-virus software of course try to minimize the number of actual non-viral programs that are falsely detected. But no one worries about the fact that the algorithms used to detect viruses produce false positives on an enormous number of non-viral objects that have never been, and will never be, present on any actual user's computer. This paper's title, then, is deliberately somewhat provocative: while the viruses that we present here are undetectable in the strict formal sense of the term, there is no reason to think that it is impossible to write a program that would detect them sufficiently well for all practical purposes.
Acceptable virus detection, in the real world, involves detecting all viable instances of the virus in question, and preferably some number of minor variants of it, while falsely detecting the virus in only a vanishingly small number of innocent programs that are actually present on a computer somewhere. It would be helpful to have a formal characterization of this more realistic notion of detection; theorists in the area of computer virus protection might usefully work toward such a characterization.
The example undetectable virus in the body of the paper requires access to an arbitrarily-large stream of random bits; many formalizations of the notion of algorithm do not allow this. This Appendix gives a slightly more complex example that does not require any random bits. Understanding the main results of this paper does not require understanding this Appendix.
Consider the virus one instance of which is :
where is a mapping from the integers to the set of programs for the relevant machine, such that for every implementable algorithm , there is some integer such that maps to . Many such mappings of course exist for the typical machine. (Note that the variable serial_number must be stored as an unlimited-size integer (a "bignum").)
All instances of the above which begin "state = 0;" are viral, and trivial to detect; they differ only in the second line, where serial_number is set. Every program which differs from the above only in having a different non-negative integer in the second line is an element of . Instances of the above which begin "state = 1;", on the other hand, are only sometimes viral, since the content of subroutine_one() is different, and will often not terminate at all, or will return true (causing the program to exit before spreading). Those that are viral create, in the next generation, exactly the above code again, so every element of is an ancestor, eventually, of every other (and is therefore a viral set). For any candidate -detection algorithm , consider program :
where maps to an implementation of . Just as described above, does not return the correct result when called on , so does not detect . And because either returns "true" on a program that simply exits, or returns something other than "true" on a program infected with , also fails to loosely-detect in the sense defined above.
1 Note that A is not an input to p here; every time p is run, it calls A on itself, and spreads if and only if A returns false. The program p therefore always spreads, or always exits, regardless of any input.
2 A similar proof, showing that no Turing Machine program can decide if one virus "evolves" into another, can be found in [2], but as far as we are aware the implications of that result for virus detection have never been explored.
3 This example assumes that P has access to an arbitrarily-long stream of random bits; some formalizations of the notion of algorithm do not allow this. See the appendix for a somewhat more complex example that does not require any random bits.
[Back to index] [Comments (0)]