Guillaume Bonfante, Matthieu Kaczmarek, Jean-Yves Marion
Lecture Notes in Computer Science, volume 3722, pp.579-593. Springer, Oct 2005.
October 2005
We are concerned with theoretical aspects of computer viruses. For this, we suggest a new definition of viruses which is clearly based on the iteration theorem and above all on Kleene's recursion theorem. We show that we capture in a natural way previous definitions, and in particular the one of Adleman. We establish generic constructions in order to construct viruses, and we illustrate them by various examples. We discuss about the relationship between information theory and virus and we propose a defense against some kind of viral propagation. Lastly, we show that virus detection is -complete. However, since we are able to deal with system vulnerability, we exhibit another defense based on controlling system access.
Computer viruses seem to be an omnipresent issue of information technology; there is a lot of books, see [13] or [16], discussing practical issues. But, as far as we know, there are only a few theoretical studies. This situation is even more amazing because the word "computer virus" comes from the seminal theoretical works of Cohen [4 - 6] and Adleman [1] in the mid-1980's. We do think that theoretical point of view on computer viruses may bring some new insights to the area, as it is also advocated for example by Filiol [8], an expert on computer viruses and cryptology. Indeed, a deep comprehension of mechanisms of computer viruses is from our point of view a promising way to suggest new directions on virus detection and defence against attacks. On theoretical approach to virology, there is an interesting survey of Bishop [2] and we aware of the paper of Thimbleby, Anderson and Cairns [10] and of Chess and White paper [3].
This being said, the first question is what is a virus? In his Phd-thesis [4], Cohen defines viruses with respect to Turing Machines. Roughly speaking, a virus is a word on a Turing machine tape such that when it is activated, it duplicates or mutates on the tape. Adleman took a more abstract formulation of computer viruses based on recursive function in order to have a definition independent from computation models. A recent article of Zuo and Zhou [21] completes Aldemans work, in particular in formalizing polymorphic viruses. In both approaches, a virus is a self-replicating device. So, a virus has the capacity to act on a description of itself. That is why Kleene's recursion theorem is central in the description of the viral mechanism.
This paper is an attempt to use computability and information theory as a vantage point from which to understand viruses. We suggest a definition which embeds Adelman's as well as Zuo and Zhou's definitions in a natural way.
A virus is a program which is solution of the fixed point equation
where is a function which describes the propagation and mutation of the virus in the system. This approach has at least three advantages compared with others mentioned above. First, a virus is a program and not a function. Thus, we switch from a purely functional point of view to a more programming perspective one.
Second, we consider the propagation function, unlike others. So, we are able to have another look at virus replications. All the more so since corresponds also to a system vulnerability. Lastly, since the definition is clearly based on recursion theorem, we are able to describe a lot of kind of virus smoothly. To illustrate our words, we establish a general construction of trigger virus in Section 3.3.
The results and the organization of the paper is the following. Section 2 presents the theoretical tools needed to define viruses. We will focus in particular in the s-m-n theorem and the recursion theorem. In section 3, we propose a virus definition and we pursue by a first approach to self-duplication. Section 4 is devoted to Adlemans virus definition. Then, we explore another duplication methods by mutations. We compare our work with Zuo and Zhou definition of polymorphic viruses. Lastly, Section 6 ends with a discussion on the relation with information theory. From that, we deduce an original defense against some particular kind of viruses, see 6.3. The last Section is about virus search complexity which turns out to -complete. It is worth to mention that we conclude the paper on some research direction to study system flaws, see Theorem 14.
We are not taking a particular programming language but we are rather considering an abstract, and so simplified, definition of a programming language. However, we shall illustrate all along the theoretical constructions by bash programs. The examples and analogies that we shall present are there to help the reader having a better understanding of the main ideas but also to show that the theoretical constructions are applicable to any programming language.
We briefly present the necessary definitions to study programming languages in an independent way from a particular computational model. We refer to the book of Davis [7], of Rogers [15] and of Odifreddi [14].
Throughout, we consider that we are given a set , the domain of the computation. As it is convenient, we take to be the set of words over some fixed alphabet. But we could also have taken natural numbers or any free algebra as domains. The size of a word is the number of letters in .
A programming language is a mapping from such that for each program is the partial function computed by . Following the convention used in calculability theory, we write instead of . Notice that there is no distinction between programs and data.
We write to say that for each , either and are defined and or both are undefined on .
A total function is computable wrt if there is a program such that . If is a partial function, we shall say that is semi-computable. Similarly, a set is computable (resp. semi-computable) if its characteristic function is computable (semi-computable).
We also assume that there is a pairing computable function (_,_) such that from two words and of , we form a pair . A pair can be decomposed uniquely into and by two computable projection functions. Next, a finite sequence of words is built by repeatedly applying the pairing function, that is .
So, we won't make any longer the distinction between a n-uple and its encoding. Every function is formally considered unary even if we have in mind a binary one. The context will always be clear.
It is worth to mention that the pairing function may be seen as an encryption function and the projections as decryption function.
Following Uspenski [19] and Rogers [15], a programming language is acceptable if
Of course, the function is the well-known s-m-n function written .
The existence of an acceptable programming language was demonstrated by Turing [18].
Kleene's Iteration Theorem yields a function which specializes an argument in a program. The self-application that is corresponds to the construction of a program which can read its own code . By analogy with bash programs, it means that the variable $0
is assigned to the text, that is , of the executed bash file.
We present now a version of the second recursion theorem which is due to Kleene. This theorem is one of the deepest result in theory of recursive function. It is the cornerstone of the paper that is why we write the proof. We could also have presented Rogers's recursion theorem but we have preferred to focus on only one recursion theorem in order not to introduce any extra difficulties. It is worth also to cite the paper [11] in which the s-m-n function and the recursion theorem are experimented;
Theorem 1 (Kleene's second recursion Theorem). If is a semicomputable function, then there is a program such that
Proof. Let be a program of the semi-computable function . We have
By setting , we have
A virus may be thought of as a program which reproduces, and executes some actions. Hence, a virus is a program whose propagation mechanism is described by a computable function . The propagation function searches and selects a sequence of programs among inputs . Then, replicates the virus inside . In other words, is the vector which carries and transmits the virus to a program. On the other hand, the function can be also seen as a flaw in the programming environment. Indeed, is a functional property of the programming language which is used by a virus to enter and propagate into the system. We suggest below an abstract formalization of viruses which reflects the picture that we have described above.
Definition 2. Assume that is a semi-computable function. A virus wrt is a program such that for each and in ,
The function is called the propagation function of the virus .
Throughout, we call virus a program, which satisfies the above definition.
As we have said above, we make no distinction between programs and data. However we write in bold face words of , like , which denote programs. On the other hand, the argument does not necessarily denote a data. Nevertheless, in both cases, or refer either to a single word or a sequence of words. (For example .)
A distinctive feature of viruses is the self-reproduction property. This has been well developed for cellular automata from the work of von Neumann [20]. Hence, Cohen [4] demonstrated how a virus reproduces in the context of Turing machines.
We show next that a virus can copy itself in several ways. We present some typical examples which in particular illustrate the key role of the recursion Theorem.
We give a first definition of self-reproduction. (A second direction will be discussed in Section 5.) A duplication function is a total computable function such that is a word which contains at least an occurrence of . A duplicating virus is a virus, which satisfies . The existence of duplicating viruses is a consequence of the following Theorem by setting .
Theorem 3. Given a semi-computable function , there is a virus such that
Proof. For set . Recursion Theorem implies that the semi-computable function has a fixed point that we call . We have .
Next, let be a code of , that is . The propagation function induced by is defined by , since
It is worth to say that the propagation function lies on the s-m-n function. The s-m-n function specializes the program to and , and thus it drops the virus in the system and propagates it. So, in some sense, the s-m-n function should be be considered as a flaw, which is inherent to each acceptable programming language.
To illustrate behaviors of duplicating viruses, we consider several examples, which correspond to known computer viruses.
A duplication function is a crushing if .
This basic idea is in fact the starting point of a lot of computer viruses. Most of the email worms use this methods, copying their script to many directories. The e-mail worm "loveletter" copies itself as "MSKernel32.vbs". Lastly, here is a tiny bash program which copies itself.
Suppose that . Then, a virus is cloning wrt , if where is a duplication function. A cloning virus keeps the structure of the program environment but copies itself into some parts. For example, we can think that is a directory and are the file inside. So a cloning virus infects some files in the directory.
Moreover, a cloning virus should also verify that . Then, the virus does not increase the program size, and so the detection of such non-size increasing virus is harder.
A cloning virus is usually quite malicious, because it overwrites existing program. A concrete example is the virus named "4870 Overwriting". The next bash program illustrates of a cloning virus.
A virus is an ecto-symbiote if it lives on the body surface of the program . For example, where is the word concatenation.
The following bash code adds its own code at the end of every file.
The computer virus "Jerusalem" is an ecto-symbiote since it copies itself to executable file (that is, ".COM" or ".EXE" files).
We establish a result which constructs a virus which performs several actions depending on some conditions on its arguments. This construction of trigger viruses is very general and embeds a lot of practical cases.
Theorem 4. Let be semi-computable disjoint subsets of and be semi-computable functions There is a virus which satisfies for all and , the equation
Proof. Define
The function is computable and has a code such that . Again, recursion Theorem yields a fixed point of which satisfies the Theorem equation. The induced propagation function is
Adleman's modeling is based on the following scenario. For every program, there is an "infected" form of the program.
The virus is a computable function from programs to "infected" programs. An infected program has several behaviors which depend on the input . Adleman lists three actions. In the first (13) the infected program ignores the intended task and executes some "destroying" code. So it is why it is called injure. In the second (14), the infected program infects the others, that is it performs the intended task of the original, a priori sane, program, and then it contaminates other programs. In the third and last one (15), the infected program imitates the original program and stays quiescent.
We translate Adleman's original definition into our formalism.
Definition 5 (Adleman's viruses). A total computable function is said to be a -viral function (virus in the sense of Adleman) if for each one of the three following properties holds:
This first kind of behavior corresponds to the execution of some viral functions independently from the infected program.
The second item corresponds to the case of infection. One sees that ny part of is rewritten according to .
The last item corresponds to mimic the original program.
Our definition respects Adleman's idea and implies easily the original infection definition. In Adleman's paper, the infection definition is very closed to the crushing virus as they have defined previously. However, our definition of the infect case is slightly stronger. Indeed, there is no condition or restriction on the application of the A-viral function to to unlike Adleman's definition. Indeed, he assumes that and that where is a computable function which depends on A.
Theorem 6. Assume that is a A-virus. Then there is a virus which performs the same actions that .
Proof. Let be the code of , that is . There is a semi-computable function such that . Suppose that is the code of . Take . We have
We conclude that the propagation function is
Until now, we have considered viruses which duplicate themselves without modifying their code. Now, we consider viruses which mutate when they duplicate. Such viruses are called polymorphic; they are common computer viruses. The appendix gives more "practical informations" about them.
This suggests a second definition of self-reproduction. A mutation function is a total computable function such that is a word which contains at least an occurrence of a virus . The difference with the previous definition of duplication function in Subsection 3.2 is that is a mutated version of wrt .
Theorem 3, and the implicit virus Theorem 4, shows that a virus is essentially a fixed point of a semi-computable function. In other word, a virus is obtained by solving the equation: . And solutions are fixed points of . Rogers [15] established that a computable function has an infinite number of fixed points. So, a first mutation strategy could be to enumerate fixed points of . However, the set of fixed points of a computable function is , and worst it is -complete for constant functions.
So we can not enumerate all fixed points because it is not a semicomputable set. But, we can generate an infinite number of fixed points.
To illustrate it, we suggest to use a classical padding function which satisfies
Lemma 7. There is a computable padding function .
Proof. Take as a computable bijective encoding of pairs. Let be first projection function of . Define as the code of .
Theorem 8. Let be a computable function. Then there is a computable function such that
Proof. In fact, is the ith fixed point of wrt to a fixed point enumeration procedure. A construction of a fixed point enumeration procedure is made by padding Kleene's fixed point given by the proof of the recursion Theorem.
For this, suppose that is a program of the semi-computable function . We have
By setting , we have
Remark 9. For a virus writer, a mutation function is a polymorphic engine, such as the well known "Dark Avenger". A polymorphic engine is a module which gives de ability to look different on replication most of them are encryptor, decryptor functions.
Polymorphic viruses were foreseen by Cohen and Adleman. As far as we know, Zuo and Zhou's are the first in [21] to propose a formal definition of the virus mutation process. They discuss on viruses that evolve into at most n forms, and then they consider polymorphism with an infinite numbers of possible mutations.
Definition 10 (Zuo and Zhou viruses). Assume that and are two disjoint computable sets. A total computable function is a ZZ-viral polymorphic function if for all n and ,
This definition is closed to the one of Adleman, where corresponds to a set of arguments for which the virus injures and is a set of arguments for which the virus infects. The last case corresponds to the imitation behavior of a virus. So, the difference stands on the argument n which is used to mutate the virus in the infect case. Hence, a given program has an infinite set of infected forms which are . (Technically, is an encoding of natural numbers into .)
Theorem 11. Assume that is a ZZ-viral polymorphic function. Then there is a virus which performs the same actions that ZZ wrt a propagation function.
Proof. The proof is a direct consequence of implicit virus Theorem 4 by setting .
There are various way to define a mutation function. A crucial feature of a virus is to be as small as possible. Thus, it is much harder to detect it. We now revisit clone and symbiote virus definitions.
A compressed clone is a mutated virus such that . A compression may use informations inside the program . There are several compression algorithms which perform such replications.
An endo-symbiote is a virus which hides (and lives) in a program. A spyware is a kind of endo-symbiote. For this, it suffices that
Both examples above show an interesting relationship with complexity information Theory. For this, we refer to the book of Li and Vitányi [12]. Complexity information theory leans on Kolmogorov complexity. The Kolmogorov complexity of a word wrt and knowing is . The fundamental Theorem of Kolmogorov complexity theory yields: There is universal program such that for any program , we have where is some constant. This means that the minimal size of a program which computes a word wrt is , up to an additive constant.
Now, suppose that the virus mutates to from . That is . An interesting question is then to determine the amount of information which is needed to produce the virus . The answer is bits, up to an additive constant.
The demonstration of the fundamental Theorem implies that the shortest description of a word is made of two parts. The first part encodes the word regularity and the second part represents the "randomness" side of . And, we have . Here, the program plays the role of an interpreter which executes in order to print . Now, let us decompose into two parts (i) an interpreter and (ii) a random data part such that . In this construction, the virus introduces an interpreter for hiding itself. This is justified by the fundamental Theorem which says that it is an efficient way to compress a virus. In [9], Goel and Bush use Kolmogorov complexity to make a comparison and establish results between biological and computer viruses.
We suggest an original defense (as far as we know) against some viruses based on information Theory. We use the notations introduced in Section 6 about endo-symbiosis and Kolmogorov complexity.
Our defense prevents the system to be infected by endo-symbiote. Suppose that the programming environment is composed of an interpreter which is a universal program. We modify it to construct in such way that . Hence, intuitively a program for is a description of a program wrt .
Given a constant , we define a -compression of a program as a program such that and . Observe that .
Now, suppose that is an endo-symbiote. So, there is a mutation function and two associated projections et . We have by definition of endo-symbiotes that . By definition of , we have . As a consequence, . So, . Finally, the space to encode the virus is bounded by . Notice that it is not difficult to forbid to execute programs which have less than bits. In this case, no endo-symbiote can infect . Therefore, -compressed programs are safe from attack by endo-symbiotes.
Of course, this defense strategy is infeasible because there is no way to approximate the Kolmogorov complexity by mean of a computable function. In consequence, we can not produce -compressed programs. However, we do think this kind of idea shed some light on self-defense programming systems.
Let us first consider the set of viruses wrt a function . It is formally given by . As the formulation of shows it, we have:
Proposition 12. Given a recursive function is .
Theorem 13. There are some functions for which is -complete.
Proof. Suppose now given a computable function , it has an index . It is well known that the set is -complete. Define now . Observe that a virus verify: . The pairing procedure being surjective, is an index of . Conversly, suppose that is not a virus. In that case, there is some for which . As a consequence, it is not an index of . So, .
Theorem 14. There are some functions for which it is decidable whether is a virus or not.
Proof. Let us define . Being recursive, it has a code, say . Application of s-m-n Theorem provides such that for all , we have . Let us define . It is routine to check that for all , is a virus for . So, in that case, any index is a virus.
A consequence of this is that there are some weakness for which it is decidable whether a code is a virus or not. This is again, as far as we know, one of the first positive results concerning the detection of viruses.
A method widely used for virus detection is file scanning. It uses short strings, refered as signatures, resulting from reverse engineering of viral codes. Those signatures only match the considered virus and not healthy programs. Thus, using a search engine, if a signature is found a virus is detected.
To avoid this detection, one could consider and old virus and change some instructions in order to fool the signature recognition. As an illustration, consider the following signature of a viral bash code
The following code denotes the same program but with an other signature
Polymorphic viruses use this idea, when it replicates, such a virus changes some parts of its code to look different.
Virus writers began experimenting with such techniques in the early nineties and it achieved with the creation of mutation engines. Historically the first one was "Dark Avenger". Nowadays, many mutation engines have been released, most of them use encryption, decryption functions. The idea, is to break the code into two parts, the first one is a decryptor responsible for decrypting the second part and passing the control to it. Then the second part generates a new decryptor, encrypts itself and links both parts to create a new version of the virus.
A polymorphic virus could be illustrated by the following bash code, it is a simple virus which use as polymorphic engine a swap of two characters.
To detect polymorphic computer viruses, anti virus editors have used code emulation techniques and static analysers. The idea of emulation, is to execute programs in a controled fake environment. Thus an encrypted virus will decrypt itself in this environment and some signature detection can be done. Concerning static analysers, they are improved signature maching engines which are able to recognize simple code variation.
To thward those methods, since 2001 virus writers has investigated metamorphism. This is an enhanced morphism technique. Where polymorphic engines generate a variable encryptor, a metamorhic engine generates a whole variable code using some obfuscation functions. Moreover, to fool emulation methods metamorphic viruses can alter their behavior if they detect a controled environment.
When it is executed, a metamorphic virus desassembles its own code, reverse engineers it and transforms it using its environment. If it detects that his environment is controled, it transforms itself into a healthy program, else it recreates a new whole viral code using reverse ingineered information, in order to generate a replication semantically indentical but programmatically different.
Such a virus is really difficult to analyse, thus it could take a long period to understand its behavior. During this period, it replicates freely.
Intuitively, polymorphic viruses mutates without concidering their environment whereas metamophic viruses spawn their next generation using new information. As a matter of fact, to capture this notion, one must consider the equation in its entirety.
[Back to index] [Comments (0)]