Eric Filiol
INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE
ISSN 0249-6399
June 2004
Rapport de recherche Nº 5250 - Juin 2004 - 10 pages
Imagining what the nature of future viral attacks might look like is the key to successfully protecting against them. This paper discusses how cryptography and key management techniques may definitively checkmate antiviral analysis and mechanisms. We present a generic virus, denoted BRADLEY which protects its code with a very secure, ultra-fast symmetric encryption. Since the main drawback of using encryption in that case lies on the existence of the secret key or information about it within the viral code, we show how to bypass this limitation by using suitable key management techniques. Finally, we show that the complexity of the BRADLEY code analysis is at least as high as that of the cryptanalysis of its underlying encryption algorithm.
Antiviral detection is directly based on the capability to have malware codes at one's disposal and to study them by disassembly means. Thus, viral databases can be updated and antiviral engines can be upgraded.
A few malware writers try to make this task more difficult by implementing various techniques which aim at delaying the knowledge and the understanding of their codes: obfusctating, rewriting, encryption.... These codes are denoted armoured codes. The first and most famous one is probably the Whale virus appeared in the early nineties. More recently, the MyDoom virus very naively tries to complicate antiviral experts' work by implementing basic encryption techniques. Up to now, none of the known malware succeeded in preventing code analysis.
The main explanation for this failure lies on two facts:
In this paper, we present a new concept about malicious codes combining efficient key management with high-level security encryption algorithm. Different analysis and experiments have confirmed the impossibility to study the code, under the assumption that we managed to get a copy of it. By limiting the code presence and virulence in the computer, we show also how to make this assumption very unlikely. We illustrate these concepts (proof-of-concept) by presenting from an algorithmic point of view the most simple example of a new virus family called the BRADLEY viruses. As a main result, we show that the general problem of BRADLEY code analysis is equivalent to the cryptanalysis of a secure encryption algorithm in the sense that it is exponentially complex.
This paper is organized as follows. In Section 2, we first define precisely the background and the known cases where attempts to use cryptography in viral codes have been made. We show why all this attempts were bound to fail. Section 3 recalls key management techniques presented in [11]. In Section 4, we present the generic viral concept2 using strong encryption combined with optimal key generation and key management. At last, Section 5 prooves that BRADLEY viral code analysis is equivalent to encryption systems cryptanalysis and that it is, in fact, of exponential complexity. Conclusion will address the problem of fighting against such armoured malware.
The purpose of this paper is purely academic and draws our attention on the evolution of viral risks. It shows how the malware risk may evolve very quickly (if not already the case) and cause great concern among the antiviral community. This is the reason why we will not give any detailed code. The activity of our laboratory is dedicated to defensive aspects. Our mission consists in identifying new risks, in testing them in practice and assessing the level of potential threat precisely.
The reader is supposed to be familiar with basic definitions about malware (virus, worms, trojan horses...) and antiviral techniques. We just recall the following starting definition of armoured codes.
Definition 1 (Armoured codes) An armoured code is a program which contains instructions whose goal is to delay, complicate or forbid its own analysis during either its execution or through its disassembly.
The best known example is probably the Whale virus which appeared in September 1990. The virus did actually represent a very limited risk but it intended obviously to make its analysis very difficult. Its code contains roughly a dozen of program traps and tricks hampering trace, disassembling and code analysis: dynamic decryption/encryption, code obfuscation, code nesting...Once activated, the viral code tries to detect the potential use of a debugger and consequently freezes the keyboard. Using polymorphism techniques, about 30 different random variants were possible for an infected file.
What the Whale virus easily managed to cause is not a terrific epidemic but a waste of anti-virus experts' time and a nearly three-day delay to eradicate it. Nowadays, the main part of the viral action is completed during the first thirty minutes after the beginning of the infection (a good example could be the Slammer worm which appeared in January 2003); therefore, such a delay in code analysis cannot be acceptable. That is the reason why armouring code techniques must be seriously taken into account.
These techniques can be divided into different classes, as follows.
Obfuscated codes are generally too slow and of too large a size to be efficiently used by undetectable malware. Moreover, the results presented in [2] show there is no transformation able to prevent the code of every program from revealing any other information except the program's input-output behaviour.
As these first two techniques are concerned, code analysts fortunately will always get to the end of it since these techniques always produce deterministic results (even if some algorithms may be partly probabilistic). The only thing is that the analysts need time to study the code behaviour instruction by instruction. Therefore, this study is likely to be time-greedy and requires many human resources and much effort dedication.
Encryption is maybe less obviously easy to handle contrary to what experience tells us so far. Fortunately, only naive or very unsecure encryption methods have been used in known malware [4]: constant masking (like in earlier macro-viruses for example), rot13 encryption or other weak encryption systems (as an example, DarkParanoid virus used very simple arithmetic functions - ADD, SUB, XOR, NEG, NOT, ROL, ROR - as encryption functions).... Besides, weak key management always allows to recover the key and then to decipher the malware code when dealing with strong cryptosystems.
The reader is supposed to be familiar with basic concepts of cryptology as well. A detailed monography about cryptography will be found in [8]. We will just recall previous uses of cryptology inside malware and a few useful concepts we will use throughout this paper.
Cryptology has previously been envisaged to provide computer virology with very efficient tools. On the one hand, cryptographic techniques have recently be considered as a means for optimal worm propagation [1]. The use of cryptographic hash functions, for instance, is suitable for speeding up Curious Yellow worm propagation [15].
On the other hand, the combination of cryptographic techniques with viral technologies led in 1996 to the concept of "Cryptovirology" as presented in [6, 16, 17]. Cryptovirology consists in applying cryptography tools to malicious codes in order to strengthen, improve or develop such codes.
Particularly, cryptography appears to be very efficient in designing pay loads. Several convincing examples are presented in [16]. The main goal is to make a victim host dependent upon the virus - i.e. a virus can survive in the host if it makes the host depend in a critical way on the very presence of the virus itself. These results are mainly obtained with public-key cryptographic techniques3 combined with limited symmetric cryptography techniques.
Though very efficient, these approaches aim only at protecting the action of the virus (its payload) but not the virus itself. In other words, if a copy of a cryptovirus is somehow or other obtained and analyzed by reverse engineering, none of the cryptographic tools it contains will totally protect them against its code analysis. Thus, the exact knowledge of the code is likely to allow antiviral software update and limit/forbid the malware's action. The main limitation comes from the fact that cryptovirus as defined by Young and Yung, is not able to manage secret key part in a suitable and efficient way for that particular purpose.
The basic technique we discuss in this paper can effectively forbid such code analysis and thus, properly complement all the approaches developped in [6, 16, 17]. Other more sophisticated techniques are being tested in our laboratory.
Malware are mobile agents by nature. If they pass through an "insecure network" or environment (from the malware's point of view), they may be analyzed (disassembled) so that their code will be completely accessible to the attacker (the analyst). As previously explained , traditional encryption systems are dealing with static keys. Actually, the key is present somehow or other in the agent (hardware or software).
In 1998, B. Schneier and J. Riordan [11] introduced the notion of environmental key generation to address this problem. In other words, keying material is contructed from certain classes of environmental data. Environmental key generation can thus be useful when the sender wishes to communicate with the receiver such that the receiver could only receive the message if some environmental conditions are true. Environmental key generation can even be used in circumstances where the receiver is not aware of the specific environmental conditions that the sender wants his communication to depend on. This latter case corresponds exactly to our malware code analysis problem. The receiver here is the malware code present in a computer (the environment) and the sender is malware code author or the target system itself.
The difficulty with building an environmental key generation protocol is that the threat model assumes that any attacker (the malware code analyst) has total control over the environment. All information available to the malware program can be found by the attacker as well. All inputs to the program are supplied by the attacker and the program states themselves is completely determined by the attacker during the code analysis. As such, the constructions must resist direct analysis and dictionnary attacks in the form of Cartesian deception, that is to say, in which the attacker tells lies about the environment.
The authors of [11] discuss several construction for environmental key generation. To illustrate their approach, let us consider the following basic construction. Let be an integer corresponding to an environmental observation, a one-way function (typically a hash function), the hash of the observation the bitwise exclusive-or operator, || the concatenation operator, a nonce and a key. The value is carried by the agent (the malware code in our case). Hash function can be used to conduct tests and construct the keys so that examination of the agent does not reveal the required environmental information. Then possible constructions, among many others, are:
Let us note that the first construction is used in most of static encrypted password authentication schemes. The most important feature of each of these constructions is that knowledge of does not leak any information on .
Riordan and Schneier proposed several efficient constructions which provide efficient environmental key generation protocols using various techniques: thresholding (protocol using the ideas of cryptographic secret sharing), nesting (action of the mobile agent is ruled by several environmental keys used in the sequential way), time indexation (part of the environmental date required to generate the key are based on time)...
Environmental key generation has only been proposed from a theoretical point of view by the authors. Some aspects still need to be thoroughly tested. Particularly, for most of the constructions they proposed, the attacker is likely to find the key by observing both the agent and the environment. The search space for the activation data may always be small enough to allow an exhaustive search approach. Moreover, by observing mobile agent actions, the attacker may easily determine where and which kind of data the agent is interested in. That implies that a patient analyst will obtain information about the agent at the same time this latter is activated by the suitable environmental data.
We now present a practical and efficient use of environmental key generation in the case of viral code armouring.
Let us discuss now the generic family virus named BRADLEY. Without loss of generality, we choose to describe only a basic but powerful example. Some More complex protocols have been developped or are currently under study (see Section 6). Two different codes have been developped and tested:
Minor variants have been tested as well and will be listed later on. The codes have been designed both for Windows and Unix systems. They successfully managed to bypass antiviral software which all remained silent. Since BRADLEY viruses are only proof-of-concept viruses, we will focus only on the armoring protocol part. Complete source code is not available. The general structure of the codes is given in Figure (1) and summarized as follows:
Figure 1: Overall structure of BRADLEY codes
Note that these three encrypted parts are exactly of the same size in order to give the slightest information on the underlying code.
Let us now describe the key management protocol. The activation data - in other words the data required to construct the different keys - are (variant A):
For the variant B, data is a given public key which is present in a pubring.gpg for example. Thus, the virus may target a particular user or users communicating through encrypted emails/data with any given user. The viral code uses the hash function SHA-1 [9] as one-way function (here denoted ). Then, the environmental key protocol is described as follows:
Some remarks can be made about this protocol:
Other variants have been tested as well, particularly to produce the most optimal code in terms of size and stealthiness. The most significative variant is the following:
For all variants we developped, encryption algorithms that have been used are RC4 [12] and RC6 [13] while gzip compression has been chosen.
To evaluate the code analysis complexity, two cases must be considered:
The second case is more likely to happen if we consider that, in any case, the virus limits its presence inside the target system by disinfecting itself from it.
But let us suppose that the analyst, even if it is very unlikely, has managed to get one copy of the virus binaries. Let us show that the environmental key generation protocol presented in section 4 efficiently forbids code analysis unless a cryptanalysis problem of exponential complexity is solved.
Proposition 1 The analysis of a code protected by the environmental key generation protocol defined in Section 4 is a problem which has exponential complexity.
Let us now prove this proposition.
Proof.
Firstly, let us remark that decipherment procedure D leaks only the following information:
Moreover, the analyst is able to analyze the virus if and only if he knows the secret key . It can be obtained either by direct cryptanalysis or by guessing the exact values of the different activation data required to generate the good key. This guessing is equivalent to dictionary attacks.
The cryptanalysis approach aims at finding the value such that where and are easy to identify in the decipherment procedure . A hash function is highly non injective by nature. Thus it cannot be computationaly inversed in any way (preimage resistance). Consequently, this problem must be reformulated as a collision search problem (for more details, refer to [8, Chapter 9]). In other words, find all pairs of input and such that . This problem itself is computationally infeasible. To be more precise finding such a pair requires operations for -bit input values ( = 512 for SHA-1). Since the analyst must absolutely find the exact key (secret key really used to encrypt the viral code), he must beforehand compute all the values such that . For a -bit input, -bit output hash function, there exists such ( for SHA-1). Then, to summarize, recovering the key requires operations - that is to say operations ( for SHA-1).
Let us now consider the dictionary attack approach. It consists in enumerating all the possible values that might have been used as activation data. Note that, in that particular case, the analyst must simultaneously consider both the encrypted viral code and the system in which the code has been found. The analyst can try all the possible data relevant to the system (that is to say ) over which he has control during the analysis. Unfortunately, data remains out of his control and thus he will not be able to determine its exact value. Thus there is no any other more efficient approach than searching exhaustively for the value . Since at least will be chosen randomly by the viral code author, this exhaustive search has complexity if a -bit input hash function has been used ( for SHA-1). All things considered, the overall complexity of the code analysis is .
The proof-of-concept virus BRADLEY has been designed and discussed to illustrate the fact that efficient armouring is possible. BRADLEY and other efficient viruses of same kind pose the problem of a threat which so far, is impossible to deal with. The polymorphic nature of such codes, when optimally implemented, forbids any detection based only on the decipherment procedure . During the experiments, detection based on behaviour monitoring and analysis has been successfully bypassed as well.
Permament and direct memory monitoring might be a solution to deal with such efficient armoured codes. Besides, heavy system ressources are required and this approach implies to be aware of this particular threat (efficient code armouring). Current research carried out in our laboratory aims at proving that even memory management and monitoring can be bypassed. We particularly designed a far more complex variant directly drawn from the DarkParanoid virus: at any time, only a single instruction can be found in an unencrypted form in memory. But it requires a far more complex environmental key generation than that simple one presented in Section 4.
Unless a solution is rapidly found to fight against these virus, this study outlines that an isolation of critical networks and a strict computer security policy is absolutely essential. Moreover, this implies that the antiviral companies must develop cryptanalysis skills in the very near future, under the assumption that it is possible to obtain a viral code sample and that breakable cryptosystems have been used.
1 Virulence is an index measuring the level of risk for self-reproducing codes. This index, hence the risk itself, is related to the number of copies of the malware code. For details, see [7, chapter 4, pp. 89ff].
2 Without loss of generality, we will use the general term "virus" but everything presented in this paper may apply to any other malware type: trojans, worms, logical bombs...
3 A cryptovirus is defined as a computer virus that contains and uses a public key.
4 One may object that the presence of the webpage url within the procedure could give a useful information to the analyst. Since the webpage is under malware author's control, it is a very dunious hypothesis that the attacker could successfully access to the suitable activation data, especially if the data availability is very limited in time. Nonetheless, we have developped a variant of the basic protocol which is discussed in this section. Instead of only one data , we use two external activation data and . Each of them come from two different webpages. The second webpage's url is encrypted by means of the key contructed from data . Once decrypted, the virus gets the second activation data and a secret permutation function (at the very beginning of ). Finally, the key is build from data and in the same way as key is. In this variant, is superseded by .
5 "Repeatedly" means here that the virus scans any data contained in the system. In our case (a given file), the virus looks recursively for that data through the tree file system.
[Back to index] [Comments (0)]