Virus Cryptoanalysis

Mircea Ciubotariu
Virus Bulletin
November 2003

'A soul on his journey to the Afterlife comes to a fork in his path. One way leads to Heaven and is guarded by an angel who always tells the truth; the other way leads to Hell and is guarded by an angel who always lies. The soul is permitted to ask only one question before he must continue his journey. 'If I were to ask your buddy here which is the way to Heaven what would he reply?' he asked one of them. Given the answer he knew for sure which path to follow...'

Riddle from a recreational math book

Introduction

As technology has evolved, more opportunities have become available for virus writers to express their imagination in malicious code.

The introduction of cryptography into virus writing has become a necessity in order for virus writers to protect their code against external factors that might reveal its malicious intentions -- such as anti-virus programs.

In general, matters concerning cryptography reside in the time required to encrypt/decrypt the information compared to the strength of the algorithm used. Any functional encrypted virus has to decrypt its code in order run it, whether it decrypts instruction by instruction, as DarkParanoid did (see VB, January 1998, p.8), or even if it attempts brute force on its encrypted part, like DarkMillennium (W32/Darkmil).

Encryption is performed by applying a function to the original viral code and transforming it into a chunk of data which becomes meaningless without the decryption function and encryption key(s).

Decryption of the encrypted chunk of data is achieved by applying the inverse of the function used to encrypt it, but with the same key(s).

Keys are the initial parameters used by encryption functions. Generally the keys are a byte or a word in size for 16-bit viruses, while the keys for 32-bit viruses are a double word.

A clear distinction must be made between the decryption code and the encrypted virus body. Decryption code in its essence is not harmful and acts as a tool attached to the virus, but in many cases it may be considered to be a signature of a specific virus due to the information contained within, such as the encrypted block length, relative addresses, etc.

That is why in some cases it would be relatively safe to extract a signature for a virus from its decryption code, but not foolproof.

Many existing encrypted viruses use very simple encryption functions and hence can be caught using a CPU emulator which goes beyond the encrypted layer(s) within a relatively short amount of time by emulating a given, specific, number of instructions.

Problems arise when the length of time spent emulating in order to catch an encrypted virus is too long, or when by natural means the virus damages its decryption code -- or simply when the information contained in the decryption code is useless.

The concept

Because many encrypted viruses use simple encryption functions such as ADD or XOR, we shall consider a different approach to virus recognition and try to adapt the notion of the virus signature in the light of the concept explained below.

Let us suppose we are dealing with a file infector virus which uses a simple function to encrypt itself. Each time it infects a file it encrypts its main body in a buffer with a random key, then attaches the decryption code, set up with the encryption key, to the victim file and finally it appends the encrypted data.

Now let us drop all the information contained in the decryption code and remain only with the encrypted data. This means we know nothing about the function or the key that was used to encrypt it.

Next, we select one of the more commonly used functions (let's say ^ and its inverse ~). Even if the function has been guessed correctly we still need a key in order to get any valid information from our chunk of data.

Labelling the units (bytes, words or double words, according to key size) in data chunks as A₀, A₁, A₂, ... in order (e.g. A₀ is the first unit, A₁ the second etc.), we assume that they were obtained by performing the function ^ on the original units a₀, a₁, a₂, ... with the key K.

For the sake of simplicity we assume for this example that the key was kept constant during encryption. We have:

A₀ = a₀ ^ K, a₀ = A₀ ~ K
A₁ = a₁ ^ K, a₁ = A₁ ~ K
A₂ = a₂ ^ K, a₂ = A₂ ~ K
...

Assuming the function has the associative property let us consider:

A₀ ~ A₁ = (a₀ ^ K) ~ (a₁ ^ K) = a₀ ~ a₁ = s₁
A₁ ~ A₂ = (a₁ ^ K) ~ (a₂ ^ K) = a₁ ~ a₂ = s₂
...

For easier understanding, imagine ^ is XOR function, so ~ is the same as ^.

Thus, for any n given units of data where n < N (N is the total number of units in chunk), we have (n - 1) units of s. This is a transformation of our initial block of data independent of encryption key at a cost of one unit.

The resulting (n - 1) units block may be considered as a hash value which can be looked up in a table of such hashes. If the hash is found, further comparison is performed based on the same criteria.

If the function we chose was wrong (i.e. the hash value was not found in the function hashes list), we try other functions until we have either a match or there are no more functions remaining to test.

Polynomial functions

This approach considers and is limited to ADD and SUB functions taking into consideration a key modifier K₁ for K₀; that is at each step of encryption another function ADD or SUB is applied to the key K₀ with parameter K₁.

Let us refer to the encryption function as + (SUB is the same as ADD with negative argument) and we have:

A₀ = a₀ + K₀, K₀' = K₀ + K₁
A₁ = a₁ + K₀', K₀'' = K₀' + K₁ = K₀ + 2K₁
A₂ = a₂ + K₀'', K₀''' = K₀'' + K₁ = K₀ + 3K₁
...

The equivalent polynomial function is f(x) = K0 * x0 + K1 * x1, where x is the current step in encryption (or unit index). We shall consider:

A₀ - A₁ = (a₀ + K₀) - (a₁ + (K₀ + K₁)) = a₀ - a₁ - K₁ = s₁
A₁ - A₂ = (a₁ + (K₀ + K₁)) - (a₂ + (K₀ + 2K₁) = a₁ - a₂ - K₁ = s₂
...
s₁ - s₂ = (a₀ - a₁ - K₁) - (a₁ - a₂ - K₁) = a₀ - 2a₁ + a₂ = S₁
s₂ - s₃ = (a₁ - a₂ - K₁) - (a₂ - a3 - K₁) = a₁ - 2a₂ + a3 = S₂
...

We get (n - 2) units of S, which are an exact transformation of the original data block, independent of the key and key modifier.

It is possible to have another modifier, K₂, for the K₁ modifier, but in practice this situation is very rare and the solution would be to iterate the above once again. The general form of polynomial function of the nth degree is:

f(x) = K₀ * x₀ + K₁ * x₁ + ... + K_n-1 * x_n-1 + K_n * _n

Implementation

At implementation level this may be accomplished in two ways: we either use key-independent hashes in signatures or keep a long enough hash from the original signature bytes (decrypted) and with it generate key-independent hashes at run time.

The first solution is limited in that it applies strictly to the data encrypted with the same function that was used to generate the signature -- so viruses that use a random function from a given set require as many signatures as encryption functions used. Another limitation of the first solution would be that, even if the signature is identified correctly, the decrypted data still won't be available, since the trick of this type of hash is to avoid keys.

Therefore the second solution may be a more efficient implementation, especially because it gives the whole decrypted data and the key used, as:

K = A₀ ~ a₀ = A₁ ~ a₁ = A₂ ~ a₂ = ...
x = X ~ K

Where K is the key deduced from the hash, X is any unit of data outside the hash. Using K the whole chunk of data is decrypted and further classic methods may be applied for an exact match or for further analysis.

Conclusion

The cost of achieving cryptoanalysis as described here is a slight decrease in performance due to the run time generation of key-independent hashes and also a slight increase in the amount of storage space required for signature(s) and for each function implemented as many look-up tables are required.

However, it should be mentioned that this approach comes as an extension of the classic hash signature type which may be regarded as encrypted with a function f(x) = x, where x is the original data.

Finally, the ideas presented in this paper are just a starting point for what may become the basis for a powerful and more complex cryptoanalysis engine as other functions such as ROR or ROL may be easily implemented. Although mixed functions and multi-layer, multi-function encryption methods push the complexity beyond practical implementation, these are subject to a different approach.