Neural Networks for Computer Virus Recognition

Gerald Tesauro, Jeffrey Kephart, Gregory Sorkin
IEEE Expert, vol. 11, no. 4, Aug. 1996, pp. 5-6.
August 1996

We have developed a neural network for generic detection of a particular class of computer viruses_the so-called boot sector viruses that infect the boot sector of a floppy disk or a hard drive. This is an important and relatively tractable subproblem of generic virus detection. Only about 5% of all known viruses are boot sector viruses, yet they account for nearly 90% of all virus incidents.1 We have successfully deployed our neural network as a commercial product, distributing it to millions of PC users worldwide as part of the IBM AntiVirus software package.

We faced several challenges in taking our neural network from a research idea to a commercial product. These included designing an appropriate input representation scheme; dealing with the scarcity of available training data; finding an appropriate trade-off point between false positives and false negatives to conform to user expectations; and making the software conform to strict constraints on memory and speed of computation needed to run on PCs. This essay discusses our methods for handling these challenges.

PC boot sectors are 512 bytes long, and an input representation consisting of raw bits or bytes would clearly be infeasible. Instead, we developed a novel representation scheme that indicates the presence (1) or absence (0) of a few dozen features. The features are short byte strings (specifically, trigrams) generated by an automated procedure from a corpus of training examples. The basic idea is that the features should appear frequently in viral boot sectors but infrequently in legitimate ones. In brief outline, we obtain the features by, first, forming a list of all trigrams contained in the corpus of viral boot sectors; second, eliminating trigrams that appear in any of the legitimate boot sectors in the corpus, or that appear too frequently in a separate corpus of uninfected PC programs; and then computing a "4-cover" of the viruses_extracting a list of features such that every virus in the training set contains at least four of the features in the list. This procedure winnows the list of trigram features from tens of thousands down to about 50.2

The available training data for this classification task was extremely limited: our entire data collection consisted of about 200 viral boot sectors and 100 legitimate boot sectors. Our experiments typically reserved 50% of the data for use as a validation set, leaving only about 100 viral examples and 50 nonviral examples for training. With 50 input dimensions and such a small training set, we could train only single-layer networks with no hidden units. Even a few hidden units would make the number of weights in the network comparable to or greater than the number of exemplars, and thus create a likelihood of overfitting the training data.

A further problem with the training data is that, by construction, the legitimate boot sectors are all represented by a single input pattern consisting of all zeroes. Hence, any set of positive weights would achieve 100% classification accuracy on the training set. Clearly, this would lead to a bad generalization, as the network would produce a false positive for any legitimate boot sector that had any nonzero feature values. We addressed this problem by training as nonviral an additional set of input patterns with a single input set to 1 and all others set to 0. This step enforced the constraint that no single input feature in isolation should be sufficient to indicate viralness. We also generated additional artificial exemplars using the 512-byte strings, starting from the entry points in our corpus of PC programs. (The thought was that these sections of code, like real boot sectors, might be oriented to machine setup rather than performance of applications.) This procedure yielded another several hundred nontrivial patterns, containing some nonzero inputs, that were trained as nonviral exemplars.

In typical training runs using backpropagation, we found that the network achieved 90-95% classification accuracy on the training data, using a classification threshold of 0.5 (that is, patterns scoring greater than 0.5 were classified as viral, while patterns scoring less than 0.5 were classified as nonviral). (We did not achieve 100% accuracy because a few of the less robust input features were deleted.) Performance on the validation sets was typically 80-85% for the viral boot sectors and 100% for the legitimate boot sectors. That is, the false negative rate was 15-20%, and the false positive rate was lower than the resolution of our validation set (about 1%).

For practical boot virus detection, we need to place more emphasis on false positives than false negatives. False positives are potentially much more frequent, because a false positive on one legitimate boot sector will mean false positive events on thousands of computers. In the field, the network should operate at a classification threshold higher than 0.5. It is difficult to choose an appropriate threshold value, because there is so little data, and because, at a threshold of 0.5, the false positive rate is already quite low. After consulting with virus experts in our group, we settled on a threshold value of approximately 0.7, a level deemed to provide an adequate safety cushion for false positives, while only slightly increasing (by a few percent) the false negative rate.

The neural net's performance in the field has, in fact, been excellent, consistent with our expectations based on the validation set measurements. The classifier, incorporated into IBM AntiVirus, has caught approximately 75% of new boot sector viruses that have come out since the product was released. Most of the viruses that escaped detection did so not because they failed to contain the feature trigrams but because the code sections containing them were obscured in various ways. We are working on independent means for capturing the obscured code, which should enable the program to detect most of these viruses. Also, to our knowledge, only two false positives have appeared since the initial product release, both on boot sectors that human experts think have suspicious virus-like qualities. Subsequent product releases have easily eliminated these false positives by using specific exception-handling code.

In incorporating the neural-net classifier into IBM AntiVirus, we faced severe constraints on CPU, memory, and disk-space use. IBM AntiVirus is already a substantial PC program and must run on minimal machines with little memory, older CPUs, and so forth. This places a premium on all computational resources, and the neural-net code must squeeze into a memory footprint of at most a few kilobytes. So, we needed to recast the neural net's floating-point arithmetic in integer form. We stored the weights as low-precision integers (using about 5 to 6 bits) and eliminated the sigmoidal output. These conversions are manageable in a small, single-layer net. However, the memory-footprint and CPU-usage issues present considerable obstacles to commercial use of larger multilayer nets.

References

S.R. White, J.O. Kephart, and D.M. Chess, "Computer Viruses: A Global Perspective," Proc. Fifth Int'l Virus Bulletin Conf., Virus Bulletin Ltd., Abingdon, England, 1995, pp. 165-181.
J.O. Kephart et al., "Biologically Inspired Defenses Against Computer Viruses," Proc. Int'l J. Conf. on AI (IJCAI-95), Morgan Kaufmann, San Francisco, 1995, pp. 985-996.

VX Heavens

Neural Networks for Computer Virus Recognition

References