Before reading this abstract , you can refer to this link to understand the basic concepts of Anti virus.
An Abstract of Malware Detection Techniques
Introduction
Techniques used for detecting malware can be categorized broadly into two categories: anomaly-based detection and signature-based detection. An anomaly-based detection technique uses its knowledge of what constitutes normal behavior to decide the maliciousness of a program under inspection. A special type of anomaly-based detection is referred to as specification-based detection. Specification-based techniques leverage some specification or rule set of what is valid behavior in order to decide the maliciousness of a program under inspection. Programs violating the specification are considered anomalous and usually, malicious.But signature-based detection uses its characterization of what is known to be malicious to decide the maliciousness of a pro-gram under inspection. As one may imagine this characterization or signature of the malicious behavior is the key to a signature-based detection method‟s effectiveness.
Signature Based detection method
Signature-based detection attempts to model the malicious behavior of malware and uses this model in the detection of malware. The collections of all of these models represent signature-based detection knowledge. This model of malicious behavior is often referred to as the signature. Ideally, a signature should be able to identify any malware exhibiting the malicious behavior specified by the signature. Like any data that exists in large quantities which requires storage, signatures require a repository. This repository represents all of the knowledge the signature-based method has, as it pertains to malware detection. The repository is searched when the method attempts to assess whether the PUI contains known signature. The most common signatures are hashes and byte-signature. We are choosing the hash signature.
Malware Detector
A Malware detector .D. is defined as a function whose domain and range are the set of executable program .P. and the set {malicious, benign}. In other words malware detector can be defined as shown below.
D (p) = malicious malicious if p contains malicious code
Benign benign otherwise.
The detector scans the program .p. ¥å P to test whether a program is benign program or malicious program. The goal of testing is to find out false positive, false negative, hit ratio. The malware detector detects the malware based on signatures of malware.The binary pattern of the machine code of a particular virus is called as signature. Antivirus programs compare their database of virus signatures with the files on the hard disk and removable media (including the boot sectors of the disks) as well as within RAM. The antivirus vendor updates the signatures frequently and makes them available to customers via the Web.
Hash Signatures
The most basic and easiest type of signature is a hash value. A hash value is created by a hash function that is a procedure or mathematical function which converts a large amount of data into a single value. The most commonly used hash function is MD5 and SHA-1. These hashes Functions are extremely accurate. The MD5 hashing algorithm produces a fixed 16 byte fixed output from a variable file size that is being inspected for malicious behavior.
Naming malware
Typically the primary, human-readable name of a piece of malware is decided by the anti-virus researcher who first analyzes the malware. Names are often based on unique characteristics that malware has, either some feature of its code or some effect that it has.
There is no central naming authority for malware, and the result is that a piece of malware will often have several different names. Needless to say, this is confusing for users of anti-virus software, trying to reconcile names heard in alerts and media reports with the names used by their own anti-virus software. To compound the problem, some sites use anti-virus software from multiple different vendors, each of whom may have different names for the same, piece of malware.
Virus Databases
Conceptually, a virus database is a database containing records, one for every known virus. When a virus is detected using a known-virus detection method, one side effect is to produce virus identifier. This virus identifier may not be the virus' name, or even be human-readable, but can be used to index into the virus database and find the record corresponding to the found virus. A virus record will contain all the information that the anti-virus software requires to handle the virus. This may include:
. A printable name for the virus, to display for the user.
. A unique virus hash signature
This is the virus signature database, as we can see from the above figure this database consists of two entries the text before the equal character is the malware name and the part after the equal is the MD5 based hashed signature of the corresponding virus.
Malware Name=virus
MD5 Hash Signature=14379AC2B18ECA31088ECD5A3AA58DD8
Any virus signatures stored in the database must be carefully handled. It illustrates a potential problem with virus databases, when more than one anti-virus program is present on a system. If virus signatures are stored in an unencrypted form, then one anti-virus program may declare another vendor's virus database to be infected, because it can find a wealth of virus signatures in the database file! The safest strategy is to encrypt stored virus signatures, and never to decrypt them. Instead, the input data being checked for a signature can be similarly encrypted, and the signature check can compare the encrypted forms.
This virus definition only holds the signature of known malwares only. Whenever new viruses are known then their signatures must be added to this file so that these new viruses will be detected by the antivirus. As new viruses are discovered, an anti-virus vendor will update their virus database, and all their users will require an updated copy of the virus database in order to be properly protected against the latest threats.
Advantages Of signature Based Virus Detection
. No false positive
A false positive occurs when a virus scanner erroneously detects a 'virus' in a no infected file. False positives result when the signature used to detect a particular virus is not unique to the virus - i.e. the same signature appears in legitimate, non-infected software.
. No false positive
A false negative occurs when a virus scanner fails to detect a virus in an infected file. The antivirus scanner may fail to detect the virus because the virus is new and no signature is
Yet available, or it may fail to detect because of configuration settings or even faulty signatures.
. Hit Ratio
A hit ratio occurs when a malware detector detects the malware. This happen because the signature of malware matches with the signatures stored in the signature databases.
Unpack file Virus Definition Database
Input file
Hashed file
Comparison engine
Report
28
Disadvantages Of signature Based Virus Detection
. Signature extraction and distribution is a complex task.
. The signature generation involves manual intervention and requires strict code analysis.
. The signatures can be easily bypassed as and when new signatures are created.
. The size of signature repository keeps on growing at an alarming rate.
3.1.2 Static anomaly based detection method (Based on API Call sequence)
Anomaly-based detection usually occurs in two phases.a training (learning)
phase and detection (monitoring) phase. During the training phase the detector attempts to learn the normal behavior. The detector could be learning the behavior of the host or the PUI or a combination of both during the training phase. A key advantage of anomaly-based detection is its ability to detect zero-day attacks. Similar to zero-day exploits, zero-day attacks are attacks that are previously unknown to the malware detector. The two fundamental limitations of this technique is its high false alarm rate and the complexity involved in deter mining what features should be learned in the training phase.
In static anomaly-based detection, characteristics about the file structure of the program under inspection are used to detect malicious code. A key advantage of static anomaly-based detection is that its use may make it possible to detect malware without having to allow the malware carrying program execute on the host system.
Following steps are adopted to detect malicious nature of program.
Step1: Program executable is decompressed (optional) if the program is compressed.
Step2: Decompressed program is disassembled using the disassembler module.
Step3: Each disassembled program is represented as a vector of functions. Each function is represented as array of equal length.
Step 4: The similarities between the functions of program P. and P.. is computed using cosine similarity measure.
Step 5: The value of the similarity is compared with the threshold value; if the value is very less than the threshold value then the program under inspection is benign otherwise malicious.
Note the choice of similarity is crucial. A high value of threshold increases the risk of false negative and low value increases the risk of false positiveness.
Similarity Analysis
Malware signatures are long and similarity analysis between signatures of different samples takes more number of comparisons. Thus there exists need of shorter signatures.
Author:
Natnael Issak
Nankai University
College of Software Engineering
nati332014@gmail.com