Markov models : detecting malware through language recognition

Datetime:2016-08-23 03:17:01          Topic:          Share

Hi there , in this blog I'm going to give an introduction to the use of probability models for (written) language recognition to detect malware and demonstrate it with a .net / powershell implementation of a Markov model .

It's not uncommon encounter text  that was generated randomly by malware. Some examples might be file names , process names, names of registry keys, obfuscated scripts or even URLs that were generated by a domain generation algorithm (DGA). The use of random generated data can hinder detection and makes the creation of signatures or IOCs more difficult. Imagine for example DGAs that are being used by some malware families. One remedy against them may consist of reverse engineering the specific algorithm used to generate the random domain names and then generate and blacklist all possible outcomes of that algorithm. While this might work, wouldn't it be easier if you had an algorithm which could recognize a language and tell if a text string was randomly generated no matter which algorithm was used to create it.

When you ask a person how he would describe the text "VMIVKPQYUNL" compared to words like "TIGER" , "BASKETBALL" or "SUMMER" he/she'll probably say that the first is some random gibberish and the other three are 'normal words' which 'make sense'.  It might seem almost trivial for humans but it's a little harder for computers to make that distinction. While a lot of text based problems can be solved by regular expressions even they can't help you out here.

One way of solving the problem is using a probability model , in this particular case a 'Markov model', which in our case will allow us  to calculate a probability which  indicates whether a certain word 'sounds' like a legit word in a specific language( we'll be using the English language in this article).

To be clear : the method I'll be describing allows you to classify meaningless strings like "uqxypfdjiwwrdvi" (without any recognizable pattern)  as being randomly generated. It does not however allow you to detect words generated by a more advanced program which itself uses aprobability model (which would be more or less the inverse of what I'm going to show here).  So a word like "crazyous"  which you won't find in your English dictionary but 'sounds' English will be recognized as a legit word by the algorithm.

In this article I'll be mainly focusing on the implementation but  if you would like some background theory ( which you will need to understand what the code is sort of doing)  here's an introduction to the theory behind Markov models for language recognition.

I'm aware of the fact that there are other ways to solve the problem based on entropy and letter distribution etc. but I just wanted to try out this Markov model stuff so the article only shows my interpretation and a .net/powershell solution that works for me.

Markov Chains

First I'll try to give an overview of how the algorithm works(with an emphasis on 'try') and what's needed for it to deliver proper results but if you don't want to know about all that stuff and just want try out the program just go ahead and skip this paragraph.

Wikipedia defines a markov chain as "a process that  undergoes transitions from one state to another on a state space. It must possess a property that is usually characterized as "memorylessness": the probability distribution of the next state depends only on the current state and not on the sequence of events that preceded it. This specific kind of "memorylessness" is called the Markov property."

Well that probably explains it all. Not.

So I'll hope a short explanation of the algorithm I'm using will clarify things a little bit. You can check the code too for a better understanding .

When using the algorithm there will be 2 different phases :

1) A learning phase were we build our probability model based on text that we feed to our program.

2) A phase in which we evaluate strings  and decide whether they're considered as random or not. This decision will be based on the model we've built before.

Thus for the learning phase you will need some text. Lots of text ! The more text we can use as input the better your model will be. One requirement would be that the text itself can be considered as being 'correct': if you're using a text with 100000 non-existing or misspelled words your results will be accordingly. Or if you're using the complete work of Shakespear as a learning text don't be surprised if a word like 'netflix' comes out as unrecognized. Also if you're going to use this script in an IT- related environment it might not be a bad thing to include texts with words that are related to your environment / context but have no meaning outside of it.

If you want to use the program to identify multiple languages I would advise you to use a text for each language and build a separate model with each one. A Google search for 'big English text'  gives us a nice text to get you started with . Thank you mister Norvig !

In this implementation I'm filtering out all characters other than those of the English alphabet in the Powershell code. If you want to implement it for another language which uses special characters you'll might want to change that.

Now for the learning part. To "learn" a word with your model you first have to split it up into n-grams ( n >0 ). These n-grams will represent the states from the wiki definition.

By default n = 2 in the code , so essentially your splitting up the words in bigrams (but you can switch to trigrams or 4-grams ... :)

An example says more than a thousand words :

'computer' would be split up into 'co' ,'om' ,'mp' ,'pu', 'ut' and 'te'

Then for each n-gram, it's going to check by which letter it's being followed ( these are the transitions) and keep track for each n-gram how many times it's being followed by each specific letter.

'co' is followed by 'm'

'om' is followed by 'p' ...

At the end of the learning phase (when you're done processing all the words of the input text) the totals of all the letter occurrences are normalized for all the n-grams and your Markov model is ready.

In the next phase you're going to evaluate the text string you're testing in a similar way : again you split your string in n-grams but nowfor each n-gram you check your model for the probability that it will be followed by the succeeding letter. All you have to do then is multiply all the probabilities of all the transitions. If a certain transition is unknown in our model, let's say from 'br' to 'x' , then we will multiply by a constant value PROB_UNKNOWN_TRANSITION.

As you'll see in the code : instead of multiplying the probabilities I'm calculating the sum of their logarithms. Google ' avoid underflow probability calculation'  for an explanation.

Okay , now we have this negative value as output , what does it tell me ? (FYI the logarithm causes the negative value ,don't worry about it)

Well this is the trickier part , you have to decide which probability value will be accepted as an okay word and which not. I've decided to compare each probability to an 'EvaluationValue' which is a function of the length of the word you're testing and some constants (check the code for details).  If the probability is smaller than the EvaluationValue then the word is accepted. I know it's not perfect and could be improved but test results were acceptable.

Also , I would not advice to depend too much on the algorithm  for word-lengths < 5 because for such a short length it becomes a really thin line to decide what's acceptable and what not. But this shouldn't be a problem because malware authors tend to generate random data of considerable length for the same reasons.  For example in the case of a random registry key name, if they generate a 4 letter key name chances are that it might in fact conflict with an already existing key so they tend to use lengths which assure that their precious malware gets installed correctly.

.Net - Powershell implementation

You'll notice that I implemented the Markov Model in C# and loaded the types in the Powershell script. The main reason for this is performance. I first attempted to implement everything purely in Powershell but to my surprise it took quite a while to build the model when using big text files (+5MB) for learning.  After some searching I found this interesting article.

Since a lot of text processing means a lot of for loops I decided to implement the MarkovModel class in C#. The cool thing with Powershell is that you can just import these classes and then immediately combine their functionality with all your favorite powershell kung fu.

The MarkovModel class also contains two pretty straightforward methods for serialization and deserialization.  These allow to write your MarkovModel that's been constructed in your session to a binary file on disk and then reload the corresponding probabilities in a next session which will save you a lot of time. In other words : you only have to run the learning phase once per learning text.

That's all very nice but what can I use this for ?

I'll start by showing what the script returns when passing it the words'powershell'  and  'hjgkdurhdrz'

The next test shows all entries in my HKLM\Software hive which I sent to a file and used as input for this test. For testing purpose I had manually added one random sub-key.

Nice, the random key gets detected.

For the next test I wanted to simulate some sort of DGA detection.

I first ran the script with a list of the first 40 sites(top level and second level domains filtered out)  in this list

2 sites were considered as not recognized : 'jimdo' and 'baiyewang' which from an English perspective is understandable. If you want to overcome this you can create a white list of known sites and only test the unknown site's with the algorithm.

Then I used a list of random domain names , some were made up but some are copy pasted from actual ransomware and malware research .

For the random words with considerable length it becomes pretty clear that the comparison between the probability and the EvaluationValue delivers reliable results.

More examples of practical use could be detection of random filenames and process names to even detection of obfuscated scripts ...

THE CODE

You can download the code here. ( Tested on Win 10 powershell v5 / Win 7 powershell v3)

It's a zip file containing the script , a learning text and some txt files for testing.

md5 : 0b30f010cbd1ce979ede98b42af8d521

If you'd like to you can follow me on twitter here

<#

MarkovModel
Tom Asselman 17-8-2016

.Description
This script can be used for word recognition. The script uses an implementation of a markov model.
The main purpose is to be able to detect random generated strings like "vxagtvsyqxtrfcm"       


.PARAMETER Word
Optional parameter.A word to test

.PARAMETER ModelPath
Optional parameter. Allows to load or deserialize a serialized model instead of rebuilding a new one. Specifies the file name of that model. 
This will save you time when using big text files.

.PARAMETER InputText
Optional parameter. Specifies the input text that will be used to learn and build the Markov Model. By default the program uses file "big.txt". 
The file is expected to be located IN THE FOLDER OF THE SCRIPT !!!

.PARAMETER SerializeModel
Optional parameter. Specifies whether the model needs to be serialized at the end of the processing.
False by default

.PARAMETER SaveModelAs
Optional parameter. Specifies the name of the file used to serialize the model to. The file will be created in the same folder as the script.
#>



[CmdletBinding()]
Param(
    [Parameter(Mandatory=$False,Position=0)]
        [String]$Word="",
    [Parameter(Mandatory=$False,Position=1)]
        [String]$File="",
    [Parameter(Mandatory=$False,Position=2)]
        [String]$ModelPath="",
    [Parameter(Mandatory=$False,Position=3)]
        [String]$InputText="big.txt",
    [Parameter(Mandatory=$False,Position=4)]
        [Switch]$SerializeModel=$False,
    [Parameter(Mandatory=$False,Position=5)]
        [String]$SaveModelAs=""
)


 "
 ##############################################################################################
 # MarkovModel.ps1
 # © Tom Asselman
 # Cyberforce
 # www.cyberforce.be
 # This script is free to use but please do not use, reuse, build new scripts using this code 
 # or spread without giving credit to the author.
 #
 ##############################################################################################
 
 "

cd $PSScriptRoot

$Source = @"
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Collections;
using System.IO;

namespace MarkovProject
{
    public class MarkovModel
    {
        //Probabilty of a transition that we have no data for in our model
        private const double PROB_UNKNOWN_TRANSITION = 0.000001;

        private int n;

        private Dictionary<string, Dictionary<string, float>> learningTable;

        public Dictionary<string, Dictionary<string, float>> LearningTable
        {
            get
            {
                return learningTable;
            }

            set
            {
                learningTable = value;
            }
        }

        public int N
        {
            get
            {
                return n;
            }

            set
            {
                n = value;
            }
        }

        public MarkovModel()
        {
            N = 2 ;
            LearningTable = new Dictionary<string, Dictionary<string, float>>();
            
        }

        /*
            You can use a different value for n but everything is optimized for standard N=2
            The value PROB_UNKNOWN_TRANSITION was chosen to give good results for n=2 and you will need to change it for any other value for n.
        */
        public MarkovModel(int _N)
        {
            N = _N;
            LearningTable = new Dictionary<string, Dictionary<string, float>>();

        }


        public int getN()
        {
            return N;
        }

        public void learnWord(string word)
        {
            word = word.ToLower();
            for (int i = 0; i < word.Length-N; i++)
            {
                processNgram(word.Substring(i, N), word.Substring(i + N, 1));
                
            }
        }


        private void processNgram(String ngram, string letter)
        {
            Dictionary<string, float> currentNgramList;
            try
            {
                currentNgramList = LearningTable[ngram];
            }
            catch (KeyNotFoundException)
            {
                currentNgramList = new Dictionary<string, float>();
                LearningTable.Add(ngram, currentNgramList);
                currentNgramList.Add(letter, 1);
                return;
            }

            try { 
                currentNgramList[letter]++;
            }
            catch (KeyNotFoundException)
            {
                currentNgramList.Add(letter, 1);
            }

        }

        //Calculates and set propabilities of all n-grams in model
        public void calculateProbabilities()
        {
            // Dictionary<string, Dictionary<string, float>>.ValueCollection ngramLists = learningTable.Values;

            foreach ( Dictionary<string, float> ngramList in LearningTable.Values)
            {
                float sum = 0;
                foreach( float number in ngramList.Values)
                {
                    sum += number;
                }
                //We add .toList() to allow us to make changes while iterating
                foreach ( string letter in ngramList.Keys.ToList())
                {
                    ngramList[letter] /= sum;
                }
            }
        }

        //calculates the probability of a word
        public float calculateProbabilityWord(string word)
        {
            float prob = 0;
            for (int i = 0; i < word.Length - N; i++)
            {
                string ngram = word.Substring(i, N);
                string letter = word.Substring(i + N, 1);

                if( learningTable.ContainsKey(ngram) &&  learningTable[ngram].ContainsKey(letter))
                {
                    prob += (float) Math.Log(learningTable[ngram][letter]);
                }
                else
                {
                    prob+= getUnknownTransition();
                }
            }
            return prob;
        }

        public float getUnknownTransition()
        {
            return (float) Math.Log(PROB_UNKNOWN_TRANSITION);
        }

        //simple method for serialization
        public void serializeToBinary(string fileName)
        {
            BinaryWriter writer = new BinaryWriter(File.Open(fileName, FileMode.Create)) ;
            writer.Write(N);
            writer.Write(learningTable.Count);
            foreach (KeyValuePair<string,Dictionary<string,float>> outerKvp in learningTable)
            {
                writer.Write(outerKvp.Key);
                writer.Write(outerKvp.Value.Count);
                foreach (KeyValuePair<string, float> innerKvp in outerKvp.Value)
                {
                    writer.Write(innerKvp.Key);
                    writer.Write((double) innerKvp.Value);
                }
            }
            writer.Flush();
            writer.Close();
        }

        // reads from a serialized binary file and adds data to the learningTable
        public void deSerializeFromBinary(string fileName)
        {
            BinaryReader reader = new BinaryReader(File.Open(fileName, FileMode.Open));
            int _N = reader.ReadInt32();
            N = _N;
            
            int nGramsCount = reader.ReadInt32();

            for (int i = 0; i < nGramsCount; i++)
            {
                string outerKey,innerKey;
                outerKey = reader.ReadString();
                int letterCount = reader.ReadInt32();
                Dictionary<string, float> nGramList = new Dictionary<string, float>(letterCount);
                for (int j = 0; j < letterCount; j++)
                {
                    innerKey = reader.ReadString();
                    float letterProb = (float) reader.ReadDouble();
                    nGramList.Add(innerKey, letterProb);
                }
                learningTable.Add(outerKey, nGramList);
                
            }
            reader.Close();
        }
    }
}

"@


<#
    We add the types from our c# code to the powershell session.
#>

Add-Type -TypeDefinition $Source -Language CSharp  
$scriptPath = split-path -parent $MyInvocation.MyCommand.Definition


<#
    Function isWordRecognized
        Checks whether a word is recognized as a real sounding word. If not the word will be made up of random characters 
        "vrigfjkdljdr" will not be recognized
    Parameters
        $word  = string you want to check
    Returns  a psObject with the following properties :
        Text :            The text that was checked
        Recognized :      Is the word recognized or is it some random shizzle
        Probability :     The calculated probability value
        Length:           Length of the text that was checked
        EvaluationValue : Value that was used to decide whether the word is random or not. If EvaluationValue > probability => Random text
#>


function isWordRecognized( [string] $word){

   begin{

      if ($PSVersionTable.PSVersion.Major -gt 2)
      {
        $properties = [ordered] @{ 
                     "Text"= '';
                     "Recognized" = $False 
                     "Probability" = 0;
                     "Length" = 0;
                     "EvaluationValue" = 0;
                      }
      }
      else{
         $properties = @{ 
                     "Text"= '';
                     "Recognized" = $False 
                     "Probability" = 0;
                     "Length" = 0;
                     "EvaluationValue" = 0;
                      }
      }
      $returnObj = New-Object PSObject -Property $properties

   }  
   
   process{  
      
     
      #Filtering of unwanted characters
      #Just make sure your filter here is identical to the one used to filter text for the learning model
      $word = ($word -replace'[^a-zA-Z ]', '').toLower().Trim()
      $returnObj.Text = $word
      $returnObj.Length= $word.Length
      $probWord =  [math]::Round($m.calculateProbabilityWord($word),3)
      $returnObj.Probability =$probWord
      $returnObj.EvaluationValue= [math]::Round(  $word.Length * $EvaluationFactor ,3) 
      $returnObj.Recognized = ($probWord -gt $word.Length * $EvaluationFactor)
      Write-output $returnObj
   }
}

<#
    Function BuildModel
        Reads input text to learn , builds the model and calculates the probabilities.
#>

function BuildModel(){
    
    "Reading $InputText"
    $lines = Get-Content $InputText
    "-->Input file read."
    #Remove all characters that do not appear in the English alphabet, if you want to allow non-English characters for other languages change it here
    #Just make sure your filter here is identical to the one in function isWordRecognized
    $lines = ($lines -replace'[^a-zA-Z ]', '').toLower()
    "-->Unwanted characters removed from $InputText."
    "Building the model, please wait." 
    "Processing time depends on the size of the input text..."
    foreach ($line in $lines){
        $m.learnWord($line) 
    }
    $m.calculateProbabilities()
    "--> Model ready."

}



<# MAIN #>

$m = new-object MarkovProject.MarkovModel

<#
    $EvaluationFactor is important for deciding whether a word is recognized as being legit or just random stuff.
    So it's just a value that seemed to be a good consensus that I found through testing. 
    It is however based on the standard value for n = 2. 
    So if you wan't to use another value for n you HAVE TO CHANGE $EvaluationFactor, or the value for getUnknownTransition() 
    because results will need to be interpreted differently.
#>

$EvaluationFactor = $m.getUnknownTransition() / (2.5)


if ($ModelPath -eq "") {
    BuildModel    
}

else{
    if (-not [System.IO.Path]::IsPathRooted($ModelPath)){
        $ModelPath = $scriptPath +'\'+ $ModelPath 
    }
    "Loading existing model from " + $ModelPath
    $m.deSerializeFromBinary($ModelPath)
}

if($word -ne ""){
    isWordRecognized($word)
}

elseif($file -ne ""){
    #We filter out words with length <= N + 2 otherwise results aren't always reliable

    Get-Content($file) | %{ isWordRecognized($_)} |Where-Object length -gt ($m.N + 2) | Sort-Object -Property length| Format-Table -AutoSize 
}
else{
  "Nothing to do ! Please specify a word or a file."
}

if ($SaveModelAs -ne ""){
    $m.serializeToBinary( "$PSScriptRoot\$SaveModelAs")    
    "-->Model was serialized to $PSScriptRoot\$SaveModelAs"
}