Home Use of Natural Language Processing techniques for the automatic identification of TTPs in Threat Intelligence Reports.
Post
Cancel

Use of Natural Language Processing techniques for the automatic identification of TTPs in Threat Intelligence Reports.

In this post I present my Master’s Thesis to conclude my studies in the official master’s degree in Cybersecurity of the UAH.

I take as a starting point the open source platform, developed by MITRE Corporation called TRAM (Threat Report ATT&CK Mapping), which is designed for the automatic extraction of TTPs from iberintelligence reports. The analysis performed on it yields a number of weaknesses on which this Master’s Thesis is going to focus.

The code can be found in the following GitHub repository. Two videos (link to first video and link to second video) summarizing the functionalities of the project.

Post index

Abstract

Threat intelligence sharing has grown exponentially over the past few years. This has given cybersecurity professionals access to a vast amount of data to analyze. This discipline is known as cyber threat intelligence (CTI) which seeks to collect and organize information that can be used to prevent, detect or mitigate cyber attacks.

Without the right tools, security analysts have to read and retrieve information from reports manually. Considering that the number of reports published daily has increased considerably, it is clear that this option is unfeasible, so the use of tools that automate the extraction of this information is necessary.

After reviewing the state of the art, the open source platform developed by MITRE Corporation called TRAM (Threat Report ATT&CK Mapping), which is designed for the automatic extraction of TTPs from cyber intelligence reports, is taken as a starting point. The analysis carried out on it shows a series of weak points on which this Master’s Thesis will focus.

We propose a set of new functionalities that, by crossing the extracted TTPs with data from public sources and the use of NLP techniques, generate conclusive data for the analyst. Specifically, these new features show the analyst a summary describing the threat present in the report, a top 3 of groups that could be involved in the attack, a visual representation of the chain of steps the threat follows to be exploited, a list with information on how to detect and mitigate the threat and, finally, the extraction of IoCs along with their validation in Alien-Vault to determine their criticality.

With faster analysis, teams can more easily implement their countermeasures. While what has been developed in this TFM cannot and is not intended to replace a human analyst, it can help by providing them with some baseline data on the report.

Introduction

In a report1 published by Check Point Research, a leading provider of cybersecurity solutions worldwide, global statistics on the increases in cyberattacks are analyzed observed by region, country and sector during 2021. A 50% increase was detected over the previous year, 2020. the previous year, 2020.

This is why it is increasingly important to evaluate the capacity for detection and rapid response to threats. For this purpose, a number of indicators are available that can be used to detect an adversary. The problem is that not all indicators are equally effective. This can be seen in what is known as “The Pyramid of Pain”2, which reflects the relationship between indicators versus the potential damage that denying those indicators can cause to attackers. In other words, the higher you are in the pyramid, the more effective our measures will be in creating a safe environment to create a hostile environment for the attacker.

Pyramid of pain Figure 1 - Pyramid of pain

Threat information sharing has expanded over the past few years. This has led practitioners to have a wealth of data, including TTPs (Tactics, Techniques and Procedures), which as seen in the pain pyramid in Figure 1 are particularly valuable, but are usually found in unstructured reports, which makes it difficult to extract information automatically.

This Master’s thesis aims to help cyber intelligence analysts by automatically extracting TTPs from reports using natural language processing techniques. With faster analysis, teams can implement their countermeasures more easily. While what has been developed in this project cannot and is not intended to replace a human analyst, it can help by providing them with some baseline data about the report.

Objectives

Although the final objective of this Master’s Thesis is the proposal of a system capable of extracting TTPs (Tactics, Techniques and Procedures) from cyberintelligence reports, this objective will be divided into a series of milestones necessary to achieve it. This objective will be divided into the series of milestones necessary to achieve it.

The first milestone consists of the study and analysis of existing tools. Today there are several proposals, although with different approaches, that aim at extracting TTPs. This first phase is fundamental to know the mechanisms and methodologies applied up to the moment of the realization of this Master’s Thesis.

In order to carry out information extraction in unstructured reports it is necessary to research Natural Language Processing or also known as NLP. The second milestone focuses on the study and analysis of this technique in order to understand its operation and how it works and to narrow the scope. After this, it is necessary to compile those procedures that can serve as a starting point for the resolution of the problem.

In the third milestone, alternative methods to the use of Natural Language Processing (NLP) techniques for the extraction of TTPs from cyberintelligence reports are raised. The use of rules and pattern detection to recover vulnerabilities, use of cybersecurity ontologies, among others.

Once the study and analysis phases have been completed, the fourth milestone begins, which focuses on the implementation of the improvements and proposals extracted from the previous phases.

Finally, an analysis of the results is performed during which a report describing a real threat is chosen, imported into the tool, processed and studied to see if the generated output is appropiate.

Results (Video demo)

Demo with the functionalities of the application

Results analysis

The following report3 is chosen for testing and checking the results. The report in question was written in 2013 and discusses the increased use of the malware called njRAT over several weeks.

It presents, as a summary, the operation and objectives pursued by this malware to determine whether the results obtained in the tests are successful or not.

What is it njRAT?

njRAT, also called Bladabindi and Njw0rm, is a remote access Trojan used to remotely control infected machines. Due to the overabundance of online tutorials, the large amount of information and the robust set of evasion techniques implemented, njRAT has become one of the most widely used RATs in the world.

This malware was first detected in 2013. The largest increase in attacks of the njRAT Trojan was recorded in 2014 in the Middle East, which is the region most targeted by this malware. It allows attackers to activate the webcam, record keystrokes and steal passwords. In addition, it gives access to the command line on the infected machine, thus allowing it to kill processes, as well as execute and manipulate files remotely. Finally, it is able to manipulate the system registry.

How does it work?

Once infected, the Trojan will collect various data about the PC on which it has been introduced, such as the computer name, operating system number, computer country, user names and operating system version.

After infecting a computer, the malware uses a variable name and copies itself to %TEMP%, %APPDATA%, %USERPROFILE%, %ALLUSERSPROFILE% or %windir%. It can also copy itself to .exe, to make sure it will be activated every time the victim turns on his computer.

It has some tricks to avoid being detected by antivirus. For example, it uses multiple .NET obfuscators to obscure its code. Another technique it uses is to camouflage itself in a critical process, which does not allow the user to shut it down. It can disable processes that belong to antivirus software, which allows it to remain hidden. It knows how to detect if it has been executed in a virtual machine, which helps attackers to set up countermeasures against researchers.

The authors exploit Pastebin to avoid investigation by cybersecurity researchers, as it downloads additional components and executes second-stage payloads from Pastebin. Thus, the malware does not need to establish a traditional command and control (C2) server. Pastebin creates a pathway between njRAT infections and new payloads. With the Trojan acting as a downloader, it will take the encrypted data dumped into Pastebin, decode it and deploy it. For propagation, it can detect external hard drives connected via USB. Once the device is detected, the RAT will copy itself to the connected drive and create a shortcut.

Analysis of results

The first feature to be analyzed is the summary and keywords. With this functionality, it is intended that the analyst has access to several types of summaries that conceptually allow him/her to understand what is described in the report. The summary is automatically extracted verbatim from the text, so there is nothing to analyze. However, it is possible to evaluate the accuracy of the keywords with which the analyst can get an idea of the type of malware he is dealing with. For example, in the case you are dealing with, njRAT is a remote access Trojan, which once it gains access to the target machine exfiltrates information through a command-and-control server. Such actions can be seen in both the verb-type keywords in Figure 2 and the application process-type keywords in Figure 3. It can be determined that the keywords used accurately describe the actions performed by a malware of these characteristics, so it can be concluded that the results are good.

Keywords njRAT 1 Figure 2 - Keywords njRAT 1

Keywords njRAT 2 Figure 3 - Keywords njRAT 2

To evaluate the top 3 groups most likely to have carried out the attack based on the TTPs described, the main targets of njRAT must be taken into account.

The summary shows that the peak of attacks occurred in 2014, when its targets were mostly aimed at the Middle East, although in reality it was a malware created for general use. In any case, Figure 4 shows whether any of the top 3 groups have Middle Eastern countries among their targets. Of these three groups, only the last one has Middle Eastern nations as its main target. Even so, and according to Wikipedia [68], the authorship is recognized to a group called M38dHhM, which is not present in the MITRE database, from which the groups are extracted for the realization of this top 3, so it is impossible to select it as a group present in the top 3.

In many cases it is very difficult to determine which group has carried out the attack, and this is one of them, because the authorship is recognized to an alleged group of which there is no further evidence. This is why it is difficult to assess the accuracy of the results presented.

Possible perpetrator of the attack Figure 4 - Possible perpetrator of the attack

To evaluate the accuracy of TTP prediction, and with it the display of TTPs in the MITRE matrix, we will check how many TTPs the system predicts autonomously and without manual review by an analyst. For this purpose, a report [69] from the eset company is used, which analyzes the malware and defines the set of TTPs used by the different versions of njRAT.

This report presents a total of 30 TTPs, while the prediction made by TRAM only generates a total of 15, as shown in Figure 5. This is by no means a failure, since it should be borne in mind that the eset report includes the TTPs used by the different versions of the malware, which means that not all versions have to use all 30 TTPs. That said, the results presented by TRAM’s NLP-based system generate a more than acceptable prediction.

Cyber kill chain Figure 5 - Cyber kill chain

As for the recommendation and detection system, the information shown only refers to the extracted TTPs, so its accuracy depends on how accurate the prediction process is. On the other hand, the information shown is correct, as it is compiled from MITRE itself. All in all, the results are good and allow the ana-list to have utilities at hand to detect and mitigate threats.

Finally, the extraction of IoCs plus their validation. The main point is to recall what was explained in 2.2.2 The big problem of IoCs, where the lack of updates of the indicators of compromise and their short lifetime were mentioned, which made them unreliable.

With this in mind, we chose one of the hundreds of IoCs extracted from the document and observed whether it had any relationship with the malware in question. As can be seen in Figure 6, the MD5 IoC shown has two pulses in AlienVault related to this malware.

MD5 IoC Figure 6 - MD5 IoC

On the other hand, the actions performed by the malware, described at the beginning of this section, are also taken into account for the evaluation. In this explanation, the paths that are normally used by the malware to carry out its actions are shown. In this case, an IoC is collected in which actions are seen both in the %appdata% directory, in Figure 7, and in the %temp% directory, in Figure 8.

IoC 1 Figure 7 - IoC 1

IoC 2 Figure 8 - IoC 2

It is observed that both the extraction of IoCs and their validation generate good results by showing information that is in line with that described in the malware report.

In short, it is concluded that the functionalities developed do produce an improvement in the tool, as they provide additional information that can be of great use to analysts.


Reverse Footnote

This post is licensed under CC BY 4.0 by the author.