Capturing 0day Exploits with PERFectly Placed Hardware Traps
By: Cody Pierce Matt Spisak and Kenneth Fitch / August 22, 2016
As we discussed in an earlierpost, most defenses focus on the post-exploitation stage of the attack, by which point it is too late and the attacker will always maintain the advantage. Instead of focusing on the post-exploitation stage, we leverage the enforcement of coarse-grained Control Flow Integrity (CFI) to enhance detection at the exploitation stage. Existing implementations of CFI require recompilation, extensive software updates, or incur a significant performance penalty, making them difficult to adopt and use in the enterprise. At Black Hat USA 2016 , we presented our hardware-assisted technique that has proven successful at blocking exploits, while minimizing the impact on performance to ensure operational utility at scale. To enable earlier detection while limiting the impact on performance, we have developed a new concept we’re calling Hardware Assisted Control Flow Integrity, or HA-CFI. This technology utilizes hardware features available in Intel processors to monitor and prevent exploitation in real time, with manageable overhead. By leveraging hardware features we can detect exploits before they reach the “Post-Exploitation” stage and provide stronger protections while defense still has the upper hand.
Prior Art and Operational Constraints
Our work builds on previous research that identified the Performance Monitoring Unit (PMU) of microprocessors as a good candidate for enforcing control-flow integrity. The PMU is a specialized unit in most microprocessor architectures that provides useful performance measuring facilities for developers. Most features of the unit are intended to count hardware level events during program execution to aid in program optimization and debugging.
In their paper, Yuan et al. [YUAN11] introduced the novel application of these events to exploit detection for software security. Their research focused on using PMU events along with the Branch Trace Store (BTS) messages to correlate and detect code-injection and code-reuse attacks without source code. Xia et al. explored the idea further in their paper CFIMon [XIA12], combining precise event context gathering with the BTS and PEBS to enforce real-time control-flow integrity. In addition to these foundational papers, others have pursued variations on the idea to specifically target exploit techniques such as Return-Oriented- Programming.
Alternatively, just-in-time CFI solutions have been proposed using dynamically instrumented frameworks such as PIN [PIN12] or DynamoRIO [DYN16]. These frameworks dynamically interpret code as it is executed while providing instrumentation functionality to developers. Applying control flow policies with a framework like PIN allows for the flexible and reliable checking of code. However, it often incurs a significant CPU over-head, in the area of 10 to 100x, making it unusable in the enterprise.
Our research into dynamic run-time CFI included parameters we feel would make this approach relevant to enterprise security, while also providing significant detection and prevention assurances. To ensure our approach is resilient for enterprise security while also providing significant detection and prevention assurances, we established several functional requirements, such as ensured functionality on 32 and 64bit Operating Systems, application without software recompilation, or access to source code.
HA-CFI uses PMU-based traps to apply coarse-grained CFI on indirect calls on the x86 architecture. The system uses the PMU to count and trap mispredicted indirect branches in order to validate branch destinations in real-time. In addition to gaining assistance from a carefully tuned PMU, a practical implementation of this approach requires support from Intel’s Last Branch Record (LBR) feature, and a method for tracking thread context switching in a given OS. It also requires an algorithm for validating branch destination addresses, all while keeping performance over-head to a minimum. After more than a year of fine-tuning these hardware features, we have proven our model is capable of generically detecting control-flow hijacks in real-time with acceptable performance over-head on both Windows and Linux. Because control-flow hijack attacks often stem from a corrupted or modified VTable, many CFI designs focus on validating all indirect branches. Because these call sites have never before jumped to the attacker controlled address, this indirect call is almost always mispredicted by the branch prediction unit. Therefore, by only focusing on mispredicted indirect call sites we greatly limit the number of places that a CFI check is necessary.
HA-CFI configures the Intel PMU on each core to count and generate an interrupt on every mispredicted indirect branch. The PMU is capable of delivering an interrupt any time an event counter overflows, and thus HA-CFI sets the initial counter value to -1 and resets the counter to -1 from the interrupt service routine to generate a trap for every occurrence of the event. In this way, the HA-CFI interrupt service routine becomes our CFI component capable of validating each mispredicted call and determining whether it is the result of malicious behavior. To validate target indirect branch addresses, HA-CFI builds a comprehensive whitelist of valid code pointer addresses as each.dll/.so is loaded into protected processes. When a counter overflows, the Interrupt Service Routine (ISR) called is then able to compare the mispredicted branch to a whitelist, and determine if the branch is anomalous.
Figure 1: High level design of HA-CFI using the PMU to validate mispredicted branches
To ensure we minimized the overhead of HA-CFI while maintaining an extremely low false-positive rate, several key design decisions had to be made, and are described below.
The Indirect Branch:On the Intel x86 architecture, an indirect branch can occur at both a CALL or JMP instruction. We focus exclusively on the CALL instruction for several reasons, including the frequent use of indirect JMP branch locations for switch statements. In our experimentation on Linux, we found roughly 12% of hijacked indirect branches occurred as part of an indirect JMP, but occurred even less frequently on Windows. Secondly, ignoring mispredicted JMP instructions further reduces the overhead of HA-CFI. Therefore, we opted to omit mispredicted JMP branches during this research, which can be achieved with settings on the PMU and LBR.
Figure 2: A breakdown of hijackable indirect JMP vs CALL instructions found in Windows and Linux x64 binaries
Added Precision with the LBR:Given our requirement for real-time detection and prevention of control-flow hijacks, unlike the majority of previous research, we couldn’t use the Intel Branch Trace Store (BTS), which does not permit analysis of the trace data in real-time. Instead, to precisely resolve the exact branch that caused the PMU to generate an interrupt, we make use of Intel’s Last Branch Record (LBR) stack. A powerful feature of the LBR is the ability to filter the types of branches that are recorded. For example, returns, indirect calls, indirect jumps, and conditional branches can all be included or excluded. With this in mind, we can configure the LBR to only record indirect call branches occurring in user mode. Additionally, the most significant bit of the LBR branch FROM address indicates whether the branch was actually mispredicted. As a result, this provides a quick filter for the ISR to ignore the branch if it was predicted correctly. It’s important to note that we are not iterating over the entire LBR stack, only the most recently inserted branch.
On-Demand PMU-Assisted CFI:HA-CFI is focused on protecting commonly exploited applications such as browsers, mail clients, and Flash. As such, the PMU and LBR are both configured to only operate on mispredicted indirect calls occurring in user mode, ignoring those that occur in ring-0. Moreover, by monitoring thread context switches in both Windows and Linux, we can turn the entire PMU on and off depending upon which applications are being protected. This design decision is perhaps the most critical element in keeping our performance overhead at an acceptable level.
Runtime Whitelist Generation:The final component to the HA-CFI system is the actual integrity check that involves querying a whitelist data structure containing valid destination addresses for indirect calls. Whitelist generation is performed at run-time for each image loaded into a protected process. We generated a whitelist such that all branches from our dataset could be verified in a hashtable leaving zero unknown captured branches.
Throughout the course of our research, we encountered numerous hurdles to meet our original goal of low-overhead, high detection stats. First, registering for PMU interrupts on Windows was a major challenge. Our initial prototype was developed under Linux. Transferring the same techniques to Windows proved problematic, especially with regards to Kernel Patch Protection. After significant research, we discovered an undocumented option in the Windows Hardware Abstraction Layer (HAL) that registers a driver supplied interrupt handler for PMU interrupts. Second, our implementation of CFI on Windows restricted PMU monitoring to a single process or thread.
The technique we ultimately arrived at makes use of a threads Asynchronous Procedure Call (APC) mechanism. Windows allows developers to register APC routines for a given thread, which are then added to a queue to be executed at certain points. By maintaining an APC registered on all threads that we seek to monitor at all times, we are notified that a thread has resumed execution when our routine executes. The routine re-enables the PMU counter if necessary and updates various tracking metrics. We detect when a processor swaps out a thread and begins executing another when we receive an interrupt in a different thread context. We can then disable the PMU counters, if needed.
To evaluate our system, we measured success both in terms of performance overhead added by HA-CFI as well as detection statistics when tested against various exploits in common client applications, including the most common web browsers, as well as Microsoft Office and Adobe Flash. We sourced exploits from Metasploit modules for testing, as well as numerous live samples from popular Exploit Kits found in the wild.
Exploit Detection:We extensively tested HA-CFI against a variety of exploits to determine its efficacy against as many bug classes and exploitation techniques as possible, with an emphasis on recent samples using approaches intended to bypass other mitigation measures. We ran one set of tests against more than 15 Metasploit exploits targeting Adobe Flash Player, Internet Explorer, and Microsoft Word. HA-CFI detected and prevented exploitation for each of the tested modules, with an overall detection rate greater than 98%.
We found the Metasploit results to be encouraging, but came to the conclusion that they did not provide sufficient diversity in exploitation techniques needed to comprehensively test HA-CFI. We used the VirusTotal service to download a set of samples used in real-world exploit kit campaigns from several widespread kits [KAF16]. In total, we tested forty-eight samples comprising twenty unique CVE vulnerabilities. We analyzed the samples to verify that they employed a varied set of both Return-Oriented Programming (ROP) and “ROPless” techniques. HA-CFI succeeded in detecting all 48 samples, with an overall detection rate of 96% in a multiple trial consistency test.
Results of VirusTotal Sample Testing, by Exploitation Technique
Results of VirusTotal Sample Testing, by Bug Class
Modern exploitation techniques are rapidly changing, requiring a new approach to exploit detection. We demonstrated such an approach to exploit detection by using the Performance Monitoring Unit to enforce control flow integrity on branch mispredictions. A run-time generated whitelist can determine the validity of indirect calls to locations classified as malicious. This approach greatly reduces the overhead of the instrumentation by moving the policy enforcement to a “coarse-grained” verifier on mispredicted indirect branch targets. The data provided also shows the efficacy of such a system on samples captured in-the-wild. These samples, from popular exploit kits, allow us to measure against unknown threats further validating its application. As exploits advance, we also need advanced exploit detection. Our hardware-assisted CFI (HA-CFI) system has a low performance impact and measurable prevention success against 0day exploits and previously unknown exploitation techniques. Using HA-CFI we have advanced the state-of-the-art – moving the industry from post-exploitation to exploitation – to give enterprise-scale security software an upper hand in earlier detection of exploitation. To learn more about pre-exploit detection and mitigation, we'll be discussing our approach during a webinar on August 25 th , at 1 pm ET.
[YUAN11]L. Yuan, W. Xing, H. Chen, B. Zang, “Security Breaches as PMU Deviation: Detecting and Identifying Security Attacks Using Performance Counters”, APSys’11, July 11-12, 2011.
[PIN12]PIN: A Dynamic Binary Instrumentation Tool. https://software.intel.com/en-us/articles/pin-a-dynamic- binary-instrumentation-tool
[DYN16]DynamoRIO: Dynamic Instrumentation Tool Platform. http://www.dynamorio.org/
[XIA12]Y. Xia, Y. Liu, H. Chen, and B. Zang, “CFIMon: Detecting violation of control flow integrity using performance counters,” in Proceedings of the 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN) , pp. 1–12, IEEE Computer Society, 2012.
[EME16]The Enhanced Mitigation Experience Toolkit. https://support.microsoft.com/en-us/kb/2458544
[ KAF16 ] Kafeine. Exploit Kit Samples. http://malware.dontneedcoffee.com/