This post compliments a presentation I gave at DEF CON 30 - "Avoiding Memory Scanners: Customizing Malware to Evade YARA, PE-sieve, and More" which included the public release of a new tool called AceLdr. The slides for this presentation are available on the conference website.
As open-source tools and commercial security products improve their ability to scan process memory for malware on Windows, red teams are forced to improve their tradecraft to evade them consistently.
Typically, beaconing C2 implants follow a common paradigm in which the malware executes an instruction and then sleeps for a period. This process presents a set of opportunities for detection and evasion, which this post aims to detail.
Open-source memory scanners have varying features that can be defined into the following categories.
Signature or pattern matching may be the most recognized feature of memory scanners and commercial security products. A prime example of this technique is YARA. YARA can perform string and byte pattern matching with conditional logic. For example, consider the following example rule:
In this rule, the target must contain one of the following to match:
This simple example should provide a good picture of what is possible with YARA. Anything from simple string or byte patterns to relatively complex combinations of these primitives can be defined.
Since YARA scans all memory allocated by a target process, many projects build off YARA to create more efficient scanners with specific goals. For example, BeaconEye only scans heap memory in search of Cobalt Strike configuration structures which are dynamically allocated at initialization.
Commercial security products like AV and EDR are also known to use YARA. Namely, Carbon Black and CrowdStrike explicitly mention using YARA, and other vendors will likely use it.
A quick Google search can find many YARA rules for Cobalt Strike. For example, the following demonstration scans two cmd.exe processes with a set of rules targeting Cobalt Strike: one benign and one injected with an implant.
Detecting Cobalt Strike with YARA
Attributes of memory such as permissions and mapping information can also be used to identify potentially malicious code. Memory can be readable, writeable, or executable and mapped as image commit or private commit data. Memory is "image commit" if it was created by loading a file from disk such as an EXE or DLL. Memory is "private commit" if the process dynamically allocated it through API calls such as VirtualAlloc.
Moneta scans memory pages to look for both executable and private commit memory. All code must be executable, but code on Windows tends to be loaded from disk. Executable private memory occurs legitimately in JIT environments such as the .NET runtime or web browsers. Additionally, Moneta will check the start address of all threads for private commit memory addresses. This check is simple enough to evade since the start address of a thread is not changed after creation. A new thread with an image commit start address can be created in a suspended state, modified to execute the target shellcode, and resumed.
PE-sieve will scan executable, non-executable, or inaccessible memory for patterns that typically occur in shellcode, depending on the usage. In addition, PE-sieve will check the return address of all threads for private commit memory addresses.
Detecting Cobalt Strike with Moneta and PE-sieve
Finally, more recent memory scanners have introduced tracing of thread call stacks to identify potentially malicious code. Tools like BeaconHunter and Hunt-Sleeping-Beacons operate on a simple premise: identify any thread with a wait reason of "DelayExecution". Since Cobalt Strike and many other implants use the Sleep API call, this method can reliably detect malware implants. Unfortunately, there are often many false positives associated with the technique.
Since the initial release of AceLdr, Hunt-Sleeping-Beacons has been updated with a new method to detect FOLIAGE (more on this in the next section). The scanner now looks for threads with a wait reason of "UserRequest" which also have a return address to KiUserApcDispatcher somewhere on their call stack. This will be covered in further detail below.
An interesting variation of stack tracing can be found in MalMemDetect. This scanner hooks API calls such as RtlAllocateHeap to check the return address at execution time. When Beacon calls one of these APIs, the return address on the stack will point to the implant shellcode, which resides in private commit memory.
Detecting Cobalt Strike with MalMemDetect
The tools discussed above have capabilities outside this post's scope. I'd recommend looking through the code of each scanner if you're interested in learning more.
Developers can take advantage of their C2 implant's sleep period to implement protections that obfuscate the malware to reduce the likelihood that a scanner will detect it. The longer an implant's sleep time, the less likely it will be found by scanners evaded by said protections.
A bypass in the context of this post does not generate false positives. It is not meant to confuse analysts or blend in with existing results. A true bypass results in zero results from a memory scanner before and after an implant is injected.
The first technique that comes to mind for encrypting data is often single-byte XOR. Single-byte XOR is conveniently easy to implement, doesn't require API calls, and runs relatively quickly. Unfortunately, tools like YARA and PE-sieve realized this and found ways to detect this encryption method with ease.
An alternative solution might implement functions that perform multi-byte XOR, AES, or RC4. However, it will become apparent in the following sections that this is not a viable option either. To completely evade scanners like Moneta, which search for any executable private memory, the code used for encrypting data must reside in image commit memory.
You can perform AES encryption using Windows APIs, but it requires a combination of multiple API calls to encrypt and decrypt data. An excellent solution for this problem is hinted at in Mimikatz. The author implements SystemFunction032: a system function that can be resolved from advapi32.dll to perform RC4 encryption and decryption. This API call accepts two arguments that contain the target memory and a key, allowing us to dynamically generate a key and encrypt data without executing code in private commit memory. Technically, SystemFunction032 is for encryption, and SystemFunction033 is for decryption. The RC4 cipher is bidirectional, though, so you can use either API for encryption or decryption.
Now that we've identified a method of encrypting data, we must decide which data should be encrypted. The beginning of this post referenced BeaconEye, a tool that scans dynamically allocated memory for Cobalt Strike configuration data structures.
Heap encryption is probably best performed in one of two ways:
The official Sleep Mask Kit from Cobalt Strike provides a list of memory addresses for encryption. Their solution is clean, but it requires the use of Sleep Mask Kit, which, as described in the following section, prevents us from bypassing some scanners.
Last year, I released a fork of TitanLdr, which creates a new heap before Beacon is loaded. The GetProcessHeap API is hooked in the implant's IAT to force it to resolve that heap when resolving the process heap to allocate memory. This allows us to encrypt all entries on the secondary heap since only the implant should use it. The following demonstration uses this fork to bypass BeaconEye.
Consistently bypassing tools like Moneta and PE-sieve requires a combination of encryption to evade pattern matching and memory permission control to evade attribute scanning.
An executable stub such as that used in Sleep Mask Kit or Shellcode Fluctuation can encrypt the implant code at rest and make it non-executable. Both examples require at least one executable region to remain unchanged, though. There will always be at least one point of detection from scanners using the "masking stub" technique, and YARA rules can be created to detect the stub itself.
The Gargoyle PoC influenced the creation of the other techniques discussed in this section. The author used asynchronous procedure calls to queue and execute a series of ROP gadgets that run while the initiating code is non-executable.
Gargoyle is only provided for 32-bit Windows, and the PoC only executes a message box. Earlier this year, Waldo-irc released YouMayPasser: a 64-bit implementation of Gargoyle, ready to use with Cobalt Strike.
Gargoyle and YouMayPasser achieve our goal of changing the implant code to non-executable. Still, they suffer the same issues as many ROP exploits: different versions of Windows require modifications to the gadget offsets. There are ways to solve this problem, but they can introduce significant complexity.
Inspired by Gargoyle, Austin Hudson released FOLIAGE: an alternative to traditional ROP, which uses the NtContinue API call to control execution during sleep. NtContinue is typically used in error handling to restore the execution context of a thread. It accepts a new context as the single argument and modifies the current thread to use this context. A context structure specifies values for CPU registers, including the instruction pointer, so it can redirect execution to a specified address. FOLIAGE queues a series of APCs which execute NtContinue to switch contexts repeatedly. A new context structure is used for each of the following steps in a chain that obfuscates the implant.
This process can be further examined by reviewing lines 217-512 of sleep.c in FOLIAGE.
A couple of months ago, C5pider claimed to have reversed MDSec NightHawk to create Ekko: an alternative to FOLIAGE which uses CreateTimerQueueTimer instead of NtQueueApcThread to queue calls to NtContinue.
The following demonstration uses FOLIAGE to bypass Moneta and PE-sieve.
Avoiding Moneta and PE-sieve
NtContinue is not the only API call that forcefully changes execution with context structures. It conveniently requires only one argument, but there are also viable alternatives.
Tools like BeaconHunter and Hunt-Sleeping-Beacons alert on threads with a wait reason of "DelayExecution". This detection can be easily evaded using an alternative method of delaying execution which does not set this wait reason. WaitForSingleObject is an API that fits this requirement and sets a wait reason of "UserRequest". The following demonstration replaces the Sleep API call with WaitForSingleObject to bypass these tools.
Spoofing the return address involves modifying the call stack return address, so it does not point to private commit memory. This section can be split into two distinct techniques: at rest and execution return address spoofing.
The term "at rest" refers to the implant during sleep. Most of the techniques discussed so far focus on this time as well. Commercial security products do not appear to be scanning the thread call stacks at rest, but open-source scanners such as PE-sieve will check return addresses when scanning.
This detection is partially evaded using a technique such as ThreadStackSpoofer. This PoC hides the return address by overwriting it with zero, effectively truncating the stack. Then, Depending on the state of the stack, this technique may leak arguments onto the stack. These arguments may resemble memory addresses to create an indicator for scanners that inspect return addresses.
A more stable technique is demonstrated in FOLIAGE. The author uses NtSetContextThread to overwrite the original thread's context with a manufactured context that sets the desired return address. The usage of NtSetContextThread is relatively rare and may be a point of detection. The author had not observed open-source scanners or commercial security products raising alerts on this behavior at the time of release.
The other time a thread's call stack may be captured is "at execution". This is demonstrated most clearly in MalMemDetect, as described above. Our return address must point to image commit memory when we make hooked API calls to evade tools like this.
The x64 Return Address Spoofing PoC accomplishes this nicely. A ROP gadget from a loaded DLL is stored as the return address before the API call is made, which jumps to a stub that restores the context necessary to continue execution.
Since the release of AceLdr, Hunt-Sleeping-Beacons has been updated to detect FOLIAGE. The scanner will now check all threads with a wait reason of "UserRequest" which also have a return address to KiUserApcDispatcher somewhere on their call stack. This cannot be easily bypassed with the public implementation of FOLIAGE as it requires call stack spoofing of API calls in the sleep chain at execution. Since FOLIAGE is obfuscating the shellcode used for return address spoofing, it cannot be called by the APC thread to spoof return addresses.
As a part of this research, I released an implementation of the previously discussed techniques called AceLdr. This tool is a user defined reflective loader (UDRL) for Cobalt Strike with the following features at the time of release.
Black Hills Information Security used this tool for approximately one year before releasing it publicly. Below is a demonstration of AceLdr bypassing several memory scanners.
Avoiding Memory Scanners with AceLdr
While AceLdr is made explicitly for Cobalt Strike, the techniques demonstrated in this post can be easily ported to many other projects. Each method presented here bypasses existing scanners. However, this does not guarantee they will evade future implementations, as we've already seen with Hunt-Sleeping-Beacons.
Memory scanners and commercial security products are not the same, but they share many characteristics. For example, evading open-source scanners does not guarantee security product evasion. In addition, security product evasion often does not require a complete memory scanner bypass since system resources and development costs limit vendors.
Published by Kyle Avery on September 9, 2022