首页
社区
课程
招聘
[转帖]Dumping DPC Queues: Adventures in HIGH_LEVEL IRQL
发表于: 2020-2-12 11:32 2095

[转帖]Dumping DPC Queues: Adventures in HIGH_LEVEL IRQL

2020-2-12 11:32
2095

Original link: https://repnz.github.io/posts/practical-reverse-engineering/dumping-dpc-queues/

 

This post is part of the Practical Reverse Engineering Exercises series.

 

To understand more about the basics of DPCs, read Reversing KeInsertQueueDpc

 

(Source code below.)

 

Exercise: Write a driver to enumerate all DPCs on the entire system. Make sure you support multi-processor systems! Explain the difficulties and how you solved them.

 

Sounds fun! let’s start. I thought about dividing this post to 2 posts, but nah

Doing Undocumented Shit in Windows

First of all, we need to understand that accessing the DPC queue from a real product is an extremely bad idea because it’s a pretty undocumented data structure. Although the usage of DPC is documented, the internals are not. Doing stuff like this in a real product is a good way to cause blue screens to your clients. For example, the DPC queue data structure was changed between windows 7 and 10. If you wrote code that used the old structure, your code won’t work or will cause BSODs. The thing is, sometimes it is needed to do undocumented stuff in the kernel, so it’s useful to know how to do it reliably. Altough sometimes undocumented stuff can be done reliably, this is very hard and requires years of experience in the windows kernel - honestly I don’t have the confidence to say that my solution is reliable but I do know I tried my best to not cause BSODs. So for the sake of the exercise it’s fine but.. yeah.

Finding the DPC queue

To solve this exercise, we need to understand where DPCs are saved. As we saw in the last solution each CPU has a structure called the PRCB that contains much per-processor information. The PRCB contains the DpcData - this is a structure that describes the queue of DPCs.

struct _KPRCB {
    ...
    ...
    ...
    struct _KDPC_DATA DpcData[2];
    ...
    ...
}

This structure looks like this:

//
// This is Prcb.DpcData in < 8.1
//
typedef struct _KDPC_DATA_1 {
    LIST_ENTRY DpcListHead; // Practically pointer to the ListHead and LastEntry..
    KSPIN_LOCK DpcLock;
    LONG DpcQueueDepth;
    ULONG DpcCount;
} KDPC_DATA_1, * PKDPC_DATA_1;


//
// This structure only exists in >= 8.1.
//
typedef struct _KDPC_LIST {
    SINGLE_LIST_ENTRY ListHead;
    PSINGLE_LIST_ENTRY LastEntry;
} KDPC_LIST, * PKDPC_LIST;


//
// This is Prcb.DpcData in >= 8.1
//
typedef struct _KDPC_DATA_2 {
    KDPC_LIST DpcList;
    KSPIN_LOCK DpcLock;
    LONG DpcQueueDepth;
    ULONG DpcCount;
    PKDPC ActiveDpc;
} KDPC_DATA_2, * PKDPC_DATA_2;

As you can see it was changed in windows 8.1 so we’ll have to be careful with that in our code. In the practical reverse engineering book, the structure was defined like the older one.

 

Ok so we need to find it in runtime. How can we do it?

 

The first thing that comes in mind is - look at the KDPC structure:

//0x40 bytes (sizeof)
struct _KDPC
{
    UCHAR Type;                                   //0x0
    UCHAR Importance;                             //0x1
    volatile USHORT Number;                       //0x2
    SINGLE_LIST_ENTRY DpcListEntry;               //0x8
    ULONGLONG ProcessorHistory;                   //0x10
    PKDEFERRED_PROCEDURE DeferredRoutine;         //0x18
    VOID* DeferredContext;                        //0x20
    VOID* SystemArgument1;                        //0x28
    VOID* SystemArgument2;                        //0x30
    PVOID DpcData;                                //0x38
};

We can see DpcData over there - yay! Actually if you read the book, you see the following sentence:

 

“DpcData—A pointer to a KDPC_DATA structure:”

 

Ok this sounds promising! Let’s queue a DPC object and then look at the DpcData member and that way we can find KDPC_DATA. There’s only one problem with this logic - it does not work - let’s see why. When we look at the value of DpcData we can see it’s initialized inside KiInsertQueueDpc:

mov     rcx, [rbp+2D90h] ; KPRCB.IsrDpcStats
cmovnz  rcx, rax
xor     eax, eax
lock cmpxchg [rdi+_KDPC.DpcData], rcx

So - the thing is: The DpcData pointer does not point to KDPC_DATA anymore! It was like this before the change but now it points to some shitty structure containing statistics about interrupts. Heh..

 

So we need to find another way to locate DpcData..

 

If you look at the PRCB structure, you’ll see this:

typedef struct _KPRCB {
    ...........
    ...........
    ...........
    union _SLIST_HEADER InterruptObjectPool;                                //0x2dc0
    ULONGLONG PrcbPad41[6];                                                 //0x2dd0
    struct _KDPC_DATA DpcData[2];                                           //0x2e00
    VOID* DpcStack;                                                         //0x2e50
    LONG MaximumDpcQueueDepth;                                              //0x2e58
    ...........
    ...........
    ...........
} KPRCB, *PKPRCB;

Ok, do you have any ideas?

 

When a DPC is scheduled to run (When it does not run under the idle thread) the value of DpcStack is used to initialize the stack of the DPC. AHA! We can do the following:

  • Schedule a DPC and then save the value of RSP at the DpcRoutine.
  • Use this value and find DpcStack in the PRCB.
  • Look right above the DpcStack and find DpcData

This is pretty hacky but it could work, Let’s think about another implementation. As we can see the KDPC_DATA is embedded inside the KPRCB structure. If you look at the KDPC_DATA structure, you’ll see it has 2 pointers:

  • Pointer to the top item in the queue
  • Pointer to the last entry in the queue

The next idea is to basically insert items to the queue by ourselves and then search for a pointer to our items inside the KPRCB. I implemented the scan with the Rsp value in my solution, Although it’s probably better to do it using the second method because it’s more reliable.

Safe (?) Signatures

Ok.. so let’s say we implement the idea from above

  • How can we know the address we got is the right one?
  • How can we know the structure was not changed and the offsets are the same?

Doing shady stuff in kernel mode can easily cause blue screens, let’s understand why. Eventually we’ll need to dereference pointers to do useful work - let’s say we think we found the address of the DPC queue and we want to use it. We may naively think about using “__try” to prevent access violation exceptions in case we get a bad address:

//
// DpcData may be invalid... 
//
PKDPC_DATA DpcData = FindDpcData(); 
....
....
// 
// Go over the DPC queue and copy the DPC information into the output buffer
// **** BAD CODE: USING __try IS DUMB HERE. *****
//
__try {
    //
    // Dereference DpcData to get the list head
    //
    PKDPC CurrentDpc = CONTAINING_RECORD(DpcData->ListHead.Next, KDPC, DpcListEntry);

    for (LONG i = 0; i < DpcData->DpcQueueDepth; i++) {
        RtlCopyMemory(&DpcObjectsCopy[i], CurrentDpc, sizeof(KDPC));

        if (CurrentDpc->DpcListEntry.Next == NULL) {
            break;
        }

        //
        // Dereference CurrentDpc to get the next entry
        //
        CurrentDpc = CONTAINING_RECORD((CurrentDpc->DpcListEntry.Next), KDPC, DpcListEntry);
    }

    return TRUE;
}
__except(EXCEPTION_EXECUTE_HANDLER) { 
    return FALSE;
}

This code may look fine - the problem is - if we were incorrect, the exception probably won’t be caught in our SEH handler. In the windows kernel, Windows expects you not to access invalid addresses in the kernel range of addresses - That’s why in certain situations the windows kernel will immediately initiate a blue screen without executing SEH when an exception occurs. The usage of __try is mainly for validating and accessing user mode addresses (Also make sure you lock the pages and use ProbeForRead/Write).

 

So how can we know we got the correct address? Let’s say we want to validate the address of DpcData - how should we do it?

 

A bad way to do it (which sometimes is necessary) is to use MmIsAddressValid. This function returns TRUE if the PTE for the virtual address is a valid PTE, else FALSE. This function is bad because:

  1. It’s prone to race conditions - the address may not be valid after the call to MmIsAddressValid
  2. This function has a big drawback: If the memory is paged out, it will return FALSE - we would want it to return TRUE because eventually if we access a paged out address (Not >= DISPATCH_LEVEL and not inside a file system paging IO handler) it should just page in the memory and give us the memory as expected - so in case the address can be paged out we cannot use MmIsAddressValid.

I found that MmIsAddressValid can be useful in case you want to scan and look for an address.. Still, avoid using it if not necessary.

 

So we may want a different solution. If you hang out in unknown cheats enough you’ll find the following post: “Read Unknown Kernel Address In A Safe Way”

 

In short, you can use MmCopyMemory - If you look at the MSDN documentation you’ll see the following sentence:

 

Kernel-mode drivers can call this routine to safely access arbitrary physical or virtual addresses.

 

The function allows the caller to read arbitrary amount of memory safely:)

 

This function was added in 8.1. Before that, you can use MmMapIoSpace to create another virtual address range for a physical address, and use this new range to read the pages. (The drawback of this approach is that if the range is more than one page it’ll be harder but possible..)

 

In our case, these functions cannot be used because of synchronization problems we’ll discuss soon. (We need to access these data structures at HIGH_LEVEL IRQL to prevent preemption and race conditions..) so we need to think of other ways to validate our result.

 

Sometimes in these signatures you have to be creative - that’s why I think it’s fun. (Remember, fun stuff is not always allowed to get into real products because of stability issues…) - Actually we can validate the address of DpcData without performing any dereference ourselves!

 

The address of the DpcData we get must be within the range of the KPRCB, that means can safely access these members:

typedef struct _KDPC_DATA {    
    _KDPC_LIST DpcList
    KSPIN_LOCK DpcLock;
    LONG DpcQueueDepth;
    ULONG DpcCount;
    PKDPC ActiveDpc;
} KDPC_DATA, *PKDPC_DATA;

typedef struct _KDPC_LIST {
    SINGLE_LIST_ENTRY ListHead;
    PSINGLE_LIST_ENTRY LastEntry;
} KDPC_LIST, * PKDPC_LIST;


typedef struct _SINGLE_LIST_ENTRY {
    struct _SINGLE_LIST_ENTRY *Next;
} SINGLE_LIST_ENTRY, *PSINGLE_LIST_ENTRY;

Ok, some validations we can still perform:

  1. Queue a DPC (With KeInsertQueueDpc) then inside the DpcRoutine validate that ActiveDpc points to our DPC object.
  2. Queue a HighImportance DPC and compare DpcData->ListHead.Next to the DpcListEntry of our DPC object.
  3. Queue a LowImportance DPC and compare DpcData->LastEntry to the DpcListEntry of our DPC object.
  4. After queueing a new DPC, validate that the DpcCount and DpcQueueDepth are increased by one.
  5. After removing a DPC, validate that the DpcQueueDepth is decreased and DpcCount stays the same.

These validations are simply comparisons of values inside KDPC_DATA and do not require any dereference. That way we can validate that we got the correct DpcData address and the offsets of the members were not changed. The only member that probably cannot be validated is the DpcLock - When we discuss synchronization we’ll talk about how we can solve locking issues.

 

Again, these methods are not 100% reliable, but they are good enough for the solution to the exercise. Sometimes in the windows kernel you just have to use undocumented functions / data structures, try to do it safely.

Synchronization Issues

Ok, now we have the address of DpcData and we want to iterate over the DPCs in the queue and then print information about the queue. As we said earlier, the DpcData structure is embedded inside the KPRCB - that means this data structure is a per-processor data structure. Let’s understand what synchronization issues we need to deal with before accessing this data structure.

Synchronization at PASSIVE_LEVEL

Normally, when we want to access a data structure that is shared between a couple of threads (That run in Irql PASSIVE_LEVEL) we need to use some synchronization primitive - typically some kind of “lock” that will prevent 2 threads from modifying the data structure at the same time. For example, in the windows kernel we can use the ERESOURCE reader-writer lock to synchronize access to a data structure. In user mode we have critical sections.

 

But what if we want to access this data structure in IRQL DISPATCH_LEVEL?

What’s IRQL?

If you don’t know what’s IRQL - in a sentence it’s the definition in windows of the priority level of the CPU. Code in user mode runs at 0 priority level (PASSIVE_LEVEL) and each registered interrupt has a level assigned with it. When an interrupt arrives to the CPU, The CPU checks whether the current CPU level is higher than the level of the interrupt - if the level is higher or equal, the interrupt will be pending until the level lowers down to a lower level. After that, the interrupt will execute. This mechanism is used to:

  • Allow interrupts from hardware devices to execute right away (interrupt handlers need to execute very fast)
  • Allow software interrupts to execute - For example APCs and DPCs are implemented using software interrupts - Special kernel APC runs at APC_LEVEL and if it’s queued an actual software interrupt is sent to the CPU.
  • Assist the implementation of context switching

Summary of IRQLs below: (The values are from x86)

  • 0 - PASSIVE_LEVEL - Most code you’ll write.
  • 1 - APC_LEVEL - Special kernel APCs
  • 2 - DISPATCH_LEVEL - The scheduler cannot be activated - You cannot sleep, cannot call WaitForObject or perform any file I/O.
  • above 2 - DIRQL (Device IRQL) - Typically IRQLs assigned to interrupt handlers.

IRQL is a pretty essential topic for kernel developers and researchers. I think it’s also cool;)

Synchronization at DISPATCH_LEVEL

Sometimes you’ll run code at DISPATCH_LEVEL - This will typically occur if you:

  • Write network filters - Some of the callbacks of network filters are invoked at DISPATCH_LEVEL (WFP callouts for example)
  • Use KTIMER Timers - This timer works by queuing a DPC for you - DPCs are invoked at DISPATCH_LEVEL.
  • Write an interrupt handler - Typically DPCs are queued in interrupt handlers to continue the work of the DPC.
  • Raise IRQL to DISPATCH_LEVEL - This happens in certain cases we’ll see soon.

To synchronize at DISPATCH_LEVEL you’ll have to use a synchronizaion object called Spin Locks.

 

A spin lock is simply a flag that can be either acquired or released. When you want to acquire the spin lock, you just wait in a loop until the spin lock is released and then change the flag to be acquired. When you want to release the spin lock, you simply set it to “released”. This is also called “busy waiting” and this synchronization object is used mostly in high IRQLs. WaitForObject cannot be used at >= DISPATCH_LEVEL becuase the scheduler cannot be invoked.

 

Ok so let’s say we want to share a queue between DISPATCH_LEVEL code and an interrupt handler (let’s say IRQL 3). To do this, we’ll try to use Spin Locks. There’s a problem with using a spin lock to do so. Try to find the bug here:

//
// Runs at IRQL 3 (DIRQL)
//
VOID InterruptHandler() 
{
    //
    // When an interrupt arrives get the DEVICE_JOB
    // and queue it
    //
    DEVICE_JOB* Info = GetDeviceJob();

    AcquireSpinLock(&DeviceJobQueue->Lock);

    QueueDeviceJob(&DeviceJobQueue, Info);

    ReleaseSpinLock(&DeviceJobQueue->Lock);
}

//
// Runs at DISPATCH_LEVEL
//
VOID RunQueuedItems() 
{
    //
    // Process all the jobs in the queue every 10 seconds
    //
    while (TRUE) {
        Sleep(SECONDS(10));

        // DONT DO IT: There's a bug here.
        // Use spin lock to synchronize access to the queue.
        // Make sure that jobs cannot be added to the queue while processing

        AcquireSpinLock(&DeviceJobQueue->Lock);

        for (ULONG i = 0; i < DeviceJobQueue->JobCount; i++) {
            ProcessJob(&DeviceJobQueue->Queue[i]);
        }

        DeviceJobQueue->JobCount = 0;

        ReleaseSpinLock(&DeviceJobQueue->Lock);
    }

}

Image the following situation:

  1. The DeviceJobQueue has 10 items in it.
  2. Queue processing is started, the spin lock is acquired to process the queue.
  3. Device interrupt arrives! The code of the InterruptHandler function starts at IRQL 3.
  4. AcquireSpinLock is called while the spin lock is acquired.
  5. The InterruptHandler function waits for the spin lock to be released.
  6. The spin lock is never released because the RunQueuedItems function was interrupted.
  7. Deadlock:)

How would you solve this deadlock?

 

The simplest solution (used also in Windows) is to disable this preemption - prevent the code of the interrupt handler from interrupting code that touches the queue. To do this we’ll raise our IRQL to the same IRQL of the interrupt handler (3):

VOID RunQueuedItems() 
{
    //
    // Process all the jobs in the queue every 10 seconds
    //
    while (TRUE) {
        Sleep(SECONDS(10));

        while (TRUE) {

            KIRQL OldIrql;

            //
            // Get a job from the queue
            // Make sure to raise the IRQL when accessing the queue
            //
            KeRaiseIrql(DeviceJobIrql, &OldIrql);

            AcquireSpinLock(&DeviceJobQueue->Lock);
            DEVICE_JOB* DeviceJob = PopDeviceJobQueue(&DeviceJobQueue);
            ReleaseSpinLock(&DeviceJobQueue->Lock);

            KeLowerIrql(OldIrql);

            if (!JobQueue)
                break;

            //
            // Process the job
            //
            ProcessJob(Job);

        }
    }

}

A similar approach is used in the DPC mechanism.

Synchronizing with DPC functions

DPCs can be queued from any IRQL (even the highest IRQL) that means that to synchronize with such code we have to raise our IRQL to HIGH_LEVEL so we won’t get interrupted. If you look at the implementation of KeInsertQueueDpc, you’ll see the following instructions:

mov     rcx, cr8       ; CR8 is the register that holds the current IRQL
mov     [rsp+TempIRQL], rcx
mov     eax, HIGH_LEVEL
mov     cr8, rax        ; Raise to HIGH_LEVEL

After raising the IRQL to HIGH_LEVEL the function inserts the DPC to the queue.

 

At the end of the function, you’ll see this code:

mov     rbx, [rsp+TempIRQL]
...
...
mov     cr8, rbx

To synchronize with such code we’ll have to raise the IRQL to HIGH_LEVEL as well to prevent interrupts. Wait, is something missing? If you look at the implementation of KeInsertQueueDpc you’ll see it acquires the DpcData->DpcLock before touching the queue. You may ask yourselves now: Why is this lock needed if the IRQL is raised?

 

The thing is: The spin lock is needed to allow access from multiple CPUs to the same DPC queue - this is needed typically when using the following function:

void KeSetTargetProcessorDpc(
  PRKDPC Dpc,
  CCHAR  Number
);

This function allows the user to queue a DPC to a different CPU, that’s why you need to access the DPC queue of another CPU. As a matter of fact, If the DPC could not be queued to a different CPU, raising IRQL would be enough. The method of disabling interrupts is still useful on some weird machines with 1 core.

Recursive spin lock issue

If you remember from our signature implementation, we would like to do the following: (can you spot the issue?)

  1. Acquire the DPC queue spin lock
  2. save the depth of the queue
  3. call KeInsertQueueDpc
  4. check that the depth was increased by one
  5. Do a bunch of other validations…
  6. release the spin lock

The issue here is that calling KeInsertQueueDpc would cause a deadlock! The reason is that spin locks cannot be acquired recursively and KeInsertQueueDpc tries to acquire the spin lock after we already acquired it in step 1.

 

Oh man… What can we do to solve this issue? The reason we wanted to acquire the spin lock from the beginning is to make sure other CPUs will not access our queue. Another issue with acquiring this lock is that the offset of the DpcLock is undocumented, we may think we know the lock is there but maybe it’s not there..?

 

Maybe we should think about a different way to make sure the queue will not be accessed by other CPUs.. Can you think of such way?

 

Imagine the following: We can send a DPC to all processors and then try to synchronize using a spinlock to make sure our code is the only code that runs globally on all CPUs! Man, this sounds complicated (not really) but Windows has already a built in mechanism called IPI (Inter-Processor Interrupt) we can utilize for such case:

ULONG_PTR KeIpiGenericCall(
  PKIPI_BROADCAST_WORKER BroadcastFunction,
  ULONG_PTR              Context
);

MSDN Documentation:

 

The KeIpiGenericCall routine causes the specified routine to run on all processors simultaneously. When a driver calls KeIpiGenericCall, the system interrupts every processor and raises the IRQL to IPI_LEVEL (interprocessor interrupt level). Each processor spins on a barrier until all processors have reached the barrier; then, all processors begin calling IpiGenericCall. KeIpiGenericCall waits for all calls to IpiGenericCall to complete before returning.

 

We can broadcast our function to all processors! This way we make sure our code is the only code that executes on all processors. Using IPI has these advantages:

  1. We don’t need to be afraid of using an undocumented spin lock
  2. We can call the DPC functions without getting a deadlock.

Implementation

Source Code

 

The general implementation does the following:

  1. Validate that the layout of the DPC object is as expected by creating a DPC and checking the values of its members.
  2. Fetch the DpcData member by assuming that this is the layout of the structure:
struct _KPRCB {
    ...
    ...
    ...
    struct _KDPC_DATA DpcData[2];
    VOID* DpcStack;                                                         
    ...
    ...
}

The DpcStack member is right after the DpcData member. If the offset of DpcStack can be found, it can be used to find the DpcData member. Of course, this assumption is validated.

 

The offset of the DpcStack member is found by queueing a DPC. When the DPC routine runs the value of RSP is taken from this member. This is true only in the context of a DPC interrupt. If the idle thread runs the DPC routine, the value of RSP is not changed - So we have to make sure the DPC is run in the context of a DPC interrupt. We do this by queuing the DPC from PASSIVE_LEVEL with HighImportance. So we take the value of RSP, search in the KPRCB, and find the DpcData.

 

3) After the KDPC_DATA was found, trigger an inter-processor interrupt that will validate the contents of the KDPC_DATA structure. An IPI is used to prevent other CPUs from changing the queue.

 

(I think the better implementation is using the ListHead / LastEntry scan to find the structure but honestly this is the solution I wrote first.)

Summary

This exercise was pretty fun, see you next time! @0xrepnz


[招生]科锐逆向工程师培训(2024年11月15日实地,远程教学同时开班, 第51期)

收藏
免费 0
支持
分享
最新回复 (2)
雪    币: 2510
能力值: ( LV1,RANK:0 )
在线值:
发帖
回帖
粉丝
2
感谢分享
2020-2-13 01:06
0
雪    币: 83
活跃值: (1092)
能力值: ( LV8,RANK:130 )
在线值:
发帖
回帖
粉丝
3
转载 最好搞成翻译版 加上自己对文章的理解与注释 会更好
2020-2-13 06:55
0
游客
登录 | 注册 方可回帖
返回
//