-
-
[原创]Linux内核5.13版本内存管理模块源码分析
-
发表于: 2021-9-1 09:47 23484
-
本文基于现今最新的Linux内核5.13版本。
内存管理模块一直是内核中最重要的模块之一,本文希望能简单的梳理内核内存管理模块的一部分核心内容,并结合我们在漏洞利用中的一些经验,以达到加深对内核理解的效果。
我们从最顶层的说起。
假设我们有3个CPU分别是C1,C2,C3
UMA/SMP:Uniform-Memory-Access,均匀存储器存取
可以简单理解为,C1,C2,C3作为一个整体,共享所有的物理内存。每台处理器可以有私有的高速cache。
NUMA:Nonuniform-Memory-Access,非均匀存储器存取
对于处理器C1,C2,C3,他们不是“共享”内存的
具体地说,相对CPU1,连接到 CPU1 的内存控制器的内存被认为是本地内存。而连接到CPU2)的内存被视为 CPU1 的外部或远程内存。
由于远程内存访问比本地内存访问有额外的延迟开销,因为它必须遍历互连(点对点链接)并连接到远程内存控制器。由于内存位置不同,系统会经历“不均匀”的内存访问时间。
Linux内核中内存组织的层次化主要是经历了 node->zone->page 这样一个顺序。
可以看到,每个CPU维护了自己对应的node,而这个node就可以理解为本地内存(NUMA)。
每个node又被划分为多个Zone。
Node 在内核源码中是一个全局数组。
可以看到在对应的数组中,每一个 pglist_data
中都包含了对应的多个 node_zones
以及其对应的引用。
每个zone中维护了一些比较重要的结构。
watermark(水位)
spanned_pages
long lowmem_reserve[MAX_NR_ZONES]
一个动态的数组,主要功能是保留一些低位内存空间,以防止当在高区有大量可释放的内存,但我们却在低区启动了OOM。也是一种预留内存。
zone_start_pfn:当前zone起始的物理页面号。而通过zone_start_pfn+spanned_pages可获得该zone的结束物理页面号。
free_area:表征当前zone中还有多少空余可供分配的page frames。
值得一提的是,其实Zone也分为不同种类的Zone(类比slab),可以通过如下方式查看:
接下来说说page(页),page frame(页帧/框)。这两个的关系类似鸡蛋(page)与篮子(page frame)的关系。
一般来说,一个page的大小是4K,是管理物理内存的最小单位。
我们主要聊里面几个重要的成员:
flags: 标定了page frame一些相应的属性。
flags的格式如下:
我们主要关注最后一位flag,用于标识page的状态。
_mapcount:表示当前page frame被map的次数(被页表引用的次数)
lru:根据page frame的活跃程度(使用频率),将page frame挂在不同的list上,作为页面回收的依据。
_refcount:引用计数。
本变量不可直接使用,而是要通过include/linux/page_ref.h
对应的函数来原子的读写。
pgoff_t index:表示在文件映射时的该page在文件内的offset,单位是page的大小。
mapping,我们主要说最常见的两种情况:
如果当前的page是一个匿名页,page->mapping指向它的anon_vma。并设置PAGE_MAPPING_ANON位来区分它。
如果当前的page是一个非匿名页,也就是说与某个文件相关联。那么mapping指向文件inode的地址空间。
根据是否处于VM_MERGEABLE区域,是否开启CONFIG_KSM,此指针指向的位置仍有不同。
详见:/include/linux/page-flags.h
在了解了本部分知识后,可以阅读:
加深对应的理解。
本部分我们主要目光放在x86-64下的4级页表的组织。
即:PGD -> PUD -> PMD -> PTE
一个比较好的说明图片:
当我们给出一个virtual addr(aka. v_addr),我们需要通过页表机制,来获取其对应的物理地址(aka. p_addr)。接下来梳理一下从 v_addr -> p_addr的过程。
需要注意的是,每个进程都拥有自己的PGD。它是一个物理页,并包含一个pgd_t数组。即每个进程都有一套自身的页表。
当发生进程切换时,切换的是进程页表:即将新进程的pgd(页目录)加载到CR3寄存器中。
如果有熟悉Kernel Pwn,在漏洞利用中有一种缓解技术叫做KPTI(Kernel page-table isolation)即内核页表隔离。当题目开启了内核页表隔离时,不能直接着陆到用户态。
KPTI的核心是,在开启了这个选项的程序中,每个进程拥有两套页表,分别是内核态页表(只能在内核态访问)与用户态页表,他们处于不同的地址空间下。
当发生一次syscall时,涉及到用户态与内核态页表的切换。(切换CR3)
而如果我们着陆用户态(ireq/sysret)的时候,没有正常切换/设置CR3寄存器,就会导致页表错误,最后引发段错误。
在bypass的时候,我们往往通过SWITCH_USER_CR3:
来重新设置cr3寄存器。
或者是通过 swapgs_restore_regs_and_return_to_usermode
函数返回。
知道了KPTI之后的页表,那么显而易见,当我们没有开启KPTI时,只有进程的页表是时刻在更新,而内核页表全局只有一份,所有进程共享内核页表。而每个进程的“进程页表”中内核态地址相关的页表项都是“内核页表”的一个拷贝。当我们想要索引内核页表时,可以通过:init_mm.pgd
而这个 swapper_pg_dir 本质上就是内核PGD的基地址。
关于内核页表的创建过程可以看:
https://richardweiyang-2.gitbook.io/kernel-exploring/00-evolution_of_kernel_pagetable
TLB是translation lookaside buffer的简称。其本质上就是一块高速缓存。记得之前在计算机体系结构课上学过:<u>全相联映射、组相连映射、直接映射</u>等。
在正常情况下,我们通过四级页表查询来做页表转换来进行虚拟地址到物理地址的转换。
而TLB提供了一种更高速的方式来做虚拟地址到物理地址的转换。
TLB是一个小的,虚拟寻址的缓存,其中每一行都保存着一个由单个PTE(Page Table Entry,页表项)组成的块。如果没有TLB,则每次取数据都需要两次访问内存,即查页表获得物理地址和取数据。
不同的映射方式的cache有不同的组织形式,但是其整体思想都是通过虚拟地址来查cache,如果TLB cache命中,则直接可以得到物理地址。
TLB包含最近使用过的页表条目。给定一个虚拟地址,处理器将检查TLB是否存在页表条目(TLB命中),检索帧号并形成实际地址。如果在TLB中找不到页表条目(TLB丢失),则页号用于索引过程页表。TLB首先检查页面是否已在主存储器中,如果不在主存储器中,则发出页面错误,然后更新TLB以包括新的页面条目。
从这张图可以清晰的看出来,TLB提供了一种从v_addr[12:47] 到 p_addr[12:47]的映射。(低12bits均相似,所以不用管)
而ASID主要是为了区分不同的进程。
首先明确一点。page cache是Linux内核使用的主要磁盘缓存。
page cache is the main disk cache used by the Linux kernel.
我们一般在异步情况下读写文件时,首先写入对应的page cache,此时pages变成dirty pages ,后续会有内核线程pdflush真正写回到硬盘上。相对而言的,当我们读文件时,也是先放入page cache,然后再拷贝给用户态。当我们再次读同一个文件,如果page cache里已经有了,那么其性能就会有很大提升。
倒排页表,顾名思义,其储存的是每个物理page frame的信息。
其出现是为了缓解多级页表占用的内存问题。倒排页表项与物理内存页框有一一对应关系,而不是每一个虚拟页面有一个表项。
它所包含的表项数量较少(物理内存大小一般远小于虚拟内存大小)。所以其使用页框号而不是虚拟页号来索引页表项。
虽然IPT的设计节省了大量空间,但是也导致从虚拟地址到物理地址的转换会变得很困难。当进程n访问虚拟页面p时,硬件不再能通过把p当作指向页表的一个索引来查找物理页框。取而代之的是,它必须搜索整个倒排页表来查找某一个表项。
所以相比来说,TLB则是更好的一种技术。
huge page也称作大页,巨页。
我们一般来说一个页表项是4k,这就产生了一个问题:当物理内存很大时,页表会变得非常大,占用大量物理内存。而大页则是使页变大,由于页变大了,需要的页表项也就小了,占用物理内存也减少了。
x64四级页表系统支持2MB的大页,1GB的大页。
其优点主要是可以减少页表项,加快检索速度,提高TLB hit概率。
当我们打开CR4的pse位时(page size extension)就开启了对应的大页。
但是缺点是需要预先分配;如果分配过多,会造成内存浪费,不能被其他程序使用
THP(transparent huge page)即透明大页,他是对Huge Page的一个优化,它允许大页做动态的分配。THP减小了针对huge page支持的开销。使得应用程序可以根据需要灵活地选择虚存页面大小,而不会被强制使用 2MB 大页面。
THP是通过将巨大的页面分解成较小的4KB页面来实现的,然后这些页面被正常地交换出去。但是为了有效地使用hugepages,内核必须找到物理上连续的内存区域,其大小足以满足请求,而且还要正确对齐。为此,我们增加了一个khugepaged内核线程。这个线程会偶尔尝试用hugepage分配来替代目前正在使用的较小的页面,从而最大限度地提高THP的使用率。在用户区,不需要对应用程序进行修改(因此是透明的)。但有一些方法可以优化其使用。对于想要使用hugepages的应用程序,使用posix_memalign()也可以帮助确保大的分配被对齐到巨大的页面(2MB)边界上。另外,THP只对匿名内存区域启用。
但是问题是由于其动态分配的性质,以及繁琐的内存锁操作,THP很可能会导致性能上的下降。
https://zhuanlan.zhihu.com/p/67053210
P(Present) - 为1表明该page存在于当前物理内存中,为0则PTE的其他部分都失去意义了,不用看了,直接触发page fault。P位为0的PTE也不会有对应的TLB entry,因为早在P位由1变为0的时候,对应的TLB就已经被flush掉了。
G (Global)- 用于context switch的时候不用flush掉kernel对应的TLB,所以这个标志位在TLB entry中也是存在的。
A(Access) - 当这个page被访问(读/写)过后,硬件将该位置1,TLB只会缓存access的值为1的page对应的映射关系。软件可将该位置0,然后对应的TLB将会被flush掉。这样,软件可以统计出每个page被访问的次数,作为内存不足时,判断该page是否应该被回收的参考。
D (Dirty)- 这个标志位只对file backed的page有意义,对anonymous的page是没有意义的。当page被写入后,硬件将该位置1,表明该page的内容比外部disk/flash对应部分要新,当系统内存不足,要将该page回收的时候,需首先将其内容flush到外部存储。之后软件将该标志位清0。
R/W和U/S属于权限控制类:
R/W(Read/Write) - 置为1表示该page是writable的,置为0则是readonly,对只读的page进行写操作会触发page fault。
U/S(User/Supervisor) - 置为0表示只有supervisor(比如操作系统中的kernel)才可访问该page,置为1表示user也可以访问。
PCD和PWT和cache属性相关:
PCD(Page Cache Disabled)- 置为1表示disable,即该page中的内容是不可以被cache的。如果置为0(enable),还要看CR0寄存器中的CD位这个总控开关是否也是0。
PWT (Page Write Through)- 置为1表示该page对应的cache部分采用write through的方式,否则采用write back。
在64位下:
伙伴系统主要以2的方幂来划分空闲的内存区域,直至获取我们想要的内存大小的内存块。
可以看到,每个zone都维护了MAX_ORDER个free_area。其中MAX_ORDER表征切分的2的最大次幂。
而对应的MIGRATE_TYPES则为:
进一步的,对于每个free_list,都拥有不同的属性:
伙伴系统主要涉及的函数如下:alloc_pages、alloc_page等一些列函数,我们从最顶端的接口入手
本函数的参数:
本函数是伙伴系统分配的核心函数
快分配路径。
get_page_from_freelist 尝试去分配页面,如果分配失败,则交给 __alloc_pages_slowpath 处理一些特殊场景。
rmqueue
当我们要分配单一的一个页面(order=0)时,直接从per_cpu_list中分配。
在 rmqueue_pcplist
中,经过如下步骤:
在 __rmqueue_pcplist
,经过如下步骤:
在 __rmqueue_bulk
,经过如下步骤:
在 __rmqueue
经历如下步骤:
在 __rmqueue_smallest
经历如下步骤:
主要是从每个order的freelist查找大小和迁移属性都合适的page
在 expand
经历如下步骤:
如果当前的 current_order > order
时:
假设此时 high=4,low=2。(current_order、order)
那么会对多出来的页进行标记,标即为guard pages 。不可访问。然后将切分后的page放入相应的free链表中
当快分配不成功时,走慢分配路径。
在 free_the_page
中:
free_unref_page_commit
free_one_page -->__free_one_page
free_pcppages_bulk
kmem_cache
kmem_cache_cpu
kmem_cache_node
一个更清晰的三层结构:
在某些内核题目中,当开启了 CONFIG_SLAB_FREELIST_HARDENED 选项,freelist_ptr
函数会对object对象混淆后的next指针进行的解密。
引用:
创建新slab其实就是申请对应order的内存页,用来放足够数量的对象。值得注意的是其中order以及对象数量的确定,这两者又是相互影响的。order和object数量同时存放在kmem_cache成员kmem_cache_order_objects中,低16位用于存放object数量,高位存放order。order与object数量的关系非常简单:((PAGE_SIZE << order) - reserved) / size
最终到达 slab_alloc_node(快分配)。
get_freepointer_safe 行为如下:
__slab_alloc 慢分配
slab_alloc_node -> __slaballoc -> \__slab_alloc
deactivate_slab
该函数主要将slab放回node
___slab_free
discard_slab
释放路径:discard_slab -> free_slab -> __freeslab -> \_free_pages
SLUB DEBUG可以检测内存越界(out-of-bounds)和访问已经释放的内存(use-after-free)等问题。
如何开启:
推荐阅读:
Linux内核slab内存的越界检查——SLUB_DEBUG
对于用户态的每个进程,运行时都存在不同的段,段有自己的属性(是否可执行、可读等),而这段与段之间也不一定是连续的。而内核的进程vma结构体就是来对运行时的段进行维护。
对于每一个进程的 task_struct
来说:
<u>图片来自公众号:LoyenWang</u>
vm_area_struct
find_vma
vmacache_find
insert_vm_struct
什么时候会发生page falut:
page table中找不到对应的PTE
对应虚拟地址的PTE拒绝访问
在5.13的内核中 __do_page_fault
已经被移除(x86),取而代之的是 handle_page_fault
vmalloc_fault
内核使用 vmalloc
来分配在虚拟内存中连续但在物理内存中不一定连续的内存
处理vmalloc或模块映射区的故障 。之所以需要这样做,是因为在vmalloc映射代码更新PMD到它与系统中其他页表同步这一更新的时间点之间存在着一个竞争条件。在这个竞争窗口中,另一个线程/CPU可以在同一个PMD上映射一个区域,发现它已经存在,并且还没有与系统的其他部分同步。因此,vmzalloc可能会返回未被系统中的每个页表映射的区域,当这些区域被访问时,会引起未处理的页错误。
主要的修复过程就是把 init 进程的 页表项 (全局)复制到当前进程的 页表项 中,这样就可以实现所有进程的内核内存地址空间同步。
spurious_kernel_fault
本函数用于处理由于TLB entry没有及时更新导致的虚假错误。
可能发生的原因:TLB entry对应的permission比页表entry的少
可能导致的原因:
1.向 ring0 做write操作。
2.对一块NX区域fetch。
bad_area_nosemaphore
bad_area_nosemaphore -> __bad_area_nosemaphore
kernelmode_fixup_or_oops
handle_mm_fault
handle_pte_fault
本函数中更详细的一些调用还可以看:
https://bbs.pediy.com/thread-264199.htm
主要讲了一下对应的COW相关的处理。
特别感谢povcfe学长的linux内存管理分析,让我学到了很多
https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html
https://frankdenneman.nl/2016/07/07/numa-deep-dive-part-1-uma-numa/
https://zhuanlan.zhihu.com/p/68465952
https://blog.csdn.net/jasonchen_gbd/article/details/79462014
https://blog.csdn.net/zhoutaopower/article/details/87090982
https://blog.csdn.net/zhoutaopower/article/details/88025712
https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html
https://zhuanlan.zhihu.com/p/137277724
https://segmentfault.com/a/1190000012269249
https://www.codenong.com/cs105984564/
https://rtoax.blog.csdn.net/article/details/108663898?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-2.essearch_pc_relevant&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-2.essearch_pc_relevant
https://qinglinmao8315.github.io/linux/2018/03/14/linux-page-cache.html
https://www.jianshu.com/p/8a86033dfcb0
https://blog.csdn.net/wh8_2011/article/details/53138377
https://zhuanlan.zhihu.com/p/258921453?utm_source=wechat_timeline
https://blog.csdn.net/FreeeLinux/article/details/54754752
https://www.sohu.com/a/297831850_467784
https://www.cnblogs.com/adera/p/11718765.html
https://blog.csdn.net/zhuyong006/article/details/100737724
https://blog.csdn.net/wangquan1992/article/details/105036282/
https://blog.csdn.net/sykpour/article/details/24044641
- 开始之前
- NUMA与UMA/SMP
- 层次化
- 页表组织
- 一次页表查询
- KPTI与内核页表
- TLB缓存
- page cache 页缓冲
- Inverted page tables(IPT)
- Huge page
- THP(transparent huge page)
- 页表标识位
- 伙伴系统
- overview
- alloc_pages(gfp_t gfp_mask, unsigned int order)
- __alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,nodemask_t *nodemask)
- get_page_from_freelist(从zone freelist,快分配)
- __alloc_pages_slowpath(慢分配)
- __free_pages
- SLAB/SLUB分配器
- 关键结构体
- slab_hardened缓解/加固
- kmem_cache_alloc
- kmem_cache_free
- do_slab_free(快速路径)
- __slab_free(慢路径)
- check_object 与 CONFIG_SLUB
- 进程
- Page Fault
- handle_page_fault
- **do_kern_addr_fault**
- **do_user_addr_fault**
- 参考
/
/
arch
/
x86
/
mm
/
numa.c
struct pglist_data
*
node_data[MAX_NUMNODES] __read_mostly;
/
/
arch
/
x86
/
mm
/
numa.c
struct pglist_data
*
node_data[MAX_NUMNODES] __read_mostly;
typedef struct pglist_data {
/
*
*
node_zones contains just the zones
for
THIS node. Not
all
of the
*
zones may be populated, but it
is
the full
list
. It
is
referenced by
*
this node
's node_zonelists as well as other node'
s node_zonelists.
*
/
struct zone node_zones[MAX_NR_ZONES];
/
*
*
node_zonelists contains references to
all
zones
in
all
nodes.
*
Generally the first zones will be references to this node's
*
node_zones.
*
/
struct zonelist node_zonelists[MAX_ZONELISTS];
int
nr_zones;
/
*
number of populated zones
in
this node
*
/
......
typedef struct pglist_data {
/
*
*
node_zones contains just the zones
for
THIS node. Not
all
of the
*
zones may be populated, but it
is
the full
list
. It
is
referenced by
*
this node
's node_zonelists as well as other node'
s node_zonelists.
*
/
struct zone node_zones[MAX_NR_ZONES];
/
*
*
node_zonelists contains references to
all
zones
in
all
nodes.
*
Generally the first zones will be references to this node's
*
node_zones.
*
/
struct zonelist node_zonelists[MAX_ZONELISTS];
int
nr_zones;
/
*
number of populated zones
in
this node
*
/
......
/
*
zone_start_pfn
=
=
zone_start_paddr >> PAGE_SHIFT
*
/
/
*
zone_start_pfn
=
=
zone_start_paddr >> PAGE_SHIFT
*
/
root@ubuntu:~
# cat /proc/zoneinfo |grep Node
Node
0
, zone DMA
Node
0
, zone DMA32
Node
0
, zone Normal
Node
0
, zone Movable
Node
0
, zone Device
root@ubuntu:~
# cat /proc/zoneinfo |grep Node
Node
0
, zone DMA
Node
0
, zone DMA32
Node
0
, zone Normal
Node
0
, zone Movable
Node
0
, zone Device
/
/
mm_types.h
struct page {
unsigned
long
flags;
/
*
Atomic flags, some possibly
*
updated asynchronously
*
/
union {
struct {
/
*
Page cache
and
anonymous pages
*
/
struct list_head lru;
/
*
See page
-
flags.h
for
PAGE_MAPPING_FLAGS
*
/
struct address_space
*
mapping;
pgoff_t index;
/
*
Our offset within mapping.
*
/
/
*
*
*
@private: Mapping
-
private opaque data.
*
Usually used
for
buffer_heads
if
PagePrivate.
*
Used
for
swp_entry_t
if
PageSwapCache.
*
Indicates order
in
the buddy system
if
PageBuddy.
*
/
unsigned
long
private;
};
struct {
/
*
page_pool used by netstack
*
/
/
*
*
*
@dma_addr: might require a
64
-
bit value on
*
32
-
bit architectures.
*
/
unsigned
long
dma_addr[
2
];
};
struct {
/
*
slab, slob
and
slub
*
/
union {
struct list_head slab_list;
struct {
/
*
Partial pages
*
/
struct page
*
next
;
#ifdef CONFIG_64BIT
int
pages;
/
*
Nr of pages left
*
/
int
pobjects;
/
*
Approximate count
*
/
#else
short
int
pages;
short
int
pobjects;
#endif
};
};
struct kmem_cache
*
slab_cache;
/
*
not
slob
*
/
/
*
Double
-
word boundary
*
/
void
*
freelist;
/
*
first free
object
*
/
union {
void
*
s_mem;
/
*
slab: first
object
*
/
unsigned
long
counters;
/
*
SLUB
*
/
struct {
/
*
SLUB
*
/
unsigned inuse:
16
;
unsigned objects:
15
;
unsigned frozen:
1
;
};
};
};
struct {
/
*
Tail pages of compound page
*
/
unsigned
long
compound_head;
/
*
Bit zero
is
set
*
/
/
*
First tail page only
*
/
unsigned char compound_dtor;
unsigned char compound_order;
atomic_t compound_mapcount;
unsigned
int
compound_nr;
/
*
1
<< compound_order
*
/
};
struct {
/
*
Second tail page of compound page
*
/
unsigned
long
_compound_pad_1;
/
*
compound_head
*
/
atomic_t hpage_pinned_refcount;
/
*
For both
global
and
memcg
*
/
struct list_head deferred_list;
};
struct {
/
*
Page table pages
*
/
unsigned
long
_pt_pad_1;
/
*
compound_head
*
/
pgtable_t pmd_huge_pte;
/
*
protected by page
-
>ptl
*
/
unsigned
long
_pt_pad_2;
/
*
mapping
*
/
union {
struct mm_struct
*
pt_mm;
/
*
x86 pgds only
*
/
atomic_t pt_frag_refcount;
/
*
powerpc
*
/
};
#if ALLOC_SPLIT_PTLOCKS
spinlock_t
*
ptl;
#else
spinlock_t ptl;
#endif
};
struct {
/
*
ZONE_DEVICE pages
*
/
/
*
*
@pgmap: Points to the hosting device page
map
.
*
/
struct dev_pagemap
*
pgmap;
void
*
zone_device_data;
};
/
*
*
@rcu_head: You can use this to free a page by RCU.
*
/
struct rcu_head rcu_head;
};
union {
/
*
This union
is
4
bytes
in
size.
*
/
atomic_t _mapcount;
/
*
*
If the page
is
neither PageSlab nor mappable to userspace,
*
the value stored here may
help
determine what this page
*
is
used
for
. See page
-
flags.h
for
a
list
of page types
*
which are currently stored here.
*
/
unsigned
int
page_type;
unsigned
int
active;
/
*
SLAB
*
/
int
units;
/
*
SLOB
*
/
};
/
*
Usage count.
*
DO NOT USE DIRECTLY
*
. See page_ref.h
*
/
atomic_t _refcount;
#ifdef CONFIG_MEMCG
unsigned
long
memcg_data;
#endif
#if defined(WANT_PAGE_VIRTUAL)
void
*
virtual;
/
*
Kernel virtual address (NULL
if
not
kmapped, ie. highmem)
*
/
#endif /* WANT_PAGE_VIRTUAL */
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int
_last_cpupid;
#endif
} _struct_page_alignment;
/
/
mm_types.h
struct page {
unsigned
long
flags;
/
*
Atomic flags, some possibly
*
updated asynchronously
*
/
union {
struct {
/
*
Page cache
and
anonymous pages
*
/
struct list_head lru;
/
*
See page
-
flags.h
for
PAGE_MAPPING_FLAGS
*
/
struct address_space
*
mapping;
pgoff_t index;
/
*
Our offset within mapping.
*
/
/
*
*
*
@private: Mapping
-
private opaque data.
*
Usually used
for
buffer_heads
if
PagePrivate.
*
Used
for
swp_entry_t
if
PageSwapCache.
*
Indicates order
in
the buddy system
if
PageBuddy.
*
/
unsigned
long
private;
};
struct {
/
*
page_pool used by netstack
*
/
/
*
*
*
@dma_addr: might require a
64
-
bit value on
*
32
-
bit architectures.
*
/
unsigned
long
dma_addr[
2
];
};
struct {
/
*
slab, slob
and
slub
*
/
union {
struct list_head slab_list;
struct {
/
*
Partial pages
*
/
struct page
*
next
;
#ifdef CONFIG_64BIT
int
pages;
/
*
Nr of pages left
*
/
int
pobjects;
/
*
Approximate count
*
/
#else
short
int
pages;
short
int
pobjects;
#endif
};
};
struct kmem_cache
*
slab_cache;
/
*
not
slob
*
/
/
*
Double
-
word boundary
*
/
void
*
freelist;
/
*
first free
object
*
/
union {
void
*
s_mem;
/
*
slab: first
object
*
/
unsigned
long
counters;
/
*
SLUB
*
/
struct {
/
*
SLUB
*
/
unsigned inuse:
16
;
unsigned objects:
15
;
unsigned frozen:
1
;
};
};
};
struct {
/
*
Tail pages of compound page
*
/
unsigned
long
compound_head;
/
*
Bit zero
is
set
*
/
/
*
First tail page only
*
/
unsigned char compound_dtor;
unsigned char compound_order;
atomic_t compound_mapcount;
unsigned
int
compound_nr;
/
*
1
<< compound_order
*
/
};
struct {
/
*
Second tail page of compound page
*
/
unsigned
long
_compound_pad_1;
/
*
compound_head
*
/
atomic_t hpage_pinned_refcount;
/
*
For both
global
and
memcg
*
/
struct list_head deferred_list;
};
struct {
/
*
Page table pages
*
/
unsigned
long
_pt_pad_1;
/
*
compound_head
*
/
pgtable_t pmd_huge_pte;
/
*
protected by page
-
>ptl
*
/
unsigned
long
_pt_pad_2;
/
*
mapping
*
/
union {
struct mm_struct
*
pt_mm;
/
*
x86 pgds only
*
/
atomic_t pt_frag_refcount;
/
*
powerpc
*
/
};
#if ALLOC_SPLIT_PTLOCKS
spinlock_t
*
ptl;
#else
spinlock_t ptl;
#endif
};
struct {
/
*
ZONE_DEVICE pages
*
/
/
*
*
@pgmap: Points to the hosting device page
map
.
*
/
struct dev_pagemap
*
pgmap;
void
*
zone_device_data;
};
/
*
*
@rcu_head: You can use this to free a page by RCU.
*
/
struct rcu_head rcu_head;
};
union {
/
*
This union
is
4
bytes
in
size.
*
/
atomic_t _mapcount;
/
*
*
If the page
is
neither PageSlab nor mappable to userspace,
*
the value stored here may
help
determine what this page
*
is
used
for
. See page
-
flags.h
for
a
list
of page types
*
which are currently stored here.
*
/
unsigned
int
page_type;
unsigned
int
active;
/
*
SLAB
*
/
int
units;
/
*
SLOB
*
/
};
/
*
Usage count.
*
DO NOT USE DIRECTLY
*
. See page_ref.h
*
/
atomic_t _refcount;
#ifdef CONFIG_MEMCG
unsigned
long
memcg_data;
#endif
#if defined(WANT_PAGE_VIRTUAL)
void
*
virtual;
/
*
Kernel virtual address (NULL
if
not
kmapped, ie. highmem)
*
/
#endif /* WANT_PAGE_VIRTUAL */
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS
int
_last_cpupid;
#endif
} _struct_page_alignment;
```c
enum pageflags {
PG_locked,
/
*
Page
is
locked. Don't touch.
*
/
PG_referenced,
/
/
表示page刚刚被访问过
PG_uptodate,
PG_dirty,
/
/
是否页面数据已经被修改(脏页)
PG_lru,
/
/
是否处于lru链表中
PG_active,
PG_workingset,
PG_waiters,
PG_error,
PG_slab,
/
/
是否属于slab分配器
PG_owner_priv_1,
/
*
Owner use. If pagecache, fs may use
*
/
PG_arch_1,
PG_reserved,
PG_private,
/
*
If pagecache, has fs
-
private data
*
/
PG_private_2,
/
*
If pagecache, has fs aux data
*
/
PG_writeback,
/
/
page正在被写回
PG_head,
/
*
A head page
*
/
PG_mappedtodisk,
/
*
Has blocks allocated on
-
disk
*
/
PG_reclaim,
/
*
To be reclaimed asap
*
/
PG_swapbacked,
/
*
Page
is
backed by RAM
/
swap
*
/
PG_unevictable,
/
*
Page
is
"unevictable"
*
/
#ifdef CONFIG_MMU
PG_mlocked,
/
*
Page
is
vma mlocked
*
/
#endif
#ifdef CONFIG_ARCH_USES_PG_UNCACHED
PG_uncached,
/
*
Page has been mapped as uncached
*
/
#endif
#ifdef CONFIG_MEMORY_FAILURE
PG_hwpoison,
/
*
hardware poisoned page. Don't touch
*
/
#endif
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
PG_young,
PG_idle,
#endif
#ifdef CONFIG_64BIT
PG_arch_2,
#endif
__NR_PAGEFLAGS,
/
*
Filesystems
*
/
PG_checked
=
PG_owner_priv_1,
/
*
SwapBacked
*
/
PG_swapcache
=
PG_owner_priv_1,
/
/
page处于swap cache中
/
*
Swap page: swp_entry_t
in
private
*
/
/
*
Two page bits are conscripted by FS
-
Cache to maintain local caching
*
state. These bits are
set
on pages belonging to the netfs's inodes
*
when those inodes are being locally cached.
*
/
PG_fscache
=
PG_private_2,
/
*
page backed by cache
*
/
/
*
XEN
*
/
/
*
Pinned
in
Xen as a read
-
only pagetable page.
*
/
PG_pinned
=
PG_owner_priv_1,
/
*
Pinned as part of domain save (see xen_mm_pin_all()).
*
/
PG_savepinned
=
PG_dirty,
/
*
Has a grant mapping of another (foreign) domain's page.
*
/
PG_foreign
=
PG_owner_priv_1,
/
*
Remapped by swiotlb
-
xen.
*
/
PG_xen_remapped
=
PG_owner_priv_1,
/
*
SLOB
*
/
PG_slob_free
=
PG_private,
/
*
Compound pages. Stored
in
first tail page's flags
*
/
PG_double_map
=
PG_workingset,
/
*
non
-
lru isolated movable page
*
/
PG_isolated
=
PG_reclaim,
/
*
Only valid
for
buddy pages. Used to track pages that are reported
*
/
PG_reported
=
PG_uptodate,
};
```c
enum pageflags {
PG_locked,
/
*
Page
is
locked. Don't touch.
*
/
PG_referenced,
/
/
表示page刚刚被访问过
PG_uptodate,
PG_dirty,
/
/
是否页面数据已经被修改(脏页)
PG_lru,
/
/
是否处于lru链表中
PG_active,
PG_workingset,
PG_waiters,
PG_error,
PG_slab,
/
/
是否属于slab分配器
PG_owner_priv_1,
/
*
Owner use. If pagecache, fs may use
*
/
PG_arch_1,
PG_reserved,
PG_private,
/
*
If pagecache, has fs
-
private data
*
/
PG_private_2,
/
*
If pagecache, has fs aux data
*
/
PG_writeback,
/
/
page正在被写回
PG_head,
/
*
A head page
*
/
PG_mappedtodisk,
/
*
Has blocks allocated on
-
disk
*
/
PG_reclaim,
/
*
To be reclaimed asap
*
/
PG_swapbacked,
/
*
Page
is
backed by RAM
/
swap
*
/
PG_unevictable,
/
*
Page
is
"unevictable"
*
/
#ifdef CONFIG_MMU
PG_mlocked,
/
*
Page
is
vma mlocked
*
/
#endif
#ifdef CONFIG_ARCH_USES_PG_UNCACHED
PG_uncached,
/
*
Page has been mapped as uncached
*
/
#endif
#ifdef CONFIG_MEMORY_FAILURE
PG_hwpoison,
/
*
hardware poisoned page. Don't touch
*
/
#endif
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)
PG_young,
PG_idle,
#endif
#ifdef CONFIG_64BIT
PG_arch_2,
#endif
__NR_PAGEFLAGS,
/
*
Filesystems
*
/
PG_checked
=
PG_owner_priv_1,
/
*
SwapBacked
*
/
PG_swapcache
=
PG_owner_priv_1,
/
/
page处于swap cache中
/
*
Swap page: swp_entry_t
in
private
*
/
/
*
Two page bits are conscripted by FS
-
Cache to maintain local caching
*
state. These bits are
set
on pages belonging to the netfs's inodes
*
when those inodes are being locally cached.
*
/
PG_fscache
=
PG_private_2,
/
*
page backed by cache
*
/
/
*
XEN
*
/
/
*
Pinned
in
Xen as a read
-
only pagetable page.
*
/
PG_pinned
=
PG_owner_priv_1,
/
*
Pinned as part of domain save (see xen_mm_pin_all()).
*
/
PG_savepinned
=
PG_dirty,
/
*
Has a grant mapping of another (foreign) domain's page.
*
/
PG_foreign
=
PG_owner_priv_1,
/
*
Remapped by swiotlb
-
xen.
*
/
PG_xen_remapped
=
PG_owner_priv_1,
/
*
SLOB
*
/
PG_slob_free
=
PG_private,
/
*
Compound pages. Stored
in
first tail page's flags
*
/
PG_double_map
=
PG_workingset,
/
*
non
-
lru isolated movable page
*
/
PG_isolated
=
PG_reclaim,
/
*
Only valid
for
buddy pages. Used to track pages that are reported
*
/
PG_reported
=
PG_uptodate,
};
root@ubuntu:~
# free
total used free shared buff
/
cache available
Mem:
4012836
207344
3317312
1128
488180
3499580
Swap:
998396
0
998396
root@ubuntu:~
# free
total used free shared buff
/
cache available
Mem:
4012836
207344
3317312
1128
488180
3499580
Swap:
998396
0
998396
PGD | 页全局目录(Page Global Directory) |
---|---|
PUD | 页上级目录(Page Upper Directory) |
PMD | 页中间目录(Page Middle Directory) |
PTE | 页表(Page Table) |
task_struct
-
> mm_struct
-
> pgd_t
*
pgd
task_struct
-
> mm_struct
-
> pgd_t
*
pgd
mov rdi, cr3
or
rdi,
1000h
mov cr3, rdi
mov rdi, cr3
or
rdi,
1000h
mov cr3, rdi
struct mm_struct init_mm
=
{
.mm_rb
=
RB_ROOT,
.pgd
=
swapper_pg_dir,
.mm_users
=
ATOMIC_INIT(
2
),
.mm_count
=
ATOMIC_INIT(
1
),
.write_protect_seq
=
SEQCNT_ZERO(init_mm.write_protect_seq),
MMAP_LOCK_INITIALIZER(init_mm)
.page_table_lock
=
__SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock
=
__SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist
=
LIST_HEAD_INIT(init_mm.mmlist),
.user_ns
=
&init_user_ns,
.cpu_bitmap
=
CPU_BITS_NONE,
INIT_MM_CONTEXT(init_mm)
};
struct mm_struct init_mm
=
{
.mm_rb
=
RB_ROOT,
.pgd
=
swapper_pg_dir,
.mm_users
=
ATOMIC_INIT(
2
),
.mm_count
=
ATOMIC_INIT(
1
),
.write_protect_seq
=
SEQCNT_ZERO(init_mm.write_protect_seq),
MMAP_LOCK_INITIALIZER(init_mm)
.page_table_lock
=
__SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),
.arg_lock
=
__SPIN_LOCK_UNLOCKED(init_mm.arg_lock),
.mmlist
=
LIST_HEAD_INIT(init_mm.mmlist),
.user_ns
=
&init_user_ns,
.cpu_bitmap
=
CPU_BITS_NONE,
INIT_MM_CONTEXT(init_mm)
};
/
*
*
Initialized during boot,
and
readonly
for
initializing page tables
*
afterwards
*
/
pgd_t swapper_pg_dir[PTRS_PER_PGD];
/
*
*
Initialized during boot,
and
readonly
for
initializing page tables
*
afterwards
*
/
pgd_t swapper_pg_dir[PTRS_PER_PGD];
/
*
*
Set
up kernel memory allocators
*
/
static void __init mm_init(void)
{
......
mem_init();
/
/
伙伴系统初始化
......
kmem_cache_init();
/
/
slab初始化
......
}
/
*
*
Set
up kernel memory allocators
*
/
static void __init mm_init(void)
{
......
mem_init();
/
/
伙伴系统初始化
......
kmem_cache_init();
/
/
slab初始化
......
}
#define MAX_ORDER 11
struct zone{
...
/
*
free areas of different sizes
*
/
struct free_area free_area[MAX_ORDER];
...
}
#define MAX_ORDER 11
struct zone{
...
/
*
free areas of different sizes
*
/
struct free_area free_area[MAX_ORDER];
...
}
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned
long
nr_free;
};
struct free_area {
struct list_head free_list[MIGRATE_TYPES];
unsigned
long
nr_free;
};
enum migratetype {
MIGRATE_UNMOVABLE,
/
/
不可移动页
MIGRATE_MOVABLE,
/
/
可移动页
MIGRATE_RECLAIMABLE,
/
/
可回收页
MIGRATE_PCPTYPES,
/
*
the number of types on the pcp lists
*
/
MIGRATE_HIGHATOMIC
=
MIGRATE_PCPTYPES,
/
/
在罕见的情况下,内核需要分配一个高阶的页面块而不能休眠.如果向具有特定可移动性的列表请求分配内存失败,这种紧急情况下可从MIGRATE_HIGHATOMIC中分配内存
#ifdef CONFIG_CMA
MIGRATE_CMA,
/
/
Linux内核最新的连续内存分配器(CMA), 用于避免预留大块内存
#endif
#ifdef CONFIG_MEMORY_ISOLATION
MIGRATE_ISOLATE,
/
/
是一个特殊的虚拟区域, 用于跨越NUMA结点移动物理内存页. 在大型系统上, 它有益于将物理内存页移动到接近于使用该页最频繁的CPU.
#endif
MIGRATE_TYPES
/
/
只是表示迁移类型的数目, 也不代表具体的区域
};
enum migratetype {
MIGRATE_UNMOVABLE,
/
/
不可移动页
MIGRATE_MOVABLE,
/
/
可移动页
MIGRATE_RECLAIMABLE,
/
/
可回收页
MIGRATE_PCPTYPES,
/
*
the number of types on the pcp lists
*
/
MIGRATE_HIGHATOMIC
=
MIGRATE_PCPTYPES,
/
/
在罕见的情况下,内核需要分配一个高阶的页面块而不能休眠.如果向具有特定可移动性的列表请求分配内存失败,这种紧急情况下可从MIGRATE_HIGHATOMIC中分配内存
#ifdef CONFIG_CMA
MIGRATE_CMA,
/
/
Linux内核最新的连续内存分配器(CMA), 用于避免预留大块内存
#endif
#ifdef CONFIG_MEMORY_ISOLATION
MIGRATE_ISOLATE,
/
/
是一个特殊的虚拟区域, 用于跨越NUMA结点移动物理内存页. 在大型系统上, 它有益于将物理内存页移动到接近于使用该页最频繁的CPU.
#endif
MIGRATE_TYPES
/
/
只是表示迁移类型的数目, 也不代表具体的区域
};
static inline struct page
*
alloc_pages(gfp_t gfp_mask, unsigned
int
order)
{
return
alloc_pages_node(numa_node_id(), gfp_mask, order);
}
static inline struct page
*
alloc_pages(gfp_t gfp_mask, unsigned
int
order)
{
return
alloc_pages_node(numa_node_id(), gfp_mask, order);
}
-
rdi:GFP bitmasks,分配的属性。见[附录](
#1)
-
rsi:分配内存的阶。
根据调用流:
```c
alloc_pages
alloc_pages_node
__alloc_pages_node(nid, gfp_mask, order)
/
/
nid是离当前CPU最近的node
__alloc_pages(gfp_mask, order, nid, NULL)
/
/
the
'heart'
of the zoned buddy allocator
-
rdi:GFP bitmasks,分配的属性。见[附录](
#1)
-
rsi:分配内存的阶。
根据调用流:
```c
alloc_pages
alloc_pages_node
__alloc_pages_node(nid, gfp_mask, order)
/
/
nid是离当前CPU最近的node
__alloc_pages(gfp_mask, order, nid, NULL)
/
/
the
'heart'
of the zoned buddy allocator
/
*
*
This
is
the
'heart'
of the zoned buddy allocator.
*
/
struct page
*
__alloc_pages(gfp_t gfp, unsigned
int
order,
int
preferred_nid,nodemask_t
*
nodemask)
{
struct page
*
page;
/
/
先设置WMark为low
unsigned
int
alloc_flags
=
ALLOC_WMARK_LOW;
/
/
新的gfp,用于标定分配的属性
gfp_t alloc_gfp;
/
/
用于保存参与分配的函数之间传递的大部分不可变的分配参数的结构,包括alloc_pages
*
系列函数。
/
/
代表了分配时的固定的上下文信息。
/
*
struct alloc_context
{
struct zonelist
*
zonelist;
nodemask_t
*
nodemask;
struct zoneref
*
preferred_zoneref;
int
migratetype;
enum zone_type highest_zoneidx;
bool
spread_dirty_pages;
};
*
/
struct alloc_context ac
=
{ };
/
/
检查order
if
(unlikely(order >
=
MAX_ORDER)) {
WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
return
NULL;
}
/
/
GFP_BOOT_MASK,感觉应该是代表分配启动
gfp &
=
gfp_allowed_mask;
/
/
根据当前进程的flags(current
-
>flags)调整gfp
gfp
=
current_gfp_context(gfp);
alloc_gfp
=
gfp;
/
/
prepare_alloc_pages对于struct alloc_context进行赋值
/
*
ac
-
>highest_zoneidx
=
gfp_zone(gfp_mask);
ac
-
>zonelist
=
node_zonelist(preferred_nid, gfp_mask);
ac
-
>nodemask
=
nodemask;
ac
-
>migratetype
=
gfp_migratetype(gfp_mask);
*
/
if
(!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,
&alloc_gfp, &alloc_flags))
return
NULL;
/
/
避免碎片化
/
/
alloc_flags
=
(__force
int
) (gfp_mask & __GFP_KSWAPD_RECLAIM);
alloc_flags |
=
alloc_flags_nofragment(ac.preferred_zoneref
-
>zone, gfp);
/
/
第一次内存分配尝试
page
=
get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
if
(likely(page))
goto out;
alloc_gfp
=
gfp;
ac.spread_dirty_pages
=
false;
/
*
*
Restore the original nodemask
if
it was potentially replaced with
*
&cpuset_current_mems_allowed to optimize the fast
-
path attempt.
*
/
ac.nodemask
=
nodemask;
/
/
第一次分配失败,第二次尝试分配
page
=
__alloc_pages_slowpath(alloc_gfp, order, &ac);
out:
if
(memcg_kmem_enabled() && (gfp & __GFP_ACCOUNT) && page &&
unlikely(__memcg_kmem_charge_page(page, gfp, order) !
=
0
)) {
__free_pages(page, order);
page
=
NULL;
}
trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
return
page;
}
/
*
*
This
is
the
'heart'
of the zoned buddy allocator.
*
/
struct page
*
__alloc_pages(gfp_t gfp, unsigned
int
order,
int
preferred_nid,nodemask_t
*
nodemask)
{
struct page
*
page;
/
/
先设置WMark为low
unsigned
int
alloc_flags
=
ALLOC_WMARK_LOW;
/
/
新的gfp,用于标定分配的属性
gfp_t alloc_gfp;
/
/
用于保存参与分配的函数之间传递的大部分不可变的分配参数的结构,包括alloc_pages
*
系列函数。
/
/
代表了分配时的固定的上下文信息。
/
*
struct alloc_context
{
struct zonelist
*
zonelist;
nodemask_t
*
nodemask;
struct zoneref
*
preferred_zoneref;
int
migratetype;
enum zone_type highest_zoneidx;
bool
spread_dirty_pages;
};
*
/
struct alloc_context ac
=
{ };
/
/
检查order
if
(unlikely(order >
=
MAX_ORDER)) {
WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
return
NULL;
}
/
/
GFP_BOOT_MASK,感觉应该是代表分配启动
gfp &
=
gfp_allowed_mask;
/
/
根据当前进程的flags(current
-
>flags)调整gfp
gfp
=
current_gfp_context(gfp);
alloc_gfp
=
gfp;
/
/
prepare_alloc_pages对于struct alloc_context进行赋值
/
*
ac
-
>highest_zoneidx
=
gfp_zone(gfp_mask);
ac
-
>zonelist
=
node_zonelist(preferred_nid, gfp_mask);
ac
-
>nodemask
=
nodemask;
ac
-
>migratetype
=
gfp_migratetype(gfp_mask);
*
/
if
(!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,
&alloc_gfp, &alloc_flags))
return
NULL;
/
/
避免碎片化
/
/
alloc_flags
=
(__force
int
) (gfp_mask & __GFP_KSWAPD_RECLAIM);
alloc_flags |
=
alloc_flags_nofragment(ac.preferred_zoneref
-
>zone, gfp);
/
/
第一次内存分配尝试
page
=
get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);
if
(likely(page))
goto out;
alloc_gfp
=
gfp;
ac.spread_dirty_pages
=
false;
/
*
*
Restore the original nodemask
if
it was potentially replaced with
*
&cpuset_current_mems_allowed to optimize the fast
-
path attempt.
*
/
ac.nodemask
=
nodemask;
/
/
第一次分配失败,第二次尝试分配
page
=
__alloc_pages_slowpath(alloc_gfp, order, &ac);
out:
if
(memcg_kmem_enabled() && (gfp & __GFP_ACCOUNT) && page &&
unlikely(__memcg_kmem_charge_page(page, gfp, order) !
=
0
)) {
__free_pages(page, order);
page
=
NULL;
}
trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
return
page;
}
/
*
*
get_page_from_freelist goes through the zonelist trying to allocate
*
a page.
*
/
static struct page
*
get_page_from_freelist(gfp_t gfp_mask, unsigned
int
order,
int
alloc_flags,
const struct alloc_context
*
ac)
{
struct zoneref
*
z;
struct zone
*
zone;
struct pglist_data
*
last_pgdat_dirty_limit
=
NULL;
bool
no_fallback;
retry:
/
/
扫描zone,尝试查找一个有足够的free pages的zone
no_fallback
=
alloc_flags & ALLOC_NOFRAGMENT;
z
=
ac
-
>preferred_zoneref;
/
/
z这里是优先查找的zone,从context中获得的。
for_next_zone_zonelist_nodemask(zone, z, ac
-
>highest_zoneidx,ac
-
>nodemask) {
struct page
*
page;
unsigned
long
mark;
if
(cpusets_enabled() &&
(alloc_flags & ALLOC_CPUSET) &&
!__cpuset_zone_allowed(zone, gfp_mask))
continue
;
/
/
主要是要保证在dirty limit之内分配,防止从LRU
list
中写入,kswapd即可完成平衡
if
(ac
-
>spread_dirty_pages) {
if
(last_pgdat_dirty_limit
=
=
zone
-
>zone_pgdat)
continue
;
if
(!node_dirty_ok(zone
-
>zone_pgdat)) {
last_pgdat_dirty_limit
=
zone
-
>zone_pgdat;
continue
;
}
}
if
(no_fallback && nr_online_nodes >
1
&&
zone !
=
ac
-
>preferred_zoneref
-
>zone)
{
int
local_nid;
/
*
*
If moving to a remote node, retry but allow
*
fragmenting fallbacks. Locality
is
more important
*
than fragmentation avoidance.
*
/
/
/
如果移动到一个远的node,但是允许碎片化回退,那么局部性比碎片避免更重要
local_nid
=
zone_to_nid(ac
-
>preferred_zoneref
-
>zone);
/
/
获取local node
id
if
(zone_to_nid(zone) !
=
local_nid) {
/
/
如果使用的不是local node
alloc_flags &
=
~ALLOC_NOFRAGMENT;
/
/
进行标记,retry
goto retry;
}
}
/
/
检查水位是否充足,并进行回收
mark
=
wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
if
(!zone_watermark_fast(zone, order, mark,
ac
-
>highest_zoneidx, alloc_flags,
gfp_mask))
{
int
ret;
......
/
*
Checked here to keep the fast path fast
*
/
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if
(alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;
if
(!node_reclaim_enabled() ||
!zone_allows_reclaim(ac
-
>preferred_zoneref
-
>zone, zone))
continue
;
ret
=
node_reclaim(zone
-
>zone_pgdat, gfp_mask, order);
switch (ret) {
case NODE_RECLAIM_NOSCAN:
/
*
did
not
scan
*
/
continue
;
case NODE_RECLAIM_FULL:
/
*
scanned but unreclaimable
*
/
continue
;
default:
/
*
did we reclaim enough
*
/
if
(zone_watermark_ok(zone, order, mark,
ac
-
>highest_zoneidx, alloc_flags))
goto try_this_zone;
continue
;
}
}
/
/
调用rmqueue进行分配
try_this_zone:
page
=
rmqueue(ac
-
>preferred_zoneref
-
>zone, zone, order,
gfp_mask, alloc_flags, ac
-
>migratetype);
/
/
如果分配成功
if
(page) {
prep_new_page(page, order, gfp_mask, alloc_flags);
/
*
*
If this
is
a high
-
order atomic allocation then check
*
if
the pageblock should be reserved
for
the future
*
/
if
(unlikely(order && (alloc_flags & ALLOC_HARDER)))
reserve_highatomic_pageblock(page, zone, order);
return
page;
}
/
/
如果分配失败
else
{
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/
*
Try again
if
zone has deferred pages
*
/
if
(static_branch_unlikely(&deferred_pages)) {
if
(_deferred_grow_zone(zone, order))
goto try_this_zone;
}
#endif
}
}
/
*
*
It's possible on a UMA machine to get through
all
zones that are
*
fragmented. If avoiding fragmentation, reset
and
try
again.
*
/
if
(no_fallback) {
alloc_flags &
=
~ALLOC_NOFRAGMENT;
goto retry;
}
return
NULL;
}
/
*
*
get_page_from_freelist goes through the zonelist trying to allocate
*
a page.
*
/
static struct page
*
get_page_from_freelist(gfp_t gfp_mask, unsigned
int
order,
int
alloc_flags,
const struct alloc_context
*
ac)
{
struct zoneref
*
z;
struct zone
*
zone;
struct pglist_data
*
last_pgdat_dirty_limit
=
NULL;
bool
no_fallback;
retry:
/
/
扫描zone,尝试查找一个有足够的free pages的zone
no_fallback
=
alloc_flags & ALLOC_NOFRAGMENT;
z
=
ac
-
>preferred_zoneref;
/
/
z这里是优先查找的zone,从context中获得的。
for_next_zone_zonelist_nodemask(zone, z, ac
-
>highest_zoneidx,ac
-
>nodemask) {
struct page
*
page;
unsigned
long
mark;
if
(cpusets_enabled() &&
(alloc_flags & ALLOC_CPUSET) &&
!__cpuset_zone_allowed(zone, gfp_mask))
continue
;
/
/
主要是要保证在dirty limit之内分配,防止从LRU
list
中写入,kswapd即可完成平衡
if
(ac
-
>spread_dirty_pages) {
if
(last_pgdat_dirty_limit
=
=
zone
-
>zone_pgdat)
continue
;
if
(!node_dirty_ok(zone
-
>zone_pgdat)) {
last_pgdat_dirty_limit
=
zone
-
>zone_pgdat;
continue
;
}
}
if
(no_fallback && nr_online_nodes >
1
&&
zone !
=
ac
-
>preferred_zoneref
-
>zone)
{
int
local_nid;
/
*
*
If moving to a remote node, retry but allow
*
fragmenting fallbacks. Locality
is
more important
*
than fragmentation avoidance.
*
/
/
/
如果移动到一个远的node,但是允许碎片化回退,那么局部性比碎片避免更重要
local_nid
=
zone_to_nid(ac
-
>preferred_zoneref
-
>zone);
/
/
获取local node
id
if
(zone_to_nid(zone) !
=
local_nid) {
/
/
如果使用的不是local node
alloc_flags &
=
~ALLOC_NOFRAGMENT;
/
/
进行标记,retry
goto retry;
}
}
/
/
检查水位是否充足,并进行回收
mark
=
wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
if
(!zone_watermark_fast(zone, order, mark,
ac
-
>highest_zoneidx, alloc_flags,
gfp_mask))
{
int
ret;
......
/
*
Checked here to keep the fast path fast
*
/
BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);
if
(alloc_flags & ALLOC_NO_WATERMARKS)
goto try_this_zone;
if
(!node_reclaim_enabled() ||
!zone_allows_reclaim(ac
-
>preferred_zoneref
-
>zone, zone))
continue
;
ret
=
node_reclaim(zone
-
>zone_pgdat, gfp_mask, order);
switch (ret) {
case NODE_RECLAIM_NOSCAN:
/
*
did
not
scan
*
/
continue
;
case NODE_RECLAIM_FULL:
/
*
scanned but unreclaimable
*
/
continue
;
default:
/
*
did we reclaim enough
*
/
if
(zone_watermark_ok(zone, order, mark,
ac
-
>highest_zoneidx, alloc_flags))
goto try_this_zone;
continue
;
}
}
/
/
调用rmqueue进行分配
try_this_zone:
page
=
rmqueue(ac
-
>preferred_zoneref
-
>zone, zone, order,
gfp_mask, alloc_flags, ac
-
>migratetype);
/
/
如果分配成功
if
(page) {
prep_new_page(page, order, gfp_mask, alloc_flags);
/
*
*
If this
is
a high
-
order atomic allocation then check
*
if
the pageblock should be reserved
for
the future
*
/
if
(unlikely(order && (alloc_flags & ALLOC_HARDER)))
reserve_highatomic_pageblock(page, zone, order);
return
page;
}
/
/
如果分配失败
else
{
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT
/
*
Try again
if
zone has deferred pages
*
/
if
(static_branch_unlikely(&deferred_pages)) {
if
(_deferred_grow_zone(zone, order))
goto try_this_zone;
}
#endif
}
}
/
*
*
It's possible on a UMA machine to get through
all
zones that are
*
fragmented. If avoiding fragmentation, reset
and
try
again.
*
/
if
(no_fallback) {
alloc_flags &
=
~ALLOC_NOFRAGMENT;
goto retry;
}
return
NULL;
}
/
*
*
Allocate a page
from
the given zone. Use pcplists
for
order
-
0
allocations.
*
/
static inline
struct page
*
rmqueue(struct zone
*
preferred_zone,
struct zone
*
zone, unsigned
int
order,
gfp_t gfp_flags, unsigned
int
alloc_flags,
int
migratetype)
{
unsigned
long
flags;
struct page
*
page;
/
/
如果当前的阶是
0
,直接使用per cpu lists分配
if
(likely(order
=
=
0
)) {
if
(!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
migratetype !
=
MIGRATE_MOVABLE) {
page
=
rmqueue_pcplist(preferred_zone, zone, gfp_flags,migratetype, alloc_flags);
goto out;
}
}
/
/
当设置了__GFP_NOFAIL,不能分配order >
1
的空间。
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order >
1
));
/
/
加锁
spin_lock_irqsave(&zone
-
>lock, flags);
do {
page
=
NULL;
/
*
*
order
-
0
request can reach here when the pcplist
is
skipped
*
due to non
-
CMA allocation context. HIGHATOMIC area
is
*
reserved
for
high
-
order atomic allocation, so order
-
0
*
request should skip it.
*
/
/
/
如果pcplist分配被跳过,那么order
=
0
会到达这里,但是我们的HIGHATOMIC区域是保留给高阶原子分配,所以order
-
0
请求应该跳过它。
if
(order >
0
&& alloc_flags & ALLOC_HARDER)
{
/
/
调用__rmqueue_smallest分配,页迁移类型为MIGRATE_HIGHATOMIC
page
=
__rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if
(page)
trace_mm_page_alloc_zone_locked(page, order, migratetype);
}
/
/
不满足上一个
if
,或分配失败,调用__rmqueue分配
if
(!page)
page
=
__rmqueue(zone, order, migratetype, alloc_flags);
}
while
(page && check_new_pages(page, order));
spin_unlock(&zone
-
>lock);
if
(!page)
goto failed;
/
/
更新zone的freepage状态
__mod_zone_freepage_state(zone,
-
(
1
<< order),get_pcppage_migratetype(page));
__count_zid_vm_events(PGALLOC, page_zonenum(page),
1
<< order);
/
/
统计NUMA架构信息(hit
/
miss)
zone_statistics(preferred_zone, zone);
/
/
恢复中断
local_irq_restore(flags);
out:
/
*
Separate test
+
clear to avoid unnecessary atomics
*
/
if
(test_bit(ZONE_BOOSTED_WATERMARK, &zone
-
>flags)) {
clear_bit(ZONE_BOOSTED_WATERMARK, &zone
-
>flags);
wakeup_kswapd(zone,
0
,
0
, zone_idx(zone));
}
VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
return
page;
failed:
local_irq_restore(flags);
return
NULL;
}
/
*
*
Allocate a page
from
the given zone. Use pcplists
for
order
-
0
allocations.
*
/
static inline
struct page
*
rmqueue(struct zone
*
preferred_zone,
struct zone
*
zone, unsigned
int
order,
gfp_t gfp_flags, unsigned
int
alloc_flags,
int
migratetype)
{
unsigned
long
flags;
struct page
*
page;
/
/
如果当前的阶是
0
,直接使用per cpu lists分配
if
(likely(order
=
=
0
)) {
if
(!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||
migratetype !
=
MIGRATE_MOVABLE) {
page
=
rmqueue_pcplist(preferred_zone, zone, gfp_flags,migratetype, alloc_flags);
goto out;
}
}
/
/
当设置了__GFP_NOFAIL,不能分配order >
1
的空间。
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order >
1
));
/
/
加锁
spin_lock_irqsave(&zone
-
>lock, flags);
do {
page
=
NULL;
/
*
*
order
-
0
request can reach here when the pcplist
is
skipped
*
due to non
-
CMA allocation context. HIGHATOMIC area
is
*
reserved
for
high
-
order atomic allocation, so order
-
0
*
request should skip it.
*
/
/
/
如果pcplist分配被跳过,那么order
=
0
会到达这里,但是我们的HIGHATOMIC区域是保留给高阶原子分配,所以order
-
0
请求应该跳过它。
if
(order >
0
&& alloc_flags & ALLOC_HARDER)
{
/
/
调用__rmqueue_smallest分配,页迁移类型为MIGRATE_HIGHATOMIC
page
=
__rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
if
(page)
trace_mm_page_alloc_zone_locked(page, order, migratetype);
}
/
/
不满足上一个
if
,或分配失败,调用__rmqueue分配
if
(!page)
page
=
__rmqueue(zone, order, migratetype, alloc_flags);
}
while
(page && check_new_pages(page, order));
spin_unlock(&zone
-
>lock);
if
(!page)
goto failed;
/
/
更新zone的freepage状态
__mod_zone_freepage_state(zone,
-
(
1
<< order),get_pcppage_migratetype(page));
__count_zid_vm_events(PGALLOC, page_zonenum(page),
1
<< order);
/
/
统计NUMA架构信息(hit
/
miss)
zone_statistics(preferred_zone, zone);
/
/
恢复中断
local_irq_restore(flags);
out:
/
*
Separate test
+
clear to avoid unnecessary atomics
*
/
if
(test_bit(ZONE_BOOSTED_WATERMARK, &zone
-
>flags)) {
clear_bit(ZONE_BOOSTED_WATERMARK, &zone
-
>flags);
wakeup_kswapd(zone,
0
,
0
, zone_idx(zone));
}
VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
return
page;
failed:
local_irq_restore(flags);
return
NULL;
}
/
*
Remove page
from
the per
-
cpu
list
, caller must protect the
list
*
/
static inline
struct page
*
__rmqueue_pcplist(struct zone
*
zone,
int
migratetype,
unsigned
int
alloc_flags,
struct per_cpu_pages
*
pcp,
struct list_head
*
list
)
{
struct page
*
page;
do {
/
/
用
list
的
next
指针判断当前
list
是否为空
if
(list_empty(
list
)) {
/
/
如果为空,调用rmqueue_bulk将它们添加到提供的列表中。
pcp
-
>count
+
=
rmqueue_bulk(zone,
0
,READ_ONCE(pcp
-
>batch),
list
,migratetype, alloc_flags);
if
(unlikely(list_empty(
list
)))
return
NULL;
}
/
/
取出
list
的第一个元素
page
=
list_first_entry(
list
, struct page, lru);
/
/
从页的lrulist中删除
list_del(&page
-
>lru);
/
/
空闲计数减
1
pcp
-
>count
-
-
;
}
while
(check_new_pcp(page));
return
page;
}
/
*
Remove page
from
the per
-
cpu
list
, caller must protect the
list
*
/
static inline
struct page
*
__rmqueue_pcplist(struct zone
*
zone,
int
migratetype,
unsigned
int
alloc_flags,
struct per_cpu_pages
*
pcp,
struct list_head
*
list
)
{
struct page
*
page;
do {
/
/
用
list
的
next
指针判断当前
list
是否为空
if
(list_empty(
list
)) {
/
/
如果为空,调用rmqueue_bulk将它们添加到提供的列表中。
pcp
-
>count
+
=
rmqueue_bulk(zone,
0
,READ_ONCE(pcp
-
>batch),
list
,migratetype, alloc_flags);
if
(unlikely(list_empty(
list
)))
return
NULL;
}
/
/
取出
list
的第一个元素
page
=
list_first_entry(
list
, struct page, lru);
/
/
从页的lrulist中删除
list_del(&page
-
>lru);
/
/
空闲计数减
1
pcp
-
>count
-
-
;
}
while
(check_new_pcp(page));
return
page;
}
/
*
*
Obtain a specified number of elements
from
the buddy allocator,
all
under
*
a single hold of the lock,
for
efficiency. Add them to the supplied
list
.
*
Returns the number of new pages which were placed at
*
list
.
*
/
static
int
rmqueue_bulk(struct zone
*
zone, unsigned
int
order,
unsigned
long
count, struct list_head
*
list
,
int
migratetype, unsigned
int
alloc_flags)
{
int
i, allocated
=
0
;
spin_lock(&zone
-
>lock);
/
/
扫描当前的zone的每个order
list
,尝试找一个最合适的page
for
(i
=
0
; i < count;
+
+
i)
{
/
/
取出一个page
struct page
*
page
=
__rmqueue(zone, order, migratetype,alloc_flags);
if
(unlikely(page
=
=
NULL))
break
;
if
(unlikely(check_pcp_refill(page)))
continue
;
/
/
将当前page添加到lrulist
list_add_tail(&page
-
>lru,
list
);
allocated
+
+
;
/
/
如果page在cma区域中,更新zone部分成员的信息,调整NR_FREE_PAGES
/
*
atomic_long_add(x, &zone
-
>vm_stat[item]);
atomic_long_add(x, &vm_zone_stat[item]);
*
/
if
(is_migrate_cma(get_pcppage_migratetype(page)))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-
(
1
<< order));
}
/
/
如果check_pcp_refill检查失败,移除页面,调整NR_FREE_PAGES
/
/
for
循环i次,扫描了i个pageblock,而每个pageblock有
2
^i个pages,更新NR_FREE_PAGES
__mod_zone_page_state(zone, NR_FREE_PAGES,
-
(i << order));
spin_unlock(&zone
-
>lock);
return
allocated;
}
/
*
*
Obtain a specified number of elements
from
the buddy allocator,
all
under
*
a single hold of the lock,
for
efficiency. Add them to the supplied
list
.
*
Returns the number of new pages which were placed at
*
list
.
*
/
static
int
rmqueue_bulk(struct zone
*
zone, unsigned
int
order,
unsigned
long
count, struct list_head
*
list
,
int
migratetype, unsigned
int
alloc_flags)
{
int
i, allocated
=
0
;
spin_lock(&zone
-
>lock);
/
/
扫描当前的zone的每个order
list
,尝试找一个最合适的page
for
(i
=
0
; i < count;
+
+
i)
{
/
/
取出一个page
struct page
*
page
=
__rmqueue(zone, order, migratetype,alloc_flags);
if
(unlikely(page
=
=
NULL))
break
;
if
(unlikely(check_pcp_refill(page)))
continue
;
/
/
将当前page添加到lrulist
list_add_tail(&page
-
>lru,
list
);
allocated
+
+
;
/
/
如果page在cma区域中,更新zone部分成员的信息,调整NR_FREE_PAGES
/
*
atomic_long_add(x, &zone
-
>vm_stat[item]);
atomic_long_add(x, &vm_zone_stat[item]);
*
/
if
(is_migrate_cma(get_pcppage_migratetype(page)))
__mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
-
(
1
<< order));
}
/
/
如果check_pcp_refill检查失败,移除页面,调整NR_FREE_PAGES
/
/
for
循环i次,扫描了i个pageblock,而每个pageblock有
2
^i个pages,更新NR_FREE_PAGES
__mod_zone_page_state(zone, NR_FREE_PAGES,
-
(i << order));
spin_unlock(&zone
-
>lock);
return
allocated;
}
/
*
*
Do the hard work of removing an element
from
the buddy allocator.
*
Call me with the zone
-
>lock already held.
*
/
static __always_inline struct page
*
__rmqueue(struct zone
*
zone, unsigned
int
order,
int
migratetype,
unsigned
int
alloc_flags)
{
struct page
*
page;
/
/
如果打开了CMA,平衡常规区域和CMA区域之间的可移动分配,当该区一半以上的可用内存在CMA区域时,从CMA分配。
if
(IS_ENABLED(CONFIG_CMA))
{
if
(alloc_flags & ALLOC_CMA &&
zone_page_state(zone, NR_FREE_CMA_PAGES) > zone_page_state(zone, NR_FREE_PAGES)
/
2
) {
page
=
__rmqueue_cma_fallback(zone, order);
if
(page)
goto out;
}
}
retry:
/
/
否则直接调用__rmqueue_smallest分配。
page
=
__rmqueue_smallest(zone, order, migratetype);
if
(unlikely(!page)) {
if
(alloc_flags & ALLOC_CMA)
page
=
__rmqueue_cma_fallback(zone, order);
if
(!page && __rmqueue_fallback(zone, order, migratetype,
alloc_flags))
goto retry;
}
out:
if
(page)
trace_mm_page_alloc_zone_locked(page, order, migratetype);
return
page;
}
/
*
*
Do the hard work of removing an element
from
the buddy allocator.
*
Call me with the zone
-
>lock already held.
*
/
static __always_inline struct page
*
__rmqueue(struct zone
*
zone, unsigned
int
order,
int
migratetype,
unsigned
int
alloc_flags)
{
struct page
*
page;
/
/
如果打开了CMA,平衡常规区域和CMA区域之间的可移动分配,当该区一半以上的可用内存在CMA区域时,从CMA分配。
if
(IS_ENABLED(CONFIG_CMA))
{
if
(alloc_flags & ALLOC_CMA &&
zone_page_state(zone, NR_FREE_CMA_PAGES) > zone_page_state(zone, NR_FREE_PAGES)
/
2
) {
page
=
__rmqueue_cma_fallback(zone, order);
if
(page)
goto out;
}
}
retry:
/
/
否则直接调用__rmqueue_smallest分配。
page
=
__rmqueue_smallest(zone, order, migratetype);
if
(unlikely(!page)) {
if
(alloc_flags & ALLOC_CMA)
page
=
__rmqueue_cma_fallback(zone, order);
if
(!page && __rmqueue_fallback(zone, order, migratetype,
alloc_flags))
goto retry;
}
out:
if
(page)
trace_mm_page_alloc_zone_locked(page, order, migratetype);
return
page;
}
/
*
*
Go through the free lists
for
the given migratetype
and
remove
*
the smallest available page
from
the freelists
*
/
static __always_inline
struct page
*
__rmqueue_smallest(struct zone
*
zone, unsigned
int
order,
int
migratetype)
{
unsigned
int
current_order;
struct free_area
*
area;
struct page
*
page;
/
*
Find a page of the appropriate size
in
the preferred
list
*
/
for
(current_order
=
order; current_order < MAX_ORDER;
+
+
current_order) {
area
=
&(zone
-
>free_area[current_order]);
/
/
从对应迁移类型的链表头获取page。
page
=
get_page_from_free_area(area, migratetype);
if
(!page)
continue
;
/
/
删除page,更新zone
del_page_from_free_list(page, zone, current_order);
expand(zone, page, order, current_order, migratetype);
/
/
设置迁移类型
set_pcppage_migratetype(page, migratetype);
return
page;
}
return
NULL;
}
/
*
*
Go through the free lists
for
the given migratetype
and
remove
*
the smallest available page
from
the freelists
*
/
static __always_inline
struct page
*
__rmqueue_smallest(struct zone
*
zone, unsigned
int
order,
int
migratetype)
{
unsigned
int
current_order;
struct free_area
*
area;
struct page
*
page;
/
*
Find a page of the appropriate size
in
the preferred
list
*
/
for
(current_order
=
order; current_order < MAX_ORDER;
+
+
current_order) {
area
=
&(zone
-
>free_area[current_order]);
/
/
从对应迁移类型的链表头获取page。
page
=
get_page_from_free_area(area, migratetype);
if
(!page)
continue
;
/
/
删除page,更新zone
del_page_from_free_list(page, zone, current_order);
expand(zone, page, order, current_order, migratetype);
/
/
设置迁移类型
set_pcppage_migratetype(page, migratetype);
return
page;
}
return
NULL;
}
while
(high > low) {
high
-
-
;
size >>
=
1
;
VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
/
*
*
Mark as guard pages (
or
page), that will allow to
*
merge back to allocator when buddy will be freed.
*
Corresponding page table entries will
not
be touched,
*
pages will stay
not
present
in
virtual address space
*
/
if
(set_page_guard(zone, &page[size], high, migratetype))
continue
;
add_to_free_list(&page[size], zone, high, migratetype);
set_buddy_order(&page[size], high);
}
while
(high > low) {
high
-
-
;
size >>
=
1
;
VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
/
*
*
Mark as guard pages (
or
page), that will allow to
*
merge back to allocator when buddy will be freed.
*
Corresponding page table entries will
not
be touched,
*
pages will stay
not
present
in
virtual address space
*
/
if
(set_page_guard(zone, &page[size], high, migratetype))
continue
;
add_to_free_list(&page[size], zone, high, migratetype);
set_buddy_order(&page[size], high);
}
static inline struct page
*
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned
int
order,
struct alloc_context
*
ac)
{
bool
can_direct_reclaim
=
gfp_mask & __GFP_DIRECT_RECLAIM;
const
bool
costly_order
=
order > PAGE_ALLOC_COSTLY_ORDER;
struct page
*
page
=
NULL;
unsigned
int
alloc_flags;
unsigned
long
did_some_progress;
enum compact_priority compact_priority;
enum compact_result compact_result;
int
compaction_retries;
int
no_progress_loops;
unsigned
int
cpuset_mems_cookie;
int
reserve_flags;
/
/
如果内存分配来自__GFP_ATOMIC(原子请求)、__GFP_DIRECT_RECLAIM(可直接回收),会产生冲突,取消原子标识
if
(WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))
=
=
(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
gfp_mask &
=
~__GFP_ATOMIC;
retry_cpuset:
compaction_retries
=
0
;
no_progress_loops
=
0
;
compact_priority
=
DEF_COMPACT_PRIORITY;
cpuset_mems_cookie
=
read_mems_allowed_begin();
/
/
快速分配采用保守的alloc_flags,我们这里进行重新设置,降低成本。
alloc_flags
=
gfp_to_alloc_flags(gfp_mask);
/
/
重新计算分配迭代zone的起始点。
ac
-
>preferred_zoneref
=
first_zones_zonelist(ac
-
>zonelist,
ac
-
>highest_zoneidx, ac
-
>nodemask);
if
(!ac
-
>preferred_zoneref
-
>zone)
goto nopage;
/
/
如果设置了ALLOC_KSWAPD,唤醒kswapds进程
if
(alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/
/
使用重新调整后的信息再次重新分配
page
=
get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if
(page)
goto got_pg;
/
*
*
For costly allocations,
try
direct compaction first, as it's likely
*
that we have enough base pages
and
don't need to reclaim. For non
-
*
movable high
-
order allocations, do that as well, as compaction will
*
try
prevent permanent fragmentation by migrating
from
blocks of the
*
same migratetype.
*
Don't
try
this
for
allocations that are allowed to ignore
*
watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
*
/
/
/
示情况进行内存压缩
if
(can_direct_reclaim &&
(costly_order ||
(order >
0
&& ac
-
>migratetype !
=
MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask)) {
page
=
__alloc_pages_direct_compact(gfp_mask, order,
alloc_flags, ac,
INIT_COMPACT_PRIORITY,
&compact_result);
if
(page)
goto got_pg;
/
/
如果设置了__GFP_NORETRY,可能包含了一些THP page fault的分配
if
(costly_order && (gfp_mask & __GFP_NORETRY)) {
if
(compact_result
=
=
COMPACT_SKIPPED ||
compact_result
=
=
COMPACT_DEFERRED)
goto nopage;
/
/
同步压缩开销太大,保持异步压缩
compact_priority
=
INIT_COMPACT_PRIORITY;
}
}
retry:
/
/
保证kswapd不会睡眠,再次唤醒
if
(alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/
/
区分真正需要访问全部内存储备的请求和可以承受部分内存的被oom kill掉的请求。
reserve_flags
=
__gfp_pfmemalloc_flags(gfp_mask);
if
(reserve_flags)
alloc_flags
=
gfp_to_alloc_flags_cma(gfp_mask, reserve_flags);
/
/
当不允许在当前cpu
-
node中分配,且设置了reserve_flags,那么降低此时的分配标准,重置高优先级的迭代器再进行分配。
if
(!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
ac
-
>nodemask
=
NULL;
ac
-
>preferred_zoneref
=
first_zones_zonelist(ac
-
>zonelist,
ac
-
>highest_zoneidx, ac
-
>nodemask);
}
/
*
Attempt with potentially adjusted zonelist
and
alloc_flags
*
/
page
=
get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if
(page)
goto got_pg;
/
*
Caller
is
not
willing to reclaim, we can't balance anything
*
/
if
(!can_direct_reclaim)
goto nopage;
/
*
Avoid recursion of direct reclaim
*
/
if
(current
-
>flags & PF_MEMALLOC)
goto nopage;
/
/
尝试先回收,再分配
page
=
__alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
&did_some_progress);
if
(page)
goto got_pg;
/
/
尝试直接压缩,再分配
page
=
__alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
compact_priority, &compact_result);
if
(page)
goto got_pg;
/
*
Do
not
loop
if
specifically requested
*
/
if
(gfp_mask & __GFP_NORETRY)
goto nopage;
/
*
*
Do
not
retry costly high order allocations unless they are
*
__GFP_RETRY_MAYFAIL
*
/
if
(costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
goto nopage;
/
/
是否应当再次进行内存回收
if
(should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress >
0
, &no_progress_loops))
goto retry;
/
/
是否应该再次压缩
if
(did_some_progress >
0
&&
should_compact_retry(ac, order, alloc_flags,
compact_result, &compact_priority,
&compaction_retries))
goto retry;
/
/
在我们启动oom之前判断可能的条件竞争问题
if
(check_retry_cpuset(cpuset_mems_cookie, ac))
goto retry_cpuset;
/
/
回收失败,开启oomkiller,杀死一些进程以获得内存
page
=
__alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if
(page)
goto got_pg;
/
/
避免nowatermark的内存无限使用
if
(tsk_is_oom_victim(current) &&
(alloc_flags & ALLOC_OOM ||
(gfp_mask & __GFP_NOMEMALLOC)))
goto nopage;
if
(did_some_progress) {
no_progress_loops
=
0
;
goto retry;
}
nopage:
if
(check_retry_cpuset(cpuset_mems_cookie, ac))
goto retry_cpuset;
/
/
当设置了__GFP_NOFAIL时,多次尝试
if
(gfp_mask & __GFP_NOFAIL) {
/
/
当所有的NOFAIL的请求都被blocked掉时,警告用户此时应该使用NOWAIT
if
(WARN_ON_ONCE(!can_direct_reclaim))
goto fail;
WARN_ON_ONCE(current
-
>flags & PF_MEMALLOC);
WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
/
*
通过让他们访问内存储备来帮助不失败的分配,但不要使用ALLOC_NO_WATERMARKS,因为这可能耗尽整个内存储备,使情况变得更糟
*
/
page
=
__alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
if
(page)
goto got_pg;
cond_resched();
goto retry;
}
fail:
warn_alloc(gfp_mask, ac
-
>nodemask,
"page allocation failure: order:%u"
, order);
got_pg:
return
page;
}
static inline struct page
*
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned
int
order,
struct alloc_context
*
ac)
{
bool
can_direct_reclaim
=
gfp_mask & __GFP_DIRECT_RECLAIM;
const
bool
costly_order
=
order > PAGE_ALLOC_COSTLY_ORDER;
struct page
*
page
=
NULL;
unsigned
int
alloc_flags;
unsigned
long
did_some_progress;
enum compact_priority compact_priority;
enum compact_result compact_result;
int
compaction_retries;
int
no_progress_loops;
unsigned
int
cpuset_mems_cookie;
int
reserve_flags;
/
/
如果内存分配来自__GFP_ATOMIC(原子请求)、__GFP_DIRECT_RECLAIM(可直接回收),会产生冲突,取消原子标识
if
(WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM))
=
=
(__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))
gfp_mask &
=
~__GFP_ATOMIC;
retry_cpuset:
compaction_retries
=
0
;
no_progress_loops
=
0
;
compact_priority
=
DEF_COMPACT_PRIORITY;
cpuset_mems_cookie
=
read_mems_allowed_begin();
/
/
快速分配采用保守的alloc_flags,我们这里进行重新设置,降低成本。
alloc_flags
=
gfp_to_alloc_flags(gfp_mask);
/
/
重新计算分配迭代zone的起始点。
ac
-
>preferred_zoneref
=
first_zones_zonelist(ac
-
>zonelist,
ac
-
>highest_zoneidx, ac
-
>nodemask);
if
(!ac
-
>preferred_zoneref
-
>zone)
goto nopage;
/
/
如果设置了ALLOC_KSWAPD,唤醒kswapds进程
if
(alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/
/
使用重新调整后的信息再次重新分配
page
=
get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if
(page)
goto got_pg;
/
*
*
For costly allocations,
try
direct compaction first, as it's likely
*
that we have enough base pages
and
don't need to reclaim. For non
-
*
movable high
-
order allocations, do that as well, as compaction will
*
try
prevent permanent fragmentation by migrating
from
blocks of the
*
same migratetype.
*
Don't
try
this
for
allocations that are allowed to ignore
*
watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.
*
/
/
/
示情况进行内存压缩
if
(can_direct_reclaim &&
(costly_order ||
(order >
0
&& ac
-
>migratetype !
=
MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask)) {
page
=
__alloc_pages_direct_compact(gfp_mask, order,
alloc_flags, ac,
INIT_COMPACT_PRIORITY,
&compact_result);
if
(page)
goto got_pg;
/
/
如果设置了__GFP_NORETRY,可能包含了一些THP page fault的分配
if
(costly_order && (gfp_mask & __GFP_NORETRY)) {
if
(compact_result
=
=
COMPACT_SKIPPED ||
compact_result
=
=
COMPACT_DEFERRED)
goto nopage;
/
/
同步压缩开销太大,保持异步压缩
compact_priority
=
INIT_COMPACT_PRIORITY;
}
}
retry:
/
/
保证kswapd不会睡眠,再次唤醒
if
(alloc_flags & ALLOC_KSWAPD)
wake_all_kswapds(order, gfp_mask, ac);
/
/
区分真正需要访问全部内存储备的请求和可以承受部分内存的被oom kill掉的请求。
reserve_flags
=
__gfp_pfmemalloc_flags(gfp_mask);
if
(reserve_flags)
alloc_flags
=
gfp_to_alloc_flags_cma(gfp_mask, reserve_flags);
/
/
当不允许在当前cpu
-
node中分配,且设置了reserve_flags,那么降低此时的分配标准,重置高优先级的迭代器再进行分配。
if
(!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {
ac
-
>nodemask
=
NULL;
ac
-
>preferred_zoneref
=
first_zones_zonelist(ac
-
>zonelist,
ac
-
>highest_zoneidx, ac
-
>nodemask);
}
/
*
Attempt with potentially adjusted zonelist
and
alloc_flags
*
/
page
=
get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if
(page)
goto got_pg;
/
*
Caller
is
not
willing to reclaim, we can't balance anything
*
/
if
(!can_direct_reclaim)
goto nopage;
/
*
Avoid recursion of direct reclaim
*
/
if
(current
-
>flags & PF_MEMALLOC)
goto nopage;
/
/
尝试先回收,再分配
page
=
__alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,
&did_some_progress);
if
(page)
goto got_pg;
/
/
尝试直接压缩,再分配
page
=
__alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,
compact_priority, &compact_result);
if
(page)
goto got_pg;
/
*
Do
not
loop
if
specifically requested
*
/
if
(gfp_mask & __GFP_NORETRY)
goto nopage;
/
*
*
Do
not
retry costly high order allocations unless they are
*
__GFP_RETRY_MAYFAIL
*
/
if
(costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))
goto nopage;
/
/
是否应当再次进行内存回收
if
(should_reclaim_retry(gfp_mask, order, ac, alloc_flags,
did_some_progress >
0
, &no_progress_loops))
goto retry;
/
/
是否应该再次压缩
if
(did_some_progress >
0
&&
should_compact_retry(ac, order, alloc_flags,
compact_result, &compact_priority,
&compaction_retries))
goto retry;
/
/
在我们启动oom之前判断可能的条件竞争问题
if
(check_retry_cpuset(cpuset_mems_cookie, ac))
goto retry_cpuset;
/
/
回收失败,开启oomkiller,杀死一些进程以获得内存
page
=
__alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
if
(page)
goto got_pg;
/
/
避免nowatermark的内存无限使用
if
(tsk_is_oom_victim(current) &&
(alloc_flags & ALLOC_OOM ||
(gfp_mask & __GFP_NOMEMALLOC)))
goto nopage;
if
(did_some_progress) {
no_progress_loops
=
0
;
goto retry;
}
nopage:
if
(check_retry_cpuset(cpuset_mems_cookie, ac))
goto retry_cpuset;
/
/
当设置了__GFP_NOFAIL时,多次尝试
if
(gfp_mask & __GFP_NOFAIL) {
/
/
当所有的NOFAIL的请求都被blocked掉时,警告用户此时应该使用NOWAIT
if
(WARN_ON_ONCE(!can_direct_reclaim))
goto fail;
WARN_ON_ONCE(current
-
>flags & PF_MEMALLOC);
WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
/
*
通过让他们访问内存储备来帮助不失败的分配,但不要使用ALLOC_NO_WATERMARKS,因为这可能耗尽整个内存储备,使情况变得更糟
*
/
page
=
__alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);
if
(page)
goto got_pg;
cond_resched();
goto retry;
}
fail:
warn_alloc(gfp_mask, ac
-
>nodemask,
"page allocation failure: order:%u"
, order);
got_pg:
return
page;
}
void __free_pages(struct page
*
page, unsigned
int
order)
{
/
/
检查页框是否还有进程在使用,检查_count变量的值是否为
0
if
(put_page_testzero(page))
free_the_page(page, order);
/
/
这里我个人理解时,类比于之前的set_page_guard那一步,分配的order大于需要的order,相当于分配了多页,那么这里就是挨个释放多页
else
if
(!PageHead(page))
while
(order
-
-
>
0
)
free_the_page(page
+
(
1
<< order), order);
}
void __free_pages(struct page
*
page, unsigned
int
order)
{
/
/
检查页框是否还有进程在使用,检查_count变量的值是否为
0
if
(put_page_testzero(page))
free_the_page(page, order);
/
/
这里我个人理解时,类比于之前的set_page_guard那一步,分配的order大于需要的order,相当于分配了多页,那么这里就是挨个释放多页
else
if
(!PageHead(page))
while
(order
-
-
>
0
)
free_the_page(page
+
(
1
<< order), order);
}
static inline void free_the_page(struct page
*
page, unsigned
int
order)
{
/
/
如果是通过pcpulist分配
if
(order
=
=
0
)
/
*
Via pcp?
*
/
free_unref_page(page);
/
/
否则调用__free_pages_ok
else
__free_pages_ok(page, order, FPI_NONE);
}
static inline void free_the_page(struct page
*
page, unsigned
int
order)
{
/
/
如果是通过pcpulist分配
if
(order
=
=
0
)
/
*
Via pcp?
*
/
free_unref_page(page);
/
/
否则调用__free_pages_ok
else
__free_pages_ok(page, order, FPI_NONE);
}
/
*
*
Free a
0
-
order page
*
/
void free_unref_page(struct page
*
page)
{
unsigned
long
flags;
/
/
获取page frame number
unsigned
long
pfn
=
page_to_pfn(page);
/
/
进行free前检查
if
(!free_unref_page_prepare(page, pfn))
return
;
local_irq_save(flags);
free_unref_page_commit(page, pfn);
local_irq_restore(flags);
}
/
*
*
Free a
0
-
order page
*
/
void free_unref_page(struct page
*
page)
{
unsigned
long
flags;
/
/
获取page frame number
unsigned
long
pfn
=
page_to_pfn(page);
/
/
进行free前检查
if
(!free_unref_page_prepare(page, pfn))
return
;
local_irq_save(flags);
free_unref_page_commit(page, pfn);
local_irq_restore(flags);
}
static void free_unref_page_commit(struct page
*
page, unsigned
long
pfn)
{
struct zone
*
zone
=
page_zone(page);
struct per_cpu_pages
*
pcp;
int
migratetype;
/
/
获取当前page的迁移类型
migratetype
=
get_pcppage_migratetype(page);
__count_vm_event(PGFREE);
/
/
percpu
list
只放入几种制定类型的page
if
(migratetype >
=
MIGRATE_PCPTYPES) {
if
(unlikely(is_migrate_isolate(migratetype))) {
/
/
free
free_one_page(zone, page, pfn,
0
, migratetype,
FPI_NONE);
return
;
}
migratetype
=
MIGRATE_MOVABLE;
}
pcp
=
&this_cpu_ptr(zone
-
>pageset)
-
>pcp;
/
/
将page用头插法放入pcp
-
>lists[migratetype]链表头
list_add(&page
-
>lru, &pcp
-
>lists[migratetype]);
pcp
-
>count
+
+
;
/
/
如果pcp中的page数量大于最大数量,则将多余的page放入伙伴系统
if
(pcp
-
>count >
=
READ_ONCE(pcp
-
>high))
free_pcppages_bulk(zone, READ_ONCE(pcp
-
>batch), pcp);
}
static void free_unref_page_commit(struct page
*
page, unsigned
long
pfn)
{
struct zone
*
zone
=
page_zone(page);
struct per_cpu_pages
*
pcp;
int
migratetype;
/
/
获取当前page的迁移类型
migratetype
=
get_pcppage_migratetype(page);
__count_vm_event(PGFREE);
/
/
percpu
list
只放入几种制定类型的page
if
(migratetype >
=
MIGRATE_PCPTYPES) {
if
(unlikely(is_migrate_isolate(migratetype))) {
/
/
free
free_one_page(zone, page, pfn,
0
, migratetype,
FPI_NONE);
return
;
}
migratetype
=
MIGRATE_MOVABLE;
}
pcp
=
&this_cpu_ptr(zone
-
>pageset)
-
>pcp;
/
/
将page用头插法放入pcp
-
>lists[migratetype]链表头
list_add(&page
-
>lru, &pcp
-
>lists[migratetype]);
pcp
-
>count
+
+
;
/
/
如果pcp中的page数量大于最大数量,则将多余的page放入伙伴系统
if
(pcp
-
>count >
=
READ_ONCE(pcp
-
>high))
free_pcppages_bulk(zone, READ_ONCE(pcp
-
>batch), pcp);
}
static inline void __free_one_page(struct page
*
page,
unsigned
long
pfn,
struct zone
*
zone, unsigned
int
order,
int
migratetype, fpi_t fpi_flags)
{
struct capture_control
*
capc
=
task_capc(zone);
unsigned
long
buddy_pfn;
unsigned
long
combined_pfn;
unsigned
int
max_order;
struct page
*
buddy;
bool
to_tail;
/
/
获取最大order
-
1
max_order
=
min_t(unsigned
int
, MAX_ORDER
-
1
, pageblock_order);
VM_BUG_ON(!zone_is_initialized(zone));
VM_BUG_ON_PAGE(page
-
>flags & PAGE_FLAGS_CHECK_AT_PREP, page);
VM_BUG_ON(migratetype
=
=
-
1
);
if
(likely(!is_migrate_isolate(migratetype)))
__mod_zone_freepage_state(zone,
1
<< order, migratetype);
VM_BUG_ON_PAGE(pfn & ((
1
<< order)
-
1
), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
continue_merging:
/
/
循环扫描直到order
=
=
max_order
-
1
/
/
处理合并问题
while
(order < max_order)
{
if
(compaction_capture(capc, page, order, migratetype)) {
__mod_zone_freepage_state(zone,
-
(
1
<< order),
migratetype);
return
;
}
/
/
查找buddy page frame
/
/
page_pfn ^ (
1
<< order)
buddy_pfn
=
__find_buddy_pfn(pfn, order);
/
/
获得对应的 struct page
buddy
=
page
+
(buddy_pfn
-
pfn);
/
/
判断是否有效
if
(!pfn_valid_within(buddy_pfn))
goto done_merging;
/
*
检查当前的buddy page是否是free状态可合并的。
主要满足以下条件:
1.
处于buddy system中
2.
有相同的order
3.
处于同一个zone
*
/
if
(!page_is_buddy(page, buddy, order))
goto done_merging;
/
/
如果满足free条件,或者是一个gaurd page,那么进行合并,合并后向上移动一个order。
if
(page_is_guard(buddy))
clear_page_guard(zone, buddy, order, migratetype);
else
del_page_from_free_list(buddy, zone, order);
/
/
合并页,设置新的pfn
combined_pfn
=
buddy_pfn & pfn;
page
=
page
+
(combined_pfn
-
pfn);
pfn
=
combined_pfn;
order
+
+
;
}
if
(order < MAX_ORDER
-
1
) {
/
/
防止隔离pageblock和正常pageblock上page的合并
if
(unlikely(has_isolate_pageblock(zone))) {
int
buddy_mt;
buddy_pfn
=
__find_buddy_pfn(pfn, order);
buddy
=
page
+
(buddy_pfn
-
pfn);
buddy_mt
=
get_pageblock_migratetype(buddy);
if
(migratetype !
=
buddy_mt
&& (is_migrate_isolate(migratetype) ||
is_migrate_isolate(buddy_mt)))
goto done_merging;
}
max_order
=
order
+
1
;
goto continue_merging;
}
done_merging:
/
/
设置阶,标记为伙伴系统的page
set_buddy_order(page, order);
if
(fpi_flags & FPI_TO_TAIL)
to_tail
=
true;
else
if
(is_shuffle_order(order))
/
/
is_shuffle_order,
return
false
to_tail
=
shuffle_pick_tail();
else
/
/
如果此时的page不是最大的page,那么检查是否buddy page是否是空的。 如果是的话,说明buddy page很可能正在被释放,而很快就要被合并起来。
/
/
在这种情况下,我们优先将page插入zone
-
>free_area[order]的
list
的尾部,延缓page的使用,从而方便buddy被free掉后,两个页进行合并。
to_tail
=
buddy_merge_likely(pfn, buddy_pfn, page, order);
/
/
插入尾部
if
(to_tail)
add_to_free_list_tail(page, zone, order, migratetype);
else
/
/
插入头部
add_to_free_list(page, zone, order, migratetype);
/
*
Notify page reporting subsystem of freed page
*
/
if
(!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))
page_reporting_notify_free(order);
}
static inline void __free_one_page(struct page
*
page,
unsigned
long
pfn,
struct zone
*
zone, unsigned
int
order,
int
migratetype, fpi_t fpi_flags)
{
struct capture_control
*
capc
=
task_capc(zone);
unsigned
long
buddy_pfn;
unsigned
long
combined_pfn;
unsigned
int
max_order;
struct page
*
buddy;
bool
to_tail;
/
/
获取最大order
-
1
max_order
=
min_t(unsigned
int
, MAX_ORDER
-
1
, pageblock_order);
VM_BUG_ON(!zone_is_initialized(zone));
VM_BUG_ON_PAGE(page
-
>flags & PAGE_FLAGS_CHECK_AT_PREP, page);
VM_BUG_ON(migratetype
=
=
-
1
);
if
(likely(!is_migrate_isolate(migratetype)))
__mod_zone_freepage_state(zone,
1
<< order, migratetype);
VM_BUG_ON_PAGE(pfn & ((
1
<< order)
-
1
), page);
VM_BUG_ON_PAGE(bad_range(zone, page), page);
continue_merging:
/
/
循环扫描直到order
=
=
max_order
-
1
/
/
处理合并问题
while
(order < max_order)
{
if
(compaction_capture(capc, page, order, migratetype)) {
__mod_zone_freepage_state(zone,
-
(
1
<< order),
migratetype);
return
;
}
/
/
查找buddy page frame
/
/
page_pfn ^ (
1
<< order)
buddy_pfn
=
__find_buddy_pfn(pfn, order);
/
/
获得对应的 struct page
buddy
=
page
+
(buddy_pfn
-
pfn);
/
/
判断是否有效
if
(!pfn_valid_within(buddy_pfn))
goto done_merging;
/
*
检查当前的buddy page是否是free状态可合并的。
主要满足以下条件:
1.
处于buddy system中
2.
有相同的order
3.
处于同一个zone
*
/
if
(!page_is_buddy(page, buddy, order))
goto done_merging;
/
/
如果满足free条件,或者是一个gaurd page,那么进行合并,合并后向上移动一个order。
if
(page_is_guard(buddy))
clear_page_guard(zone, buddy, order, migratetype);
else
del_page_from_free_list(buddy, zone, order);
/
/
合并页,设置新的pfn
combined_pfn
=
buddy_pfn & pfn;
page
=
page
+
(combined_pfn
-
pfn);
pfn
=
combined_pfn;
order
+
+
;
}
if
(order < MAX_ORDER
-
1
) {
/
/
防止隔离pageblock和正常pageblock上page的合并
if
(unlikely(has_isolate_pageblock(zone))) {
int
buddy_mt;
buddy_pfn
=
__find_buddy_pfn(pfn, order);
buddy
=
page
+
(buddy_pfn
-
pfn);
buddy_mt
=
get_pageblock_migratetype(buddy);
if
(migratetype !
=
buddy_mt
&& (is_migrate_isolate(migratetype) ||
is_migrate_isolate(buddy_mt)))
goto done_merging;
}
max_order
=
order
+
1
;
goto continue_merging;
}
done_merging:
/
/
设置阶,标记为伙伴系统的page
set_buddy_order(page, order);
if
(fpi_flags & FPI_TO_TAIL)
to_tail
=
true;
else
if
(is_shuffle_order(order))
/
/
is_shuffle_order,
return
false
to_tail
=
shuffle_pick_tail();
else
/
/
如果此时的page不是最大的page,那么检查是否buddy page是否是空的。 如果是的话,说明buddy page很可能正在被释放,而很快就要被合并起来。
/
/
在这种情况下,我们优先将page插入zone
-
>free_area[order]的
list
的尾部,延缓page的使用,从而方便buddy被free掉后,两个页进行合并。
to_tail
=
buddy_merge_likely(pfn, buddy_pfn, page, order);
/
/
插入尾部
if
(to_tail)
add_to_free_list_tail(page, zone, order, migratetype);
else
/
/
插入头部
add_to_free_list(page, zone, order, migratetype);
/
*
Notify page reporting subsystem of freed page
*
/
if
(!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))
page_reporting_notify_free(order);
}
static void free_pcppages_bulk(struct zone
*
zone,
int
count,
struct per_cpu_pages
*
pcp)
{
int
migratetype
=
0
;
int
batch_free
=
0
;
int
prefetch_nr
=
READ_ONCE(pcp
-
>batch);
bool
isolated_pageblocks;
struct page
*
page,
*
tmp;
LIST_HEAD(head);
/
/
获取pcpulist中pages最大数量
count
=
min
(pcp
-
>count, count);
while
(count)
{
struct list_head
*
list
;
/
*
*
Remove pages
from
lists
in
a
round
-
robin fashion. A
*
batch_free count
is
maintained that
is
incremented when an
*
empty
list
is
encountered. This
is
so more pages are freed
*
off fuller lists instead of spinning excessively around empty
*
lists
*
/
/
/
batch_free(删除页数递增),遍历迁移列表
do {
batch_free
+
+
;
if
(
+
+
migratetype
=
=
MIGRATE_PCPTYPES)
migratetype
=
0
;
list
=
&pcp
-
>lists[migratetype];
}
while
(list_empty(
list
));
/
/
如果只有一个非空列表
if
(batch_free
=
=
MIGRATE_PCPTYPES)
batch_free
=
count;
do {
/
/
获取列表尾部的元素
page
=
list_last_entry(
list
, struct page, lru);
/
*
must delete to avoid corrupting pcp
list
*
/
list_del(&page
-
>lru);
pcp
-
>count
-
-
;
if
(bulkfree_pcp_prepare(page))
continue
;
/
/
放入head列表中
list_add_tail(&page
-
>lru, &head);
/
/
对page的buddy页进行预取
if
(prefetch_nr) {
prefetch_buddy(page);
prefetch_nr
-
-
;
}
}
while
(
-
-
count &&
-
-
batch_free && !list_empty(
list
));
}
spin_lock(&zone
-
>lock);
isolated_pageblocks
=
has_isolate_pageblock(zone);
/
*
*
Use safe version since after __free_one_page(),
*
page
-
>lru.
next
will
not
point to original
list
.
*
/
list_for_each_entry_safe(page, tmp, &head, lru) {
int
mt
=
get_pcppage_migratetype(page);
/
/
MIGRATE_ISOLATE的page不可以被放入pcplist
VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
/
/
迁移类型不可以是isolate?但是has_isolate_pageblock未实现。
if
(unlikely(isolated_pageblocks))
mt
=
get_pageblock_migratetype(page);
/
/
调用__free_one_page放入伙伴系统
__free_one_page(page, page_to_pfn(page), zone,
0
, mt, FPI_NONE);
trace_mm_page_pcpu_drain(page,
0
, mt);
}
spin_unlock(&zone
-
>lock);
}
static void free_pcppages_bulk(struct zone
*
zone,
int
count,
struct per_cpu_pages
*
pcp)
{
int
migratetype
=
0
;
int
batch_free
=
0
;
int
prefetch_nr
=
READ_ONCE(pcp
-
>batch);
bool
isolated_pageblocks;
struct page
*
page,
*
tmp;
LIST_HEAD(head);
/
/
获取pcpulist中pages最大数量
count
=
min
(pcp
-
>count, count);
while
(count)
{
struct list_head
*
list
;
/
*
*
Remove pages
from
lists
in
a
round
-
robin fashion. A
*
batch_free count
is
maintained that
is
incremented when an
*
empty
list
is
encountered. This
is
so more pages are freed
*
off fuller lists instead of spinning excessively around empty
*
lists
*
/
/
/
batch_free(删除页数递增),遍历迁移列表
do {
batch_free
+
+
;
if
(
+
+
migratetype
=
=
MIGRATE_PCPTYPES)
migratetype
=
0
;
list
=
&pcp
-
>lists[migratetype];
}
while
(list_empty(
list
));
/
/
如果只有一个非空列表
if
(batch_free
=
=
MIGRATE_PCPTYPES)
batch_free
=
count;
do {
/
/
获取列表尾部的元素
page
=
list_last_entry(
list
, struct page, lru);
/
*
must delete to avoid corrupting pcp
list
*
/
list_del(&page
-
>lru);
pcp
-
>count
-
-
;
if
(bulkfree_pcp_prepare(page))
continue
;
/
/
放入head列表中
list_add_tail(&page
-
>lru, &head);
/
/
对page的buddy页进行预取
if
(prefetch_nr) {
prefetch_buddy(page);
prefetch_nr
-
-
;
}
}
while
(
-
-
count &&
-
-
batch_free && !list_empty(
list
));
}
spin_lock(&zone
-
>lock);
isolated_pageblocks
=
has_isolate_pageblock(zone);
/
*
*
Use safe version since after __free_one_page(),
*
page
-
>lru.
next
will
not
point to original
list
.
*
/
list_for_each_entry_safe(page, tmp, &head, lru) {
int
mt
=
get_pcppage_migratetype(page);
/
/
MIGRATE_ISOLATE的page不可以被放入pcplist
VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
/
/
迁移类型不可以是isolate?但是has_isolate_pageblock未实现。
if
(unlikely(isolated_pageblocks))
mt
=
get_pageblock_migratetype(page);
/
/
调用__free_one_page放入伙伴系统
__free_one_page(page, page_to_pfn(page), zone,
0
, mt, FPI_NONE);
trace_mm_page_pcpu_drain(page,
0
, mt);
}
spin_unlock(&zone
-
>lock);
}
/
*
*
Slab cache management.
*
/
struct kmem_cache {
struct kmem_cache_cpu __percpu
*
cpu_slab;
/
/
per cpu 缓存
/
*
Used
for
retrieving partial slabs, etc.
*
/
slab_flags_t flags;
unsigned
long
min_partial;
/
/
partial链表中slab的最大数量
unsigned
int
size;
/
*
The size of an
object
including metadata 每个块内存实际需要的大小
*
/
unsigned
int
object_size;
/
*
The size of an
object
without metadata 除去元数据的对象大小
*
/
struct reciprocal_value reciprocal_size;
unsigned
int
offset;
/
*
Free pointer offset 到空闲指针的偏移,可以索引指向下一个空闲块的指针
*
/
#ifdef CONFIG_SLUB_CPU_PARTIAL
/
*
Number of per cpu partial objects to keep around
*
/
unsigned
int
cpu_partial;
/
/
cpuslab partial链表中slab的最大数量,超过数量的则被放入kmem_cache_node普通的partial链表中
#endif
struct kmem_cache_order_objects oo;
/
/
记录slab管理page的数量(高
16bits
)和slab obj的数量(低
16bits
)
/
*
Allocation
and
freeing of slabs
*
/
struct kmem_cache_order_objects
max
;
/
/
最大分配数量
struct kmem_cache_order_objects
min
;
/
/
最小分配量
gfp_t allocflags;
/
*
从伙伴系统集成的分配请求掩码
*
/
int
refcount;
/
*
Refcount
for
slab cache destroy
*
/
void (
*
ctor)(void
*
);
unsigned
int
inuse;
/
*
Offset to metadata
*
/
unsigned
int
align;
/
*
Alignment
*
/
unsigned
int
red_left_pad;
/
*
Left redzone padding size
*
/
const char
*
name;
/
*
文件系统显示使用
*
/
struct list_head
list
;
/
*
所有slab的
list
*
/
#ifdef CONFIG_SYSFS
struct kobject kobj;
/
*
文件系统使用
*
/
#endif
#ifdef CONFIG_SLAB_FREELIST_HARDENED
unsigned
long
random;
#endif
#ifdef CONFIG_NUMA
/
*
*
Defragmentation by allocating
from
a remote node.
*
/
unsigned
int
remote_node_defrag_ratio;
#endif
#ifdef CONFIG_SLAB_FREELIST_RANDOM
unsigned
int
*
random_seq;
#endif
#ifdef CONFIG_KASAN
struct kasan_cache kasan_info;
#endif
unsigned
int
useroffset;
/
*
Usercopy region offset
*
/
unsigned
int
usersize;
/
*
Usercopy region size
*
/
struct kmem_cache_node
*
node[MAX_NUMNODES];
/
/
slab节点
};
/
*
*
Slab cache management.
*
/
struct kmem_cache {
struct kmem_cache_cpu __percpu
*
cpu_slab;
/
/
per cpu 缓存
/
*
Used
for
retrieving partial slabs, etc.
*
/
slab_flags_t flags;
unsigned
long
min_partial;
/
/
partial链表中slab的最大数量
unsigned
int
size;
/
*
The size of an
object
including metadata 每个块内存实际需要的大小
*
/
unsigned
int
object_size;
/
*
The size of an
object
without metadata 除去元数据的对象大小
*
/
struct reciprocal_value reciprocal_size;
unsigned
int
offset;
/
*
Free pointer offset 到空闲指针的偏移,可以索引指向下一个空闲块的指针
*
/
#ifdef CONFIG_SLUB_CPU_PARTIAL
/
*
Number of per cpu partial objects to keep around
*
/
unsigned
int
cpu_partial;
/
/
cpuslab partial链表中slab的最大数量,超过数量的则被放入kmem_cache_node普通的partial链表中
#endif
struct kmem_cache_order_objects oo;
/
/
记录slab管理page的数量(高
16bits
)和slab obj的数量(低
16bits
)
/
*
Allocation
and
freeing of slabs
*
/
struct kmem_cache_order_objects
max
;
/
/
最大分配数量
struct kmem_cache_order_objects
min
;
/
/
最小分配量
gfp_t allocflags;
/
*
从伙伴系统集成的分配请求掩码
*
/
int
refcount;
/
*
Refcount
for
slab cache destroy
*
/
void (
*
ctor)(void
*
);
unsigned
int
inuse;
/
*
Offset to metadata
*
/
unsigned
int
align;
/
*
Alignment
*
/
unsigned
int
red_left_pad;
/
*
Left redzone padding size
*
/
const char
*
name;
/
*
文件系统显示使用
*
/
struct list_head
list
;
/
*
所有slab的
list
*
/
#ifdef CONFIG_SYSFS
struct kobject kobj;
/
*
文件系统使用
*
/
#endif
#ifdef CONFIG_SLAB_FREELIST_HARDENED
unsigned
long
random;
#endif
#ifdef CONFIG_NUMA
/
*
*
Defragmentation by allocating
from
a remote node.
*
/
unsigned
int
remote_node_defrag_ratio;
#endif
#ifdef CONFIG_SLAB_FREELIST_RANDOM
unsigned
int
*
random_seq;
#endif
#ifdef CONFIG_KASAN
struct kasan_cache kasan_info;
#endif
unsigned
int
useroffset;
/
*
Usercopy region offset
*
/
unsigned
int
usersize;
/
*
Usercopy region size
*
/
struct kmem_cache_node
*
node[MAX_NUMNODES];
/
/
slab节点
};
struct kmem_cache_cpu {
void
*
*
freelist;
/
*
Pointer to
next
available
object
*
/
unsigned
long
tid;
/
*
CPU的独特标识
*
/
struct page
*
page;
/
*
当前正准备分配的slab
*
/
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct page
*
partial;
/
*
指向当前的半满的slab(slab中有空闲的
object
)
*
/
#endif
#ifdef CONFIG_SLUB_STATS
unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
};
struct kmem_cache_cpu {
void
*
*
freelist;
/
*
Pointer to
next
available
object
*
/
unsigned
long
tid;
/
*
CPU的独特标识
*
/
struct page
*
page;
/
*
当前正准备分配的slab
*
/
#ifdef CONFIG_SLUB_CPU_PARTIAL
struct page
*
partial;
/
*
指向当前的半满的slab(slab中有空闲的
object
)
*
/
#endif
#ifdef CONFIG_SLUB_STATS