[原创]Linux内核5.13版本内存管理模块源码分析-二进制漏洞-看雪-安全社区|安全招聘|kanxue.com

[原创]Linux内核5.13版本内存管理模块源码分析

发表于: 2021-9-1 09:47 24181

[原创]Linux内核5.13版本内存管理模块源码分析

Roland_

2021-9-1 09:47

24181

本文基于现今最新的Linux内核5.13版本。

内存管理模块一直是内核中最重要的模块之一，本文希望能简单的梳理内核内存管理模块的一部分核心内容，并结合我们在漏洞利用中的一些经验，以达到加深对内核理解的效果。

我们从最顶层的说起。

假设我们有3个CPU分别是C1，C2，C3

UMA/SMP：Uniform-Memory-Access，均匀存储器存取

可以简单理解为，C1，C2，C3作为一个整体，共享所有的物理内存。每台处理器可以有私有的高速cache。

NUMA：Nonuniform-Memory-Access，非均匀存储器存取

对于处理器C1，C2，C3，他们不是“共享”内存的

具体地说，相对CPU1，连接到 CPU1 的内存控制器的内存被认为是本地内存。而连接到CPU2)的内存被视为 CPU1 的外部或远程内存。

由于远程内存访问比本地内存访问有额外的延迟开销，因为它必须遍历互连（点对点链接）并连接到远程内存控制器。由于内存位置不同，系统会经历“不均匀”的内存访问时间。

Linux内核中内存组织的层次化主要是经历了 node->zone->page 这样一个顺序。

可以看到，每个CPU维护了自己对应的node，而这个node就可以理解为本地内存（NUMA）。

每个node又被划分为多个Zone。

Node 在内核源码中是一个全局数组。

可以看到在对应的数组中，每一个 pglist_data 中都包含了对应的多个 node_zones 以及其对应的引用。

每个zone中维护了一些比较重要的结构。

watermark（水位）

spanned_pages

long lowmem_reserve[MAX_NR_ZONES] 一个动态的数组，主要功能是保留一些低位内存空间，以防止当在高区有大量可释放的内存，但我们却在低区启动了OOM。也是一种预留内存。

zone_start_pfn：当前zone起始的物理页面号。而通过zone_start_pfn+spanned_pages可获得该zone的结束物理页面号。

free_area：表征当前zone中还有多少空余可供分配的page frames。

值得一提的是，其实Zone也分为不同种类的Zone（类比slab），可以通过如下方式查看：

接下来说说page（页），page frame（页帧/框）。这两个的关系类似鸡蛋（page）与篮子（page frame）的关系。

一般来说，一个page的大小是4K，是管理物理内存的最小单位。

我们主要聊里面几个重要的成员：

flags: 标定了page frame一些相应的属性。

flags的格式如下：

我们主要关注最后一位flag，用于标识page的状态。

_mapcount：表示当前page frame被map的次数（被页表引用的次数）

lru：根据page frame的活跃程度（使用频率），将page frame挂在不同的list上，作为页面回收的依据。

_refcount：引用计数。

本变量不可直接使用，而是要通过include/linux/page_ref.h 对应的函数来原子的读写。

pgoff_t index：表示在文件映射时的该page在文件内的offset，单位是page的大小。

mapping，我们主要说最常见的两种情况：

如果当前的page是一个匿名页，page->mapping指向它的anon_vma。并设置PAGE_MAPPING_ANON位来区分它。

如果当前的page是一个非匿名页，也就是说与某个文件相关联。那么mapping指向文件inode的地址空间。

根据是否处于VM_MERGEABLE区域，是否开启CONFIG_KSM，此指针指向的位置仍有不同。

详见：/include/linux/page-flags.h

在了解了本部分知识后，可以阅读：

文件系统cache与匿名页交换

加深对应的理解。

本部分我们主要目光放在x86-64下的4级页表的组织。

即：PGD -> PUD -> PMD -> PTE

一个比较好的说明图片：

当我们给出一个virtual addr(aka. v_addr)，我们需要通过页表机制，来获取其对应的物理地址（aka. p_addr）。接下来梳理一下从 v_addr -> p_addr的过程。

需要注意的是，每个进程都拥有自己的PGD。它是一个物理页，并包含一个pgd_t数组。即每个进程都有一套自身的页表。

当发生进程切换时，切换的是进程页表：即将新进程的pgd(页目录)加载到CR3寄存器中。

如果有熟悉Kernel Pwn，在漏洞利用中有一种缓解技术叫做KPTI（Kernel page-table isolation）即内核页表隔离。当题目开启了内核页表隔离时，不能直接着陆到用户态。

KPTI的核心是，在开启了这个选项的程序中，每个进程拥有两套页表，分别是内核态页表（只能在内核态访问）与用户态页表，他们处于不同的地址空间下。

当发生一次syscall时，涉及到用户态与内核态页表的切换。（切换CR3）

而如果我们着陆用户态（ireq/sysret）的时候，没有正常切换/设置CR3寄存器，就会导致页表错误，最后引发段错误。

在bypass的时候，我们往往通过SWITCH_USER_CR3：

来重新设置cr3寄存器。

或者是通过 swapgs_restore_regs_and_return_to_usermode 函数返回。

知道了KPTI之后的页表，那么显而易见，当我们没有开启KPTI时，只有进程的页表是时刻在更新，而内核页表全局只有一份，所有进程共享内核页表。而每个进程的“进程页表”中内核态地址相关的页表项都是“内核页表”的一个拷贝。当我们想要索引内核页表时，可以通过：init_mm.pgd

而这个 swapper_pg_dir 本质上就是内核PGD的基地址。

关于内核页表的创建过程可以看：

https://richardweiyang-2.gitbook.io/kernel-exploring/00-evolution_of_kernel_pagetable

TLB是translation lookaside buffer的简称。其本质上就是一块高速缓存。记得之前在计算机体系结构课上学过：<u>全相联映射、组相连映射、直接映射</u>等。

在正常情况下，我们通过四级页表查询来做页表转换来进行虚拟地址到物理地址的转换。

而TLB提供了一种更高速的方式来做虚拟地址到物理地址的转换。

TLB是一个小的，虚拟寻址的缓存，其中每一行都保存着一个由单个PTE(Page Table Entry,页表项)组成的块。如果没有TLB，则每次取数据都需要两次访问内存，即查页表获得物理地址和取数据。

不同的映射方式的cache有不同的组织形式，但是其整体思想都是通过虚拟地址来查cache，如果TLB cache命中，则直接可以得到物理地址。

TLB包含最近使用过的页表条目。给定一个虚拟地址，处理器将检查TLB是否存在页表条目（TLB命中），检索帧号并形成实际地址。如果在TLB中找不到页表条目（TLB丢失），则页号用于索引过程页表。TLB首先检查页面是否已在主存储器中，如果不在主存储器中，则发出页面错误，然后更新TLB以包括新的页面条目。

从这张图可以清晰的看出来，TLB提供了一种从v_addr[12:47] 到 p_addr[12:47]的映射。（低12bits均相似，所以不用管）

而ASID主要是为了区分不同的进程。

首先明确一点。page cache是Linux内核使用的主要磁盘缓存。

page cache is the main disk cache used by the Linux kernel.

我们一般在异步情况下读写文件时，首先写入对应的page cache，此时pages变成dirty pages ，后续会有内核线程pdflush真正写回到硬盘上。相对而言的，当我们读文件时，也是先放入page cache，然后再拷贝给用户态。当我们再次读同一个文件，如果page cache里已经有了，那么其性能就会有很大提升。

倒排页表，顾名思义，其储存的是每个物理page frame的信息。

其出现是为了缓解多级页表占用的内存问题。倒排页表项与物理内存页框有一一对应关系，而不是每一个虚拟页面有一个表项。

它所包含的表项数量较少（物理内存大小一般远小于虚拟内存大小）。所以其使用页框号而不是虚拟页号来索引页表项。

虽然IPT的设计节省了大量空间，但是也导致从虚拟地址到物理地址的转换会变得很困难。当进程n访问虚拟页面p时，硬件不再能通过把p当作指向页表的一个索引来查找物理页框。取而代之的是，它必须搜索整个倒排页表来查找某一个表项。

所以相比来说，TLB则是更好的一种技术。

huge page也称作大页，巨页。

我们一般来说一个页表项是4k，这就产生了一个问题：当物理内存很大时，页表会变得非常大，占用大量物理内存。而大页则是使页变大，由于页变大了，需要的页表项也就小了，占用物理内存也减少了。

x64四级页表系统支持2MB的大页，1GB的大页。

其优点主要是可以减少页表项，加快检索速度，提高TLB hit概率。

当我们打开CR4的pse位时（page size extension）就开启了对应的大页。

但是缺点是需要预先分配；如果分配过多，会造成内存浪费，不能被其他程序使用

THP（transparent huge page）即透明大页，他是对Huge Page的一个优化，它允许大页做动态的分配。THP减小了针对huge page支持的开销。使得应用程序可以根据需要灵活地选择虚存页面大小，而不会被强制使用 2MB 大页面。

THP是通过将巨大的页面分解成较小的4KB页面来实现的，然后这些页面被正常地交换出去。但是为了有效地使用hugepages，内核必须找到物理上连续的内存区域，其大小足以满足请求，而且还要正确对齐。为此，我们增加了一个khugepaged内核线程。这个线程会偶尔尝试用hugepage分配来替代目前正在使用的较小的页面，从而最大限度地提高THP的使用率。在用户区，不需要对应用程序进行修改（因此是透明的）。但有一些方法可以优化其使用。对于想要使用hugepages的应用程序，使用posix_memalign()也可以帮助确保大的分配被对齐到巨大的页面（2MB）边界上。另外，THP只对匿名内存区域启用。

但是问题是由于其动态分配的性质，以及繁琐的内存锁操作，THP很可能会导致性能上的下降。

https://zhuanlan.zhihu.com/p/67053210

P（Present） - 为1表明该page存在于当前物理内存中，为0则PTE的其他部分都失去意义了，不用看了，直接触发page fault。P位为0的PTE也不会有对应的TLB entry，因为早在P位由1变为0的时候，对应的TLB就已经被flush掉了。

G （Global）- 用于context switch的时候不用flush掉kernel对应的TLB，所以这个标志位在TLB entry中也是存在的。

A（Access） - 当这个page被访问（读/写）过后，硬件将该位置1，TLB只会缓存access的值为1的page对应的映射关系。软件可将该位置0，然后对应的TLB将会被flush掉。这样，软件可以统计出每个page被访问的次数，作为内存不足时，判断该page是否应该被回收的参考。

D （Dirty）- 这个标志位只对file backed的page有意义，对anonymous的page是没有意义的。当page被写入后，硬件将该位置1，表明该page的内容比外部disk/flash对应部分要新，当系统内存不足，要将该page回收的时候，需首先将其内容flush到外部存储。之后软件将该标志位清0。

R/W和U/S属于权限控制类：

R/W（Read/Write） - 置为1表示该page是writable的，置为0则是readonly，对只读的page进行写操作会触发page fault。

U/S（User/Supervisor） - 置为0表示只有supervisor（比如操作系统中的kernel）才可访问该page，置为1表示user也可以访问。

PCD和PWT和cache属性相关：

PCD（Page Cache Disabled）- 置为1表示disable，即该page中的内容是不可以被cache的。如果置为0（enable），还要看CR0寄存器中的CD位这个总控开关是否也是0。

PWT （Page Write Through）- 置为1表示该page对应的cache部分采用write through的方式，否则采用write back。

在64位下：

伙伴系统主要以2的方幂来划分空闲的内存区域，直至获取我们想要的内存大小的内存块。

可以看到，每个zone都维护了MAX_ORDER个free_area。其中MAX_ORDER表征切分的2的最大次幂。

而对应的MIGRATE_TYPES则为：

进一步的，对于每个free_list，都拥有不同的属性：

伙伴系统主要涉及的函数如下：alloc_pages、alloc_page等一些列函数，我们从最顶端的接口入手

本函数的参数：

本函数是伙伴系统分配的核心函数

快分配路径。

get_page_from_freelist 尝试去分配页面，如果分配失败，则交给 __alloc_pages_slowpath 处理一些特殊场景。

rmqueue

当我们要分配单一的一个页面（order=0）时，直接从per_cpu_list中分配。

在 rmqueue_pcplist 中，经过如下步骤：

在 __rmqueue_pcplist ，经过如下步骤：

在 __rmqueue_bulk ，经过如下步骤：

在 __rmqueue 经历如下步骤：

在 __rmqueue_smallest 经历如下步骤：

主要是从每个order的freelist查找大小和迁移属性都合适的page

在 expand 经历如下步骤：

如果当前的 current_order > order 时：

假设此时 high=4，low=2。（current_order、order）

那么会对多出来的页进行标记，标即为guard pages 。不可访问。然后将切分后的page放入相应的free链表中

当快分配不成功时，走慢分配路径。

在 free_the_page 中：

free_unref_page_commit

free_one_page -->__free_one_page

free_pcppages_bulk

kmem_cache

kmem_cache_cpu

kmem_cache_node

一个更清晰的三层结构：

在某些内核题目中，当开启了 CONFIG_SLAB_FREELIST_HARDENED 选项，freelist_ptr 函数会对object对象混淆后的next指针进行的解密。

引用：

创建新slab其实就是申请对应order的内存页，用来放足够数量的对象。值得注意的是其中order以及对象数量的确定，这两者又是相互影响的。order和object数量同时存放在kmem_cache成员kmem_cache_order_objects中，低16位用于存放object数量，高位存放order。order与object数量的关系非常简单：((PAGE_SIZE << order) - reserved) / size

最终到达 slab_alloc_node(快分配)。

get_freepointer_safe 行为如下：

__slab_alloc 慢分配

slab_alloc_node -> __slaballoc -> \__slab_alloc

deactivate_slab

该函数主要将slab放回node

___slab_free

discard_slab

释放路径：discard_slab -> free_slab -> __freeslab -> \_free_pages

SLUB DEBUG可以检测内存越界（out-of-bounds）和访问已经释放的内存（use-after-free）等问题。

如何开启：

推荐阅读：

Linux内核slab内存的越界检查——SLUB_DEBUG

对于用户态的每个进程，运行时都存在不同的段，段有自己的属性（是否可执行、可读等），而这段与段之间也不一定是连续的。而内核的进程vma结构体就是来对运行时的段进行维护。

对于每一个进程的 task_struct 来说：

<u>图片来自公众号：LoyenWang</u>

vm_area_struct

find_vma

vmacache_find

insert_vm_struct

什么时候会发生page falut：

page table中找不到对应的PTE

对应虚拟地址的PTE拒绝访问

在5.13的内核中 __do_page_fault 已经被移除（x86），取而代之的是 handle_page_fault

vmalloc_fault

内核使用 vmalloc 来分配在虚拟内存中连续但在物理内存中不一定连续的内存

处理vmalloc或模块映射区的故障。之所以需要这样做，是因为在vmalloc映射代码更新PMD到它与系统中其他页表同步这一更新的时间点之间存在着一个竞争条件。在这个竞争窗口中，另一个线程/CPU可以在同一个PMD上映射一个区域，发现它已经存在，并且还没有与系统的其他部分同步。因此，vmzalloc可能会返回未被系统中的每个页表映射的区域，当这些区域被访问时，会引起未处理的页错误。

主要的修复过程就是把 init 进程的页表项（全局）复制到当前进程的页表项中，这样就可以实现所有进程的内核内存地址空间同步。

spurious_kernel_fault

本函数用于处理由于TLB entry没有及时更新导致的虚假错误。

可能发生的原因：TLB entry对应的permission比页表entry的少

可能导致的原因：

1.向 ring0 做write操作。

2.对一块NX区域fetch。

bad_area_nosemaphore

bad_area_nosemaphore -> __bad_area_nosemaphore

kernelmode_fixup_or_oops

handle_mm_fault

handle_pte_fault

本函数中更详细的一些调用还可以看：

https://bbs.pediy.com/thread-264199.htm

主要讲了一下对应的COW相关的处理。

特别感谢povcfe学长的linux内存管理分析，让我学到了很多

https://www.kernel.org/doc/html/latest/core-api/memory-allocation.html

https://frankdenneman.nl/2016/07/07/numa-deep-dive-part-1-uma-numa/

https://zhuanlan.zhihu.com/p/68465952

https://blog.csdn.net/jasonchen_gbd/article/details/79462014

https://blog.csdn.net/zhoutaopower/article/details/87090982

https://blog.csdn.net/zhoutaopower/article/details/88025712

https://0xax.gitbooks.io/linux-insides/content/Theory/linux-theory-1.html

https://zhuanlan.zhihu.com/p/137277724

https://segmentfault.com/a/1190000012269249

https://www.codenong.com/cs105984564/

https://rtoax.blog.csdn.net/article/details/108663898?utm_medium=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-2.essearch_pc_relevant&depth_1-utm_source=distribute.pc_relevant.none-task-blog-2%7Edefault%7EBlogCommendFromMachineLearnPai2%7Edefault-2.essearch_pc_relevant

https://qinglinmao8315.github.io/linux/2018/03/14/linux-page-cache.html

https://www.jianshu.com/p/8a86033dfcb0

https://blog.csdn.net/wh8_2011/article/details/53138377

https://zhuanlan.zhihu.com/p/258921453?utm_source=wechat_timeline

https://blog.csdn.net/FreeeLinux/article/details/54754752

https://www.sohu.com/a/297831850_467784

https://www.cnblogs.com/adera/p/11718765.html

https://blog.csdn.net/zhuyong006/article/details/100737724

https://blog.csdn.net/wangquan1992/article/details/105036282/

https://blog.csdn.net/sykpour/article/details/24044641

页迁移与碎片整理

开始之前
NUMA与UMA/SMP
层次化
页表组织
一次页表查询
KPTI与内核页表
TLB缓存
page cache 页缓冲
Inverted page tables（IPT）
Huge page
THP(transparent huge page)
页表标识位
伙伴系统
overview
alloc_pages(gfp_t gfp_mask, unsigned int order)
__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,nodemask_t *nodemask)
get_page_from_freelist（从zone freelist，快分配）
__alloc_pages_slowpath（慢分配）
__free_pages
SLAB/SLUB分配器
关键结构体
slab_hardened缓解/加固
kmem_cache_alloc
kmem_cache_free
do_slab_free（快速路径）
__slab_free（慢路径）
check_object 与 CONFIG_SLUB
进程
Page Fault
handle_page_fault
**do_kern_addr_fault**
**do_user_addr_fault**
参考

//arch/x86/mm/numa.c

struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;

//arch/x86/mm/numa.c

struct pglist_data *node_data[MAX_NUMNODES] __read_mostly;

typedef struct pglist_data {

    /*

     * node_zones contains just the zones for THIS node. Not all of the

     * zones may be populated, but it is the full list. It is referenced by

     * this node's node_zonelists as well as other node's node_zonelists.

     */

    struct zone node_zones[MAX_NR_ZONES];
 
    /*

     * node_zonelists contains references to all zones in all nodes.

     * Generally the first zones will be references to this node's

     * node_zones.

     */

    struct zonelist node_zonelists[MAX_ZONELISTS];
 
    int nr_zones; /* number of populated zones in this node */

  ......

typedef struct pglist_data {

/*

* node_zones contains just the zones for THIS node. Not all of the

* zones may be populated, but it is the full list. It is referenced by

* this node's node_zonelists as well as other node's node_zonelists.

*/

struct zone node_zones[MAX_NR_ZONES];

/*

* node_zonelists contains references to all zones in all nodes.

* Generally the first zones will be references to this node's

* node_zones.

*/

struct zonelist node_zonelists[MAX_ZONELISTS];

int nr_zones; /* number of populated zones in this node */

......

/* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */

root@ubuntu:~# cat /proc/zoneinfo |grep Node

Node 0, zone      DMA

Node 0, zone    DMA32

Node 0, zone   Normal

Node 0, zone  Movable

Node 0, zone   Device

root@ubuntu:~# cat /proc/zoneinfo |grep Node

Node 0, zone DMA

Node 0, zone DMA32

Node 0, zone Normal

Node 0, zone Movable

Node 0, zone Device

//mm_types.h
struct page {

    unsigned long flags;        /* Atomic flags, some possibly

                     * updated asynchronously */

    union {

        struct {    /* Page cache and anonymous pages */

            struct list_head lru;

            /* See page-flags.h for PAGE_MAPPING_FLAGS */

            struct address_space *mapping;

            pgoff_t index;        /* Our offset within mapping. */

            /**

             * @private: Mapping-private opaque data.

             * Usually used for buffer_heads if PagePrivate.

             * Used for swp_entry_t if PageSwapCache.

             * Indicates order in the buddy system if PageBuddy.

             */

            unsigned long private;

        };

        struct {    /* page_pool used by netstack */

            /**

             * @dma_addr: might require a 64-bit value on

             * 32-bit architectures.

             */

            unsigned long dma_addr[2];

        };

        struct {    /* slab, slob and slub */

            union {

                struct list_head slab_list;

                struct {    /* Partial pages */

                    struct page *next;
#ifdef CONFIG_64BIT

                    int pages;    /* Nr of pages left */

                    int pobjects;    /* Approximate count */
#else

                    short int pages;

                    short int pobjects;
#endif

                };

            };

            struct kmem_cache *slab_cache; /* not slob */

            /* Double-word boundary */

            void *freelist;        /* first free object */

            union {

                void *s_mem;    /* slab: first object */

                unsigned long counters;        /* SLUB */

                struct {            /* SLUB */

                    unsigned inuse:16;

                    unsigned objects:15;

                    unsigned frozen:1;

                };

            };

        };

        struct {    /* Tail pages of compound page */

            unsigned long compound_head;    /* Bit zero is set */
 
            /* First tail page only */

            unsigned char compound_dtor;

            unsigned char compound_order;

            atomic_t compound_mapcount;

            unsigned int compound_nr; /* 1 << compound_order */

        };

        struct {    /* Second tail page of compound page */

            unsigned long _compound_pad_1;    /* compound_head */

            atomic_t hpage_pinned_refcount;

            /* For both global and memcg */

            struct list_head deferred_list;

        };

        struct {    /* Page table pages */

            unsigned long _pt_pad_1;    /* compound_head */

            pgtable_t pmd_huge_pte; /* protected by page->ptl */

            unsigned long _pt_pad_2;    /* mapping */

            union {

                struct mm_struct *pt_mm; /* x86 pgds only */

                atomic_t pt_frag_refcount; /* powerpc */

            };
#if ALLOC_SPLIT_PTLOCKS

            spinlock_t *ptl;
#else

            spinlock_t ptl;
#endif

        };

        struct {    /* ZONE_DEVICE pages */

            /** @pgmap: Points to the hosting device page map. */

            struct dev_pagemap *pgmap;

            void *zone_device_data;

        };
 
        /** @rcu_head: You can use this to free a page by RCU. */

        struct rcu_head rcu_head;

    };
 
    union {        /* This union is 4 bytes in size. */

        atomic_t _mapcount;
 
        /*

         * If the page is neither PageSlab nor mappable to userspace,

         * the value stored here may help determine what this page

         * is used for.  See page-flags.h for a list of page types

         * which are currently stored here.

         */

        unsigned int page_type;
 
        unsigned int active;        /* SLAB */

        int units;            /* SLOB */

    };
 
    /* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */

    atomic_t _refcount;
 
#ifdef CONFIG_MEMCG

    unsigned long memcg_data;
#endif
#if defined(WANT_PAGE_VIRTUAL)

    void *virtual;            /* Kernel virtual address (NULL if

                       not kmapped, ie. highmem) */
#endif /* WANT_PAGE_VIRTUAL */
 
#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS

    int _last_cpupid;
#endif
} _struct_page_alignment;

//mm_types.h

struct page {

unsigned long flags; /* Atomic flags, some possibly

* updated asynchronously */

union {

struct { /* Page cache and anonymous pages */

struct list_head lru;

/* See page-flags.h for PAGE_MAPPING_FLAGS */

struct address_space *mapping;

pgoff_t index; /* Our offset within mapping. */

/**

* @private: Mapping-private opaque data.

* Usually used for buffer_heads if PagePrivate.

* Used for swp_entry_t if PageSwapCache.

* Indicates order in the buddy system if PageBuddy.

*/

unsigned long private;

};

struct { /* page_pool used by netstack */

/**

* @dma_addr: might require a 64-bit value on

* 32-bit architectures.

*/

unsigned long dma_addr[2];

};

struct { /* slab, slob and slub */

union {

struct list_head slab_list;

struct { /* Partial pages */

struct page *next;

#ifdef CONFIG_64BIT

int pages; /* Nr of pages left */

int pobjects; /* Approximate count */

#else

short int pages;

short int pobjects;

#endif

};

struct kmem_cache *slab_cache; /* not slob */

/* Double-word boundary */

void *freelist; /* first free object */

union {

void *s_mem; /* slab: first object */

unsigned long counters; /* SLUB */

struct { /* SLUB */

unsigned inuse:16;

unsigned objects:15;

unsigned frozen:1;

};

struct { /* Tail pages of compound page */

unsigned long compound_head; /* Bit zero is set */

/* First tail page only */

unsigned char compound_dtor;

unsigned char compound_order;

atomic_t compound_mapcount;

unsigned int compound_nr; /* 1 << compound_order */

};

struct { /* Second tail page of compound page */

unsigned long _compound_pad_1; /* compound_head */

atomic_t hpage_pinned_refcount;

/* For both global and memcg */

struct list_head deferred_list;

};

struct { /* Page table pages */

unsigned long _pt_pad_1; /* compound_head */

pgtable_t pmd_huge_pte; /* protected by page->ptl */

unsigned long _pt_pad_2; /* mapping */

union {

struct mm_struct *pt_mm; /* x86 pgds only */

atomic_t pt_frag_refcount; /* powerpc */

};

#if ALLOC_SPLIT_PTLOCKS

spinlock_t *ptl;

#else

spinlock_t ptl;

#endif

};

struct { /* ZONE_DEVICE pages */

/** @pgmap: Points to the hosting device page map. */

struct dev_pagemap *pgmap;

void *zone_device_data;

};

/** @rcu_head: You can use this to free a page by RCU. */

struct rcu_head rcu_head;

};

union { /* This union is 4 bytes in size. */

atomic_t _mapcount;

/*

* If the page is neither PageSlab nor mappable to userspace,

* the value stored here may help determine what this page

* is used for. See page-flags.h for a list of page types

* which are currently stored here.

*/

unsigned int page_type;

unsigned int active; /* SLAB */

int units; /* SLOB */

};

/* Usage count. *DO NOT USE DIRECTLY*. See page_ref.h */

atomic_t _refcount;

#ifdef CONFIG_MEMCG

unsigned long memcg_data;

#endif

#if defined(WANT_PAGE_VIRTUAL)

void *virtual; /* Kernel virtual address (NULL if

not kmapped, ie. highmem) */

#endif /* WANT_PAGE_VIRTUAL */

#ifdef LAST_CPUPID_NOT_IN_PAGE_FLAGS

int _last_cpupid;

#endif

} _struct_page_alignment;

```c
enum pageflags {

    PG_locked,        /* Page is locked. Don't touch. */

    PG_referenced,//表示page刚刚被访问过

    PG_uptodate,

    PG_dirty,            //是否页面数据已经被修改（脏页）

    PG_lru,                //是否处于lru链表中

    PG_active,

    PG_workingset,

    PG_waiters,        

    PG_error,

    PG_slab,            //是否属于slab分配器

    PG_owner_priv_1,    /* Owner use. If pagecache, fs may use*/

    PG_arch_1,

    PG_reserved,

    PG_private,        /* If pagecache, has fs-private data */

    PG_private_2,        /* If pagecache, has fs aux data */

    PG_writeback,        //page正在被写回

    PG_head,        /* A head page */

    PG_mappedtodisk,    /* Has blocks allocated on-disk */

    PG_reclaim,        /* To be reclaimed asap */

    PG_swapbacked,        /* Page is backed by RAM/swap */

    PG_unevictable,        /* Page is "unevictable"  */
#ifdef CONFIG_MMU

    PG_mlocked,        /* Page is vma mlocked */
#endif
#ifdef CONFIG_ARCH_USES_PG_UNCACHED

    PG_uncached,        /* Page has been mapped as uncached */
#endif
#ifdef CONFIG_MEMORY_FAILURE

    PG_hwpoison,        /* hardware poisoned page. Don't touch */
#endif
#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)

    PG_young,

    PG_idle,
#endif
#ifdef CONFIG_64BIT

    PG_arch_2,
#endif

    __NR_PAGEFLAGS,
 
    /* Filesystems */

    PG_checked = PG_owner_priv_1,
 
    /* SwapBacked */

    PG_swapcache = PG_owner_priv_1,    //page处于swap cache中

  /* Swap page: swp_entry_t in private */
 
    /* Two page bits are conscripted by FS-Cache to maintain local caching

     * state.  These bits are set on pages belonging to the netfs's inodes

     * when those inodes are being locally cached.

     */

    PG_fscache = PG_private_2,    /* page backed by cache */
 
    /* XEN */

    /* Pinned in Xen as a read-only pagetable page. */

    PG_pinned = PG_owner_priv_1,

    /* Pinned as part of domain save (see xen_mm_pin_all()). */

    PG_savepinned = PG_dirty,

    /* Has a grant mapping of another (foreign) domain's page. */

    PG_foreign = PG_owner_priv_1,

    /* Remapped by swiotlb-xen. */

    PG_xen_remapped = PG_owner_priv_1,
 
    /* SLOB */

    PG_slob_free = PG_private,
 
    /* Compound pages. Stored in first tail page's flags */

    PG_double_map = PG_workingset,
 
    /* non-lru isolated movable page */

    PG_isolated = PG_reclaim,
 
    /* Only valid for buddy pages. Used to track pages that are reported */

    PG_reported = PG_uptodate,
};

```c

enum pageflags {

PG_locked, /* Page is locked. Don't touch. */

PG_referenced,//表示page刚刚被访问过

PG_uptodate,

PG_dirty, //是否页面数据已经被修改（脏页）

PG_lru, //是否处于lru链表中

PG_active,

PG_workingset,

PG_waiters,

PG_error,

PG_slab, //是否属于slab分配器

PG_owner_priv_1, /* Owner use. If pagecache, fs may use*/

PG_arch_1,

PG_reserved,

PG_private, /* If pagecache, has fs-private data */

PG_private_2, /* If pagecache, has fs aux data */

PG_writeback, //page正在被写回

PG_head, /* A head page */

PG_mappedtodisk, /* Has blocks allocated on-disk */

PG_reclaim, /* To be reclaimed asap */

PG_swapbacked, /* Page is backed by RAM/swap */

PG_unevictable, /* Page is "unevictable" */

#ifdef CONFIG_MMU

PG_mlocked, /* Page is vma mlocked */

#endif

#ifdef CONFIG_ARCH_USES_PG_UNCACHED

PG_uncached, /* Page has been mapped as uncached */

#endif

#ifdef CONFIG_MEMORY_FAILURE

PG_hwpoison, /* hardware poisoned page. Don't touch */

#endif

#if defined(CONFIG_IDLE_PAGE_TRACKING) && defined(CONFIG_64BIT)

PG_young,

PG_idle,

#endif

#ifdef CONFIG_64BIT

PG_arch_2,

#endif

__NR_PAGEFLAGS,

/* Filesystems */

PG_checked = PG_owner_priv_1,

/* SwapBacked */

PG_swapcache = PG_owner_priv_1, //page处于swap cache中

/* Swap page: swp_entry_t in private */

/* Two page bits are conscripted by FS-Cache to maintain local caching

* state. These bits are set on pages belonging to the netfs's inodes

* when those inodes are being locally cached.

*/

PG_fscache = PG_private_2, /* page backed by cache */

/* XEN */

/* Pinned in Xen as a read-only pagetable page. */

PG_pinned = PG_owner_priv_1,

/* Pinned as part of domain save (see xen_mm_pin_all()). */

PG_savepinned = PG_dirty,

/* Has a grant mapping of another (foreign) domain's page. */

PG_foreign = PG_owner_priv_1,

/* Remapped by swiotlb-xen. */

PG_xen_remapped = PG_owner_priv_1,

/* SLOB */

PG_slob_free = PG_private,

/* Compound pages. Stored in first tail page's flags */

PG_double_map = PG_workingset,

/* non-lru isolated movable page */

PG_isolated = PG_reclaim,

/* Only valid for buddy pages. Used to track pages that are reported */

PG_reported = PG_uptodate,

};

root@ubuntu:~# free

              total        used        free      shared  buff/cache   available

Mem:        4012836      207344     3317312        1128      488180     3499580

Swap:        998396           0      998396

root@ubuntu:~# free

total used free shared buff/cache available

Mem: 4012836 207344 3317312 1128 488180 3499580

Swap: 998396 0 998396

PGD	页全局目录(Page Global Directory)
PUD	页上级目录(Page Upper Directory)
PMD	页中间目录(Page Middle Directory)
PTE	页表(Page Table)

task_struct -> mm_struct -> pgd_t * pgd

mov     rdi, cr3

or      rdi, 1000h
mov     cr3, rdi

mov rdi, cr3

or rdi, 1000h

mov cr3, rdi

struct mm_struct init_mm = {

    .mm_rb        = RB_ROOT,

    .pgd        = swapper_pg_dir,

    .mm_users    = ATOMIC_INIT(2),

    .mm_count    = ATOMIC_INIT(1),

    .write_protect_seq = SEQCNT_ZERO(init_mm.write_protect_seq),

    MMAP_LOCK_INITIALIZER(init_mm)

    .page_table_lock =  __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),

    .arg_lock    =  __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),

    .mmlist        = LIST_HEAD_INIT(init_mm.mmlist),

    .user_ns    = &init_user_ns,

    .cpu_bitmap    = CPU_BITS_NONE,

    INIT_MM_CONTEXT(init_mm)
};

struct mm_struct init_mm = {

.mm_rb = RB_ROOT,

.pgd = swapper_pg_dir,

.mm_users = ATOMIC_INIT(2),

.mm_count = ATOMIC_INIT(1),

.write_protect_seq = SEQCNT_ZERO(init_mm.write_protect_seq),

MMAP_LOCK_INITIALIZER(init_mm)

.page_table_lock = __SPIN_LOCK_UNLOCKED(init_mm.page_table_lock),

.arg_lock = __SPIN_LOCK_UNLOCKED(init_mm.arg_lock),

.mmlist = LIST_HEAD_INIT(init_mm.mmlist),

.user_ns = &init_user_ns,

.cpu_bitmap = CPU_BITS_NONE,

INIT_MM_CONTEXT(init_mm)

};

/*

 * Initialized during boot, and readonly for initializing page tables

 * afterwards

 */
pgd_t swapper_pg_dir[PTRS_PER_PGD];

/*

* Initialized during boot, and readonly for initializing page tables

* afterwards

*/

pgd_t swapper_pg_dir[PTRS_PER_PGD];

/*

 * Set up kernel memory allocators

 */
static void __init mm_init(void)
{

    ......

    mem_init();    //伙伴系统初始化

  ......

    kmem_cache_init(); //slab初始化

    ......
}

/*

* Set up kernel memory allocators

*/

static void __init mm_init(void)

{

......

mem_init(); //伙伴系统初始化

......

kmem_cache_init(); //slab初始化

......

}

#define MAX_ORDER 11
struct zone{

  ...

  /* free areas of different sizes */

    struct free_area    free_area[MAX_ORDER];

  ...
}

#define MAX_ORDER 11

struct zone{

...

/* free areas of different sizes */

struct free_area free_area[MAX_ORDER];

...

}

struct free_area {

    struct list_head    free_list[MIGRATE_TYPES];

    unsigned long        nr_free;
};

struct free_area {

struct list_head free_list[MIGRATE_TYPES];

unsigned long nr_free;

};

enum migratetype {

    MIGRATE_UNMOVABLE,//不可移动页

    MIGRATE_MOVABLE,    //    可移动页

    MIGRATE_RECLAIMABLE,//可回收页

    MIGRATE_PCPTYPES,    /* the number of types on the pcp lists */

    MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,//在罕见的情况下,内核需要分配一个高阶的页面块而不能休眠.如果向具有特定可移动性的列表请求分配内存失败,这种紧急情况下可从MIGRATE_HIGHATOMIC中分配内存
#ifdef CONFIG_CMA

    MIGRATE_CMA,        //Linux内核最新的连续内存分配器(CMA), 用于避免预留大块内存
#endif
#ifdef CONFIG_MEMORY_ISOLATION

    MIGRATE_ISOLATE,    //是一个特殊的虚拟区域, 用于跨越NUMA结点移动物理内存页. 在大型系统上, 它有益于将物理内存页移动到接近于使用该页最频繁的CPU.
#endif

    MIGRATE_TYPES            //只是表示迁移类型的数目, 也不代表具体的区域
};

enum migratetype {

MIGRATE_UNMOVABLE,//不可移动页

MIGRATE_MOVABLE, // 可移动页

MIGRATE_RECLAIMABLE,//可回收页

MIGRATE_PCPTYPES, /* the number of types on the pcp lists */

MIGRATE_HIGHATOMIC = MIGRATE_PCPTYPES,//

在罕见的情况下,内核需要分配一个高阶的页面块而不能休眠.如果向具有特定可移动性的列表请求分配内存失败,这种紧急情况下可从MIGRATE_HIGHATOMIC中分配内存

#ifdef CONFIG_CMA

MIGRATE_CMA, //Linux内核最新的连续内存分配器(CMA), 用于避免预留大块内存

#endif

#ifdef CONFIG_MEMORY_ISOLATION

MIGRATE_ISOLATE, //

是一个特殊的虚拟区域, 用于跨越NUMA结点移动物理内存页. 在大型系统上, 它有益于将物理内存页移动到接近于使用该页最频繁的CPU.

#endif

MIGRATE_TYPES //只是表示迁移类型的数目, 也不代表具体的区域

};

static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)
{

    return alloc_pages_node(numa_node_id(), gfp_mask, order);
}

static inline struct page *alloc_pages(gfp_t gfp_mask, unsigned int order)

{

return alloc_pages_node(numa_node_id(), gfp_mask, order);

}

- rdi：GFP bitmasks，分配的属性。见[附录](#1)

- rsi：分配内存的阶。
 
根据调用流：
 
```c
alloc_pages 

  alloc_pages_node

      __alloc_pages_node(nid, gfp_mask, order) //nid是离当前CPU最近的node

          __alloc_pages(gfp_mask, order, nid, NULL) //the 'heart' of the zoned buddy allocator

- rdi：GFP bitmasks，分配的属性。见[附录](#1)

- rsi：分配内存的阶。

根据调用流：

```c

alloc_pages

alloc_pages_node

__alloc_pages_node(nid, gfp_mask, order) //nid是离当前CPU最近的node

__alloc_pages(gfp_mask, order, nid, NULL) //the 'heart' of the zoned buddy allocator

/*

 * This is the 'heart' of the zoned buddy allocator.

 */

struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,nodemask_t *nodemask)
{

    struct page *page;
 
  //先设置WMark为low

    unsigned int alloc_flags = ALLOC_WMARK_LOW;

  //新的gfp，用于标定分配的属性

    gfp_t alloc_gfp; 
 
  //用于保存参与分配的函数之间传递的大部分不可变的分配参数的结构，包括alloc_pages*系列函数。

  //代表了分配时的固定的上下文信息。

  /*

  struct alloc_context 
{

    struct zonelist *zonelist;

    nodemask_t *nodemask;

    struct zoneref *preferred_zoneref;

    int migratetype;

    enum zone_type highest_zoneidx;

    bool spread_dirty_pages;
};

  */

    struct alloc_context ac = { };
 
    // 检查order

    if (unlikely(order >= MAX_ORDER)) {

        WARN_ON_ONCE(!(gfp & __GFP_NOWARN));

        return NULL;

    }
 
  //GFP_BOOT_MASK，感觉应该是代表分配启动

    gfp &= gfp_allowed_mask;

  //根据当前进程的flags（current->flags）调整gfp

    gfp = current_gfp_context(gfp);

    alloc_gfp = gfp;
 
  //prepare_alloc_pages对于struct alloc_context进行赋值

  /*

  ac->highest_zoneidx = gfp_zone(gfp_mask);

    ac->zonelist = node_zonelist(preferred_nid, gfp_mask);

    ac->nodemask = nodemask;

    ac->migratetype = gfp_migratetype(gfp_mask);

  */

    if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,

            &alloc_gfp, &alloc_flags))

        return NULL;
 
  //避免碎片化

  //alloc_flags = (__force int) (gfp_mask & __GFP_KSWAPD_RECLAIM);

    alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);
 
    //第一次内存分配尝试

    page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);

    if (likely(page))

        goto out;
 
    alloc_gfp = gfp;

    ac.spread_dirty_pages = false;
 
    /*

     * Restore the original nodemask if it was potentially replaced with

     * &cpuset_current_mems_allowed to optimize the fast-path attempt.

     */

    ac.nodemask = nodemask;

    //第一次分配失败，第二次尝试分配

    page = __alloc_pages_slowpath(alloc_gfp, order, &ac);
 
out:

    if (memcg_kmem_enabled() && (gfp & __GFP_ACCOUNT) && page &&

        unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {

        __free_pages(page, order);

        page = NULL;

    }
 
    trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);
 
    return page;
}

/*

* This is the 'heart' of the zoned buddy allocator.

*/

struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,nodemask_t *nodemask)

{

struct page *page;

//先设置WMark为low

unsigned int alloc_flags = ALLOC_WMARK_LOW;

//新的gfp，用于标定分配的属性

gfp_t alloc_gfp;

//用于保存参与分配的函数之间传递的大部分不可变的分配参数的结构，包括alloc_pages*系列函数。

//代表了分配时的固定的上下文信息。

/*

struct alloc_context

{

struct zonelist *zonelist;

nodemask_t *nodemask;

struct zoneref *preferred_zoneref;

int migratetype;

enum zone_type highest_zoneidx;

bool spread_dirty_pages;

};

*/

struct alloc_context ac = { };

// 检查order

if (unlikely(order >= MAX_ORDER)) {

WARN_ON_ONCE(!(gfp & __GFP_NOWARN));

return NULL;

}

//GFP_BOOT_MASK，感觉应该是代表分配启动

gfp &= gfp_allowed_mask;

//根据当前进程的flags（current->flags）调整gfp

gfp = current_gfp_context(gfp);

alloc_gfp = gfp;

//prepare_alloc_pages对于struct alloc_context进行赋值

/*

ac->highest_zoneidx = gfp_zone(gfp_mask);

ac->zonelist = node_zonelist(preferred_nid, gfp_mask);

ac->nodemask = nodemask;

ac->migratetype = gfp_migratetype(gfp_mask);

*/

if (!prepare_alloc_pages(gfp, order, preferred_nid, nodemask, &ac,

&alloc_gfp, &alloc_flags))

return NULL;

//避免碎片化

//alloc_flags = (__force int) (gfp_mask & __GFP_KSWAPD_RECLAIM);

alloc_flags |= alloc_flags_nofragment(ac.preferred_zoneref->zone, gfp);

//第一次内存分配尝试

page = get_page_from_freelist(alloc_gfp, order, alloc_flags, &ac);

if (likely(page))

goto out;

alloc_gfp = gfp;

ac.spread_dirty_pages = false;

/*

* Restore the original nodemask if it was potentially replaced with

* &cpuset_current_mems_allowed to optimize the fast-path attempt.

*/

ac.nodemask = nodemask;

//第一次分配失败，第二次尝试分配

page = __alloc_pages_slowpath(alloc_gfp, order, &ac);

out:

if (memcg_kmem_enabled() && (gfp & __GFP_ACCOUNT) && page &&

unlikely(__memcg_kmem_charge_page(page, gfp, order) != 0)) {

__free_pages(page, order);

page = NULL;

}

trace_mm_page_alloc(page, order, alloc_gfp, ac.migratetype);

return page;

}

/*

 * get_page_from_freelist goes through the zonelist trying to allocate

 * a page.

 */

static struct page *

get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,

                        const struct alloc_context *ac)
{

    struct zoneref *z;

    struct zone *zone;

    struct pglist_data *last_pgdat_dirty_limit = NULL;

    bool no_fallback;
 
retry:

  //扫描zone，尝试查找一个有足够的free pages的zone

    no_fallback = alloc_flags & ALLOC_NOFRAGMENT;

    z = ac->preferred_zoneref;

  //z这里是优先查找的zone，从context中获得的。

    for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,ac->nodemask) {

        struct page *page;

        unsigned long mark;
 
        if (cpusets_enabled() &&

            (alloc_flags & ALLOC_CPUSET) &&

            !__cpuset_zone_allowed(zone, gfp_mask))

                continue;
 
    //主要是要保证在dirty limit之内分配，防止从LRU list中写入，kswapd即可完成平衡

        if (ac->spread_dirty_pages) {

            if (last_pgdat_dirty_limit == zone->zone_pgdat)

                continue;
 
            if (!node_dirty_ok(zone->zone_pgdat)) {

                last_pgdat_dirty_limit = zone->zone_pgdat;

                continue;

            }

        }
 
        if (no_fallback && nr_online_nodes > 1 &&

            zone != ac->preferred_zoneref->zone) 

    {

            int local_nid;
 
            /*

             * If moving to a remote node, retry but allow

             * fragmenting fallbacks. Locality is more important

             * than fragmentation avoidance.

             */

      //如果移动到一个远的node，但是允许碎片化回退，那么局部性比碎片避免更重要

            local_nid = zone_to_nid(ac->preferred_zoneref->zone);    //获取local node id

            if (zone_to_nid(zone) != local_nid) {//如果使用的不是local node

                alloc_flags &= ~ALLOC_NOFRAGMENT;        //进行标记，retry

                goto retry;

            }

        }
 
    //检查水位是否充足，并进行回收

        mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);

        if (!zone_watermark_fast(zone, order, mark,

                       ac->highest_zoneidx, alloc_flags,

                       gfp_mask)) 

    {

            int ret;
 
        ......

            /* Checked here to keep the fast path fast */

            BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);

            if (alloc_flags & ALLOC_NO_WATERMARKS)

                goto try_this_zone;
 
            if (!node_reclaim_enabled() ||

                !zone_allows_reclaim(ac->preferred_zoneref->zone, zone))

                continue;
 
            ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);

            switch (ret) {

            case NODE_RECLAIM_NOSCAN:

                /* did not scan */

                continue;

            case NODE_RECLAIM_FULL:

                /* scanned but unreclaimable */

                continue;

            default:

                /* did we reclaim enough */

                if (zone_watermark_ok(zone, order, mark,

                    ac->highest_zoneidx, alloc_flags))

                    goto try_this_zone;
 
                continue;

            }

        }
 
    //调用rmqueue进行分配
try_this_zone:

        page = rmqueue(ac->preferred_zoneref->zone, zone, order,

                gfp_mask, alloc_flags, ac->migratetype);

    //如果分配成功

        if (page) {

            prep_new_page(page, order, gfp_mask, alloc_flags);
 
            /*

             * If this is a high-order atomic allocation then check

             * if the pageblock should be reserved for the future

             */

            if (unlikely(order && (alloc_flags & ALLOC_HARDER)))

                reserve_highatomic_pageblock(page, zone, order);
 
            return page;

        } 

    //如果分配失败

    else {
#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT

            /* Try again if zone has deferred pages */

            if (static_branch_unlikely(&deferred_pages)) {

                if (_deferred_grow_zone(zone, order))

                    goto try_this_zone;

            }
#endif

        }

    }
 
    /*

     * It's possible on a UMA machine to get through all zones that are

     * fragmented. If avoiding fragmentation, reset and try again.

     */

    if (no_fallback) {

        alloc_flags &= ~ALLOC_NOFRAGMENT;

        goto retry;

    }
 
    return NULL;
}

/*

* get_page_from_freelist goes through the zonelist trying to allocate

* a page.

*/

static struct page *

get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,

const struct alloc_context *ac)

{

struct zoneref *z;

struct zone *zone;

struct pglist_data *last_pgdat_dirty_limit = NULL;

bool no_fallback;

retry:

//扫描zone，尝试查找一个有足够的free pages的zone

no_fallback = alloc_flags & ALLOC_NOFRAGMENT;

z = ac->preferred_zoneref;

//z这里是优先查找的zone，从context中获得的。

for_next_zone_zonelist_nodemask(zone, z, ac->highest_zoneidx,ac->nodemask) {

struct page *page;

unsigned long mark;

if (cpusets_enabled() &&

(alloc_flags & ALLOC_CPUSET) &&

!__cpuset_zone_allowed(zone, gfp_mask))

continue;

//主要是要保证在dirty limit之内分配，防止从LRU list中写入，kswapd即可完成平衡

if (ac->spread_dirty_pages) {

if (last_pgdat_dirty_limit == zone->zone_pgdat)

continue;

if (!node_dirty_ok(zone->zone_pgdat)) {

last_pgdat_dirty_limit = zone->zone_pgdat;

continue;

}

if (no_fallback && nr_online_nodes > 1 &&

zone != ac->preferred_zoneref->zone)

{

int local_nid;

/*

* If moving to a remote node, retry but allow

* fragmenting fallbacks. Locality is more important

* than fragmentation avoidance.

*/

//如果移动到一个远的node，但是允许碎片化回退，那么局部性比碎片避免更重要

local_nid = zone_to_nid(ac->preferred_zoneref->zone); //获取local node id

if (zone_to_nid(zone) != local_nid) {//如果使用的不是local node

alloc_flags &= ~ALLOC_NOFRAGMENT; //进行标记，retry

goto retry;

}

//检查水位是否充足，并进行回收

mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);

if (!zone_watermark_fast(zone, order, mark,

ac->highest_zoneidx, alloc_flags,

gfp_mask))

{

int ret;

......

/* Checked here to keep the fast path fast */

BUILD_BUG_ON(ALLOC_NO_WATERMARKS < NR_WMARK);

if (alloc_flags & ALLOC_NO_WATERMARKS)

goto try_this_zone;

if (!node_reclaim_enabled() ||

!zone_allows_reclaim(ac->preferred_zoneref->zone, zone))

continue;

ret = node_reclaim(zone->zone_pgdat, gfp_mask, order);

switch (ret) {

case NODE_RECLAIM_NOSCAN:

/* did not scan */

continue;

case NODE_RECLAIM_FULL:

/* scanned but unreclaimable */

continue;

default:

/* did we reclaim enough */

if (zone_watermark_ok(zone, order, mark,

ac->highest_zoneidx, alloc_flags))

goto try_this_zone;

continue;

}

//调用rmqueue进行分配

try_this_zone:

page = rmqueue(ac->preferred_zoneref->zone, zone, order,

gfp_mask, alloc_flags, ac->migratetype);

//如果分配成功

if (page) {

prep_new_page(page, order, gfp_mask, alloc_flags);

/*

* If this is a high-order atomic allocation then check

* if the pageblock should be reserved for the future

*/

if (unlikely(order && (alloc_flags & ALLOC_HARDER)))

reserve_highatomic_pageblock(page, zone, order);

return page;

}

//如果分配失败

else {

#ifdef CONFIG_DEFERRED_STRUCT_PAGE_INIT

/* Try again if zone has deferred pages */

if (static_branch_unlikely(&deferred_pages)) {

if (_deferred_grow_zone(zone, order))

goto try_this_zone;

}

#endif

}

/*

* It's possible on a UMA machine to get through all zones that are

* fragmented. If avoiding fragmentation, reset and try again.

*/

if (no_fallback) {

alloc_flags &= ~ALLOC_NOFRAGMENT;

goto retry;

}

return NULL;

}

/*

 * Allocate a page from the given zone. Use pcplists for order-0 allocations.

 */
static inline

struct page *rmqueue(struct zone *preferred_zone,

            struct zone *zone, unsigned int order,

            gfp_t gfp_flags, unsigned int alloc_flags,

            int migratetype)
{

    unsigned long flags;

    struct page *page;

    // 如果当前的阶是0，直接使用per cpu lists分配

    if (likely(order == 0)) {

        if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||

                migratetype != MIGRATE_MOVABLE) {

            page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,migratetype, alloc_flags);

            goto out;

        }

    }
 
  //当设置了__GFP_NOFAIL，不能分配order > 1的空间。

    WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

  //加锁

    spin_lock_irqsave(&zone->lock, flags);
 
    do {

        page = NULL;

        /*

         * order-0 request can reach here when the pcplist is skipped

         * due to non-CMA allocation context. HIGHATOMIC area is

         * reserved for high-order atomic allocation, so order-0

         * request should skip it.

         */

    //如果pcplist分配被跳过，那么order=0会到达这里，但是我们的HIGHATOMIC区域是保留给高阶原子分配，所以order-0请求应该跳过它。

        if (order > 0 && alloc_flags & ALLOC_HARDER) 

    {

      //调用__rmqueue_smallest分配，页迁移类型为MIGRATE_HIGHATOMIC

            page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);

            if (page)

                trace_mm_page_alloc_zone_locked(page, order, migratetype);

        }

    //不满足上一个if，或分配失败，调用__rmqueue分配

        if (!page)

            page = __rmqueue(zone, order, migratetype, alloc_flags);

    } while (page && check_new_pages(page, order));
 
    spin_unlock(&zone->lock);
 
    if (!page)

        goto failed;

  //更新zone的freepage状态

    __mod_zone_freepage_state(zone, -(1 << order),get_pcppage_migratetype(page));
 
    __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);

  //统计NUMA架构信息（hit/miss）

    zone_statistics(preferred_zone, zone);

  //恢复中断

    local_irq_restore(flags);
 
out:

    /* Separate test+clear to avoid unnecessary atomics */

    if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {

        clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);

        wakeup_kswapd(zone, 0, 0, zone_idx(zone));

    }
 
    VM_BUG_ON_PAGE(page && bad_range(zone, page), page);

    return page;
 
failed:

    local_irq_restore(flags);

    return NULL;
}

/*

* Allocate a page from the given zone. Use pcplists for order-0 allocations.

*/

static inline

struct page *rmqueue(struct zone *preferred_zone,

struct zone *zone, unsigned int order,

gfp_t gfp_flags, unsigned int alloc_flags,

int migratetype)

{

unsigned long flags;

struct page *page;

// 如果当前的阶是0，直接使用per cpu lists分配

if (likely(order == 0)) {

if (!IS_ENABLED(CONFIG_CMA) || alloc_flags & ALLOC_CMA ||

migratetype != MIGRATE_MOVABLE) {

page = rmqueue_pcplist(preferred_zone, zone, gfp_flags,migratetype, alloc_flags);

goto out;

}

//当设置了__GFP_NOFAIL，不能分配order > 1的空间。

WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));

//加锁

spin_lock_irqsave(&zone->lock, flags);

do {

page = NULL;

/*

* order-0 request can reach here when the pcplist is skipped

* due to non-CMA allocation context. HIGHATOMIC area is

* reserved for high-order atomic allocation, so order-0

* request should skip it.

*/

//如果pcplist分配被跳过，那么order=0会到达这里，但是我们的HIGHATOMIC区域是保留给高阶原子分配，所以order-0请求应该跳过它。

if (order > 0 && alloc_flags & ALLOC_HARDER)

{

//调用__rmqueue_smallest分配，页迁移类型为MIGRATE_HIGHATOMIC

page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);

if (page)

trace_mm_page_alloc_zone_locked(page, order, migratetype);

}

//不满足上一个if，或分配失败，调用__rmqueue分配

if (!page)

page = __rmqueue(zone, order, migratetype, alloc_flags);

} while (page && check_new_pages(page, order));

spin_unlock(&zone->lock);

if (!page)

goto failed;

//更新zone的freepage状态

__mod_zone_freepage_state(zone, -(1 << order),get_pcppage_migratetype(page));

__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);

//统计NUMA架构信息（hit/miss）

zone_statistics(preferred_zone, zone);

//恢复中断

local_irq_restore(flags);

out:

/* Separate test+clear to avoid unnecessary atomics */

if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {

clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);

wakeup_kswapd(zone, 0, 0, zone_idx(zone));

}

VM_BUG_ON_PAGE(page && bad_range(zone, page), page);

return page;

failed:

local_irq_restore(flags);

return NULL;

}

/* Remove page from the per-cpu list, caller must protect the list */
static inline

struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,

            unsigned int alloc_flags,

            struct per_cpu_pages *pcp,

            struct list_head *list)
{

    struct page *page;
 
    do {

    //用list的next指针判断当前list是否为空

        if (list_empty(list)) {

      //如果为空，调用rmqueue_bulk将它们添加到提供的列表中。

            pcp->count += rmqueue_bulk(zone, 0,READ_ONCE(pcp->batch), list,migratetype, alloc_flags);

            if (unlikely(list_empty(list)))

                return NULL;

        }

        //取出list的第一个元素

        page = list_first_entry(list, struct page, lru);

    //从页的lrulist中删除

        list_del(&page->lru);

    //空闲计数减1

        pcp->count--;

    } while (check_new_pcp(page));
 
    return page;
}

/* Remove page from the per-cpu list, caller must protect the list */

static inline

struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,

unsigned int alloc_flags,

struct per_cpu_pages *pcp,

struct list_head *list)

{

struct page *page;

do {

//用list的next指针判断当前list是否为空

if (list_empty(list)) {

//如果为空，调用rmqueue_bulk将它们添加到提供的列表中。

pcp->count += rmqueue_bulk(zone, 0,READ_ONCE(pcp->batch), list,migratetype, alloc_flags);

if (unlikely(list_empty(list)))

return NULL;

}

//取出list的第一个元素

page = list_first_entry(list, struct page, lru);

//从页的lrulist中删除

list_del(&page->lru);

//空闲计数减1

pcp->count--;

} while (check_new_pcp(page));

return page;

}

/*

 * Obtain a specified number of elements from the buddy allocator, all under

 * a single hold of the lock, for efficiency.  Add them to the supplied list.

 * Returns the number of new pages which were placed at *list.

 */

static int rmqueue_bulk(struct zone *zone, unsigned int order,

            unsigned long count, struct list_head *list,

            int migratetype, unsigned int alloc_flags)
{

    int i, allocated = 0;
 
    spin_lock(&zone->lock);

  //扫描当前的zone的每个order list，尝试找一个最合适的page

    for (i = 0; i < count; ++i) 

  {

    //取出一个page

        struct page *page = __rmqueue(zone, order, migratetype,alloc_flags);

        if (unlikely(page == NULL))

            break;
 
        if (unlikely(check_pcp_refill(page)))

            continue;
 
    //将当前page添加到lrulist

        list_add_tail(&page->lru, list);

        allocated++;

    //如果page在cma区域中，更新zone部分成员的信息，调整NR_FREE_PAGES

    /*

    atomic_long_add(x, &zone->vm_stat[item]);

        atomic_long_add(x, &vm_zone_stat[item]);

    */

        if (is_migrate_cma(get_pcppage_migratetype(page)))

            __mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -(1 << order));

    }
 
  //如果check_pcp_refill检查失败，移除页面，调整NR_FREE_PAGES

  //for循环i次，扫描了i个pageblock，而每个pageblock有2^i个pages，更新NR_FREE_PAGES

    __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

    spin_unlock(&zone->lock);

    return allocated;
}

/*

* Obtain a specified number of elements from the buddy allocator, all under

* a single hold of the lock, for efficiency. Add them to the supplied list.

* Returns the number of new pages which were placed at *list.

*/

static int rmqueue_bulk(struct zone *zone, unsigned int order,

unsigned long count, struct list_head *list,

int migratetype, unsigned int alloc_flags)

{

int i, allocated = 0;

spin_lock(&zone->lock);

//扫描当前的zone的每个order list，尝试找一个最合适的page

for (i = 0; i < count; ++i)

{

//取出一个page

struct page *page = __rmqueue(zone, order, migratetype,alloc_flags);

if (unlikely(page == NULL))

break;

if (unlikely(check_pcp_refill(page)))

continue;

//将当前page添加到lrulist

list_add_tail(&page->lru, list);

allocated++;

//如果page在cma区域中，更新zone部分成员的信息，调整NR_FREE_PAGES

/*

atomic_long_add(x, &zone->vm_stat[item]);

atomic_long_add(x, &vm_zone_stat[item]);

*/

if (is_migrate_cma(get_pcppage_migratetype(page)))

__mod_zone_page_state(zone, NR_FREE_CMA_PAGES, -(1 << order));

}

//如果check_pcp_refill检查失败，移除页面，调整NR_FREE_PAGES

//for循环i次，扫描了i个pageblock，而每个pageblock有2^i个pages，更新NR_FREE_PAGES

__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));

spin_unlock(&zone->lock);

return allocated;

}

/*

 * Do the hard work of removing an element from the buddy allocator.

 * Call me with the zone->lock already held.

 */

static __always_inline struct page *

__rmqueue(struct zone *zone, unsigned int order, int migratetype,

                        unsigned int alloc_flags)
{

    struct page *page;
 
  // 如果打开了CMA，平衡常规区域和CMA区域之间的可移动分配，当该区一半以上的可用内存在CMA区域时，从CMA分配。

    if (IS_ENABLED(CONFIG_CMA)) 

  {

        if (alloc_flags & ALLOC_CMA &&

            zone_page_state(zone, NR_FREE_CMA_PAGES) > zone_page_state(zone, NR_FREE_PAGES) / 2)         {

            page = __rmqueue_cma_fallback(zone, order);

            if (page)

                goto out;

        }

    }
retry:

  //否则直接调用__rmqueue_smallest分配。

    page = __rmqueue_smallest(zone, order, migratetype);

    if (unlikely(!page)) {

        if (alloc_flags & ALLOC_CMA)

            page = __rmqueue_cma_fallback(zone, order);

        if (!page && __rmqueue_fallback(zone, order, migratetype,

                                alloc_flags))

            goto retry;

    }
out:

    if (page)

        trace_mm_page_alloc_zone_locked(page, order, migratetype);

    return page;
}

/*

* Do the hard work of removing an element from the buddy allocator.

* Call me with the zone->lock already held.

*/

static __always_inline struct page *

__rmqueue(struct zone *zone, unsigned int order, int migratetype,

unsigned int alloc_flags)

{

struct page *page;

//

如果打开了CMA，平衡常规区域和CMA区域之间的可移动分配，当该区一半以上的可用内存在CMA区域时，从CMA分配。

if (IS_ENABLED(CONFIG_CMA))

{

if (alloc_flags & ALLOC_CMA &&

zone_page_state(zone, NR_FREE_CMA_PAGES) > zone_page_state(zone, NR_FREE_PAGES) / 2) {

page = __rmqueue_cma_fallback(zone, order);

if (page)

goto out;

}

retry:

//否则直接调用__rmqueue_smallest分配。

page = __rmqueue_smallest(zone, order, migratetype);

if (unlikely(!page)) {

if (alloc_flags & ALLOC_CMA)

page = __rmqueue_cma_fallback(zone, order);

if (!page && __rmqueue_fallback(zone, order, migratetype,

alloc_flags))

goto retry;

}

out:

if (page)

trace_mm_page_alloc_zone_locked(page, order, migratetype);

return page;

}

/*

 * Go through the free lists for the given migratetype and remove

 * the smallest available page from the freelists

 */
static __always_inline

struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,

                        int migratetype)
{

    unsigned int current_order;

    struct free_area *area;

    struct page *page;
 
    /* Find a page of the appropriate size in the preferred list */

    for (current_order = order; current_order < MAX_ORDER; ++current_order) {

        area = &(zone->free_area[current_order]);

    //从对应迁移类型的链表头获取page。

        page = get_page_from_free_area(area, migratetype);

        if (!page)

            continue;

    //删除page，更新zone

        del_page_from_free_list(page, zone, current_order);
 
        expand(zone, page, order, current_order, migratetype);

    //设置迁移类型

        set_pcppage_migratetype(page, migratetype);

        return page;

    }
 
    return NULL;
}

/*

* Go through the free lists for the given migratetype and remove

* the smallest available page from the freelists

*/

static __always_inline

struct page *__rmqueue_smallest(struct zone *zone, unsigned int order,

int migratetype)

{

unsigned int current_order;

struct free_area *area;

struct page *page;

/* Find a page of the appropriate size in the preferred list */

for (current_order = order; current_order < MAX_ORDER; ++current_order) {

area = &(zone->free_area[current_order]);

//从对应迁移类型的链表头获取page。

page = get_page_from_free_area(area, migratetype);

if (!page)

continue;

//删除page，更新zone

del_page_from_free_list(page, zone, current_order);

expand(zone, page, order, current_order, migratetype);

//设置迁移类型

set_pcppage_migratetype(page, migratetype);

return page;

}

return NULL;

}

while (high > low) {

        high--;

        size >>= 1;

        VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);
 
        /*

         * Mark as guard pages (or page), that will allow to

         * merge back to allocator when buddy will be freed.

         * Corresponding page table entries will not be touched,

         * pages will stay not present in virtual address space

         */

        if (set_page_guard(zone, &page[size], high, migratetype))

            continue;
 
        add_to_free_list(&page[size], zone, high, migratetype);

        set_buddy_order(&page[size], high);

    }

while (high > low) {

high--;

size >>= 1;

VM_BUG_ON_PAGE(bad_range(zone, &page[size]), &page[size]);

/*

* Mark as guard pages (or page), that will allow to

* merge back to allocator when buddy will be freed.

* Corresponding page table entries will not be touched,

* pages will stay not present in virtual address space

*/

if (set_page_guard(zone, &page[size], high, migratetype))

continue;

add_to_free_list(&page[size], zone, high, migratetype);

set_buddy_order(&page[size], high);

}

static inline struct page *

__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,

                        struct alloc_context *ac)
{

    bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;

    const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;

    struct page *page = NULL;

    unsigned int alloc_flags;

    unsigned long did_some_progress;

    enum compact_priority compact_priority;

    enum compact_result compact_result;

    int compaction_retries;

    int no_progress_loops;

    unsigned int cpuset_mems_cookie;

    int reserve_flags;
 
    //如果内存分配来自__GFP_ATOMIC（原子请求）、__GFP_DIRECT_RECLAIM（可直接回收），会产生冲突，取消原子标识

    if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) == (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))

        gfp_mask &= ~__GFP_ATOMIC;
 
retry_cpuset:

    compaction_retries = 0;

    no_progress_loops = 0;

    compact_priority = DEF_COMPACT_PRIORITY;

    cpuset_mems_cookie = read_mems_allowed_begin();
 
    //快速分配采用保守的alloc_flags，我们这里进行重新设置，降低成本。

    alloc_flags = gfp_to_alloc_flags(gfp_mask);
 
    //重新计算分配迭代zone的起始点。

    ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,

                    ac->highest_zoneidx, ac->nodemask);

    if (!ac->preferred_zoneref->zone)

        goto nopage;
 
    //如果设置了ALLOC_KSWAPD，唤醒kswapds进程

    if (alloc_flags & ALLOC_KSWAPD)

        wake_all_kswapds(order, gfp_mask, ac);
 
    //使用重新调整后的信息再次重新分配

    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

    if (page)

        goto got_pg;
 
    /*

     * For costly allocations, try direct compaction first, as it's likely

     * that we have enough base pages and don't need to reclaim. For non-

     * movable high-order allocations, do that as well, as compaction will

     * try prevent permanent fragmentation by migrating from blocks of the

     * same migratetype.

     * Don't try this for allocations that are allowed to ignore

     * watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.

     */

    //示情况进行内存压缩

    if (can_direct_reclaim &&

            (costly_order ||

               (order > 0 && ac->migratetype != MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask)) {

        page = __alloc_pages_direct_compact(gfp_mask, order,

                        alloc_flags, ac,

                        INIT_COMPACT_PRIORITY,

                        &compact_result);

        if (page)

            goto got_pg;
 
    //如果设置了__GFP_NORETRY，可能包含了一些THP page fault的分配

        if (costly_order && (gfp_mask & __GFP_NORETRY)) {

            if (compact_result == COMPACT_SKIPPED ||

                compact_result == COMPACT_DEFERRED)

                goto nopage;
 
            //同步压缩开销太大，保持异步压缩

            compact_priority = INIT_COMPACT_PRIORITY;

        }

    }
 
retry:

    //保证kswapd不会睡眠，再次唤醒

    if (alloc_flags & ALLOC_KSWAPD)

        wake_all_kswapds(order, gfp_mask, ac);
 
    //区分真正需要访问全部内存储备的请求和可以承受部分内存的被oom kill掉的请求。

    reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);

    if (reserve_flags)

        alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags);
 
    //当不允许在当前cpu-node中分配，且设置了reserve_flags，那么降低此时的分配标准，重置高优先级的迭代器再进行分配。

    if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {

        ac->nodemask = NULL;

        ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,

                    ac->highest_zoneidx, ac->nodemask);

    }
 
    /* Attempt with potentially adjusted zonelist and alloc_flags */

    page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

    if (page)

        goto got_pg;
 
    /* Caller is not willing to reclaim, we can't balance anything */

    if (!can_direct_reclaim)

        goto nopage;
 
    /* Avoid recursion of direct reclaim */

    if (current->flags & PF_MEMALLOC)

        goto nopage;
 
    //尝试先回收，再分配

    page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,

                            &did_some_progress);

    if (page)

        goto got_pg;
 
    //尝试直接压缩，再分配

    page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,

                    compact_priority, &compact_result);

    if (page)

        goto got_pg;
 
    /* Do not loop if specifically requested */

    if (gfp_mask & __GFP_NORETRY)

        goto nopage;
 
    /*

     * Do not retry costly high order allocations unless they are

     * __GFP_RETRY_MAYFAIL

     */

    if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))

        goto nopage;
 
    //是否应当再次进行内存回收

    if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,

                 did_some_progress > 0, &no_progress_loops))

        goto retry;
 
    //是否应该再次压缩

    if (did_some_progress > 0 &&

            should_compact_retry(ac, order, alloc_flags,

                compact_result, &compact_priority,

                &compaction_retries))

        goto retry;
 
    //在我们启动oom之前判断可能的条件竞争问题

    if (check_retry_cpuset(cpuset_mems_cookie, ac))

        goto retry_cpuset;
 
    //回收失败，开启oomkiller，杀死一些进程以获得内存

    page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);

    if (page)

        goto got_pg;
 
    //避免nowatermark的内存无限使用

    if (tsk_is_oom_victim(current) &&

        (alloc_flags & ALLOC_OOM ||

         (gfp_mask & __GFP_NOMEMALLOC)))

        goto nopage;
 
    if (did_some_progress) {

        no_progress_loops = 0;

        goto retry;

    }
 
nopage:
 
    if (check_retry_cpuset(cpuset_mems_cookie, ac))

        goto retry_cpuset;
 
    //当设置了__GFP_NOFAIL时，多次尝试

    if (gfp_mask & __GFP_NOFAIL) {

        //当所有的NOFAIL的请求都被blocked掉时，警告用户此时应该使用NOWAIT

        if (WARN_ON_ONCE(!can_direct_reclaim))

            goto fail;
 
        WARN_ON_ONCE(current->flags & PF_MEMALLOC);

        WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);
 
        /*

         通过让他们访问内存储备来帮助不失败的分配，但不要使用ALLOC_NO_WATERMARKS，因为这可能耗尽整个内存储备，使情况变得更糟

         */

        page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);

        if (page)

            goto got_pg;
 
        cond_resched();

        goto retry;

    }
fail:

    warn_alloc(gfp_mask, ac->nodemask,

            "page allocation failure: order:%u", order);
got_pg:

    return page;
}

static inline struct page *

__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,

struct alloc_context *ac)

{

bool can_direct_reclaim = gfp_mask & __GFP_DIRECT_RECLAIM;

const bool costly_order = order > PAGE_ALLOC_COSTLY_ORDER;

struct page *page = NULL;

unsigned int alloc_flags;

unsigned long did_some_progress;

enum compact_priority compact_priority;

enum compact_result compact_result;

int compaction_retries;

int no_progress_loops;

unsigned int cpuset_mems_cookie;

int reserve_flags;

//

如果内存分配来自__GFP_ATOMIC（原子请求）、__GFP_DIRECT_RECLAIM（可直接回收），会产生冲突，取消原子标识

if (WARN_ON_ONCE((gfp_mask & (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)) == (__GFP_ATOMIC|__GFP_DIRECT_RECLAIM)))

gfp_mask &= ~__GFP_ATOMIC;

retry_cpuset:

compaction_retries = 0;

no_progress_loops = 0;

compact_priority = DEF_COMPACT_PRIORITY;

cpuset_mems_cookie = read_mems_allowed_begin();

//快速分配采用保守的alloc_flags，我们这里进行重新设置，降低成本。

alloc_flags = gfp_to_alloc_flags(gfp_mask);

//重新计算分配迭代zone的起始点。

ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,

ac->highest_zoneidx, ac->nodemask);

if (!ac->preferred_zoneref->zone)

goto nopage;

//如果设置了ALLOC_KSWAPD，唤醒kswapds进程

if (alloc_flags & ALLOC_KSWAPD)

wake_all_kswapds(order, gfp_mask, ac);

//使用重新调整后的信息再次重新分配

page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

if (page)

goto got_pg;

/*

* For costly allocations, try direct compaction first, as it's likely

* that we have enough base pages and don't need to reclaim. For non-

* movable high-order allocations, do that as well, as compaction will

* try prevent permanent fragmentation by migrating from blocks of the

* same migratetype.

* Don't try this for allocations that are allowed to ignore

* watermarks, as the ALLOC_NO_WATERMARKS attempt didn't yet happen.

*/

//示情况进行内存压缩

if (can_direct_reclaim &&

(costly_order ||

(order > 0 && ac->migratetype != MIGRATE_MOVABLE)) && !gfp_pfmemalloc_allowed(gfp_mask)) {

page = __alloc_pages_direct_compact(gfp_mask, order,

alloc_flags, ac,

INIT_COMPACT_PRIORITY,

&compact_result);

if (page)

goto got_pg;

//如果设置了__GFP_NORETRY，可能包含了一些THP page fault的分配

if (costly_order && (gfp_mask & __GFP_NORETRY)) {

if (compact_result == COMPACT_SKIPPED ||

compact_result == COMPACT_DEFERRED)

goto nopage;

//同步压缩开销太大，保持异步压缩

compact_priority = INIT_COMPACT_PRIORITY;

}

retry:

//保证kswapd不会睡眠，再次唤醒

if (alloc_flags & ALLOC_KSWAPD)

wake_all_kswapds(order, gfp_mask, ac);

//区分真正需要访问全部内存储备的请求和可以承受部分内存的被oom kill掉的请求。

reserve_flags = __gfp_pfmemalloc_flags(gfp_mask);

if (reserve_flags)

alloc_flags = gfp_to_alloc_flags_cma(gfp_mask, reserve_flags);

//当不允许在当前cpu-

node中分配，且设置了reserve_flags，那么降低此时的分配标准，重置高优先级的迭代器再进行分配。

if (!(alloc_flags & ALLOC_CPUSET) || reserve_flags) {

ac->nodemask = NULL;

ac->preferred_zoneref = first_zones_zonelist(ac->zonelist,

ac->highest_zoneidx, ac->nodemask);

}

/* Attempt with potentially adjusted zonelist and alloc_flags */

page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);

if (page)

goto got_pg;

/* Caller is not willing to reclaim, we can't balance anything */

if (!can_direct_reclaim)

goto nopage;

/* Avoid recursion of direct reclaim */

if (current->flags & PF_MEMALLOC)

goto nopage;

//尝试先回收，再分配

page = __alloc_pages_direct_reclaim(gfp_mask, order, alloc_flags, ac,

&did_some_progress);

if (page)

goto got_pg;

//尝试直接压缩，再分配

page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags, ac,

compact_priority, &compact_result);

if (page)

goto got_pg;

/* Do not loop if specifically requested */

if (gfp_mask & __GFP_NORETRY)

goto nopage;

/*

* Do not retry costly high order allocations unless they are

* __GFP_RETRY_MAYFAIL

*/

if (costly_order && !(gfp_mask & __GFP_RETRY_MAYFAIL))

goto nopage;

//是否应当再次进行内存回收

if (should_reclaim_retry(gfp_mask, order, ac, alloc_flags,

did_some_progress > 0, &no_progress_loops))

goto retry;

//是否应该再次压缩

if (did_some_progress > 0 &&

should_compact_retry(ac, order, alloc_flags,

compact_result, &compact_priority,

&compaction_retries))

goto retry;

//在我们启动oom之前判断可能的条件竞争问题

if (check_retry_cpuset(cpuset_mems_cookie, ac))

goto retry_cpuset;

//回收失败，开启oomkiller，杀死一些进程以获得内存

page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);

if (page)

goto got_pg;

//避免nowatermark的内存无限使用

if (tsk_is_oom_victim(current) &&

(alloc_flags & ALLOC_OOM ||

(gfp_mask & __GFP_NOMEMALLOC)))

goto nopage;

if (did_some_progress) {

no_progress_loops = 0;

goto retry;

}

nopage:

if (check_retry_cpuset(cpuset_mems_cookie, ac))

goto retry_cpuset;

//当设置了__GFP_NOFAIL时，多次尝试

if (gfp_mask & __GFP_NOFAIL) {

//当所有的NOFAIL的请求都被blocked掉时，警告用户此时应该使用NOWAIT

if (WARN_ON_ONCE(!can_direct_reclaim))

goto fail;

WARN_ON_ONCE(current->flags & PF_MEMALLOC);

WARN_ON_ONCE(order > PAGE_ALLOC_COSTLY_ORDER);

/*

通过让他们访问内存储备来帮助不失败的分配，但不要使用ALLOC_NO_WATERMARKS，因为这可能耗尽整个内存储备，使情况变得更糟

*/

page = __alloc_pages_cpuset_fallback(gfp_mask, order, ALLOC_HARDER, ac);

if (page)

goto got_pg;

cond_resched();

goto retry;

}

fail:

warn_alloc(gfp_mask, ac->nodemask,

"page allocation failure: order:%u", order);

got_pg:

return page;

}

void __free_pages(struct page *page, unsigned int order)
{

  //检查页框是否还有进程在使用，检查_count变量的值是否为0

    if (put_page_testzero(page))

        free_the_page(page, order);

  //这里我个人理解时，类比于之前的set_page_guard那一步，分配的order大于需要的order，相当于分配了多页，那么这里就是挨个释放多页

    else if (!PageHead(page))

        while (order-- > 0)

            free_the_page(page + (1 << order), order);
}

void __free_pages(struct page *page, unsigned int order)

{

//检查页框是否还有进程在使用，检查_count变量的值是否为0

if (put_page_testzero(page))

free_the_page(page, order);

//

这里我个人理解时，类比于之前的set_page_guard那一步，分配的order大于需要的order，相当于分配了多页，那么这里就是挨个释放多页

else if (!PageHead(page))

while (order-- > 0)

free_the_page(page + (1 << order), order);

}

static inline void free_the_page(struct page *page, unsigned int order)
{

  //如果是通过pcpulist分配

    if (order == 0)        /* Via pcp? */

        free_unref_page(page);

  //否则调用__free_pages_ok

    else

        __free_pages_ok(page, order, FPI_NONE);
}

static inline void free_the_page(struct page *page, unsigned int order)

{

//如果是通过pcpulist分配

if (order == 0) /* Via pcp? */

free_unref_page(page);

//否则调用__free_pages_ok

else

__free_pages_ok(page, order, FPI_NONE);

}

/*

 * Free a 0-order page

 */

void free_unref_page(struct page *page)
{

    unsigned long flags;

  //获取page frame number

    unsigned long pfn = page_to_pfn(page);

    //进行free前检查

    if (!free_unref_page_prepare(page, pfn))

        return;
 
    local_irq_save(flags);

    free_unref_page_commit(page, pfn);

    local_irq_restore(flags);
}

/*

* Free a 0-order page

*/

void free_unref_page(struct page *page)

{

unsigned long flags;

//获取page frame number

unsigned long pfn = page_to_pfn(page);

//进行free前检查

if (!free_unref_page_prepare(page, pfn))

return;

local_irq_save(flags);

free_unref_page_commit(page, pfn);

local_irq_restore(flags);

}

static void free_unref_page_commit(struct page *page, unsigned long pfn)
{

    struct zone *zone = page_zone(page);

    struct per_cpu_pages *pcp;

    int migratetype;

    //获取当前page的迁移类型

    migratetype = get_pcppage_migratetype(page);

    __count_vm_event(PGFREE);
 
    //percpu list只放入几种制定类型的page

    if (migratetype >= MIGRATE_PCPTYPES) {

        if (unlikely(is_migrate_isolate(migratetype))) {

      //free

            free_one_page(zone, page, pfn, 0, migratetype,

                      FPI_NONE);

            return;

        }

        migratetype = MIGRATE_MOVABLE;

    }
 
    pcp = &this_cpu_ptr(zone->pageset)->pcp;
 
  //将page用头插法放入pcp->lists[migratetype]链表头

    list_add(&page->lru, &pcp->lists[migratetype]);

    pcp->count++;
 
  //如果pcp中的page数量大于最大数量，则将多余的page放入伙伴系统

    if (pcp->count >= READ_ONCE(pcp->high))

        free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
}

static void free_unref_page_commit(struct page *page, unsigned long pfn)

{

struct zone *zone = page_zone(page);

struct per_cpu_pages *pcp;

int migratetype;

//获取当前page的迁移类型

migratetype = get_pcppage_migratetype(page);

__count_vm_event(PGFREE);

//percpu list只放入几种制定类型的page

if (migratetype >= MIGRATE_PCPTYPES) {

if (unlikely(is_migrate_isolate(migratetype))) {

//free

free_one_page(zone, page, pfn, 0, migratetype,

FPI_NONE);

return;

}

migratetype = MIGRATE_MOVABLE;

}

pcp = &this_cpu_ptr(zone->pageset)->pcp;

//将page用头插法放入pcp->lists[migratetype]链表头

list_add(&page->lru, &pcp->lists[migratetype]);

pcp->count++;

//如果pcp中的page数量大于最大数量，则将多余的page放入伙伴系统

if (pcp->count >= READ_ONCE(pcp->high))

free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);

}

static inline void __free_one_page(struct page *page,

        unsigned long pfn,

        struct zone *zone, unsigned int order,

        int migratetype, fpi_t fpi_flags)
{

    struct capture_control *capc = task_capc(zone);

    unsigned long buddy_pfn;

    unsigned long combined_pfn;

    unsigned int max_order;

    struct page *buddy;

    bool to_tail;

    //获取最大order-1

    max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);
 
    VM_BUG_ON(!zone_is_initialized(zone));

    VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);
 
    VM_BUG_ON(migratetype == -1);

    if (likely(!is_migrate_isolate(migratetype)))

        __mod_zone_freepage_state(zone, 1 << order, migratetype);
 
    VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);

    VM_BUG_ON_PAGE(bad_range(zone, page), page);
 
continue_merging:

    // 循环扫描直到order == max_order-1

    // 处理合并问题

    while (order < max_order) 

    {

        if (compaction_capture(capc, page, order, migratetype)) {

            __mod_zone_freepage_state(zone, -(1 << order),

                                migratetype);

            return;

        }

        //查找buddy page frame

        //page_pfn ^ (1 << order)

        buddy_pfn = __find_buddy_pfn(pfn, order);

        //获得对应的 struct page

        buddy = page + (buddy_pfn - pfn);

        //判断是否有效

        if (!pfn_valid_within(buddy_pfn))

            goto done_merging;

        /*

        检查当前的buddy page是否是free状态可合并的。

        主要满足以下条件：

        1.处于buddy system中

        2.有相同的order

        3.处于同一个zone

        */

        if (!page_is_buddy(page, buddy, order))

            goto done_merging;

        //如果满足free条件，或者是一个gaurd page，那么进行合并，合并后向上移动一个order。

        if (page_is_guard(buddy))

            clear_page_guard(zone, buddy, order, migratetype);

        else

            del_page_from_free_list(buddy, zone, order);
 
        //合并页，设置新的pfn

        combined_pfn = buddy_pfn & pfn;

        page = page + (combined_pfn - pfn);

        pfn = combined_pfn;

        order++;

    }

    if (order < MAX_ORDER - 1) {

        //防止隔离pageblock和正常pageblock上page的合并

        if (unlikely(has_isolate_pageblock(zone))) {

            int buddy_mt;
 
            buddy_pfn = __find_buddy_pfn(pfn, order);

            buddy = page + (buddy_pfn - pfn);

            buddy_mt = get_pageblock_migratetype(buddy);
 
            if (migratetype != buddy_mt

                    && (is_migrate_isolate(migratetype) ||

                        is_migrate_isolate(buddy_mt)))

                goto done_merging;

        }

        max_order = order + 1;

        goto continue_merging;

    }
 
done_merging:

    //设置阶，标记为伙伴系统的page

    set_buddy_order(page, order);
 
    if (fpi_flags & FPI_TO_TAIL)

        to_tail = true;

    else if (is_shuffle_order(order))    //is_shuffle_order，return false

        to_tail = shuffle_pick_tail();

    else

        //如果此时的page不是最大的page，那么检查是否buddy page是否是空的。 如果是的话，说明buddy page很可能正在被释放，而很快就要被合并起来。

        //在这种情况下，我们优先将page插入zone->free_area[order]的list的尾部，延缓page的使用，从而方便buddy被free掉后，两个页进行合并。

        to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);
 
    //插入尾部

    if (to_tail)

        add_to_free_list_tail(page, zone, order, migratetype);

    else

    //插入头部

        add_to_free_list(page, zone, order, migratetype);
 
    /* Notify page reporting subsystem of freed page */

    if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))

        page_reporting_notify_free(order);
}

static inline void __free_one_page(struct page *page,

unsigned long pfn,

struct zone *zone, unsigned int order,

int migratetype, fpi_t fpi_flags)

{

struct capture_control *capc = task_capc(zone);

unsigned long buddy_pfn;

unsigned long combined_pfn;

unsigned int max_order;

struct page *buddy;

bool to_tail;

//获取最大order-1

max_order = min_t(unsigned int, MAX_ORDER - 1, pageblock_order);

VM_BUG_ON(!zone_is_initialized(zone));

VM_BUG_ON_PAGE(page->flags & PAGE_FLAGS_CHECK_AT_PREP, page);

VM_BUG_ON(migratetype == -1);

if (likely(!is_migrate_isolate(migratetype)))

__mod_zone_freepage_state(zone, 1 << order, migratetype);

VM_BUG_ON_PAGE(pfn & ((1 << order) - 1), page);

VM_BUG_ON_PAGE(bad_range(zone, page), page);

continue_merging:

// 循环扫描直到order == max_order-1

// 处理合并问题

while (order < max_order)

{

if (compaction_capture(capc, page, order, migratetype)) {

__mod_zone_freepage_state(zone, -(1 << order),

migratetype);

return;

}

//查找buddy page frame

//page_pfn ^ (1 << order)

buddy_pfn = __find_buddy_pfn(pfn, order);

//获得对应的 struct page

buddy = page + (buddy_pfn - pfn);

//判断是否有效

if (!pfn_valid_within(buddy_pfn))

goto done_merging;

/*

检查当前的buddy page是否是free状态可合并的。

主要满足以下条件：

1.处于buddy system中

2.有相同的order

3.处于同一个zone

*/

if (!page_is_buddy(page, buddy, order))

goto done_merging;

//如果满足free条件，或者是一个gaurd page，那么进行合并，合并后向上移动一个order。

if (page_is_guard(buddy))

clear_page_guard(zone, buddy, order, migratetype);

else

del_page_from_free_list(buddy, zone, order);

//合并页，设置新的pfn

combined_pfn = buddy_pfn & pfn;

page = page + (combined_pfn - pfn);

pfn = combined_pfn;

order++;

}

if (order < MAX_ORDER - 1) {

//防止隔离pageblock和正常pageblock上page的合并

if (unlikely(has_isolate_pageblock(zone))) {

int buddy_mt;

buddy_pfn = __find_buddy_pfn(pfn, order);

buddy = page + (buddy_pfn - pfn);

buddy_mt = get_pageblock_migratetype(buddy);

if (migratetype != buddy_mt

&& (is_migrate_isolate(migratetype) ||

is_migrate_isolate(buddy_mt)))

goto done_merging;

}

max_order = order + 1;

goto continue_merging;

}

done_merging:

//设置阶，标记为伙伴系统的page

set_buddy_order(page, order);

if (fpi_flags & FPI_TO_TAIL)

to_tail = true;

else if (is_shuffle_order(order)) //is_shuffle_order，return false

to_tail = shuffle_pick_tail();

else

//

如果此时的page不是最大的page，那么检查是否buddy page是否是空的。 如果是的话，说明buddy page很可能正在被释放，而很快就要被合并起来。

//在这种情况下，我们优先将page插入zone->free_area[order]的list的尾部，延缓page的使用，从而方便buddy被free掉后，两个页进行合并。

to_tail = buddy_merge_likely(pfn, buddy_pfn, page, order);

//插入尾部

if (to_tail)

add_to_free_list_tail(page, zone, order, migratetype);

else

//插入头部

add_to_free_list(page, zone, order, migratetype);

/* Notify page reporting subsystem of freed page */

if (!(fpi_flags & FPI_SKIP_REPORT_NOTIFY))

page_reporting_notify_free(order);

}

static void free_pcppages_bulk(struct zone *zone, int count,

                    struct per_cpu_pages *pcp)
{

    int migratetype = 0;

    int batch_free = 0;

    int prefetch_nr = READ_ONCE(pcp->batch);

    bool isolated_pageblocks;

    struct page *page, *tmp;

    LIST_HEAD(head);
 
    //获取pcpulist中pages最大数量

    count = min(pcp->count, count);
 
    while (count) 

    {

        struct list_head *list;
 
        /*

         * Remove pages from lists in a round-robin fashion. A

         * batch_free count is maintained that is incremented when an

         * empty list is encountered.  This is so more pages are freed

         * off fuller lists instead of spinning excessively around empty

         * lists

         */

        //batch_free（删除页数递增），遍历迁移列表

        do {

            batch_free++;

            if (++migratetype == MIGRATE_PCPTYPES)

                migratetype = 0;

            list = &pcp->lists[migratetype];

        } while (list_empty(list));
 
        //如果只有一个非空列表

        if (batch_free == MIGRATE_PCPTYPES)

            batch_free = count;
 
        do {

            //获取列表尾部的元素

            page = list_last_entry(list, struct page, lru);

            /* must delete to avoid corrupting pcp list */

            list_del(&page->lru);

            pcp->count--;
 
            if (bulkfree_pcp_prepare(page))

                continue;

            // 放入head列表中

            list_add_tail(&page->lru, &head);
 
            //对page的buddy页进行预取

            if (prefetch_nr) {

                prefetch_buddy(page);

                prefetch_nr--;

            }

        } while (--count && --batch_free && !list_empty(list));

    }
 
    spin_lock(&zone->lock);

    isolated_pageblocks = has_isolate_pageblock(zone);
 
    /*

     * Use safe version since after __free_one_page(),

     * page->lru.next will not point to original list.

     */

    list_for_each_entry_safe(page, tmp, &head, lru) {

        int mt = get_pcppage_migratetype(page);

        //MIGRATE_ISOLATE的page不可以被放入pcplist

        VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);

        //迁移类型不可以是isolate？但是has_isolate_pageblock未实现。

        if (unlikely(isolated_pageblocks))

            mt = get_pageblock_migratetype(page);

        //调用__free_one_page放入伙伴系统

        __free_one_page(page, page_to_pfn(page), zone, 0, mt, FPI_NONE);

        trace_mm_page_pcpu_drain(page, 0, mt);

    }

    spin_unlock(&zone->lock);
}

static void free_pcppages_bulk(struct zone *zone, int count,

struct per_cpu_pages *pcp)

{

int migratetype = 0;

int batch_free = 0;

int prefetch_nr = READ_ONCE(pcp->batch);

bool isolated_pageblocks;

struct page *page, *tmp;

LIST_HEAD(head);

//获取pcpulist中pages最大数量

count = min(pcp->count, count);

while (count)

{

struct list_head *list;

/*

* Remove pages from lists in a round-robin fashion. A

* batch_free count is maintained that is incremented when an

* empty list is encountered. This is so more pages are freed

* off fuller lists instead of spinning excessively around empty

* lists

*/

//batch_free（删除页数递增），遍历迁移列表

do {

batch_free++;

if (++migratetype == MIGRATE_PCPTYPES)

migratetype = 0;

list = &pcp->lists[migratetype];

} while (list_empty(list));

//如果只有一个非空列表

if (batch_free == MIGRATE_PCPTYPES)

batch_free = count;

do {

//获取列表尾部的元素

page = list_last_entry(list, struct page, lru);

/* must delete to avoid corrupting pcp list */

list_del(&page->lru);

pcp->count--;

if (bulkfree_pcp_prepare(page))

continue;

// 放入head列表中

list_add_tail(&page->lru, &head);

//对page的buddy页进行预取

if (prefetch_nr) {

prefetch_buddy(page);

prefetch_nr--;

}

} while (--count && --batch_free && !list_empty(list));

}

spin_lock(&zone->lock);

isolated_pageblocks = has_isolate_pageblock(zone);

/*

* Use safe version since after __free_one_page(),

* page->lru.next will not point to original list.

*/

list_for_each_entry_safe(page, tmp, &head, lru) {

int mt = get_pcppage_migratetype(page);

//MIGRATE_ISOLATE的page不可以被放入pcplist

VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);

//迁移类型不可以是isolate？但是has_isolate_pageblock未实现。

if (unlikely(isolated_pageblocks))

mt = get_pageblock_migratetype(page);

//调用__free_one_page放入伙伴系统

__free_one_page(page, page_to_pfn(page), zone, 0, mt, FPI_NONE);

trace_mm_page_pcpu_drain(page, 0, mt);

}

spin_unlock(&zone->lock);

}

/*

 * Slab cache management.

 */
struct kmem_cache {

    struct kmem_cache_cpu __percpu *cpu_slab;    //per cpu 缓存

    /* Used for retrieving partial slabs, etc. */

    slab_flags_t flags;

    unsigned long min_partial;    //partial链表中slab的最大数量

    unsigned int size;    /* The size of an object including metadata 每个块内存实际需要的大小*/

    unsigned int object_size;/* The size of an object without metadata 除去元数据的对象大小*/

    struct reciprocal_value reciprocal_size;

    unsigned int offset;    /* Free pointer offset 到空闲指针的偏移，可以索引指向下一个空闲块的指针*/
#ifdef CONFIG_SLUB_CPU_PARTIAL

    /* Number of per cpu partial objects to keep around */

    unsigned int cpu_partial;        //cpuslab partial链表中slab的最大数量，超过数量的则被放入kmem_cache_node普通的partial链表中
#endif

    struct kmem_cache_order_objects oo;//记录slab管理page的数量（高16bits）和slab obj的数量（低16bits）
 
    /* Allocation and freeing of slabs */

    struct kmem_cache_order_objects max;    //最大分配数量    

    struct kmem_cache_order_objects min;    //最小分配量

    gfp_t allocflags;    /* 从伙伴系统集成的分配请求掩码 */

    int refcount;        /* Refcount for slab cache destroy */

    void (*ctor)(void *);

    unsigned int inuse;        /* Offset to metadata */

    unsigned int align;        /* Alignment */

    unsigned int red_left_pad;    /* Left redzone padding size */

    const char *name;    /* 文件系统显示使用 */

    struct list_head list;    /* 所有slab的list */
#ifdef CONFIG_SYSFS

    struct kobject kobj;    /* 文件系统使用 */
#endif
#ifdef CONFIG_SLAB_FREELIST_HARDENED

    unsigned long random;
#endif
 
#ifdef CONFIG_NUMA

    /*

     * Defragmentation by allocating from a remote node.

     */

    unsigned int remote_node_defrag_ratio;
#endif
 
#ifdef CONFIG_SLAB_FREELIST_RANDOM

    unsigned int *random_seq;
#endif
 
#ifdef CONFIG_KASAN

    struct kasan_cache kasan_info;
#endif
 
    unsigned int useroffset;    /* Usercopy region offset */

    unsigned int usersize;        /* Usercopy region size */
 
    struct kmem_cache_node *node[MAX_NUMNODES];        //slab节点
};

/*

* Slab cache management.

*/

struct kmem_cache {

struct kmem_cache_cpu __percpu *cpu_slab; //per cpu 缓存

/* Used for retrieving partial slabs, etc. */

slab_flags_t flags;

unsigned long min_partial; //partial链表中slab的最大数量

unsigned int size; /* The size of an object including metadata 每个块内存实际需要的大小*/

unsigned int object_size;/* The size of an object without metadata 除去元数据的对象大小*/

struct reciprocal_value reciprocal_size;

unsigned int offset; /* Free pointer offset 到空闲指针的偏移，可以索引指向下一个空闲块的指针*/

#ifdef CONFIG_SLUB_CPU_PARTIAL

/* Number of per cpu partial objects to keep around */

unsigned int cpu_partial; //cpuslab partial链表中slab的最大数量，超过数量的则被放入kmem_cache_node普通的partial链表中

#endif

struct kmem_cache_order_objects oo;//记录slab管理page的数量（高16bits）和slab obj的数量（低16bits）

/* Allocation and freeing of slabs */

struct kmem_cache_order_objects max; //最大分配数量

struct kmem_cache_order_objects min; //最小分配量

gfp_t allocflags; /* 从伙伴系统集成的分配请求掩码 */

int refcount; /* Refcount for slab cache destroy */

void (*ctor)(void *);

unsigned int inuse; /* Offset to metadata */

unsigned int align; /* Alignment */

unsigned int red_left_pad; /* Left redzone padding size */

const char *name; /* 文件系统显示使用 */

struct list_head list; /* 所有slab的list */

#ifdef CONFIG_SYSFS

struct kobject kobj; /* 文件系统使用 */

#endif

#ifdef CONFIG_SLAB_FREELIST_HARDENED

unsigned long random;

#endif

#ifdef CONFIG_NUMA

/*

* Defragmentation by allocating from a remote node.

*/

unsigned int remote_node_defrag_ratio;

#endif

#ifdef CONFIG_SLAB_FREELIST_RANDOM

unsigned int *random_seq;

#endif

#ifdef CONFIG_KASAN

struct kasan_cache kasan_info;

#endif

unsigned int useroffset; /* Usercopy region offset */

unsigned int usersize; /* Usercopy region size */

struct kmem_cache_node *node[MAX_NUMNODES]; //slab节点

};

struct kmem_cache_cpu {

    void **freelist;    /* Pointer to next available object */

    unsigned long tid;    /* CPU的独特标识 */

    struct page *page;    /* 当前正准备分配的slab */
#ifdef CONFIG_SLUB_CPU_PARTIAL

    struct page *partial;    /* 指向当前的半满的slab(slab中有空闲的object) */
#endif
#ifdef CONFIG_SLUB_STATS

    unsigned stat[NR_SLUB_STAT_ITEMS];
#endif
};

struct kmem_cache_cpu {

void **freelist; /* Pointer to next available object */

unsigned long tid; /* CPU的独特标识 */

struct page *page; /* 当前正准备分配的slab */

#ifdef CONFIG_SLUB_CPU_PARTIAL

struct page *partial; /* 指向当前的半满的slab(slab中有空闲的object) */

#endif

#ifdef CONFIG_SLUB_STATS

登录后可查看完整内容

[注意]看雪招聘，专注安全领域的专业人才平台！

#Linux #漏洞分析

收藏・28

免费・18

支持

赞赏记录

参与人

雪币

留言

时间

一路南寻

为你点赞！

2025-2-3 04:35

東陽不列山

感谢你的贡献，论坛因你而更加精彩！

2024-11-11 04:25

Youlor

看雪因你而更加精彩！

2024-8-17 01:27

mb_oxvmulqt

为你点赞~

2023-8-11 02:30

bwner

为你点赞~

2023-6-6 14:50

PLEBFE

为你点赞~

2022-7-27 01:02

心游尘世外

为你点赞~

2022-7-26 22:49

飘零丶

为你点赞~

2022-7-17 02:29

mb_cjmhuhva

为你点赞~

2022-3-30 22:48

saloyun

为你点赞~

2022-1-20 13:59

erfze

为你点赞~

2021-11-6 18:33

arttnba3

为你点赞~

2021-9-9 15:08

PIG-007

为你点赞~

2021-9-2 14:35

pureGavin

为你点赞~

2021-9-2 14:23

zhczf

为你点赞~

2021-9-2 14:18

Emiya侍郎

为你点赞~

2021-9-2 10:28

T1e9u

为你点赞~

2021-9-1 14:25

Bombs

为你点赞~

2021-9-1 11:03

最新回复 (2)
pureGavin 雪币： 15077 活跃值： (18423) 能力值： ( LV12，RANK：290 ) 在线值：发帖 94 回帖 1434 粉丝 287 关注私信	pureGavin 3 2 楼楼主为何如此高产 2021-9-2 14:23 0
fengyunabc 雪币： 4011 活跃值： (4277) 能力值： ( LV4，RANK：50 ) 在线值：发帖 11 回帖 542 粉丝 27 关注私信	fengyunabc 1 3 楼感谢分享！ 2022-11-16 18:01 0
	游客登录 \| 注册方可回帖回帖表情雪币赚取及消费高级回复

Roland_

发帖

102

回帖

710

RANK

关注

私信

他的文章

关于我们

联系我们

企业服务

看雪公众号

最新回复 (2)
pureGavin 雪币： 15077 活跃值： (18423) 能力值： ( LV12，RANK：290 ) 在线值：发帖 94 回帖 1434 粉丝 287 关注私信	pureGavin 3 2 楼楼主为何如此高产 2021-9-2 14:23 0
fengyunabc 雪币： 4011 活跃值： (4277) 能力值： ( LV4，RANK：50 ) 在线值：发帖 11 回帖 542 粉丝 27 关注私信	fengyunabc 1 3 楼感谢分享！ 2022-11-16 18:01 0
	游客登录 \| 注册方可回帖回帖表情雪币赚取及消费高级回复

[原创]Linux内核5.13版本内存管理模块源码分析

账号登录 验证码登录

账号登录

验证码登录