首页
社区
课程
招聘
[原创][eBPF源码分析]Socket_filter类型调用链埋点分析
发表于: 2024-5-11 18:51 10323

[原创][eBPF源码分析]Socket_filter类型调用链埋点分析

2024-5-11 18:51
10323

socket filter,在BPF中的类型为BPF_PROG_TYPE_SOCKET_FILTER,顾名思义,实现的是socket的过滤器。
本文会分析BPF_PROG_TYPE_SOCKET_FILTER类型程序的实现原理,一直到埋点函数。
内核中有示例代码,位置在sample/bpf/sock_example.csamples/bpf/sockex1_kern.c等。
一般会将socket filter程序的段名定义成SEC("socketxxx")
下文的代码分析,基于5.15.99版本的内核

这里通过sample/bpf/sock_example.c学习。
一些注释和文件头

加载map,使用内核的bpf_create_map函数

用字节码的形式定义的BPF prog程序本体

这里的bpf_insn是bpf程序底层的字节码,是抽象过的汇编代码。各种高级库都要转换成这种形式,最后转换成汇编代码。

以上实现的程序:

用内核函数bpf_load_program装载prog程序,参数为BPF_PROG_TYPE_SOCKET_FILTER,表示socket/filter类型

open_raw_sock创建一个raw_socket,调用setsockopt将bpf prog附着到这个socket上,参数为SO_ATTACH_BPF

接下来重点关注setsockopt如何访问sock/filter的点位。
查找setsockopt的源码,寻找SO_ATTACH_BPF参数逻辑。在5.15.99net/core/sock.c:1169行,找到了处理逻辑

排除关于bpf的操作(一些对prog程序的操作),跟进__sk_attach_prog

在代码注释里的操作过后,成功将prog对象指向了sk->sk_filter->prog。

查找sk_filter代码,寻找调用函数。可以在很多函数中找到踪迹,在include/linux/filter.h中找到函数的原型。

sk_filter是封装好的sk->sk_filter调用原型,也有其他代码通过获取sk->sk_filter或者直接调用sk_filter_trim_cap来进行SOCKET_FILTER程序的运行。

跟进sk_filter_trim_cap函数,net/core/filter.c

PF_MEMALLOC含义:
当前进程有很多可以释放的内存,如果能分配一点紧急内存给当前进程,那么当前进程可以返回更多的内存给系统。非内存管理子系统不应该使用这个标记,除非这次分配保证会释放更大的内存给系统。如果每个子系统都滥用这个标记,可能会耗尽内存管理子系统的保留内存。

程序首先检查 skb 是否设置了PF_MEMALLOC标志位,如果是的话,只有设置了 SOCK_MEMALLOC 标志的 socket 才能使用它,否则就返回 -ENOMEM 并增加统计计数器 LINUX_MIB_PFMEMALLOCDROP。这是为了防止内存不足的情况下,非紧急的 socket 占用有限的内存资源。
下一步,调用 BPF_CGROUP_RUN_PROG_INET_INGRESS() 函数,执行 cgroup 的 ingress hook上的 eBPF 程序,如果返回err,就return err。这是为了实现 cgroup 的网络隔离和限制功能。
这里如果开启了CGROUP_BPF的CGROUP_INET_INGRESS点,则调用__cgroup_bpf_run_filter_skb函数,执行CGROUP的filter程序。若没有开启,返回0值,继续执行代码。
cgroup细节之后讨论。

调用 security_sock_rcv_skb 函数,这是LSM的预留hook点,检查 socket 是否有权限接收 skb。
接下来,获取读锁,防止 sk_filter 被并发修改,从 sk 中获取 sk_filter 结构体指针。
取出sk->filter后,更新skb中的sock为当前传入socket,并且调用bpf_prog_run_save_cb执行bpf程序。然后把skb->sk赋值旧的socket回去。

如果返回的长度不为 0,就调用 pskb_trim 函数,将 skb 的数据部分裁剪到 cap 和返回的长度中的较大值,如果裁剪失败,就返回错误码;如果返回的长度为 0,就将错误码设置为 -EPERM,表示要丢弃 packet。

bpf prog执行的细节可以简单看一下。
其中涉及细节放到BPF系统源码分析里讲。

可以通过搜索sk->sk_filtersk_filtersk_filter_trim_cap,分析filter程序的调用

查找函数引用,回溯一下调用链。
查找调用sk_filter的函数,定位到sock_queue_rcv_skb(net/core/sock.c)(很多函数有注释,比如)image.png

代码中的sk_filter就是埋点函数。
代码中有非常多的调用,很多都是各种协议的适配,比如J1939,搜索后发现是汽车的CAN总线通信协议。这里我们关注net/ieee802154/socket.c
可以看到,dgram_rcv_skb(数据报SOCKET)raw_rcv_skb都调用了sock_queue_rcv_skb

ipv4的raw_rcv_skb逻辑也差不多,后面都进入同样的raw_rcv

跟进到net/ipv4/raw.c的 raw_rcv。

跟进raw_v4_input。这个函数主要做socket_raw的RX方向sk分配。SOCKET_RAW允许多个socket同时接收同一个数据包,

跟进到raw_local_deliver

跟进ip_local_deliver->ip_local_deliver_finish->ip_protocol_deliver_rcu->raw_local_deliver,这就来到了网络层转发到传输层的函数入口了。ip_local_deliver负责网络层转发到上层协议。由于SOCKET_RAW跳过传输层,因此检查设置在了这,具体细节可以看网络系统文章。
[Linux内核源码分析]网络子系统

net/ipv4/tcp_ipv4.ctcp_filter函数调用了sk_filter_trim_cap

跟进到tcp_v4_rcv(AF_INET_tcp的recv函数)
TCP_NEW_SYN_RECV的处理逻辑以及主体函数逻辑中,都有tcp_filter的函数调用
tcp_v4_rcv函数中会对TCP_NEW_SYN_RECV进行处理,如果连接检查成功,则需要新建控制块来处理连接,这个新建控制块的状态将会使用TCP_SYN_RECV状态;

搜索sk_filter找到了udp_queue_rcv_one_skb函数。这个函数位于udp_queue_rcv_skb内部

逆向一路向上跟进至udp_rcv,可知检测逻辑在UDP协议栈rcv处理函数内部。从udp_rcv顺序分析

跟进__udp4_lib_rcv

udp_unicast_rcv_skb__udp4_lib_mcast_deliver都调用了udp_queue_rcv_skb,而udp_queue_rcv_skb内部包含udp_queue_rcv_one_skb
最后在udp_queue_rcv_one_skb中调用了sk_filter_trim_cap

调用链:udp_rcv->__udp4_lib_rcv->udp_unicast_rcv_skb/__udp4_lib_mcast_deliver->udp_queue_rcv_skb->udp_queue_rcv_one_skb->sk_filter_trim_cap

还有一些内核函数也调用sk_filter相关函数,但是属于通用sock处理逻辑(__sk_receive_skb),一些其他协议使用,比如DCCP、pppoe、l2tp等,这里就不加以分析。

跟踪点是__sk_destruct(net/core/sock.c),其中会检查sock_filter是否还存在,还存在的话调用sk_filter_uncharge删除分配的内存。

跟进到sk_destruct

跟进__sk_free

跟进至sk_free

sk_free是内核删除socket对象的函数。内核通过sk_alloc分配socket对象。以下为tipc_sk_create(net/tipc/socket.c)的示例。通过sk_alloc创建socket对象,然后判断创建失败,sk_free释放内存。

处理逻辑调用链:sk_free->__sk_free->sk_destruct->__sk_destruct

net/packet/af_packet.c的run_filter会取出sk->sk_filter->prog程序,bpf执行

跟进到packet_rcv

跟进packet_create,这是PF_PACKET协议栈的create函数,其中创建了packet_sock,并且把packet_rcv指针赋值到协议栈处理函数中。当系统创建socket时,会调用inet_create,从inetsw数组中取出协议栈注册的函数,对应PF_PACKET的就是这里的packet_create

packet_sock的结构

BPF_PROG_TYPE_SOCKET_FILTER类型的bpf程序,需要利用setsockopt函数绑定,埋点函数位于net/core/filter.c:sk_filter_trim_cap

调用链总结

/* eBPF example program:
 * - creates arraymap in kernel with key 4 bytes and value 8 bytes
 *
 * - loads eBPF program:
 *   r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)];
 *   *(u32*)(fp - 4) = r0;
 *   // assuming packet is IPv4, lookup ip->proto in a map
 *   value = bpf_map_lookup_elem(map_fd, fp - 4);
 *   if (value)
 *        (*(u64*)value) += 1;
 *
 * - attaches this program to loopback interface "lo" raw socket
 *
 * - every second user space reads map[tcp], map[udp], map[icmp] to see
 *   how many packets of given protocol were seen on "lo"
 */
#include <stdio.h>
#include <unistd.h>
#include <assert.h>
#include <linux/bpf.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <stddef.h>
#include <bpf/bpf.h>
#include "bpf_insn.h"
#include "sock_example.h"
/* eBPF example program:
 * - creates arraymap in kernel with key 4 bytes and value 8 bytes
 *
 * - loads eBPF program:
 *   r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)];
 *   *(u32*)(fp - 4) = r0;
 *   // assuming packet is IPv4, lookup ip->proto in a map
 *   value = bpf_map_lookup_elem(map_fd, fp - 4);
 *   if (value)
 *        (*(u64*)value) += 1;
 *
 * - attaches this program to loopback interface "lo" raw socket
 *
 * - every second user space reads map[tcp], map[udp], map[icmp] to see
 *   how many packets of given protocol were seen on "lo"
 */
#include <stdio.h>
#include <unistd.h>
#include <assert.h>
#include <linux/bpf.h>
#include <string.h>
#include <stdlib.h>
#include <errno.h>
#include <sys/socket.h>
#include <arpa/inet.h>
#include <linux/if_ether.h>
#include <linux/ip.h>
#include <stddef.h>
#include <bpf/bpf.h>
#include "bpf_insn.h"
#include "sock_example.h"
int sock = -1, map_fd, prog_fd, i, key;
long long value = 0, tcp_cnt, udp_cnt, icmp_cnt;
 
map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value),
                            256, 0);
if (map_fd < 0) {
    printf("failed to create map '%s'\n", strerror(errno));
    goto cleanup;
}
int sock = -1, map_fd, prog_fd, i, key;
long long value = 0, tcp_cnt, udp_cnt, icmp_cnt;
 
map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), sizeof(value),
                            256, 0);
if (map_fd < 0) {
    printf("failed to create map '%s'\n", strerror(errno));
    goto cleanup;
}
struct bpf_insn prog[] = {
    BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
    BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),
    BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
        BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
        BPF_LD_MAP_FD(BPF_REG_1, map_fd),
    BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
    BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
    BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
        BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_0, BPF_REG_1, 0),
    BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
        BPF_EXIT_INSN(),
};
size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
struct bpf_insn prog[] = {
    BPF_MOV64_REG(BPF_REG_6, BPF_REG_1),
    BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol) /* R0 = ip->proto */),
    BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), /* *(u32 *)(fp - 4) = r0 */
        BPF_MOV64_REG(BPF_REG_2, BPF_REG_10),
    BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = fp - 4 */
        BPF_LD_MAP_FD(BPF_REG_1, map_fd),
    BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem),
    BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2),
    BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */
        BPF_ATOMIC_OP(BPF_DW, BPF_ADD, BPF_REG_0, BPF_REG_1, 0),
    BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */
        BPF_EXIT_INSN(),
};
size_t insns_cnt = sizeof(prog) / sizeof(struct bpf_insn);
struct bpf_insn {
    __u8    code;       /* opcode */
    __u8    dst_reg:4;  /* dest register */
    __u8    src_reg:4;  /* source register */
    __s16   off;        /* signed offset */
    __s32   imm;        /* signed immediate constant */
};
struct bpf_insn {
    __u8    code;       /* opcode */
    __u8    dst_reg:4;  /* dest register */
    __u8    src_reg:4;  /* source register */
    __s16   off;        /* signed offset */
    __s32   imm;        /* signed immediate constant */
};
r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)];
*(u32*)(fp - 4) = r0;
// assuming packet is IPv4, lookup ip->proto in a map
value = bpf_map_lookup_elem(map_fd, fp - 4);
if (value)
    (*(u64*)value) += 1;
r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)];
*(u32*)(fp - 4) = r0;
// assuming packet is IPv4, lookup ip->proto in a map
value = bpf_map_lookup_elem(map_fd, fp - 4);
if (value)
    (*(u64*)value) += 1;
prog_fd = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog, insns_cnt,
                           "GPL", 0, bpf_log_buf, BPF_LOG_BUF_SIZE);
if (prog_fd < 0) {
    printf("failed to load prog '%s'\n", strerror(errno));
    goto cleanup;
}
prog_fd = bpf_load_program(BPF_PROG_TYPE_SOCKET_FILTER, prog, insns_cnt,
                           "GPL", 0, bpf_log_buf, BPF_LOG_BUF_SIZE);
if (prog_fd < 0) {
    printf("failed to load prog '%s'\n", strerror(errno));
    goto cleanup;
}
sock = open_raw_sock("lo");
 
if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
                   sizeof(prog_fd)) < 0) {
    printf("setsockopt %s\n", strerror(errno));
    goto cleanup;
}
sock = open_raw_sock("lo");
 
if (setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd,
                   sizeof(prog_fd)) < 0) {
    printf("setsockopt %s\n", strerror(errno));
    goto cleanup;
}
case SO_ATTACH_BPF:
    ret = -EINVAL;
    if (optlen == sizeof(u32)) {
        u32 ufd;
 
        ret = -EFAULT;
        if (copy_from_sockptr(&ufd, optval, sizeof(ufd)))
            break;
 
        ret = sk_attach_bpf(ufd, sk);
    }
    break;
case SO_ATTACH_BPF:
    ret = -EINVAL;
    if (optlen == sizeof(u32)) {
        u32 ufd;
 
        ret = -EFAULT;
        if (copy_from_sockptr(&ufd, optval, sizeof(ufd)))
            break;
 
        ret = sk_attach_bpf(ufd, sk);
    }
    break;
跳转到`sk_attach_bpf`函数(`net/core/filter.c:1571`)
跳转到`sk_attach_bpf`函数(`net/core/filter.c:1571`)
int sk_attach_bpf(u32 ufd, struct sock *sk)
{
    struct bpf_prog *prog = __get_bpf(ufd, sk);
    int err;
 
    if (IS_ERR(prog))
        return PTR_ERR(prog);
 
    err = __sk_attach_prog(prog, sk);
    if (err < 0) {
        bpf_prog_put(prog);
        return err;
    }
 
    return 0;
}
int sk_attach_bpf(u32 ufd, struct sock *sk)
{
    struct bpf_prog *prog = __get_bpf(ufd, sk);
    int err;
 
    if (IS_ERR(prog))
        return PTR_ERR(prog);
 
    err = __sk_attach_prog(prog, sk);
    if (err < 0) {
        bpf_prog_put(prog);
        return err;
    }
 
    return 0;
}
static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk)
{
    //创建socket_filter的对象
    // struct sk_filter {
    //      refcount_t  refcnt;
    //      struct rcu_head rcu;
    //      struct bpf_prog *prog;
    // };
    struct sk_filter *fp, *old_fp;
 
    fp = kmalloc(sizeof(*fp), GFP_KERNEL);
    if (!fp)
        return -ENOMEM;
 
    fp->prog = prog;
 
    // 为fp sk_filter对象分配一个socket的引用,如果失败,释放fp空间
    if (!__sk_filter_charge(sk, fp)) {
        kfree(fp);
        return -ENOMEM;
    }
    refcount_set(&fp->refcnt, 1);
 
    // 获取原先socket/filter过滤器
    old_fp = rcu_dereference_protected(sk->sk_filter,
                       lockdep_sock_is_held(sk));
    // 将sk->sk_filter的值变为我们新分配的fp
    rcu_assign_pointer(sk->sk_filter, fp);
 
    // 如果有淘汰下来的旧prog,需要对空间进行清理
    if (old_fp)
        sk_filter_uncharge(sk, old_fp);
 
    return 0;
}
static int __sk_attach_prog(struct bpf_prog *prog, struct sock *sk)
{
    //创建socket_filter的对象
    // struct sk_filter {
    //      refcount_t  refcnt;
    //      struct rcu_head rcu;
    //      struct bpf_prog *prog;
    // };
    struct sk_filter *fp, *old_fp;
 
    fp = kmalloc(sizeof(*fp), GFP_KERNEL);
    if (!fp)
        return -ENOMEM;
 
    fp->prog = prog;
 
    // 为fp sk_filter对象分配一个socket的引用,如果失败,释放fp空间
    if (!__sk_filter_charge(sk, fp)) {
        kfree(fp);
        return -ENOMEM;
    }
    refcount_set(&fp->refcnt, 1);
 
    // 获取原先socket/filter过滤器
    old_fp = rcu_dereference_protected(sk->sk_filter,
                       lockdep_sock_is_held(sk));
    // 将sk->sk_filter的值变为我们新分配的fp
    rcu_assign_pointer(sk->sk_filter, fp);
 
    // 如果有淘汰下来的旧prog,需要对空间进行清理
    if (old_fp)
        sk_filter_uncharge(sk, old_fp);
 
    return 0;
}
int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap);
static inline int sk_filter(struct sock *sk, struct sk_buff *skb)
{
    return sk_filter_trim_cap(sk, skb, 1);
}
int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap);
static inline int sk_filter(struct sock *sk, struct sk_buff *skb)
{
    return sk_filter_trim_cap(sk, skb, 1);
}
/
 *  sk_filter_trim_cap - run a packet through a socket filter
 *  @sk: sock associated with &sk_buff
 *  @skb: buffer to filter
 *  @cap: limit on how short the eBPF program may trim the packet
 *
 * Run the eBPF program and then cut skb->data to correct size returned by
 * the program. If pkt_len is 0 we toss packet. If skb->len is smaller
 * than pkt_len we keep whole skb->data. This is the socket level
 * wrapper to bpf_prog_run. It returns 0 if the packet should
 * be accepted or -EPERM if the packet should be tossed.
 *
 */
int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
{
    int err;
    struct sk_filter *filter;
 
    /*
     * If the skb was allocated from pfmemalloc reserves, only
     * allow SOCK_MEMALLOC sockets to use it as this socket is
     * helping free memory
     */
    // 检查SKB是否分配PF_MEMALLOC标志位
    if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
        NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
        return -ENOMEM;
    }
    err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
    if (err)
        return err;
 
    //lsm框架hook点
    err = security_sock_rcv_skb(sk, skb);
    if (err)
        return err;
 
    rcu_read_lock();
    filter = rcu_dereference(sk->sk_filter);
    if (filter) {
        struct sock *save_sk = skb->sk;
        unsigned int pkt_len;
 
        skb->sk = sk;
        pkt_len = bpf_prog_run_save_cb(filter->prog, skb);
        skb->sk = save_sk;
        err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
    }
    rcu_read_unlock();
 
    return err;
}
EXPORT_SYMBOL(sk_filter_trim_cap);
/
 *  sk_filter_trim_cap - run a packet through a socket filter
 *  @sk: sock associated with &sk_buff
 *  @skb: buffer to filter
 *  @cap: limit on how short the eBPF program may trim the packet
 *
 * Run the eBPF program and then cut skb->data to correct size returned by
 * the program. If pkt_len is 0 we toss packet. If skb->len is smaller
 * than pkt_len we keep whole skb->data. This is the socket level
 * wrapper to bpf_prog_run. It returns 0 if the packet should
 * be accepted or -EPERM if the packet should be tossed.
 *
 */
int sk_filter_trim_cap(struct sock *sk, struct sk_buff *skb, unsigned int cap)
{
    int err;
    struct sk_filter *filter;
 
    /*
     * If the skb was allocated from pfmemalloc reserves, only
     * allow SOCK_MEMALLOC sockets to use it as this socket is
     * helping free memory
     */
    // 检查SKB是否分配PF_MEMALLOC标志位
    if (skb_pfmemalloc(skb) && !sock_flag(sk, SOCK_MEMALLOC)) {
        NET_INC_STATS(sock_net(sk), LINUX_MIB_PFMEMALLOCDROP);
        return -ENOMEM;
    }
    err = BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb);
    if (err)
        return err;
 
    //lsm框架hook点
    err = security_sock_rcv_skb(sk, skb);
    if (err)
        return err;
 
    rcu_read_lock();
    filter = rcu_dereference(sk->sk_filter);
    if (filter) {
        struct sock *save_sk = skb->sk;
        unsigned int pkt_len;
 
        skb->sk = sk;
        pkt_len = bpf_prog_run_save_cb(filter->prog, skb);
        skb->sk = save_sk;
        err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
    }
    rcu_read_unlock();
 
    return err;
}
EXPORT_SYMBOL(sk_filter_trim_cap);
/* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)                 \
({                                        \
    int __ret = 0;                                \
    if (cgroup_bpf_enabled(CGROUP_INET_INGRESS))              \
        __ret = __cgroup_bpf_run_filter_skb(sk, skb,              \
                            CGROUP_INET_INGRESS); \
                                          \
    __ret;                                    \
})
/* Wrappers for __cgroup_bpf_run_filter_skb() guarded by cgroup_bpf_enabled. */
#define BPF_CGROUP_RUN_PROG_INET_INGRESS(sk, skb)                 \
({                                        \
    int __ret = 0;                                \
    if (cgroup_bpf_enabled(CGROUP_INET_INGRESS))              \
        __ret = __cgroup_bpf_run_filter_skb(sk, skb,              \
                            CGROUP_INET_INGRESS); \
                                          \
    __ret;                                    \
})
rcu_read_lock();
filter = rcu_dereference(sk->sk_filter);
if (filter) {
    struct sock *save_sk = skb->sk;
    unsigned int pkt_len;
 
    skb->sk = sk;
    pkt_len = bpf_prog_run_save_cb(filter->prog, skb);
    skb->sk = save_sk;
    err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
}
rcu_read_unlock();
rcu_read_lock();
filter = rcu_dereference(sk->sk_filter);
if (filter) {
    struct sock *save_sk = skb->sk;
    unsigned int pkt_len;
 
    skb->sk = sk;
    pkt_len = bpf_prog_run_save_cb(filter->prog, skb);
    skb->sk = save_sk;
    err = pkt_len ? pskb_trim(skb, max(cap, pkt_len)) : -EPERM;
}
rcu_read_unlock();
static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog,
                       struct sk_buff *skb)
{
    u32 res;
 
    migrate_disable();
    res = __bpf_prog_run_save_cb(prog, skb);
    migrate_enable();
    return res;
}
 
/* Must be invoked with migration disabled */
static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog,
                     const void *ctx)
{
    const struct sk_buff *skb = ctx;
    u8 *cb_data = bpf_skb_cb(skb);
    u8 cb_saved[BPF_SKB_CB_LEN];
    u32 res;
 
    if (unlikely(prog->cb_access)) {
        memcpy(cb_saved, cb_data, sizeof(cb_saved));
        memset(cb_data, 0, sizeof(cb_saved));
    }
 
    res = bpf_prog_run(prog, skb);
 
    if (unlikely(prog->cb_access))
        memcpy(cb_data, cb_saved, sizeof(cb_saved));
 
    return res;
}
static inline u8 *bpf_skb_cb(const struct sk_buff *skb)
{
    /* eBPF programs may read/write skb->cb[] area to transfer meta
     * data between tail calls. Since this also needs to work with
     * tc, that scratch memory is mapped to qdisc_skb_cb's data area.
     *
     * In some socket filter cases, the cb unfortunately needs to be
     * saved/restored so that protocol specific skb->cb[] data won't
     * be lost. In any case, due to unpriviledged eBPF programs
     * attached to sockets, we need to clear the bpf_skb_cb() area
     * to not leak previous contents to user space.
     */
    BUILD_BUG_ON(sizeof_field(struct __sk_buff, cb) != BPF_SKB_CB_LEN);
    BUILD_BUG_ON(sizeof_field(struct __sk_buff, cb) !=
             sizeof_field(struct qdisc_skb_cb, data));
 
    return qdisc_skb_cb(skb)->data;
}
static __always_inline u32 bpf_prog_run(const struct bpf_prog *prog, const void *ctx)
{
    return __bpf_prog_run(prog, ctx, bpf_dispatcher_nop_func);
}
static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
                      const void *ctx,
                      bpf_dispatcher_fn dfunc)
{
    u32 ret;
 
    cant_migrate();
    if (static_branch_unlikely(&bpf_stats_enabled_key)) {
        struct bpf_prog_stats *stats;
        u64 start = sched_clock();
        unsigned long flags;
 
        ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
        stats = this_cpu_ptr(prog->stats);
        flags = u64_stats_update_begin_irqsave(&stats->syncp);
        u64_stats_inc(&stats->cnt);
        u64_stats_add(&stats->nsecs, sched_clock() - start);
        u64_stats_update_end_irqrestore(&stats->syncp, flags);
    } else {
        ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
    }
    return ret;
}
static inline u32 bpf_prog_run_save_cb(const struct bpf_prog *prog,
                       struct sk_buff *skb)
{
    u32 res;
 
    migrate_disable();
    res = __bpf_prog_run_save_cb(prog, skb);
    migrate_enable();
    return res;
}
 
/* Must be invoked with migration disabled */
static inline u32 __bpf_prog_run_save_cb(const struct bpf_prog *prog,
                     const void *ctx)
{
    const struct sk_buff *skb = ctx;
    u8 *cb_data = bpf_skb_cb(skb);
    u8 cb_saved[BPF_SKB_CB_LEN];
    u32 res;
 
    if (unlikely(prog->cb_access)) {
        memcpy(cb_saved, cb_data, sizeof(cb_saved));
        memset(cb_data, 0, sizeof(cb_saved));
    }
 
    res = bpf_prog_run(prog, skb);
 
    if (unlikely(prog->cb_access))
        memcpy(cb_data, cb_saved, sizeof(cb_saved));
 
    return res;
}
static inline u8 *bpf_skb_cb(const struct sk_buff *skb)
{
    /* eBPF programs may read/write skb->cb[] area to transfer meta
     * data between tail calls. Since this also needs to work with
     * tc, that scratch memory is mapped to qdisc_skb_cb's data area.
     *
     * In some socket filter cases, the cb unfortunately needs to be
     * saved/restored so that protocol specific skb->cb[] data won't
     * be lost. In any case, due to unpriviledged eBPF programs
     * attached to sockets, we need to clear the bpf_skb_cb() area
     * to not leak previous contents to user space.
     */
    BUILD_BUG_ON(sizeof_field(struct __sk_buff, cb) != BPF_SKB_CB_LEN);
    BUILD_BUG_ON(sizeof_field(struct __sk_buff, cb) !=
             sizeof_field(struct qdisc_skb_cb, data));
 
    return qdisc_skb_cb(skb)->data;
}
static __always_inline u32 bpf_prog_run(const struct bpf_prog *prog, const void *ctx)
{
    return __bpf_prog_run(prog, ctx, bpf_dispatcher_nop_func);
}
static __always_inline u32 __bpf_prog_run(const struct bpf_prog *prog,
                      const void *ctx,
                      bpf_dispatcher_fn dfunc)
{
    u32 ret;
 
    cant_migrate();
    if (static_branch_unlikely(&bpf_stats_enabled_key)) {
        struct bpf_prog_stats *stats;
        u64 start = sched_clock();
        unsigned long flags;
 
        ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
        stats = this_cpu_ptr(prog->stats);
        flags = u64_stats_update_begin_irqsave(&stats->syncp);
        u64_stats_inc(&stats->cnt);
        u64_stats_add(&stats->nsecs, sched_clock() - start);
        u64_stats_update_end_irqrestore(&stats->syncp, flags);
    } else {
        ret = dfunc(ctx, prog->insnsi, prog->bpf_func);
    }
    return ret;
}
int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
    int err;
 
    err = sk_filter(sk, skb);
    if (err)
        return err;
 
    return __sock_queue_rcv_skb(sk, skb);
}
EXPORT_SYMBOL(sock_queue_rcv_skb);
int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
    int err;
 
    err = sk_filter(sk, skb);
    if (err)
        return err;
 
    return __sock_queue_rcv_skb(sk, skb);
}
EXPORT_SYMBOL(sock_queue_rcv_skb);
static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
    skb = skb_share_check(skb, GFP_ATOMIC);
    if (!skb)
        return NET_RX_DROP;
 
    if (sock_queue_rcv_skb(sk, skb) < 0) {
        kfree_skb(skb);
        return NET_RX_DROP;
    }
 
    return NET_RX_SUCCESS;
}
static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
    skb = skb_share_check(skb, GFP_ATOMIC);
    if (!skb)
        return NET_RX_DROP;
 
    if (sock_queue_rcv_skb(sk, skb) < 0) {
        kfree_skb(skb);
        return NET_RX_DROP;
    }
 
    return NET_RX_SUCCESS;
}
static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
    /* Charge it to the socket. */
 
    ipv4_pktinfo_prepare(sk, skb);
    if (sock_queue_rcv_skb(sk, skb) < 0) {
        kfree_skb(skb);
        return NET_RX_DROP;
    }
 
    return NET_RX_SUCCESS;
}
static int raw_rcv_skb(struct sock *sk, struct sk_buff *skb)
{
    /* Charge it to the socket. */
 
    ipv4_pktinfo_prepare(sk, skb);
    if (sock_queue_rcv_skb(sk, skb) < 0) {
        kfree_skb(skb);
        return NET_RX_DROP;
    }
 
    return NET_RX_SUCCESS;
}
int raw_rcv(struct sock *sk, struct sk_buff *skb)
{
    // 安全策略检查
    if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb)) {
        atomic_inc(&sk->sk_drops);
        kfree_skb(skb);
        return NET_RX_DROP;
    }
    //NFHOOK埋点,重置跟踪信息
    nf_reset_ct(skb);
 
    skb_push(skb, skb->data - skb_network_header(skb));
 
    raw_rcv_skb(sk, skb);
    return 0;
}
int raw_rcv(struct sock *sk, struct sk_buff *skb)
{
    // 安全策略检查
    if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb)) {
        atomic_inc(&sk->sk_drops);
        kfree_skb(skb);
        return NET_RX_DROP;
    }
    //NFHOOK埋点,重置跟踪信息
    nf_reset_ct(skb);
 
    skb_push(skb, skb->data - skb_network_header(skb));
 
    raw_rcv_skb(sk, skb);
    return 0;
}
/* IP input processing comes here for RAW socket delivery.
 * Caller owns SKB, so we must make clones.
 *
 * RFC 1122: SHOULD pass TOS value up to the transport layer.
 * -> It does. And not only TOS, but all IP header.
 */
static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash)
{
    ......
    // 根据网络设备寻找匹配的socket
    net = dev_net(skb->dev);
    sk = __raw_v4_lookup(net, __sk_head(head), iph->protocol,
                 iph->saddr, iph->daddr, dif, sdif);
 
    while (sk) {
        delivered = 1;
        if ((iph->protocol != IPPROTO_ICMP || !icmp_filter(sk, skb)) &&
            ip_mc_sf_allow(sk, iph->daddr, iph->saddr,
                   skb->dev->ifindex, sdif)) {
            // clone的目的是不共享数据包,socket拥有自己的数据包
            struct sk_buff *clone = skb_clone(skb, GFP_ATOMIC);
 
            /* Not releasing hash table! */
            if (clone)
                raw_rcv(sk, clone);
        }
        sk = __raw_v4_lookup(net, sk_next(sk), iph->protocol,
                     iph->saddr, iph->daddr,
                     dif, sdif);
    }
out:
    read_unlock(&raw_v4_hashinfo.lock);
    return delivered;
}
/* IP input processing comes here for RAW socket delivery.
 * Caller owns SKB, so we must make clones.
 *
 * RFC 1122: SHOULD pass TOS value up to the transport layer.
 * -> It does. And not only TOS, but all IP header.
 */
static int raw_v4_input(struct sk_buff *skb, const struct iphdr *iph, int hash)
{
    ......
    // 根据网络设备寻找匹配的socket
    net = dev_net(skb->dev);
    sk = __raw_v4_lookup(net, __sk_head(head), iph->protocol,
                 iph->saddr, iph->daddr, dif, sdif);
 
    while (sk) {
        delivered = 1;
        if ((iph->protocol != IPPROTO_ICMP || !icmp_filter(sk, skb)) &&
            ip_mc_sf_allow(sk, iph->daddr, iph->saddr,
                   skb->dev->ifindex, sdif)) {
            // clone的目的是不共享数据包,socket拥有自己的数据包
            struct sk_buff *clone = skb_clone(skb, GFP_ATOMIC);
 
            /* Not releasing hash table! */
            if (clone)
                raw_rcv(sk, clone);
        }
        sk = __raw_v4_lookup(net, sk_next(sk), iph->protocol,
                     iph->saddr, iph->daddr,

[注意]传递专业知识、拓宽行业人脉——看雪讲师团队等你加入!

最后于 2024-5-11 18:55 被天堂猪0ink编辑 ,原因: 编辑样式
收藏
免费 4
支持
分享
最新回复 (0)
游客
登录 | 注册 方可回帖
返回
//