首页
社区
课程
招聘
[原创]Linux系统调用机制浅析
发表于: 2021-10-27 17:08 9382

[原创]Linux系统调用机制浅析

erfze 活跃值
12
2021-10-27 17:08
9382

本文不会介绍CPU特权级别,中断,MSR,段机制及页机制等相关前置知识,如果读者此前未接触过这些,建议阅读Intel SDM对应篇章或者参阅链接<SUP>1</SUP>之后再继续下面篇幅。本文基于如下环境:

首先从源码角度分析传统系统调用,即int 0x80。IDT(Interrupt Descriptor Table)建立位于arch/x86/kernel/traps.c中:

idt_setup_traps()函数定义在arch/x86/kernel/idt.c中:

其调用idt_setup_from_table函数同样位于该文件:

def_idts存储了IDT各项默认值,其定义如下:

根据配置选项不同,IA32_SYSCALL_VECTOR项值不同——若启用CONFIG_IA32_EMULATION,则以64位兼容模式运行32位程序;否则是32位。IA32_SYSCALL_VECTOR定义如下:

INTGSYSG定义不同之处在于DPL:

相关定义如下:

门描述符及类型定义如下(位于/arch/x86/include/asm/desc_defs.h):

对应于Intel SDM中:

idt_init_desc函数定义如下:

write_idt_entrymemcpy函数的简单包装:

如此一来,便在IDT 0x80项写入了系统调用函数地址。上述函数调用关系为:

entry_INT80_32定义位于arch/x86/entry/entry_32.S文件中:

执行系统调用的主要代码位于do_int80_syscall_32(arch/x86/entry/common.c):

do_syscall_32_irqs_on定义如下:

上述函数调用关系为:

ia32_sys_call_table定义位于同目录的syscall_32.c文件中:

sys_ni_syscall(kernel/sys_ni.c)定义如下,对应于未实现的系统调用:

asm/syscalls_32.h文件内容由syscalltbl.sh脚本根据syscall_32.tbl生成,具体定义在arch/x86/entry/syscalls/Makefile中:

syscall_32.tbl中存储了系统调用名称,调用号及入口等内容:

syscall_32.c文件中有如下宏定义:

那么ia32_sys_call_table数组内容会成为如下形式:

#define __SYSCALL_I386(nr, sym, qual) [nr] = sym,宏定义了ia32_sys_call_table数组项——以系统调用号为索引;#define __SYSCALL_I386(nr, sym, qual) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);定义了每项中系统调用函数Entry Point。

如此一来,ia32_sys_call_table[nr]((unsigned int)regs->bx, (unsigned int)regs->cx,(unsigned int)regs->dx, (unsigned int)regs->si,(unsigned int)regs->di, (unsigned int)regs->bp);便会调用真正实现功能函数。以sys_restart_syscall为例,其定义位于kernel/signal.c中:

SYSCALL_DEFINE相关宏定义位于include/linux/syscalls.h中:

系统调用返回是通过IRET语句:

其弹出寄存器值在发生中断时已经保存在栈中:

下面通过动态调试(调试环境使用Qemu+GDB+Busybox搭建)来剖析传统系统调用过程。于entry_INT80_32设置断点后,键入clear命令,成功断下:

查看栈中各寄存器值:

确为INT $0x80传统系统调用:

保存系统调用号及相关寄存器值:

传递regs参数给do_int80_syscall_32及引用其成员值:

对应源码为:

pt_regs结构定义如下:

之后便是根据系统调用号进入真正实现功能函数:

检查EFLAGS中VM位,SS中TI位是否设置为1以及CS中RPL:

若TI位未设置,则使用GDT进行索引。之后恢复SAVE_ALL所保存的寄存器值(出栈及入栈顺序与pt_regs中所定义顺序一致)并执行IRET指令返回调用程序:

返回值则在之前由do_syscall_32_irqs_on函数保存在了栈中:

RESTORE_REGS恢复寄存器值时将其弹出到EAX以传递给调用程序。

根据Intel SDM中描述,使用SYSENTER命令需要事先设置如下三个MSR寄存器值;

执行到SYSENTER命令时操作如下:

Linux源码中设置三个MSR寄存器值操作位于syscall_init函数(arch/x86/kernel/cpu/common.c)中:

编译时需要启用CONFIG_IA32_EMULATION选项。entry_SYSENTER_compat定义位于arch/x86/entry/entry_64_compat.S中:

关于SWAPGS可阅读参阅链接<SUP>7</SUP>:

do_fast_syscall_32函数会调用do_syscall_32_irqs_on

该函数其余代码部分见后文描述。

使用如下代码作为示例(不建议这样去执行系统调用,下面的代码仅仅是作为展示):

entry_SYSENTER_compat成功断下:

regs传递给do_fast_syscall_32

可以看到其orig_ax成员偏移与之前相比发生了变化, 这是因为regs对应结构定义为:

通过sysret指令返回调用程序:

Intel SDM中对此命令描述如下:

严格意义上来说,上一小节中给出示例不符合系统调用规范,笔者在实际测试时发现手动执行SYSENTER会出现错误。本小节示例如下:

采用静态编译方式,目标平台32位。跟踪open函数调用如下:

对应源码位于arch/x86/entry/vdso/vdso32/system_call.S文件中:

关于系统调用指令,根据平台选择是SYSENTER或是SYSCALL,若均不支持则执行传统系统调用int $0x80。

Intel SDM:

同样是位于syscall_init函数中:

entry_SYSCALL_64中执行系统调用是采用如下方式:

调用约定是:

返回依然是采用SYSRET指令:

VDSO全称是Virtual Dynamic Shared Object,它映射到用户地址空间中,可以被用户程序直接调用,但没有对应文件,是由内核直接映射:

其导出函数见arch/x86/entry/vdso/vdso.lds.S文件:

gettimeofday为例, 其定义位于同目录下vclock_gettime.c文件中:

用户调用gettimeofday时,实际执行的是__vdso_gettimeofday。示例代码如下:

编译之后跟踪gettimeofday函数调用:

查看内存空间映射情况:

可以看到执行指令确实映射在vdso区域内。

void __init trap_init(void)
{
    /* Init cpu_entry_area before IST entries are set up */
    setup_cpu_entry_areas();
 
    idt_setup_traps();
 
    /*
     * Set the IDT descriptor to a fixed read-only location, so that the
     * "sidt" instruction will not leak the location of the kernel, and
     * to defend the IDT against arbitrary memory write vulnerabilities.
     * It will be reloaded in cpu_init() */
    cea_set_pte(CPU_ENTRY_AREA_RO_IDT_VADDR, __pa_symbol(idt_table),
            PAGE_KERNEL_RO);
    idt_descr.address = CPU_ENTRY_AREA_RO_IDT;
 
    /*
     * Should be a barrier for any external CPU state:
     */
    cpu_init();
 
    idt_setup_ist_traps();
 
    x86_init.irqs.trap_init();
 
    idt_setup_debugidt_traps();
}
void __init trap_init(void)
{
    /* Init cpu_entry_area before IST entries are set up */
    setup_cpu_entry_areas();
 
    idt_setup_traps();
 
    /*
     * Set the IDT descriptor to a fixed read-only location, so that the
     * "sidt" instruction will not leak the location of the kernel, and
     * to defend the IDT against arbitrary memory write vulnerabilities.
     * It will be reloaded in cpu_init() */
    cea_set_pte(CPU_ENTRY_AREA_RO_IDT_VADDR, __pa_symbol(idt_table),
            PAGE_KERNEL_RO);
    idt_descr.address = CPU_ENTRY_AREA_RO_IDT;
 
    /*
     * Should be a barrier for any external CPU state:
     */
    cpu_init();
 
    idt_setup_ist_traps();
 
    x86_init.irqs.trap_init();
 
    idt_setup_debugidt_traps();
}
/**
 * idt_setup_traps - Initialize the idt table with default traps
 */
void __init idt_setup_traps(void)
{
    idt_setup_from_table(idt_table, def_idts, ARRAY_SIZE(def_idts), true);
}
/**
 * idt_setup_traps - Initialize the idt table with default traps
 */
void __init idt_setup_traps(void)
{
    idt_setup_from_table(idt_table, def_idts, ARRAY_SIZE(def_idts), true);
}
static void
idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sys)
{
    gate_desc desc;
 
    for (; size > 0; t++, size--) {
        idt_init_desc(&desc, t);
        write_idt_entry(idt, t->vector, &desc);
        if (sys)
            set_bit(t->vector, system_vectors);
    }
}
static void
idt_setup_from_table(gate_desc *idt, const struct idt_data *t, int size, bool sys)
{
    gate_desc desc;
 
    for (; size > 0; t++, size--) {
        idt_init_desc(&desc, t);
        write_idt_entry(idt, t->vector, &desc);
        if (sys)
            set_bit(t->vector, system_vectors);
    }
}
/*
 * The default IDT entries which are set up in trap_init() before
 * cpu_init() is invoked. Interrupt stacks cannot be used at that point and
 * the traps which use them are reinitialized with IST after cpu_init() has
 * set up TSS.
 */
static const __initconst struct idt_data def_idts[] = {
    INTG(X86_TRAP_DE,        divide_error),
    INTG(X86_TRAP_NMI,        nmi),
    INTG(X86_TRAP_BR,        bounds),
    INTG(X86_TRAP_UD,        invalid_op),
    INTG(X86_TRAP_NM,        device_not_available),
    INTG(X86_TRAP_OLD_MF,        coprocessor_segment_overrun),
    INTG(X86_TRAP_TS,        invalid_TSS),
    INTG(X86_TRAP_NP,        segment_not_present),
    INTG(X86_TRAP_SS,        stack_segment),
    INTG(X86_TRAP_GP,        general_protection),
    INTG(X86_TRAP_SPURIOUS,        spurious_interrupt_bug),
    INTG(X86_TRAP_MF,        coprocessor_error),
    INTG(X86_TRAP_AC,        alignment_check),
    INTG(X86_TRAP_XF,        simd_coprocessor_error),
 
#ifdef CONFIG_X86_32
    TSKG(X86_TRAP_DF,        GDT_ENTRY_DOUBLEFAULT_TSS),
#else
    INTG(X86_TRAP_DF,        double_fault),
#endif
    INTG(X86_TRAP_DB,        debug),
 
#ifdef CONFIG_X86_MCE
    INTG(X86_TRAP_MC,        &machine_check),
#endif
 
    SYSG(X86_TRAP_OF,        overflow),
#if defined(CONFIG_IA32_EMULATION)
    SYSG(IA32_SYSCALL_VECTOR,    entry_INT80_compat),
#elif defined(CONFIG_X86_32)
    SYSG(IA32_SYSCALL_VECTOR,    entry_INT80_32),
#endif
};
/*
 * The default IDT entries which are set up in trap_init() before
 * cpu_init() is invoked. Interrupt stacks cannot be used at that point and
 * the traps which use them are reinitialized with IST after cpu_init() has
 * set up TSS.
 */
static const __initconst struct idt_data def_idts[] = {
    INTG(X86_TRAP_DE,        divide_error),
    INTG(X86_TRAP_NMI,        nmi),
    INTG(X86_TRAP_BR,        bounds),
    INTG(X86_TRAP_UD,        invalid_op),
    INTG(X86_TRAP_NM,        device_not_available),
    INTG(X86_TRAP_OLD_MF,        coprocessor_segment_overrun),
    INTG(X86_TRAP_TS,        invalid_TSS),
    INTG(X86_TRAP_NP,        segment_not_present),
    INTG(X86_TRAP_SS,        stack_segment),
    INTG(X86_TRAP_GP,        general_protection),
    INTG(X86_TRAP_SPURIOUS,        spurious_interrupt_bug),
    INTG(X86_TRAP_MF,        coprocessor_error),
    INTG(X86_TRAP_AC,        alignment_check),
    INTG(X86_TRAP_XF,        simd_coprocessor_error),
 
#ifdef CONFIG_X86_32
    TSKG(X86_TRAP_DF,        GDT_ENTRY_DOUBLEFAULT_TSS),
#else
    INTG(X86_TRAP_DF,        double_fault),
#endif
    INTG(X86_TRAP_DB,        debug),
 
#ifdef CONFIG_X86_MCE
    INTG(X86_TRAP_MC,        &machine_check),
#endif
 
    SYSG(X86_TRAP_OF,        overflow),
#if defined(CONFIG_IA32_EMULATION)
    SYSG(IA32_SYSCALL_VECTOR,    entry_INT80_compat),
#elif defined(CONFIG_X86_32)
    SYSG(IA32_SYSCALL_VECTOR,    entry_INT80_32),
#endif
};
#define IA32_SYSCALL_VECTOR        0x80
#define IA32_SYSCALL_VECTOR        0x80
/* Interrupt gate */
#define INTG(_vector, _addr)                \
    G(_vector, _addr, DEFAULT_STACK, GATE_INTERRUPT, DPL0, __KERNEL_CS)
 
/* System interrupt gate */
#define SYSG(_vector, _addr)                \
    G(_vector, _addr, DEFAULT_STACK, GATE_INTERRUPT, DPL3, __KERNEL_CS)
/* Interrupt gate */
#define INTG(_vector, _addr)                \
    G(_vector, _addr, DEFAULT_STACK, GATE_INTERRUPT, DPL0, __KERNEL_CS)
 
/* System interrupt gate */
#define SYSG(_vector, _addr)                \
    G(_vector, _addr, DEFAULT_STACK, GATE_INTERRUPT, DPL3, __KERNEL_CS)
#define DPL0        0x0
#define DPL3        0x3
 
#define DEFAULT_STACK    0
 
#define G(_vector, _addr, _ist, _type, _dpl, _segment)    \
    {                        \
        .vector        = _vector,        \
        .bits.ist    = _ist,            \
        .bits.type    = _type,        \
        .bits.dpl    = _dpl,            \
        .bits.p        = 1,            \
        .addr        = _addr,        \
        .segment    = _segment,        \
    }
#define DPL0        0x0
#define DPL3        0x3
 
#define DEFAULT_STACK    0
 
#define G(_vector, _addr, _ist, _type, _dpl, _segment)    \
    {                        \
        .vector        = _vector,        \
        .bits.ist    = _ist,            \
        .bits.type    = _type,        \
        .bits.dpl    = _dpl,            \
        .bits.p        = 1,            \
        .addr        = _addr,        \
        .segment    = _segment,        \
    }
struct gate_struct {
    u16        offset_low;
    u16        segment;
    struct idt_bits    bits;
    u16        offset_middle;
#ifdef CONFIG_X86_64
    u32        offset_high;
    u32        reserved;
#endif
} __attribute__((packed));
 
enum {
    GATE_INTERRUPT = 0xE,
    GATE_TRAP = 0xF,
    GATE_CALL = 0xC,
    GATE_TASK = 0x5,
};
struct gate_struct {
    u16        offset_low;
    u16        segment;
    struct idt_bits    bits;
    u16        offset_middle;
#ifdef CONFIG_X86_64
    u32        offset_high;
    u32        reserved;
#endif
} __attribute__((packed));
 
enum {
    GATE_INTERRUPT = 0xE,
    GATE_TRAP = 0xF,
    GATE_CALL = 0xC,
    GATE_TASK = 0x5,
};
 
 
static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
{
    unsigned long addr = (unsigned long) d->addr;
 
    gate->offset_low    = (u16) addr;
    gate->segment        = (u16) d->segment;
    gate->bits        = d->bits;
    gate->offset_middle    = (u16) (addr >> 16);
#ifdef CONFIG_X86_64
    gate->offset_high    = (u32) (addr >> 32);
    gate->reserved        = 0;
#endif
}
static inline void idt_init_desc(gate_desc *gate, const struct idt_data *d)
{
    unsigned long addr = (unsigned long) d->addr;
 
    gate->offset_low    = (u16) addr;
    gate->segment        = (u16) d->segment;
    gate->bits        = d->bits;
    gate->offset_middle    = (u16) (addr >> 16);
#ifdef CONFIG_X86_64
    gate->offset_high    = (u32) (addr >> 32);
    gate->reserved        = 0;
#endif
}
#define write_idt_entry(dt, entry, g)        native_write_idt_entry(dt, entry, g)
......
static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
{
    memcpy(&idt[entry], gate, sizeof(*gate));
}
#define write_idt_entry(dt, entry, g)        native_write_idt_entry(dt, entry, g)
......
static inline void native_write_idt_entry(gate_desc *idt, int entry, const gate_desc *gate)
{
    memcpy(&idt[entry], gate, sizeof(*gate));
}
 
 
ENTRY(entry_INT80_32)
    ASM_CLAC
    pushl    %eax            /* pt_regs->orig_ax */
    SAVE_ALL pt_regs_ax=$-ENOSYS    /* save rest */
 
    /*
     * User mode is traced as though IRQs are on, and the interrupt gate
     * turned them off.
     */
    TRACE_IRQS_OFF
 
    movl    %esp, %eax
    call    do_int80_syscall_32
.Lsyscall_32_done:
 
restore_all:
    TRACE_IRQS_IRET
.Lrestore_all_notrace:
#ifdef CONFIG_X86_ESPFIX32
    ALTERNATIVE    "jmp .Lrestore_nocheck", "", X86_BUG_ESPFIX
 
    movl    PT_EFLAGS(%esp), %eax        # mix EFLAGS, SS and CS
    /*
     * Warning: PT_OLDSS(%esp) contains the wrong/random values if we
     * are returning to the kernel.
     * See comments in process.c:copy_thread() for details.
     */
    movb    PT_OLDSS(%esp), %ah
    movb    PT_CS(%esp), %al
    andl    $(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax
    cmpl    $((SEGMENT_LDT << 8) | USER_RPL), %eax
    je .Lldt_ss                # returning to user-space with LDT SS
#endif
.Lrestore_nocheck:
    RESTORE_REGS 4                # skip orig_eax/error_code
.Lirq_return:
    INTERRUPT_RETURN
 
.section .fixup, "ax"
ENTRY(iret_exc    )
    pushl    $0                # no error code
    pushl    $do_iret_error
    jmp    common_exception
.previous
    _ASM_EXTABLE(.Lirq_return, iret_exc)
 
#ifdef CONFIG_X86_ESPFIX32
.Lldt_ss:
/*
 * Setup and switch to ESPFIX stack
 *
 * We're returning to userspace with a 16 bit stack. The CPU will not
 * restore the high word of ESP for us on executing iret... This is an
 * "official" bug of all the x86-compatible CPUs, which we can work
 * around to make dosemu and wine happy. We do this by preloading the
 * high word of ESP with the high word of the userspace ESP while
 * compensating for the offset by changing to the ESPFIX segment with
 * a base address that matches for the difference.
 */
#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8)
    mov    %esp, %edx            /* load kernel esp */
    mov    PT_OLDESP(%esp), %eax        /* load userspace esp */
    mov    %dx, %ax            /* eax: new kernel esp */
    sub    %eax, %edx            /* offset (low word is 0) */
    shr    $16, %edx
    mov    %dl, GDT_ESPFIX_SS + 4        /* bits 16..23 */
    mov    %dh, GDT_ESPFIX_SS + 7        /* bits 24..31 */
    pushl    $__ESPFIX_SS
    pushl    %eax                /* new kernel esp */
    /*
     * Disable interrupts, but do not irqtrace this section: we
     * will soon execute iret and the tracer was already set to
     * the irqstate after the IRET:
     */
    DISABLE_INTERRUPTS(CLBR_ANY)
    lss    (%esp), %esp            /* switch to espfix segment */
    jmp    .Lrestore_nocheck
#endif
ENDPROC(entry_INT80_32)
ENTRY(entry_INT80_32)
    ASM_CLAC
    pushl    %eax            /* pt_regs->orig_ax */
    SAVE_ALL pt_regs_ax=$-ENOSYS    /* save rest */
 
    /*
     * User mode is traced as though IRQs are on, and the interrupt gate
     * turned them off.
     */
    TRACE_IRQS_OFF
 
    movl    %esp, %eax
    call    do_int80_syscall_32
.Lsyscall_32_done:
 
restore_all:
    TRACE_IRQS_IRET
.Lrestore_all_notrace:
#ifdef CONFIG_X86_ESPFIX32
    ALTERNATIVE    "jmp .Lrestore_nocheck", "", X86_BUG_ESPFIX
 
    movl    PT_EFLAGS(%esp), %eax        # mix EFLAGS, SS and CS
    /*
     * Warning: PT_OLDSS(%esp) contains the wrong/random values if we
     * are returning to the kernel.
     * See comments in process.c:copy_thread() for details.
     */
    movb    PT_OLDSS(%esp), %ah
    movb    PT_CS(%esp), %al
    andl    $(X86_EFLAGS_VM | (SEGMENT_TI_MASK << 8) | SEGMENT_RPL_MASK), %eax
    cmpl    $((SEGMENT_LDT << 8) | USER_RPL), %eax
    je .Lldt_ss                # returning to user-space with LDT SS
#endif
.Lrestore_nocheck:
    RESTORE_REGS 4                # skip orig_eax/error_code
.Lirq_return:
    INTERRUPT_RETURN
 
.section .fixup, "ax"
ENTRY(iret_exc    )
    pushl    $0                # no error code
    pushl    $do_iret_error
    jmp    common_exception
.previous
    _ASM_EXTABLE(.Lirq_return, iret_exc)
 
#ifdef CONFIG_X86_ESPFIX32
.Lldt_ss:
/*
 * Setup and switch to ESPFIX stack
 *
 * We're returning to userspace with a 16 bit stack. The CPU will not
 * restore the high word of ESP for us on executing iret... This is an
 * "official" bug of all the x86-compatible CPUs, which we can work
 * around to make dosemu and wine happy. We do this by preloading the
 * high word of ESP with the high word of the userspace ESP while
 * compensating for the offset by changing to the ESPFIX segment with
 * a base address that matches for the difference.
 */
#define GDT_ESPFIX_SS PER_CPU_VAR(gdt_page) + (GDT_ENTRY_ESPFIX_SS * 8)
    mov    %esp, %edx            /* load kernel esp */
    mov    PT_OLDESP(%esp), %eax        /* load userspace esp */
    mov    %dx, %ax            /* eax: new kernel esp */
    sub    %eax, %edx            /* offset (low word is 0) */
    shr    $16, %edx
    mov    %dl, GDT_ESPFIX_SS + 4        /* bits 16..23 */
    mov    %dh, GDT_ESPFIX_SS + 7        /* bits 24..31 */
    pushl    $__ESPFIX_SS
    pushl    %eax                /* new kernel esp */
    /*
     * Disable interrupts, but do not irqtrace this section: we
     * will soon execute iret and the tracer was already set to
     * the irqstate after the IRET:
     */
    DISABLE_INTERRUPTS(CLBR_ANY)
    lss    (%esp), %esp            /* switch to espfix segment */
    jmp    .Lrestore_nocheck
#endif
ENDPROC(entry_INT80_32)
/* Handles int $0x80 */
__visible void do_int80_syscall_32(struct pt_regs *regs)
{
    enter_from_user_mode();
    local_irq_enable();
    do_syscall_32_irqs_on(regs);
}
/* Handles int $0x80 */
__visible void do_int80_syscall_32(struct pt_regs *regs)
{
    enter_from_user_mode();
    local_irq_enable();
    do_syscall_32_irqs_on(regs);
}
#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
/*
 * Does a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.  Does
 * all entry and exit work and returns with IRQs off.  This function is
 * extremely hot in workloads that use it, and it's usually called from
 * do_fast_syscall_32, so forcibly inline it to improve performance.
 */
static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
{
    struct thread_info *ti = current_thread_info();
    unsigned int nr = (unsigned int)regs->orig_ax;
 
#ifdef CONFIG_IA32_EMULATION
    current->thread.status |= TS_COMPAT;
#endif
 
    if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
        /*
         * Subtlety here: if ptrace pokes something larger than
         * 2^32-1 into orig_ax, this truncates it.  This may or
         * may not be necessary, but it matches the old asm
         * behavior.
         */
        nr = syscall_trace_enter(regs);
    }
 
    if (likely(nr < IA32_NR_syscalls)) {
        /*
         * It's possible that a 32-bit syscall implementation
         * takes a 64-bit parameter but nonetheless assumes that
         * the high bits are zero.  Make sure we zero-extend all
         * of the args.
         */
        regs->ax = ia32_sys_call_table[nr](
            (unsigned int)regs->bx, (unsigned int)regs->cx,
            (unsigned int)regs->dx, (unsigned int)regs->si,
            (unsigned int)regs->di, (unsigned int)regs->bp);
    }
 
    syscall_return_slowpath(regs);
}
#if defined(CONFIG_X86_32) || defined(CONFIG_IA32_EMULATION)
/*
 * Does a 32-bit syscall.  Called with IRQs on in CONTEXT_KERNEL.  Does
 * all entry and exit work and returns with IRQs off.  This function is
 * extremely hot in workloads that use it, and it's usually called from
 * do_fast_syscall_32, so forcibly inline it to improve performance.
 */
static __always_inline void do_syscall_32_irqs_on(struct pt_regs *regs)
{
    struct thread_info *ti = current_thread_info();
    unsigned int nr = (unsigned int)regs->orig_ax;
 
#ifdef CONFIG_IA32_EMULATION
    current->thread.status |= TS_COMPAT;
#endif
 
    if (READ_ONCE(ti->flags) & _TIF_WORK_SYSCALL_ENTRY) {
        /*
         * Subtlety here: if ptrace pokes something larger than
         * 2^32-1 into orig_ax, this truncates it.  This may or
         * may not be necessary, but it matches the old asm
         * behavior.
         */
        nr = syscall_trace_enter(regs);
    }
 
    if (likely(nr < IA32_NR_syscalls)) {
        /*
         * It's possible that a 32-bit syscall implementation
         * takes a 64-bit parameter but nonetheless assumes that
         * the high bits are zero.  Make sure we zero-extend all
         * of the args.
         */
        regs->ax = ia32_sys_call_table[nr](
            (unsigned int)regs->bx, (unsigned int)regs->cx,
            (unsigned int)regs->dx, (unsigned int)regs->si,
            (unsigned int)regs->di, (unsigned int)regs->bp);
    }
 
    syscall_return_slowpath(regs);
}
 
 
extern asmlinkage long sys_ni_syscall(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
 
__visible const sys_call_ptr_t ia32_sys_call_table[__NR_syscall_compat_max+1] = {
    /*
     * Smells like a compiler bug -- it doesn't work
     * when the & below is removed.
     */
    [0 ... __NR_syscall_compat_max] = &sys_ni_syscall,
#include <asm/syscalls_32.h>
};
extern asmlinkage long sys_ni_syscall(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long);
 
__visible const sys_call_ptr_t ia32_sys_call_table[__NR_syscall_compat_max+1] = {
    /*
     * Smells like a compiler bug -- it doesn't work
     * when the & below is removed.
     */
    [0 ... __NR_syscall_compat_max] = &sys_ni_syscall,
#include <asm/syscalls_32.h>
};
/*
 * Non-implemented system calls get redirected here.
 */
asmlinkage long sys_ni_syscall(void)
{
    return -ENOSYS;
}
/*
 * Non-implemented system calls get redirected here.
 */
asmlinkage long sys_ni_syscall(void)
{
    return -ENOSYS;
}
syscall32 := $(srctree)/$(src)/syscall_32.tbl
syscall64 := $(srctree)/$(src)/syscall_64.tbl
 
syshdr := $(srctree)/$(src)/syscallhdr.sh
systbl := $(srctree)/$(src)/syscalltbl.sh
......
$(out)/syscalls_32.h: $(syscall32) $(systbl)
    $(call if_changed,systbl)
$(out)/syscalls_64.h: $(syscall64) $(systbl)
    $(call if_changed,systbl)
syscall32 := $(srctree)/$(src)/syscall_32.tbl
syscall64 := $(srctree)/$(src)/syscall_64.tbl
 
syshdr := $(srctree)/$(src)/syscallhdr.sh
systbl := $(srctree)/$(src)/syscalltbl.sh
......
$(out)/syscalls_32.h: $(syscall32) $(systbl)
    $(call if_changed,systbl)
$(out)/syscalls_64.h: $(syscall64) $(systbl)
    $(call if_changed,systbl)
 
 
#define __SYSCALL_I386(nr, sym, qual) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) ;
#include <asm/syscalls_32.h>
#undef __SYSCALL_I386
 
#define __SYSCALL_I386(nr, sym, qual) [nr] = sym,
#define __SYSCALL_I386(nr, sym, qual) extern asmlinkage long sym(unsigned long, unsigned long, unsigned long, unsigned long, unsigned long, unsigned long) ;
#include <asm/syscalls_32.h>
#undef __SYSCALL_I386
 
#define __SYSCALL_I386(nr, sym, qual) [nr] = sym,
[0 ... __NR_syscall_compat_max] = &sys_ni_syscall,
[0] = sys_restart_syscall,
[1] = sys_exit,
......
[0 ... __NR_syscall_compat_max] = &sys_ni_syscall,
[0] = sys_restart_syscall,
[1] = sys_exit,
......
 
/**
 *  sys_restart_syscall - restart a system call
 */
SYSCALL_DEFINE0(restart_syscall)
{
    struct restart_block *restart = &current->restart_block;
    return restart->fn(restart);
}
/**
 *  sys_restart_syscall - restart a system call
 */
SYSCALL_DEFINE0(restart_syscall)
{
    struct restart_block *restart = &current->restart_block;
    return restart->fn(restart);
}
#define SYSCALL_METADATA(sname, nb, ...)
 
static inline int is_syscall_trace_event(struct trace_event_call *tp_event)
{
    return 0;
}
#endif
 
#define SYSCALL_DEFINE0(sname)                    \
    SYSCALL_METADATA(_##sname, 0);                \
    asmlinkage long sys_##sname(void)
 
#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)
 
#define SYSCALL_DEFINE_MAXARGS    6
 
#define SYSCALL_DEFINEx(x, sname, ...)                \
    SYSCALL_METADATA(sname, x, __VA_ARGS__)            \
    __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
 
#define __PROTECT(...) asmlinkage_protect(__VA_ARGS__)
#define __SYSCALL_DEFINEx(x, name, ...)                    \
    asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))    \
        __attribute__((alias(__stringify(SyS##name))));        \
    static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__));    \
    asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__));    \
    asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))    \
    {                                \
        long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__));    \
        __MAP(x,__SC_TEST,__VA_ARGS__);                \
        __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__));    \
        return ret;                        \
    }                                \
    static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
#define SYSCALL_METADATA(sname, nb, ...)
 
static inline int is_syscall_trace_event(struct trace_event_call *tp_event)
{
    return 0;
}
#endif
 
#define SYSCALL_DEFINE0(sname)                    \
    SYSCALL_METADATA(_##sname, 0);                \
    asmlinkage long sys_##sname(void)
 
#define SYSCALL_DEFINE1(name, ...) SYSCALL_DEFINEx(1, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE2(name, ...) SYSCALL_DEFINEx(2, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE3(name, ...) SYSCALL_DEFINEx(3, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE4(name, ...) SYSCALL_DEFINEx(4, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE5(name, ...) SYSCALL_DEFINEx(5, _##name, __VA_ARGS__)
#define SYSCALL_DEFINE6(name, ...) SYSCALL_DEFINEx(6, _##name, __VA_ARGS__)
 
#define SYSCALL_DEFINE_MAXARGS    6
 
#define SYSCALL_DEFINEx(x, sname, ...)                \
    SYSCALL_METADATA(sname, x, __VA_ARGS__)            \
    __SYSCALL_DEFINEx(x, sname, __VA_ARGS__)
 
#define __PROTECT(...) asmlinkage_protect(__VA_ARGS__)
#define __SYSCALL_DEFINEx(x, name, ...)                    \
    asmlinkage long sys##name(__MAP(x,__SC_DECL,__VA_ARGS__))    \
        __attribute__((alias(__stringify(SyS##name))));        \
    static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__));    \
    asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__));    \
    asmlinkage long SyS##name(__MAP(x,__SC_LONG,__VA_ARGS__))    \
    {                                \
        long ret = SYSC##name(__MAP(x,__SC_CAST,__VA_ARGS__));    \
        __MAP(x,__SC_TEST,__VA_ARGS__);                \
        __PROTECT(x, ret,__MAP(x,__SC_ARGS,__VA_ARGS__));    \
        return ret;                        \
    }                                \
    static inline long SYSC##name(__MAP(x,__SC_DECL,__VA_ARGS__))
 
 
 
 
 
 
 
 
 
 
 
 
 
unsigned int nr = (unsigned int)regs->orig_ax;
......
if (likely(nr < IA32_NR_syscalls)) {
        /*
         * It's possible that a 32-bit syscall implementation
         * takes a 64-bit parameter but nonetheless assumes that
         * the high bits are zero.  Make sure we zero-extend all
         * of the args.
         */
        regs->ax = ia32_sys_call_table[nr](
            (unsigned int)regs->bx, (unsigned int)regs->cx,
            (unsigned int)regs->dx, (unsigned int)regs->si,
            (unsigned int)regs->di, (unsigned int)regs->bp);
    }
unsigned int nr = (unsigned int)regs->orig_ax;

[招生]科锐逆向工程师培训(2024年11月15日实地,远程教学同时开班, 第51期)

收藏
免费 2
支持
分享
最新回复 (1)
雪    币: 1332
活跃值: (9481)
能力值: ( LV12,RANK:650 )
在线值:
发帖
回帖
粉丝
2
2021-10-27 18:12
0
游客
登录 | 注册 方可回帖
返回
//