-
-
[原创]什么是runC?
-
2022-1-11 11:19 21379
-
什么是runC
OCI 标准
容器运行时,Container runtime是指管理和运行容器的工具,当前的容器工具很多,比如docker,rkt等,但是如果每个容器工具都使用自己的运行时,那么就不利于容器灵雨的发展,因此,一些容器厂商就一起制定了容器镜像格式和容器运行时的标准,即Open Container Initiative
(OCI
)。
OCI bundle
OCI Bundle
是指满足OCI标准的一系列文件,这些文件包含了运行容器所需要的所有数据,它们存放在一个共同的目录,该目录包含以下两项:
- config.json:包含容器运行的配置数据
- container 的 root filesystem
runC框架
这是runC主要的代码逻辑,其中libcontainer其实就是早期docker的一大基础,为了适应OCI格式进行了二次的封装。
以runc create 为例子,其对应的主要操作如下:
startContainer
:通过读取config.json配置将配置内容转换为OCI标准规定的内存数据结构形式,尝试创建容器,并根据参数执行不同的操作比如run,start,Restore。
contianer对应的一些数据结构如下,这里创建了一个接口,里面包括了一个容器需要的所有的操作:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 | type BaseContainer interface { / / Returns the ID of the container ID () string / / Returns the current status of the container. Status() (Status, error) / / State returns the current container's state information. State() ( * State, error) / / OCIState returns the current container's state information. OCIState() ( * specs.State, error) / / Returns the current config of the container. Config() configs.Config / / Returns the PIDs inside this container. The PIDs are in the namespace of the calling process. / / / / Some of the returned PIDs may no longer refer to processes in the Container, unless / / the Container state is PAUSED in which case every PID in the slice is valid. Processes() ([] int , error) / / Returns statistics for the container. Stats() ( * Stats, error) / / Set resources of container as configured / / / / We can use this to change resources when containers are running. / / Set (config configs.Config) error / / Start a process inside the container. Returns error if process fails to / / start. You can track process lifecycle with passed Process structure. Start(process * Process) (err error) / / Run immediately starts the process inside the container. Returns error if process / / fails to start. It does not block waiting for the exec fifo after start returns but / / opens the fifo after start returns. Run(process * Process) (err error) / / Destroys the container, if its in a valid state, after killing any / / remaining running processes. / / / / Any event registrations are removed before the container is destroyed. / / No error is returned if the container is already destroyed. / / / / Running containers must first be stopped using Signal(..). / / Paused containers must first be resumed using Resume(..). Destroy() error / / Signal sends the provided signal code to the container's initial process. / / / / If all is specified the signal is sent to all processes in the container / / including the initial process. Signal(s os.Signal, all bool ) error / / Exec signals the container to exec the users process at the end of the init. Exec() error } |
在linux平台上,对该接口进行了一些包裹,生成了linux 平台的一些专用接口:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 | / / Container is a libcontainer container object . / / / / Each container is thread - safe within the same process. Since a container can / / be destroyed by a separate process, any function may return that the container / / was not found. type Container interface { BaseContainer / / Methods below here are platform specific / / Checkpoint checkpoints the running container's state to disk using the criu( 8 ) utility. Checkpoint(criuOpts * CriuOpts) error / / Restore restores the checkpointed container to a running state using the criu( 8 ) utility. Restore(process * Process, criuOpts * CriuOpts) error / / If the Container state is RUNNING or CREATED, sets the Container state to PAUSING and pauses / / the execution of any user processes. Asynchronously, when the container finished being paused the / / state is changed to PAUSED. / / If the Container state is PAUSED, do nothing. Pause() error / / If the Container state is PAUSED, resumes the execution of any user processes in the / / Container before setting the Container state to RUNNING. / / If the Container state is RUNNING, do nothing. Resume() error / / NotifyOOM returns a read - only channel signaling when the container receives an OOM notification. NotifyOOM() (< - chan struct{}, error) / / NotifyMemoryPressure returns a read - only channel signaling when the container reaches a given pressure level NotifyMemoryPressure(level PressureLevel) (< - chan struct{}, error) } |
还有一个重要的接口Factory:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | type Factory interface { / / Creates a new container with the given id and starts the initial process inside it. / / id must be a string containing only letters, digits and underscores and must contain / / between 1 and 1024 characters, inclusive. / / / / The id must not already be in use by an existing container. Containers created using / / a factory with the same path ( and filesystem) must have distinct ids. / / / / Returns the new container with a running process. / / / / On error, any partially created container parts are cleaned up (the operation is atomic). Create( id string, config * configs.Config) (Container, error) / / Load takes an ID for an existing container and returns the container information / / from the state. This presents a read only view of the container. Load( id string) (Container, error) / / StartInitialization is an internal API to libcontainer used during the reexec of the / / container. StartInitialization() error / / Type returns info string about factory type (e.g. lxc, libcontainer...) Type () string } |
其中也有对应Linux 平台的一个实现:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | / / LinuxFactory implements the default factory interface for linux based systems. type LinuxFactory struct { / / Root directory for the factory to store state. Root string / / InitPath is the path for calling the init responsibilities for spawning / / a container. InitPath string / / InitArgs are arguments for calling the init responsibilities for spawning / / a container. InitArgs []string / / CriuPath is the path to the criu binary used for checkpoint and restore of / / containers. CriuPath string / / New{u,g}idmapPath is the path to the binaries used for mapping with / / rootless containers. NewuidmapPath string NewgidmapPath string / / Validator provides validation to container configurations. Validator validate.Validator / / NewIntelRdtManager returns an initialized Intel RDT manager for a single container. NewIntelRdtManager func(config * configs.Config, id string, path string) intelrdt.Manager } |
Linux Factory中的create的具体实现其实就是创建一个LinuxContainer(这正和我们之前所说的Linux下的container接口相对应):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | type linuxContainer struct { id string root string config * configs.Config cgroupManager cgroups.Manager intelRdtManager intelrdt.Manager initPath string initArgs []string initProcess parentProcess initProcessStartTime uint64 criuPath string newuidmapPath string newgidmapPath string m sync.Mutex criuVersion int state containerState created time.Time fifo * os. File } |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | func createContainer(context * cli.Context, id string, spec * specs.Spec) (libcontainer.Container, error) { rootlessCg, err : = shouldUseRootlessCgroupManager(context) if err ! = nil { return nil, err } config, err : = specconv.CreateLibcontainerConfig(&specconv.CreateOpts{ CgroupName: id , UseSystemdCgroup: context.GlobalBool( "systemd-cgroup" ), NoPivotRoot: context. Bool ( "no-pivot" ), NoNewKeyring: context. Bool ( "no-new-keyring" ), Spec: spec, RootlessEUID: os.Geteuid() ! = 0 , RootlessCgroups: rootlessCg, }) if err ! = nil { return nil, err } factory, err : = loadFactory(context) if err ! = nil { return nil, err } return factory.Create( id , config) } |
可以看到,首先加载配置config,然后使用loadFactory创建相关的LinuxFactory,最终调用了factory.Create(id, config),然后由factory.Create(id, config)返回一个LinuxContainer。其中LoadFactory十分关键,他在最后调用了libcontainer.New()函数来返回LinuxContainer,在该New函数里面其设置了InitPath(InitPath非常重要):
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | / / New returns a linux based container factory based in the root directory and / / configures the factory with the provided option funcs. func New(root string, options ...func( * LinuxFactory) error) (Factory, error) { if root ! = "" { if err : = os.MkdirAll(root, 0o700 ); err ! = nil { return nil, err } } l : = &LinuxFactory{ Root: root, InitPath: "/proc/self/exe" , InitArgs: []string{os.Args[ 0 ], "init" }, Validator: validate.New(), CriuPath: "criu" , } for _, opt : = range options { if opt = = nil { continue } if err : = opt(l); err ! = nil { return nil, err } } return l, nil } |
在LinuxFactory的Create过程中InitPath和InitArgs被传递给linuxContainer。在知道是如何创建出一个linuxContainer之后,我们把目光返回到startContainer,该函数最后生成了runner结构体,然后调用了其run方法,参数为spec.Process,这里的spec.Process其实就是当初config.json里面的进程信息。
在run方法中,一方面通过newProcess以config.json为模板创建了libcontainer.Process结构体,与进程相关的limt和Capabilities等设置都在此时完成,另一方面主要根据action做了三种操作:
1 2 3 4 5 6 7 8 9 10 | switch r.action { case CT_ACT_CREATE: err = r.container.Start(process) case CT_ACT_RESTORE: err = r.container.Restore(process, r.criuOpts) case CT_ACT_RUN: err = r.container.Run(process) default: panic( "Unknown action" ) } |
Process结构体,其中大部分的内容都来自config.json文件:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 | / / Process specifies the configuration and IO for a process inside / / a container. type Process struct { / / The command to be run followed by any arguments. Args []string / / Env specifies the environment variables for the process. Env []string / / User will set the uid and gid of the executing process running inside the container / / local to the container's user and group configuration. User string / / AdditionalGroups specifies the gids that should be added to supplementary groups / / in addition to those that the user belongs to. AdditionalGroups []string / / Cwd will change the processes current working directory inside the container's rootfs. Cwd string / / Stdin is a pointer to a reader which provides the standard input stream. Stdin io.Reader / / Stdout is a pointer to a writer which receives the standard output stream. Stdout io.Writer / / Stderr is a pointer to a writer which receives the standard error stream. Stderr io.Writer / / ExtraFiles specifies additional open files to be inherited by the container ExtraFiles [] * os. File / / Initial sizings for the console ConsoleWidth uint16 ConsoleHeight uint16 / / Capabilities specify the capabilities to keep when executing the process inside the container / / All capabilities not specified will be dropped from the processes capability mask Capabilities * configs.Capabilities / / AppArmorProfile specifies the profile to apply to the process and is / / changed at the time the process is execed AppArmorProfile string / / Label specifies the label to apply to the process. It is commonly used by selinux Label string / / NoNewPrivileges controls whether processes can gain additional privileges. NoNewPrivileges * bool / / Rlimits specifies the resource limits, such as max open files, to set in the container / / If Rlimits are not set , the container will inherit rlimits from the parent process Rlimits []configs.Rlimit / / ConsoleSocket provides the masterfd console. ConsoleSocket * os. File / / Init specifies whether the process is the first process in the container. Init bool ops processOperations LogLevel string / / SubCgroupPaths specifies sub - cgroups to run the process in . / / Map keys are controller names, map values are paths (relative to / / container's top - level cgroup). / / / / If empty, the default top - level container's cgroup is used. / / / / For cgroup v2, the only key allowed is "". SubCgroupPaths map [string]string } |
start方法:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | func (c * linuxContainer) Start(process * Process) error { c.m.Lock() defer c.m.Unlock() if c.config.Cgroups.Resources.SkipDevices { return errors.New( "can't start container with SkipDevices set" ) } if process.Init { if err : = c.createExecFifo(); err ! = nil { return err } } if err : = c.start(process); err ! = nil { if process.Init { c.deleteExecFifo() } return err } return nil } |
可以看到,start方法,主要是创建了一个fifo管道(这个管道主要用于阻塞,后面会用到),然后调用了start方法。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 | func (c * linuxContainer) start(process * Process) (retErr error) { parent, err : = c.newParentProcess(process) if err ! = nil { return fmt.Errorf( "unable to create new parent process: %w" , err) } logsDone : = parent.forwardChildLogs() if logsDone ! = nil { defer func() { / / Wait for log forwarder to finish. This depends on / / runc init closing the _LIBCONTAINER_LOGPIPE log fd. err : = < - logsDone if err ! = nil && retErr = = nil { retErr = fmt.Errorf( "unable to forward init logs: %w" , err) } }() } if err : = parent.start(); err ! = nil { return fmt.Errorf( "unable to start container process: %w" , err) } if process.Init { c.fifo.Close() if c.config.Hooks ! = nil { s, err : = c.currentOCIState() if err ! = nil { return err } if err : = c.config.Hooks[configs.Poststart].RunHooks(s); err ! = nil { if err : = ignoreTerminateErrors(parent.terminate()); err ! = nil { logrus.Warn(fmt.Errorf( "error running poststart hook: %w" , err)) } return err } } } return nil } |
该方法第一步首先返回了一个initProcess
结构体,这个结构体实现了 parentProcess
接口,该结构体由linuxContainer的newInitProcess函数创建。
1 2 3 4 5 6 7 8 9 10 11 12 13 | type initProcess struct { cmd * exec .Cmd messageSockPair filePair logFilePair filePair config * initConfig manager cgroups.Manager intelRdtManager intelrdt.Manager container * linuxContainer fds []string process * Process bootstrapData io.Reader sharePidns bool } |
接口如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | type parentProcess interface { / / pid returns the pid for the running process. pid() int / / start starts the process execution. start() error / / send a SIGKILL to the process and wait for the exit. terminate() error / / wait waits on the process returning the process state. wait() ( * os.ProcessState, error) / / startTime returns the process start time. startTime() (uint64, error) signal(os.Signal) error externalDescriptors() []string setExternalDescriptors(fds []string) forwardChildLogs() chan error } |
在整个的newParentProcess函数过程中,首先创了一对sock和一对pipe管道,然后用这一对sock中的childsock和childpipe创建了一个cmd模板,该模板中执行的命令正好就是之前的InitPath中设置的路径("/proc/self/exe",和 "init",这其实表示会执行runC本身,参数就是init),sock和pipe其实是为了实现cmd和父进程直接的数据通信,它们被放入到cmd.ExtraFiles中,同时相关的文件描述符被放入到环境变量里面,接下来是对进程是否是初始化进程进行判断,如果不是,则调用newSetnsProcess
,来返回一个setnsProcess结构体,该结构体同样实现了parentProcess接口,newSetnsProcess主要是用来在已有容器中创建一个新的进程。
接下来执行includeExecFifo()方法,其就是打开之前创建的exec.fifo文件,并存入到cmd.ExtraFiles和环境变量中,最后调用最关键的函数newInitProcess来创建Init结构体:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 | func (c * linuxContainer) newInitProcess(p * Process, cmd * exec .Cmd, messageSockPair, logFilePair filePair) ( * initProcess, error) { cmd.Env = append(cmd.Env, "_LIBCONTAINER_INITTYPE=" + string(initStandard)) nsMaps : = make( map [configs.NamespaceType]string) for _, ns : = range c.config.Namespaces { if ns.Path ! = "" { nsMaps[ns. Type ] = ns.Path } } _, sharePidns : = nsMaps[configs.NEWPID] data, err : = c.bootstrapData(c.config.Namespaces.CloneFlags(), nsMaps, initStandard) if err ! = nil { return nil, err } if c.shouldSendMountSources() { / / Elements on this slice will be paired with mounts (see StartInitialization() and / / prepareRootfs()). This slice MUST have the same size as c.config.Mounts. mountFds : = make([] int , len (c.config.Mounts)) for i, m : = range c.config.Mounts { if !m.IsBind() { / / Non bind - mounts do not use an fd. mountFds[i] = - 1 continue } / / The fd passed here will not be used: nsexec.c will overwrite it with dup3(). We just need / / to allocate a fd so that we know the number to pass in the environment variable. The fd / / must not be closed before cmd.Start(), so we reuse messageSockPair.child because the / / lifecycle of that fd is already taken care of. cmd.ExtraFiles = append(cmd.ExtraFiles, messageSockPair.child) mountFds[i] = stdioFdCount + len (cmd.ExtraFiles) - 1 } mountFdsJson, err : = json.Marshal(mountFds) if err ! = nil { return nil, fmt.Errorf( "Error creating _LIBCONTAINER_MOUNT_FDS: %w" , err) } cmd.Env = append(cmd.Env, "_LIBCONTAINER_MOUNT_FDS=" + string(mountFdsJson), ) } init : = &initProcess{ cmd: cmd, messageSockPair: messageSockPair, logFilePair: logFilePair, manager: c.cgroupManager, intelRdtManager: c.intelRdtManager, config: c.newInitConfig(p), container: c, process: p, bootstrapData: data, sharePidns: sharePidns, } c.initProcess = init return init, nil } |
在该函数中首先设置standard环境变量,然后从config.json里面读取需要新建的namespaces,并将这些数据进行存储,然后创建initProcess结构体,中间的shouldSendMountSources不用特别关心,它其实是为了挂载一些目录所设置的。到此为止,parentProcess结构体就基本设置完成了。
在start方法中接下来调用了parentProcess的start()函数,这里其实是initProcess结构体实现的start函数。在该start函数中会启动之前设置的/proc/self/exe进程,参数为init,然后给父进程设置了cgroup,之后通过sock把信息传输给子进程,这里最关键的其实是启动了runC init这样一个子进程,因为创建的容器可能具备新的namespaces,因此,通过子进程执行runC init的时候可以很方便的通过setns()完成命名空间的切换,同时setns其实是不运行在多线程条件下使用的,但是go runtime就是多线程的,因此必须在go runtime之前设置命名空间,因此使用cgo在go runtime启动之前使用c代码设置命名空间。
在cgo中,首先利用环境变量拿到了pipe(可以看到之前父进程在环境变量里面进程了设置),然后以netlink msg的格式读取父进程发送的config配置信息,接着同样执行了创建sock组的操作,这是为了使得它和孙进程之间可以相互通信,接着以状态机的形式用clone创建出符合config.json中设置的命名空间的进程,然后本来的子进程就exit(0)销毁。、
接着回到create中,在执行init进程之后对其进行了cgroup的限制,这也方便在接下来的过程中防止子进程通过cgroup进行逃逸,接着父进程发送bootstrapData数据到init进程,之后create拿到init创建的子进程的pid,然后通过pipe管拿到子进程打开的fd进行保存,在进行一系列的设置之后通过sendConfig发送config.json中的要执行的进程的信息,接下来就是容器初始化和执行config.json中设置的进程了,具体的过程可以参考standard_init_linux.go中linuxStandardInit的Init函数,到此为止一个容器的大致启动过程就基本分析结束了。
参考链接:
https://segmentfault.com/a/1190000017576314#item-1
https://github.com/opencontainers/runc
[培训]内核驱动高级班,冲击BAT一流互联网大厂工作,每周日13:00-18:00直播授课