-
-
[原创]什么是runC?
-
发表于: 2022-1-11 11:19 22667
-
容器运行时,Container runtime是指管理和运行容器的工具,当前的容器工具很多,比如docker,rkt等,但是如果每个容器工具都使用自己的运行时,那么就不利于容器灵雨的发展,因此,一些容器厂商就一起制定了容器镜像格式和容器运行时的标准,即Open Container Initiative
(OCI
)。
OCI Bundle
是指满足OCI标准的一系列文件,这些文件包含了运行容器所需要的所有数据,它们存放在一个共同的目录,该目录包含以下两项:
这是runC主要的代码逻辑,其中libcontainer其实就是早期docker的一大基础,为了适应OCI格式进行了二次的封装。
以runc create 为例子,其对应的主要操作如下:
contianer对应的一些数据结构如下,这里创建了一个接口,里面包括了一个容器需要的所有的操作:
在linux平台上,对该接口进行了一些包裹,生成了linux 平台的一些专用接口:
还有一个重要的接口Factory:
其中也有对应Linux 平台的一个实现:
Linux Factory中的create的具体实现其实就是创建一个LinuxContainer(这正和我们之前所说的Linux下的container接口相对应):
可以看到,首先加载配置config,然后使用loadFactory创建相关的LinuxFactory,最终调用了factory.Create(id, config),然后由factory.Create(id, config)返回一个LinuxContainer。其中LoadFactory十分关键,他在最后调用了libcontainer.New()函数来返回LinuxContainer,在该New函数里面其设置了InitPath(InitPath非常重要):
在LinuxFactory的Create过程中InitPath和InitArgs被传递给linuxContainer。在知道是如何创建出一个linuxContainer之后,我们把目光返回到startContainer,该函数最后生成了runner结构体,然后调用了其run方法,参数为spec.Process,这里的spec.Process其实就是当初config.json里面的进程信息。
在run方法中,一方面通过newProcess以config.json为模板创建了libcontainer.Process结构体,与进程相关的limt和Capabilities等设置都在此时完成,另一方面主要根据action做了三种操作:
Process结构体,其中大部分的内容都来自config.json文件:
start方法:
可以看到,start方法,主要是创建了一个fifo管道(这个管道主要用于阻塞,后面会用到),然后调用了start方法。
该方法第一步首先返回了一个initProcess
结构体,这个结构体实现了 parentProcess
接口,该结构体由linuxContainer的newInitProcess函数创建。
接口如下:
在整个的newParentProcess函数过程中,首先创了一对sock和一对pipe管道,然后用这一对sock中的childsock和childpipe创建了一个cmd模板,该模板中执行的命令正好就是之前的InitPath中设置的路径("/proc/self/exe",和 "init",这其实表示会执行runC本身,参数就是init),sock和pipe其实是为了实现cmd和父进程直接的数据通信,它们被放入到cmd.ExtraFiles中,同时相关的文件描述符被放入到环境变量里面,接下来是对进程是否是初始化进程进行判断,如果不是,则调用newSetnsProcess
,来返回一个setnsProcess结构体,该结构体同样实现了parentProcess接口,newSetnsProcess主要是用来在已有容器中创建一个新的进程。
接下来执行includeExecFifo()方法,其就是打开之前创建的exec.fifo文件,并存入到cmd.ExtraFiles和环境变量中,最后调用最关键的函数newInitProcess来创建Init结构体:
在该函数中首先设置standard环境变量,然后从config.json里面读取需要新建的namespaces,并将这些数据进行存储,然后创建initProcess结构体,中间的shouldSendMountSources不用特别关心,它其实是为了挂载一些目录所设置的。到此为止,parentProcess结构体就基本设置完成了。
在start方法中接下来调用了parentProcess的start()函数,这里其实是initProcess结构体实现的start函数。在该start函数中会启动之前设置的/proc/self/exe进程,参数为init,然后给父进程设置了cgroup,之后通过sock把信息传输给子进程,这里最关键的其实是启动了runC init这样一个子进程,因为创建的容器可能具备新的namespaces,因此,通过子进程执行runC init的时候可以很方便的通过setns()完成命名空间的切换,同时setns其实是不运行在多线程条件下使用的,但是go runtime就是多线程的,因此必须在go runtime之前设置命名空间,因此使用cgo在go runtime启动之前使用c代码设置命名空间。
在cgo中,首先利用环境变量拿到了pipe(可以看到之前父进程在环境变量里面进程了设置),然后以netlink msg的格式读取父进程发送的config配置信息,接着同样执行了创建sock组的操作,这是为了使得它和孙进程之间可以相互通信,接着以状态机的形式用clone创建出符合config.json中设置的命名空间的进程,然后本来的子进程就exit(0)销毁。、
接着回到create中,在执行init进程之后对其进行了cgroup的限制,这也方便在接下来的过程中防止子进程通过cgroup进行逃逸,接着父进程发送bootstrapData数据到init进程,之后create拿到init创建的子进程的pid,然后通过pipe管拿到子进程打开的fd进行保存,在进行一系列的设置之后通过sendConfig发送config.json中的要执行的进程的信息,接下来就是容器初始化和执行config.json中设置的进程了,具体的过程可以参考standard_init_linux.go中linuxStandardInit的Init函数,到此为止一个容器的大致启动过程就基本分析结束了。
参考链接:
https://segmentfault.com/a/1190000017576314#item-1
https://github.com/opencontainers/runc
type
BaseContainer interface {
/
/
Returns the
ID
of the container
ID
() string
/
/
Returns the current status of the container.
Status() (Status, error)
/
/
State returns the current container's state information.
State() (
*
State, error)
/
/
OCIState returns the current container's state information.
OCIState() (
*
specs.State, error)
/
/
Returns the current config of the container.
Config() configs.Config
/
/
Returns the PIDs inside this container. The PIDs are
in
the namespace of the calling process.
/
/
/
/
Some of the returned PIDs may no longer refer to processes
in
the Container, unless
/
/
the Container state
is
PAUSED
in
which case every PID
in
the
slice
is
valid.
Processes() ([]
int
, error)
/
/
Returns statistics
for
the container.
Stats() (
*
Stats, error)
/
/
Set
resources of container as configured
/
/
/
/
We can use this to change resources when containers are running.
/
/
Set
(config configs.Config) error
/
/
Start a process inside the container. Returns error
if
process fails to
/
/
start. You can track process lifecycle with passed Process structure.
Start(process
*
Process) (err error)
/
/
Run immediately starts the process inside the container. Returns error
if
process
/
/
fails to start. It does
not
block waiting
for
the
exec
fifo after start returns but
/
/
opens the fifo after start returns.
Run(process
*
Process) (err error)
/
/
Destroys the container,
if
its
in
a valid state, after killing
any
/
/
remaining running processes.
/
/
/
/
Any
event registrations are removed before the container
is
destroyed.
/
/
No error
is
returned
if
the container
is
already destroyed.
/
/
/
/
Running containers must first be stopped using Signal(..).
/
/
Paused containers must first be resumed using Resume(..).
Destroy() error
/
/
Signal sends the provided signal code to the container's initial process.
/
/
/
/
If
all
is
specified the signal
is
sent to
all
processes
in
the container
/
/
including the initial process.
Signal(s os.Signal,
all
bool
) error
/
/
Exec signals the container to
exec
the users process at the end of the init.
Exec() error
}
type
BaseContainer interface {
/
/
Returns the
ID
of the container
ID
() string
/
/
Returns the current status of the container.
Status() (Status, error)
/
/
State returns the current container's state information.
State() (
*
State, error)
/
/
OCIState returns the current container's state information.
OCIState() (
*
specs.State, error)
/
/
Returns the current config of the container.
Config() configs.Config
/
/
Returns the PIDs inside this container. The PIDs are
in
the namespace of the calling process.
/
/
/
/
Some of the returned PIDs may no longer refer to processes
in
the Container, unless
/
/
the Container state
is
PAUSED
in
which case every PID
in
the
slice
is
valid.
Processes() ([]
int
, error)
/
/
Returns statistics
for
the container.
Stats() (
*
Stats, error)
/
/
Set
resources of container as configured
/
/
/
/
We can use this to change resources when containers are running.
/
/
Set
(config configs.Config) error
/
/
Start a process inside the container. Returns error
if
process fails to
/
/
start. You can track process lifecycle with passed Process structure.
Start(process
*
Process) (err error)
/
/
Run immediately starts the process inside the container. Returns error
if
process
/
/
fails to start. It does
not
block waiting
for
the
exec
fifo after start returns but
/
/
opens the fifo after start returns.
Run(process
*
Process) (err error)
/
/
Destroys the container,
if
its
in
a valid state, after killing
any
/
/
remaining running processes.
/
/
/
/
Any
event registrations are removed before the container
is
destroyed.
/
/
No error
is
returned
if
the container
is
already destroyed.
/
/
/
/
Running containers must first be stopped using Signal(..).
/
/
Paused containers must first be resumed using Resume(..).
Destroy() error
/
/
Signal sends the provided signal code to the container's initial process.
/
/
/
/
If
all
is
specified the signal
is
sent to
all
processes
in
the container
/
/
including the initial process.
Signal(s os.Signal,
all
bool
) error
/
/
Exec signals the container to
exec
the users process at the end of the init.
Exec() error
}
/
/
Container
is
a libcontainer container
object
.
/
/
/
/
Each container
is
thread
-
safe within the same process. Since a container can
/
/
be destroyed by a separate process,
any
function may
return
that the container
/
/
was
not
found.
type
Container interface {
BaseContainer
/
/
Methods below here are platform specific
/
/
Checkpoint checkpoints the running container's state to disk using the criu(
8
) utility.
Checkpoint(criuOpts
*
CriuOpts) error
/
/
Restore restores the checkpointed container to a running state using the criu(
8
) utility.
Restore(process
*
Process, criuOpts
*
CriuOpts) error
/
/
If the Container state
is
RUNNING
or
CREATED, sets the Container state to PAUSING
and
pauses
/
/
the execution of
any
user processes. Asynchronously, when the container finished being paused the
/
/
state
is
changed to PAUSED.
/
/
If the Container state
is
PAUSED, do nothing.
Pause() error
/
/
If the Container state
is
PAUSED, resumes the execution of
any
user processes
in
the
/
/
Container before setting the Container state to RUNNING.
/
/
If the Container state
is
RUNNING, do nothing.
Resume() error
/
/
NotifyOOM returns a read
-
only channel signaling when the container receives an OOM notification.
NotifyOOM() (<
-
chan struct{}, error)
/
/
NotifyMemoryPressure returns a read
-
only channel signaling when the container reaches a given pressure level
NotifyMemoryPressure(level PressureLevel) (<
-
chan struct{}, error)
}
/
/
Container
is
a libcontainer container
object
.
/
/
/
/
Each container
is
thread
-
safe within the same process. Since a container can
/
/
be destroyed by a separate process,
any
function may
return
that the container
/
/
was
not
found.
type
Container interface {
BaseContainer
/
/
Methods below here are platform specific
/
/
Checkpoint checkpoints the running container's state to disk using the criu(
8
) utility.
Checkpoint(criuOpts
*
CriuOpts) error
/
/
Restore restores the checkpointed container to a running state using the criu(
8
) utility.
Restore(process
*
Process, criuOpts
*
CriuOpts) error
/
/
If the Container state
is
RUNNING
or
CREATED, sets the Container state to PAUSING
and
pauses
/
/
the execution of
any
user processes. Asynchronously, when the container finished being paused the
/
/
state
is
changed to PAUSED.
/
/
If the Container state
is
PAUSED, do nothing.
Pause() error
/
/
If the Container state
is
PAUSED, resumes the execution of
any
user processes
in
the
/
/
Container before setting the Container state to RUNNING.
/
/
If the Container state
is
RUNNING, do nothing.
Resume() error
/
/
NotifyOOM returns a read
-
only channel signaling when the container receives an OOM notification.
NotifyOOM() (<
-
chan struct{}, error)
/
/
NotifyMemoryPressure returns a read
-
only channel signaling when the container reaches a given pressure level
NotifyMemoryPressure(level PressureLevel) (<
-
chan struct{}, error)
}
type
Factory interface {
/
/
Creates a new container with the given
id
and
starts the initial process inside it.
/
/
id
must be a string containing only letters, digits
and
underscores
and
must contain
/
/
between
1
and
1024
characters, inclusive.
/
/
/
/
The
id
must
not
already be
in
use by an existing container. Containers created using
/
/
a factory with the same path (
and
filesystem) must have distinct ids.
/
/
/
/
Returns the new container with a running process.
/
/
/
/
On error,
any
partially created container parts are cleaned up (the operation
is
atomic).
Create(
id
string, config
*
configs.Config) (Container, error)
/
/
Load takes an
ID
for
an existing container
and
returns the container information
/
/
from
the state. This presents a read only view of the container.
Load(
id
string) (Container, error)
/
/
StartInitialization
is
an internal API to libcontainer used during the reexec of the
/
/
container.
StartInitialization() error
/
/
Type
returns info string about factory
type
(e.g. lxc, libcontainer...)
Type
() string
}
type
Factory interface {
/
/
Creates a new container with the given
id
and
starts the initial process inside it.
/
/
id
must be a string containing only letters, digits
and
underscores
and
must contain
/
/
between
1
and
1024
characters, inclusive.
/
/
/
/
The
id
must
not
already be
in
use by an existing container. Containers created using
/
/
a factory with the same path (
and
filesystem) must have distinct ids.
/
/
/
/
Returns the new container with a running process.
/
/
/
/
On error,
any
partially created container parts are cleaned up (the operation
is
atomic).
Create(
id
string, config
*
configs.Config) (Container, error)
/
/
Load takes an
ID
for
an existing container
and
returns the container information
/
/
from
the state. This presents a read only view of the container.
Load(
id
string) (Container, error)
/
/
StartInitialization
is
an internal API to libcontainer used during the reexec of the
/
/
container.
StartInitialization() error
/
/
Type
returns info string about factory
type
(e.g. lxc, libcontainer...)
Type
() string
}
/
/
LinuxFactory implements the default factory interface
for
linux based systems.
type
LinuxFactory struct {
/
/
Root directory
for
the factory to store state.
Root string
/
/
InitPath
is
the path
for
calling the init responsibilities
for
spawning
/
/
a container.
InitPath string
/
/
InitArgs are arguments
for
calling the init responsibilities
for
spawning
/
/
a container.
InitArgs []string
/
/
CriuPath
is
the path to the criu binary used
for
checkpoint
and
restore of
/
/
containers.
CriuPath string
/
/
New{u,g}idmapPath
is
the path to the binaries used
for
mapping with
/
/
rootless containers.
NewuidmapPath string
NewgidmapPath string
/
/
Validator provides validation to container configurations.
Validator validate.Validator
/
/
NewIntelRdtManager returns an initialized Intel RDT manager
for
a single container.
NewIntelRdtManager func(config
*
configs.Config,
id
string, path string) intelrdt.Manager
}
/
/
LinuxFactory implements the default factory interface
for
linux based systems.
type
LinuxFactory struct {
/
/
Root directory
for
the factory to store state.
Root string
/
/
InitPath
is
the path
for
calling the init responsibilities
for
spawning
/
/
a container.
InitPath string
/
/
InitArgs are arguments
for
calling the init responsibilities
for
spawning
/
/
a container.
InitArgs []string
/
/
CriuPath
is
the path to the criu binary used
for
checkpoint
and
restore of
/
/
containers.
CriuPath string
/
/
New{u,g}idmapPath
is
the path to the binaries used
for
mapping with
/
/
rootless containers.
NewuidmapPath string
NewgidmapPath string
/
/
Validator provides validation to container configurations.
Validator validate.Validator
/
/
NewIntelRdtManager returns an initialized Intel RDT manager
for
a single container.
NewIntelRdtManager func(config
*
configs.Config,
id
string, path string) intelrdt.Manager
}
type
linuxContainer struct {
id
string
root string
config
*
configs.Config
cgroupManager cgroups.Manager
intelRdtManager intelrdt.Manager
initPath string
initArgs []string
initProcess parentProcess
initProcessStartTime uint64
criuPath string
newuidmapPath string
newgidmapPath string
m sync.Mutex
criuVersion
int
state containerState
created time.Time
fifo
*
os.
File
}
type
linuxContainer struct {
id
string
root string
config
*
configs.Config
cgroupManager cgroups.Manager
intelRdtManager intelrdt.Manager
initPath string
initArgs []string
initProcess parentProcess
initProcessStartTime uint64
criuPath string
newuidmapPath string
newgidmapPath string
m sync.Mutex
criuVersion
int
state containerState
created time.Time
fifo
*
os.
File
}
func createContainer(context
*
cli.Context,
id
string, spec
*
specs.Spec) (libcontainer.Container, error) {
rootlessCg, err :
=
shouldUseRootlessCgroupManager(context)
if
err !
=
nil {
return
nil, err
}
config, err :
=
specconv.CreateLibcontainerConfig(&specconv.CreateOpts{
CgroupName:
id
,
UseSystemdCgroup: context.GlobalBool(
"systemd-cgroup"
),
NoPivotRoot: context.
Bool
(
"no-pivot"
),
NoNewKeyring: context.
Bool
(
"no-new-keyring"
),
Spec: spec,
RootlessEUID: os.Geteuid() !
=
0
,
RootlessCgroups: rootlessCg,
})
if
err !
=
nil {
return
nil, err
}
factory, err :
=
loadFactory(context)
if
err !
=
nil {
return
nil, err
}
return
factory.Create(
id
, config)
}
func createContainer(context
*
cli.Context,
id
string, spec
*
specs.Spec) (libcontainer.Container, error) {
rootlessCg, err :
=
shouldUseRootlessCgroupManager(context)
if
err !
=
nil {
return
nil, err
}
config, err :
=
specconv.CreateLibcontainerConfig(&specconv.CreateOpts{
CgroupName:
id
,
UseSystemdCgroup: context.GlobalBool(
"systemd-cgroup"
),
NoPivotRoot: context.
Bool
(
"no-pivot"
),
NoNewKeyring: context.
Bool
(
"no-new-keyring"
),
Spec: spec,
RootlessEUID: os.Geteuid() !
=
0
,
RootlessCgroups: rootlessCg,
})
if
err !
=
nil {
return
nil, err
}
factory, err :
=
loadFactory(context)
if
err !
=
nil {
return
nil, err
}
return
factory.Create(
id
, config)
}
/
/
New returns a linux based container factory based
in
the root directory
and
/
/
configures the factory with the provided option funcs.
func New(root string, options ...func(
*
LinuxFactory) error) (Factory, error) {
if
root !
=
"" {
if
err :
=
os.MkdirAll(root,
0o700
); err !
=
nil {
return
nil, err
}
}
l :
=
&LinuxFactory{
Root: root,
InitPath:
"/proc/self/exe"
,
InitArgs: []string{os.Args[
0
],
"init"
},
Validator: validate.New(),
CriuPath:
"criu"
,
}
for
_, opt :
=
range
options {
if
opt
=
=
nil {
continue
}
if
err :
=
opt(l); err !
=
nil {
return
nil, err
}
}
return
l, nil
}
/
/
New returns a linux based container factory based
in
the root directory
and
/
/
configures the factory with the provided option funcs.
func New(root string, options ...func(
*
LinuxFactory) error) (Factory, error) {
if
root !
=
"" {
if
err :
=
os.MkdirAll(root,
0o700
); err !
=
nil {
return
nil, err
}
}
l :
=
&LinuxFactory{
Root: root,
InitPath:
"/proc/self/exe"
,
InitArgs: []string{os.Args[
0
],
"init"
},
Validator: validate.New(),
CriuPath:
"criu"
,
}
for
_, opt :
=
range
options {
if
opt
=
=
nil {
continue
}
if
err :
=
opt(l); err !
=
nil {
return
nil, err
}
}
return
l, nil
}
switch r.action {
case CT_ACT_CREATE:
err
=
r.container.Start(process)
case CT_ACT_RESTORE:
err
=
r.container.Restore(process, r.criuOpts)
case CT_ACT_RUN:
err
=
r.container.Run(process)
default:
panic(
"Unknown action"
)
}
switch r.action {
case CT_ACT_CREATE:
err
=
r.container.Start(process)
case CT_ACT_RESTORE:
err
=
r.container.Restore(process, r.criuOpts)
case CT_ACT_RUN:
err
=
r.container.Run(process)
default: