Linux capabilities were introduced in kernel 2.2, while Namespaces in kernel 3.8 (by introduction I just mean introduction, and not the final implementation of the feature, which sometimes takes a lot of releases). Two technologies separated so far apart in the release cycle came to a confluence in the patchset proposed by Serge E. Hallyn. Here is the Idea - Link. Let me break things down for you.

A. Linux Capabilities

These came into picture to fine grain the power of root (EUID 0, traditionally the programs running with SUDO). Before that a program running as root was all powerful. Now let's say I am running wireshark, it should only have network related powers, and if it tries to insert a kernel module (say someone exploited my wireshark instance) it should be denied. With just EUID being a check of powers, it was not possible. Now I can give wireshark the "CAP_NET_ADMIN" and "CAP_NET_RAW" capabilities (and perhaps some others which are related to networking), and not give "CAP_SYS_MODULE".

B. Namespaces

Namespaces are a Linux kernel feature that isolates and virtualizes resources (PID, hostname, userid, network, ipc, filesystem) of a collection of processes. (~ source : wikipedia). So for example two processes running on the same system can have same PID, as long as they belong to different PID namespaces.

C. User Namespace

User namespaces isolate security-related identifiers and attributes, in particular, user IDs and group IDs (see credentials(7)), the root directory, keys (see keyctl(2)), and capabilities (see capabilities(7)). A process's user and group IDs can be different inside and outside a user namespace. In particular, a process can have a normal unprivileged user ID outside a user namespace while at the same time having a user ID of 0 inside the namespace; in other words, the process has full privileges for operations inside the user namespace, but is unprivileged for operations outside the namespace. (~ source : man page)
*Permissions for namespace of the other kinds are checked in the user namespace, they got created in.* (~ source : wikipedia).

So in a way user namespaces, like user and group IDs in a traditional system control the access to various components of it (for example, managing files, killing processes, etc.). The only thing different in case of user namespaces is that the "system" is defined by other parameters related to it, like the mount namespace, PID namespace, etc.

The highlighted line above then says something like this, if a process in user namespace UN1 has the capability (i'll explain how capability intersects with user namespaces in a while, for now assume that it is run with sudo) to kill other processes, and if it tries to kill a process in another user namespace, say UN2, where UN1 and UN2 are disjoint, and one is not the ancestor of other, then this must not be allowed.

D. The init_user namespace (init_user_ns)

This is simply the root user namespace, i.e. the user namespace that is created at boot time.

E. The confluence

The intersection of the two technologies allow an unprivileged process to create new user namespace , where it has full capabilities. Thus in the "pseudo system" created by it, it will be able to use Linux capabilities, but outside this "system", it will be powerless. By "pseudo system" I mean creation of a new user namespace, and assigning of resources to it using other namespaces like PID, mount, net, etc.

F. Digging deeper

If we have a look at all the places in the Linux kernel where the checks of capability is being made, we will see two types of checks, depending on the operation that is being performed.

  • Checks with respect to user namespace of the process.
    For operations in which one process tries to affect another process, it is important to check that the process trying to affect another process has the required capabilities in the user namespace of the other process. For example if one process is trying to kill another process, it must have "CAP_KILL" in the user namespace of the process it is trying to kill. For example let's look at the following code.

/*kernel/signal.c:692*/
static int kill_ok_by_cred(struct task_struct t) /* t is the task_struct of the task that is being killed */
{
const struct cred *cred = current_cred();
const struct cred *tcred = __task_cred(t);

if (uid_eq(cred->euid, tcred->suid) ||
uid_eq(cred->euid, tcred->uid) ||
uid_eq(cred->uid, tcred->suid) ||
uid_eq(cred->uid, tcred->uid))
return 1;

if (ns_capable(tcred->user_ns, CAP_KILL))
return 1;

return 0;
}

"kill_ok_by_cred()" is involved in the process of doing permission checks when a process is trying to kill another process. As can be seen in highlighted line, "ns_capable()" is considering not just the capability of the process, but is considering it with respect to the user namespaces of the two processes.

  • Checks that just consider the capability of the processes without worrying about the user namespace the process is in.
    Let's say a process P1 is trying to install a kernel module, now from the point of view of the process it is not trying to influence a process belonging to some other user namespace, thus only capability matter, and not the user namespace.(Incorrect)

  • Checks that consider the capability with respect to the init_user namespace (init_user_ns).
    There are some actions which have system wide consequences, for example if a process P1 tries to install a module. Such checks are made with respect to the init_user namespace. There is no meaning to the question, "is this user allowed to install module with respect to his own namespace?", since the installation will affect the whole system. Thus a better question to ask would be, "Is this user allowed to install module with respect to the root namespace (init_user) ?". Since, the init_user will anyway be ancestor of all other namespaces, thus it will anyway have system wide scope.

    Consider the following code. The check to ensure that the process has required capabilities is made by "may_init_module()", which internally calls "ns_capable()", which is the same function used in the previous code in this article, but notice the difference now. Now, instead of the user namespace, the capability is being checked with respect to the "init_user" namespace.

/* kernel/module.c:3321 */
SYSCALL_DEFINE3(finit_module, int, fd, const char __user *, uargs, int, flags)
{
int err;
struct load_info info = { };

err = may_init_module();
if (err)
. . .

}

/* kernel/module.c:3136 */
static int may_init_module(void)
{
if (!capable(CAP_SYS_MODULE) || modules_disabled)
return -EPERM;

return 0;
}

/* kernel/capability.c:405 */
bool capable(int cap)
{
return ns_capable(&init_user_ns, cap);
}

Resources