Why does RLIMIT_NOFILE slow down your containerized application

zouyee

2024-04-30

container

thanks polarathene

Description

The max open files limit NOFILE of dockerd is 1048576, which is defined in dockerd’s systemd unit file.

1 2	$ cat /proc/$(pidof dockerd)/limits \| grep "Max open files" Max open files 1048576 1048576 files

1
2
3

$ systemctl show docker | grep LimitNOFILE
LimitNOFILE=1048576
LimitNOFILESoft=1048576

However, inside the container, the value of the limit is a very large number — 1073741816:

1 2	$ docker run --rm ubuntu bash -c "cat /proc/self/limits" \| grep "Max open files" Max open files 1073741816 1073741816 files

It may cause the program iterate all available fds until the limit is reached; for example, the xinetd program sets the number of file descriptors using setrlimit(2) at initialization, which causes unnecessary waste of CPU resources and time on closing 1073741816 fds.

root@1b3165886528# strace xinetd
execve("/usr/sbin/xinetd", ["xinetd"], 0x7ffd3c2882e0 /* 9 vars */) = 0
brk(NULL)                               = 0x557690d7a000
arch_prctl(0x3001 /* ARCH_??? */, 0x7ffee17ce6f0) = -1 EINVAL (Invalid argument)
mmap(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb14255c000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
close(12024371)                         = -1 EBADF (Bad file descriptor)
close(12024372)                         = -1 EBADF (Bad file descriptor)
close(12024373)                         = -1 EBADF (Bad file descriptor)
close(12024374)                         = -1 EBADF (Bad file descriptor)
close(12024375)                         = -1 EBADF (Bad file descriptor)
close(12024376)                         = -1 EBADF (Bad file descriptor)
close(12024377)                         = -1 EBADF (Bad file descriptor)
close(12024378)                         = -1 EBADF (Bad file descriptor)

we found similar cases:

yum hang

I noticed that newest version of docker, on rockylinux-9, taken from https://download.docker.com/linux/centos/$releasever/$basearch/stable are a bit slow especially for operations done by yum

On both centos-7 and rocky-9 hosts I did:

1 2	docker run -itd --name centos7 quay.io/centos/centos:centos7 docker exec -it centos7 /bin/bash -c "time yum update -y"

On centos7 host it takes ~2 minutes
On rocky-9 host after an hour it did not complete the process, I can leave it under tmux to discover how much time it takes

reproduce steps:

1 2	docker run -itd --name centos7 quay.io/centos/centos:centos7 docker exec -it centos7 /bin/bash -c "time yum update -y"

rpm slow

run below comamnd on host:

time zypper --reposd-dir /workspace/zypper/reposd --cache-dir /workspace/zypper/cache --solv-cache-dir /workspace/zypper/solv --pkg-cache-dir /workspace/zypper/pkg --non-interactive --root /workspace/root install rpm subversion

spend time：

1
2
3

real    0m11.248s
user    0m7.316s
sys     0m1.932s

when test it in container

docker run --rm --net=none --log-driver=none -v "/workspace:/workspace" -v "/disks:/disks" opensuse bash -c "time zypper --reposd-dir /workspace/zypper/reposd --cache-dir /workspace/zypper/cache --solv-cache-dir /workspace/zypper/solv --pkg-cache-dir /workspace/zypper/pkg --non-interactive --root /workspace/root install rpm subversion"

spend time：

1
2
3

real    0m31.089s
user    0m14.876s
sys     0m12.524s

Here’s the relevant section of code from RPM. It’s part of the POSIX lua library that’s inside RPM, and was added by rpm-software-management/rpm@7a7c31f.

static int Pexec(lua_State *L) /** exec(path,[args]) */
{
	/* ... */
	open_max = sysconf(_SC_OPEN_MAX);
	if (open_max == -1) {
	    open_max = 1024;
	}
	for (fdno = 3; fdno < open_max; fdno++) {
	    flag = fcntl(fdno, F_GETFD);
	    if (flag == -1 || (flag & FD_CLOEXEC))
		continue;
	    fcntl(fdno, F_SETFD, FD_CLOEXEC);
	}
	/* ... */
}

So the reason for doing F_GETFD is because they are setting all of the FDs to CLOEXEC before doing the requested exec(2). There’s a redhat bugzilla entry in the commit message, which says that this was an SELinux issue where Fedora (or RHEL) have an SELinux setup where you cannot execute a process if it will inherit FDs it shouldn’t have access to?

I guess if this is an SELinux issue it should be handled by only applying this fix when SELinux is used (though there are arguably security reasons why you might want to CLOEXEC every file descriptor).

PtyProcess.spawn slowdown in close() loop

The following code in ptyprocess

# Do not allow child to inherit open file descriptors from parent, 
 # with the exception of the exec_err_pipe_write of the pipe 
 # and pass_fds. 
 # Impose ceiling on max_fd: AIX bugfix for users with unlimited 
 # nofiles where resource.RLIMIT_NOFILE is 2^63-1 and os.closerange() 
 # occasionally raises out of range error 
 max_fd = min(1048576, resource.getrlimit(resource.RLIMIT_NOFILE)[0]) 
 spass_fds = sorted(set(pass_fds) | {exec_err_pipe_write}) 
 for pair in zip([2] + spass_fds, spass_fds + [max_fd]): 
     os.closerange(pair[0]+1, pair[1])

is looping through all possible file descriptors in order to close those (note that closerange() implemented as a loop at least on Linux). In case the limit of open fds (aka ulimit -n, aka RLIMIT_NOFILE, aka SC_OPEN_MAX) is set too high (for example, with recent docker it is 1024*1024), this loop takes considerable time (as it results in about a million close() syscalls).

The solution (at least for Linux and Darwin) is to obtain the list of actually opened fds, and only close those. This is implemented in subprocess module in Python3, and there is a backport of it to Python2 called subprocess32.

MySQL has been known to allocate excessive memory

In idle mode, the “mysql” container should use ~200MB memory; ~200-300MB for the the “lms” and “cms” containers.

On some operating systems, such as RedHat, Arch Linux or Fedora, a very high limit of the number of open files (nofile) per container may cause the “mysql”, “lms” and “cms” containers to use a lot of memory: up to 8-16GB. To check whether you might impacted, run::

cat /proc/$(pgrep dockerd)/limits | grep "Max open files"

If the output is 1073741816 or higher, then it is likely that you are affected by this mysql issue <https://github.com/docker-library/mysql/issues/579>. To learn more about the root cause, read this containerd issue comment <https://github.com/containerd/containerd/pull/7566#issuecomment-1285417325>. Basically, the OS is hard-coding a very high limit for the allowed number of open files, and this is causing some containers to fail. To resolve the problem, you should configure the Docker daemon to enforce a lower value, as described here <https://github.com/docker-library/mysql/issues/579#issuecomment-1432576518>__. Edit /etc/docker/daemon.json and add the following contents::

{
    "default-ulimits": {
        "nofile": {
            "Name": "nofile",
            "Hard": 1048576,
            "Soft": 1048576
        }
    }
}

Check your configuration is valid with:

dockerd --validate

Then restart the Docker service:

sudo systemctl restart docker.service

Technical Background Introduction

1. RLIMIT_NOFILE

https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#Process%20Properties

Don’t use. Be careful when raising the soft limit above 1024, since select(2) cannot function with file descriptors above 1023 on Linux. Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults. Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, i.e. do not use select(2). Note that file descriptors are nowadays accounted like any other form of memory, thus there should not be any need to lower the hard limit. Use MemoryMax= to control overall service memory use, including file descriptor memory.

https://github.com/systemd/systemd/blob/1742aae2aa8cd33897250d6fcfbe10928e43eb2f/NEWS#L60..L94

The Linux kernel’s current default RLIMIT_NOFILE resource limit for userspace processes is set to 1024 (soft) and 4096 (hard). Previously, systemd passed this on unmodified to all processes it forked off. With this systemd release the hard limit systemd passes on is increased to 512K, overriding the kernel’s defaults and substantially increasing the number of simultaneous file descriptors unprivileged userspace processes can allocate. Note that the soft limit remains at 1024 for compatibility reasons: the traditional UNIX select() call cannot deal with file descriptors >= 1024 and increasing the soft limit globally might thus result in programs unexpectedly allocating a high file descriptor and thus failing abnormally when attempting to use it with select() (of course, programs shouldn’t use select() anymore, and prefer poll()/epoll, but the call unfortunately remains undeservedly popular at this time). This change reflects the fact that file descriptor
handling in the Linux kernel has been optimized in more recent kernels and allocating large numbers of them should be much cheaper both in memory and in performance than it used to be. Programs that
want to take benefit of the increased limit have to “opt-in” into high file descriptors explicitly by raising their soft limit. Of course, when they do that they must acknowledge that they cannot use select() anymore (and neither can any shared library they use — or any shared library used by any shared library they use and so on). Which default hard limit is most appropriate is of course hard to decide. However, given reports that ~300K file descriptors are used in real-life applications we believe 512K is sufficiently high as new default for now. Note that there are also reports that using very high hard limits (e.g. 1G) is problematic: some software allocates large arrays with one element for each potential file descriptor (Java, …) — a high hard limit thus triggers excessively large memory allocations in these applications. Hopefully, the new default of 512K is a good middle ground: higher than what real-life applications currently need, and low enough for avoid triggering excessively large allocations in problematic software. (And yes, somebody should fix
Java.)

systemd v240 release in 2018Q4. Both Docker and Containerd projects have recently removed the line from their configs to rely on the 1024:524288 default systemd v240 provides (unless the system has been configured explicitly to some other value, which the system administrator may do so when they know they need higher limits).

2. File Descriptor Limits

1
2
3

This specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.)  to exceed this limit yield the error EMFILE.  (Historically, this limit was named RLIMIT_OFILE on BSD.)
Since Linux 4.5, this limit also defines the maximum number of file descriptors that an unprivileged process (one without the CAP_SYS_RESOURCE capability) may have "in
flight" to other processes, by being passed across UNIX domain sockets.  This limit applies to the sendmsg(2) system call.  For further details, see unix(7).

The primary way to reference, allocate and pin runtime OS resources on Linux today are file descriptors (“fds”). Originally they were used to reference open files and directories and maybe a bit more, but today they may be used to reference almost any kind of runtime resource in Linux userspace, including open devices, memory (memfd_create(2)), timers (timefd_create(2)) and even processes (with the new pidfd_open(2) system call). In a way, the philosophically skewed UNIX concept of “everything is a file” through the proliferation of fds actually acquires a bit of sensible meaning: “everything has a file descriptor“ is certainly a much better motto to adopt.

Because of this proliferation of fds, non-trivial modern programs tend to have to deal with substantially more fds at the same time than they traditionally did. Today, you’ll often encounter real-life programs that have a few thousand fds open at the same time.

Like on most runtime resources on Linux limits are enforced on file descriptors: once you hit the resource limit configured via RLIMIT_NOFILE any attempt to allocate more is refused with the EMFILE error — until you close a couple of those you already have open.

Because fds weren’t such a universal concept traditionally, the limit of RLIMIT_NOFILE used to be quite low. Specifically, when the Linux kernel first invokes userspace it still sets RLIMIT_NOFILE to a low value of 1024 (soft) and 4096 (hard). (Quick explanation: the soft limit is what matters and causes the EMFILE issues, the hard limit is a secondary limit that processes may bump their soft limit to — if they like — without requiring further privileges to do so. Bumping the limit further would require privileges however.). A limit of 1024 fds made fds a scarce resource: APIs tried to be careful with using fds, since you simply couldn’t have that many of them at the same time. This resulted in some questionable coding decisions and concepts at various places: often secondary descriptors that are very similar to fds — but were not actually fds — were introduced (e.g. inotify watch descriptors), simply to avoid for them the low limits enforced on true fds. Or code tried to aggressively close fds when not absolutely needing them (e.g. ftw()/nftw()), losing the nice + stable “pinning” effect of open fds.

Worse though is that certain OS level APIs were designed having only the low limits in mind. The worst offender being the BSD/POSIX select(2) system call: it only works with fds in the numeric range of 0…1023 (aka FD_SETSIZE-1). If you have an fd outside of this range, tough luck: select() won’t work, and only if you are lucky you’ll detect that and can handle it somehow.

Linux fds are exposed as simple integers, and for most calls it is guaranteed that the lowest unused integer is allocated for new fds. Thus, as long as the RLIMIT_NOFILE soft limit is set to 1024 everything remains compatible with select(): the resulting fds will also be below 1024. Yay. If we’d bump the soft limit above this threshold though and at some point in time an fd higher than the threshold is allocated, this fd would not be compatible with select() anymore.

Because of that, indiscriminately increasing the soft RLIMIT_NOFILE resource limit today for every userspace process is problematic: as long as there’s userspace code still using select() doing so will risk triggering hard-to-handle, hard-to-debug errors all over the place.

However, given the nowadays ubiquitous use of fds for all kinds of resources (did you know, an eBPF program is an fd? and a cgroup too? and attaching an eBPF program to cgroup is another fd? …), we’d really like to raise the limit anyway.

So before we continue thinking about this problem, let’s make the problem more complex (…uh, I mean… “more exciting”) first. Having just one hard and one soft per-process limit on fds is boring. Let’s add more limits on fds to the mix. Specifically on Linux there are two system-wide sysctls: fs.nr_open and fs.file-max. (Don’t ask me why one uses a dash and the other an underscore, or why there are two of them…) On today’s kernels they kinda lost their relevance. They had some originally, because fds weren’t accounted by any other counter. But today, the kernel tracks fds mostly as small pieces of memory allocated on userspace requests — because that’s ultimately what they are —, and thus charges them to the memory accounting done anyway.

So now, we have four limits (actually: five if you count the memory accounting) on the same kind of resource, and all of them make a resource artificially scarce that we don’t want to be scarce. So what to do?

Back in systemd v240 already (i.e. 2019) we decided to do something about it. Specifically:

Automatically at boot we’ll now bump the two sysctls to their maximum, making them effectively ineffective. This one was easy. We got rid of two pretty much redundant knobs. Nice!
The RLIMIT_NOFILE hard limit is bumped substantially to 512K. Yay, cheap fds! You may have an fd, and you, and you as well, everyone may have an fd!
But … we left the soft RLIMIT_NOFILE limit at 1024. We weren’t quite ready to break all programs still using select() in 2019 yet. But it’s not as bad as it might sound I think: given the hard limit is bumped every program can easily opt-in to a larger number of fds, by setting the soft limit to the hard limit early on — without requiring privileges.

So effectively, with this approach fds should be much less scarce (at least for programs that opt into that), and the limits should be much easier to configure, since there are only two knobs now one really needs to care about:

Configure the RLIMIT_NOFILE hard limit to the maximum number of fds you actually want to allow a process.
In the program code then either bump the soft to the hard limit, or not. If you do, you basically declare “I understood the problem, I promise to not use select(), drown me fds please!”. If you don’t then effectively everything remains as it always was.

Apparently this approach worked, since the negative feedback on change was even scarcer than fds traditionally were (ha, fun!). We got reports from pretty much only two projects that were bitten by the change (one being a JVM implementation): they already bumped their soft limit automatically to their hard limit during program initialization, and then allocated an array with one entry per possible fd. With the new high limit this resulted in one massive allocation that traditionally was just a few K, and this caused memory checks to be hit.

Anyway, here’s the take away of this blog story:

Don’t use select() anymore in 2021. Use poll(), epoll, iouring, …, but for heaven’s sake don’t use select(). It might have been all the rage in the 1990s but it doesn’t scale and is simply not designed for today’s programs. I wished the man page of select() would make clearer how icky it is and that there are plenty of more preferably APIs.
If you hack on a program that potentially uses a lot of fds, add some simple code somewhere to its start-up that bumps the RLIMIT_NOFILE soft limit to the hard limit. But if you do this, you have to make sure your code (and any code that you link to from it) refrains from using select(). (Note: there’s at least one glibc NSS plugin using select() internally. Given that NSS modules can end up being loaded into pretty much any process such modules should probably be considered just buggy.)
If said program you hack on forks off foreign programs, make sure to reset the RLIMIT_NOFILE soft limit back to 1024 for them. Just because your program might be fine with fds >= 1024 it doesn’t mean that those foreign programs might. And unfortunately RLIMIT_NOFILE is inherited down the process tree unless explicitly set.

select

supervisord with select() (2011 reported, 2014 fixed).
Nginx will raise soft limit when you tell it to via config (bug report in 2015, where select() was used by nginx limiting it to 1024). Alternatively if you’re using the nginx container, you can raise the soft limit at the container level.
Redis docs advising 2^16 in example, will be dependent upon your workload.
- redis-py Dec 2013 issue with select(), fixed in June 2014
- redis/hiredis 2015 issue where a user was relying on select().
- Nov 2020 article on select(), references Redis still carries select() as a fallback (ae_select.c)
httpd has select() (see this 2002 commit, which is still present today in 2024_)
Postgres hasthis 2010 response for why they don’t use the hard limit to not negatively impact other software running. Less of an issue in a container, especially when you can set the limits. The limits work a little bit differently now , that the issue shouldn’t be applicable anymore (the global FD limit for a system is notably higher than the hard limit tends to be per process).
- Docs on max_files_per_process. Similar to nginx, there is a specific setting here, and it’s also per process this software manages. Both of these are doing the correct thing by not having a monolithic process sharing the same FD limit. The soft limit applied is per process, if you have 1024 processes with the soft limit of 1024, you also have 2^20 FDs available across them…not only 1024.
- Postgres still has source with usage of select() here, here and various other locations if you want to look through it.
MongoDB using select() in 2014, in the 3.7.5 release it was still using it for listen.cpp, but that was dropped in the 3.7.6 release (April 2018). A select() call still exists though.

Dig deeping into

ulimit, being an archaic resource management mechanism, is not completely obsoleted by cgroup controllers, and it is still an essential part of system administration.

Default ulimits for a new container are derived from those of ~~dockerd~~ containerd itself. They are set in containerd.service systemd unit file to unlimited values:

$ grep ^Limit /lib/systemd/system/containerd.service
LimitNOFILE=infinity
LimitNPROC=infinity
LimitCORE=infinity

This is required for containerd itself, but is way too generous for containers it runs. For comparison, ulimits for a user (including root) on the host system are pretty modest (this is an example from Ubuntu 18.04):

$ ulimit -a
core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 0
file size               (blocks, -f) unlimited
pending signals                 (-i) 62435
max locked memory       (kbytes, -l) 16384
max memory size         (kbytes, -m) unlimited
open files                      (-n) 1024
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) 62435
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited

This can create a number of problems, such as container abusing system resources (e.g. DoS attacks). In general, cgroup limits should be used to prevent those, yet I think ulimits should be set to a saner values.

In particular, RLIMIT_NOFILE, a number of open files limit, which is set to 2^20 (aka 1048576), causes a slowdown in a number of programs, as they use the upper limit value to iterate over all potentially opened file descriptors, closing those (or setting CLOEXEC bit) before every fork/exec. I am aware of the following cases:

rpm, reported in Slow performance when installing RPMs to create new Docker images #23137, https://bugzilla.redhat.com/show_bug.cgi?id=1537564, fix: Optimize and unite setting CLOEXEC on fds rpm-software-management/rpm#444 (fixed in Fedora 28).
python2, reported in Spawning PTY processes is many times slower on Docker 18.09 docker/for-linux#502, proposed fix: [2.7] bpo-35757: subprocess.Popen: optimize close_fds for Linux python/cpython#11584 (WONTFIX as python2 is frozen)
python’s pexpect/ptyprocess library, reported in PtyProcess.spawn (and thus pexpect) slowdown in close() loop pexpect/ptyprocess#50.

Attacking those one by one proved complicated and not very fruitful, as some software is obsoleted, some is hard to fix, etc. In addition, the above list is not a concise one, so there might be more cases like this we’re not aware of.

Investigated limits impact and costs

2^16 (65k) busybox containers estimated resource usage:

688k tasks + 206 GB (192 GiB) memory in containerd (10.5 tasks + 3MiB per container)
Requiring at minimum LimitNOFILE=262144 (containerd.service) + LimitNOFILE=393216 (docker.service) - based on 4:1 + 6:1 service FDs needed per container ratio.
2.49 million open files (fs.file-nr must be below fs.file-max limit) - approx 38 FDs per container.
25 GiB memory for the containers cgroup (approx 400KiB per container).

LimitNOFILE=524288 (systemd default since v240) should be more than enough for most systems as a sane default. This should be more than enough for both docker.service and containerd.service resource needs, capable of supporting 65k containers.

Containers that do need higher limits can explicitly declare that (via --ulimit or equivalent), as the upper bound is not impacted by containerd.service. The same can be done for lowering limits if necessary, both should rarely be necessary for most containers.

While docker.service and containerd.service need the higher soft limit (enforced implicitly since Go 1.19), it would be unlikely required for containers. An upcoming release of Go (with backports to 1.19) will implicitly restore the soft limit to fork / exec processes AFAIK. Until then, the Docker daemon can be configured with default-ulimit setting to enforce a 1024 soft limit on containers.

System details

Fedora 37 VM 6.1.9 kernel x86_64 (16 GB memory)
Docker v23, containerd 1.6.18, systemd v251

# Additionally verified with builds before Go 1.19 to test soft limit lower than the hard limit:
dnf install docker-ce-3:20.10.23 docker-ce-cli-1:20.10.23 containerd.io-1.6.8

Observations in `.service` files for `LimitNOFILE`

On a fresh install (via VM on Vultr) there was approx 1800 file descriptors open (sysctl fs.file-nr). I used a shell loop to run busybox containers until failure and adjusted the LimitNOFILE for docker.service and containerd.service to collect metrics for insights.

I noticed a consistent ratio of number of FDs needed per container:

docker.service - 6:1 ratio (5:1 with --network=host), approx 853 containers with LimitNOFILE=5120 (1024 with host network).
containerd.service - 4:1 ratio (I did not verify if --network=host reduced this), LimitNOFILE=1024 should be capable of 256 containers, provided docker.service is also high enough (eg: LimitNOFILE=2048_).

In containerd.service there was also a clear pattern in resources per container, where the LimitNOFILE value, image used (busybox, alpine, debian), and number of containers remained constant:

Each containers systemd .scope has 1 task and approx 400KiB memory (little bit less for alpine and debian).
10.5 tasks + 3MiB memory added per container to systemctl status containerd report.
Approx 38 open files per container running (fs.file-nr after, minus the before value, divided by number of containers).

1	mailserver/docker-mailserver:edge

was also tested to compare to the

sleep 180

containers:

- 33 tasks per container `.scope` and 85MiB memory reported via `systemd-cgtop` (*10GiB needed min to run 120 of these containers*).
- In `containerd` per container average resources were 11 tasks + 3.4MiB memory (*approx 400MiB usage for 120 of these containers*). Roughly consistent with the lighter images resource usage in `containerd`.
- Files opened per container also increased to 291 (*approx 35k files open for 120 of these containers*).
- If you want to reproduce for this image, `docker run` should include these extra options: `--hostname example.test --env SMTP_ONLY=1` (*hostname required to init, `SMTP_ONLY=1` skips needing an account configured*).

Operations like docker stats need to open as many file descriptors as total containers running, otherwise it’ll hang waiting. You can observe if the daemon has reached the limit with ls -1 /proc/$(pidof dockerd)/fd | wc -l.

Reproduction

Set LimitNOFILE=768 in docker.service, then systemctl daemon-reload && systemctl restart docker. You can confirm the limit is applied to the daemon process with cat /proc/$(pidof dockerd)/limits.

Running the following should list:

How many containers are running.
Number of open files.
How many tasks and memory both the containerd and dockerd daemons are using.

# Useful to run before the loop to compare against output after the loop is done
(pgrep containerd-shim | wc -l) && sysctl fs.file-nr \
  && (echo 'Containerd service:' && systemctl status containerd | grep -E 'Tasks|Memory') \
  && (echo 'Docker service:' && systemctl status docker | grep -E 'Tasks|Memory')

Running this loop below should fail on the last few containers, about 123 should be created:

1
2
3

# When `docker.service` limit is the bottleneck, you may need to `CTRL + C` to exit the loop
# if it stalls while waiting for new FDs once exhausted and outputting errors:
for i in $(seq 1 130); do docker run --rm -d busybox sleep 180; done

You can add additional options:

1	--network host

- Avoids creating a new veth interface (see `ip link`) to the default Docker bridge each `docker run`.
- Without this `docker run` [may fail after `1023` interfaces are present on a single bridge](https://github.com/moby/moby/issues/44973#issuecomment-1543747718)?
- Creation will be a bit faster, and the FD to container ratio for `dockerd` is lowered to `5:1`.

1	--ulimit "nofile=1023456789"

- Useful to observe that it does not affect memory usage on it's own.
- Also shows `dockerd` + `containerd` limits don't affect how high this can go.
- For Debian based distros this would fail as it's higher than `fs.nr_open` (`1 048 576`), use that or a lower value.

1	--cgroup-parent=LimitTests.slice

- Similar to `docker stats` but isolated from other running containers. `systemd-cgtop` does include disk cache (*file-backed memory pages*) in it's reported memory usage however (*use `sync && sysctl vm.drop_caches=3` to clear that*).
- This can be useful if you want a better overview of resource usage across all the containers created:
    - Create a temporary slice for testing with: `mkdir /sys/fs/cgroup/LimitTests.slice`.
    - Run `systemd-cgtop --order=memory LimitTests.slice` to only view the containers running in this cgroup slice sorted by memory usage.
    - Memory usage shown for the entire slice and per container. A `busybox` container uses roughly 400KB per container.

Limits impact across process children

I had a misconception that child processes contributed to the parents open files limit. However as my notes in this section detail, it’s only inheriting the limits applied, each process seems to have it’s own individual count.

Although, I’m probably missing something here as I have read of processes passing down FDs to children, which is also why daemons have a common hygiene practice to close the FD range available I think? This is lower level than I’m familiar with 😅

You can also observe the number of file descriptors open for the dockerd and containerd processes like this: ls -1 /proc/$(pidof dockerd)/fd | wc -l.
This isn’t applicable to the containerd-shim process that is responsible for the container, so ls -1 /proc/$(pgrep --newest --exact containerd-shim)/fd | wc -l won’t be useful there.

To confirm this, run a container to test with: docker run --rm -it --ulimit "nofile=1024:1048576" alpine ash. Then try the following:

# Create a folder to add many files to:
mkdir /tmp/test; cd /tmp/test

# Create empty files:
for x in $(seq 3 2048); do touch "${x}.tmp"; done

# Open files to specific file descriptors:
for x in $(seq 1000 1030); do echo "${x}"; eval "exec ${x}< ${x}.tmp"; done
# Fails after 1024 because soft limit is capped there. Raise it:
ulimit -Sn 2048

# Now the loop before will be successful.
# You could cover the whole original soft limit range (excluding FDs 0-2: stdin, stdout, stderr):
for x in $(seq 3 1024); do echo "${x}"; eval "exec ${x}< ${x}.tmp"; done

# Multiple container processes / children opening as many files:
# You can run the same loop in a new shell process with `ash -c 'for ... done'`
# Or `docker exec` into the container from another terminal and run the loop at `/tmp/test` again.
# Each can open files up to their current soft limit, it doesn't matter what limits are set on `dockerd`, `containerd` or the containers PID 1.

############
### Tips ###
############

# You can observe the current limit applied:
cat /proc/self/limits
# And if you have not exhausted your FDs soft limit (due to the pipe),
# this will report how much of the limit is used:
ls -1 /proc/self/fd | wc -l
# Otherwise, outside of the container if this is your only `ash` process running,
# you can query it's PID to get this information:
ls -1 /proc/$(pgrep --newest --exact ash)/fd | wc -l

# Process count in container:
# `docker stats` should list a containers PIDs count for number of processes,
# `systemd-cgtop` should report the same value in it's Tasks column.
# Alternatively if you know the cgroup name like `docker-<CONTAINER_ID>.scope`:
# (NOTE: Path may differ if you used `--cgroup-parent`)
cat /sys/fs/cgroup/system.slice/docker-<CONTAINER_ID>.scope/pids.current

# List the processes with their PIDs:
# For a single container, you can visualize the process tree:
pstree --arguments --show-pids $(pgrep --newest --exact containerd-shim)
# Alternatively if you know the cgroup name like `docker-<CONTAINER_ID>.scope`:
systemd-cgls --unit docker-<CONTAINER_ID>.scope

# You can also observe disk cache in memory monitoring by creating a 1GB file:
dd if=/dev/zero of=bigfile bs=1M count=1000
free -h
# `systemd-cgtop` will include 1GB more in memory usage for the container,
# while `docker stats` appears to account only 30MiB (scales proportionally).
# Now clear the cache outside of the container and observe memory usage again:
sync && sysctl vm.drop_caches=3

You will notice that:

Each process adds those FDs to the open file count returned from fs.file-nr, and frees them when that process is closed.
You can re-run the loops for the same process and observe no change, the files are already counted as open for that process.
There is a memory cost involved:
- Each file touch costs about 2048 bytes (disk-cache only until opened).
- Each file open (1 or more FD references each increment fs.file-nr) costs about 512 bytes per FD open for it.
- Creating 512k files this way uses approx 1.1GiB memory (not released with sysctl vm.drop_caches=3 while opened by at least one FD), while each process opening the equivalent amount of file descriptors additionally uses 250MiB (262MB).

Errors

Nothing useful here, other than depending on which service limit was exhausted first resulted in slightly different errors.

Sometimes this made any docker command like docker ps hang (daemon exhausted limit). I had also observed:

Containers not running (no pgrep containerd-shim output, but docker ps listed containers running well beyond when they should have exited).
Containers running with containerd-shim process (using memory), despite systemctl stop docker containerd. Sometimes this needed pkill containerd-shim to cleanup and systemctl start docker containerd would log a bunch of errors in journalctl handling cleanup of dead shims (depending on the number of containers, this may time out and need to start the containerd service again).
Even with all that out of the way, there was some memory usage of several hundred MB from the baseline that lingered. As it didn’t seem to belong to any process, I assume it was kernel memory. I think the largest number of containers I experimented running with was around 1600-ish.

`docker.service` limit exceeded

This failure output more errors per docker run but the errors varied:

ERRO[0000] Error waiting for container: container caff476371b6897ef35a95e26429f100d0d929120ff1abecc8a16aa674d692bf: driver "overlay2" failed to remove root filesystem: openfdat /var/lib/docker/overlay2/35f26ec862bb91d7c3214f76f8660938145bbb36eda114f67e711aad2be89578-init/diff/etc: too many open files

docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: time="2023-03-12T02:26:20Z" level=fatal msg="failed to create a netlink handle: could not get current namespace while creating netlink socket: too many open files": unknown.

docker: Error response from daemon: failed to initialize logging driver: open /var/lib/docker/containers/b014a19f7eb89bb909dee158d21f35f001cfeb80c01e0078d6f20aac8151573f/b014a19f7eb89bb909dee158d21f35f001cfeb80c01e0078d6f20aac8151573f-json.log: too many open files.

`containerd.service` limit exceeded

I think I’ve seen some others, but it’s usually this one:

1	docker: Error response from daemon: failed to start shim: start failed: : pipe2: too many open files: unknown.

Scope and Explanation

Aug 2023: LimitNOFILE=infinity remove from docker.service
May 2021: LimitNOFILE=infinity + LimitNPROC=infinity brought back into docker.service to sync with Docker CE’s equivalent config.
- This PR was a merge commit of this one from Sep 2018 (commit history interleaved ranges from 2017-2021).
- As the PR main diff shows, LimitNPROC=infinity was already added, and LimitNOFILE=1048576 was changed to infinity by the PR merge (initially confused since I’m using git blame on the master branch).
July 2016: LimitNOFILE=infinity changed to LimitNOFILE=1048576 (this number is 2^20).
- Discussion references a 2009 StackOverflow answer about infinity being capped to 2^20 in a specific distro release / kernel. On some systems today, that ceiling can be 1024 times higher (2^30 == 1073741816, over 1 billion).
July 2016: LimitNOFILE and LimitNPROC changed from 1048576 to infinity
- This PR reverted the LimitNOFILE change shortly after as described above.
March 2014: Original LimitNOFILE + LimitNPROC added with 1048576.
- Linked PR comment mentions that this 2^20 value is already higher than Docker needs:
- It appears it was later changed to infinity to improve CI times where a smaller limit was applied (like this comment about Ubuntu 14.04 adjusting any limit exceeding 2^20 to 2^10?).
- PR also [referenced relevant systemd docs](https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Process Properties) (which may have changed since 2014)

Current Status:

LimitNOFILE=infinity is still the case until Docker v25, unless the team is backporting it to any releases built with Go 1.19+
containerd has merged the equivalent change to remove LimitNOFILE from their systemd service file.

Systemd < 240

Why is LimitNOFILE not set to infinity when configured in the service?

After setting LimitNOFILE to infinity in the service, when checking the limit of the process ID (pid), it is observed that the open file limit is 65536 instead of infinity.

Please review the service configuration.

[root@XXX ~]# ulimit -n -u
open files                      (-n) 1024
max user processes              (-u) 499403

containerd systemd configuration:

cat /usr/lib/systemd/system/containerd.service
[Unit]
Description=containerd container runtime
Documentation=https://containerd.io
After=network.target local-fs.target

[Service]
ExecStartPre=-/sbin/modprobe overlay
ExecStart=/usr/local/bin/containerd

Type=notify
Delegate=yes
KillMode=process
Restart=always
RestartSec=5
# Having non-zero Limit*s causes performance problems due to accounting overhead
# in the kernel. We recommend using cgroups to do container-local accounting.
LimitNPROC=infinity
LimitCORE=infinity
LimitNOFILE=infinity
# Comment TasksMax if your systemd version does not supports it.
# Only systemd 226 and above support this version.
TasksMax=infinity
OOMScoreAdjust=-999

[Install]
WantedBy=multi-user.target

Viewing the configuration effect

[root@XXX ~]# cat /proc/$(pidof dockerd)/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1048576              1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       499403               499403               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us        
[root@XXX ~]# cat /proc/$(pidof containerd)/limits
Limit                     Soft Limit           Hard Limit           Units     
Max cpu time              unlimited            unlimited            seconds   
Max file size             unlimited            unlimited            bytes     
Max data size             unlimited            unlimited            bytes     
Max stack size            8388608              unlimited            bytes     
Max core file size        unlimited            unlimited            bytes     
Max resident set          unlimited            unlimited            bytes     
Max processes             unlimited            unlimited            processes 
Max open files            1048576              1048576              files     
Max locked memory         65536                65536                bytes     
Max address space         unlimited            unlimited            bytes     
Max file locks            unlimited            unlimited            locks     
Max pending signals       499403               499403               signals   
Max msgqueue size         819200               819200               bytes     
Max nice priority         0                    0                    
Max realtime priority     0                    0                    
Max realtime timeout      unlimited            unlimited            us

This has systemd look at /proc/sys/fs/nr_open to find the current maximum of
open files compiled into the kernel and tries to set the RLIMIT_NOFILE max to
it. This has the advantage the value chosen as limit is less arbitrary and also
improves the behavior of systemd in containers that have an rlimit set: When
systemd currently starts in a container that has RLIMIT_NOFILE set to e.g.
100000 systemd will lower it to 65536. With this patch systemd will try to set
the nofile limit to the allowed kernel maximum. If this fails, it will compute
the minimum of the current set value (the limit that is set on the container)
and the maximum value as soft limit and the currently set maximum value as the
maximum value. This way it retains the limit set on the container.

see: https://github.com/systemd/systemd/issues/6559