thanks polarathene
Description
The max open files limit NOFILE
of dockerd is 1048576, which is defined in dockerd’s systemd unit file.
1 | $ cat /proc/$(pidof dockerd)/limits | grep "Max open files" |
1 | $ systemctl show docker | grep LimitNOFILE |
However, inside the container, the value of the limit is a very large number — 1073741816:
1 | $ docker run --rm ubuntu bash -c "cat /proc/self/limits" | grep "Max open files" |
It may cause the program iterate all available fds until the limit is reached; for example, the xinetd
program sets the number of file descriptors using setrlimit(2)
at initialization, which causes unnecessary waste of CPU resources and time on closing 1073741816 fds.
1 | root@1b3165886528# strace xinetd |
we found similar cases:
yum hang
I noticed that newest version of docker, on rockylinux-9, taken from https://download.docker.com/linux/centos/$releasever/$basearch/stable are a bit slow especially for operations done by yum
On both centos-7 and rocky-9 hosts I did:
1 | docker run -itd --name centos7 quay.io/centos/centos:centos7 |
On centos7 host it takes ~2 minutes
On rocky-9 host after an hour it did not complete the process, I can leave it under tmux to discover how much time it takes
reproduce steps:
1 | docker run -itd --name centos7 quay.io/centos/centos:centos7 |
rpm slow
run below comamnd on host:
1 | time zypper --reposd-dir /workspace/zypper/reposd --cache-dir /workspace/zypper/cache --solv-cache-dir /workspace/zypper/solv --pkg-cache-dir /workspace/zypper/pkg --non-interactive --root /workspace/root install rpm subversion |
spend time:
1 | real 0m11.248s |
when test it in container
1 | docker run --rm --net=none --log-driver=none -v "/workspace:/workspace" -v "/disks:/disks" opensuse bash -c "time zypper --reposd-dir /workspace/zypper/reposd --cache-dir /workspace/zypper/cache --solv-cache-dir /workspace/zypper/solv --pkg-cache-dir /workspace/zypper/pkg --non-interactive --root /workspace/root install rpm subversion" |
spend time:
1 | real 0m31.089s |
Here’s the relevant section of code from RPM. It’s part of the POSIX lua library that’s inside RPM, and was added by rpm-software-management/rpm@7a7c31f
.
1 | static int Pexec(lua_State *L) /** exec(path,[args]) */ |
So the reason for doing F_GETFD
is because they are setting all of the FDs to CLOEXEC
before doing the requested exec(2)
. There’s a redhat bugzilla entry in the commit message, which says that this was an SELinux issue where Fedora (or RHEL) have an SELinux setup where you cannot execute a process if it will inherit FDs it shouldn’t have access to?
I guess if this is an SELinux issue it should be handled by only applying this fix when SELinux is used (though there are arguably security reasons why you might want to CLOEXEC
every file descriptor).
PtyProcess.spawn slowdown in close() loop
The following code in ptyprocess
1 | # Do not allow child to inherit open file descriptors from parent, |
is looping through all possible file descriptors in order to close those (note that closerange()
implemented as a loop at least on Linux). In case the limit of open fds (aka ulimit -n
, aka RLIMIT_NOFILE
, aka SC_OPEN_MAX
) is set too high (for example, with recent docker it is 1024*1024), this loop takes considerable time (as it results in about a million close()
syscalls).
The solution (at least for Linux and Darwin) is to obtain the list of actually opened fds, and only close those. This is implemented in subprocess
module in Python3, and there is a backport of it to Python2 called subprocess32.
MySQL has been known to allocate excessive memory
In idle mode, the “mysql” container should use ~200MB memory; ~200-300MB for the the “lms” and “cms” containers.
On some operating systems, such as RedHat, Arch Linux or Fedora, a very high limit of the number of open files (nofile
) per container may cause the “mysql”, “lms” and “cms” containers to use a lot of memory: up to 8-16GB. To check whether you might impacted, run::
cat /proc/$(pgrep dockerd)/limits | grep "Max open files"
If the output is 1073741816 or higher, then it is likely that you are affected by this mysql issue <https://github.com/docker-library/mysql/issues/579>
. To learn more about the root cause, read this containerd issue comment <https://github.com/containerd/containerd/pull/7566#issuecomment-1285417325>
. Basically, the OS is hard-coding a very high limit for the allowed number of open files, and this is causing some containers to fail. To resolve the problem, you should configure the Docker daemon to enforce a lower value, as described here <https://github.com/docker-library/mysql/issues/579#issuecomment-1432576518>
__. Edit /etc/docker/daemon.json
and add the following contents::
{
"default-ulimits": {
"nofile": {
"Name": "nofile",
"Hard": 1048576,
"Soft": 1048576
}
}
}
Check your configuration is valid with:
dockerd --validate
Then restart the Docker service:
sudo systemctl restart docker.service
Technical Background Introduction
1. RLIMIT_NOFILE
https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#Process%20Properties
Don’t use. Be careful when raising the soft limit above 1024, since select(2) cannot function with file descriptors above 1023 on Linux. Nowadays, the hard limit defaults to 524288, a very high value compared to historical defaults. Typically applications should increase their soft limit to the hard limit on their own, if they are OK with working with file descriptors above 1023, i.e. do not use select(2). Note that file descriptors are nowadays accounted like any other form of memory, thus there should not be any need to lower the hard limit. Use MemoryMax=
to control overall service memory use, including file descriptor memory.
https://github.com/systemd/systemd/blob/1742aae2aa8cd33897250d6fcfbe10928e43eb2f/NEWS#L60..L94
The Linux kernel’s current default RLIMIT_NOFILE resource limit for userspace processes is set to 1024 (soft) and 4096 (hard). Previously, systemd passed this on unmodified to all processes it forked off. With this systemd release the hard limit systemd passes on is increased to 512K, overriding the kernel’s defaults and substantially increasing the number of simultaneous file descriptors unprivileged userspace processes can allocate. Note that the soft limit remains at 1024 for compatibility reasons: the traditional UNIX select() call cannot deal with file descriptors >= 1024 and increasing the soft limit globally might thus result in programs unexpectedly allocating a high file descriptor and thus failing abnormally when attempting to use it with select() (of course, programs shouldn’t use select() anymore, and prefer poll()/epoll, but the call unfortunately remains undeservedly popular at this time). This change reflects the fact that file descriptor
handling in the Linux kernel has been optimized in more recent kernels and allocating large numbers of them should be much cheaper both in memory and in performance than it used to be. Programs that
want to take benefit of the increased limit have to “opt-in” into high file descriptors explicitly by raising their soft limit. Of course, when they do that they must acknowledge that they cannot use select() anymore (and neither can any shared library they use — or any shared library used by any shared library they use and so on). Which default hard limit is most appropriate is of course hard to decide. However, given reports that ~300K file descriptors are used in real-life applications we believe 512K is sufficiently high as new default for now. Note that there are also reports that using very high hard limits (e.g. 1G) is problematic: some software allocates large arrays with one element for each potential file descriptor (Java, …) — a high hard limit thus triggers excessively large memory allocations in these applications. Hopefully, the new default of 512K is a good middle ground: higher than what real-life applications currently need, and low enough for avoid triggering excessively large allocations in problematic software. (And yes, somebody should fix
Java.)
systemd v240 release in 2018Q4. Both Docker and Containerd projects have recently removed the line from their configs to rely on the 1024:524288
default systemd v240 provides (unless the system has been configured explicitly to some other value, which the system administrator may do so when they know they need higher limits).
2. File Descriptor Limits
1 | This specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(2), pipe(2), dup(2), etc.) to exceed this limit yield the error EMFILE. (Historically, this limit was named RLIMIT_OFILE on BSD.) |
The primary way to reference, allocate and pin runtime OS resources on Linux today are file descriptors (“fds”). Originally they were used to reference open files and directories and maybe a bit more, but today they may be used to reference almost any kind of runtime resource in Linux userspace, including open devices, memory (memfd_create(2)
), timers (timefd_create(2)
) and even processes (with the new pidfd_open(2)
system call). In a way, the philosophically skewed UNIX concept of “everything is a file” through the proliferation of fds actually acquires a bit of sensible meaning: “everything has a file descriptor“ is certainly a much better motto to adopt.
Because of this proliferation of fds, non-trivial modern programs tend to have to deal with substantially more fds at the same time than they traditionally did. Today, you’ll often encounter real-life programs that have a few thousand fds open at the same time.
Like on most runtime resources on Linux limits are enforced on file descriptors: once you hit the resource limit configured via RLIMIT_NOFILE
any attempt to allocate more is refused with the EMFILE
error — until you close a couple of those you already have open.
Because fds weren’t such a universal concept traditionally, the limit of RLIMIT_NOFILE
used to be quite low. Specifically, when the Linux kernel first invokes userspace it still sets RLIMIT_NOFILE
to a low value of 1024 (soft) and 4096 (hard). (Quick explanation: the soft limit is what matters and causes the EMFILE
issues, the hard limit is a secondary limit that processes may bump their soft limit to — if they like — without requiring further privileges to do so. Bumping the limit further would require privileges however.). A limit of 1024 fds made fds a scarce resource: APIs tried to be careful with using fds, since you simply couldn’t have that many of them at the same time. This resulted in some questionable coding decisions and concepts at various places: often secondary descriptors that are very similar to fds — but were not actually fds — were introduced (e.g. inotify watch descriptors), simply to avoid for them the low limits enforced on true fds. Or code tried to aggressively close fds when not absolutely needing them (e.g. ftw()
/nftw()
), losing the nice + stable “pinning” effect of open fds.
Worse though is that certain OS level APIs were designed having only the low limits in mind. The worst offender being the BSD/POSIX select(2)
system call: it only works with fds in the numeric range of 0…1023 (aka FD_SETSIZE
-1). If you have an fd outside of this range, tough luck: select() won’t work, and only if you are lucky you’ll detect that and can handle it somehow.
Linux fds are exposed as simple integers, and for most calls it is guaranteed that the lowest unused integer is allocated for new fds. Thus, as long as the RLIMIT_NOFILE
soft limit is set to 1024 everything remains compatible with select()
: the resulting fds will also be below 1024. Yay. If we’d bump the soft limit above this threshold though and at some point in time an fd higher than the threshold is allocated, this fd would not be compatible with select()
anymore.
Because of that, indiscriminately increasing the soft RLIMIT_NOFILE
resource limit today for every userspace process is problematic: as long as there’s userspace code still using select()
doing so will risk triggering hard-to-handle, hard-to-debug errors all over the place.
However, given the nowadays ubiquitous use of fds for all kinds of resources (did you know, an eBPF program is an fd? and a cgroup too? and attaching an eBPF program to cgroup is another fd? …), we’d really like to raise the limit anyway.
So before we continue thinking about this problem, let’s make the problem more complex (…uh, I mean… “more exciting”) first. Having just one hard and one soft per-process limit on fds is boring. Let’s add more limits on fds to the mix. Specifically on Linux there are two system-wide sysctls: fs.nr_open
and fs.file-max
. (Don’t ask me why one uses a dash and the other an underscore, or why there are two of them…) On today’s kernels they kinda lost their relevance. They had some originally, because fds weren’t accounted by any other counter. But today, the kernel tracks fds mostly as small pieces of memory allocated on userspace requests — because that’s ultimately what they are —, and thus charges them to the memory accounting done anyway.
So now, we have four limits (actually: five if you count the memory accounting) on the same kind of resource, and all of them make a resource artificially scarce that we don’t want to be scarce. So what to do?
Back in systemd v240 already (i.e. 2019) we decided to do something about it. Specifically:
- Automatically at boot we’ll now bump the two sysctls to their maximum, making them effectively ineffective. This one was easy. We got rid of two pretty much redundant knobs. Nice!
- The
RLIMIT_NOFILE
hard limit is bumped substantially to 512K. Yay, cheap fds! You may have an fd, and you, and you as well, everyone may have an fd! - But … we left the soft
RLIMIT_NOFILE
limit at 1024. We weren’t quite ready to break all programs still usingselect()
in 2019 yet. But it’s not as bad as it might sound I think: given the hard limit is bumped every program can easily opt-in to a larger number of fds, by setting the soft limit to the hard limit early on — without requiring privileges.
So effectively, with this approach fds should be much less scarce (at least for programs that opt into that), and the limits should be much easier to configure, since there are only two knobs now one really needs to care about:
- Configure the
RLIMIT_NOFILE
hard limit to the maximum number of fds you actually want to allow a process. - In the program code then either bump the soft to the hard limit, or not. If you do, you basically declare “I understood the problem, I promise to not use
select()
, drown me fds please!”. If you don’t then effectively everything remains as it always was.
Apparently this approach worked, since the negative feedback on change was even scarcer than fds traditionally were (ha, fun!). We got reports from pretty much only two projects that were bitten by the change (one being a JVM implementation): they already bumped their soft limit automatically to their hard limit during program initialization, and then allocated an array with one entry per possible fd. With the new high limit this resulted in one massive allocation that traditionally was just a few K, and this caused memory checks to be hit.
Anyway, here’s the take away of this blog story:
- Don’t use
select()
anymore in 2021. Usepoll()
,epoll
,iouring
, …, but for heaven’s sake don’t useselect()
. It might have been all the rage in the 1990s but it doesn’t scale and is simply not designed for today’s programs. I wished the man page ofselect()
would make clearer how icky it is and that there are plenty of more preferably APIs. - If you hack on a program that potentially uses a lot of fds, add some simple code somewhere to its start-up that bumps the
RLIMIT_NOFILE
soft limit to the hard limit. But if you do this, you have to make sure your code (and any code that you link to from it) refrains from usingselect()
. (Note: there’s at least one glibc NSS plugin usingselect()
internally. Given that NSS modules can end up being loaded into pretty much any process such modules should probably be considered just buggy.) - If said program you hack on forks off foreign programs, make sure to reset the
RLIMIT_NOFILE
soft limit back to 1024 for them. Just because your program might be fine with fds >= 1024 it doesn’t mean that those foreign programs might. And unfortunatelyRLIMIT_NOFILE
is inherited down the process tree unless explicitly set.
select
supervisord
withselect()
(2011 reported, 2014 fixed).- Nginx will raise soft limit when you tell it to via config (bug report in 2015, where
select()
was used by nginx limiting it to1024
). Alternatively if you’re using the nginx container, you can raise the soft limit at the container level. - Redis docs advising
2^16
in example, will be dependent upon your workload.redis-py
Dec 2013 issue withselect()
, fixed in June 2014redis/hiredis
2015 issue where a user was relying onselect()
.- Nov 2020 article on
select()
, references Redis still carriesselect()
as a fallback (ae_select.c
)
- httpd has
select()
(see this 2002 commit, which is still present today in 2024_) - Postgres hasthis 2010 response for why they don’t use the hard limit to not negatively impact other software running. Less of an issue in a container, especially when you can set the limits. The limits work a little bit differently now , that the issue shouldn’t be applicable anymore (the global FD limit for a system is notably higher than the hard limit tends to be per process).
- Docs on
max_files_per_process
. Similar to nginx, there is a specific setting here, and it’s also per process this software manages. Both of these are doing the correct thing by not having a monolithic process sharing the same FD limit. The soft limit applied is per process, if you have 1024 processes with the soft limit of 1024, you also have2^20
FDs available across them…not only 1024. - Postgres still has source with usage of
select()
here, here and various other locations if you want to look through it.
- Docs on
- MongoDB using
select()
in 2014, in the 3.7.5 release it was still using it forlisten.cpp
, but that was dropped in the 3.7.6 release (April 2018). Aselect()
call still exists though.
Dig deeping into
ulimit, being an archaic resource management mechanism, is not completely obsoleted by cgroup controllers, and it is still an essential part of system administration.
Default ulimits for a new container are derived from those of dockerd containerd itself. They are set in containerd.service
systemd unit file to unlimited
values:
1 | $ grep ^Limit /lib/systemd/system/containerd.service |
This is required for containerd itself, but is way too generous for containers it runs. For comparison, ulimits for a user (including root) on the host system are pretty modest (this is an example from Ubuntu 18.04):
1 | $ ulimit -a |
This can create a number of problems, such as container abusing system resources (e.g. DoS attacks). In general, cgroup limits should be used to prevent those, yet I think ulimits should be set to a saner values.
In particular, RLIMIT_NOFILE
, a number of open files limit, which is set to 2^20 (aka 1048576), causes a slowdown in a number of programs, as they use the upper limit value to iterate over all potentially opened file descriptors, closing those (or setting CLOEXEC bit) before every fork/exec. I am aware of the following cases:
- rpm, reported in Slow performance when installing RPMs to create new Docker images #23137, https://bugzilla.redhat.com/show_bug.cgi?id=1537564, fix: Optimize and unite setting CLOEXEC on fds rpm-software-management/rpm#444 (fixed in Fedora 28).
- python2, reported in Spawning PTY processes is many times slower on Docker 18.09 docker/for-linux#502, proposed fix: [2.7] bpo-35757: subprocess.Popen: optimize close_fds for Linux python/cpython#11584 (WONTFIX as python2 is frozen)
- python’s pexpect/ptyprocess library, reported in PtyProcess.spawn (and thus pexpect) slowdown in close() loop pexpect/ptyprocess#50.
Attacking those one by one proved complicated and not very fruitful, as some software is obsoleted, some is hard to fix, etc. In addition, the above list is not a concise one, so there might be more cases like this we’re not aware of.
Investigated limits impact and costs
2^16
(65k) busybox
containers estimated resource usage:
- 688k tasks + 206 GB (192 GiB) memory in
containerd
(10.5 tasks + 3MiB per container) - Requiring at minimum
LimitNOFILE=262144
(containerd.service
) +LimitNOFILE=393216
(docker.service
) - based on4:1
+6:1
service FDs needed per container ratio. - 2.49 million open files (
fs.file-nr
must be belowfs.file-max
limit) - approx 38 FDs per container. - 25 GiB memory for the containers cgroup (approx 400KiB per container).
LimitNOFILE=524288
(systemd default since v240) should be more than enough for most systems as a sane default. This should be more than enough for both docker.service
and containerd.service
resource needs, capable of supporting 65k containers.
Containers that do need higher limits can explicitly declare that (via --ulimit
or equivalent), as the upper bound is not impacted by containerd.service
. The same can be done for lowering limits if necessary, both should rarely be necessary for most containers.
While docker.service
and containerd.service
need the higher soft limit (enforced implicitly since Go 1.19), it would be unlikely required for containers. An upcoming release of Go (with backports to 1.19) will implicitly restore the soft limit to fork
/ exec
processes AFAIK. Until then, the Docker daemon can be configured with default-ulimit
setting to enforce a 1024
soft limit on containers.
System details
1 | Fedora 37 VM 6.1.9 kernel x86_64 (16 GB memory) |
Observations in .service
files for LimitNOFILE
On a fresh install (via VM on Vultr) there was approx 1800 file descriptors open (sysctl fs.file-nr
). I used a shell loop to run busybox
containers until failure and adjusted the LimitNOFILE
for docker.service
and containerd.service
to collect metrics for insights.
I noticed a consistent ratio of number of FDs needed per container:
docker.service
-6:1
ratio (5:1
with--network=host
), approx 853 containers withLimitNOFILE=5120
(1024 with host network).containerd.service
-4:1
ratio (I did not verify if--network=host
reduced this),LimitNOFILE=1024
should be capable of 256 containers, provideddocker.service
is also high enough (eg:LimitNOFILE=2048
_).
In containerd.service
there was also a clear pattern in resources per container, where the LimitNOFILE
value, image used (busybox
, alpine
, debian
), and number of containers remained constant:
Each containers systemd
.scope
has 1 task and approx 400KiB memory (little bit less foralpine
anddebian
).10.5 tasks + 3MiB memory added per container to
systemctl status containerd
report.Approx 38 open files per container running (
fs.file-nr
after, minus the before value, divided by number of containers).
1 | mailserver/docker-mailserver:edge |
was also tested to compare to the
1 | sleep 180 |
containers:
- 33 tasks per container `.scope` and 85MiB memory reported via `systemd-cgtop` (*10GiB needed min to run 120 of these containers*).
- In `containerd` per container average resources were 11 tasks + 3.4MiB memory (*approx 400MiB usage for 120 of these containers*). Roughly consistent with the lighter images resource usage in `containerd`.
- Files opened per container also increased to 291 (*approx 35k files open for 120 of these containers*).
- If you want to reproduce for this image, `docker run` should include these extra options: `--hostname example.test --env SMTP_ONLY=1` (*hostname required to init, `SMTP_ONLY=1` skips needing an account configured*).
Operations like docker stats
need to open as many file descriptors as total containers running, otherwise it’ll hang waiting. You can observe if the daemon has reached the limit with ls -1 /proc/$(pidof dockerd)/fd | wc -l
.
Reproduction
Set LimitNOFILE=768
in docker.service
, then systemctl daemon-reload && systemctl restart docker
. You can confirm the limit is applied to the daemon process with cat /proc/$(pidof dockerd)/limits
.
Running the following should list:
- How many containers are running.
- Number of open files.
- How many tasks and memory both the
containerd
anddockerd
daemons are using.
1 | # Useful to run before the loop to compare against output after the loop is done |
Running this loop below should fail on the last few containers, about 123 should be created:
1 | # When `docker.service` limit is the bottleneck, you may need to `CTRL + C` to exit the loop |
You can add additional options:
1 | --network host |
- Avoids creating a new veth interface (see `ip link`) to the default Docker bridge each `docker run`.
- Without this `docker run` [may fail after `1023` interfaces are present on a single bridge](https://github.com/moby/moby/issues/44973#issuecomment-1543747718)?
- Creation will be a bit faster, and the FD to container ratio for `dockerd` is lowered to `5:1`.
1 | --ulimit "nofile=1023456789" |
- Useful to observe that it does not affect memory usage on it's own.
- Also shows `dockerd` + `containerd` limits don't affect how high this can go.
- For Debian based distros this would fail as it's higher than `fs.nr_open` (`1 048 576`), use that or a lower value.
1 | --cgroup-parent=LimitTests.slice |
- Similar to `docker stats` but isolated from other running containers. `systemd-cgtop` does include disk cache (*file-backed memory pages*) in it's reported memory usage however (*use `sync && sysctl vm.drop_caches=3` to clear that*).
- This can be useful if you want a better overview of resource usage across all the containers created:
- Create a temporary slice for testing with: `mkdir /sys/fs/cgroup/LimitTests.slice`.
- Run `systemd-cgtop --order=memory LimitTests.slice` to only view the containers running in this cgroup slice sorted by memory usage.
- Memory usage shown for the entire slice and per container. A `busybox` container uses roughly 400KB per container.
Limits impact across process children
I had a misconception that child processes contributed to the parents open files limit. However as my notes in this section detail, it’s only inheriting the limits applied, each process seems to have it’s own individual count.
Although, I’m probably missing something here as I have read of processes passing down FDs to children, which is also why daemons have a common hygiene practice to close the FD range available I think? This is lower level than I’m familiar with 😅
- You can also observe the number of file descriptors open for the
dockerd
andcontainerd
processes like this:ls -1 /proc/$(pidof dockerd)/fd | wc -l
. - This isn’t applicable to the
containerd-shim
process that is responsible for the container, sols -1 /proc/$(pgrep --newest --exact containerd-shim)/fd | wc -l
won’t be useful there.
To confirm this, run a container to test with: docker run --rm -it --ulimit "nofile=1024:1048576" alpine ash
. Then try the following:
1 | # Create a folder to add many files to: |
You will notice that:
Each process adds those FDs to the open file count returned from
fs.file-nr
, and frees them when that process is closed.You can re-run the loops for the same process and observe no change, the files are already counted as open for that process.
There is a memory cost involved:
- Each file
touch
costs about2048
bytes (disk-cache only until opened). - Each file open (1 or more FD references each increment
fs.file-nr
) costs about512
bytes per FD open for it. - Creating 512k files this way uses approx 1.1GiB memory (not released with
sysctl vm.drop_caches=3
while opened by at least one FD), while each process opening the equivalent amount of file descriptors additionally uses 250MiB (262MB).
- Each file
Errors
Nothing useful here, other than depending on which service limit was exhausted first resulted in slightly different errors.
Sometimes this made any docker
command like docker ps
hang (daemon exhausted limit). I had also observed:
- Containers not running (no
pgrep containerd-shim
output, butdocker ps
listed containers running well beyond when they should have exited). - Containers running with
containerd-shim
process (using memory), despitesystemctl stop docker containerd
. Sometimes this neededpkill containerd-shim
to cleanup andsystemctl start docker containerd
would log a bunch of errors injournalctl
handling cleanup of dead shims (depending on the number of containers, this may time out and need to start thecontainerd
service again). - Even with all that out of the way, there was some memory usage of several hundred MB from the baseline that lingered. As it didn’t seem to belong to any process, I assume it was kernel memory. I think the largest number of containers I experimented running with was around 1600-ish.
docker.service
limit exceeded
This failure output more errors per docker run
but the errors varied:
1 | ERRO[0000] Error waiting for container: container caff476371b6897ef35a95e26429f100d0d929120ff1abecc8a16aa674d692bf: driver "overlay2" failed to remove root filesystem: openfdat /var/lib/docker/overlay2/35f26ec862bb91d7c3214f76f8660938145bbb36eda114f67e711aad2be89578-init/diff/etc: too many open files |
1 | docker: Error response from daemon: failed to create shim task: OCI runtime create failed: runc create failed: unable to start container process: error during container init: error running hook #0: error running hook: exit status 1, stdout: , stderr: time="2023-03-12T02:26:20Z" level=fatal msg="failed to create a netlink handle: could not get current namespace while creating netlink socket: too many open files": unknown. |
1 | docker: Error response from daemon: failed to initialize logging driver: open /var/lib/docker/containers/b014a19f7eb89bb909dee158d21f35f001cfeb80c01e0078d6f20aac8151573f/b014a19f7eb89bb909dee158d21f35f001cfeb80c01e0078d6f20aac8151573f-json.log: too many open files. |
containerd.service
limit exceeded
I think I’ve seen some others, but it’s usually this one:
1 | docker: Error response from daemon: failed to start shim: start failed: : pipe2: too many open files: unknown. |
Scope and Explanation
Aug 2023:
LimitNOFILE=infinity
remove fromdocker.service
May 2021:
LimitNOFILE=infinity
+LimitNPROC=infinity
brought back intodocker.service
to sync with Docker CE’s equivalent config.- This PR was a merge commit of this one from Sep 2018 (commit history interleaved ranges from 2017-2021).
- As the PR main diff shows,
LimitNPROC=infinity
was already added, andLimitNOFILE=1048576
was changed toinfinity
by the PR merge (initially confused since I’m usinggit blame
on the master branch).
July 2016:
LimitNOFILE=infinity
changed toLimitNOFILE=1048576
(this number is2^20
).- Discussion references a 2009 StackOverflow answer about
infinity
being capped to2^20
in a specific distro release / kernel. On some systems today, that ceiling can be 1024 times higher (2^30 == 1073741816
, over 1 billion).
- Discussion references a 2009 StackOverflow answer about
July 2016:
LimitNOFILE
andLimitNPROC
changed from1048576
toinfinity
- This PR reverted the
LimitNOFILE
change shortly after as described above.
- This PR reverted the
March 2014: Original
LimitNOFILE
+LimitNPROC
added with1048576
.Linked PR comment mentions that this 2^20 value is already higher than Docker needs:
It appears it was later changed to
infinity
to improve CI times where a smaller limit was applied (like this comment about Ubuntu 14.04 adjusting any limit exceeding2^20
to2^10
?).PR also [referenced relevant systemd docs](https://www.freedesktop.org/software/systemd/man/systemd.exec.html#Process Properties) (which may have changed since 2014)
Current Status:
LimitNOFILE=infinity
is still the case until Docker v25, unless the team is backporting it to any releases built with Go 1.19+containerd
has merged the equivalent change to removeLimitNOFILE
from their systemd service file.
Systemd < 240
Why is LimitNOFILE not set to infinity when configured in the service?
After setting LimitNOFILE to infinity in the service, when checking the limit of the process ID (pid), it is observed that the open file limit is 65536 instead of infinity.
Please review the service configuration.
1 | [root@XXX ~]# ulimit -n -u |
containerd systemd configuration:
1 | cat /usr/lib/systemd/system/containerd.service |
Viewing the configuration effect
1 | [root@XXX ~]# cat /proc/$(pidof dockerd)/limits |
1 | This has systemd look at /proc/sys/fs/nr_open to find the current maximum of |
see: https://github.com/systemd/systemd/issues/6559
References
- https://github.com/moby/moby/issues/45838
- https://github.com/moby/moby/issues/23137
- https://0pointer.net/blog/file-descriptor-limits.html
- https://www.codenong.com/cs105896693/
- https://github.com/moby/moby/issues/38814
- https://github.com/cri-o/cri-o/issues/7703
- https://github.com/envoyproxy/envoy/issues/31502
- https://www.freedesktop.org/software/systemd/man/latest/systemd.exec.html#Process%20Properties