267 lines
11 KiB
XML
267 lines
11 KiB
XML
<?xml version='1.0' encoding='utf-8' ?>
|
|
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
|
|
]>
|
|
<chapter id="sect-Defensive_Coding-Tasks-Descriptors">
|
|
<title>File Descriptor Management</title>
|
|
<para>
|
|
File descriptors underlie all input/output mechanisms offered by
|
|
the system. They are used to implementation the <literal>FILE
|
|
*</literal>-based functions found in
|
|
<literal><stdio.h></literal>, and all the file and network
|
|
communication facilities provided by the Python and Java
|
|
environments are eventually implemented in them.
|
|
</para>
|
|
<para>
|
|
File descriptors are small, non-negative integers in userspace,
|
|
and are backed on the kernel side with complicated data structures
|
|
which can sometimes grow very large.
|
|
</para>
|
|
<section>
|
|
<title>Closing descriptors</title>
|
|
<para>
|
|
If a descriptor is no longer used by a program and is not closed
|
|
explicitly, its number cannot be reused (which is problematic in
|
|
itself, see <xref
|
|
linkend="sect-Defensive_Coding-Tasks-Descriptors-Limit"/>), and
|
|
the kernel resources are not freed. Therefore, it is important
|
|
to close all descriptors at the earlierst point in time
|
|
possible, but not earlier.
|
|
</para>
|
|
<section>
|
|
<title>Error handling during descriptor close</title>
|
|
<para>
|
|
The <function>close</function> system call is always
|
|
successful in the sense that the passed file descriptor is
|
|
never valid after the function has been called. However,
|
|
<function>close</function> still can return an error, for
|
|
example if there was a file system failure. But this error is
|
|
not very useful because the absence of an error does not mean
|
|
that all caches have been emptied and previous writes have
|
|
been made durable. Programs which need such guarantees must
|
|
open files with <literal>O_SYNC</literal> or use
|
|
<literal>fsync</literal> or <literal>fdatasync</literal>, and
|
|
may also have to <literal>fsync</literal> the directory
|
|
containing the file.
|
|
</para>
|
|
</section>
|
|
<section>
|
|
<title>Closing descriptors and race conditions</title>
|
|
<para>
|
|
Unlike process IDs, which are recycle only gradually, the
|
|
kernel always allocates the lowest unused file descriptor when
|
|
a new descriptor is created. This means that in a
|
|
multi-threaded program which constantly opens and closes file
|
|
descriptors, descriptors are reused very quickly. Unless
|
|
descriptor closing and other operations on the same file
|
|
descriptor are synchronized (typically, using a mutex), there
|
|
will be race coniditons and I/O operations will be applied to
|
|
the wrong file descriptor.
|
|
</para>
|
|
<para>
|
|
Sometimes, it is necessary to close a file descriptor
|
|
concurrently, while another thread might be about to use it in
|
|
a system call. In order to support this, a program needs to
|
|
create a single special file descriptor, one on which all I/O
|
|
operations fail. One way to achieve this is to use
|
|
<function>socketpair</function>, close one of the descriptors,
|
|
and call <literal>shutdown(fd, SHUTRDWR)</literal> on the
|
|
other.
|
|
</para>
|
|
<para>
|
|
When a descriptor is closed concurrently, the program does not
|
|
call <function>close</function> on the descriptor. Instead it
|
|
program uses <function>dup2</function> to replace the
|
|
descriptor to be closed with the dummy descriptor created
|
|
earlier. This way, the kernel will not reuse the descriptor,
|
|
but it will carry out all other steps associated with calling
|
|
a descriptor (for instance, if the descriptor refers to a
|
|
stream socket, the peer will be notified).
|
|
</para>
|
|
<para>
|
|
This is just a sketch, and many details are missing.
|
|
Additional data structures are needed to determine when it is
|
|
safe to really close the descriptor, and proper locking is
|
|
required for that.
|
|
</para>
|
|
</section>
|
|
<section>
|
|
<title>Lingering state after close</title>
|
|
<para>
|
|
By default, closing a stream socket returns immediately, and
|
|
the kernel will try to send the data in the background. This
|
|
means that it is impossible to implement accurate accounting
|
|
of network-related resource utilization from userspace.
|
|
</para>
|
|
<para>
|
|
The <literal>SO_LINGER</literal> socket option alters the
|
|
behavior of <function>close</function>, so that it will return
|
|
only after the lingering data has been processed, either by
|
|
sending it to the peer successfully, or by discarding it after
|
|
the configured timeout. However, there is no interface which
|
|
could perform this operation in the background, so a separate
|
|
userspace thread is needed for each <function>close</function>
|
|
call, causing scalability issues.
|
|
</para>
|
|
<para>
|
|
Currently, there is no application-level countermeasure which
|
|
applies universally. Mitigation is possible with
|
|
<application>iptables</application> (the
|
|
<literal>connlimit</literal> match type in particular) and
|
|
specialized filtering devices for denial-of-service network
|
|
traffic.
|
|
</para>
|
|
<para>
|
|
These problems are not related to the
|
|
<literal>TIME_WAIT</literal> state commonly seen in
|
|
<application>netstat</application> output. The kernel
|
|
automatically expires such sockets if necessary.
|
|
</para>
|
|
</section>
|
|
</section>
|
|
|
|
<section id="sect-Defensive_Coding-Tasks-Descriptors-Child_Processes">
|
|
<title>Preventing file descriptor leaks to child processes</title>
|
|
<para>
|
|
Child processes created with <function>fork</function> share
|
|
the initial set of file descriptors with their parent
|
|
process. By default, file descriptors are also preserved if
|
|
a new process image is created with <function>execve</function>
|
|
(or any of the other functions such as <function>system</function>
|
|
or <function>posix_spawn</function>).
|
|
</para>
|
|
<para>
|
|
Usually, this behavior is not desirable. There are two ways to
|
|
turn it off, that is, to prevent new process images from
|
|
inheriting the file descriptors in the parent process:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Set the close-on-exec flag on all newly created file
|
|
descriptors. Traditionally, this flag is controlled by the
|
|
<literal>FD_CLOEXEC</literal> flag, using
|
|
<literal>F_GETFD</literal> and <literal>F_SETFD</literal>
|
|
operations of the <function>fcntl</function> function.
|
|
</para>
|
|
<para>
|
|
However, in a multi-threaded process, there is a race
|
|
condition: a subprocess could have been created between the
|
|
time the descriptor was created and the
|
|
<literal>FD_CLOEXEC</literal> was set. Therefore, many system
|
|
calls which create descriptors (such as
|
|
<function>open</function> and <function>openat</function>)
|
|
now accept the <function>O_CLOEXEC</function> flag
|
|
(<function>SOCK_CLOEXEC</function> for
|
|
<function>socket</function> and
|
|
<function>socketpair</function>), which cause the
|
|
<literal>FD_CLOEXEC</literal> flag to be set for the file
|
|
descriptor in an atomic fashion. In addition, a few new
|
|
systems calls were introduced, such as
|
|
<function>pipe2</function> and <function>dup3</function>.
|
|
</para>
|
|
<para>
|
|
The downside of this approach is that every descriptor needs
|
|
to receive special treatment at the time of creation,
|
|
otherwise it is not completely effective.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
After calling <function>fork</function>, but before creating
|
|
a new process image with <function>execve</function>, all
|
|
file descriptors which the child process will not need are
|
|
closed.
|
|
</para>
|
|
<para>
|
|
Traditionally, this was implemented as a loop over file
|
|
descriptors ranging from <literal>3</literal> to
|
|
<literal>255</literal> and later <literal>1023</literal>.
|
|
But this is only an approximatio because it is possible to
|
|
create file descriptors outside this range easily (see <xref
|
|
linkend="sect-Defensive_Coding-Tasks-Descriptors-Limit"/>).
|
|
Another approach reads <filename>/proc/self/fd</filename>
|
|
and closes the unexpected descriptors listed there, but this
|
|
approach is much slower.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
At present, environments which care about file descriptor
|
|
leakage implement the second approach. OpenJDK 6 and 7
|
|
are among them.
|
|
</para>
|
|
</section>
|
|
|
|
<section id="sect-Defensive_Coding-Tasks-Descriptors-Limit">
|
|
<title>Dealing with the <function>select</function> limit</title>
|
|
<para>
|
|
By default, a user is allowed to open only 1024 files in a
|
|
single process, but the system administrator can easily change
|
|
this limit (which is necessary for busy network servers).
|
|
However, there is another restriction which is more difficult to
|
|
overcome.
|
|
</para>
|
|
<para>
|
|
The <function>select</function> function only supports a
|
|
maximum of <literal>FD_SETSIZE</literal> file descriptors
|
|
(that is, the maximum permitted value for a file descriptor
|
|
is <literal>FD_SETSIZE - 1</literal>, usually 1023.) If a
|
|
process opens many files, descriptors may exceed such
|
|
limits. It is impossible to query such descriptors using
|
|
<function>select</function>.
|
|
</para>
|
|
<para>
|
|
If a library which creates many file descriptors is used in
|
|
the same process as a library which uses
|
|
<function>select</function>, at least one of them needs to
|
|
be changed. <!-- ??? refer to event-driven programming -->
|
|
Calls to <function>select</function> can be replaced with
|
|
calls to <function>poll</function> or another event handling
|
|
mechanism. Replacing the <function>select</function> function
|
|
is the recommended approach.
|
|
</para>
|
|
<para>
|
|
Alternatively, the library with high descriptor usage can
|
|
relocate descriptors above the <literal>FD_SETSIZE</literal>
|
|
limit using the following procedure.
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
Create the file descriptor <literal>fd</literal> as
|
|
usual, preferably with the <literal>O_CLOEXEC</literal>
|
|
flag.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Before doing anything else with the descriptor
|
|
<literal>fd</literal>, invoke:
|
|
</para>
|
|
<programlisting language="C">
|
|
int newfd = fcntl(fd, F_DUPFD_CLOEXEC, (long)FD_SETSIZE);
|
|
</programlisting>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Check that <literal>newfd</literal> result is
|
|
non-negative, otherwise close <literal>fd</literal> and
|
|
report an error, and return.
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
Close <literal>fd</literal> and continue to use
|
|
<literal>newfd</literal>.
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
The new descriptor has been allocated above the
|
|
<literal>FD_SETSIZE</literal>. Even though this algorithm
|
|
is racy in the sense that the <literal>FD_SETSIZE</literal>
|
|
first descriptors could fill up, a very high degree of
|
|
physical parallelism is required before this becomes a problem.
|
|
</para>
|
|
</section>
|
|
</chapter>
|