defensive-coding-guide/en-US/Tasks-Descriptors.xml

<?xml version='1.0' encoding='utf-8' ?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN" "http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
]>
<chapter id="sect-Defensive_Coding-Tasks-Descriptors">
  <title>File Descriptor Management</title>
  <para>
    File descriptors underlie all input/output mechanisms offered by
    the system.  They are used to implementation the <literal>FILE
    *</literal>-based functions found in
    <literal>&lt;stdio.h&gt;</literal>, and all the file and network
    communication facilities provided by the Python and Java
    environments are eventually implemented in them.
  </para>
  <para>
    File descriptors are small, non-negative integers in userspace,
    and are backed on the kernel side with complicated data structures
    which can sometimes grow very large.
  </para>
  <section>
    <title>Closing descriptors</title>
    <para>
      If a descriptor is no longer used by a program and is not closed
      explicitly, its number cannot be reused (which is problematic in
      itself, see <xref
      linkend="sect-Defensive_Coding-Tasks-Descriptors-Limit"/>), and
      the kernel resources are not freed.  Therefore, it is important
      to close all descriptors at the earlierst point in time
      possible, but not earlier.
    </para>
    <section>
      <title>Error handling during descriptor close</title>
      <para>
	The <function>close</function> system call is always
	successful in the sense that the passed file descriptor is
	never valid after the function has been called.  However,
	<function>close</function> still can return an error, for
	example if there was a file system failure.  But this error is
	not very useful because the absence of an error does not mean
	that all caches have been emptied and previous writes have
	been made durable.  Programs which need such guarantees must
	open files with <literal>O_SYNC</literal> or use
	<literal>fsync</literal> or <literal>fdatasync</literal>, and
	may also have to <literal>fsync</literal> the directory
	containing the file.
      </para>
    </section>
    <section>
      <title>Closing descriptors and race conditions</title>
      <para>
	Unlike process IDs, which are recycle only gradually, the
	kernel always allocates the lowest unused file descriptor when
	a new descriptor is created.  This means that in a
	multi-threaded program which constantly opens and closes file
	descriptors, descriptors are reused very quickly.  Unless
	descriptor closing and other operations on the same file
	descriptor are synchronized (typically, using a mutex), there
	will be race coniditons and I/O operations will be applied to
	the wrong file descriptor.
      </para>
      <para>
	Sometimes, it is necessary to close a file descriptor
	concurrently, while another thread might be about to use it in
	a system call.  In order to support this, a program needs to
	create a single special file descriptor, one on which all I/O
	operations fail.  One way to achieve this is to use
	<function>socketpair</function>, close one of the descriptors,
	and call <literal>shutdown(fd, SHUTRDWR)</literal> on the
	other.
      </para>
      <para>
	When a descriptor is closed concurrently, the program does not
	call <function>close</function> on the descriptor.  Instead it
	program uses <function>dup2</function> to replace the
	descriptor to be closed with the dummy descriptor created
	earlier.  This way, the kernel will not reuse the descriptor,
	but it will carry out all other steps associated with calling
	a descriptor (for instance, if the descriptor refers to a
	stream socket, the peer will be notified).
      </para>
      <para>
	This is just a sketch, and many details are missing.
	Additional data structures are needed to determine when it is
	safe to really close the descriptor, and proper locking is
	required for that.
      </para>
    </section>
    <section>
      <title>Lingering state after close</title>
      <para>
	By default, closing a stream socket returns immediately, and
	the kernel will try to send the data in the background.  This
	means that it is impossible to implement accurate accounting
	of network-related resource utilization from userspace.
      </para>
      <para>
	The <literal>SO_LINGER</literal> socket option alters the
	behavior of <function>close</function>, so that it will return
	only after the lingering data has been processed, either by
	sending it to the peer successfully, or by discarding it after
	the configured timeout.  However, there is no interface which
	could perform this operation in the background, so a separate
	userspace thread is needed for each <function>close</function>
	call, causing scalability issues.
      </para>
      <para>
	Currently, there is no application-level countermeasure which
	applies universally.  Mitigation is possible with
	<application>iptables</application> (the
	<literal>connlimit</literal> match type in particular) and
	specialized filtering devices for denial-of-service network
	traffic.
      </para>
      <para>
	These problems are not related to the
	<literal>TIME_WAIT</literal> state commonly seen in
	<application>netstat</application> output.  The kernel
	automatically expires such sockets if necessary.
      </para>
    </section>
  </section>

  <section id="sect-Defensive_Coding-Tasks-Descriptors-Child_Processes">
    <title>Preventing file descriptor leaks to child processes</title>
    <para>
      Child processes created with <function>fork</function> share
      the initial set of file descriptors with their parent
      process.  By default, file descriptors are also preserved if
      a new process image is created with <function>execve</function>
      (or any of the other functions such as <function>system</function>
      or <function>posix_spawn</function>).
    </para>
    <para>
      Usually, this behavior is not desirable.  There are two ways to
      turn it off, that is, to prevent new process images from
      inheriting the file descriptors in the parent process:
    </para>
    <itemizedlist>
      <listitem>
	<para>
	  Set the close-on-exec flag on all newly created file
	  descriptors.  Traditionally, this flag is controlled by the
	  <literal>FD_CLOEXEC</literal> flag, using
	  <literal>F_GETFD</literal> and <literal>F_SETFD</literal>
	  operations of the <function>fcntl</function> function.
	</para>
	<para>
	  However, in a multi-threaded process, there is a race
	  condition: a subprocess could have been created between the
	  time the descriptor was created and the
	  <literal>FD_CLOEXEC</literal> was set.  Therefore, many system
	  calls which create descriptors (such as
	  <function>open</function> and <function>openat</function>)
	  now accept the <function>O_CLOEXEC</function> flag
	  (<function>SOCK_CLOEXEC</function> for
	  <function>socket</function> and
	  <function>socketpair</function>), which cause the
	  <literal>FD_CLOEXEC</literal> flag to be set for the file
	  descriptor in an atomic fashion.  In addition, a few new
	  systems calls were introduced, such as
	  <function>pipe2</function> and <function>dup3</function>.
	</para>
	<para>
	  The downside of this approach is that every descriptor needs
	  to receive special treatment at the time of creation,
	  otherwise it is not completely effective.
	</para>
      </listitem>
      <listitem>
	<para>
	  After calling <function>fork</function>, but before creating
	  a new process image with <function>execve</function>, all
	  file descriptors which the child process will not need are
	  closed.
	</para>
	<para>
	  Traditionally, this was implemented as a loop over file
	  descriptors ranging from <literal>3</literal> to
	  <literal>255</literal> and later <literal>1023</literal>.
	  But this is only an approximatio because it is possible to
	  create file descriptors outside this range easily (see <xref
	  linkend="sect-Defensive_Coding-Tasks-Descriptors-Limit"/>).
	  Another approach reads <filename>/proc/self/fd</filename>
	  and closes the unexpected descriptors listed there, but this
	  approach is much slower.
	</para>
      </listitem>
    </itemizedlist>
    <para>
      At present, environments which care about file descriptor
      leakage implement the second approach.  OpenJDK 6 and 7
      are among them.
    </para>
  </section>

  <section id="sect-Defensive_Coding-Tasks-Descriptors-Limit">
    <title>Dealing with the <function>select</function> limit</title>
    <para>
      By default, a user is allowed to open only 1024 files in a
      single process, but the system administrator can easily change
      this limit (which is necessary for busy network servers).
      However, there is another restriction which is more difficult to
      overcome.
    </para>
    <para>
      The <function>select</function> function only supports a
      maximum of <literal>FD_SETSIZE</literal> file descriptors
      (that is, the maximum permitted value for a file descriptor
      is <literal>FD_SETSIZE - 1</literal>, usually 1023.)  If a
      process opens many files, descriptors may exceed such
      limits.  It is impossible to query such descriptors using
      <function>select</function>.
    </para>
    <para>
      If a library which creates many file descriptors is used in
      the same process as a library which uses
      <function>select</function>, at least one of them needs to
      be changed.  <!-- ??? refer to event-driven programming -->
      Calls to <function>select</function> can be replaced with
      calls to <function>poll</function> or another event handling
      mechanism.  Replacing the <function>select</function> function
      is the recommended approach.
    </para>
    <para>
      Alternatively, the library with high descriptor usage can
      relocate descriptors above the <literal>FD_SETSIZE</literal>
      limit using the following procedure.
    </para>
    <itemizedlist>
      <listitem>
	<para>
	  Create the file descriptor <literal>fd</literal> as
	  usual, preferably with the <literal>O_CLOEXEC</literal>
	  flag.
	</para>
      </listitem>
      <listitem>
	<para>
	  Before doing anything else with the descriptor
	  <literal>fd</literal>, invoke:
	</para>
	<programlisting language="C">
	  int newfd = fcntl(fd, F_DUPFD_CLOEXEC, (long)FD_SETSIZE);
	</programlisting>
      </listitem>
      <listitem>
	<para>
	  Check that <literal>newfd</literal> result is
	  non-negative, otherwise close <literal>fd</literal> and
	  report an error, and return.
	</para>
      </listitem>
      <listitem>
	<para>
	  Close <literal>fd</literal> and continue to use
	  <literal>newfd</literal>.
	</para>
      </listitem>
    </itemizedlist>
    <para>
      The new descriptor has been allocated above the
      <literal>FD_SETSIZE</literal>.  Even though this algorithm
      is racy in the sense that the <literal>FD_SETSIZE</literal>
      first descriptors could fill up, a very high degree of
      physical parallelism is required before this becomes a problem.
    </para>
  </section>
</chapter>