Serialization and Deserialization

Serialization and Deserialization Protocol decoders and file format parsers are often the most-exposed part of an application because they are exposed with little or no user interaction and before any authentication and security checks are made. They are also difficult to write robustly in languages which are not memory-safe.

Recommendations for manually written decoders For C and C++, the advice in applies. In addition, avoid non-character pointers directly into input buffers. Pointer misalignment causes crashes on some architectures. When reading variable-sized objects, do not allocate large amounts of data solely based on the value of a size field. If possible, grow the data structure as more data is read from the source, and stop when no data is available. This helps to avoid denial-of-service attacks where little amounts of input data results in enormous memory allocations during decoding. Alternatively, you can impose reasonable bounds on memory allocations, but some protocols do not permit this.

Protocol design Binary formats with explicit length fields are more difficult to parse robustly than those where the length of dynamically-sized elements is derived from sentinel values. A protocol which does not use length fields and can be written in printable ASCII characters simplifies testing and debugging. However, binary protocols with length fields may be more efficient to parse.

Library support for deserialization For some languages, generic libraries are available which allow to serialize and deserialize user-defined objects. The deserialization part comes in one of two flavors, depending on the library. The first kind uses type information in the data stream to control which objects are instantiated. The second kind uses type definitions supplied by the programmer. The first one allows arbitrary object instantiation, the second one generally does not. The following serialization frameworks are in the first category, are known to be unsafe, and must not be used for untrusted data: Python's pickle and cPickle modules, and wrappers such as shelve Perl's Storable package Java serialization (java.io.ObjectInputStream), even if encoded in other formats (as with java.beans.XMLDecoder) PHP serialization (unserialize) Most implementations of YAML When using a type-directed deserialization format where the types of the deserialized objects are specified by the programmer, make sure that the objects which can be instantiated cannot perform any destructive actions in their destructors, even when the data members have been manipulated. In general, JSON decoders do not suffer from this problem. But you must not use the eval function to parse JSON objects in Javascript; even with the regular expression filter from RFC 4627, there are still information leaks remaining. JSON-based formats can still turn out risky if they serve as an encoding form for any if the serialization frameworks listed above.

XML serialization

External references XML documents can contain external references. They can occur in various places. In the DTD declaration in the header of an XML document: ]]> In a namespace declaration: ]]> In an entity defintion: ]]> In a notation: ]]> Originally, these external references were intended as unique identifiers, but by many XML implementations, they are used for locating the data for the referenced element. This causes unwanted network traffic, and may disclose file system contents or otherwise unreachable network resources, so this functionality should be disabled. Depending on the XML library, external referenced might be processed not just when parsing XML, but also when generating it.

Entity expansion When external DTD processing is disabled, an internal DTD subset can still contain entity definitions. Entity declarations can reference other entities. Some XML libraries expand entities automatically, and this processing cannot be switched off in some places (such as attribute values or content models). Without limits on the entity nesting level, this expansion results in data which can grow exponentially in length with size of the input. (If there is a limit on the nesting level, the growth is still polynomial, unless further limits are imposed.) Consequently, the processing internal DTD subsets should be disabled if possible, and only trusted DTDs should be processed. If a particular XML application does not permit such restrictions, then application-specific limits are called for.

XInclude processing XInclude processing can reference file and network resources and include them into the document, much like external entity references. When parsing untrusted XML documents, XInclude processing should be truned off. XInclude processing is also fairly complex and may pull in support for the XPointer and XPath specifications, considerably increasing the amount of code required for XML processing.

Algorithmic complexity of XML validation DTD-based XML validation uses regular expressions for content models. The XML specification requires that content models are deterministic, which means that efficient validation is possible. However, some implementations do not enforce determinism, and require exponential (or just polynomial) amount of space or time for validating some DTD/document combinations. XML schemas and RELAX NG (via the xsd: prefix) directly support textual regular expressions which are not required to be deterministic.

Using Expat for XML parsing By default, Expat does not try to resolve external IDs, so no steps are required to block them. However, internal entity declarations are processed. Installing a callback which stops parsing as soon as such entities are encountered disables them, see . Expat does not perform any validation, so there are no problems related to that. Disabling XML entity processing with Expat This handler must be installed when the XML_Parser object is created (). Creating an Expat XML parser It is also possible to reject internal DTD subsets altogeher, using a suitable XML_StartDoctypeDeclHandler handler installed with XML_SetDoctypeDeclHandler.

Using Qt for XML parsing The XML component of Qt, QtXml, does not resolve external IDs by default, so it is not requred to prevent such resolution. Internal entities are processed, though. To change that, a custom QXmlDeclHandler and QXmlSimpleReader subclasses are needed. It is not possible to use the QDomDocument::setContent(const QByteArray &) convenience methods. shows an entity handler which always returns errors, causing parsing to stop when encountering entity declarations. A QtXml entity handler which blocks entity processing This handler is used in the custom QXmlReader subclass in . Some parts of QtXml will call the setDeclHandler(QXmlDeclHandler *) method. Consequently, we prevent overriding our custom handler by providing a definition of this method which does nothing. In the constructor, we activate namespace processing; this part may need adjusting. A QtXml XML reader which blocks entity processing Our NoEntityReader class can be used with one of the overloaded QDomDocument::setContent methods. shows how the buffer object (of type QByteArray) is wrapped as a QXmlInputSource. After calling the setContent method, you should check the return value and report any error. Parsing an XML document with QDomDocument, without entity expansion

Using OpenJDK for XML parsing and validation OpenJDK contains facilities for DOM-based, SAX-based, and StAX-based document parsing. Documents can be validated against DTDs or XML schemas. The approach taken to deal with entity expansion differs from the general recommendation in . We enable the the feature flag javax.xml.XMLConstants.FEATURE_SECURE_PROCESSING, which enforces heuristic restrictions on the number of entity expansions. Note that this flag alone does not prevent resolution of external references (system IDs or public IDs), so it is slightly misnamed. In the following sections, we use helper classes to prevent external ID resolution. Helper class to prevent DTD external entity resolution in OpenJDK Helper class to prevent schema resolution in OpenJDK shows the imports used by the examples. Java imports for OpenJDK XML parsing

DOM-based XML parsing and DTD validation in OpenJDK This approach produces a org.w3c.dom.Document object from an input stream. use the data from the java.io.InputStream instance in the inputStream variable. DOM-based XML parsing in OpenJDK External entity references are prohibited using the NoEntityResolver class in . Because external DTD references are prohibited, DTD validation (if enabled) will only happen against the internal DTD subset embedded in the XML document. To validate the document against an external DTD, use a javax.xml.transform.Transformer class to add the DTD reference to the document, and an entity resolver which whitelists this external reference.

XML Schema validation in OpenJDK shows how to validate a document against an XML Schema, using a SAX-based approach. The XML data is read from an java.io.InputStream in the inputStream variable. SAX-based validation against an XML schema in OpenJDK The NoResourceResolver class is defined in . If you need to validate a document against an XML schema, use the code in to create the document, but do not enable validation at this point. Then use to perform the schema-based validation on the org.w3c.dom.Document instance document. Validation of a DOM document against an XML schema in OpenJDK

Other XML parsers in OpenJDK OpenJDK contains additional XML parsing and processing facilities. Some of them are insecure. The class java.beans.XMLDecoder acts as a bridge between the Java object serialization format and XML. It is close to impossible to securely deserialize Java objects in this format from untrusted inputs, so its use is not recommended, as with the Java object serialization format itself. See .

Protocol Encoders For protocol encoders, you should write bytes to a buffer which grows as needed, using an exponential sizing policy. Explicit lengths can be patched in later, once they are known. Allocating the required number of bytes upfront typically requires separate code to compute the final size, which must be kept in sync with the actual encoding step, or vulnerabilities may result. In multi-threaded code, parts of the object being deserialized might change, so that the computed size is out of date. You should avoid copying data directly from a received packet during encoding, disregarding the format. Propagating malformed data could enable attacks on other recipients of that data. When using C or C++ and copying whole data structures directly into the output, make sure that you do not leak information in padding bytes between fields or at the end of the struct.