The XML ( eXtensible Markup Language ) allows to define a grammar to describe and format structured documents, allowing the transfer of data and information useful for processing them. Among the advantages deriving from XML there is the possibility of transferring data that are understandable both for a human observer and for a computer.
But there is the other side of the coin. The use of metadata involves an increase in data to be transmitted. At the same time, as the size of the documents to be transmitted increases, the XML begins to become difficult to interpret for human eyes. Furthermore, there is often no need to keep the document in a comprehensible form between one processing and another.
Over time, the demand for tools to optimize XML documents has become widespread. This article will present the problem, several useful approaches to solve it, and the solution offered by the W3C ( World Wide Web Consortium ). Finally, we will introduce some tools for optimization in Java.
The article consists of the following sections:
- Introduction to the problem and resolution techniques
- The W3C approach: EXI ( Efficient XML Interchange)
- Other solutions for optimizing XML in Java
Introduction to the problem and resolution techniques
As already mentioned, there are several reasons that led to the request for optimization of XML documents. A parallel problem is that of optimizing XML queries, but it is beyond the scope of the article.
This section will present the main complications that led to the optimization request, even if it is emphasized that the causes often overlap and are not excluded. Following are the main solutions with the related trade-off.
- Memory and bandwidth constraints. Storing and transporting markups is expensive, so applications subject to limited bandwidth or stringent memory constraints may not find the adoption of XML possible because of its verbosity. The same problem applies to applications that require the exchange of an intensive volume of data.
- Processing constraints. Generating and analyzing an XML document requires a lot of resources from the machine that processes the document. The resulting overhead, especially with increasing complexity of formatting, tends to grow in an unacceptable manner by systems with limited processing capacity. Secondary impacts, processing times limit the volume of processable documents, so applications may not scale as desired, or you might be forced to opt for solutions with less complexity of formatting.
- Binary data. XML is a textual language, it does not facilitate other formats such as binary ones. XML can encode contents based on 64 and hexadecimal, but this, in turn, limits performance, especially for large payloads.
- Random access. XML is monolithic, requires that the entire document be available for processing (as opposed to other formats such as jpeg). As a consequence, it is possible to establish if it is well-formed or valid only once it has been acquired in full. In addition, mechanisms such as XML Namespaces complicate the problem because the evaluation must be carried out on extended contexts.
… and the solutions
So far we have introduced the problems. Turning to solutions, the first thing to say is that there is a trade-off due to the fact that each solution represents a compromise between the benefits offered by XML and the desire to improve performance. Among the main advantages, XML is self-explanatory, easily extensible, readable to the human eye, conceptually simple. Any optimization should take these aspects into account and preserve them. In addition, it should be lossless, ie starting from an optimized document you should be able to rebuild the original XML document.
Various solutions have been offered to the problem, among these are the compression methods based on Gzip or ASN.1 coding. But it would be preferable to use formatting as a basis for targeted optimization. It is possible to classify the solutions as follows:
|Text compression||traditional compression technique to reduce the size of a document. However, it has the disadvantages of taking time and typically making the document opaque until decompression.|
|Length coding||encoding the length of elements, attributes, and other structures allows efficient and arbitrary access to certain parts of the document because it is not necessary to analyze all the bytes. The problem is that the document tends to become difficult to understand by the human eye.|
|Tag dictionary||replace the XML QNames with more homogeneous and fixed length identifiers to improve document efficiency and size. But at the expense of readability.|
|Selective shuffle||the flexibility in document serialization can allow some applications to improve performance, especially if they need to access only parts of the document being analyzed. This can be detrimental to readability.|
|Selectable recoding||some types of XML data, such as base64 binary, are particularly inefficient for both the size and processing times required. It is possible to change the encoding of this data, probably losing the possibility of displaying it on classic text editors.|
|Type-based encoding||when the type of information is known, it can be used to recode a document by placing the types in canonical form, for example with a fixed length. The native form would be recovered without problems. This involves an elaboration that is not very expensive, at the cost of the legibility and the modifiability of the document.|
|Coding based on the schema||based on the structure of the XML schema associated to the documents, it is possible to generate particular formats, requiring however that the users share the same version of the scheme to generate and consume a particular document. This technique allows to reduce the processing requests and the space, but it is not self-describing, so it poses problems during the evolution of the format and normally it is not particularly readable.|
In general, it is likely that an optimization tool will combine several of the techniques presented in order to increase their effectiveness.
We can, therefore, say that the optimization consists of a set of methods to reformat an XML document in order to minimize the impact in terms of resources used, be it band, memory, and so on.
An example of how optimization could work is presented. Let’s start with a classic “Person” structure:
<pre class=”brush: php; html-script: true”>
<Name> Goofy </ name>
<Surname> De Pippis </ Surname>
A possible optimization could bring the XML into a flat form, replacing the content of the elements with attributes, for example:
<pre class=”brush: php; html-script: true”>
<Person Name = “Pippo” Surname = “De Pippis” />
This led to a significant compression of the occupied space (a saving of about a third), but it is possible to recover the original document? In such a simple case yes, but how would we distinguish the original attributes from those created by the optimization?
The problem is not so simple, we will see therefore in the next chapter a solution studied by W3C.
The W3C approach: EXI ( Efficient XML Interchange )
In order to maximize the number of systems and applications capable of communicating via XML data, W3C has initiated work that led to the specification of the EXI ( Efficient XML Interchange ) format, a very compact representation of XML information. EXI is a binary exchange format, which came to version 1.0 and became part of the W3C recommendations in March 2011.
EXI bases optimization on a grammar-driven approach, using a module called EXI processor to encode XML data in EXI streams or to decode and make data usable again.
EXI is defined as informed scheme, ie it is able to improve compactness and performance if it can use a scheme, but it does not depend on being able to work without it. If a scheme is used, it will be necessary to have it available to regain the original document. Furthermore, it is possible to choose whether or not to compress the documents in order to save more space.
In this chapter, we will introduce the basic concepts of the specification and some evaluations of the services offered. Evaluations will follow on the impact of the format on the applications that use the XML and, finally, the main implementations of the specification will be presented.
Elements of the format
In coding, the grammar is used to map a stream of XML information into a smaller stream of events. Subsequently, the event flow is encoded by event codes, similar to Huffman codes. The codes form a sequence of values that can be compressed if desired, replacing the frequent values with specific patterns to further reduce the dimensions.
A EXI stream is constituted by a header, EXI header followed by a body, EXI body. The body represents the content of the document, while it header contains information about the EXI format used and the options used to encode the content (for example, if compression has been enabled, if it is an informed schema, and so on).
One EXI header could start with a EXI Cookie, a field that is meant to indicate that the stream that follows is part of an EXI stream. The field consists of four characters ($, E, X and I) and is used to distinguish EXI streams from other types.
L ‘ EXI body is constituted by a sequence of events, the events EXI. As in the case of XML, events need concurrent start and end elements. Event codes are associated with events, encoded in binary.
Therefore, in a nutshell, the original document is associated with a series of events, which in turn are coded in binary.
To evaluate the efficacy of EXI, both from the point of view of the compaction capacity and the processing efficiency, the following evaluations have been carried out.
For more information on testing, see EXI Evaluation.
Compaction: comparisons with Gzipped and ASN 1.0
The two graphs show the comparison between one of the implementations of the EXI specification (Efficient XML 4.0), Gzipped and ASN. 1. On the ordinates, we find the compaction capacity expressed as a percentage based on the original document. The results are presented by placing the best results on the left. With EXI, in some cases, a document has been obtained up to 100 times smaller than the original, while the average results are around a reduction of about 10 times.
EXI produces significantly better results than Gzipped and ASN 1.0. Up to 10 times compared to gzip (for example with messages based on a high volume of small data, as the typical data of geolocalization), and up to 20 times in the case of ASN 1.0 (which in many cases does not produce appreciable results).
To evaluate the advantages from the point of view of processing speed, the basic analysis and serialization operations were performed with and without EXI and with and without EXI with compression. In the course of decoding operations, EXI improved performance on average about 15 times without compression and about 9 times with compression.
For serialization operations the performances are improved on average 6 times without compression and about 5 times with compression.
Impact of EXI
EXI was born to be compatible with XML without placing particular aggravations on the applications that use it. At least in theory. The objective has been partially met, but the impact can be further reduced.
|Readability||one of the sore points comes from the adoption of a binary format that has led to sacrifice the readability of the optimized document. Unless you use specific editors, one of the main advantages of XML is lost.|
|APIs||at least formally, EXI declares to support all the APIs commonly used to process the XML, for which EXI would not have an immediate impact on the already existing APIs. In practice, using pre-existing APIs requires that all the names and text of an EXI document be converted to strings. Therefore, more can be expected to limit this aspect of the impact of EXI.|
|Safety||signature and encryption can be used with EXI, with some expedients.The signature can be used with EXI by specifying an existing canonization algorithm (for example Canonical XML , a specification that establishes a method for determining whether two documents are identical). In particular, EXI has no impact on canonization algorithms for XML documents. With regard to encryption, if it is known that recipients are able to receive EXI documents, the MimeType attributeof the EncryptedData element could be used to indicate EXI as the format of the encrypted data.|
At the moment the following implementations of the specification are reported:
|Efficient XML fromAgileDelta||Probably the most mature implementation of the specification, able to provide specific support for web services and APIs such as DOM and SAX. This is a commercial product, available for a 30-day evaluation.Work on Java.|
|EXIficient||Open source project initiated by Siemens AG. It supports all the features and encoding modes of EXI. Work on Java.|
|EXIP||Open source project led by EISLAB. Work on Java.|
|OpenEXI||Open source project led by Fujitsu. Work on C.|
Other solutions for optimizing XML in Java
In addition to the EXI specification and related implementations, various solutions to alleviate the problem have already been studied and applied. Some of these aimed at converting to lighter exchange formats, while others aimed at optimizing performance based on the presence of known a priori schemes. Some solutions adopted are presented as examples.
|JAXB (Java Architecture for XML Binding )||is a specification (with different implementations and a native reference implementation in Java starting with Java SE 6) that, starting from the XML schema, allows to construct the object tree corresponding to the analyzed XML document, allowing to keep the XML in memory and avoiding the heaviness of the DOM approach. The reference implementation is already detailed in the JAXB article : XML in Javaand related articles.|
|SOAP protocol||has a mechanism, MTOM / XOP ( Message Transmission Optimization Mechanism / XML-binary Optimized Packaging ), which is useful to optimize the transmission of XML data of type
|Apache Axiom||core of the Apache Axis2 framework, in addition to supporting the construction of the object tree, provides native MTOM / XOP support for efficiently transporting binary data.|