Data Format Description Language

Data Format Description Language (DFDL, often pronounced daff-o-dil), published as an Open Grid Forum Recommendation in February 2021, is a modeling language for describing general text and binary data in a standard way. A DFDL model or schema allows any text or binary data to be read (or "parsed") from its native format and to be presented as an instance of an information set. (An information set is a logical representation of the data contents, independent of the physical format. For example, two records could be in different formats, because one has fixed-length fields and the other uses delimiters, but they could contain exactly the same data, and would both be represented by the same information set). The same DFDL schema also allows data to be taken from an instance of an information set and written out (or "serialized") to its native format.

DFDL is descriptive and not prescriptive. DFDL is not a data format, nor does it impose the use of any particular data format. Instead it provides a standard way of describing many different kinds of data formats. This approach has several advantages.^[1] It allows an application author to design an appropriate data representation according to their requirements while describing it in a standard way which can be shared, enabling multiple programs to directly interchange the data.

DFDL achieves this by building upon the facilities of W3C XML Schema 1.0. A subset of XML Schema is used, enough to enable the modeling of non-XML data. The motivations for this approach are to avoid inventing a completely new schema language, and to make it easy to convert general text and binary data, via a DFDL information set, into a corresponding XML document.

Educational material is available in the form of DFDL Tutorials, videos and several hands-on DFDL labs.

YouTube Encyclopedic

1/3
Views:
252 737
14 031
1 164 817

Transcription

Hi I’m Jared Hillam, Imagine that you’re designing a new home and that a plumbing sales rep tells you that he has a packaged plumbing design for 2 story homes which will reduce the cost of architecting the plumbing into your future house. You think to yourself, “well I’m not the first person to build a house, and this sounds like it might save some money.” So you go ahead and move forward with the idea. So you build your house with the packaged design and to your absolute dismay the packaged design is not for your house layout but just that of a generic 2 story home! So you have pipes going right through the middle of your living room and other places in the house! Worse yet the bathroom doesn’t even get any plumbing! OK… I’ve made this example as extreme and obvious as possible. But I’ve done that so I can introduce you to its equivalent in the data integration space called Packaged Data Models. OK first let’s talk about what a data model is. A data model represents the framework of what the relationships are within a database. This framework in the world of data warehousing is a critical component as it will provide the structure which will support the analytical needs of the decision makers. The data itself will literally be stored within this framework on a database. So building a data model is a critical step in the design of the data warehouse. This step requires your business people to engage in the process by participating in facilitated sessions with our data architects. This can seem intimidating for organizations to take on, so often organizations start looking for something out of the box. But here’s the kicker that most people don’t realize. The amount of effort that you would take in vetting a packaged data model to ensure that it fits your business; is nearly the same level of effort to go ahead and design one specifically for your business. However, this is an easy trap to fall into because data models appear to do more than they actually do. Especially to business people. The first thing to understand is that a data model is just a shell. A Data Model without populated data isn’t much more than a drawing. The real expense and heavy lifting in a data warehouse project is not in the design of the model, but rather populating that model with data from your source systems. And that is where the big money is spent to create a Data Warehouse. Now some organizations don’t even vet packaged data models, assuming they are well positioned out of the box. But they’re usually in for a huge shock once all the reporting and analytics go live. They often find that the logic does not mirror their internal practices and rules. Additionally, they find entire dimensions not being represented. Like our example with the home, fixing the problem is an expensive rip and replace process because the entire house around the pipes is already built. The same is true for a data warehouse. The ETL, Staging, Data Model, and Reports are all in place and now need a major rebuild. If you are going to use a packaged data model, at the very least, you should vet it completely before deploying it in your business. Rather than thinking of the packaged data model as the end game you’re better off thinking of it as a guide while you design your own data warehouse. This is because you’re likely to have a lot of deviations from what the vendor would consider as “standard”. And it also enables you to build in your competitive advantage above and beyond your competitors analytics. Intricity has extensive experience in the packaged data model space. We have modified, implemented, replaced, and designed such models for years. We can help you determine what the right approach is when implementing a data model. I recommend you visit our website and talk with one of our specialists. The small investment in our guidance can bring you huge dividends for the future of your project.

History

DFDL was created in response to a need for grid APIs to be able to understand data regardless of source. A language was needed capable of modeling a wide variety of existing text and binary data formats. A working group was established at the Global Grid Forum (which later became the Open Grid Forum) in 2003 to create a specification for such a language.

A decision was made early on to base the language on a subset of W3C XML Schema, using <xs:appinfo> annotations to carry the extra information necessary to describe non-XML physical representations. This is an established approach that is already being used today in commercial systems. DFDL takes this approach and evolves it into an open standard capable of describing many text or binary data formats.

Work continued on the language, resulting in the publication of a DFDL 1.0 specification as OGF Proposed Recommendation GFD.174 in January 2011.

The official OGF Recommendation is GFD.240 published in February 2021 which obsoletes all prior versions and incorporates all issues noted to date (also available as html). A summary of DFDL and its features is available at the OGF. Any issues with the specification are being tracked using GitHub issue trackers.

Implementations

Implementations of DFDL processors that can parse and serialize data using DFDL schemas are available.

IBM has a production-ready DFDL 1.0 streaming parser, modeler and visual tester.^[2] This is available in several IBM products including IBM App Connect Enterprise (formerly known as IBM Integration Bus). A free developer edition is available.
Apache Daffodil is an open-source DFDL processor having both parser and unparser, as well as integrations into Apache NiFi, and the XML Calabash XProc pipeline engine. It continues to be under active development.
European Space Agency project S2G Data Viewer includes a parser DFDL4S^[3] that implements a subset of the DFDL 1.0 specification.

A public repository for DFDL schemas that describe commercial and scientific data formats has been established on GitHub. DFDL schemas for formats like UN/EDIFACT, NACHA, MIL-STD-2045, NITF, and ISO8583 are available for free download.

Example

Take as an example the following text data stream which gives the name, age and location of a person:

The logical model for this data can be described by the following fragment of an XML Schema document. The order, names, types and cardinality of the fields are expressed by the XML schema model.

<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" ...>

<xs:complexType name="person_type">
  <xs:sequence>
    <xs:element name="name" type="xs:string"/>
    <xs:element name="age" type="xs:short"/>
    <xs:element name="county" type="xs:string"/>
    <xs:element name="country" type="xs:string"/>
  </xs:sequence>
</xs:complexType>

</xs:schema>

To additionally model the physical representation of the data stream, DFDL augments the XML schema fragment with annotations on the xs:element and xs:sequence objects, as follows:

<xs:schema xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/" xmlns:xs="http://www.w3.org/2001/XMLSchema" ...>

<xs:complexType name="person_type">
  <xs:sequence>
    <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
        <dfdl:sequence encoding="ASCII" sequenceKind="ordered" 
                       separator="," separatorType="infix" separatorPolicy="required"/>                   
    </xs:appinfo></xs:annotation>
    <xs:element name="name" type="xs:string">
      <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
        <dfdl:element lengthKind="delimited" encoding="ASCII"/>                   
      </xs:appinfo></xs:annotation>
    </xs:element>
    <xs:element name="age" type="xs:short">
      <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
        <dfdl:element representation="text" lengthKind="delimited" encoding="ASCII"
                      textNumberRep="standard" textNumberPattern="#0" textNumberBase="10"/>                   
      </xs:appinfo></xs:annotation>
    </xs:element>
    <xs:element name="county" type="xs:string">
      <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
        <dfdl:element lengthKind="delimited" encoding="ASCII"/>                   
      </xs:appinfo></xs:annotation>
    </xs:element>
    <xs:element name="country" type="xs:string">
      <xs:annotation><xs:appinfo source="http://www.ogf.org/dfdl/">
        <dfdl:element lengthKind="delimited" encoding="ASCII"/>                   
      </xs:appinfo></xs:annotation>
    </xs:element>
  </xs:sequence>
</xs:complexType>

</xs:schema>

The property attributes on these DFDL annotations express that the data are represented in an ASCII text format with fields being of variable length and delimited by commas

An alternative, more compact syntax is also provided, where DFDL properties are carried as non-native attributes on the XML Schema objects themselves.

<xs:schema xmlns:dfdl="http://www.ogf.org/dfdl/dfdl-1.0/" xmlns:xs="http://www.w3.org/2001/XMLSchema" ...>

<xs:complexType name="person_type">
  <xs:sequence dfdl:encoding="ASCII" dfdl:sequenceKind="ordered" 
               dfdl:separator="," dfdl:separatorType="infix" dfdl:separatorPolicy="required">
    <xs:element name="name" type="xs:string"
                dfdl:lengthKind="delimited" dfdl:encoding="ASCII"/>                   
    <xs:element name="age" type="xs:short"
                dfdl:representation="text" dfdl:lengthKind="delimited" dfdl:encoding="ASCII"
                dfdl:textNumberRep="standard" dfdl:textNumberPattern="##0" dfdl:textNumberBase="10"/>                   
    <xs:element name="county" type="xs:string"
                dfdl:lengthKind="delimited" dfdl:encoding="ASCII"/>                   
    <xs:element name="country" type="xs:string"
                dfdl:lengthKind="delimited" dfdl:encoding="ASCII"/>                   
  </xs:sequence>
</xs:complexType>

</xs:schema>

Features

The goal of DFDL is to provide a rich modeling language capable of representing any text or binary data format. The 1.0 release is a major step towards this goal. The capability includes support for:

Text data types such as strings, numbers, zoned decimals, calendars and Booleans
Binary data types such as two's complement integers, BCD, packed decimals, floats, calendars and Booleans
Fixed length data and data delimited by text or binary markup
Language data structures found in languages like COBOL, C and PL/1
Industry standards such as CSV, SWIFT, FIX, HL7, X12, HIPAA, EDIFACT, ISO 8583
Any encoding and endian-ness
Bit data of arbitrary length
Pattern languages for text numbers and calendars
Ordered, unordered and floating content
Default values on parsing and serializing
Nil values capability for handling out-of-band data
Fixed and variable arrays
XPath 2.0 expression language including variables to model dynamic data
Speculative parsing and other mechanisms to resolve choices and optionality
Validation to XML Schema 1.0 rules
A scoping mechanism that allows common property values to be applied at multiple annotation points
Hiding elements in the data from the information set
Calculating element values for the information set

References

External links

This page was last edited on 30 June 2023, at 14:02

From Wikipedia, the free encyclopedia