06
  • Introduction
  • 6.1
  • 6.2
  • 6.3
  • 6.4
  • 6.5
  • 6.6
  • XML and JSON

    Languages used to describe serialized data being communicated to/from web services.

    XML has many additional uses as well

    Both are text-based and support a standard syntax.
    open and non-proprietary
    easier heterogeneous integration – everything text based
    really for describing data and not programming languages
    a standard syntax avoids the use of native data formats and aids the reuse of common tooling

    JSON much lighter but can do much more with XML...

    REST uses JSON/XML; SOAP uses XML.

  • XML in Brief

    <myelement>some text data</myelement>
    

    XML is a language for creating other languages – it is a meta-language.
    a meta-language is a language that is used to describe other languages i.e. you define your own tags
    essentially an XML document is a text file, representing a data structure defined in terms of elements, attributes and entities etc.

    XML 1.0 and 1.1 are W3C recommendations i.e. international standards.

    Two important recommendations which are used with XML to do anything useful are:-
    XML Schema
    XML Namespaces

    XML lets you define schemas for tag-based languages (thus - “markup language”).
    i.e. XML defines the syntax and you (or a standard) define the structure of these languages
    e.g. defines the allowed elements, what attributes go with what element, elelment order, allowed child elements etc. etc.
    structure is defined in a schema (schema with a small ‘s’). Several options (of which the main three are):-
    DTDs
    XML Schema (XSD)
    RelaxNG
    in fact the use of any of these is optional
    documents which are syntactically correct are called well-formed
    documents which additionally adhere to a given schema e.g. a XML Schema definition are called valid

    📷 XML Example

    XML 1.0/1.1 and JSON

    XML 1.0/1.1
    XML Declaration and Character Encoding
    XML 1.0 and 1.1
    Elements
    Attributes
    PCDATA
    Entities
    CDATA
    Processing Instructions
    Serialization of XML

    JSON
    strings, numbers, booleans, null, objects and arrays
    comparison with XML
    Serialization of JSON
    X
    XML Example Diagram
  • XML 1.0 – A Representative Document

    <?xml version="1.0" encoding="UTF-8"?>
    <!DOCTYPE students SYSTEM "student.dtd">
    
    <!-- This is an XML document that describes students -->
    <?studentdb displaydesc="true"?>
    
    <students>
       <student id="0001">
          <name>Jim Bob</name>
          <status>graduate</status>
          <residence/>
          <degree>Advanced Computing & Music</degree>
          <description>Jim Bob!
                            Hi my name is Jim. I look like
                            <img src="jim.jpg"> ]]>
          </description>
       </student>
      <student id="0002">
         ...
      </student>
    </students>
    

    XML Declaration

    <?xml version="1.0" encoding="UTF-8"?>
    

    Optional
    default implies XML 1.0 and UTF-8 assumed
    so it’s 1.0 unless you specify otherwise

    A Processing Instruction (see later).

    Must appear as the very first thing in an XML document.

    Used to indicate the version of the specification to which the document conforms (and whether the document is “standalone” i.e. a DTD is part of the file).

    Used to indicate the character encoding of the document.
    UTF-8
    UTF-16
    iso-8859-1
    ...

    XML Character Encoding

    Every character in an XML document is a unicode character.
    the encoding (which may not be unicode) is merely a convenience for representing the characters in a file
    the parser will translate this into unicode internally

    A very messy world but usually not a problem if you stick to the first 128 characters (“US-ASCII”) which are all the same:-
    iso-8859-1 (Latin characters/Western European – old but sill used)
    windows-1252 (or “ANSI”) (27 differences from iso-8859-1)
    UTF-8 (newer, recommended) “8-bit Unicode Transformation Format” (can represent any character as a sequence of 8-bit values)

    You should stick to UTF-8 (default) and always use a declaration.

    If you do not use UTF-8 then make sure the declaration is consistent for your character content.

    XML 1.0 and XML 1.1

    XML 1.1 has:-
    better support for Unicode characters
    better support for control characters
    mandatory XML declarations

    XML 1.0 and XML 1.1 are not quite upwardly compatible!

    In English and European language markup this all makes little effective difference.

    XML 1.1 (Second Edition) (2006) incorporates errata.

    Tool/compiler support in Java is for XML 1.1.

    Tool/compiler support in .NET is for XML 1.0.

    DOCTYPE

    <!DOCTYPE students SYSTEM "students.dtd">
    

    References a Document Type Definition (DTD).

    Can refer to an external DTD file or include some internal DTD information within the tag itself.

    DTD is the original mechanism for specifying the schema of an XML document.

    Inherited in part from previous SGML standard.
    arcane syntax
    limited expressive functionality

    We will consider newer XML Schema instead of DTDs.

    Elements

    <name>Jim Bob</name>
    

    Main structure in an XML document.

    Only one root element allowed.

    everything else inside this

    Start Tag

    allows specification of zero or more attributes

    <student id="0001" ...>
    

    Closing Tag (or End Tag)

    must match name, case, and nesting level of start tag

    </student>
    

    Name must start with letter or underscore and can contain only letters, numbers, hyphens, periods, and underscores.

    Element Content

    <student>
       <status>...</status>
    </student>
    

    Parsed Character Data (aka PCDATA, aka Text)

    <name>Jim Bob</name>
    

    Mixed Content

    <name>Jim <initial>J</initial> Bob</name>
    

    No Content

    <residence/>
    

    The usual relative terms are used for adjacent elements.

    parent, child, sibling

    Attributes

    <student id="0001">
    

    Element properties

    must be valued

    Name
    must start with letter or underscore and can contain only letters, numbers, hyphens, periods, and underscores
    must be unique in a given element

    Value
    can be of several types, but is almost always a string
    must be quoted and have the ‘=‘
    title="Lecture 2"
    match='item="baseball bat"'
    cannot contain < (at all) or & (by itself)

    Elements and Attributes

    Use which where?
    “Should this data be coded as an element or an attribute?”
    XML’s great imponderable

    If data in an element:-
    easier to read (or so it’s said anyway)
    can have multiple values (i.e. sibling elements)
    can have children
    extendible

    If data in an attribute:-
    should be used as meta-data about the element
    not hierarchical
    used to distinguish between elements and has no further use

    No hard and fast answer
    a design decision
    programmatically much the same work to get at element or attribute data

    PCDATA

    Jim Bob
    

    “Parsed Character Data”

    Has to be parsed because it may contain child elements.

    Text that appears as the content of an element.

    Can reference entities.

    Data within is checked for sub-elements, entities, etc.

    Cannot contain either < or & (by itself).

    Entities

    &amp;
    

    Used to “escape” content or include content that is hard to enter or repeated frequently.

    somewhat like macros

    Five (and only five) pre-defined entities (note the trailing ‘;’).

    &amp; &lt; &gt; &apos; &quot;

    Character entities can refer to a single character by unicode number.

    e.g., &#x00A9; is ©

    Symbolic entities must be declared to be legal (in a DTD) i.e.
    <!ENTITY mycopy "&#x00A9;">
    &mycopy; could then be used in the XML document

    Cannot refer to themselves.

    CDATA

    <![CDATA[ <h1>Jim Bob!</h1> ...]]>
    

    “Character Data”

    Parsed in “one chunk” by the XML parser.

    Really data for a recipient application.

    Data within is not checked for sub-elements, entities, etc.

    Only string you can’t have is ]]>

    Allows you to include badly formed markup or character data that would cause a problem during parsing.

    Example:-

    including HTML tags in an XML document which use markup minimisation i.e. no closing tag

    Comments

    <-- This is ... -->
    

    Can include any text inside a comment to make it easier for human readers to understand your document.

    Generally not available to applications reading the document.

    Always begin with <!-- and end with -->

    Cannot contain --

    Processing Instructions

    <?studentdb displaydesc="true"?>
    

    “Sticky notes” to applications processing an XML document that explain how to handle content.

    The target portion (e.g., studentdb) of a PI indicates the application that is to process this instruction; cannot start with “xml”.

    The remainder of the PI can be any text that gives instructions to the application.

    Examples:-
    XML Declaration
    instructions to an application to display different versions of an image
    instructions to an application to suppress display of certain content

    XML Tools and API Libraries

    Many language bindings available with different implementations.
    Xerces/Xalan(Apache), .NET System.XML (Microsoft)…
    in the core language/framework class libraries
    will additionally check if document well-formed and valid

    Many approaches to process XML over the years:-
    stream parsers
    in memory processing e.g. DOM
    processing engines e.g. XSLT

    As a case study Lab 3 contains a .NET/C# application to serialize/de-serialize a list of objects of an arbitrary class.

    uses current standard libraries
  • JSON

    JavaScript Object Notation (JSON)

    JSON is an alternative to XML and is the data interchange format of choice in modern web services.

    Wire format introduced for AJAX Web applications (browser-web server communication).
    specifically for data-exchange (self-documenting)
    considered simpler than XML (simple and faster parsing)
    much more compact and lightweight than XML (minimal; but still text)
    most common as representation in RESTful web services

    Data-types completely described by:-
    strings, numbers, booleans and null
    objects
    unordered collections of name/value pairs
    values themselves may be of any type including objects and arrays i.e. can be nested
    arrays
    ordered sequences of values
    can represent records, list and trees using objects and arrays etc. by nesting them to an arbitrary complexity

    Simple syntax
    objects enclosed in curly braces with colon between name and value
    arrays enclosed in square brackets
    items are separated by commas and white space is insignificant

    Supported in most languages.

    📷 JSON Example

    JSON and XML

    JSON has much less expressive power than XML.

    JSON has no feature to express:-
    attributes
    namespaces
    mixed content

    Although it is possible to shoehorn some of these features into JSON e.g. attributes, they are not represented in a standard manner.

    XML has many more processing APIs than JSON.

    and many of them (not all) are standardised

    Any translation scheme from XML to JSON should be lossless i.e. not lose any information but this is difficult to achieve and depends on the XML document.

    JSON Schema
    is a standardisation initiative for a schema representation
    an IETF draft standard
    schema are themselves defined in JSON
    🔗 JSON Schema
    VS 2015 onwards has support for JSON Schema in its JSON editor

    Many JSON API libraries including:-
    JSON-P libraries in Java EE 8
    .NET 4.5 – Json.Net (Newtonsoft) is the current API library of choice.
    🔗 https://www.newtonsoft.com/json
    X
    JSON Example Diagram
  • XML Namespaces

    XML documents:-
    are a tree of elements
    each element has a name and a set of attributes
    each attribute has a name and a value
    each element and attribute name is an unrestricted string from a set of restricted characters
    we can call such a name, as we have defined so far, a local name
    an application (a parser) will process the element based on the element and attribute name/values
    they are used extensively in WSDL which we will be studying shortly
    XML passed in web service communication will most likely use namespaces.

    If there is no control over naming in XML:-
    different programmers/standards bodies may/will pick the same names producing conflicts
    we may wish to mix different languages in the one XML document and these may use (some of) the same names (compound documents)
    the application (parser) will not be able tell elements/attributes with the same names apart and will wrongly process them
    we do not want potential name clashes to be a concern for the designer of an XML language
    see http://www.rpbourret.com/xml/NamespacesFAQ.htm#ns_2 for more reasons

    Solution:-
    introduce expanded names which are additionally scoped by a namespace
    an optional mechanism providing additional information external to the XML document
  • URIs

    URI (Uniform Resource Identifier)
    “a compact string of characters for identifying an abstract or physical resource”
    see: 🔗 http://tools.ietf.org/html/rfc3986

    URL (Uniform Resource Locator)
    subset of URIs
    its location i.e. how to find it
    “identify resources via a representation of their primary access mechanism, rather than … by name or some other attribute”

    URN (Uniform Resource Name)
    subset of URIs
    its identity e.g. urn:isbn:0451450523
    persistent and location-independent
    “required to remain globally unique and persistent even when resource ceases to exist or becomes unavailable”
  • Namespaces in XML 1.0/1.1

    Namespaces 1.1 adds support for:-
    unicodeURIs (called IRIs)
    undeclaringprefixes (and thus the current namespace)
    Java (through Xerces) has support for Namespaces 1.1
    .NET 4.5 however has support for Namespaces 1.0

    An up-to-date reference for XML namespaces is at:-
    🔗 http://www.rpbourret.com/xml/NamespacesFAQ.htm

    XML Schema

    XML Schema Definition (XSD) became an official recommendation in May 2001; Second Edition in October 2004.
    Primer
    an overview (> 150 pages alone!)
    Structures
    Data Types

    A very large standard.
    much of which should be kept as a reference rather than attempt to learn -a large and verbose standard

    The Second Edition addresses only some bugs and minor loose ends of the original standard.

    Main Features of XSD

    XML Syntax
    there is a schema for schemas
    you can validate the schema (in EditiX -Lab 3)

    Object-oriented type system
    with inheritance, abstract types and finals

    Global(=top-level) and local(=inline) type definitions

    Modularisation

    Built-in data-types
    about 40 of these
    not just positive integers etc. but dates etc.
    can add constraints to the built-in types e.g. cardinality, ranges...

    Regular expressions

    An Example...

    config.xml and config.xsd

    The XML document which is meant to be valid with respect to a given XML schema definition is referred to as the instance document.

    Notice that:
    XML instance points to schema
    XML Schema declares elements (and attributes)
    XML Schema defines types
    types come in a number of varieties:
    Built-in types (e.g., xsd:string, xsd:date)
    Simple types
    Complex types
    there is a implicit assumption of 1 embedded in many constructs e.g.
    <xsd:sequence>
    i.e. here the sequence must occur once only
    another value or range can be specified

    Why?

    Data validation
    structure of elements and attributes
    order of elements
    data values of elements and attributes
    uniqueness of values

    Simplifies interoperability and tooling if parties can agree acceptable data formats described in a standard way.

    Documentation.

    Augmentation of instance with default values.

    Storage of application information.

    Application can avoid error checking if it can be assumed instance already validated.

    There is a meta-schema for XML Schema documents.
    i.e. a XML Schema document’s own validity can be checked against this!

    Another Example (product.xsd)

    <xsd:schemaxmlns:xsd="http://www.w3.org/2001/XMLSchema">
      <xsd:elementname="product" type="ProductType"/>
      <xsd:complexTypename="ProductType">
           <xsd:sequence>
                  <xsd:elementname="number" type="xsd:integer"/>
                  <xsd:elementname="size" type="SizeType"/>
           </xsd:sequence>
           <xsd:attributename="effDate" type="xsd:date"/>
      </xsd:complexType>
      <xsd:simpleTypename="SizeType">
            <xsd:restrictionbase="xsd:integer">
                 <xsd:minInclusivevalue="2"/>
                 <xsd:maxInclusivevalue="18"/>
            </xsd:restriction>
      </xsd:simpleType>             <product effDate="2008-04-02">
    </xsd:schema>                     <number>557</number>
                                         <size>10</size>
                                       </product>
    

    Namespace Usage Examples

    <prod:product xmlns:prod="http://example.org/prod"
        xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
        xsi:schemaLocation="http://example.org/prod prod.xsd">
      <prod:number>557</prod:number>
      <prod:size>10</prod:size>
    </prod:product>
    
    <order xmlns="http://example.org/ord"
           xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
           xsi:schemaLocation="http://example.org/prod prod.xsd
           http://example.org/ord ord.xsd"
    <number>123ABBCC123</number>
    <items>
       <product xmlns="http://example.org/prod">
         <number>557</number>
         <size system="US-DRESS">10</size>
       </product>
    </items>
    </order>
    
School of Computing, Engineering and Built Environment