Week 6 - XML 1.0/1.1 and JSON

XML and JSON

Languages used to describe serialized data being communicated to/from web services.

XML has many additional uses as well

Both are text-based and support a standard syntax.

open and non-proprietary

easier heterogeneous integration – everything text based

really for describing data and not programming languages

a standard syntax avoids the use of native data formats and aids the reuse of common tooling

JSON much lighter but can do much more with XML...

REST uses JSON/XML; SOAP uses XML.
XML in Brief
```
<myelement>some text data</myelement>
```
XML is a language for creating other languages – it is a meta-language.

a meta-language is a language that is used to describe other languages i.e. you define your own tags

essentially an XML document is a text file, representing a data structure defined in terms of elements, attributes and entities etc.

XML 1.0 and 1.1 are W3C recommendations i.e. international standards.

Two important recommendations which are used with XML to do anything useful are:-

XML Schema

XML Namespaces

XML lets you define schemas for tag-based languages (thus - “markup language”).

i.e. XML defines the syntax and you (or a standard) define the structure of these languages

e.g. defines the allowed elements, what attributes go with what element, elelment order, allowed child elements etc. etc.

structure is defined in a schema (schema with a small ‘s’). Several options (of which the main three are):-

DTDs

XML Schema (XSD)

RelaxNG
in fact the use of any of these is optional

documents which are syntactically correct are called well-formed

documents which additionally adhere to a given schema e.g. a XML Schema definition are called valid

📷 XML Example

XML 1.0/1.1 and JSON

XML 1.0/1.1

XML Declaration and Character Encoding

XML 1.0 and 1.1

Elements

Attributes

PCDATA

Entities

CDATA

Processing Instructions

Serialization of XML

JSON

strings, numbers, booleans, null, objects and arrays

comparison with XML

Serialization of JSON
X
XML 1.0 – A Representative Document
```
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE students SYSTEM "student.dtd">


<?studentdb displaydesc="true"?>

<students>
   <student id="0001">
      <name>Jim Bob</name>
      <status>graduate</status>
      <residence/>
      <degree>Advanced Computing & Music</degree>
      <description>Jim Bob!
                        Hi my name is Jim. I look like
                        <img src="jim.jpg"> ]]>
      </description>
   </student>
  <student id="0002">
     ...
  </student>
</students>
```
XML Declaration
```
<?xml version="1.0" encoding="UTF-8"?>
```
Optional

default implies XML 1.0 and UTF-8 assumed

so it’s 1.0 unless you specify otherwise

A Processing Instruction (see later).

Must appear as the very first thing in an XML document.

Used to indicate the version of the specification to which the document conforms (and whether the document is “standalone” i.e. a DTD is part of the file).

Used to indicate the character encoding of the document.

UTF-8

UTF-16

iso-8859-1

...

XML Character Encoding

Every character in an XML document is a unicode character.

the encoding (which may not be unicode) is merely a convenience for representing the characters in a file

the parser will translate this into unicode internally

A very messy world but usually not a problem if you stick to the first 128 characters (“US-ASCII”) which are all the same:-

iso-8859-1 (Latin characters/Western European – old but sill used)

windows-1252 (or “ANSI”) (27 differences from iso-8859-1)

UTF-8 (newer, recommended) “8-bit Unicode Transformation Format” (can represent any character as a sequence of 8-bit values)

You should stick to UTF-8 (default) and always use a declaration.

If you do not use UTF-8 then make sure the declaration is consistent for your character content.

XML 1.0 and XML 1.1

XML 1.1 has:-

better support for Unicode characters

better support for control characters

mandatory XML declarations

XML 1.0 and XML 1.1 are not quite upwardly compatible!

In English and European language markup this all makes little effective difference.

XML 1.1 (Second Edition) (2006) incorporates errata.

Tool/compiler support in Java is for XML 1.1.

Tool/compiler support in .NET is for XML 1.0.

DOCTYPE
```
<!DOCTYPE students SYSTEM "students.dtd">
```
References a Document Type Definition (DTD).

Can refer to an external DTD file or include some internal DTD information within the tag itself.

DTD is the original mechanism for specifying the schema of an XML document.

Inherited in part from previous SGML standard.

arcane syntax

limited expressive functionality

We will consider newer XML Schema instead of DTDs.

Elements
```
<name>Jim Bob</name>
```
Main structure in an XML document.

Only one root element allowed.

everything else inside this

Start Tag

allows specification of zero or more attributes
```
<student id="0001" ...>
```
Closing Tag (or End Tag)

must match name, case, and nesting level of start tag
```
</student>
```
Name must start with letter or underscore and can contain only letters, numbers, hyphens, periods, and underscores.

Element Content
```
<student>
   <status>...</status>
</student>
```
Parsed Character Data (aka PCDATA, aka Text)
```
<name>Jim Bob</name>
```
Mixed Content
```
<name>Jim <initial>J</initial> Bob</name>
```
No Content
```
<residence/>
```
The usual relative terms are used for adjacent elements.

parent, child, sibling

Attributes
```
<student id="0001">
```
Element properties

must be valued

Name

must start with letter or underscore and can contain only letters, numbers, hyphens, periods, and underscores

must be unique in a given element

Value

can be of several types, but is almost always a string

must be quoted and have the ‘=‘

title="Lecture 2"

match='item="baseball bat"'

cannot contain < (at all) or & (by itself)

Elements and Attributes

Use which where?

“Should this data be coded as an element or an attribute?”

XML’s great imponderable

If data in an element:-

easier to read (or so it’s said anyway)

can have multiple values (i.e. sibling elements)

can have children

extendible

If data in an attribute:-

should be used as meta-data about the element

not hierarchical

used to distinguish between elements and has no further use

No hard and fast answer

a design decision

programmatically much the same work to get at element or attribute data

PCDATA
```
Jim Bob
```
“Parsed Character Data”

Has to be parsed because it may contain child elements.

Text that appears as the content of an element.

Can reference entities.

Data within is checked for sub-elements, entities, etc.

Cannot contain either < or & (by itself).

Entities
```
&amp;
```
Used to “escape” content or include content that is hard to enter or repeated frequently.

somewhat like macros

Five (and only five) pre-defined entities (note the trailing ‘;’).

& < > ' "

Character entities can refer to a single character by unicode number.

e.g., © is ©

Symbolic entities must be declared to be legal (in a DTD) i.e.

<!ENTITY mycopy "©">

&mycopy; could then be used in the XML document

Cannot refer to themselves.

CDATA
```
<![CDATA[ <h1>Jim Bob!</h1> ...]]>
```
“Character Data”

Parsed in “one chunk” by the XML parser.

Really data for a recipient application.

Data within is not checked for sub-elements, entities, etc.

Only string you can’t have is ]]>

Allows you to include badly formed markup or character data that would cause a problem during parsing.

Example:-

including HTML tags in an XML document which use markup minimisation i.e. no closing tag

Comments
```
<-- This is ... -->
```
Can include any text inside a comment to make it easier for human readers to understand your document.

Generally not available to applications reading the document.

Always begin with 

Cannot contain --

Processing Instructions
```
<?studentdb displaydesc="true"?>
```
“Sticky notes” to applications processing an XML document that explain how to handle content.

The target portion (e.g., studentdb) of a PI indicates the application that is to process this instruction; cannot start with “xml”.

The remainder of the PI can be any text that gives instructions to the application.

Examples:-

XML Declaration

instructions to an application to display different versions of an image

instructions to an application to suppress display of certain content

XML Tools and API Libraries

Many language bindings available with different implementations.

Xerces/Xalan(Apache), .NET System.XML (Microsoft)…

in the core language/framework class libraries

will additionally check if document well-formed and valid

Many approaches to process XML over the years:-

stream parsers

in memory processing e.g. DOM

processing engines e.g. XSLT

As a case study Lab 3 contains a .NET/C# application to serialize/de-serialize a list of objects of an arbitrary class.

uses current standard libraries
JSON

JavaScript Object Notation (JSON)

JSON is an alternative to XML and is the data interchange format of choice in modern web services.

Wire format introduced for AJAX Web applications (browser-web server communication).

specifically for data-exchange (self-documenting)

considered simpler than XML (simple and faster parsing)

much more compact and lightweight than XML (minimal; but still text)

most common as representation in RESTful web services

Data-types completely described by:-

strings, numbers, booleans and null

objects

unordered collections of name/value pairs

values themselves may be of any type including objects and arrays i.e. can be nested

arrays

ordered sequences of values

can represent records, list and trees using objects and arrays etc. by nesting them to an arbitrary complexity

Simple syntax

objects enclosed in curly braces with colon between name and value

arrays enclosed in square brackets

items are separated by commas and white space is insignificant

Supported in most languages.

📷 JSON Example

JSON and XML

JSON has much less expressive power than XML.

JSON has no feature to express:-

attributes

namespaces

mixed content

Although it is possible to shoehorn some of these features into JSON e.g. attributes, they are not represented in a standard manner.

XML has many more processing APIs than JSON.

and many of them (not all) are standardised

Any translation scheme from XML to JSON should be lossless i.e. not lose any information but this is difficult to achieve and depends on the XML document.

JSON Schema

is a standardisation initiative for a schema representation

an IETF draft standard

schema are themselves defined in JSON

🔗 JSON Schema

VS 2015 onwards has support for JSON Schema in its JSON editor

Many JSON API libraries including:-

JSON-P libraries in Java EE 8

.NET 4.5 – Json.Net (Newtonsoft) is the current API library of choice.

🔗 https://www.newtonsoft.com/json

X
XML Namespaces

XML documents:-

are a tree of elements

each element has a name and a set of attributes

each attribute has a name and a value

each element and attribute name is an unrestricted string from a set of restricted characters

we can call such a name, as we have defined so far, a local name

an application (a parser) will process the element based on the element and attribute name/values

they are used extensively in WSDL which we will be studying shortly

XML passed in web service communication will most likely use namespaces.

If there is no control over naming in XML:-

different programmers/standards bodies may/will pick the same names producing conflicts

we may wish to mix different languages in the one XML document and these may use (some of) the same names (compound documents)

the application (parser) will not be able tell elements/attributes with the same names apart and will wrongly process them

we do not want potential name clashes to be a concern for the designer of an XML language

see http://www.rpbourret.com/xml/NamespacesFAQ.htm#ns_2 for more reasons

Solution:-

introduce expanded names which are additionally scoped by a namespace

an optional mechanism providing additional information external to the XML document
URIs

URI (Uniform Resource Identifier)

“a compact string of characters for identifying an abstract or physical resource”

see: 🔗 http://tools.ietf.org/html/rfc3986

URL (Uniform Resource Locator)

subset of URIs

its location i.e. how to find it

“identify resources via a representation of their primary access mechanism, rather than … by name or some other attribute”

URN (Uniform Resource Name)

subset of URIs

its identity e.g. urn:isbn:0451450523

persistent and location-independent

“required to remain globally unique and persistent even when resource ceases to exist or becomes unavailable”
Namespaces in XML 1.0/1.1
Namespaces 1.1 adds support for:-

unicodeURIs (called IRIs)

undeclaringprefixes (and thus the current namespace)

Java (through Xerces) has support for Namespaces 1.1

.NET 4.5 however has support for Namespaces 1.0

An up-to-date reference for XML namespaces is at:-
🔗 http://www.rpbourret.com/xml/NamespacesFAQ.htm

XML Schema

XML Schema Definition (XSD) became an official recommendation in May 2001; Second Edition in October 2004.

Primer

an overview (> 150 pages alone!)

Structures

Data Types

A very large standard.

much of which should be kept as a reference rather than attempt to learn -a large and verbose standard

The Second Edition addresses only some bugs and minor loose ends of the original standard.

Main Features of XSD

XML Syntax

there is a schema for schemas

you can validate the schema (in EditiX -Lab 3)

Object-oriented type system

with inheritance, abstract types and finals

Global(=top-level) and local(=inline) type definitions

Modularisation

Built-in data-types

about 40 of these

not just positive integers etc. but dates etc.

can add constraints to the built-in types e.g. cardinality, ranges...

Regular expressions

An Example...

config.xml and config.xsd

The XML document which is meant to be valid with respect to a given XML schema definition is referred to as the instance document.

Notice that:

XML instance points to schema

XML Schema declares elements (and attributes)

XML Schema defines types

types come in a number of varieties:

Built-in types (e.g., xsd:string, xsd:date)

Simple types

Complex types

there is a implicit assumption of 1 embedded in many constructs e.g.

<xsd:sequence>

i.e. here the sequence must occur once only

another value or range can be specified

Why?

Data validation

structure of elements and attributes

order of elements

data values of elements and attributes

uniqueness of values

Simplifies interoperability and tooling if parties can agree acceptable data formats described in a standard way.

Documentation.

Augmentation of instance with default values.

Storage of application information.

Application can avoid error checking if it can be assumed instance already validated.

There is a meta-schema for XML Schema documents.

i.e. a XML Schema document’s own validity can be checked against this!

Another Example (product.xsd)
```
<xsd:schemaxmlns:xsd="http://www.w3.org/2001/XMLSchema">
  <xsd:elementname="product" type="ProductType"/>
  <xsd:complexTypename="ProductType">
       <xsd:sequence>
              <xsd:elementname="number" type="xsd:integer"/>
              <xsd:elementname="size" type="SizeType"/>
       </xsd:sequence>
       <xsd:attributename="effDate" type="xsd:date"/>
  </xsd:complexType>
  <xsd:simpleTypename="SizeType">
        <xsd:restrictionbase="xsd:integer">
             <xsd:minInclusivevalue="2"/>
             <xsd:maxInclusivevalue="18"/>
        </xsd:restriction>
  </xsd:simpleType>             <product effDate="2008-04-02">
</xsd:schema>                     <number>557</number>
                                     <size>10</size>
                                   </product>
```
Namespace Usage Examples
```
<prod:product xmlns:prod="http://example.org/prod"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://example.org/prod prod.xsd">
  <prod:number>557</prod:number>
  <prod:size>10</prod:size>
</prod:product>

<order xmlns="http://example.org/ord"
       xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance">
       xsi:schemaLocation="http://example.org/prod prod.xsd
       http://example.org/ord ord.xsd"
<number>123ABBCC123</number>
<items>
   <product xmlns="http://example.org/prod">
     <number>557</number>
     <size system="US-DRESS">10</size>
   </product>
</items>
</order>
```

XML and JSON

XML in Brief

XML 1.0 – A Representative Document

JSON

XML Namespaces

URIs

Namespaces in XML 1.0/1.1

XML Schema

An Example...

Why?

Another Example (product.xsd)

Namespace Usage Examples