04
  • Introduction
  • 4.1
  • 4.2
  • 4.3
  • Motivation

    Scaling computation is only part of the story…

    A cloud application’s data handling has to scale well. If it doesn’t, then the cloud’s computational scalability gets bottlenecked into uselessness.

    Data has to be spread around multiple servers sometimes with geo-replication.

    Often (certainly with relational data) no easy answers and dependent on both the data and the application.

    Overview

    Azure Cloud Storage (non-relational)
    A Microsoft product but other cloud vendors offer similar technology often implemented as services
    semi/unstructured data
    massive scalability
    key-value stores
    blobs, queues, (non-relational) tables
    uses some features from non-relational databases
    abstractions of block storage and not a full database as such

    Non-Relational Databases (NoSQL)
    semi/unstructured data
    massive scalability
    key-value stores, column stores, document stores, graph databases...

    Relational Databases
    good for structured data - schemas
    but not so good for semi/unstructured data which has increased in prevalence
    do not scale well
    but lots of relational data about that needs to be deployed in the cloud...
    various techniques to try to make them scale
  • Key/Value Stores

    Standard concept for cloud storage and non-relational databases.

    and some distributed caches e.g. redis

    Unique key identifies a value whilst hiding its actual location.

    i.e. user does not know directly where it is in the cloud or its replication

    Value is dependent on implementation.

    in a column store NoSQL database this would be akin to a record

    Azure Cloud Storage

    A bit like “a file system for the cloud”.

    i.e. always there and accessible from applications without having to start any separate service

    Common storage structures include blobs, tables, and queues.

    Generally do not attempt to enforce any data integrity i.e. to enforce referential integrity.

    whatever integrity the application requires is the responsibility of the program developer

    As a result cloud storage tends to scale very well.

    Cloud storage exposes RESTful APIs.
    so you can get or put a blob etc…
    the Azure SDK for storage is in fact a CLR wrapper around REST calls

    Again a key/value store.

    Values are blobs or (non-relational) tables or queues.
    basically share the same idea
    blobs and tables exist separately to allow different optimisations
    for unstructured and semi-structured data respectively

    (on Azure) Each cloud server splits disk space into 1 or more partitions.
    a partition is a set of disk blocks’ data on one server
    implemented on an enhanced distributed file system (hidden from programmer)
    each partition can be moved between servers dynamically within the cloud according to load or size to ensure it scales
    the key includes the identity of the partition (partition key)
    in detail the keys are used differently in tables, blobs and queues

    Azure Cloud Storage Structure

    Blobs Tables Queues
    Key/Value Store - movable and resizable Partitions
    Distributed File System
    Disks Blocks

    Blobs

    BLOB = Binary Large Object.

    Storage of an unstructured binary object in a single partition.

    Contents are opaque to the storage service.

    Blob has settable metadata.

    e.g. a name allowing one blob to be distinguished from another

    Stored in blocks (optional access to this level for effective random access).

    Can map file system to a blob for legacy applications.

    Example of content: videos, songs, pictures, newspaper articles.

    Azure Blobs or AWS Simple Storage Service (S3).

    Azure Blobs

    Azure Blobs Diagram

    Containers (buckets in S3) facilitate an administrative breakdown of the blob space.

    provides a namespace mechanism

    The key value identifying the partition is the container name + blob name.
    each blob has its own partition
    blobs can be distributed across many servers

    📷 Common BLOB Use Case

    Queues

    A queue is a classic first-in, first-out data storage structure.

    Primarily used for passing data from one computing job to another in a loosely-coupled fashion.

    typically between role instances

    Generally not used for long-term storage.

    Scale really well.
    there are limits on the rate of message processing
    if the rate proves insufficient can be scaled by creating multiple queues

    The key for a message is the queue name.
    all messages in the same queue are grouped into a single partition and are served by a single server
    different queues may be processed by different servers i.e. to balance the load for multiple queues

    Azure Queues or AWS Simple Queue Service (SQS).

    offer similar feature sets with minor differences in QoS

    📷 Azure Queues

    📷 Azure Queues Use Case

    Azure Tables

    Semi-structured data without schemas.

    Basically collections of name-value pairs.

    You can use them in any convenient way, like a property bag.

    each row/entity can have a different set of properties

    However probably most useful in the classic row-column case.

    Primary difference from relational tables is the absence of relations and thus referential integrity checks e.g. no joins are available.
    scale much better than relational tables
    it is possible to work round the lack of relations but the functionality has to be “hand crafted”
    Azure Tables Diagram

    in Azure entity = row; property = column

    Key is 2 parts: partition key and row key

    Max entity size is 1MB

    Azure Tables Diagram 2

    in Azure entity = row; property = column

    Key is 2 parts: partition key and row key

    The programmer has control over the entity/row partitioning i.e. what partition has what entities.
    at least logically to ensure that a subset of entities can be kept together
    need to balance the scalability benefits of spreading entities across multiple partitions with the data access advantages of grouping entities in a single partition
    apart – better load balancing and scalability
    together – atomic batch operations possible e.g. limited transactional operations across multiple entities

    DynamoDB is an AWS service offering similar functionality with some enhanced features but runs as a service

    📷 Azure Tables Use Case

    Lab 4 contains code examples.
    Part 1 - TestTables - demonstrates using a console application how to carry out basic tasks on Azure tables
    Parts 3 to 5 - ProductStore - RESTful application using an Azure table to persist data

    We will always be using just one table with one partition with a fixed property (column) set.
    note that you cloud easily simulate relations by putting key values of a separate table in a property
    but you would have to code the integrity...

    Useful link: 🔗 Get started with Azure Table storage and the Azure Cosmos DB Table API using .NET

    X
    Common BLOB Use Case Diagram
    X
    Azure Queues Diagram
    X
    Azure Queues Use Case Diagram
    X
    Azure Tables Use Case
  • Non-Relational Databases

    Many different options with different characteristics.
    many available as managed services from cloud vendors
    usually have a weaker consistency model than relational databases (see later lecture)

    Column Stores: data stored in schema-less records with variable columns (fields). Also wide column stores which can have very large no. of dynamic columns e.g. Cassandra

    Document Stores: structured data stored as values e.g. JSON. Data can be fully indexed and queried e.g. MongoDB, MS Document DB

    Azure Tables in Context

    Azure tables offers basically a high performance geo-replicated column store database with a RESTful API.

    However if you require complex queries or require to manipulate richer data types you may need a full-blown NoSQL database.
    e.g. with Azure tables it is inefficient to query on values
    only the partition and row keys can be indexed

    Performance and cost also have to be benchmarked for the data and data access behaviour.

    Not an easy answer and a moving target...

  • Relational Databases

    Q: What’s the problem?

    A: Relational tables are difficult to scale out well.
    by definition relational databases contain necessary and complex relations between tables

    The reason these do not scale well is because of the need to enforce referential integrity.
    this check is undertaken by the dbms itself
    the dbms needs to be able to communicate between all tables involved at the same time

    This ability degrades as the number of tables and relations increases and with the number of rows in each table.

    may be able to scale up to a certain degree, but scaling out difficult

    Important observation: most data services are read-dominant.

    database replication works well for them

    Some of the scalability techniques used predate the cloud and have been leveraged.

    we will review the common techniques:-

    1. Functional Partitioning
    distribute complete tables
    limiting the scope of UDI queries i.e. UPDATE, DELETE, INSERT

    2. Vertical Partitioning
    split tables by columns
    denormalisation

    3. Horizontal Partitioning
    sharding
    split tables by rows

    A relational database designed for the cloud (e.g. Azure SQL) will offer a mix of these techniques but their actual use is up to the database designer and is application dependent.

    none of this can be applied automatically…

    Typical Simple Web Architecture

    One application server runs application code.

    assume a simple one instance web role

    One database server holds the application state.

    The application code can issue any query to the database.
    SELECT (read queries)
    UPDATE, DELETE, INSERT (UDI queries)
    transactions
    Typical Simple Web Architecture Diagram

    Scaling the Architecture

    The application server contains only the application code.
    could be a simple multi-instance web role
    it does not hold state
    different requests can be processed independently
    Scaling the Architecture Diagram

    Replicating the Database

    Data is fully replicated across multiple database servers.
    read queries can be addressed at any replica
    UDIs must be issued at every replica (to ensure data consistency)

    Each database server must process:-

    1/N Read Queries + UDI Queries
    increasing N does not help when the UDIs alone saturate the server‘s capacity

    Replicating the Database Diagram

    Functional Partitioning

    We must send less UDIs to each server.
    each server contains a subset of all tables
    note tables are intact here
    limits the scope of UDI queries

    Updates to T1 must be addressed to only 2 servers.

    We must place tables according to query templates.

    we cannot execute a query that joins T1 and T2. . .

    Functional Partitioning Diagram
    Functional Partitioning Diagram 2

    It can be shown that this can dramatically improve scalability.

    However granularity may be too coarse.
    maximum gain limited by number of tables
    whilst saying that the tables may be replicated
    we can introduce a finer granularity: column-level

    Vertical Partitioning

    Vertical Partitioning Diagram
    Reduces the I/O and performance costs associated with fetching the items that are accessed most frequently.
    column-level oriented
    schema has to be reorganised
    columns with common access behaviours can be held together and optimised separately
    e.g. speeded up with the introduction of a cache
    different collections of columns can be secured separately

    However now have the issue that even simple queries may have to access data from more data services.

    Performance can be further improved if some partitions are read-only and can thus be easily replicated.

    Denormalisation

    Denormalisation is a variant of vertical partitioning.

    Intentionally duplicate columns across multiple tables.
    denormalise data into separate services
    replicate read-only columns across multiple tables and data services

    Horizontal Partitioning

    Horizontal Partitioning Diagram

    Also commonly called sharding.

    Notice all shards have the same schema.

    Most queries are indexed by primary key.

    Common approach is to hash table records by their primary key.
    (not shown in the diagram)
    other mathematical functions are used too e.g. modulo
    split hash space evenly between shards (i.e. servers)
    can be difficult to later rebalance the data distribution...
    as the hashed key value is random this ensures balanced distribution of rows between shards and can ensure balanced workload
    avoids “hot partitions”

    However main requirement is to balance requests to shards and not to ensure they should contain same amount of data.

    important to ensure that a single shard does not exceed the scale limits (in terms of capacity and processing resources) of the data store being used to host that shard

    Consistent Hashing

    Problem with adding/removing servers is in rebalancing the data distribution...

    need to reorganise values across the current servers

    Consistent hashing is an approach to limit changes so that only K/S values (no. possible keys/no. shards) have to be reorganised on adding or removing shard servers.

    Note that in general the concept of sharding, key hashing (and consistent hashing) is used in NoSQL databases and other technologies as well.

School of Computing, Engineering and Built Environment