Migrating schema and data in Azure Table Storage

Recently I faced a problem, when I had to change and adjust schema in tables stored in Azure Table Storage. The issue there was to actually automate changes so I don't have to perform them manually on each environment. This was the reason why I created a simple library called AzureTableStorageMigratorwhich helps in such tasks and eases the whole process.

The basics

The base idea was to actually create two things:

  • a simple fluent API, which will take care of chaining all tasks
  • a table which will hold all migration metadata

Current version(1.0) gives you following possibilities:

  • void Insert<T>(string tableName, T entity, bool createIfNotExists = false)
  • void DeleteTable(string tableName)
  • void CreateTable(string tableName)
  • void RenameTable<T>(string originTable, string destinationTable)
  • void Delete<T>(string tableName, T entity)
  • void Delete(string tableName, string partitionKey)
  • void Delete(string tableName, string partitionKey, string rowKey)
  • void Clear(string tableName)

and when you take a look at the example of usage:

/
var migrator = new Migrator();
migrator.CreateMigration(_ =>
{
  _.CreateTable("table1");
  _.CreateTable("table2");
  _.Insert("table1", new DummyEntity { PartitionKey = "pk", RowKey = DateTime.UtcNow.Ticks.ToString(), Name = "foo"});
  _.Insert("table1", new DummyEntity { PartitionKey = "pk", RowKey = DateTime.UtcNow.Ticks.ToString(), Name = "foo2"});
  _.Insert("table2", new DummyEntity { PartitionKey = "pk", RowKey = DateTime.UtcNow.Ticks.ToString(), Name = "foo"});
}, 1, "1.1", "My first migration!");

you'll see, that's pretty straightforward and self-describing. 

The way how it works is very simple - each CreateMigration() method is described using 3 different values - its id, version number and description. Each time this method is called, it'll add a new record to the versionData table to make sure, that metadata is saved and the same migration won't be run twice.

Why should I use it?

In fact it's not a matter of what you "should" do but rather what is "good" for your project. Versioning is generally a good idea, especially if you follow CI/CD pattern, where the goal is to deploy and rollback with ease. If you perform migrations by hand, you'll eventually face the situation, where rollback is either very time-consuming or almost impossible. 

It's good to remember that making your database a part of your repository(of course in terms of storing schema, not data) is considered a good practice and is one of the main parts of many modern projects.

What's next?

I published ATSM because I couldn't find a tool similar to it, which would help me version tables in Table Storage easily. For sure some new features will be added in the future, however if you find this project interesting, feel free to post an issue or a request - I'll be more than happy to discuss it.

Demystifying things - WASB in Azure

When you want to interact with Azure Storage, the easiest way is to access it via HTTP client(whether we're talking about the REST API or hitting an endpoint manually). This is fairly easy to detect, consider following URL:

/
https://yourstorageaccount.blob.core.windows.net/path

Simple as that. However, it's possible to find other protocols in the documentation when it comes to using Blob Storage:

/
wasb[s]://<containername>@<accountname>.blob.core.windows.net/<path>

What is this mysterious WASB here and what it is used for?

Hadoop-in-the-cloud

When working with Hadoop, one of its core components is HDFS, which is a file system used by it to manage data and storage. The limitation of HDFS is the fact, that it has access only to local files available for your cluster. There's a question - what if I'd like to access files stored inside by blob storages? Well, this is where WASB comes to play.

WASB - or Windows Azure Storage Blob - is an abstraction built atop of HDFS. It allows Hadoop(or HDIsight because we're talking about Hadoop in the Azure cloud) to seamlessly integrate with Azure Blob Storage. What is more, it allows multiple clusters to access data stored in one place. But what it really gives you?

Sharing is fun!

Before Hadoop can start working on the data, it actually has to load it from somewhere. Normally you either store it in your cluster or load it from an external source. The important thing here is following statement - data has to be accessible locally. Now imagine situation that you'd like to destroy a cluster each time computation has been made(e.g. it happens twice a week and there's no need to pay for it for each day). Moving and loading data each time you want to do something with it consumes time and resources.

When using HDInsight you no longer have to be worried about those caveats. Thanks to WASB, Hadoop can load data from blob storages immediately - you can connect multiple consumers and make computations at the same time. What is more, WASB can be installed with a traditional installations of Hadoop, so even when you provision a cluster on your own in Azure, it's still possible to use WASB in it.