Building your Big Data playground with Azure

Let's say you were assigned a task, which requires you to provision a whole new environment using technologies, which are not "cool" when used on your dev machine. Let's take into consideration Hadoop - it becomes more and more popular, yet it's still a black box for many(including me) people. What if you'd like to play with it a little? Well, here's the instruction what you have to do to install and run it on Windows. Trust me - it's doable... This is the only "good" part of the whole process.

Do it for me?

I don't like wasting my time and my computer's resources on temporary things, which I need only for a few hours. What I like, is to make something to do it for me. If you take a look at Azure Marketplace, you'll see plenty of available software images, many of them including OSS software. It can be installed and used without any additional charges. Does it include Hadoop? Yes it does. Let's grab it and install it.

Do I have money for it?

If you have an Azure subscription, feel no worries to install this image. As I said - it charges you only for the resources you're using. If you're done with it you can either delete the whole resource group with resources provisioned or disable a VM used by Hadoop - it will save you time when it will be needed next time and the cost is negligible in such case.

Got it! What's next?

The Linux VM instance used to install Hadoop on it is accessible through SSH client and requires passing SSH key to connect to it(you can use whichever client you like like PuTTY or even terminal from SourceTree) or a password, which you provided. Once connected to it, you can run tasks and scripts designed for Hadoop.

Just to make things clear - in Azure Portal, when you go to the Overview tab in VM provisioned for Hadoop, you'll se public IP address, which can be used to connect to it. What is more, you can use SFTP to upload file to the VM or download them. Go to your FTP client and use your_VM_IP:22 as host and enter your credentials. You'll see the default directory of your VM. From this point everything is set - you have your very own Hadoop playground, which you can use whenever you want.

Overview | Mesosphere DC/OS on Azure - allowing external access

In the previous post I introduced DC/OS and provided a way to install it. There was a minor caveat related to the accessing your DC/OS instance - you need ssh to connect to it and some way to tunnel port 80 from the VM to your local computer. In fact whole infrastructure is sealed and allows connecting to it only via ssh. What if I would like to allow accessing it with any other possibility? Well, there's a simple way to do it, which I present by allowing access directly via browser on port 80. 

Inbound security rules...

The whole DC/OS isolation comes from the fact, that it resides inside a VPN, which is protected by both its security rules and a load balancer, which directs traffic inside the network. By default it allows connections via ssh on port 2200, which is further forwarded to port 22. To allow accessing it using other service, we have to perform following steps:

  1. Add a new inbound NAT rule to the load balancer to forward traffic on port 80 to port 80 inside our VM
  2. Allow accessing our network with port 80

Note - we're talking about HTTP here, nor problem to change configured port to 443 and access VM only via HTTPS.

How can I do it?

To allow access to our VM via HTTP perform following steps:

  1. Go to Azure Portal and open resource group containing an instance of Mesosphere DC/OS
  2. Find the master load balancer(usually contains something like dcos-master)
  3. Go to Inbound NAT rules and click +Add
  4. Provide a name for the rule, from the Service dropdown select any service you'd like to configure(e.g. HTTP)
  5. In Target field select a VM you're interested in
  6. Then click OK and wait a minute so the load balancer is reconfigured
  7. Now go back to DC/OS resources and find a network security group associated with the master node
  8. Go to Inbound security rules and click +Add
  9. Provide a name and select a service you're interested in
  10. Make sure Allow is selected and click OK

Once configuration is finished you should be able to access DC/OS with your browser by using your VM IP public address.