Docker Swarm – Experience / Issues

Posted: June 12, 2017 in Docker, Cluster Manager, Uncategorized
Tags: , , , , ,

Last few months I had worked on Cloud based Cluster Manager using Docker Swarm to set up the new customer instance for my project and the experience was wonderful as it gave me deep understanding on how cloud based cluster solutions are working compared to a old school application deployment/setup model. Our solution is mainly to replace the another popular cluster management solution using DMM – Docker Marathon & Mesos.

In DMM, Marathon-lb + Marathon + Mesos takes care of connecting customer requests to their respective services running on the cluster. But this is more involved effort and Docker swarm provides these capabilities out of the box & with simple commands. Some of technologies that we are using …

  • Mesos => Master / Slave configuration forms the cluster
  • Chronos / Marathon (REST API) => Resource Manager, helps to orchestrate containerized services/apps
  • Docker container isolates the resource consumption between tenants
  • Marathon-lb / HAProxy => binds to service port of every app and sends incoming reqs to app instances
    • marathon-lb.py calls marathon api to retrieve all running services/apps & generates/updates HAProxy config and reloads the HAProxy service
  • Spark => Distributed Compute S/w, can be run in local mode or standalone mode or mesos/yarn mode
  • HDFS => Global shared file system
  • ZK => Distributed / centralized configuration management tool used by many distributed softwares like Kafka, Spark … etc for achieving high availability through Leader/follower model provided by ZK.
  • Nginx =>high-performance HTTP server and reverse proxy
  • Docker Swarm => Swarm mode on docker engine help us to natively manage the cluster.

For building new architecture I had to learn various commands related to docker/docker swarm, HDFS, Spark & Linux (Thanks to our great Chief Architect for his vision/inputs). We had built python based provisioning service to create customer specific instance which involves setting up many of the swarm services …

  • Core product service
  • Backend etl product service
  • Spark as standalone cluster services i.e. master service + worker services
  • Other non docker swarm based configurations
    • Customer space in HDFS with default data
    • ZK configuration for all components.

Since Docker swarm is new technology and we had ran into lots of issues due to the docker version + Linux OS/version that we are using. Our journey on debugging/fixing issues …

  • Started on Centos 7.0 + Docker 1.23 => got into lot of docker swarm service connectivity and other weird issues
  • upgraded Centos OS + Docker to latest version => still had the issues
  • Upgraded OS to Ubuntu 14.04 + Docker to 17.03 => still had service connectivity issues in docker swarm
  • Upgraded Docker to 17.05 => Issues came down but we still noticed few connectivity issues. Posted the issue with docker team – https://github.com/moby/moby/issues/32830. Lately we came to know that there is some race condition which has been fixed.
  • Upgraded OS to latest Ubuntu 16.04 with latest kernel => Yet to apply 17.06 release to see if connectivity issues are completely gone or not. For now we check the connectivity health by using scripts.

 

Leave a comment