Docker Swarm – Experience / Issues

Posted: June 12, 2017 in Docker, Cluster Manager, Uncategorized
Tags: Cluster Manager, DMM, Docker, Docker Swarm, Marathon, Mesos

Last few months I had worked on Cloud based Cluster Manager using Docker Swarm to set up the new customer instance for my project and the experience was wonderful as it gave me deep understanding on how cloud based cluster solutions are working compared to a old school application deployment/setup model. Our solution is mainly to replace the another popular cluster management solution using DMM – Docker Marathon & Mesos.

In DMM, Marathon-lb + Marathon + Mesos takes care of connecting customer requests to their respective services running on the cluster. But this is more involved effort and Docker swarm provides these capabilities out of the box & with simple commands. Some of technologies that we are using …

Mesos => Master / Slave configuration forms the cluster
Chronos / Marathon (REST API) => Resource Manager, helps to orchestrate containerized services/apps
Docker container isolates the resource consumption between tenants
Marathon-lb / HAProxy => binds to service port of every app and sends incoming reqs to app instances
- marathon-lb.py calls marathon api to retrieve all running services/apps & generates/updates HAProxy config and reloads the HAProxy service
Spark => Distributed Compute S/w, can be run in local mode or standalone mode or mesos/yarn mode
HDFS => Global shared file system
ZK => Distributed / centralized configuration management tool used by many distributed softwares like Kafka, Spark … etc for achieving high availability through Leader/follower model provided by ZK.
Nginx =>high-performance HTTP server and reverse proxy
Docker Swarm => Swarm mode on docker engine help us to natively manage the cluster.

For building new architecture I had to learn various commands related to docker/docker swarm, HDFS, Spark & Linux (Thanks to our great Chief Architect for his vision/inputs). We had built python based provisioning service to create customer specific instance which involves setting up many of the swarm services …

Core product service
Backend etl product service
Spark as standalone cluster services i.e. master service + worker services
Other non docker swarm based configurations
- Customer space in HDFS with default data
- ZK configuration for all components.

Since Docker swarm is new technology and we had ran into lots of issues due to the docker version + Linux OS/version that we are using. Our journey on debugging/fixing issues …

Started on Centos 7.0 + Docker 1.23 => got into lot of docker swarm service connectivity and other weird issues
upgraded Centos OS + Docker to latest version => still had the issues
Upgraded OS to Ubuntu 14.04 + Docker to 17.03 => still had service connectivity issues in docker swarm
Upgraded Docker to 17.05 => Issues came down but we still noticed few connectivity issues. Posted the issue with docker team – https://github.com/moby/moby/issues/32830. Lately we came to know that there is some race condition which has been fixed.
Upgraded OS to latest Ubuntu 16.04 with latest kernel => Yet to apply 17.06 release to see if connectivity issues are completely gone or not. For now we check the connectivity health by using scripts.

Tech Hub

Vijay Mahankali

Recent Posts

Subscribe to Blog via Email

Archives

Blogs 2 Follow

Links

Pages

Blog Stats