tusacentral.net working for a better world... before the end!Setup ProxySQL as High Available (and not a SPOF)

tusacentral.net working for a better world... before the end!Setup ProxySQL as High Available (and not a SPOF)

During the last few months we had a lot of opportunities to present and discuss about a very powerful tool that will become more and more used in the architectures supporting MySQL, ProxySQL .

ProxySQL is becoming every day more flexible, solid, performant and used ( http://www.proxysql.com/ and recent http://www.proxysql.com/compare ).

This is it, the tool is a winner in comparing it with similar ones, and we all need to have a clear(er) idea on how integrate it in our architectures in order to achieve the best results.

The first to keep in mind is that ProxySQL is not natively supporting any high availability solution, in short we can setup a cluster of MySQL(s) and achieve 4 or even 5 nines of HA, but if we include ProxySQL, as it is, and as single block, our HA will include a single point of failure (SPOF) that will drag us down in case of crash.

To solve this, the most common solution so far had be to setup ProxySQL as part of a tile architecture, where Application/ProxySQL where deploy together.

This is a good solution for some cases, and for sure it reduce the network hops, but what it may be less than practical when our architecture has a very high number of tiles. Say 100 or 400 application servers, not so unusual nowadays. In that case managing the ProxySQL will be challenging, but most problematic it will be the fact that ProxySQL must perform several checks on the destination servers (MySQL), and if we have 400 instance of ProxySQL we will end up keeping our databases busy just because the checks.

In short ... is not a smart move.

Another possible approach, used so far was to have two layers of ProxySQL, one close to the application, another in the middle to finally connect to the database.

I personally don't like this approach for many reasons, but the most relevants are that this approach create additional complexity in the management of the platform, and it adds network hops.

So what can be done?

I like to have things simple, I love the KISS principle, I love to have things simple and because I am lazy I love to reuse the wheel instead re-invent things that someone else had already invent.

Last thing I like to have my customers not depending from me or any other colleague, once I am done, and gone, they must be able to manage their things, understand their things, fix their things by themselves.

Anyhow as said I like simple things. So my point here is the following:

What I can use for the remaining cases?

The answer comes with existing solutions and combining existing blocks, KeepAlived + ProxySQl + MySQL .

Short description "Keepalived is a routing software written in C. The main goal of this project is to provide simple and robust facilities for loadbalancing and high-availability to Linux system and Linux based infrastructures. Loadbalancing framework relies on well-known and widely used Linux Virtual Server (IPVS) kernel module providing Layer4 loadbalancing. Keepalived implements a set of checkers to dynamically and adaptively maintain and manage loadbalanced server pool according their health. On the other hand high-availability is achieved by VRRP protocol. VRRP is a fundamental brick for router failover. In addition, Keepalived implements a set of hooks to the VRRP finite state machine providing low-level and high-speed protocol interactions. Keepalived frameworks can be used independently or all together to provide resilient infrastructures."

Bingo! this is exactly what we need for our ProxySQL setup.

Below I will show how to setup:

Just remind that what we want to achieve is to prevent ProxySQL to become a SPOF, that's it.

While achieving that we need to reduce as much as possible the network hops and keep the solution SIMPLE.

Another important concept to keep in mind is that ProxySQL (re)start take place in less then a second.

This means that if it crash and it can be restarted by the angel process, having it doing so and recovery the service is much more efficient than to have any kind of failover mechanism to take place.

As such whenever you plan your solution keep in mind the ~1 second time of ProxySQL restart as base line.


Let's go.


Choose 3 machines that will host the combination of Keepalive and ProxySQL.

In the following example I will use 3 machines for ProxySQL and Keepalived and 3 hosting PXC, but you can have the Keepalived+ProxySQL whenever you like also on the same PXC box.

For the following examples we will have:

To check I will use this table, please create it in your MySQL server:

And this bash TEST command to use later

Once you have your ProxySQL up (run the same on all ProxySQL nodes, it is much simpler), connect to the Admin interface and:

Create a my.cnf file in your default dir with

Simple Setup using a single VIP 3 ProxySQL 3 Galera nodes

First setup the keepalive configuration file (/etc/keepalived/keepalived.conf):

Given the above and given I want to have test1 as main priority will be set as:

Modify the config in each node following the above values and (re)start keepalived.

If all is set correctly in the system log of the TEST1 machine you will see:

While in the other two:

Which means node is there as ... :D Backup.

Now is time to test our connection to our ProxySQL pool.

From an application node or just from your laptop.

Open 3 terminals and in each one:

You will see that unless you are already sending queries to proxies, you have the Proxies just doing nothing. Time to start the test bash as I indicate above. If everything is working correctly you will see the bash command reporting this:

The other 3 running bash commands will show that ONLY the ProxySQL in TEST1 is currently getting/serving requests, because is the one with the VIP.


So nothing special right, all as expected.

Time to see if the failover-failback works along the chain.

Let us kill the ProxySQL on TEST1 while the test bash command is running.

Here is what you will get:

the source had change but not the PXC node.

If you check the system log for TEST1:

While on TEST2

Simple ... and elegant. No need to re-invent the wheel and works smooth.

The total time for the recovery given the ProxySQL crash had be of 5.06 seconds ,

considering the wider window ( last application start, last recovery in PXC 2017-01-10 18:19:06.188233|2017-01-10 18:19:11.250927)

As such the worse scenario, keeping in mind we run the check for the ProxySQL every 2 seconds (real recover max window 5-2=3 sec ).

OK what about fail-back?

Let us restart the proxysql service:

Here the output:

Worse recovery time = 4.04 seconds of which 2 of delay because the check interval.

Of course the test is running every second and is running one single operation, as such the impact is minimal (no error in fail-back), and recovery longer.

But I think I have made clear the concept here.

Let see another thing... is the failover working as expected? Test1 -> 2 -> 3 ??

Let us kill 1 - 2 and see:

This image is where you should be at the end:

In this case given I have done one kill immediately after the other, Keepalived had take a bit more in failing over, but still it did correctly and following the planned chain.

Fail-back as smooth as usual:

Let us see now another case.

The case above is nice and simple, but as a cavet.

I can access only one ProxySQL a time, which may be good or not.

In any case it may be nice to have the possibility to choose, and with Keepalived you can.

We can actually set an X number of VIP and associate them to each test box.

The result will be that each server hosting ProxySQL will also host a VIP, and will be eventually able to fail-over to any of the other two servers.

Failing-over/Back will be fully managed by Keepalived, checking as we did before if ProxySQL is running. Example of configuration for one node can be the one below:

The tricky part in this case is to play with the PRIORITY for each VIP and each server such that you will NOT assign the same ip twice.

The whole set of configs can be found here

Performing the check with the test bash as above we have:

Final state of IPs on Test3:

And this is the image:

Recovery times:


In this example I had just use a test that checks the process, but a check can be anything reporting 0|1, the limit is define only from what you need.

The times for the failover can be significant shorter, reducing the check time and considering only the time taken to move the VIP, I had prefer to show the worse case scenario considering an application with a second interval, but that is a pessimistic view of what normally happens with real traffic.

I was looking for a simple, simple simple way to add HA to ProxySQL, something that can be easily integrate with automation and that is actually also well established and maintained.

In my opinion using Keepalived is a good solution because it match all the above expectations.

Implementing a set of ProxySQL and have Keepalived manage the failover between them is pretty easy, but you can expand the usage (and the complexity) if you need, counting on tools that are already part of the Linux stack, no need to re-invent the wheel with crazy mechanism.

If you want to have fun doing crazy things... at least start from something that helps you to go beyond the basiscs.

For instance I was also playing a bit with keepalived and virtual server, creating set of redundant Proxysql with load balancers and ... .. but this is another story (blog).