-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Asserting for DB failovers #35
Comments
If what you're looking for is a way to simulate MySQL disconnects to test the retries, we do that with a background thread that kills connections: def with_pt_kill(timeout, query_match)
pool = connection.pool
c = pool.checkout
t = Thread.new do
sleep timeout
processlist = c.execute("SHOW PROCESSLIST").to_a
process = processlist.find do |(_id, _user, _host, _db, _command, _time, _state, info)|
info =~ query_match
end
raise "target query not found" if process.empty?
c.execute("KILL #{process[0]}")
end
yield
t.join
ensure
ActiveRecord::Base.clear_all_connections!
end
with_pt_kill(0.5, /^SELECT SLEEP/) do
connection.execute("SELECT SLEEP(1)") # the query will be killed but then retried by the patch
end No toxiproxy required! |
that's some beautiful rubby lol |
What even is MySQLs behaviour on these kinds of failovers? Does it just close all connections? How are you expecting your application to react to the failover? How are you expecting to tell your application that a failover has occurred? Something about this feels fishy. I feel like we're trying to test this at the wrong layer. |
This is interesting! I hadn't considered not using Toxiproxy with this as I was looking to assert when MySQL was (proper) offline not just ActoveRecord not being able to connect, the queries wouldn't end up on the floor. We've taken a similar approach to Dalibor's post whereby the query is constantly reconnecting and then retried (our logic contains some different measures but still similar).
The crux of the issue that we are working around is within AWS RDS where the cluster gets a CNAME, ActiveRecord doesn't re-resolve the CNAME when it gets a failure so even if a promotion from reader => writer occurs, ActiveRecord doesn't get the new values and attempts to write to a non-writable host. We're transitioning away from using a DNS CNAME (as I've discussed with @sirupsen in some detail) but we also need to address the ActiveRecord issue here and ensure that even when a host disappears or goes read only, we get the correct values. To answer your question about the failure modes, there are a couple of scenarios that we want to cover at the moment:
Both of these scenarios have been tested manually with our patch and seems to work for what we need to cover for failovers.
Nothing too special at the moment; just that the client shouldn't drop the queries and retry to send them when the exception is cleared.
Outside of the exception clearing, we're not really. The idea is that the application just knows it wasn't able to complete the query and it should retry at the defined interval. We don't really want the MySQL parts of the application to know the difference between things like a failover or broken connection to a host which self heals.
This very well could be the case 😄 I'm open to other thoughts on how this could be better tested if anyone has them. |
It's taken us years to get a similar patch half-right. I gisted it for you |
Half right? 😂 Can you elaborate on what is still half wrong with the patch or where it could do with improvements? More for my own understanding than anything. BTW, thanks for including the tests. That helps me understand a bunch more of the purpose of some of the included code. |
Well, only in the past few weeks we found this bug: brianmario/mysql2#1022 So you need this custom mysql2 version. Mind you, we've still run it in production for years. These are some gnarly edge-cases. |
Closing outdated stale issues that haven't been touched in years. |
I'm using Toxiproxy for our external services and we're now getting ready to do a bunch of DB failover work. To better handle our failovers without dropping queries, we've patched ActiveRecord to catch any MySQL errors, perform a reconnect and then try the query again. I can manually confirm this works by kicking off this script and either toggling the availability of the toxiproxy or DB server manually during the execution.
However, I'm getting a little stuck when it comes to using Toxiproxy to emulate the failover completing. I first tried:
It seems our patch works a little too well because it sits here waiting for the MySQL server to come back but it never does as the
yield
is still running. I then tried to split theenable
/disable
but still had the same results with the following:Which leads me to the following questions:
down
(and laterdisable
) which would only disable the proxy for a period of time. Is applying a non-blocking timeout to that functionality something you'd consider useful for the library?Thanks!
The text was updated successfully, but these errors were encountered: