Asserting for DB failovers #35

jacobbednarz · 2019-03-04T01:19:50Z

I'm using Toxiproxy for our external services and we're now getting ready to do a bunch of DB failover work. To better handle our failovers without dropping queries, we've patched ActiveRecord to catch any MySQL errors, perform a reconnect and then try the query again. I can manually confirm this works by kicking off this script and either toggling the availability of the toxiproxy or DB server manually during the execution.

ATTEMPT_COUNT = 300
puts "==> Truncating the users test_db table"
ActiveRecord::Base.connection.execute("truncate test_db.user")

puts "==> Starting to send MySQL queries"
require 'securerandom'
ATTEMPT_COUNT.times do
  sleep 0.1
  hash = SecureRandom.uuid
  begin
    puts "    [#{Time.now.strftime("%T.%L")}] Inserting #{hash}"
    ActiveRecord::Base.connection.execute("INSERT INTO user (first_name, last_name) VALUES ('test', '#{hash}')")
    puts "    Success!"
  rescue Exception => e
    puts "    [#{Time.now.strftime("%T.%L")}] #{e}"
  end
end

row_count = ActiveRecord::Base.connection.execute("select * from user").count
puts
puts "Attempted writes: #{ATTEMPT_COUNT}"
puts "DB row count:     #{row_count}"
puts "Variance:         #{ATTEMPT_COUNT - row_count}"

However, I'm getting a little stuck when it comes to using Toxiproxy to emulate the failover completing. I first tried:

Toxiproxy[:mysql_master].down do
  User.first
end

It seems our patch works a little too well because it sits here waiting for the MySQL server to come back but it never does as the yield is still running. I then tried to split the enable/disable but still had the same results with the following:

Toxiproxy[:mysql_master].disable
User.first
Toxiproxy[:mysql_master].enable

Which leads me to the following questions:

Could you share how your using Toxiproxy with things like DB failovers? Is this something you're able to test similarly to my intention or do you handle it on a per model basis? I essentially need the proxy to only be present for a short period of time but re-enable after the time has passed.
The only way I could think of having this work would be to pass another argument to down (and later disable) which would only disable the proxy for a period of time. Is applying a non-blocking timeout to that functionality something you'd consider useful for the library?

Thanks!

The text was updated successfully, but these errors were encountered:

jacobbednarz · 2019-03-05T21:40:14Z

cc @sirupsen @kirs @jpittis as IIRC I've heard you all discuss this type of thing at one point or another and interested in your thoughts and insights here 😬

kirs · 2019-03-06T13:02:27Z

If what you're looking for is a way to simulate MySQL disconnects to test the retries, we do that with a background thread that kills connections:

def with_pt_kill(timeout, query_match)
  pool = connection.pool
  c = pool.checkout

  t = Thread.new do
    sleep timeout
    processlist = c.execute("SHOW PROCESSLIST").to_a
    process = processlist.find do |(_id, _user, _host, _db, _command, _time, _state, info)|
      info =~ query_match
    end
    raise "target query not found" if process.empty?
    c.execute("KILL #{process[0]}")
  end

  yield

  t.join
ensure
  ActiveRecord::Base.clear_all_connections!
end

with_pt_kill(0.5, /^SELECT SLEEP/) do
  connection.execute("SELECT SLEEP(1)") # the query will be killed but then retried by the patch
end

No toxiproxy required!

sirupsen · 2019-03-06T19:47:21Z

that's some beautiful rubby lol

jpittis · 2019-03-06T20:43:34Z

What even is MySQLs behaviour on these kinds of failovers? Does it just close all connections?

How are you expecting your application to react to the failover?

How are you expecting to tell your application that a failover has occurred?

Something about this feels fishy. I feel like we're trying to test this at the wrong layer.

jacobbednarz · 2019-03-06T21:31:57Z

If what you're looking for is a way to simulate MySQL disconnects to test the retries, we do that with a background thread that kills connections:
...
No toxiproxy required!

This is interesting! I hadn't considered not using Toxiproxy with this as I was looking to assert when MySQL was (proper) offline not just ActoveRecord not being able to connect, the queries wouldn't end up on the floor. We've taken a similar approach to Dalibor's post whereby the query is constantly reconnecting and then retried (our logic contains some different measures but still similar).

What even is MySQLs behaviour on these kinds of failovers? Does it just close all connections?

The crux of the issue that we are working around is within AWS RDS where the cluster gets a CNAME, ActiveRecord doesn't re-resolve the CNAME when it gets a failure so even if a promotion from reader => writer occurs, ActiveRecord doesn't get the new values and attempts to write to a non-writable host. We're transitioning away from using a DNS CNAME (as I've discussed with @sirupsen in some detail) but we also need to address the ActiveRecord issue here and ensure that even when a host disappears or goes read only, we get the correct values.

To answer your question about the failure modes, there are a couple of scenarios that we want to cover at the moment:

MySQL instance goes read only. The host still accepts connections but responds with an exception if you try to write to it.
MySQL instance goes away. The host doesn't accept connections or client throws a timeout.

Both of these scenarios have been tested manually with our patch and seems to work for what we need to cover for failovers.

How are you expecting your application to react to the failover?

Nothing too special at the moment; just that the client shouldn't drop the queries and retry to send them when the exception is cleared.

How are you expecting to tell your application that a failover has occurred?

Outside of the exception clearing, we're not really. The idea is that the application just knows it wasn't able to complete the query and it should retry at the defined interval. We don't really want the MySQL parts of the application to know the difference between things like a failover or broken connection to a host which self heals.

I feel like we're trying to test this at the wrong layer.

This very well could be the case 😄 I'm open to other thoughts on how this could be better tested if anyone has them.

jacobbednarz · 2019-03-07T10:44:12Z

Thanks for the pointer @kirs! It got me on the right track and Toxiproxy wasn't needed here.

@jpittis Happy to continue discussing this if you think others will benefit from it too. Or feel free to ignore if you've got more important things to look at 😁

sirupsen · 2019-03-07T12:35:21Z

reconnecting and then retried (our logic contains some different measures but still similar).

It's taken us years to get a similar patch half-right. I gisted it for you

jacobbednarz · 2019-03-07T22:21:12Z

It's taken us years to get a similar patch half-right.

Half right? 😂 Can you elaborate on what is still half wrong with the patch or where it could do with improvements? More for my own understanding than anything. BTW, thanks for including the tests. That helps me understand a bunch more of the purpose of some of the included code.

sirupsen · 2019-03-07T22:31:58Z

Well, only in the past few weeks we found this bug: brianmario/mysql2#1022 So you need this custom mysql2 version. Mind you, we've still run it in production for years. These are some gnarly edge-cases.

abecevello · 2023-04-05T13:59:55Z

Closing outdated stale issues that haven't been touched in years.

miry added the Toxiproxy label Jan 4, 2023

abecevello closed this as not planned Won't fix, can't repro, duplicate, stale Apr 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Asserting for DB failovers #35

Asserting for DB failovers #35

jacobbednarz commented Mar 4, 2019

jacobbednarz commented Mar 5, 2019

kirs commented Mar 6, 2019 •

edited

Loading

sirupsen commented Mar 6, 2019

jpittis commented Mar 6, 2019

jacobbednarz commented Mar 6, 2019

jacobbednarz commented Mar 7, 2019

sirupsen commented Mar 7, 2019

jacobbednarz commented Mar 7, 2019

sirupsen commented Mar 7, 2019 •

edited

Loading

abecevello commented Apr 5, 2023

Asserting for DB failovers #35

Asserting for DB failovers #35

Comments

jacobbednarz commented Mar 4, 2019

jacobbednarz commented Mar 5, 2019

kirs commented Mar 6, 2019 • edited Loading

sirupsen commented Mar 6, 2019

jpittis commented Mar 6, 2019

jacobbednarz commented Mar 6, 2019

jacobbednarz commented Mar 7, 2019

sirupsen commented Mar 7, 2019

jacobbednarz commented Mar 7, 2019

sirupsen commented Mar 7, 2019 • edited Loading

abecevello commented Apr 5, 2023

kirs commented Mar 6, 2019 •

edited

Loading

sirupsen commented Mar 7, 2019 •

edited

Loading