Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Asserting for DB failovers #35

Closed
jacobbednarz opened this issue Mar 4, 2019 · 10 comments
Closed

Asserting for DB failovers #35

jacobbednarz opened this issue Mar 4, 2019 · 10 comments

Comments

@jacobbednarz
Copy link

I'm using Toxiproxy for our external services and we're now getting ready to do a bunch of DB failover work. To better handle our failovers without dropping queries, we've patched ActiveRecord to catch any MySQL errors, perform a reconnect and then try the query again. I can manually confirm this works by kicking off this script and either toggling the availability of the toxiproxy or DB server manually during the execution.

ATTEMPT_COUNT = 300
puts "==> Truncating the users test_db table"
ActiveRecord::Base.connection.execute("truncate test_db.user")

puts "==> Starting to send MySQL queries"
require 'securerandom'
ATTEMPT_COUNT.times do
  sleep 0.1
  hash = SecureRandom.uuid
  begin
    puts "    [#{Time.now.strftime("%T.%L")}] Inserting #{hash}"
    ActiveRecord::Base.connection.execute("INSERT INTO user (first_name, last_name) VALUES ('test', '#{hash}')")
    puts "    Success!"
  rescue Exception => e
    puts "    [#{Time.now.strftime("%T.%L")}] #{e}"
  end
end

row_count = ActiveRecord::Base.connection.execute("select * from user").count
puts
puts "Attempted writes: #{ATTEMPT_COUNT}"
puts "DB row count:     #{row_count}"
puts "Variance:         #{ATTEMPT_COUNT - row_count}"

However, I'm getting a little stuck when it comes to using Toxiproxy to emulate the failover completing. I first tried:

Toxiproxy[:mysql_master].down do
  User.first
end

It seems our patch works a little too well because it sits here waiting for the MySQL server to come back but it never does as the yield is still running. I then tried to split the enable/disable but still had the same results with the following:

Toxiproxy[:mysql_master].disable
User.first
Toxiproxy[:mysql_master].enable

Which leads me to the following questions:

  • Could you share how your using Toxiproxy with things like DB failovers? Is this something you're able to test similarly to my intention or do you handle it on a per model basis? I essentially need the proxy to only be present for a short period of time but re-enable after the time has passed.
  • The only way I could think of having this work would be to pass another argument to down (and later disable) which would only disable the proxy for a period of time. Is applying a non-blocking timeout to that functionality something you'd consider useful for the library?

Thanks!

@jacobbednarz
Copy link
Author

cc @sirupsen @kirs @jpittis as IIRC I've heard you all discuss this type of thing at one point or another and interested in your thoughts and insights here 😬

@kirs
Copy link

kirs commented Mar 6, 2019

If what you're looking for is a way to simulate MySQL disconnects to test the retries, we do that with a background thread that kills connections:

def with_pt_kill(timeout, query_match)
  pool = connection.pool
  c = pool.checkout

  t = Thread.new do
    sleep timeout
    processlist = c.execute("SHOW PROCESSLIST").to_a
    process = processlist.find do |(_id, _user, _host, _db, _command, _time, _state, info)|
      info =~ query_match
    end
    raise "target query not found" if process.empty?
    c.execute("KILL #{process[0]}")
  end

  yield

  t.join
ensure
  ActiveRecord::Base.clear_all_connections!
end

with_pt_kill(0.5, /^SELECT SLEEP/) do
  connection.execute("SELECT SLEEP(1)") # the query will be killed but then retried by the patch
end

No toxiproxy required!

@sirupsen
Copy link
Contributor

sirupsen commented Mar 6, 2019

that's some beautiful rubby lol

@jpittis
Copy link
Contributor

jpittis commented Mar 6, 2019

What even is MySQLs behaviour on these kinds of failovers? Does it just close all connections?

How are you expecting your application to react to the failover?

How are you expecting to tell your application that a failover has occurred?

Something about this feels fishy. I feel like we're trying to test this at the wrong layer.

@jacobbednarz
Copy link
Author

If what you're looking for is a way to simulate MySQL disconnects to test the retries, we do that with a background thread that kills connections:
...
No toxiproxy required!

This is interesting! I hadn't considered not using Toxiproxy with this as I was looking to assert when MySQL was (proper) offline not just ActoveRecord not being able to connect, the queries wouldn't end up on the floor. We've taken a similar approach to Dalibor's post whereby the query is constantly reconnecting and then retried (our logic contains some different measures but still similar).

What even is MySQLs behaviour on these kinds of failovers? Does it just close all connections?

The crux of the issue that we are working around is within AWS RDS where the cluster gets a CNAME, ActiveRecord doesn't re-resolve the CNAME when it gets a failure so even if a promotion from reader => writer occurs, ActiveRecord doesn't get the new values and attempts to write to a non-writable host. We're transitioning away from using a DNS CNAME (as I've discussed with @sirupsen in some detail) but we also need to address the ActiveRecord issue here and ensure that even when a host disappears or goes read only, we get the correct values.

To answer your question about the failure modes, there are a couple of scenarios that we want to cover at the moment:

  • MySQL instance goes read only. The host still accepts connections but responds with an exception if you try to write to it.
  • MySQL instance goes away. The host doesn't accept connections or client throws a timeout.

Both of these scenarios have been tested manually with our patch and seems to work for what we need to cover for failovers.

How are you expecting your application to react to the failover?

Nothing too special at the moment; just that the client shouldn't drop the queries and retry to send them when the exception is cleared.

How are you expecting to tell your application that a failover has occurred?

Outside of the exception clearing, we're not really. The idea is that the application just knows it wasn't able to complete the query and it should retry at the defined interval. We don't really want the MySQL parts of the application to know the difference between things like a failover or broken connection to a host which self heals.

I feel like we're trying to test this at the wrong layer.

This very well could be the case 😄 I'm open to other thoughts on how this could be better tested if anyone has them.

@jacobbednarz
Copy link
Author

Thanks for the pointer @kirs! It got me on the right track and Toxiproxy wasn't needed here.

@jpittis Happy to continue discussing this if you think others will benefit from it too. Or feel free to ignore if you've got more important things to look at 😁

@sirupsen
Copy link
Contributor

sirupsen commented Mar 7, 2019

reconnecting and then retried (our logic contains some different measures but still similar).

It's taken us years to get a similar patch half-right. I gisted it for you

@jacobbednarz
Copy link
Author

It's taken us years to get a similar patch half-right.

Half right? 😂 Can you elaborate on what is still half wrong with the patch or where it could do with improvements? More for my own understanding than anything. BTW, thanks for including the tests. That helps me understand a bunch more of the purpose of some of the included code.

@sirupsen
Copy link
Contributor

sirupsen commented Mar 7, 2019

Well, only in the past few weeks we found this bug: brianmario/mysql2#1022 So you need this custom mysql2 version. Mind you, we've still run it in production for years. These are some gnarly edge-cases.

@miry miry added the Toxiproxy label Jan 4, 2023
@abecevello
Copy link
Member

Closing outdated stale issues that haven't been touched in years.

@abecevello abecevello closed this as not planned Won't fix, can't repro, duplicate, stale Apr 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants