Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for RDMA #65

Closed
wants to merge 10 commits into from
Closed

Conversation

mtisza
Copy link

@mtisza mtisza commented Aug 8, 2023

I've spent some time debugging issues with RDMA. Without these patches RDMA did not work at all (crashes, hangs, random ping timeouts, ...). After these patches it works quite well.

This does resolve #58, as well as other issues.

Miki Grof-Tisza added 10 commits August 8, 2023 21:31
…_irq methods, to prevent interrupt settings getting changed.

One example of this is an interrupt inadvertently getting enabled inside an IRQ handler:
[   84.546271] ------------[ cut here ]------------
[   84.546290] WARNING: CPU: 0 PID: 0 at kernel/softirq.c:362 __local_bh_enable_ip+0x3a/0x60
[   84.546313] Modules linked in: drbd_transport_rdma(O) drbd(O) ip6table_nat iptable_nat nf_nat bpfilter nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c rdma_ucm rdma_cm ib_cm iw_cm ib_umad ib_ipoib mlx4_ib kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd nvme nvme_core mlx4_core ib_uverbs ib_core dummy bonding [last unloaded: drbd]
[   84.546374] CPU: 0 PID: 0 Comm: swapper/0 Tainted: G        W  O      5.15.75 LINBIT#2
[   84.546386] Hardware name: Insyde Grantley/Analytic Blade Board, BIOS 05.04.21.0038.00.011 05/09/2018
[   84.546394] RIP: 0010:__local_bh_enable_ip+0x3a/0x60
[   84.546406] Code: a9 00 00 0f 00 75 23 83 ee 01 f7 de 65 01 35 5d 6a f9 7e 65 8b 05 56 6a f9 7e a9 00 ff ff 00 74 0d 5d 65 ff 0d 47 6a f9 7e c3 <0f> 0b eb d9 65 66 8b 05 fa 5a fa 7e 66 85 c0 74 e6 e8 20 ff ff ff
[   84.546419] RSP: 0018:ffff88fe7f805c80 EFLAGS: 00010206
[   84.546429] RAX: 0000000080010200 RBX: ffff888104de3000 RCX: 0000000000000001
[   84.546437] RDX: ffff888104de3238 RSI: 0000000000000200 RDI: ffffffffa022c3f0
[   84.546445] RBP: ffff88fe7f805c80 R08: 000000000000000e R09: 0000000000000535
[   84.546452] R10: 0000000000000001 R11: ffff8881082a0450 R12: 0000000000000000
[   84.546460] R13: 0000000000000a20 R14: 0000000000000021 R15: ffff888104de323c
[   84.546468] FS:  0000000000000000(0000) GS:ffff88fe7f800000(0000) knlGS:0000000000000000
[   84.546477] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   84.546486] CR2: 00005586e40e7688 CR3: 000000000220a006 CR4: 00000000003706f0
[   84.546494] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   84.546502] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   84.546509] Call Trace:
[   84.546517]  <IRQ>
[   84.546525]  _raw_spin_unlock_bh+0x1a/0x20
[   84.546555]  dtr_send_flow_control_msg+0xb0/0x410 [drbd_transport_rdma]
[   84.546566]  ? mlx4_ib_post_recv+0x10/0x20 [mlx4_ib]
[   84.546582]  dtr_recycle_rx_desc.constprop.0+0xb7/0xc0 [drbd_transport_rdma]
[   84.546591]  dtr_control_data_ready+0xbe/0xe0 [drbd_transport_rdma]
[   84.546599]  dtr_rx_cq_event_handler+0x413/0x5e0 [drbd_transport_rdma]
[   84.546607]  mlx4_ib_cq_comp+0x20/0x30 [mlx4_ib]
[   84.546618]  mlx4_cq_completion+0x42/0x60 [mlx4_core]
[   84.546650]  mlx4_eq_int+0x1d4/0x7f0 [mlx4_core]
[   84.546670]  ? note_gp_changes+0x60/0x70
[   84.546681]  mlx4_msi_x_interrupt+0x11/0x20 [mlx4_core]
[   84.546701]  ? mlx4_msi_x_interrupt+0x11/0x20 [mlx4_core]
[   84.546722]  __handle_irq_event_percpu+0x3f/0x150
[   84.546731]  handle_irq_event+0x4d/0xb0
[   84.546739]  handle_edge_irq+0x94/0x1f0
[   84.546745]  __common_interrupt+0x44/0xa0
[   84.546752]  common_interrupt+0x85/0xa0
[   84.546761]  </IRQ>
[   84.546765]  <TASK>
[   84.546769]  asm_common_interrupt+0x27/0x40
[   84.546776] RIP: 0010:cpuidle_enter_state+0xd3/0x350
[   84.546784] Code: 89 c6 0f 1f 44 00 00 31 ff e8 59 15 a0 ff 80 7d d7 00 74 12 9c 58 f6 c4 02 0f 85 69 02 00 00 31 ff e8 f1 46 a5 ff fb 45 85 ff <0f> 88 fa 00 00 00 49 63 cf 4c 8b 55 c8 48 8d 04 49 48 8d 14 81 48
[   84.546793] RSP: 0018:ffffffff82203df0 EFLAGS: 00000202
[   84.546798] RAX: ffff88fe7f82a580 RBX: ffffe8ffff402888 RCX: 000000000000001f
[   84.546804] RDX: 00000013af5996af RSI: 000000003d17f3e5 RDI: 0000000000000000
[   84.546809] RBP: ffffffff82203e28 R08: 0000000000000002 R09: ffff88fe7f8294c4
[   84.546814] R10: 0000000000000018 R11: 0000000000000067 R12: 0000000000000004
[   84.546819] R13: ffffffff82377ce0 R14: 00000013af5996af R15: 0000000000000004
[   84.546826]  cpuidle_enter+0x2e/0x40
[   84.546834]  do_idle+0x1ca/0x220
[   84.546842]  cpu_startup_entry+0x1d/0x20
[   84.546849]  rest_init+0xbf/0xd0
[   84.546854]  arch_call_rest_init+0xe/0x1b
[   84.546871]  start_kernel+0x65f/0x685
[   84.546879]  x86_64_start_reservations+0x24/0x26
[   84.546887]  x86_64_start_kernel+0x9c/0x9f
[   84.546894]  secondary_startup_64_no_verify+0xc2/0xcb
[   84.546903]  </TASK>
[   84.546907] ---[ end trace baa9db6983265450 ]---
…ace condition possibly ending in GPF or NPR
drbd testres1: Preparing cluster-wide state change 1915447682 (0->-1 3/2)
drbd testres1: State change 1915447682: primary_nodes=0, weak_nodes=0
drbd testres1: Committing cluster-wide state change 1915447682 (0ms)
drbd testres1: role( Primary -> Secondary )
drbd testres1: Preparing cluster-wide state change 4059059782 (0->1 496/16)
drbd testres1: State change 4059059782: primary_nodes=0, weak_nodes=0
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Cluster is now split
drbd testres1: Committing cluster-wide state change 4059059782 (0ms)
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: conn( Connected -> Disconnecting ) peer( Secondary -> Unknown )
drbd testres1/0 drbd2 ybos-00000000-0000-0000-0000-38b8ebd03c78: pdsk( UpToDate -> DUnknown ) repl( Established -> Off )
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: sock_recvmsg returned -4
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Terminating sender thread
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Starting sender thread (from drbd_r_testres1 [3923])
BUG: kernel NULL pointer dereference, address: 0000000000000008
PGD 0 P4D 0
Oops: 0000 [LINBIT#1] SMP
CPU: 0 PID: 16 Comm: kworker/0:1 Tainted: G           O      5.15.75 LINBIT#3
Hardware name: Insyde Grantley/Analytic Blade Board, BIOS 05.04.21.0038.00.011 05/09/2018
Workqueue: events dtr_end_rx_work_fn [drbd_transport_rdma]
RIP: 0010:dtr_free_rx_desc.part.0+0x15/0xa0 [drbd_transport_rdma]
Code: 00 48 89 d7 e8 8c 6f 2a e1 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 4c 8b 67 20 48 89 fb <49> 8b 44 24 08 4d 8b 6c 24 10 48 8b 00 48 8b 38 48 85 ff 74 1f 49
RSP: 0018:ffff888100c0fe30 EFLAGS: 00010082
RAX: ffff888460e46b88 RBX: ffff888460e46b80 RCX: ffff888460e46bc8
RDX: 0000000000000001 RSI: 807fffffffffffff RDI: ffff888460e46b80
RBP: ffff888100c0fe48 R08: ffff8881102cb728 R09: ffff88810006b1b4
R10: 0000000000000018 R11: fefefefefefefeff R12: 0000000000000000
R13: ffff8881102cb718 R14: ffff888460e46bc0 R15: ffff88fe7f829f00
FS:  0000000000000000(0000) GS:ffff88fe7f800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000000220a002 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <TASK>
 dtr_end_rx_work_fn+0x48/0x70 [drbd_transport_rdma]
 process_one_work+0x1e4/0x390
 worker_thread+0x50/0x3e0
 ? rescuer_thread+0x3a0/0x3a0
 kthread+0x12a/0x150
 ? set_kthread_struct+0x50/0x50
 ret_from_fork+0x1f/0x30
 </TASK>
Modules linked in: ext4 mbcache jbd2 drbd_transport_rdma(O) drbd(O) ip6table_nat iptable_nat nf_nat bpfilter nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c rdma_ucm rdma_cm ib_cm iw_cm ib_umad ib_ipoib mlx4_ib kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd nvme nvme_core mlx4_core ib_uverbs ib_core dummy bonding [last unloaded: drbd]
CR2: 0000000000000008
---[ end trace 5d134c4748bcd1c9 ]---
RIP: 0010:dtr_free_rx_desc.part.0+0x15/0xa0 [drbd_transport_rdma]
Code: 00 48 89 d7 e8 8c 6f 2a e1 5d c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 41 55 41 54 53 4c 8b 67 20 48 89 fb <49> 8b 44 24 08 4d 8b 6c 24 10 48 8b 00 48 8b 38 48 85 ff 74 1f 49
RSP: 0018:ffff888100c0fe30 EFLAGS: 00010082
RAX: ffff888460e46b88 RBX: ffff888460e46b80 RCX: ffff888460e46bc8
RDX: 0000000000000001 RSI: 807fffffffffffff RDI: ffff888460e46b80
RBP: ffff888100c0fe48 R08: ffff8881102cb728 R09: ffff88810006b1b4
R10: 0000000000000018 R11: fefefefefefefeff R12: 0000000000000000
R13: ffff8881102cb718 R14: ffff888460e46bc0 R15: ffff88fe7f829f00
FS:  0000000000000000(0000) GS:ffff88fe7f800000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000008 CR3: 000000000220a002 CR4: 00000000003706f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Kernel panic - not syncing: Fatal exception
Kernel Offset: disabled
Rebooting in 5 seconds..
The can be repeated within about 10-20 iterations of:
"while true; do drbdadm up testres1 && sleep 2 && drbdadm down testres1 && sleep 2; done"
The key is that the testres1.res file points to another node where either the node doesn't exist, or isn't booted yet.

drbd testres1/0 drbd2: disk( UpToDate -> Detaching )
drbd testres1/0 drbd2: disk( Detaching -> Diskless )
drbd testres1/0 drbd2: drbd_bm_resize called with capacity == 0
drbd testres1: Terminating worker thread
drbd testres1: Starting worker thread (from drbdsetup [4990])
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Starting sender thread (from drbdsetup [4997])
drbd testres1/0 drbd2: meta-data IO uses: blk-bio
drbd testres1/0 drbd2: disk( Diskless -> Attaching )
drbd testres1/0 drbd2: Maximum number of peer devices = 1
drbd testres1: Method to ensure write ordering: flush
drbd testres1/0 drbd2: drbd_bm_resize called with capacity == 7501244792
drbd testres1/0 drbd2: resync bitmap: bits=937655599 words=14650869 pages=28615
drbd2: detected capacity change from 0 to 7501244792
drbd testres1/0 drbd2: size = 3577 GB (3750622396 KB)
drbd testres1/0 drbd2: size = 3577 GB (3750622396 KB)
drbd testres1/0 drbd2: recounting of set bits took additional 40ms
drbd testres1/0 drbd2: disk( Attaching -> UpToDate )
drbd testres1/0 drbd2: attached to current UUID: 027B94FF1B3EC8D4
drbd testres1/0 drbd2: Setting exposed data uuid: 027B94FF1B3EC8D4
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: conn( StandAlone -> Unconnected )
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Starting receiver thread (from drbd_w_testres1 [4991])
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: conn( Unconnected -> Connecting )
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: conn( Connecting -> Disconnecting )
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Failed to initiate connection, err=-512
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Terminating sender thread
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Starting sender thread (from drbd_r_testres1 [5010])
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Connection closed
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: conn( Disconnecting -> StandAlone )
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Terminating receiver thread
drbd testres1 ybos-00000000-0000-0000-0000-38b8ebd03c78: Terminating sender thread
BUG: kernel NULL pointer dereference, address: 0000000000000208
PGD 0 P4D 0
Oops: 0002 [LINBIT#1] SMP
CPU: 11 PID: 0 Comm: swapper/11 Tainted: G           O      5.15.75 LINBIT#3
Hardware name: Insyde Grantley/Analytic Blade Board, BIOS 05.04.21.0038.00.011 05/09/2018
RIP: 0010:__run_timers+0x1df/0x280
Code: 48 c7 43 08 00 00 00 00 48 85 c0 0f 84 86 00 00 00 49 8b 0c 24 48 89 4b 08 66 90 48 8b 01 48 8b 51 08 48 89 02 48 85 c0 74 04 <48> 89 50 08 48 c7 41 08 00 00 00 00 48 8b 71 18 4c 89 31 f6 41 22
RSP: 0018:ffff88fe7fac5ed0 EFLAGS: 00010006
RAX: 0000000000000200 RBX: ffff88fe7fadb740 RCX: ffff88810bf17578
RDX: ffff88fe7fac5ef8 RSI: 0000000140003d80 RDI: ffff88fe7fadb768
RBP: ffff88fe7fac5f68 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88fe7fac5ef8
R13: 0000000100003d80 R14: dead000000000122 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88fe7fac0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000208 CR3: 000000000220a002 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
 <IRQ>
 run_timer_softirq+0x1d/0x40
 __do_softirq+0xc6/0x27d
 irq_exit_rcu+0x86/0xb0
 sysvec_apic_timer_interrupt+0x78/0xa0
 </IRQ>
 <TASK>
 asm_sysvec_apic_timer_interrupt+0x1b/0x20
RIP: 0010:cpuidle_enter_state+0xd3/0x350
Code: 89 c6 0f 1f 44 00 00 31 ff e8 59 15 a0 ff 80 7d d7 00 74 12 9c 58 f6 c4 02 0f 85 69 02 00 00 31 ff e8 f1 46 a5 ff fb 45 85 ff <0f> 88 fa 00 00 00 49 63 cf 4c 8b 55 c8 48 8d 04 49 48 8d 14 81 48
RSP: 0018:ffff8881010dbe70 EFLAGS: 00000202
RAX: ffff88fe7faea580 RBX: ffffe8ffff6c2888 RCX: 000000000000001f
RDX: 0000006a8769d584 RSI: 000000003d17f1fb RDI: 0000000000000000
RBP: ffff8881010dbea8 R08: 0000000000000002 R09: ffff88fe7fae94a4
R10: 0000000000000008 R11: 0000000000066863 R12: 0000000000000004
R13: ffffffff82377ce0 R14: 0000006a8769d584 R15: 0000000000000004
 cpuidle_enter+0x2e/0x40
 do_idle+0x1ca/0x220
 cpu_startup_entry+0x1d/0x20
 start_secondary+0xe1/0xf0
 secondary_startup_64_no_verify+0xc2/0xcb
 </TASK>
Modules linked in: drbd_transport_rdma(O) ip6table_nat iptable_nat nf_nat bpfilter drbd(O) nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 libcrc32c rdma_ucm rdma_cm ib_cm iw_cm ib_umad ib_ipoib mlx4_ib kvm_intel kvm irqbypass crc32_pclmul ghash_clmulni_intel aesni_intel crypto_simd cryptd nvme nvme_core mlx4_core ib_uverbs ib_core dummy bonding
CR2: 0000000000000208
---[ end trace 9617d0e986125e0b ]---
RIP: 0010:__run_timers+0x1df/0x280
Code: 48 c7 43 08 00 00 00 00 48 85 c0 0f 84 86 00 00 00 49 8b 0c 24 48 89 4b 08 66 90 48 8b 01 48 8b 51 08 48 89 02 48 85 c0 74 04 <48> 89 50 08 48 c7 41 08 00 00 00 00 48 8b 71 18 4c 89 31 f6 41 22
RSP: 0018:ffff88fe7fac5ed0 EFLAGS: 00010006
RAX: 0000000000000200 RBX: ffff88fe7fadb740 RCX: ffff88810bf17578
RDX: ffff88fe7fac5ef8 RSI: 0000000140003d80 RDI: ffff88fe7fadb768
RBP: ffff88fe7fac5f68 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000001 R12: ffff88fe7fac5ef8
R13: 0000000100003d80 R14: dead000000000122 R15: 0000000000000000
FS:  0000000000000000(0000) GS:ffff88fe7fac0000(0000) knlGS:0000000000000000
CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000208 CR3: 000000000220a002 CR4: 00000000003706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
Rebooting in 5 seconds..
In my setup I need more control stream buffers to prevent the "Not sending flow_control mgs, no receive window!" error message from happening (usually during resync).  This is a hackish way to increase that.  Perhaps this should be configurable.  The original value was 64.  I get better resync performance if I set it to 2 (half as many buffers as the data stream).
@LinbitPRBot
Copy link
Collaborator

Hi @mtisza!

Thanks for your contribution to the LINBIT software!

Development for this project happens on mailing lists, rather than on GitHub - this GitHub repository is a read-only mirror that isn't used for accepting contributions. So that your change can become part of our software, please email it to us as a patch.

Here's what to do:

  • Format your contribution
  • Decide where to send your contribution to
  • Set up your system to send your contribution as an email
  • Send your contribution and wait for feedback

How do I format my contribution?

Firstly, all contributions need to be formatted as patches. A patch is a plain text document showing the change you want to make to the code, and documenting why it is a good idea.

You can create patches with git format-patch.

Secondly, patches need 'commit messages', which is the human-friendly documentation explaining what the change is and why it's necessary.

Who do I send my contribution to?

There are two mailing lists:

  • DRBD-dev is "strictly" used for patch coordination.
  • DRBD-user is also fine to send your ideas and initals "RFC" patches. You probably want to start here.

If you're interested in DRBD development, subscribing to these mailing lists is a good idea.

How do I send my contribution?

Use git send-email, which will ensure that your patches are formatted in the standard manner. In order to use git send-email, you'll need to configure git to use your SMTP email server.

For more information about using git send-email, look at the Git documentation or type git help send-email. There are a number of useful guides and tutorials about git send-email that can be found on the internet.

How do I get help if I'm stuck?

Firstly, don't get discouraged, we are here to help! If you are lost in the process, and really tried, you will usually find contact information in header/implementation files, or see who touched the code with git blame. If it was an @linbit.com person, write to them. We are more interested in good patches than strictly following the rules (but you should try first!).

I sent my patch - now what?

You wait.

You can check that your email has been received by checking the mailing list archives for the mailing list you sent your patch to. Messages may not be received instantly, so be patient. Developers are generally very busy people, so it may take a few days, even weeks before your patch is looked at.

Then, you keep waiting. It is fine to kick us again if you did not receive an answer within 2 weeks, but usually we are a lot faster.

Further information

Happy hacking!

This message was posted by a bot - if you have any questions or suggestions, please talk to my owner, @rck

@LinbitPRBot LinbitPRBot closed this Aug 8, 2023
@mtisza mtisza mentioned this pull request Aug 9, 2023
@mtisza
Copy link
Author

mtisza commented Aug 9, 2023

I know this is closed, but I submitted a fixed #66, as there was an issue I failed to detect on this one prior to submitting it (build issue due to rebasing onto latest master).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants