Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dmitryc patches8 #6

Open
wants to merge 8 commits into
base: master
Choose a base branch
from
Open

Dmitryc patches8 #6

wants to merge 8 commits into from

Conversation

ramosian-glider
Copy link
Owner

@dvyukov: please take a look. I've ran the patches through checkpatch.pl and think those are ready for upstream review.

ramosian-glider and others added 8 commits January 27, 2016 14:32
depending on which allocator (SLAB or SLUB) is being used

Signed-off-by: Alexander Potapenko <[email protected]>
This patch adds KASAN hooks to SLAB allocator.

This patch is based on the "mm: kasan: unified support for SLUB and
SLAB allocators" patch originally prepared by Dmitry Chernenkov.

Signed-off-by: Dmitry Chernenkov <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Add GFP flags to KASAN hooks for future patches to use.

This patch is based on the "mm: kasan: unified support for SLUB and
SLAB allocators" patch originally prepared by Dmitry Chernenkov.

Signed-off-by: Dmitry Chernenkov <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
KASAN needs to know whether the allocation happens in an IRQ handler.
This lets us strip everything below the IRQ entry point to reduce the number
of unique stack traces needed to be stored.

Signed-off-by: Alexander Potapenko <[email protected]>
Stack depot will allow KASAN store allocation/deallocation stack traces
for memory chunks. The stack traces are stored in a hash table and
referenced by handles which reside in the kasan_alloc_meta and
kasan_free_meta structures in the allocated memory chunks.

IRQ stack traces are cut below the IRQ entry point to avoid unnecessary
duplication.

Right now stackdepot support is only enabled in SLAB allocator.
Once KASAN features in SLAB are on par with those in SLUB we can switch
SLUB to stackdepot as well, thus removing the dependency on SLUB_DEBUG.

This patch is based on the "mm: kasan: stack depots" patch originally
prepared by Dmitry Chernenkov.

Signed-off-by: Dmitry Chernenkov <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
Quarantine isolates freed objects in a separate queue. The objects are
returned to the allocator later, which helps to detect use-after-free
errors.

Freed objects are first added to per-cpu quarantine queues.
When a cache is destroyed or memory shrinking is requested, the objects
are moved into the global quarantine queue. Whenever a kmalloc call
allows memory reclaiming, the oldest objects are popped out of the
global queue until the total size of objects in quarantine is less than
3/4 of the maximum quarantine size (which is a fraction of installed
physical memory).

Right now quarantine support is only enabled in SLAB allocator.
Unification of KASAN features in SLAB and SLUB will be done later.

This patch is based on the "mm: kasan: quarantine" patch originally
prepared by Dmitry Chernenkov.

Signed-off-by: Dmitry Chernenkov <[email protected]>
Signed-off-by: Alexander Potapenko <[email protected]>
ramosian-glider pushed a commit that referenced this pull request Feb 4, 2020
There exists a deadlock with range_cyclic that has existed forever.  If
we loop around with a bio already built we could deadlock with a writer
who has the page locked that we're attempting to write but is waiting on
a page in our bio to be written out.  The task traces are as follows

  PID: 1329874  TASK: ffff889ebcdf3800  CPU: 33  COMMAND: "kworker/u113:5"
   #0 [ffffc900297bb658] __schedule at ffffffff81a4c33f
   #1 [ffffc900297bb6e0] schedule at ffffffff81a4c6e3
   #2 [ffffc900297bb6f8] io_schedule at ffffffff81a4ca42
   #3 [ffffc900297bb708] __lock_page at ffffffff811f145b
   #4 [ffffc900297bb798] __process_pages_contig at ffffffff814bc502
   #5 [ffffc900297bb8c8] lock_delalloc_pages at ffffffff814bc684
   #6 [ffffc900297bb900] find_lock_delalloc_range at ffffffff814be9ff
   #7 [ffffc900297bb9a0] writepage_delalloc at ffffffff814bebd0
   torvalds#8 [ffffc900297bba18] __extent_writepage at ffffffff814bfbf2
   torvalds#9 [ffffc900297bba98] extent_write_cache_pages at ffffffff814bffbd

  PID: 2167901  TASK: ffff889dc6a59c00  CPU: 14  COMMAND:
  "aio-dio-invalid"
   #0 [ffffc9003b50bb18] __schedule at ffffffff81a4c33f
   #1 [ffffc9003b50bba0] schedule at ffffffff81a4c6e3
   #2 [ffffc9003b50bbb8] io_schedule at ffffffff81a4ca42
   #3 [ffffc9003b50bbc8] wait_on_page_bit at ffffffff811f24d6
   #4 [ffffc9003b50bc60] prepare_pages at ffffffff814b05a7
   #5 [ffffc9003b50bcd8] btrfs_buffered_write at ffffffff814b1359
   #6 [ffffc9003b50bdb0] btrfs_file_write_iter at ffffffff814b5933
   #7 [ffffc9003b50be38] new_sync_write at ffffffff8128f6a8
   torvalds#8 [ffffc9003b50bec8] vfs_write at ffffffff81292b9d
   torvalds#9 [ffffc9003b50bf00] ksys_pwrite64 at ffffffff81293032

I used drgn to find the respective pages we were stuck on

page_entry.page 0xffffea00fbfc7500 index 8148 bit 15 pid 2167901
page_entry.page 0xffffea00f9bb7400 index 7680 bit 0 pid 1329874

As you can see the kworker is waiting for bit 0 (PG_locked) on index
7680, and aio-dio-invalid is waiting for bit 15 (PG_writeback) on index
8148.  aio-dio-invalid has 7680, and the kworker epd looks like the
following

  crash> struct extent_page_data ffffc900297bbbb0
  struct extent_page_data {
    bio = 0xffff889f747ed830,
    tree = 0xffff889eed6ba448,
    extent_locked = 0,
    sync_io = 0
  }

Probably worth mentioning as well that it waits for writeback of the
page to complete while holding a lock on it (at prepare_pages()).

Using drgn I walked the bio pages looking for page
0xffffea00fbfc7500 which is the one we're waiting for writeback on

  bio = Object(prog, 'struct bio', address=0xffff889f747ed830)
  for i in range(0, bio.bi_vcnt.value_()):
      bv = bio.bi_io_vec[i]
      if bv.bv_page.value_() == 0xffffea00fbfc7500:
	  print("FOUND IT")

which validated what I suspected.

The fix for this is simple, flush the epd before we loop back around to
the beginning of the file during writeout.

Fixes: b293f02 ("Btrfs: Add writepages support")
CC: [email protected] # 4.4+
Reviewed-by: Filipe Manana <[email protected]>
Signed-off-by: Josef Bacik <[email protected]>
Signed-off-by: David Sterba <[email protected]>
ramosian-glider pushed a commit that referenced this pull request Feb 4, 2020
It is not that hard to trigger lockdep splats by calling printk from
under zone->lock.  Most of them are false positives caused by lock
chains introduced early in the boot process and they do not cause any
real problems (although most of the early boot lock dependencies could
happen after boot as well).  There are some console drivers which do
allocate from the printk context as well and those should be fixed.  In
any case, false positives are not that trivial to workaround and it is
far from optimal to lose lockdep functionality for something that is a
non-issue.

So change has_unmovable_pages() so that it no longer calls dump_page()
itself - instead it returns a "struct page *" of the unmovable page back
to the caller so that in the case of a has_unmovable_pages() failure,
the caller can call dump_page() after releasing zone->lock.  Also, make
dump_page() is able to report a CMA page as well, so the reason string
from has_unmovable_pages() can be removed.

Even though has_unmovable_pages doesn't hold any reference to the
returned page this should be reasonably safe for the purpose of
reporting the page (dump_page) because it cannot be hotremoved in the
context of memory unplug.  The state of the page might change but that
is the case even with the existing code as zone->lock only plays role
for free pages.

While at it, remove a similar but unnecessary debug-only printk() as
well.  A sample of one of those lockdep splats is,

  WARNING: possible circular locking dependency detected
  ------------------------------------------------------
  test.sh/8653 is trying to acquire lock:
  ffffffff865a4460 (console_owner){-.-.}, at:
  console_unlock+0x207/0x750

  but task is already holding lock:
  ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
  __offline_isolated_pages+0x179/0x3e0

  which lock already depends on the new lock.

  the existing dependency chain (in reverse order) is:

  -> #3 (&(&zone->lock)->rlock){-.-.}:
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         _raw_spin_lock+0x2f/0x40
         rmqueue_bulk.constprop.21+0xb6/0x1160
         get_page_from_freelist+0x898/0x22c0
         __alloc_pages_nodemask+0x2f3/0x1cd0
         alloc_pages_current+0x9c/0x110
         allocate_slab+0x4c6/0x19c0
         new_slab+0x46/0x70
         ___slab_alloc+0x58b/0x960
         __slab_alloc+0x43/0x70
         __kmalloc+0x3ad/0x4b0
         __tty_buffer_request_room+0x100/0x250
         tty_insert_flip_string_fixed_flag+0x67/0x110
         pty_write+0xa2/0xf0
         n_tty_write+0x36b/0x7b0
         tty_write+0x284/0x4c0
         __vfs_write+0x50/0xa0
         vfs_write+0x105/0x290
         redirected_tty_write+0x6a/0xc0
         do_iter_write+0x248/0x2a0
         vfs_writev+0x106/0x1e0
         do_writev+0xd4/0x180
         __x64_sys_writev+0x45/0x50
         do_syscall_64+0xcc/0x76c
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  -> #2 (&(&port->lock)->rlock){-.-.}:
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         _raw_spin_lock_irqsave+0x3a/0x50
         tty_port_tty_get+0x20/0x60
         tty_port_default_wakeup+0xf/0x30
         tty_port_tty_wakeup+0x39/0x40
         uart_write_wakeup+0x2a/0x40
         serial8250_tx_chars+0x22e/0x440
         serial8250_handle_irq.part.8+0x14a/0x170
         serial8250_default_handle_irq+0x5c/0x90
         serial8250_interrupt+0xa6/0x130
         __handle_irq_event_percpu+0x78/0x4f0
         handle_irq_event_percpu+0x70/0x100
         handle_irq_event+0x5a/0x8b
         handle_edge_irq+0x117/0x370
         do_IRQ+0x9e/0x1e0
         ret_from_intr+0x0/0x2a
         cpuidle_enter_state+0x156/0x8e0
         cpuidle_enter+0x41/0x70
         call_cpuidle+0x5e/0x90
         do_idle+0x333/0x370
         cpu_startup_entry+0x1d/0x1f
         start_secondary+0x290/0x330
         secondary_startup_64+0xb6/0xc0

  -> #1 (&port_lock_key){-.-.}:
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         _raw_spin_lock_irqsave+0x3a/0x50
         serial8250_console_write+0x3e4/0x450
         univ8250_console_write+0x4b/0x60
         console_unlock+0x501/0x750
         vprintk_emit+0x10d/0x340
         vprintk_default+0x1f/0x30
         vprintk_func+0x44/0xd4
         printk+0x9f/0xc5

  -> #0 (console_owner){-.-.}:
         check_prev_add+0x107/0xea0
         validate_chain+0x8fc/0x1200
         __lock_acquire+0x5b3/0xb40
         lock_acquire+0x126/0x280
         console_unlock+0x269/0x750
         vprintk_emit+0x10d/0x340
         vprintk_default+0x1f/0x30
         vprintk_func+0x44/0xd4
         printk+0x9f/0xc5
         __offline_isolated_pages.cold.52+0x2f/0x30a
         offline_isolated_pages_cb+0x17/0x30
         walk_system_ram_range+0xda/0x160
         __offline_pages+0x79c/0xa10
         offline_pages+0x11/0x20
         memory_subsys_offline+0x7e/0xc0
         device_offline+0xd5/0x110
         state_store+0xc6/0xe0
         dev_attr_store+0x3f/0x60
         sysfs_kf_write+0x89/0xb0
         kernfs_fop_write+0x188/0x240
         __vfs_write+0x50/0xa0
         vfs_write+0x105/0x290
         ksys_write+0xc6/0x160
         __x64_sys_write+0x43/0x50
         do_syscall_64+0xcc/0x76c
         entry_SYSCALL_64_after_hwframe+0x49/0xbe

  other info that might help us debug this:

  Chain exists of:
    console_owner --> &(&port->lock)->rlock --> &(&zone->lock)->rlock

   Possible unsafe locking scenario:

         CPU0                    CPU1
         ----                    ----
    lock(&(&zone->lock)->rlock);
                                 lock(&(&port->lock)->rlock);
                                 lock(&(&zone->lock)->rlock);
    lock(console_owner);

   *** DEADLOCK ***

  9 locks held by test.sh/8653:
   #0: ffff88839ba7d408 (sb_writers#4){.+.+}, at:
  vfs_write+0x25f/0x290
   #1: ffff888277618880 (&of->mutex){+.+.}, at:
  kernfs_fop_write+0x128/0x240
   #2: ffff8898131fc218 (kn->count#115){.+.+}, at:
  kernfs_fop_write+0x138/0x240
   #3: ffffffff86962a80 (device_hotplug_lock){+.+.}, at:
  lock_device_hotplug_sysfs+0x16/0x50
   #4: ffff8884374f4990 (&dev->mutex){....}, at:
  device_offline+0x70/0x110
   #5: ffffffff86515250 (cpu_hotplug_lock.rw_sem){++++}, at:
  __offline_pages+0xbf/0xa10
   #6: ffffffff867405f0 (mem_hotplug_lock.rw_sem){++++}, at:
  percpu_down_write+0x87/0x2f0
   #7: ffff88883fff3c58 (&(&zone->lock)->rlock){-.-.}, at:
  __offline_isolated_pages+0x179/0x3e0
   torvalds#8: ffffffff865a4920 (console_lock){+.+.}, at:
  vprintk_emit+0x100/0x340

  stack backtrace:
  Hardware name: HPE ProLiant DL560 Gen10/ProLiant DL560 Gen10,
  BIOS U34 05/21/2019
  Call Trace:
   dump_stack+0x86/0xca
   print_circular_bug.cold.31+0x243/0x26e
   check_noncircular+0x29e/0x2e0
   check_prev_add+0x107/0xea0
   validate_chain+0x8fc/0x1200
   __lock_acquire+0x5b3/0xb40
   lock_acquire+0x126/0x280
   console_unlock+0x269/0x750
   vprintk_emit+0x10d/0x340
   vprintk_default+0x1f/0x30
   vprintk_func+0x44/0xd4
   printk+0x9f/0xc5
   __offline_isolated_pages.cold.52+0x2f/0x30a
   offline_isolated_pages_cb+0x17/0x30
   walk_system_ram_range+0xda/0x160
   __offline_pages+0x79c/0xa10
   offline_pages+0x11/0x20
   memory_subsys_offline+0x7e/0xc0
   device_offline+0xd5/0x110
   state_store+0xc6/0xe0
   dev_attr_store+0x3f/0x60
   sysfs_kf_write+0x89/0xb0
   kernfs_fop_write+0x188/0x240
   __vfs_write+0x50/0xa0
   vfs_write+0x105/0x290
   ksys_write+0xc6/0x160
   __x64_sys_write+0x43/0x50
   do_syscall_64+0xcc/0x76c
   entry_SYSCALL_64_after_hwframe+0x49/0xbe

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Qian Cai <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Sergey Senozhatsky <[email protected]>
Cc: Petr Mladek <[email protected]>
Cc: Steven Rostedt (VMware) <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
ramosian-glider pushed a commit that referenced this pull request Feb 17, 2025
When COWing a relocation tree path, at relocation.c:replace_path(), we
can trigger a lockdep splat while we are in the btrfs_search_slot() call
against the relocation root. This happens in that callchain at
ctree.c:read_block_for_search() when we happen to find a child extent
buffer already loaded through the fs tree with a lockdep class set to
the fs tree. So when we attempt to lock that extent buffer through a
relocation tree we have to reset the lockdep class to the class for a
relocation tree, since a relocation tree has extent buffers that used
to belong to a fs tree and may currently be already loaded (we swap
extent buffers between the two trees at the end of replace_path()).

However we are missing calls to btrfs_maybe_reset_lockdep_class() to reset
the lockdep class at ctree.c:read_block_for_search() before we read lock
an extent buffer, just like we did for btrfs_search_slot() in commit
b40130b ("btrfs: fix lockdep splat with reloc root extent buffers").

So add the missing btrfs_maybe_reset_lockdep_class() calls before the
attempts to read lock an extent buffer at ctree.c:read_block_for_search().

The lockdep splat was reported by syzbot and it looks like this:

   ======================================================
   WARNING: possible circular locking dependency detected
   6.13.0-rc5-syzkaller-00163-gab75170520d4 #0 Not tainted
   ------------------------------------------------------
   syz.0.0/5335 is trying to acquire lock:
   ffff8880545dbc38 (btrfs-tree-01){++++}-{4:4}, at: btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146

   but task is already holding lock:
   ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #2 (btrfs-treloc-02/1){+.+.}-{4:4}:
          reacquire_held_locks+0x3eb/0x690 kernel/locking/lockdep.c:5374
          __lock_release kernel/locking/lockdep.c:5563 [inline]
          lock_release+0x396/0xa30 kernel/locking/lockdep.c:5870
          up_write+0x79/0x590 kernel/locking/rwsem.c:1629
          btrfs_force_cow_block+0x14b3/0x1fd0 fs/btrfs/ctree.c:660
          btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755
          btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153
          replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
          merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
          merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
          relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
          btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
          btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
          __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
          btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
          btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
          vfs_ioctl fs/ioctl.c:51 [inline]
          __do_sys_ioctl fs/ioctl.c:906 [inline]
          __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
          do_syscall_x64 arch/x86/entry/common.c:52 [inline]
          do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
          entry_SYSCALL_64_after_hwframe+0x77/0x7f

   -> #1 (btrfs-tree-01/1){+.+.}-{4:4}:
          lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
          down_write_nested+0xa2/0x220 kernel/locking/rwsem.c:1693
          btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
          btrfs_init_new_buffer fs/btrfs/extent-tree.c:5052 [inline]
          btrfs_alloc_tree_block+0x41c/0x1440 fs/btrfs/extent-tree.c:5132
          btrfs_force_cow_block+0x526/0x1fd0 fs/btrfs/ctree.c:573
          btrfs_cow_block+0x371/0x830 fs/btrfs/ctree.c:755
          btrfs_search_slot+0xc01/0x3180 fs/btrfs/ctree.c:2153
          btrfs_insert_empty_items+0x9c/0x1a0 fs/btrfs/ctree.c:4351
          btrfs_insert_empty_item fs/btrfs/ctree.h:688 [inline]
          btrfs_insert_inode_ref+0x2bb/0xf80 fs/btrfs/inode-item.c:330
          btrfs_rename_exchange fs/btrfs/inode.c:7990 [inline]
          btrfs_rename2+0xcb7/0x2b90 fs/btrfs/inode.c:8374
          vfs_rename+0xbdb/0xf00 fs/namei.c:5067
          do_renameat2+0xd94/0x13f0 fs/namei.c:5224
          __do_sys_renameat2 fs/namei.c:5258 [inline]
          __se_sys_renameat2 fs/namei.c:5255 [inline]
          __x64_sys_renameat2+0xce/0xe0 fs/namei.c:5255
          do_syscall_x64 arch/x86/entry/common.c:52 [inline]
          do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
          entry_SYSCALL_64_after_hwframe+0x77/0x7f

   -> #0 (btrfs-tree-01){++++}-{4:4}:
          check_prev_add kernel/locking/lockdep.c:3161 [inline]
          check_prevs_add kernel/locking/lockdep.c:3280 [inline]
          validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
          __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
          lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
          down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649
          btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
          btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline]
          read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610
          btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237
          replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
          merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
          merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
          relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
          btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
          btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
          __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
          btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
          btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
          vfs_ioctl fs/ioctl.c:51 [inline]
          __do_sys_ioctl fs/ioctl.c:906 [inline]
          __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
          do_syscall_x64 arch/x86/entry/common.c:52 [inline]
          do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
          entry_SYSCALL_64_after_hwframe+0x77/0x7f

   other info that might help us debug this:

   Chain exists of:
     btrfs-tree-01 --> btrfs-tree-01/1 --> btrfs-treloc-02/1

    Possible unsafe locking scenario:

          CPU0                    CPU1
          ----                    ----
     lock(btrfs-treloc-02/1);
                                  lock(btrfs-tree-01/1);
                                  lock(btrfs-treloc-02/1);
     rlock(btrfs-tree-01);

    *** DEADLOCK ***

   8 locks held by syz.0.0/5335:
    #0: ffff88801e3ae420 (sb_writers#13){.+.+}-{0:0}, at: mnt_want_write_file+0x5e/0x200 fs/namespace.c:559
    #1: ffff888052c760d0 (&fs_info->reclaim_bgs_lock){+.+.}-{4:4}, at: __btrfs_balance+0x4c2/0x26b0 fs/btrfs/volumes.c:4183
    #2: ffff888052c74850 (&fs_info->cleaner_mutex){+.+.}-{4:4}, at: btrfs_relocate_block_group+0x775/0xd90 fs/btrfs/relocation.c:4086
    #3: ffff88801e3ae610 (sb_internal#2){.+.+}-{0:0}, at: merge_reloc_root+0xf11/0x1ad0 fs/btrfs/relocation.c:1659
    #4: ffff888052c76470 (btrfs_trans_num_writers){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288
    #5: ffff888052c76498 (btrfs_trans_num_extwriters){++++}-{0:0}, at: join_transaction+0x405/0xda0 fs/btrfs/transaction.c:288
    #6: ffff8880545db878 (btrfs-tree-01/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189
    #7: ffff8880545dba58 (btrfs-treloc-02/1){+.+.}-{4:4}, at: btrfs_tree_lock_nested+0x2f/0x250 fs/btrfs/locking.c:189

   stack backtrace:
   CPU: 0 UID: 0 PID: 5335 Comm: syz.0.0 Not tainted 6.13.0-rc5-syzkaller-00163-gab75170520d4 #0
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
   Call Trace:
    <TASK>
    __dump_stack lib/dump_stack.c:94 [inline]
    dump_stack_lvl+0x241/0x360 lib/dump_stack.c:120
    print_circular_bug+0x13a/0x1b0 kernel/locking/lockdep.c:2074
    check_noncircular+0x36a/0x4a0 kernel/locking/lockdep.c:2206
    check_prev_add kernel/locking/lockdep.c:3161 [inline]
    check_prevs_add kernel/locking/lockdep.c:3280 [inline]
    validate_chain+0x18ef/0x5920 kernel/locking/lockdep.c:3904
    __lock_acquire+0x1397/0x2100 kernel/locking/lockdep.c:5226
    lock_acquire+0x1ed/0x550 kernel/locking/lockdep.c:5849
    down_read_nested+0xb5/0xa50 kernel/locking/rwsem.c:1649
    btrfs_tree_read_lock_nested+0x2f/0x250 fs/btrfs/locking.c:146
    btrfs_tree_read_lock fs/btrfs/locking.h:188 [inline]
    read_block_for_search+0x718/0xbb0 fs/btrfs/ctree.c:1610
    btrfs_search_slot+0x1274/0x3180 fs/btrfs/ctree.c:2237
    replace_path+0x1243/0x2740 fs/btrfs/relocation.c:1224
    merge_reloc_root+0xc46/0x1ad0 fs/btrfs/relocation.c:1692
    merge_reloc_roots+0x3b3/0x980 fs/btrfs/relocation.c:1942
    relocate_block_group+0xb0a/0xd40 fs/btrfs/relocation.c:3754
    btrfs_relocate_block_group+0x77d/0xd90 fs/btrfs/relocation.c:4087
    btrfs_relocate_chunk+0x12c/0x3b0 fs/btrfs/volumes.c:3494
    __btrfs_balance+0x1b0f/0x26b0 fs/btrfs/volumes.c:4278
    btrfs_balance+0xbdc/0x10c0 fs/btrfs/volumes.c:4655
    btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3670
    vfs_ioctl fs/ioctl.c:51 [inline]
    __do_sys_ioctl fs/ioctl.c:906 [inline]
    __se_sys_ioctl+0xf5/0x170 fs/ioctl.c:892
    do_syscall_x64 arch/x86/entry/common.c:52 [inline]
    do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
    entry_SYSCALL_64_after_hwframe+0x77/0x7f
   RIP: 0033:0x7f1ac6985d29
   Code: ff ff c3 (...)
   RSP: 002b:00007f1ac63fe038 EFLAGS: 00000246 ORIG_RAX: 0000000000000010
   RAX: ffffffffffffffda RBX: 00007f1ac6b76160 RCX: 00007f1ac6985d29
   RDX: 0000000020000180 RSI: 00000000c4009420 RDI: 0000000000000007
   RBP: 00007f1ac6a01b08 R08: 0000000000000000 R09: 0000000000000000
   R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
   R13: 0000000000000001 R14: 00007f1ac6b76160 R15: 00007fffda145a88
    </TASK>

Reported-by: [email protected]
Link: https://lore.kernel.org/linux-btrfs/[email protected]/
Fixes: 9978599 ("btrfs: reduce lock contention when eb cache miss for btree search")
Signed-off-by: Filipe Manana <[email protected]>
Reviewed-by: David Sterba <[email protected]>
Signed-off-by: David Sterba <[email protected]>
ramosian-glider pushed a commit that referenced this pull request Feb 17, 2025
libtraceevent parses and returns an array of argument fields, sometimes
larger than RAW_SYSCALL_ARGS_NUM (6) because it includes "__syscall_nr",
idx will traverse to index 6 (7th element) whereas sc->fmt->arg holds 6
elements max, creating an out-of-bounds access. This runtime error is
found by UBsan. The error message:

  $ sudo UBSAN_OPTIONS=print_stacktrace=1 ./perf trace -a --max-events=1
  builtin-trace.c:1966:35: runtime error: index 6 out of bounds for type 'syscall_arg_fmt [6]'
    #0 0x5c04956be5fe in syscall__alloc_arg_fmts /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:1966
    #1 0x5c04956c0510 in trace__read_syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2110
    #2 0x5c04956c372b in trace__syscall_info /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:2436
    #3 0x5c04956d2f39 in trace__init_syscalls_bpf_prog_array_maps /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:3897
    #4 0x5c04956d6d25 in trace__run /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:4335
    #5 0x5c04956e112e in cmd_trace /home/howard/hw/linux-perf/tools/perf/builtin-trace.c:5502
    #6 0x5c04956eda7d in run_builtin /home/howard/hw/linux-perf/tools/perf/perf.c:351
    #7 0x5c04956ee0a8 in handle_internal_command /home/howard/hw/linux-perf/tools/perf/perf.c:404
    torvalds#8 0x5c04956ee37f in run_argv /home/howard/hw/linux-perf/tools/perf/perf.c:448
    torvalds#9 0x5c04956ee8e9 in main /home/howard/hw/linux-perf/tools/perf/perf.c:556
    torvalds#10 0x79eb3622a3b7 in __libc_start_call_main ../sysdeps/nptl/libc_start_call_main.h:58
    torvalds#11 0x79eb3622a47a in __libc_start_main_impl ../csu/libc-start.c:360
    torvalds#12 0x5c04955422d4 in _start (/home/howard/hw/linux-perf/tools/perf/perf+0x4e02d4) (BuildId: 5b6cab2d59e96a4341741765ad6914a4d784dbc6)

     0.000 ( 0.014 ms): Chrome_ChildIO/117244 write(fd: 238, buf: !, count: 1)                                      = 1

Fixes: 5e58fcf ("perf trace: Allow allocating sc->arg_fmt even without the syscall tracepoint")
Signed-off-by: Howard Chu <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Signed-off-by: Namhyung Kim <[email protected]>
ramosian-glider pushed a commit that referenced this pull request Feb 17, 2025
This fixes the following hard lockup in isolate_lru_folios() during memory
reclaim.  If the LRU mostly contains ineligible folios this may trigger
watchdog.

watchdog: Watchdog detected hard LOCKUP on cpu 173
RIP: 0010:native_queued_spin_lock_slowpath+0x255/0x2a0
Call Trace:
	_raw_spin_lock_irqsave+0x31/0x40
	folio_lruvec_lock_irqsave+0x5f/0x90
	folio_batch_move_lru+0x91/0x150
	lru_add_drain_per_cpu+0x1c/0x40
	process_one_work+0x17d/0x350
	worker_thread+0x27b/0x3a0
	kthread+0xe8/0x120
	ret_from_fork+0x34/0x50
	ret_from_fork_asm+0x1b/0x30

lruvec->lru_lock owner:

PID: 2865     TASK: ffff888139214d40  CPU: 40   COMMAND: "kswapd0"
 #0 [fffffe0000945e60] crash_nmi_callback at ffffffffa567a555
 #1 [fffffe0000945e68] nmi_handle at ffffffffa563b171
 #2 [fffffe0000945eb0] default_do_nmi at ffffffffa6575920
 #3 [fffffe0000945ed0] exc_nmi at ffffffffa6575af4
 #4 [fffffe0000945ef0] end_repeat_nmi at ffffffffa6601dde
    [exception RIP: isolate_lru_folios+403]
    RIP: ffffffffa597df53  RSP: ffffc90006fb7c28  RFLAGS: 00000002
    RAX: 0000000000000001  RBX: ffffc90006fb7c60  RCX: ffffea04a2196f88
    RDX: ffffc90006fb7c60  RSI: ffffc90006fb7c60  RDI: ffffea04a2197048
    RBP: ffff88812cbd3010   R8: ffffea04a2197008   R9: 0000000000000001
    R10: 0000000000000000  R11: 0000000000000001  R12: ffffea04a2197008
    R13: ffffea04a2197048  R14: ffffc90006fb7de8  R15: 0000000003e3e937
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
    <NMI exception stack>
 #5 [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
 #6 [ffffc90006fb7cf8] shrink_active_list at ffffffffa597f788
 #7 [ffffc90006fb7da8] balance_pgdat at ffffffffa5986db0
 torvalds#8 [ffffc90006fb7ec0] kswapd at ffffffffa5987354
 torvalds#9 [ffffc90006fb7ef8] kthread at ffffffffa5748238
crash>

Scenario:
User processe are requesting a large amount of memory and keep page active.
Then a module continuously requests memory from ZONE_DMA32 area.
Memory reclaim will be triggered due to ZONE_DMA32 watermark alarm reached.
However pages in the LRU(active_anon) list are mostly from
the ZONE_NORMAL area.

Reproduce:
Terminal 1: Construct to continuously increase pages active(anon).
mkdir /tmp/memory
mount -t tmpfs -o size=1024000M tmpfs /tmp/memory
dd if=/dev/zero of=/tmp/memory/block bs=4M
tail /tmp/memory/block

Terminal 2:
vmstat -a 1
active will increase.
procs ---memory--- ---swap-- ---io---- -system-- ---cpu--- ...
 r  b   swpd   free  inact active   si   so    bi    bo
 1  0   0 1445623076 45898836 83646008    0    0     0
 1  0   0 1445623076 43450228 86094616    0    0     0
 1  0   0 1445623076 41003480 88541364    0    0     0
 1  0   0 1445623076 38557088 90987756    0    0     0
 1  0   0 1445623076 36109688 93435156    0    0     0
 1  0   0 1445619552 33663256 95881632    0    0     0
 1  0   0 1445619804 31217140 98327792    0    0     0
 1  0   0 1445619804 28769988 100774944    0    0     0
 1  0   0 1445619804 26322348 103222584    0    0     0
 1  0   0 1445619804 23875592 105669340    0    0     0

cat /proc/meminfo | head
Active(anon) increase.
MemTotal:       1579941036 kB
MemFree:        1445618500 kB
MemAvailable:   1453013224 kB
Buffers:            6516 kB
Cached:         128653956 kB
SwapCached:            0 kB
Active:         118110812 kB
Inactive:       11436620 kB
Active(anon):   115345744 kB
Inactive(anon):   945292 kB

When the Active(anon) is 115345744 kB, insmod module triggers
the ZONE_DMA32 watermark.

perf record -e vmscan:mm_vmscan_lru_isolate -aR
perf script
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=2
nr_skipped=2 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=0
nr_skipped=0 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=28835844
nr_skipped=28835844 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=1 nr_requested=32 nr_scanned=28835844
nr_skipped=28835844 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=29
nr_skipped=29 nr_taken=0 lru=active_anon
isolate_mode=0 classzone=1 order=0 nr_requested=32 nr_scanned=0
nr_skipped=0 nr_taken=0 lru=active_anon

See nr_scanned=28835844.
28835844 * 4k = 115343376KB approximately equal to 115345744 kB.

If increase Active(anon) to 1000G then insmod module triggers
the ZONE_DMA32 watermark. hard lockup will occur.

In my device nr_scanned = 0000000003e3e937 when hard lockup.
Convert to memory size 0x0000000003e3e937 * 4KB = 261072092 KB.

   [ffffc90006fb7c28] isolate_lru_folios at ffffffffa597df53
    ffffc90006fb7c30: 0000000000000020 0000000000000000
    ffffc90006fb7c40: ffffc90006fb7d40 ffff88812cbd3000
    ffffc90006fb7c50: ffffc90006fb7d30 0000000106fb7de8
    ffffc90006fb7c60: ffffea04a2197008 ffffea0006ed4a48
    ffffc90006fb7c70: 0000000000000000 0000000000000000
    ffffc90006fb7c80: 0000000000000000 0000000000000000
    ffffc90006fb7c90: 0000000000000000 0000000000000000
    ffffc90006fb7ca0: 0000000000000000 0000000003e3e937
    ffffc90006fb7cb0: 0000000000000000 0000000000000000
    ffffc90006fb7cc0: 8d7c0b56b7874b00 ffff88812cbd3000

About the Fixes:
Why did it take eight years to be discovered?

The problem requires the following conditions to occur:
1. The device memory should be large enough.
2. Pages in the LRU(active_anon) list are mostly from the ZONE_NORMAL area.
3. The memory in ZONE_DMA32 needs to reach the watermark.

If the memory is not large enough, or if the usage design of ZONE_DMA32
area memory is reasonable, this problem is difficult to detect.

notes:
The problem is most likely to occur in ZONE_DMA32 and ZONE_NORMAL,
but other suitable scenarios may also trigger the problem.

Link: https://lkml.kernel.org/r/[email protected]
Fixes: b2e1875 ("mm, vmscan: begin reclaiming pages on a per-node basis")
Signed-off-by: liuye <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Yang Shi <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant