Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shell fails on SBRK out of heap (address wrap) on script #2180

Closed
wants to merge 0 commits into from

Conversation

floriangit
Copy link
Contributor

Note, that with this change I'm running into the SBRK FAIL, OUT OF HEAP (address wrap) message you added in:
#2134

As I'm puzzled why my changes would trigger that, I've created this PR.

@floriangit
Copy link
Contributor Author

floriangit commented Jan 11, 2025

When I simply put an exit 0 at this position:
https://github.com/ghaerr/elks/pull/2180/files#diff-a315aa3faca9cc819bbc35f960bf9a47dd24420634950b50df8dd2a96dcb11f3L119

OUT OF HEAP still triggers. So I assume it has something to do with the filesize rather than its commands/code?

Edit: Yea, maybe. Before it was 3070 bytes and now it's 4229 bytes. Crosses a magic 4096 (page?) size?

@ghaerr
Copy link
Owner

ghaerr commented Jan 11, 2025

Try increasing /bin/sh's heap size in elkscmd/ash/Makefile. It's set to 12k to try to keep the resident working set small since at least one shell is almost always running. This can be fine-tuned later if this is found to be the issue. The shell data segment is also alway duplicated entirely on 'fork' so keeping it small helps run programs when memory is extremely tight.

Regarding the changes to the script, I would recommend adding a -p option to mknod instead of using 'test' with all the duplicated script code.

Before it was 3070 bytes and now it's 4229 bytes. Crosses a magic 4096 (page?) size?

The 8086 has no notion of page size, since there is no MMU. However, /bin/sh itself could be requiring more heap with an increased script size, I am not familiar with the internals of shell script processing but can figure that out if this continues to be a problem.

The bin/sys script itself had been previously meticulously crafted down to below 3K bytes to keep the distribution image size smaller, since it is on every disk image. Of course, it will have to cross 3K with this enhancement, but can easily be made less than 4k bytes with the mknod -p change.

@floriangit
Copy link
Contributor Author

floriangit commented Jan 11, 2025

Thanks for the input. I'll explore the -p option.
HOWEVER, I was digging around your (massive) alloc changes, which had not been on my radar before. So my bias was more that your changes broke something (sorry, haha) and I found:

In my feature branch I still find libc/malloc/v7malloc.c which you renamed/moved in #2176.

This makes me believe that I have not a synced source tree. When I switch to my master the file is away! My git tree was made synced with yours before I started creating the feature branch today. And git pull says up-to-date regardless of my branch or master. I'm a bit puzzled, maybe I need to learn more re. git which keeps sometimes surprising me.
Any hints?

If not, I think your comment sounds "it's kind of expected" and I will check the options your wrote.

@ghaerr
Copy link
Owner

ghaerr commented Jan 11, 2025

maybe I need to learn more re. git which keeps sometimes surprising me.
Any hints?

I'm definitely not a git wizard, and don't normally like renaming files as I believe it might affect the searchability for file history.

If you think you're tree is screwed up, sometimes I will clone a new tree from start then move the cross/ directory over from the old tree so as to not have to recompile the GCC toolchain.

@ghaerr
Copy link
Owner

ghaerr commented Jan 11, 2025

I was digging around your (massive) alloc changes,
In my feature branch I still find libc/malloc/v7malloc.c which you renamed/moved

Neither of the alloc changes should affect any of this since /bin/sh doesn't use v7malloc (now named dmalloc.c). I do believe it strange that your tree still has v7malloc.c, as that was renamed as you stated.

@ghaerr
Copy link
Owner

ghaerr commented Jan 15, 2025

Hello @floriangit,

This makes me believe that I have not a synced source tree.

Have you determined whether this issue resulted from a repo sync issue, or is there a heap space problem with /bin/sh that needs to be looked into? If the former, please close this issue, otherwise I'll leave this open and lets work to get this duplicated on my side so I can trace down the issue.

Thank you!

@floriangit
Copy link
Contributor Author

Hi @ghaerr,

FWIW, I deleted as suggested my local repo and cloned again. I then tried things again with my "noisy/big sys".

I was still running out of heap, then 1 doubled the heap size in ash/Makefile, still same. I then quadrupled to 48k heap and still the same. I stopped at that point. 48k sounds a lot, though, so maybe there is another reason? I will raise it again when it should happen.

@ghaerr
Copy link
Owner

ghaerr commented Jan 15, 2025

I then quadrupled to 48k heap and still the same.

It sounds to me that there is a problem with /bin/sh and an address wrap on the sbrk system call, for some reason. Do you still have the offending shell script?

The "SBRK OUT OF HEAP (address wrap)" message indicates that the amount being requested to increase/decrease the heap overflows the 16-bit add/subtract and wraps - which is a bit different than just requesting a smaller amount of heap that isn't available. It may be the shell is doing this for some reason with your offending script. If I can duplicate with your original script, I can reopen this issue and debug it.

@ghaerr ghaerr changed the title add more error handling Shell fails on SBRK out of heap (address wrap) on script Jan 15, 2025
@floriangit
Copy link
Contributor Author

https://github.com/floriangit/elks/tree/heap_hunters

This fails for me when executing sys /dev/fd1 in qemu when booted from /dev/fd0.
The only changed files are sys and qemu.sh.

Do you need a PR?

@ghaerr
Copy link
Owner

ghaerr commented Jan 15, 2025

Do you need a PR?

No - I pulled your sys and will attempt duplication later today, thanks!

@ghaerr
Copy link
Owner

ghaerr commented Jan 16, 2025

@floriangit,

I pulled your sys from heap_hunters and ran it. It immediately failed with SBRK 6338 FAIL, OUT OF HEAP SPACE. This means an additional 6338 bytes of heap are needed:
fail

I then changed the following line in elkscmd/ash/Makefile:

-LDFLAGS += -maout-heap=12288 -maout-stack=2304
+LDFLAGS += -maout-heap=20480 -maout-stack=2304

Then I ran "rm elkscmd/ash/ash; make". Note that not removing elkscmd/ash/ash will result in not relinking the executable, and thus having no effect. I suspect this is what may be happening in your previous testing. The script then ran without memory errors, as follows:
after

Notice however that sys is getting "Can't create symlink errors: file exists", which likely means that the symlink should be rm -f before execution. Please plan to add that to your PR when fixing mknod -p return value.

I am unable to get the "SBRK ... (address wrap)" error - only the FAIL error. As stated above, these are different. Which error are you getting? If not address wrap, then the problem is just not enough heap for the shell: I quickly changed this from 12k to 20k and everything OK. If you get the address wrap error, then we have some other issue. Please change the heap setting to 20k in the sys fix PR also. I also notice that this script is not the same as your original failing script, as it contains a mknod -p which was not present initially. Thus we may be debugging a different issue than what you first stated as the problem.

Try all this on your system and let me know what you find, using your improved script w/o the test etc in it. We also likely need to add set -e/set +e as well, which will likely then fail the script on symlink creation.

If you can get your original script to produce (address wrap) error, I would be very interested in seeing that.

Thank you!

@floriangit
Copy link
Contributor Author

Thanks for checking all this. The error is the same as yours indeed. So I might have mixed that small semantics up. Of course I did not know to manually delete the ash binary before building (needs build system fix, no?), so I did not do that, which explains I could not fix the error by changing the heap on my end.

Noted the 'rm -f' hint for sys and the bump to 20k for the other PR, thanks.

I also notice that this script is not the same as your original failing script, as it contains a mknod -p which was not present initially

Also this is thankfully no problem. I just added the test -b ... conditions on top of my tree to repro the issue at hand, I can delete the branch, which was only for the purpose of our debugging efforts then.

@ghaerr
Copy link
Owner

ghaerr commented Jan 17, 2025

@floriangit, I think I've finally figured all this out. As you have seen, #2186 fixes the sys script and doesn't require a larger heap for /bin/sh. It appears the reason for this is the use of the shell test builtin - that's what seems to be requiring more heap.

It appears you ran into problems because elkscmd/ash/ash was not being rebuilt when elkscmd/ash/Makefile was modified. Is this a build system bug? I don't know - in order to fix this, every single Makefile would have to add "Makefile" as a dependency for every single target. I would think that make itself might want to see that Makefile has been modified and rebuild everything, rather than adding "Makefile" as a dependency everywhere. I'll leave this to another day.

So for now, we don't need to change anything. I've made a note to look further into this at a later time; I haven't learned yet whether the heap overflow is on execution of the test builtin, or just that the script itself is larger. But sys itself as committed does not require a larger heap, which is good, since /bin/sh is resident for all logins, and the data segment is forked for every command executed - requiring double the space.

@floriangit
Copy link
Contributor Author

Thanks, I see you brought it all together now and saved me some work!

I haven't learned yet whether the heap overflow is on execution of the test builtin, or just that the script itself is larger.

I mentioned earlier, that I placed an exit 0 as first instruction into sys to test it. I got out of heap directly, which is a strong hint, that's is about the file size, not the file instructions.

I don't know - in order to fix this, every single Makefile would have to add "Makefile" as a dependency for every single target.

I never thought about this to be honest. I implicitly expected make to figure out when the Makefile it uses has been changed. Seems not to be true after a short test ;-) I'll leave that in your TODO bucket, I don't mind much really but thought that in so many projects it just works, that I thought it's worth mentioning.

I'll test your sys and other changes on my 286 and write again if I see something unexpected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants