Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add content reader to decouple NFS file reads [RHELDST-26339] #643

Draft
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

rajulkumar
Copy link
Contributor

The API to read content of the push items strongly depends on their presence as file on NFS or locally-mounted filesystem. However, with advent of new push item sources, it likely that the push items might not be stored as a file or be locally available.

Hence, add a new content reader interface that will be implemented by content readers based on the source or content type. The readers can then be used by specific push item types to get their bits from the respective source.

Here, the File content reader implements the interface and is used by default for all push items. This reader reads the bits from the path provided as push item source and returns a file-like object as push item content attribute.

@rajulkumar
Copy link
Contributor Author

rajulkumar commented Dec 4, 2024

@rohanpm Can you please review whether this is on expected lines?
I tried using this patch with release-engineering/pubtools-pulp#324 and it seems to work as expected.

Copy link
Member

@rohanpm rohanpm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think currently this is a bit more complicated than what the issue was asking for. It looks like it can be simplified.

Although the issue is not asking for this yet, if I were working on this I would probably hack one of the Source backends to actually start using HTTP requests for fetching content. Just as a local change, not something to be pushed. The koji backend may be a good candidate, since koji is designed to serve up content via HTTP.

Doing that would help ensure that the new API is genuinely usable for the intended purpose.

content type reader.

Returns:
:class:`~pushsource.ContentReader`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs some rewording:

  1. I don't think we should be documenting that we return a ContentReader here. Can we just say that we're returning an io.BufferedIOBase ? I don't think ContentReader should be public API at all, I don't see a need to introduce our own API for file-like objects.

  2. It shouldn't say "reading the content from the src". Yes, that is what the code does, but that's not what the API should promise to do. The main point of this change is to set us up for different ways of reading content in the future, with the user of this API not having to know or care about that. For example, this might do something like: read content remotely by doing an HTTP request.

The doc string should be generic enough to cover all that. Basically it should just say that it obtains a non-seekable read-only file-like object for accessing the content's bytes, but it should not say anything about how this is done.

"""
return ContentReader.get("file:" + self.src)

def exist(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this for? I don't think we really need it, do we?

I see this is used from wait_exist. If it's only for internal use it shouldn't be public API, but I don't see even why it's needed internally.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is used by wait_exist. I assumed depending on the push item source, there might be different ways to check for existence. Hence, delegate it to push item or content reader. Though as per the comment elsewhere this won't be needed.

LOG = logging.getLogger("pushsource")


class ContentReader(BufferedIOBase):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have some issues with this class:

  • Firstly I think the concept of register_reader and get is not needed. The issue didn't ask for anything like this. I would expect in most cases whatever Source backend is providing a push item knows exactly what reader makes sense, and doesn't benefit from this additional layer of abstraction.
  • I think the concept of a base class for all readers doesn't really work either. Again looking at the idea of doing an HTTP request as an example: the HTTP client API is probably going to itself give you a file-like object. You can't make the client return some subclass of ContentReader. All you can do is accept the object it gives you, and wrap it in something else if you need to. So, this would seem to point us towards composition (rather than inheritance) being the right approach here.

Considering what this class actually needs to do beyond the standard file-like object API, I believe there are just two points requested:

  • it should have a name attribute (str)
  • it should not be seekable

So, I think the better design would be to have a class not designed to be subclassed, but designed to wrap other io objects and enforce these two behaviors.

io.BufferedReader is already designed to wrap another stream, so I think we could just extend that slightly to add a name attribute and enforce non-seekability.

Something like this:

class PushItemReader(BufferedReader):
    # A BufferedReader which always has a name and is always unseekable.
    def __init__(self, raw, *, name, buffer_size=DEFAULT_BUFFER_SIZE):
        self.name = name
        super(raw, buffer_size)

    def seekable(self):
        return False

    def seek(self, pos, whence=0):
        raise UnsupportedOperation(f"Seek unsupported while reading {self.name}")

This would be a private class, not exposed or documented as public API.

Then the way it'd be used internally is something like this:

# Example of simple 'open file' case
def content(self):
    return PushItemReader(open(self.src), name=self.src)

# (possible future) example of HTTP case
def content(self):
    r = requests.get(self.src, stream=True)
    check_success(r)
    return PushItemReader(r.raw, name=self.src)

Copy link
Contributor Author

@rajulkumar rajulkumar Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for sharing this helpful insight. It seems I over-analysed and complicated it as you mentioned.

I have the understanding that the Source backend gets the list and the metadata of the push items. There might be a case that the push items reside somewhere else in a separate artifact repository.
Hence , Source backend only knows the artifact repo and returns the corresponding PushItem with the the metadata. e.g. Errata Tool provides the push item metadata that may reside in some object store like S3, some other location reachable via HTTP etc. or a mix of them depending on the push item types or creation time like old ones on brew and new ones one some other artifact store. So ErrataSource backend will provide the push items accordingly without getting into how to fetch them.

I thought it's the PushItem that should responsible to get the actual content from the artifact repo using the applicable PushItemReader. PushItem delegates all the responsibility to the PushItemReader to get the content including getting the raw IO object, wrapping it and any other steps like auth, config etc. that are required to get the content from the artifact repo. This could decouple the entire content fetching responsibility from the Source or PushItem

Eventually, if there a new artifact location OR a new workflow involving a new backend or push item type, a new PushItemReader will be registered to be used by the corresponding PushItem.

Extending io.BufferedReader in PushItemReader and wrapping the raw IO object definitely helps and doesn't require this ContentReader class.

I understand that considering new backend, push item or artifact repo etc. might be an overkill for now. However, what do you think about extending the PushItemReader for fetching specific content source or content type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't fully worked through the idea of how exactly to make the content/open method actually do something other than open(self.src). I would rather not commit to anything until it comes time to really implement an alternative method of opening, as I see this issue as being primarily about allowing the users of pushsource to prepare for alternative methods of opening. We need to change the interface to achieve that, but implementation details can be left for later.

But anyway, here's a concrete idea which I would use as a starting point.

The PushItem class can have an 'opener' attribute, something like this:

opener = attr.ib(default=open_src, repr=False)

(Originally I was thinking this should probably be private, but since we do allow Source implementations to be registered from outside of pushsource library, I guess it should actually be public.)

The default implementation would be something like:

def open_src_local(item):
    # default opener for push items, assumes that the item's
    # 'src' points at a locally-accessible file
    return open(item.src, 'rb')

Then a Source implementation can change the method used to open by passing a value during push item creation. For example, in KojiSource where RpmPushItems are currently instantiated:

        return [
            RpmPushItem(
                name=os.path.basename(rpm_path),
                src=rpm_path,
                dest=self._dest,
                signing_key=rpm_signing_key,
                build=build["nvr"],
            )
        ]

The code could potentially be updated to do something like:

        kwargs = dict(src=rpm_path)
        # We still default to reading files locally, but when koji source was constructed,
        # was an argument passed to request downloading over HTTP?
        if self._download_files:
          rpm_url = os.path.join(koji_download_root, rpm_path)
          kwargs['src'] = rpm_url
          kwargs['opener'] = utils.open_src_http

        return [
            RpmPushItem(
                name=os.path.basename(rpm_path),
                dest=self._dest,
                signing_key=rpm_signing_key,
                build=build["nvr"],
                **kwargs,
            )
        ]

Still I would not recommend adding anything like this now (except perhaps as a local experiment).
This change is first and foremost about making it possible for the users of this library to transfer the responsibility for reading content into the pushsource library, this does not yet require us to make all the design decisions related to supporting alternative methods of reading.

@@ -63,7 +63,7 @@ def __iter__(self):
if not hasattr(item, "src") or not item.src or not item.src.startswith("/"):
yield item
else:
wait_exist(item.src, timeout, poll_rate)
wait_exist(item, timeout, poll_rate)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this change is probably not needed, or we should find some way to limit when this happens.

The logic here was introduced purely to support a replication problem between some specific NFS mounts. I believe it might be obsolete now, but even if not - this is NOT something I think we should support for anything other than this known NFS case.

I don't think this particular bit of code needs to support different ways of reading the item, I think it is reasonable to take this position:

  • if content is later customized to do something other than just open the file: we should try hard to avoid a situation where the new method of opening would for some reason need to poll for the content's existence
  • if we really can't avoid that situation: then it is within the scope of the content implementation to do the polling. For instance if there is an implementation of content which fetches content via HTTP requests, it could loop a few times in case of 404 errors, if and only if it's known to make sense in that particular case.

Overall, I would suggest leaving this code alone for now. It's not necessary to make this change to satisfy the main goal of this issue. If wait_exist is still needed at all, it's only needed for this specific NFS replication issue, but it's not obvious to me how to make the code only be applied in that case.

I think it'd be better to revisit it when it actually comes time to introduce a different implementation of content, and with any luck wait_exist could be completely removed by then.

@@ -248,3 +249,19 @@ def with_checksums(self):
updated_sums[attribute] = hasher.hexdigest()

return attr.evolve(self, **updated_sums)

@property
def content(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this should be a property, I think a plain method fits better.

It will work either way, but I think developers usually have an expectation that properties are cheap, and not subject to failure. It doesn't fit this case well, as this could potentially have to do HTTP requests or other slow and error-prone operations.

It also returns a different object on each call, which is not how a property would usually work. For instance someone might write code like this, intending to read the content in chunks:

while chunk := item.content.read():
  # oops, this is an infinite loop because every iteration
  # opens a new file and reads from the beginning!

This code is wrong but it doesn't look wrong. I think the above would more clearly stand out as wrong if it were a method rather than a property.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this. FileContentReader uses the same file object, however I missed that this created a new FileContentReader every time.

Copy link
Contributor Author

@rajulkumar rajulkumar Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried caching it but static failed with "W0642: Invalid assignment to self in method (self-cls-assignment) "

if not getattr(self, "_content", None):
     self = attr.evolve(self, _content=PushItemReader(self.opener(self))
return self._content

So, I though of using a weakref.WeakKeyDictionary() as cache and keep the reader for each instance but then it might return a new object if garbage collected and the cache size might grow too large.

Hence, I left it as-is assuming it's implied that it will return a new object now that it's not an attribute and is not bound/expected to remain same.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the way you've got it now is fine.

The API to read content of the push items strongly depends
on their presence as file on NFS or locally-mounted filesystem.
However, with advent of new push item sources, it likely that
the push items might not be stored as a file or be locally available.

Hence, pushitems will have an `opener` that is capable to fetch
the corresponding bits and provide them as `content()`. `Source`
will define the `opener` while creating the pushitems to get the
bit from the specific content source. This will transfer the
responsibility to fetch the content from the consumer that is
using the content/pushitems to the `Source` and corresponding
`pushitem` here, abstracting the reading mechanism from the user
and provide flexibility to fetch from different locations/protocols.
@@ -178,6 +180,10 @@ def _default_build_info(self):
doesn't enforce this.
"""

opener = attr.ib(type=callable, default=open_src_local, repr=False)
"""Callable that gets the content of this push item. The content could be
retrived from `content()` method"""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
retrived from `content()` method"""
retrieved from `content()` method"""

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this also needs to be a bit more specific regarding the contract that an "opener" needs to satisfy.

I would write something like:

The opener, when given a push item, should return a file-like object suitable for reading this item's bytes.

@@ -248,3 +249,19 @@ def with_checksums(self):
updated_sums[attribute] = hasher.hexdigest()

return attr.evolve(self, **updated_sums)

@property
def content(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think the way you've got it now is fine.

@@ -248,3 +254,12 @@ def with_checksums(self):
updated_sums[attribute] = hasher.hexdigest()

return attr.evolve(self, **updated_sums)

def content(self):
"""Returns a read-only, non-seekable content of this push item.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should probably elaborate on this a bit with an example, and also clarify what's expected for types of push items not designed to be read in this way.

Here's my attempt at fleshing out the docs a bit:

For push items representing a single file, content will obtain a stream for reading that file's bytes;
for example, RpmPushItem.content can be used to read the content of an RPM; VMIPushItem.content
can be used to read the content of a VMI and so on.

Not every type of push item can be read in this way. For example, a single ContainerPushItem
may represent any number of artifacts making up a container image, to be accessed from a container
image registry in the usual way - not using this method. Other types of push items such as
ErratumPushItem may be understood as metadata-only items and do not themselves have any content. For items such as these, this method will return None.

:class:`~io.BufferedReader`
A non-seekable object of the push item content
"""
return PushItemReader(self.opener(self))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in my proposed doc addition, I think we need to make this also support a "return None" case, given there are many types of push items where this doesn't make sense.

My suggestion is that opener should be allowed to be None. If it is None, then content should also return None. Then set up non-openable types such as ContainerPushItem to initialize opener to None.

One thing I'm not sure of: should opener on the base class default to open_src_local and then be overridden to None in non-openable types? Or should it default to None in the base class and then be overridden to open_src_local in openable types?

At the moment, I'm leaning towards saying it should default to None in the base class.
The reason is that, when someone introduces a new PushItem subclass, I think it makes sense for the default behavior to be "don't support reading - unless the developer thinks about it and implements it". If it instead defaults to open_src_local on the base class then the default behavior will be "try to read, and crash if that doesn't make sense". This will probably lead to PushItem subclasses where content doesn't make sense but it has not been gracefully overriden to return None.

def __init__(self, raw, name=None, **kwargs):
super(PushItemReader, self).__init__(raw, **kwargs)

# Attempt to assign name from the raw object if none is provided
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that it might be better to drop this, and to just always force a name to come from the caller. Every push item has a name attribute, so maybe that can just be used.

The way you have it now, when combined with the current content implementation, means that the contract of opener is not only "returns a file-like object" but the more complicated "returns a file-like object with a non-empty name attribute".

I believe a lot of io streams do not have the name attribute, including those we're likely to see used in the future, such as the stream of an HTTP response body. So, insisting that opener must return something with a name seems to make the API more difficult to use.

@@ -24,20 +24,32 @@

LOG = logging.getLogger("test_baseline")

EXCLUDED_ATTRIBUTES = [
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it might be a remnant of the past, as in the end content is just a plain method and not an attr.ib - it should already not appear in the output of asdict and therefore not need any filtering.


return NoNameBR(open(item.src, "rb"))


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need tests somewhere for the proposed "returns None" case too, e.g. attempting to use content on an ErratumPushItem, ContainerPushItem.

"""Read pushitem content from the source"""
item = PushItem(name="test", src=ITEM_SRC)

assert item.content().name == ITEM_SRC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would strongly suggest that at least one test is using content via a with statement. This is the most likely way the API will actually be used in practice, and if not tested there is a possibility we've messed it up somehow.

e.g. this test could look more like:

with item.content() as f:
  assert f.name == ITEM_SRC
  assert ...

@@ -59,6 +63,7 @@ def add_enum_representers(cls):

ItemDumper.add_enum_representers()
ItemDumper.add_representer(frozendict, ItemDumper.represent_dict)
ItemDumper.add_representer(type(lambda: None), ItemDumper.represent_callable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer to just filter these out rather than dumping them. (i.e. do not show opener field here, same as we don't show it in repr.)

Firstly because there is no way opener can survive a round-trip anyway, since we only serialize the callable's name and not the code. I'm pretty sure prior to this change, you can take the output of "pushsource-ls" and paste it into a python file and it'll actually work to construct push items, but that will be broken by this.

On top of that, this leaks some internal implementation details. open_src_local isn't public API, but its name is going to become visible here if we don't filter it out. If people actually use the serialized opener field for something, that can potentially block us from renaming or refactoring it later.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants