Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EuroCrops: handle Nones in get_label #2499

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

burakekim
Copy link
Contributor

Closes #2497

Now we return 0 if hcat_code is None, not rendering the feature.

@github-actions github-actions bot added the datasets Geospatial or benchmark datasets label Jan 3, 2025
@github-actions github-actions bot added the testing Continuous integration testing label Jan 3, 2025
@adamjstewart
Copy link
Collaborator

adamjstewart commented Jan 3, 2025

Nice catch!

It looks like HCATv2 actually has a class for this:

"not_known_and_other,3399000000"

I wonder if we should use this instead:

hcat_code = feature['properties'][self.label_name] or '3399000000'

I'm not sure if 0 is ever actually being used or not. Are there any other instances where we return 0?

@favyen2 do you remember why you added 0? Just to appease mypy?

@adamjstewart adamjstewart added this to the 0.6.3 milestone Jan 3, 2025
@burakekim
Copy link
Contributor Author

burakekim commented Jan 4, 2025

It looks like HCATv2 actually has a class for this:

Under Harmonisation with HCAT, it seems EC_hcat_c is already the label itself, and that is the variable that is assigned as None. If I understand correctly, we should discard EC_hcat_c with None since they are not assigned any value, including not_known_and_other. I am unsure whether they serve any purpose in the creation of the dataset though

I am in favor of removing the print statement that notifies the user each time we hit None, as it becomes very verbose during training.

@favyen2
Copy link
Contributor

favyen2 commented Jan 11, 2025

do you remember why you added 0? Just to appease mypy?

Do you mean for the HCAT code or for the label returned by the get_label function?

I don't think I ran into cases where HCAT code was None or 0 so it may be in a new version of the data or in a geographic area that I did not test.

For returning 0 from get_label, the function needs to return some integer for cases where the HCAT code is not in the list of codes that the user provides (i.e. the list they care about). Currently it returns the background class 0, which is also used for pixels that are not covered by any polygon. There may be cases where the user wants to use a different code for background class but for this use case I think it would work for them to add a catch-all class to the list of classes passed to the constructor since it is hierarchical (e.g. classes=["1234", "5678", "0000"], now there is a separate "other" category that will be labeled as 3 after rasterization distinct from the background class 0).

@adamjstewart
Copy link
Collaborator

Should we use 3399000000 instead of 0? I can also ask the EuroCrops folks if necessary.

@favyen2
Copy link
Contributor

favyen2 commented Jan 11, 2025

As far as I know, get_label converts the HCAT code into a class integer suitable for training the segmentation model. So after the conversion these returned class indices should correspond to output neurons from the neural model rather than the original codes.

@favyen2
Copy link
Contributor

favyen2 commented Jan 11, 2025

Oh, I guess you are saying that some users might still find it useful to train on the feature, mapping it to an "other" category (which first setting the code to 3399000000 and then mapping it based on the user-provided category list would enable) rather than being limited to mapping it to 0 which matches the background category.

I'm not sure the answer to this, it may depend on why there are these weird features in the dataset that don't have a code. I guess the safest would be to mark them invalid but otherwise without digging more into the data I think both are reasonable solutions.

@adamjstewart
Copy link
Collaborator

Oh, I guess you are saying that some users might still find it useful to train on the feature, mapping it to an "other" category (which first setting the code to 3399000000 and then mapping it based on the user-provided category list would enable) rather than being limited to mapping it to 0 which matches the background category.

I'm actually saying the exact opposite. I'm wondering if the background class, None, and "not_known_and_other" should all be mapped to the same value. Otherwise, users will have to add all three to ignore_index during training.

@burakekim
Copy link
Contributor Author

I think it would be worth reaching out to EuroCrops people to ask whether None and not_known_and_other serve any specific purpose across the entire dataset. It is best to find out if there is any reason behind this design choice

@adamjstewart adamjstewart changed the title handle Nones in get_label EuroCrops: handle Nones in get_label Jan 12, 2025
@favyen2
Copy link
Contributor

favyen2 commented Jan 12, 2025

I'm actually saying the exact opposite. I'm wondering if the background class, None, and "not_known_and_other" should all be mapped to the same value. Otherwise, users will have to add all three to ignore_index during training.

I think there would be cases where users want to ignore None / not_known_and_other but not background, to try to have the model distinguish areas that are not crop fields.

@adamjstewart
Copy link
Collaborator

Reply from the EuroCrops folks:

Depending on the version of EuroCrops you downloaded some of the None fields either represent duplicated field parcels left blank intentionally or nothing was provided by the source data to begin with. In the latter case, background class would be the way to go. For the former, it depends on what you want to achieve. These are duplicate field parcels with either the same or different crop labels, since most people only wanted one label per parcel, we chose to only map the first appearing parcel/largest area. Depending on your needs you could also do your own processing since the original label is still in the file.
Regarding not_known_and_other these are predominately either artificial built-up areas or water bodies. Most crops that are non-specific/not-known are in other arable crops or the like. Hope this helps!
We are also very close to releasing a new HCAT so maybe everything would be clearer then?

@burakekim
Copy link
Contributor Author

some of the None fields either represent duplicated field parcels

These are duplicate field parcels with either the same or different crop labels, since most people only wanted one label per parcel

Does this mean some fields have multiple labels, and only one is used while the rest are mapped to None? If so, it is better not to map None to any other label.

Regarding not_known_and_other these are predominately either artificial built-up areas or water bodies.

This feels more like background to me.

@adamjstewart
Copy link
Collaborator

I guess I don't have strong opinions at the moment. We should make a decision here and get this merged though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Geospatial or benchmark datasets testing Continuous integration testing
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EuroCrops get_label cannot handle features with None attributes
3 participants