Skip to content

Log and Uniform Binning for Random Forest Classifier#33

Open
theabbybault wants to merge 36 commits intoLSSTDESC:masterfrom
theabbybault:master
Open

Log and Uniform Binning for Random Forest Classifier#33
theabbybault wants to merge 36 commits intoLSSTDESC:masterfrom
theabbybault:master

Conversation

@theabbybault
Copy link
Copy Markdown

I looked at how different binnings would affect the scores of the random forest classifier. I mainly focused on log binning and uniform binning. For the log binning I created evenly spaced numbers on a log scale, and then put galaxies into the bins based on their percentile. For the uniform binning I created the bins based on a uniform random distribution and sorted the galaxies based on that. The binning for each log and uniform for 10 bins are shown below.
log:
RandomForest_log_{'bins'_ 10, 'colors'_ True, 'errors'_ True}_riz
uniform:
RandomForestUniform_{'bins': 10, 'colors': True, 'errors': True}_riz

I also attempted to combine some of the bins for each log and uniform binning. I started with 5 bins, and combined 2 bins so that there were only 4 bins. I was only able to combine bin 0 with bins 1, 2, and 3 for each. The combined bins were renamed so the plot legends might be a bit confusing (if they weren’t renamed the calculations threw an error). An example of this combined bin binning for uniform binning is:
RF_Uniform_CombineBins_{'bins'_ 5, 'colors'_ True, 'errors'_ True}riz 0, 1

The notebook showing the results is called 'log_uniform_and_combined_bins.ipynb' is in the main part of the repository. Below is an example of a plot showing the scores for each metric calculated using Jax and log binning
fig

There are more plots similar to this in the notebook.

@EiffL EiffL added the entry Challenge entry label Sep 1, 2020
@EiffL
Copy link
Copy Markdown
Member

EiffL commented Sep 1, 2020

Thank you for your entry @theabbybault ! This is super interesting! So if I understand this right, aside from anything else, one should prefer log binning :-)

theabbybault and others added 8 commits September 7, 2020 14:10
merged all methods into one file
merged all methods into one file
merged all methods into one file
merged all methods into one file
File has options for log, random (previously called uniform) and combining bins, as well as setting a seed.
no longer needed. adding all methods to one file eliminated the need to change this file.
removes unnecessary files and update others to include new updates to code
@theabbybault
Copy link
Copy Markdown
Author

Since my original submission, I've made some updates to my methods as well as ran them with higher bins. I changed the name of the 'uniform' method to 'random', since it was suggested to me that random made more sense (the method finds z_edges by pulling from a random uniform distribution).

I've merged my methods into one file, and you can select which binning type you want in the yaml file (examples shown in example/funbins*.yaml. The classifier name is now funbins as well.

The results out to 20 bins are shown in the plots in the jupyter notebook that is in the home directory. I'll include the plots here for simplicity.

For the log binning out to 20 bins, each metric:
log_scores

For the random binning out to 20 bins, each metric:
random_scores
I suspect the large jump between bins 14 and 16 is due to the seed I used (123).

Finally, I thought it would be good to be able to compare each metric for each binning type together, so I created a plot that compares the 'FOM_DETF_3x2' scores for each metric, out to 20 bins.
compare_scores

In the notebook you can change the metric if you wish to compare other metrics.

theabbybault and others added 7 commits September 7, 2020 14:57
fix the error of plots not going to the right folder
new binning method is from David Kirkby. Calculates the bin edges so they are equally spaced in comoving distance (I've called it 'chi').
@theabbybault
Copy link
Copy Markdown
Author

This last update included a new binning method (code written by David Kirkby, here). where the bins are equally spaced in comoving distance, called 'chi'. I've also updated the notebook to include some plots for this method.

Focusing on the FOM_DETF_3x2 score, a table of scores for each method is shown below:
fom_detf_3x2_table
and the corresponding plot:
fom_detf_3x2_plot

A summary of what I've done:

  • All methods have been run on even bins from 2-20.
  • The classifier is called funbins (this was a throwaway name that stuck)
  • Only run on riz (should work for griz too)
  • 'log' is log-spaced percentile bin edges
  • 'random' (previously 'uniform') is bin edges with random numbers from a uniform distribution
  • 'chi' is bin edges equally spaced in comoving distance

All plots posted here can be found in the notebook funbins_results.ipynb. The plots showing the bins can be found in results under the selected method (and then under jax).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

entry Challenge entry

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants