Skip to content
This repository was archived by the owner on Oct 19, 2019. It is now read-only.

Conversation

@aidanvu1992
Copy link

No description provided.

@llpk79
Copy link

llpk79 commented Sep 11, 2019

  1. Load a dataset from Github (via its RAW URL)
  • Nice job here. Not much to get excited about.
  1. Load a dataset from your local machine
  • Using the colab library, nice.
  1. Load a dataset from UCI using !wget
    -Good job here as well. Would be interesting to see where you can take the poker data.

@llpk79
Copy link

llpk79 commented Sep 16, 2019

Sprint challenge code review:

Part 1 - Load and validate the data

  • Load the data as a pandas data frame.
    • You've lost a row because the first row is read as the header by defualt.
    • You've added an unnecessary step. df = pd.read_csv(cancer_survival_url, header=['list', 'of', 'columns']) is sufficient.
  • Validate that it has the appropriate number of observations (you can check the raw file, and also read the dataset description from UCI).
    • Use df.shape to view number of rows. !cat <file.name> | wc -l to view number of rows in file, or go to data page to confirm expected rows in dataframe.
  • Validate that you have no missing values.
    • Use df.isnull().sum() to see an easy to read summation of null values per column.
  • Add informative names to the features.
    • Complete.
  • The survival variable is encoded as 1 for surviving >5 years and 2 for not - change this to be 0 for not surviving and 1 for surviving >5 years (0/1 is a more traditional encoding of binary variables)
    • Nicely done.
  • At the end, print the first five rows of the dataset to demonstrate the above.
    • Complete.

Part 2 - Examine the distribution and relationships of the features

  • Explore the data - create at least 2 tables (can be summary statistics or crosstabulations) and 2 plots illustrating the nature of the data.
    • Good job exploring several cross-tabs. Consider binning the 'Number of positive axillary nodes detected'' column as well.
    • Check out pd.qcut()
    • Try using bar graphs instead of line charts when comparing discreet values. Tend to reserve line graphs for time series.

Part 3 - DataFrame Filtering

  • Use DataFrame filtering to subset the data into two smaller dataframes. You should make one dataframe for individuals who survived >5 years and a second dataframe for individuals who did not.
    • It would be more informative to cross-tab with age vs nodes or year vs 'nodesrather than with survival because we know all are eithersurvivedornot_survived`.
  • Create a graph with each of the dataframes (can be the same graph type) to show the differences in Age and Number of Positive Axillary Nodes Detected between the two groups.
    • Try bar graphs and heat-maps with cross-tabs as above. Try plotting on the same bar graph with similar cross-tabs for both 'survived' and 'not_survived' populations.

Part 4 - Analysis and Interpretation

  • What is at least one feature that looks to have a positive relationship with survival? (As that feature goes up in value rate of survival increases)
    • 👎
    • year_of_operation and survival have a positive relationship.
  • What is at least one feature that looks to have a negative relationship with survival? (As that feature goes down in value rate of survival increases)
    • 👍
    • Age and number of nodes are negatively correlated.
  • How are those two features related with each other, and what might that mean?
    • 👍
    • Age and year_of_operation are positively correlated.

Not bad, Anh. I'm going to give a 2. Do go back and review some of this material as we will continue to build on this to rapidly more complex applications.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants