Skip to content

Update featurization.ipynb #750

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 12 additions & 12 deletions docs/examples/featurization.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -241,7 +241,7 @@
"source": [
"### Using feature hashing\n",
"\n",
"In fact, the `StringLookup` layer allows us to configure multiple OOV indices. If we do that, any raw value that is not in the vocabulary will be deterministically hashed to one of the OOV indices. The more such indices we have, the less likley it is that two different raw feature values will hash to the same OOV index. Consequently, if we have enough such indices the model should be able to train about as well as a model with an explicit vocabulary without the disdvantage of having to maintain the token list."
"In fact, the `StringLookup` layer allows us to configure multiple OOV indices. If we do that, any raw value that is not in the vocabulary will be deterministically hashed to one of the OOV indices. The more such indices we have, the less likley it is that two different raw feature values will hash to the same OOV index. Consequently, if we have enough such indices the model should be able to train about as well as a model with an explicit vocabulary without the disadvantage of having to maintain the token list."
]
},
{
Expand All @@ -250,7 +250,7 @@
"id": "t0gOaMjJAC17"
},
"source": [
"We can take this to its logical extreme and rely entirely on feature hashing, with no vocabulary at all. This is implemented in the `tf.keras.layers.Hashing` layer."
"We can take this to its logical extreme and rely entirely on feature hashing, with no vocabulary at all. This is implemented in the [`tf.keras.layers.Hashing`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Hashing) layer."
]
},
{
Expand Down Expand Up @@ -314,7 +314,7 @@
"source": [
"movie_title_embedding = tf.keras.layers.Embedding(\n",
" # Let's use the explicit vocabulary lookup.\n",
" input_dim=movie_title_lookup.vocab_size(),\n",
" input_dim=movie_title_lookup.vocabulary_size(),\n",
" output_dim=32\n",
")"
]
Expand Down Expand Up @@ -356,7 +356,7 @@
},
"outputs": [],
"source": [
"movie_title_model([\"Star Wars (1977)\"])"
"movie_title_model(tf.constant([\"Star Wars (1977)\"]))"
]
},
{
Expand All @@ -379,7 +379,7 @@
"user_id_lookup = tf.keras.layers.StringLookup()\n",
"user_id_lookup.adapt(ratings.map(lambda x: x[\"user_id\"]))\n",
"\n",
"user_id_embedding = tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32)\n",
"user_id_embedding = tf.keras.layers.Embedding(user_id_lookup.vocabulary_size(), 32)\n",
"\n",
"user_id_model = tf.keras.Sequential([user_id_lookup, user_id_embedding])"
]
Expand Down Expand Up @@ -426,7 +426,7 @@
"\n",
"[Standardization](https://en.wikipedia.org/wiki/Feature_scaling#Standardization_(Z-score_Normalization)) rescales features to normalize their range by subtracting the feature's mean and dividing by its standard deviation. It is a common preprocessing transformation.\n",
"\n",
"This can be easily accomplished using the [`tf.keras.layers.Normalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/Normalization) layer:"
"This can be easily accomplished using the [`tf.keras.layers.Normalization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/Normalization) layer:"
]
},
{
Expand Down Expand Up @@ -518,7 +518,7 @@
"\n",
"The first transformation we need to apply to text is tokenization (splitting into constituent words or word-pieces), followed by vocabulary learning, followed by an embedding.\n",
"\n",
"The Keras [`tf.keras.layers.TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/preprocessing/TextVectorization) layer can do the first two steps for us:"
"The Keras [`tf.keras.layers.TextVectorization`](https://www.tensorflow.org/api_docs/python/tf/keras/layers/TextVectorization) layer can do the first two steps for us:"
]
},
{
Expand Down Expand Up @@ -584,7 +584,7 @@
"source": [
"This looks correct: the layer is tokenizing titles into individual words.\n",
"\n",
"To finish the processing, we now need to embed the text. Because each title contains multiple words, we will get multiple embeddings for each title. For use in a donwstream model these are usually compressed into a single embedding. Models like RNNs or Transformers are useful here, but averaging all the words' embeddings together is a good starting point."
"To finish the processing, we now need to embed the text. Because each title contains multiple words, we will get multiple embeddings for each title. For use in a downstream model these are usually compressed into a single embedding. Models like RNNs or Transformers are useful here, but averaging all the words' embeddings together is a good starting point."
]
},
{
Expand Down Expand Up @@ -624,7 +624,7 @@
"\n",
" self.user_embedding = tf.keras.Sequential([\n",
" user_id_lookup,\n",
" tf.keras.layers.Embedding(user_id_lookup.vocab_size(), 32),\n",
" tf.keras.layers.Embedding(user_id_lookup.vocabulary_size(), 32),\n",
" ])\n",
" self.timestamp_embedding = tf.keras.Sequential([\n",
" tf.keras.layers.Discretization(timestamp_buckets.tolist()),\n",
Expand Down Expand Up @@ -665,7 +665,7 @@
"user_model = UserModel()\n",
"\n",
"user_model.normalized_timestamp.adapt(\n",
" ratings.map(lambda x: x[\"timestamp\"]).batch(128))\n",
" ratings.map(lambda x: x[\"timestamp\"]).batch(128,drop_remainder=True))\n",
"\n",
"for row in ratings.batch(1).take(1):\n",
" print(f\"Computed representations: {user_model(row)[0, :3]}\")"
Expand Down Expand Up @@ -698,7 +698,7 @@
"\n",
" self.title_embedding = tf.keras.Sequential([\n",
" movie_title_lookup,\n",
" tf.keras.layers.Embedding(movie_title_lookup.vocab_size(), 32)\n",
" tf.keras.layers.Embedding(movie_title_lookup.vocabulary_size(), 32)\n",
" ])\n",
" self.title_text_embedding = tf.keras.Sequential([\n",
" tf.keras.layers.TextVectorization(max_tokens=max_tokens),\n",
Expand Down Expand Up @@ -749,7 +749,7 @@
"source": [
"## Next steps\n",
"\n",
"With the two models above we've taken the first steps to representing rich features in a recommender model: to take this further and explore how these can be used to build an effective deep recomender model, take a look at our Deep Recommenders tutorial."
"With the two models above we've taken the first steps to representing rich features in a recommender model: to take this further and explore how these can be used to build an effective deep recommender model, take a look at our Deep Recommenders tutorial."
]
}
],
Expand Down