-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathcontent.json
1 lines (1 loc) · 58.4 KB
/
content.json
1
{"meta":{"title":"Aerial Dev","subtitle":"Distributed databases and aerial arts all rolled up into one, oddly shaped...bundle","description":"Distributed databases and aerial arts all rolled up into one, oddly shaped...bundle","author":"David Gilardi","url":"https://sonicdmg.github.io"},"pages":[{"title":"About me","date":"2017-02-06T20:54:38.000Z","updated":"2017-02-10T13:37:08.000Z","comments":true,"path":"About/index.html","permalink":"https://sonicdmg.github.io/About/index.html","excerpt":"","text":"Hi there, I’m David Gilardi, a nerd, sci-fi, fantasy enthusiast who would like nothing more than to get on the Starship Enterprise (or anything remotely like it that won’t explode on its first voyage) and explore the universe. Until that happens I’ll continue playing games, building distributed database clusters, climbing up multi-story strips of fabric while calling it “art”, and “improving” every bit of my house that I can. You mentioned distributed databasesYuppers. I’m currently a Technical Evangelist at DataStax and let me tell you after 20+ years in my career coding, DBAing, building bare metal and cloud infrastructure, and managing I am a happy camper getting back into some seriously cool tech. I have a mixed-workload search/graph/Cassandra cluster using 4 Raspberry PI’s for my core Cassandra DC. You can’t get much more commodity hardware than PI’s and my cluster is humming along quite nicely. Strips of fabric?Yes! One of my favorite activities is aerial arts. Seriously, drop the gym, start climbing stuff and your body will thank you. It also might reward you with some torn, ripped, pulled tendons, ligaments, and muscles, but no matter. It’s all earned pain. I realize I don’t make it sound all that great, but after 5 years I’m still in one piece and love every moment of it (most of the time). Here is a fun example. Sorry for the potato quality, it won’t happen again. Turn up the sound and wait for the end."}],"posts":[{"title":"Moving from Cassandra tables to Search with DataStax: Part 2","slug":"Moving-from-Cassandra-tables-to-Search-with-DataStax-Part-2","date":"2018-02-12T21:11:12.000Z","updated":"2018-02-12T21:33:14.452Z","comments":true,"path":"2018/02/12/Moving-from-Cassandra-tables-to-Search-with-DataStax-Part-2/","link":"","permalink":"https://sonicdmg.github.io/2018/02/12/Moving-from-Cassandra-tables-to-Search-with-DataStax-Part-2/","excerpt":"Hello again and welcome to part 2 of my 3 part series on moving from Cassandra tables to using DataStax Enterprise Search. If you haven’t read part 1 yet go take a look as it contains the backstory for this series. In part 1 we looked at the types of searches we perform in KillrVideo, scratched the surface on how those searches were implemented, and then I asked a set of questions that lead us into why we might use DSE Search. Here in part 2, I’ll discuss why we moved to using DSE Search, detail what the transition encompassed, and explain considerations I took into account when making the switch. One thing I’d like to point out before we get started is the Cassandra only approach that we replaced with DSE Search is perfectly valid. At times I make the case for why the Search approach is better IMO for our particular situation, but it is not broken or anything along those lines and it follows established denormalized query-first design patterns. The move to DSE SearchSo, we decided to switch from using only Cassandra tables for our searches to using DSE Search. That part is obvious enough, but some of you might be curious as to why we made this move.","text":"Hello again and welcome to part 2 of my 3 part series on moving from Cassandra tables to using DataStax Enterprise Search. If you haven’t read part 1 yet go take a look as it contains the backstory for this series. In part 1 we looked at the types of searches we perform in KillrVideo, scratched the surface on how those searches were implemented, and then I asked a set of questions that lead us into why we might use DSE Search. Here in part 2, I’ll discuss why we moved to using DSE Search, detail what the transition encompassed, and explain considerations I took into account when making the switch. One thing I’d like to point out before we get started is the Cassandra only approach that we replaced with DSE Search is perfectly valid. At times I make the case for why the Search approach is better IMO for our particular situation, but it is not broken or anything along those lines and it follows established denormalized query-first design patterns. The move to DSE SearchSo, we decided to switch from using only Cassandra tables for our searches to using DSE Search. That part is obvious enough, but some of you might be curious as to why we made this move. All of the searches worked perfectly fine with Cassandra tables, they performed well, and honestly the code to implement was not terribly complex so what was it that pushed us over the edge? It was the need to expand searches to include more than tags, provide more comprehensive, “fuzzy” searches, and enable more flexibility for future search enhancements. Why, why, whyIf you remember from part 1, I detailed what the various searches were all doing. One common denominator was the use of the tag column for all of our searches. Now, we wanted to expand our searches to include both video name and description columns along with tag. Sure, I could modify the existing schema to include the new columns, but then I would have a data migration to worry about to populate the new columns on existing rows. I would also have to touch all code points that intersected with any of the searches, update entity classes, change CQL queries, and potentially change or add code logic to handle the new columns not to mention the inflexibility of it all if I needed to add another column(s) down the line. Now in all frankness I would have to do some of this if we switched to using DSE Search, but as you’ll see it’s more of a removal task than a data model and design change. While writing this article it also dawned on me I have the benefit of hindsight since we’re already using Search. It’s easy to say “well just do it this way”, but when coming at this fresh and looking at what we were trying to accomplish it was a no brainer. Comparing the Cassandra, Search, Analytics, and Graph workloads available to DSE the requirements here almost read like a product page for Search. It’s the right tool for the job. Time to get dirtyOk, we made the decision to use DSE Search and now it’s time to update the application. Before I started hacking at code though I needed to figure out what I was dealing with from a query perspective. Remember from part 1 of this series we have the following Cassandra only CQL queries to perform our searches.12// typeaheadSELECT tag FROM tags_by_letter WHERE first_letter = ? AND tag >= ? and12// tag and \"more videos like this\"SELECT * FROM videos_by_tag WHERE tag = ? Not very hard, but we need multiple, specialized tables to handle our searches. In order to start using DSE Search I needed to create a search index. At this point I just want to point out my intent here is not to be a whole guide on how to create and implement search indexes. The focus is on our case moving from Cassandra tables to DSE Search. I will go over some of the highlights just to connect some dots and break down the sections that are relevant for this post. If you are not familiar with how to create Search indexes I highly suggest you take a look at this blog post on creating search indexes in DSE 5.1 along with the documentation on the same topic. If you want a total deep dive on DSE Search I highly suggest you check out the DataStax Academy course on DSE Search and prepare to have your mind blown. Back to itHere is the schema defined for our videos table. It was mostly auto-generated after enabling Search with dse_tool. If you are curious about how the schema is configured take a look here. Why the videos table? Because it holds all of the information necessary for our searches, namely, tags (as before), name, and description and since the videos table is populated on each video upload we get everything we need without having to write to multiple denormalized tables. As a matter of fact, implementing our searches against the videos table allowed us to remove the VideoAddedHandlers class and the asynchronous Cassandra queries contained within, but I am getting ahead of myself. For our current discussion lets focus on the following snippet:1234<!-- For search autocomplete functionality, copy name and tags fields to search_suggestions field --><field indexed=\"true\" multiValued=\"false\" name=\"search_suggestions\" stored=\"true\" type=\"textSuggest\" /><copyField source=\"name\" dest=\"search_suggestions\" /><copyField source=\"tags\" dest=\"search_suggestions\" /> Notice the field name of “search_suggestions” and both the “name” and “tags” copyField‘s. We essentially copied name and tag column information into a single field called “search_suggestions”. This will come into play here in a moment. Then take a look at the “Basic fields” section:12345678910<!-- Basic fields --><field indexed=\"true\" multiValued=\"false\" name=\"added_date\" stored=\"true\" type=\"TrieDateField\"/><field indexed=\"true\" multiValued=\"false\" name=\"location\" stored=\"true\" type=\"TextField\"/><field indexed=\"true\" multiValued=\"false\" name=\"preview_image_location\" stored=\"true\" type=\"TextField\"/><field indexed=\"true\" multiValued=\"false\" name=\"name\" termVectors=\"true\" stored=\"true\" type=\"TextField\"/><field indexed=\"true\" multiValued=\"true\" name=\"tags\" termVectors=\"true\" stored=\"true\" type=\"TextField\"/><field indexed=\"true\" multiValued=\"false\" name=\"userid\" stored=\"true\" type=\"UUIDField\"/><field indexed=\"true\" multiValued=\"false\" name=\"videoid\" stored=\"true\" type=\"UUIDField\"/><field indexed=\"true\" multiValued=\"false\" name=\"location_type\" stored=\"true\" type=\"TrieIntField\"/><field indexed=\"true\" multiValued=\"false\" name=\"description\" termVectors=\"true\" stored=\"true\" type=\"TextField\"/> As I mentioned above most of this was auto-generated. This includes indexes for all of the columns in the videos table which allows us to perform searches against any of the columns in our table. For our case we are using tags, name, and description. If you are curious about why we might want the other columns I refer you to OutOfScopeOfThisArticleException. It is something I will cover in the future. The important takeaway is we have the needed columns indexed. Some of you might be thinking “wait, isn’t this more complex than just creating some Cassandra tables like you had before?”. It’s about the same IMO. We’re defining fields and their types in a schema, just a different type of schema, but once we do this we are all set to go…for all of our searches! Back to creating our indexI’ve pulled the following out of KillrVideo’s bootstrap script. We use this to create all of the initial database artifacts needed to run KillrVideo on cluster creation. This may obviously be different for you depending on your setup, but it should give you the general idea. We are using dsetool to both create and reload our core against the videos tables in our killrvideo keyspace.12345echo '=> Creating search core' dsetool -h $YOUR_CLUSTER_IP create_core killrvideo.videos schema=$YOUR_SCHEMA_LOCATION/videos.schema.xml solrconfig=$YOUR_SCHEMA_LOCATION/videos.solrconfig.xml -l $USERNAME -p $PASSWORDecho '=> Reloading search core' dsetool -h $YOUR_CLUSTER_IP reload_core killrvideo.videos schema=$YOUR_SCHEMA_LOCATION/videos.schema.xml solrconfig=$YOUR_SCHEMA_LOCATION/videos.solrconfig.xml -l $USERNAME -p $PASSWORD Once this completes, that’s it. Search indexes are in place and ready for use. As we insert data into the Cassandra based videos table our indexes will automatically be updated. No extra queries, explicit code, or anything else needed. One more thing on this whole schema thing before I move on. Search index creation will be a whole lot easier in the upcoming DSE 6. I’ll have some posts digging into this in the near future and I’m totally stoked to say the least so keep an eye out for updates. Let’s take stockI think it’s a good idea to summarize where we’re at so far. We’ve effectively replaced our 2 Cassandra only tables with a Search core loaded against the videos table, and we’ve loaded that same core for use within our database. Within our search core we created a field named “search_suggestions” that combines data from the tags and name columns into a single field and now we can start performing searches using DSE Search against any of the fields we created in the search schema above. Cassandra based search vs. DSE Search comparisonsSo now comes the fun part where we get to take a look at how the different Cassandra and Search based searches compare. “Typeahead” searchLet’s start with the “typeahead” search from the search bar. For our examples I’m using query parameters with value ‘d’. This is the same as if I typed ‘d’ in the search bar from the UI.Here’s the original CQL:123456SELECT tagFROM tags_by_letter WHERE first_letter = 'd' AND tag >= 'd'; Here’s the CQL with Search in place:123456SELECT tags, nameFROM videos WHERE solr_query='{\"q\":\"search_suggestions:d*\", \"paging\":\"driver\"}'; Right away there are a couple differences to point out. One, notice that in the original query I am selecting only the tag column. This is because the tags_by_letter table only includes tags, there is nothing else to extract. The purpose of the table is to provide tags in a very efficient manner. In the Search based query I am selecting both tags and name in this case, but I could get any column from the videos table if I wanted to. Also, notice the difference between the tag and tags columns between the two queries. Since the Search based query is pulling from the videos table we return a set of tags in the tags column compared to a single tag per row. The final piece is probably the most obvious of all, the WHERE clause. In the original query we are using both our partition key and clustering column to find rows that have the letter ‘d’ with tags >= ‘d’. I should note this will return an alphabetically ordered list of results because of how clustering columns work. In the Search query we are looking for any words starting with the letter ‘d’ from the “search_suggestions” field. This field is a combination of all tags and names. Don’t forget tags is a set of tags per row. I’m totally glossing over the “paging” parameter you see above in the Search query. We will get to that in part 3. Now comes the fun. Watch what happens when I execute each query. Also remember that our goal was to expand our search results to provide more complex and varied results.I’d like to point out there is no LIMIT clause in my query and this is from a database with thousands of videos. Now, here is the Search query.………………………………………….. Notice not only the amount of results, but that we are matching across both the tags and name columns and we are matching within the set of values in the tags column. I could have added description or any other relevant column from the videos table if I wanted to by simply adding it to my query. So, ok, we have more results, we have more variation in results, but you might notice we have a lot of repeated terms where the Cassandra based query returned a more succinct set of results. In my case I handled this with a little regex and a TreeSet that ensures I don’t have repeats and results are ordered alphabetically naturally within the set. As a matter of fact here is that very snippet:12345678910111213141516171819202122232425262728293031// Use a TreeSet to ensure 1) no duplicates, and 2) words are ordered naturally alphabeticallyfinal Set<String> suggestionSet = new TreeSet<>();/** * Here, we are inserting the request from the search bar, maybe something * like \"c\", \"ca\", or \"cas\" as someone starts to type the word \"cassandra\". * For each of these cases we are looking for any words in the search data that * start with the values above. */final String pattern = \"(?i)\\\\b\" + request.getQuery() + \"[a-z]*\\\\b\";final Pattern checkRegex = Pattern.compile(pattern);int remaining = rows.getAvailableWithoutFetching();for (Row row : rows) { String name = row.getString(\"name\"); Set<String> tags = row.getSet(\"tags\", new TypeToken<String>() {}); /** * Since I simply want matches from both the name and tags fields * concatenate them together, apply regex, and add any results into * our suggestionSet TreeMap. The TreeMap will handle any duplicates. */ Matcher regexMatcher = checkRegex.matcher(name.concat(tags.toString())); while (regexMatcher.find()) { suggestionSet.add(regexMatcher.group().toLowerCase()); } if (--remaining == 0) { break; }} It was a small price to pay for taking this particular approach and there was no noticeable performance degradation in doing so. I will explore some other options coming up here in the future. Almost thereBear with me for another example to bring this together. In the following case I purposely removed the constraint on my pagesize to illustrate the difference in results. Remember with our Cassandra only query we returned just 5 tags out of thousands of videos. These were simply the only tags that started with the letter ‘d’ in the database. So, nothing wrong with this, it is doing exactly what it was designed for, but we wanted to provide a “richer” experience and expand beyond tags. With Search in place we could now do exactly that and provide a definite increase in variation all while working right out of the videos table. I could expand this further to include any column in my table by simply modifying my query, no data model change needed. Hopefully the amount, and variation of, the options in my search bar are obvious. Compare this to the 5 results we had previously. I don’t know about you, but I could use a break right about now. Ok, moving on. Tag search and the “more videos like this” sectionI’m just going to go ahead and combine these because the original Cassandra only searches are effectively the same. For reference we are talking about the following query using ‘dsl’ as the query parameter for the tag column:1SELECT * FROM videos_by_tag WHERE tag = 'dsl'; From the UI perspective this query was used both if an end-user clicked on any of the tag buttons on the video detail page and when viewing the “More videos like this” section at the bottom of the video detail page. The former case would simply pass the clicked tag value to the back-end and execute a query similar to what you see above. In our example it would be as if I clicked on the “dsl” button below. The latter case would essentially loop through all of the different tags associated with a video and execute the above query for each tag in the list. In our example we have 4 queries for “datastax”, “dsl”, “graph”, and “gremlin”. The results were then combined and used to populate the “more video like this” section. Something to point outAt this point I’m sure you have noticed that tags are a core component of how searches are powered. The UI enforces including at least one tag on video upload and they are included in the design for every search. However, we loosened this restriction when pulling videos from the back-end using the generator service. We did this because the difference in the amount of videos available to us without tags compared to those with tags under the various topics we are pulling from YouTube is pretty huge. There are also some pretty useful/cool videos out there that don’t include tags. For each of the Cassandra only searches, if there are no tags, you get no videos, nothing. Case in point, take a look at the above image. There are no tags at all, yet if you look at the “more videos like this” section at the bottom notice how relevant our results are when compared to the video we are viewing. This is a nice example of how using Search allowed us to provide a more comprehensive experience by making it easy to include multiple facets of data and even cover the case of missing one of our key pieces of data. In the previous solution the “more videos like this” section would be empty. Let’s wrap this upOk, so we talked about why we made the switch to using DSE Search, looked at some of the details of how this was done, discussed some considerations taken into account, and then viewed some result comparisons. That’s a good amount of stuff, but it’s not the full picture. My goal here was to demonstrate how using DSE Search enhanced our search capability and didn’t require us to radically change our overall design. Hopefully I accomplished my goal. In part 3, we’ll dive into simplifying our code base by removing the pieces we no longer needed after moving to Search, tie up some loose ends, and look at advanced search capabilities we got for “free” simply because we are using Search. See you soon :D","categories":[{"name":"Technical","slug":"Technical","permalink":"https://sonicdmg.github.io/categories/Technical/"}],"tags":[{"name":"killrvideo","slug":"killrvideo","permalink":"https://sonicdmg.github.io/tags/killrvideo/"},{"name":"datastax","slug":"datastax","permalink":"https://sonicdmg.github.io/tags/datastax/"},{"name":"search","slug":"search","permalink":"https://sonicdmg.github.io/tags/search/"},{"name":"DSE Search","slug":"DSE-Search","permalink":"https://sonicdmg.github.io/tags/DSE-Search/"}]},{"title":"Moving from Cassandra tables to Search with DataStax: Part 1","slug":"Moving-from-Cassandra-tables-to-Search-with-DataStax-Part-I","date":"2018-01-10T14:28:42.000Z","updated":"2018-02-12T21:11:14.747Z","comments":true,"path":"2018/01/10/Moving-from-Cassandra-tables-to-Search-with-DataStax-Part-I/","link":"","permalink":"https://sonicdmg.github.io/2018/01/10/Moving-from-Cassandra-tables-to-Search-with-DataStax-Part-I/","excerpt":"Hi there and welcome to part 1 of a three part series on upgrading our KillrVideo java reference application from Cassandra based tabular searches to using DSE Search. Here in part 1, we’ll take a look at the “before” picture and how we were previously performing searches. I’ll give some examples of the types of searches and how those were implemented with Cassandra tables. We’ll also talk a little about the “why” of moving to DSE Search. In part 2, I’ll explain the transition to DSE Search and what considerations I had to take into account along with a before and after code comparison. Finally, in part 3, we’ll take a look at our results along with some of the more advanced types of searches we can now perform. Ok, let’s do thisFirst things first…assumptions!If it isn’t obvious, we are using the Java based KillrVideo application for reference. If you aren’t familiar with KillrVideo go take a look here to get up to speed. In short, this is a real, open source, micro-service style application that we build and maintain to present examples and help folks understand the DataStax tech stack. It’s also a nice way that I personally get some code time against the stack in a real application as compared to punching out demo apps. We are using DataStax Enterprise from drivers to cluster. All of the capabilities we’re talking about here are assumed to be within that ecosystem.","text":"Hi there and welcome to part 1 of a three part series on upgrading our KillrVideo java reference application from Cassandra based tabular searches to using DSE Search. Here in part 1, we’ll take a look at the “before” picture and how we were previously performing searches. I’ll give some examples of the types of searches and how those were implemented with Cassandra tables. We’ll also talk a little about the “why” of moving to DSE Search. In part 2, I’ll explain the transition to DSE Search and what considerations I had to take into account along with a before and after code comparison. Finally, in part 3, we’ll take a look at our results along with some of the more advanced types of searches we can now perform. Ok, let’s do thisFirst things first…assumptions!If it isn’t obvious, we are using the Java based KillrVideo application for reference. If you aren’t familiar with KillrVideo go take a look here to get up to speed. In short, this is a real, open source, micro-service style application that we build and maintain to present examples and help folks understand the DataStax tech stack. It’s also a nice way that I personally get some code time against the stack in a real application as compared to punching out demo apps. We are using DataStax Enterprise from drivers to cluster. All of the capabilities we’re talking about here are assumed to be within that ecosystem. Do we really need to use DSE Search?No. Maybe. Yes? ¯\\_(ツ)_/¯Ok, it depends, but for the most basic searches it isn’t a requirement. As a matter of fact search was already implemented in the Java version without using DSE Search. So, why the change? Mostly it comes down to requirements and the right tool for the job. So, before I get into all of this why don’t we take a look at the various types of searches that exist in KillrVideo. In KillrVideo, you can get details and play any video from the video detail page. A quick look at the whole page and you can see all of the available searches. At the top left is the “typeahead” search bar, over to the right are the tag search buttons, and at the very bottom is the “more videos like this” search. “Typeahead” searchSo, nothing new here really. These types of searches have been around for quite some time. Start typing letters in the search bar and you are provided with a list of potential matches Tag searchThis next one is pretty straightforward as well. If you click on any of the tag buttons on the video detail page it will perform a search for other videos with the same tag. “More videos like this” searchAt the bottom of the video detail page is a section labeled “More Videos Like This”. This search will happen automatically when you navigate to any video and present a set of videos that are similar to the video you are currently viewing. Let’s take a look at the implementationRemember I mentioned before moving to using DSE Search these were all powered with Cassandra tables. Let’s break some of this down and take a look at some details. Also, if you are interested check out this pull request up on github. You can use this as a reference of the before and after changes if you so choose. OverviewSo overall the setup is pretty simple. We have 3 searches that are supported by a combination of 3 Cassandra tables (the number of searches and tables just happen to match, there is no correlation between them), 3 table entities that map to our Cassandra tables, and 3 mapping objects derived from our table entities. Here is a simple visual representation. Now, I’m really pointing out these particular items because they will come into play later once we move to using DSE Search, namely, we will need to remove most of them. We’ll leave that there for now and come back to it later. Of the 3 tables I just mentioned 2 of them exist only to support Cassandra based searches in KillrVideo, tags_by_letter and videos_by_tag. They follow the denormalized data model pattern we’ve come to love in Cassandra and were created solely to support this purpose. The videos table stores all videos inserted into KillrVideo. It is not specialized to Cassandra based searches and will come into play a little later. I just mentioned that both tags_by_letter and videos_by_tag were specially created to support searches within KillrVideo. Let’s take a deeper look at both tables and see what’s going on. If you aren’t familiar with primary keys in Apache Cassandra™ I highly suggest you take 5 minutes to read Patrick McFadin’s post on their importance. This will better explain how they are applied below. Here is the CQL schema for the tags_by_letter table:12345CREATE TABLE IF NOT EXISTS tags_by_letter ( first_letter text, tag text, PRIMARY KEY (first_letter, tag)); The first_letter column is the partition key with tag as a clustering column. The partition key determines where data is located in your cluster while clustering columns handle how data is ordered within the partition. This is especially useful in cases like a “typeahead” search where searches typically start with the first letter of a given search term and usually provide an alphabetical list of results. This, in a sense, pre-optimizes query results and prevents us from having to sort our data, whether in query execution or code. Just to absolutely belabor this point (because who doesn’t like belaboring something) here is an example of this in action. Notice how the results are sorted automatically per the tag column with no sorting or extra commands needed in the query. Moving on, here is the CQL schema for the videos_by_tag table:12345678910CREATE TABLE IF NOT EXISTS videos_by_tag ( tag text, videoid uuid, added_date timestamp, userid uuid, name text, preview_image_location text, tagged_date timestamp, PRIMARY KEY (tag, videoid)); Again, this table was specially created to answer the question of what videos have a specified tag. It uses tag as the partition key which allows for fast retrieval of videos that match a tag when querying. The other fields you see listed are there to provide information required by our web tier UI. VideoAddedHandlersAnother portion we need to keep an eye on is the VideoAddedHandlers class. As the name implies this class is responsible for performing some action(s) every time a video is added to KillrVideo. If you take a look at the two prepared statements within the init() method you should notice they are inserting data into the 2 search tables we mentioned above tags_by_letter and videos_by_tag.12345678910videosByTagPrepared = dseSession.prepare( \"INSERT INTO \" + Schema.KEYSPACE + \".\" + videosByTagTableName + \" \" + \"(tag, videoid, added_date, userid, name, preview_image_location, tagged_date) \" + \"VALUES (?, ?, ?, ?, ?, ?, ?)\");tagsByLetterPrepared = dseSession.prepare( \"INSERT INTO \" + Schema.KEYSPACE + \".\" + tagsByLetterTableName + \" \" + \"(first_letter, tag) VALUES (?, ?)\"); Every time a video is added these queries are fired off to power our searches. Notice some of the column names, things like “tag” and “first_letter”. Again, we will dig into the detailed logic here soon. Detailed logic here soonAlrighty, so, we’ve gone over a high level overview of the various items and objects we are using to support our Cassandra based searches. Now, let’s get into the searches themselves and see what they are doing. “Typeahead” searchAs I mentioned above the typeahead search simply takes the values the user types into the search bar and provides search suggestions based off of the sequence of letters typed in usually with a wildcard attached to the end of the sequence. An example is something like “d” which might return “database”, “decision”, “document”, etc… Then, if the user continues with something like “da” could be “database”, “databases”, “datastax”, etc… and so on. In the Cassandra based search case this is supported by the tags_by_letter table. Every time a video is added the related subscriber method in the VideoAddedHandlers class is called and ALL tags are inserted along with the first letter of the tag into their respective columns. Since multiple tags are allowed this means we end up looping through and batching up all commands for each tag. Then, when a user starts typing into the search bar we have the following query to get our results:1SELECT tag FROM tags_by_letter WHERE first_letter = ? AND tag >= ? Which returns all tags that match the query string from our search. We loop through those results and send our tags back to the UI. Pretty simple. Remember I previously mentioned the first_letter column is the partition key and tag is the clustering column which handles data ordering. This is where this all comes into play. At this point I’d like to point out that we are working only with tags. Neither the name or description of any videos are considered. Sure, we could add support for this in our data model and code if we really wanted to, but it is something we have to explicitly take into account if we want that capability. Tag searchOk, let’s move to the tag based search. This one is pretty straightforward. Click on a tag button in the video details page and return all videos that have the same tag. This search is supported by the videos_by_tag table. Every time a video is added the related subscriber method in the VideoAddedHandlers class is called and an entry is made for each tag associated with the video. If a video has one tag there will be a single entry, if it has 5 there will be 5 entries, and so on. Note that the videos_by_tag table is optimized specifically for this task. If you click on a tag button the following query is executed:1SELECT * FROM videos_by_tag WHERE tag = ? Which returns all videos that match the tag provided in the query. We send these back to the UI which provides a list of videos for you to choose from. “More videos like this” searchThe related videos or “More videos like this” section is very similar to the tag search. The difference in this case is instead of matching only to the selected tag this search will find videos that match all tags of the selected video. So, if my selected video has tags of “datastax”, “dsl”, “graph”, and “gremlin” then the search will return videos that include any of those. It uses the same query as the tag search above. The only difference is we’ll perform a query for each tag that exists with the video and combine the results. A couple things to point outFor one, notice how our tables and searches work in lock-step. The tables were created to support a particular set of searches or “questions” asked by our application UI and our code supports whatever CRUD operations are needed to maintain the data we use for searches. Essentially this was all purpose made to fit our search needs exactly in a denormalized fashion. This is quite different from how we may have handled things in the relational world. Also notice the number of operations needed, mostly on the insert end, to constantly populate the search based tables with data when videos are added to the system. Now, we’re talking about Cassandra here so this is not really that much of an issue, but there is overhead associated with those operations and the code needed to support it. Why move to DSE Search?So, what happens now when we want to expand our searches to include more fields separate from tag, provide more varied results, or enable advanced searches? Is there a way we could reduce the number of overall actions and code needed to support our searches and also speed things up? Lastly, can we do this in such a way that does not take a whole rethink of our data model? Well, I think at this point you know exactly what I’m going to suggest, but you’ll have to wait until part 2 of this series for details. Oooo…suspense…I know….totally suspenseful Until then (quite soon honestly), thanks for reading and I hope you got something useful out of part 1. Always feel free to add comments or contact me directly for any thoughts or questions. Take care :D","categories":[{"name":"Technical","slug":"Technical","permalink":"https://sonicdmg.github.io/categories/Technical/"}],"tags":[{"name":"killrvideo","slug":"killrvideo","permalink":"https://sonicdmg.github.io/tags/killrvideo/"},{"name":"datastax","slug":"datastax","permalink":"https://sonicdmg.github.io/tags/datastax/"},{"name":"search","slug":"search","permalink":"https://sonicdmg.github.io/tags/search/"},{"name":"DSE Search","slug":"DSE-Search","permalink":"https://sonicdmg.github.io/tags/DSE-Search/"}]},{"title":"Ok, yea, so maybe it's been a while since I posted","slug":"Ok-yea-so-maybe-it-s-been-a-while-since-I-posted","date":"2018-01-09T15:38:20.000Z","updated":"2018-01-09T19:14:06.000Z","comments":true,"path":"2018/01/09/Ok-yea-so-maybe-it-s-been-a-while-since-I-posted/","link":"","permalink":"https://sonicdmg.github.io/2018/01/09/Ok-yea-so-maybe-it-s-been-a-while-since-I-posted/","excerpt":"So…I’ve been busy. Quite busy since the last time I posted. Let’s see, I got married, added 2 greyhounds to the family, repaired things from Hurricane Irma, been digging into all things DataStax, and I have a child on the way (due March 15th). On the DataStax and KillrVideo front I added both a graph based recommendation engine with DSE Graph and recently DSE Search capability for all video searches in the Java version. We also have SparkSQL fun coming up here soon as well. All of the application code is available for folks to really do whatever they want with it. Feel free to leave comments, issues, or make pull requests if you have something fun to add. My main goal is for KillrVideo to be useful to folks trying to figure this stuff out. Once Spark gets into the mix KillrVideo will cover the 4 horsemen of DataStax Enterprise, namely,Cassandra, Graph, Search, and Analytics.","text":"So…I’ve been busy. Quite busy since the last time I posted. Let’s see, I got married, added 2 greyhounds to the family, repaired things from Hurricane Irma, been digging into all things DataStax, and I have a child on the way (due March 15th). On the DataStax and KillrVideo front I added both a graph based recommendation engine with DSE Graph and recently DSE Search capability for all video searches in the Java version. We also have SparkSQL fun coming up here soon as well. All of the application code is available for folks to really do whatever they want with it. Feel free to leave comments, issues, or make pull requests if you have something fun to add. My main goal is for KillrVideo to be useful to folks trying to figure this stuff out. Once Spark gets into the mix KillrVideo will cover the 4 horsemen of DataStax Enterprise, namely,Cassandra, Graph, Search, and Analytics. And then……this battle station will be fully operational!!!! Mostly, except for some unfinished turbo lasers, maybe some panels, like a whole hemisphere worth, just some small items really. Sooooo much cool stuff coming!Seriously, new OpsCenter, Studio, and DSE everything updates on the way along with a child, but that last one is not part of the normal development cycle….kind of a side project. Once I can talk about some of the new changes I’ll start posting and getting things worked into code. Until then I’ll be posting about much of what I’ve been up to this last year and passing on some things I’ve learned while wrapping my tendrils around all of this NoSQL distributed database stuff. Fun for all I’m sure. ;)","categories":[{"name":"Something Else","slug":"Something-Else","permalink":"https://sonicdmg.github.io/categories/Something-Else/"}],"tags":[{"name":"killrvideo","slug":"killrvideo","permalink":"https://sonicdmg.github.io/tags/killrvideo/"},{"name":"datastax","slug":"datastax","permalink":"https://sonicdmg.github.io/tags/datastax/"}]},{"title":"Don't block your Async calls","slug":"Don-t-block-your-Async-calls","date":"2017-04-17T14:31:42.000Z","updated":"2017-04-20T19:58:33.000Z","comments":true,"path":"2017/04/17/Don-t-block-your-Async-calls/","link":"","permalink":"https://sonicdmg.github.io/2017/04/17/Don-t-block-your-Async-calls/","excerpt":"Or rather I should be saying that to myself. So, TIL (today I learned) something simple yet profound while working with asynchronous programming and the DSE java driver. Ensure that you are properly iterating through your results when making an async call. You cannot simply iterate all of your rows using a for loop or something along the lines. Ok, well, technically you can, but if you have more rows than your fetch size the DSE java driver will throw a big fat error your way letting you know you are blocking within an async call. I should point that I am still somewhat new to working with asynchronous calls (yes, someone finally pulled up the rock I was under) so for you veterans this may be knowledge already gained from async NOOB 101. By the way, here is the error the driver threw at me (thank you for doing so DSE driver peeps).","text":"Or rather I should be saying that to myself. So, TIL (today I learned) something simple yet profound while working with asynchronous programming and the DSE java driver. Ensure that you are properly iterating through your results when making an async call. You cannot simply iterate all of your rows using a for loop or something along the lines. Ok, well, technically you can, but if you have more rows than your fetch size the DSE java driver will throw a big fat error your way letting you know you are blocking within an async call. I should point that I am still somewhat new to working with asynchronous calls (yes, someone finally pulled up the rock I was under) so for you veterans this may be knowledge already gained from async NOOB 101. By the way, here is the error the driver threw at me (thank you for doing so DSE driver peeps). 123456Detected a synchronous call on an I/O thread, this can cause deadlocks or unpredictable behavior. This generally happens when a Future callback calls a synchronous Session method (execute() or prepare()), or iterates a result set past the fetch size (causing an internal synchronous fetch of the next page of results). Avoid this in your callbacks, or schedule them on a different executor. com.datastax.driver.core.AbstractSession.checkNotInEventLoop(AbstractSession.java:206) com.datastax.driver.core.ArrayBackedResultSet$MultiPage.prepareNextRow(ArrayBackedResultSet.java:310) com.datastax.driver.core.ArrayBackedResultSet$MultiPage.isExhausted(ArrayBackedResultSet.java:269) com.datastax.driver.core.ArrayBackedResultSet$1.hasNext(ArrayBackedResultSet.java:143) com.datastax.driver.mapping.Result$1.hasNext(Result.java:102... The reason is stated here. I’ll quote it just to be clear “If you consume a ResultSet in a callback, be aware that iterating the rows will trigger synchronous queries as you page through the results. To avoid this, use getAvailableWithoutFetching to limit the iteration to the current page, and fetchMoreResults to get a future to the next page”. Even though I read this before I started into this code I must have glossed over this concept the first time through as my implementation was acting very strange indeed. Let’s look at a simple example. At this point in my code I already made an aynchronous call with session.executeAsync(), created a future, and returned my results into a callback. The following examples are within my callback.In the case below I mapped my results to the UserVideos entity and now I am iterating through those results to do something with each “userVideo” object.This…DOES NOT work and will throw the error I mentioned above. Ehem, I have a utility class handle callbacks if you were wondering where that was. I wanted to keep the example nice and simple. Just know that by the time you see “.handle” we are within the callback.1234567FutureUtils.buildCompletableFuture(userVideosMapper.mapAsync(future)) .handle((userVideos, ex) -> { try { if (userVideos != null) { for (UserVideos userVideo : userVideos) { \"do something with userVideo here\" } It seems so simple. I returned my results and now I want to iterate over those results and do something with them, but these aren’t synchronous calls that block until complete. I need to handle them properly from an asynchronous standpoint and only grab those results that have actually been returned. The rest I will need to fetch with more asynchronous calls. Again, this is demonstrated very clearly here in the Async paging section. This…is a snippet pulled from the working code using the example given from the async page I keep referencing. Now, I see how many items I have remaining without fetching, loop through the remaining items, and break out once I have exhausted the list. You may not see in my example below, but once I “break;” I exit out and grab futures for any more items that may be left, rinse and repeat.123456789101112FutureUtils.buildCompletableFuture(userVideosMapper.mapAsync(future)) .handle((userVideos, ex) -> { try { if (userVideos != null) { int remaining = userVideos.getAvailableWithoutFetching(); for (UserVideos userVideo : userVideos) { \"do something with userVideo here\" if (--remaining == 0) { break; } } The whole point of this post was to point out a potential “gotcha” with a very simple fix when dealing with asynchronous programming and the DSE driver for Java. This one tripped me up for a moment until I realized my mistake. Now that I know better my “futures” are looking bright in deed….see my joke there….ha….haha…..ha…awkward pause. Honestly, this simple change tightend all of my async code up. No more strange artifacts","categories":[{"name":"Technical","slug":"Technical","permalink":"https://sonicdmg.github.io/categories/Technical/"}],"tags":[{"name":"TIL","slug":"TIL","permalink":"https://sonicdmg.github.io/tags/TIL/"},{"name":"async","slug":"async","permalink":"https://sonicdmg.github.io/tags/async/"},{"name":"blocking","slug":"blocking","permalink":"https://sonicdmg.github.io/tags/blocking/"},{"name":"java","slug":"java","permalink":"https://sonicdmg.github.io/tags/java/"},{"name":"killrvideo","slug":"killrvideo","permalink":"https://sonicdmg.github.io/tags/killrvideo/"}]},{"title":"Dropping in on my cluster","slug":"Dropping-in-on-my-cluster","date":"2017-02-13T16:27:17.000Z","updated":"2017-02-13T16:35:19.000Z","comments":true,"path":"2017/02/13/Dropping-in-on-my-cluster/","link":"","permalink":"https://sonicdmg.github.io/2017/02/13/Dropping-in-on-my-cluster/","excerpt":"","text":"Looks like my nodes are healthy. :)","categories":[{"name":"Aerial","slug":"Aerial","permalink":"https://sonicdmg.github.io/categories/Aerial/"}],"tags":[{"name":"datastax","slug":"datastax","permalink":"https://sonicdmg.github.io/tags/datastax/"},{"name":"opscenter","slug":"opscenter","permalink":"https://sonicdmg.github.io/tags/opscenter/"},{"name":"aerial","slug":"aerial","permalink":"https://sonicdmg.github.io/tags/aerial/"},{"name":"drop","slug":"drop","permalink":"https://sonicdmg.github.io/tags/drop/"}]},{"title":"Mixed Workload DSE Cluster with Raspberry PI's","slug":"Mixed-Workload-DSE-Cluster-with-Raspberry-PI-s","date":"2017-02-07T18:52:11.000Z","updated":"2017-02-10T21:04:38.000Z","comments":true,"path":"2017/02/07/Mixed-Workload-DSE-Cluster-with-Raspberry-PI-s/","link":"","permalink":"https://sonicdmg.github.io/2017/02/07/Mixed-Workload-DSE-Cluster-with-Raspberry-PI-s/","excerpt":"Alrighty, as I mentioned in my previous post I have a mixed-workload cluster (Cassandra, DSE search, DSE graph) using a combination of 4 Raspberry PI’s and my laptop. I had multiple things in mind when I started into this. Low cost for learning Something I can break and not cry about How low can one really go when setting up a cluster? Get some DSE OpsCenter knowledge These are Raspberry PI’s, they are just damn cool, so why not setup a cluster?!","text":"Alrighty, as I mentioned in my previous post I have a mixed-workload cluster (Cassandra, DSE search, DSE graph) using a combination of 4 Raspberry PI’s and my laptop. I had multiple things in mind when I started into this. Low cost for learning Something I can break and not cry about How low can one really go when setting up a cluster? Get some DSE OpsCenter knowledge These are Raspberry PI’s, they are just damn cool, so why not setup a cluster?! PIE? Raspberries? 3.14? PI? + = If you are not already familiar with RaspberryPI’s go take a look. These are cool little machines for very little cost. Here are the specs for the model 3, but just to summarize each node in our cluster only has 1GB of RAM and a 1.2GHz quad-core processor. This is clearly not a setup to use in your production environment. This is, however, a great way to learn and experiment. The setup4 RaspberryPI’s (wired ethernet)NOTE: These are designed to be “built up” so you will need to purchase microSD cards and the outer shells separately One 2.6GHz 8 core MacBook with 16GB RAM (wired ethernet) DataStax Enterprise Opscenter 6.0.7 using tarball installation DataStax Agent 6.0.7 using tarball installation DataStax Enterprise 5.0.5, again, using the tarball installation RAVPower 6 port USB charger with a set of USB Male A to Micro B cables. I also purchased a 4x1 HDMI switch, but you can easily run these headless if you know your way around a Linux shell. Just a quick note about the tarball installations. DSE generally supports using installers across most major platforms and for each of the various installs, but since we are using RaspberryPI’s in this case and since I used the NOOBS install to get up and going as fast as I could it just so happens that combination only works with tarball installs. Please, learn from me and don’t spend the many, many hours I did eventually figuring this out. You can, in fact, install other operating systems on RaspberryPI’s which may allow the installers to work, but you’ll have to come back and tell me about it if you give it a go. Aaaaaaand for the reveal dun dunn dunnnnnnnnn! (Yes, those are shiny, lighted cables) I may have cheated just a little bitSo, I mentioned above “mixed workload cluster with Raspberry PI’s”. This is 100% true, but also notice there is a laptop in the mix. I ended up using the laptop to house OpsCenter and my search/graph datacenter and I’m using the PI’s for my core Cassandra cluster. In my experience I don’t usually install operations tools and the like directly on production devices facing the public because they have a tendency to cause unpredicatable load. The limited RAM on the PI’s (1GB) is also a factor which I will address here in a moment. As far as the search/graph datacenter portion I simply used the laptop out of convienence because I already had a 4-node core Cassandra cluster running on the PI’s at the time and I wanted to observe the interaction between my search/graph and Cassandra datacenters in a “pure” fashion. Get this working on PI’sRaspberry PI’s are special snowflakes when it comes to making this all work. I will detail all this below, but here, for you, is the summarized list of what is needed. Disable swapsudo swapoff -a is your friend. It is also the quick and dirty way, not permanent on reboot. Take a look at the “Disable swapping” section of this post if you would like a more permanent solution. Go HeadlessThis is quite easy to do using sudo raspi-config while ssh’d into your PI and it will help free up enough memory to make things stable. Just make sure you already know your PI’s IP address or know how to find it if things go sour. Decrease RAM allocated to datastax-agent from 128MB to 64MBEdit datastax-agent-env.sh located in [your datastax-agent install dir]/conf/datastax-agent-env.sh 12345From:JVM_OPTS=\"$JVM_OPTS -Xmx128M -Djclouds.mpu.parts.magnitude=100000 -Djclouds.mpu.parts.size=16777216\"To:JVM_OPTS=\"$JVM_OPTS -Xmx64M -Djclouds.mpu.parts.magnitude=100000 -Djclouds.mpu.parts.size=16777216\" Explicitly set Java HEAP settings for Cassandra nodeEdit cassandra-env.sh located in [your DSE install dir]/resources/cassandra/conf/cassandra-env.sh. Search for “HEAP” to find and edit the lines. Notice mine are already set to the working values. These are commented out by default which will allow the script to calculate values for you, but for our PI case we cannot use the calculated values. 12MAX_HEAP_SIZE=\"200M\"HEAP_NEWSIZE=\"50M\" Store collection data on a separate clusterRemember that part about “cheating” with my OpsCenter laptop? Yup, this is part of it. I’ll give more details down below. Let’s talk about OpsCenter, agents, memory, disk speed, and mucho frustrationoThat’s kind of a long list now that I see it all typed out, but there’s a lot to consider. We have nodes with 1GB of RAM, a 32 bit os, 4 cores, and a 32GB microSD drive acting as a hard disk. This is way below the recommended values of 16-32GB of RAM, 500GB-1TB of fast disk, and 64bit with 8 cores for running Cassandra nodes, or really any database for that matter. I wasn’t really sure how well this would work, if at all, given memory contraints alone not to mention the speed of microSD’s for a database that is known to need very fast disk. OpsCenterI decided right off the bat I wanted OpsCenter in the mix. Part of this whole project was to learn and what better way than to go whole hog and see what it could do. If you use OpsCenter you must install DataStax agents on each of your nodes in order for magic to happen. That magic comes at a memory cost, not a huge one, but one that matters when only dealing with 1GB of RAM. In order to leave enough room for the Cassandra node itself to run I effectively cut this requirement in half. So far, after months of running, I have not seen an issue running agents at 64MB of RAM. Cassandra memoryThe default auto caluclated memory configuration for DSE managed Cassandra nodes works well enough even on the PI’s, but there’s a catch. Remember those agents we need for OpsCenter? Well, turns out the agents need just enough extra memory to push things over the edge and on a system with no swap file this means page fault which is exactly what happened. I tried to quash every little process I could to free up enough RAM for my nodes to remain stable and I even made them headless, but to remain stable I had to explicitly configure the HEAP settings for my Cassandra nodes. The end result is listed up above. HeadlessSo, before I went headless things were working…uhh…well enough. Not well enough that I could leave it alone really and any time the system was put under stress !BAM! I would lose a node. This ended up being the clincher. I noticed the Raspbian UI itself was eating up just enough RAM to prevent my nodes from allocating more in times of need. I chopped off their heads and since then along with the other changes I made my nodes have been rock solid on the memory front. Memory is good how about disk?We already talked about the swap file. Not only is it strongly recommended to disable swap on nodes running Cassandra, but even more so on PI’s running on microSD’s. Before I did so it was clear my nodes were struggling as even small tasks kept driving load up and upon some inspection it was obvious I/O was mostly to blame. However, something else was lurking even after I disabled swap. At times I would see my nodes shoot up from a load of <1 upwards to 10+. At this point they would usually become unresponsive and either crash or eventually come back to reality, but always, always under heavy load. Colllleccccctionnnnnn DaaaaatttttaaaaaSorry, couldn’t help myself. As stated above move collection data storage off the PI’s onto a separate cluster. They simply cannot handle all of the I/O associated with collecting, storing, and repairing the collection data from the rollup* tables. Compaction was happening way too fast for the nodes to keep up most likely a result of having very little memory to work with, the PI’s could not keep up with the amount of collection data itself, and read repair on the rollup* tables was a constant, never ending stream of repair. Once I made the switch my PI nodes all quieted down to a normal load around 1, things have stabilized, and I no longer have gaps in my analytics data (except for when I HULK SMASH nodes myself for fun). Finally, the mixed workload partYup, right there in the title and all and I haven’t really mentioned it. Part of the reason I went and did all of this aside from learning and seeing what could be done was to extend Luke Tillman’s *cough*….shame…less..plug *cooouugh* freaking awesome KillrVideo reference app to hook up to clusters outside of its Dockerized container. This forced me to extend my existing Cassandra cluster into a mixed workload scenario with DSE Search. Right, I could have simply put search within the same cluster, but I was looking to emulate what I would do in a production scenario. I have an upcoming post on this very topic coming here in the future. I also had need to extend into DSE graph as well for some of my own projects so I took the opportunity to go ahead and just do it all. The end result is a fully functional Cassandra/Search/Graph DSE managed mixed workload cluster being served up mostly on Raspberry PI’s with a little help from a laptop all hooked up to KillrVideo. One last thing before I go. I find that tailing the agent.log, opscenterd.log, and system.log files from all of the nodes is quite insightful especially when watching the interaction between the nodes when performing regular CQL, search, and graph queries. I’m also the type of person who can watch a defrag for hours and find every little box color change useful information. Not sure what that says about me.","categories":[{"name":"Technical","slug":"Technical","permalink":"https://sonicdmg.github.io/categories/Technical/"}],"tags":[{"name":"killrvideo","slug":"killrvideo","permalink":"https://sonicdmg.github.io/tags/killrvideo/"},{"name":"raspberry PI","slug":"raspberry-PI","permalink":"https://sonicdmg.github.io/tags/raspberry-PI/"},{"name":"cluster","slug":"cluster","permalink":"https://sonicdmg.github.io/tags/cluster/"}]},{"title":"I'm Sure You Weren't Looking","slug":"I-m-Sure-You-Weren-t-Looking","date":"2017-02-07T18:51:11.000Z","updated":"2017-02-07T21:56:00.000Z","comments":true,"path":"2017/02/07/I-m-Sure-You-Weren-t-Looking/","link":"","permalink":"https://sonicdmg.github.io/2017/02/07/I-m-Sure-You-Weren-t-Looking/","excerpt":"","text":"Hi there and welcome to my blog. As the title suggests I am pretty sure you had no idea this blog or page even existed. This is most likely due to the fact that I had not published anything until….just now. Wow, you are like…. THE FIRST PERSON HERE OMG! Seriously though, thanks for taking a look and make sure to come back and check out my other posts as I muse on things ranging from mishaps while up on silks to technical discussions on distributed database clusters. In my next post, I’m going to bring you through my experience setting up a Raspberry PI mixed-worlkload cluster using Cassandra. See ya :)","categories":[{"name":"Something Else","slug":"Something-Else","permalink":"https://sonicdmg.github.io/categories/Something-Else/"}],"tags":[{"name":"hi there","slug":"hi-there","permalink":"https://sonicdmg.github.io/tags/hi-there/"},{"name":"welcome","slug":"welcome","permalink":"https://sonicdmg.github.io/tags/welcome/"},{"name":"fun times","slug":"fun-times","permalink":"https://sonicdmg.github.io/tags/fun-times/"},{"name":"OMG","slug":"OMG","permalink":"https://sonicdmg.github.io/tags/OMG/"}]}]}