Too messy commit structure to merge, check other PR #14

SimonBerend · 2021-05-12T16:02:27Z

No description provided.

Updating scrapers before introducing keywords funcionality

andrewsutjahjo

The functionality is there and good, it's just not implemented in the right place. It needs a small rework so that future developers can understand the code quickly.

For bonuspoints: use a !fixup commit to remove any errant code like just_scrape.py so that it doesn't clutter the repo

forward43/hparams.py

forward43/scraper_kickstarter.py

andrewsutjahjo · 2021-06-30T15:03:30Z

forward43/scraper_kickstarter.py

+        url_term_list = []
+
+        for term in keywords:
+            search_term = 'term=' + term.replace(' ', '+')
+            random_seed = 'seed=' + str(random.randint(1, 65536))
+            get_params  = self.default_url_params + [f"category_id={category}", f"page={page}", random_seed, search_term]   # added a keyword parameter to the url
+            url         = self.base_url + '&'.join(get_params)
+            url_term_list.append(url)
+
+        return url_term_list


This is a good idea, however it's implemented in an illogical place: the function you've placed it in is called get_url which is singular. Anyone that reads this that hasn't been part of this PR will (rightfully) assume that this function will return exactly one URL, and the docstring will validate that belief. When it gets used they will get a list of URLs.

I'm not too sure if we want to do a search of each combination of category_id and search_term, but I haven't had enough of a dive into kickstarter data to know what to decide on that front

I'd place this in another function like get_urls_by_keyword and extend the scrape script to do both a loop of the original crawl by category_id functionality and a loop of searching by keyword with(or without) category_id

andrewsutjahjo · 2021-06-30T15:04:40Z

forward43/scraper_kickstarter.py

                except Exception as e:
                    self.logger.exception('Failed to get projects from current page')

            self.write_to_file(projects, str(category))
-
+            


formatting: keep whitespaces out of empty lines

andrewsutjahjo · 2021-06-30T15:05:35Z

forward43/scraper_kickstarter.py

-
+                    url_term_list       = self.get_url(category, page)
+
+                    print(url_term_list)


use the logger if you want to log what it's doing, otherwise remove the print statement

andrewsutjahjo · 2021-06-30T15:09:55Z

PS: please remember to request a review from someone (top of the right sidebar on github) so that they know they should read your code and comment on it, otherwise no one knows it's ready to be reviewed

SimonBerend and others added 3 commits May 6, 2021 12:33

Merge pull request #1 from CorrelAidxNL/master

f128329

Updating scrapers before introducing keywords funcionality

keywords in kickstartscraper v1 + list in hparams

ada2c47

Delete just_scrape.py

8dc06be

andrewsutjahjo requested changes Jun 30, 2021

View reviewed changes

SimonBerend added 2 commits August 11, 2021 13:15

minor formatting rework hparams

f8cf15c

rework kickstarter scraper: only keywords search and no categories crawl

9d307ff

SimonBerend closed this Aug 11, 2021

SimonBerend reopened this Aug 11, 2021

SimonBerend changed the title ~~Keywords functionality in scraper kickstarter v1~~ Keywords search in kickstarter scraper Aug 11, 2021

SimonBerend requested review from andrewsutjahjo and akashrajkn August 11, 2021 15:13

adressed errors after failed CICD

129501b

SimonBerend changed the title ~~Keywords search in kickstarter scraper~~ Too messy commit structure to merge, check other PR Sep 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too messy commit structure to merge, check other PR #14

Too messy commit structure to merge, check other PR #14

SimonBerend commented May 12, 2021

andrewsutjahjo left a comment

andrewsutjahjo Jun 30, 2021

andrewsutjahjo Jun 30, 2021

andrewsutjahjo Jun 30, 2021

andrewsutjahjo commented Jun 30, 2021


		url_term_list = self.get_url(category, page)

		print(url_term_list)

Too messy commit structure to merge, check other PR #14

Are you sure you want to change the base?

Too messy commit structure to merge, check other PR #14

Conversation

SimonBerend commented May 12, 2021

andrewsutjahjo left a comment

Choose a reason for hiding this comment

andrewsutjahjo Jun 30, 2021

Choose a reason for hiding this comment

andrewsutjahjo Jun 30, 2021

Choose a reason for hiding this comment

andrewsutjahjo Jun 30, 2021

Choose a reason for hiding this comment

andrewsutjahjo commented Jun 30, 2021