Plan of study scraper #23

denniwang · 2025-02-05T13:46:22Z

Scrapes plan of studies. Still need to implement automated scraping for list of urls. Given a plan of study HTML table it is able to scrape all the properly formatted classes.

todo: need to ignore empty json files (web pages with no plan of study)

KobeZ123 · 2025-02-24T18:59:47Z

src/urls/urls.ts

+            "https://nextcatalog.northeastern.edu/",
+            startYear === currentYear
+              ? ""
+              : `archive/${startYear}-${startYear + 1}/#planofstudytext`,


is this the same code as scrapeMajorLinks()? if so, we can abstract the two methods with something like a suffix parameter if only the suffix changes
scrapeMajorLinks('') and scrapeMajorLinks('#plan-of-study')

KobeZ123 · 2025-02-24T19:00:20Z

Chemical Engineering and Physics, BSChE _ Northeastern University Academic Catalog.html

just note don't want these files in the final

KobeZ123 · 2025-02-24T19:01:45Z

src/runtime/index.ts

+    //PhaseLabel.ScrapeMajorLinks,
+    //scrapeMajorPlanLinks(year, currentYear),
+  //)
+    //.then(addPhase(spin, PhaseLabel.Classify, classify))


are the classify and tokenize phases removed? it would be useful to log the tokens scraped before parsing

i think there might be an issue where it scrapes everything twice

todo: scrape note on plan of study as well (e.g. https://catalog.northeastern.edu/archive/2022-2023/undergraduate/science/physics/applied-physics-bs/#planofstudytext)

denniwang added 3 commits January 30, 2025 16:09

~10s plan scraper

b835589

can scrape more major plans

a850135

scraper reliably scrapes courses now

32cdbc6

denniwang requested a review from KobeZ123 February 5, 2025 13:46

denniwang added 2 commits February 9, 2025 17:55

scrapes multiple majors

1093dc6

todo: need to ignore empty json files (web pages with no plan of study)

scrapes 40+ majors correctly

b463122

KobeZ123 requested changes Feb 24, 2025

View reviewed changes

denniwang added 5 commits March 2, 2025 22:39

imrproved paths and gitignore output

def466f

removed unnecessary files

83d93ac

scrapes to same folder as major now

c6b0759

i think there might be an issue where it scrapes everything twice

rename to template

4dcac0d

better cli logging and table scraping logic

04b755c

todo: scrape note on plan of study as well (e.g. https://catalog.northeastern.edu/archive/2022-2023/undergraduate/science/physics/applied-physics-bs/#planofstudytext)

denniwang force-pushed the plan-of-study-scraper branch from bed9498 to 04b755c Compare March 11, 2025 17:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Plan of study scraper #23

Plan of study scraper #23

denniwang commented Feb 5, 2025

KobeZ123 Feb 24, 2025

KobeZ123 Feb 24, 2025

KobeZ123 Feb 24, 2025

Plan of study scraper #23

Are you sure you want to change the base?

Plan of study scraper #23

Conversation

denniwang commented Feb 5, 2025

KobeZ123 Feb 24, 2025

Choose a reason for hiding this comment

KobeZ123 Feb 24, 2025

Choose a reason for hiding this comment

KobeZ123 Feb 24, 2025

Choose a reason for hiding this comment