-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Plan of study scraper #23
base: main
Are you sure you want to change the base?
Conversation
todo: need to ignore empty json files (web pages with no plan of study)
src/urls/urls.ts
Outdated
"https://nextcatalog.northeastern.edu/", | ||
startYear === currentYear | ||
? "" | ||
: `archive/${startYear}-${startYear + 1}/#planofstudytext`, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this the same code as scrapeMajorLinks()? if so, we can abstract the two methods with something like a suffix parameter if only the suffix changes
scrapeMajorLinks('') and scrapeMajorLinks('#plan-of-study')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just note don't want these files in the final
src/runtime/index.ts
Outdated
//PhaseLabel.ScrapeMajorLinks, | ||
//scrapeMajorPlanLinks(year, currentYear), | ||
//) | ||
//.then(addPhase(spin, PhaseLabel.Classify, classify)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are the classify and tokenize phases removed? it would be useful to log the tokens scraped before parsing
i think there might be an issue where it scrapes everything twice
todo: scrape note on plan of study as well (e.g. https://catalog.northeastern.edu/archive/2022-2023/undergraduate/science/physics/applied-physics-bs/#planofstudytext)
bed9498
to
04b755c
Compare
Scrapes plan of studies. Still need to implement automated scraping for list of urls. Given a plan of study HTML table it is able to scrape all the properly formatted classes.