Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan of study scraper #23

Open
wants to merge 10 commits into
base: main
Choose a base branch
from
Open

Plan of study scraper #23

wants to merge 10 commits into from

Conversation

denniwang
Copy link

Scrapes plan of studies. Still need to implement automated scraping for list of urls. Given a plan of study HTML table it is able to scrape all the properly formatted classes.

@denniwang denniwang requested a review from KobeZ123 February 5, 2025 13:46
todo: need to ignore empty json files (web pages with no plan of study)
src/urls/urls.ts Outdated
"https://nextcatalog.northeastern.edu/",
startYear === currentYear
? ""
: `archive/${startYear}-${startYear + 1}/#planofstudytext`,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this the same code as scrapeMajorLinks()? if so, we can abstract the two methods with something like a suffix parameter if only the suffix changes
scrapeMajorLinks('') and scrapeMajorLinks('#plan-of-study')

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just note don't want these files in the final

//PhaseLabel.ScrapeMajorLinks,
//scrapeMajorPlanLinks(year, currentYear),
//)
//.then(addPhase(spin, PhaseLabel.Classify, classify))

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the classify and tokenize phases removed? it would be useful to log the tokens scraped before parsing

@denniwang denniwang force-pushed the plan-of-study-scraper branch from bed9498 to 04b755c Compare March 11, 2025 17:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants