Thank you for your interest in contributing to open-source-legislation! We welcome contributions of all kinds, including new features, bug fixes, documentation improvements, and especially new jurisdiction scrapers. This document will guide you through the process of contributing to the project.
-
Fork the Repository:
- Fork the repository on GitHub.
- Clone your forked repository to your local machine.
- Create a new branch for your work.
-
Set Up Your Development Environment:
- Create a virtual environment.
- Install dependencies from
requirements.txt
. - Set PythonPath:
export PYTHONPATH=/Users/s/VSCodeProjects/open-source-legislation:$PYTHONPATH
- Set up the database (refer to the
README.md
for instructions).
-
Make Your Changes:
- Ensure your code follows our coding standards.
- Write or update tests as necessary.
- Document your changes in the code and in the
docs/
directory.
-
Submit a Pull Request:
- Push your changes to your forked repository.
- Submit a pull request to the
main
branch of the original repository. - Provide a clear and descriptive title and description for your pull request.
To add a new jurisdiction scraper, follow these detailed steps:
- Find the statutes web page.
- Find the Table of Contents page.
- This is where all the
top_level_titles
live.
- This is where all the
- Find the first section and understand the path to reach it.
- Follow the path from top to bottom (usually Title -> Chapter -> Section links).
- Determine the format of the legislation (HTML, PDF, Java, mixed).
- Understand the level hierarchy.
- Identify the levels (e.g., Title -> Chapter -> Section).
- Determine if the hierarchy and order are consistent.
- Examine the HTML structure.
- Identify containers and unique IDs for each level.
- Note any specific data like node names or links that are well-tagged in the HTML.
- Identify reserved language in the node names.
- Start a list of terms indicating reserved sections (e.g., "repealed", "reserved", "renumbered").
- Find the country/state code for the jurisdiction.
- Clone the
SCRAPE_TEMPLATE
directory. - Rename template files (e.g.,
readTEMPLATE
,scrapeTEMPLATE
,processTEMPLATE
). - Update global variables in the renamed files:
- Set
TABLE_NAME
tocode_node
(e.g.,ca_node
for California). - Set
TOC_URL
to the Table of Contents URL. - Set
BASE_URL
to the base URL of the jurisdiction's website.
- Set
- Create a table in your database (e.g., using Postico).
- Navigate to the Table of Contents page.
- Open
read(STATE).py
file. - Create a data folder to store scraped data.
- Inspect the Table of Contents page to identify the HTML container for top-level title links.
- Set this container in BeautifulSoup.
- Iterate over each
top_level_title
link to extract the link and additional information if necessary.- Save this information to
data/top_level_titles.txt
.
- Save this information to
- Open
scrape.py
file. - Read in top-level titles from the
.txt
file. - Iterate over each URL and determine the scraping method to use based on the legislation's properties and HTML structure.
- Choose a scraping method (Regular, Recursive, Stack) and delete unused methods.
- Implement the chosen method:
- Create a scrape function for each level (
scrape_level
). - In each function, create a BeautifulSoup object for the parent container of the level elements.
- Iterate over each level element to extract node information.
- Check for reserved nodes and handle special formatting.
- Insert nodes into the database and handle duplicates.
- Call the next scrape function as needed.
- Implement the recursive scraping function (
scrape_structure
). - Handle each level by calling the function recursively until the section level is reached.
- Extract section information and insert it into the database.
- Implement the stack-based scraping function.
- Handle structure nodes on a single page using a stack data structure.
- Extract section information and insert it into the database.
- Add print statements to debug each level and node.
- Incrementally test by adding one level at a time.
- Check for common errors (e.g., mis-tagged reserved nodes, incorrect links, formatting issues).
- Perform thorough spot checks to ensure data integrity.
- Run the
scrape.py
file to get vector embeddings for all valid nodes. - Update the
node_direct_children
andnode_siblings
fields using SQL queries. - Ensure all fields are of the correct type (e.g.,
node_text
as Vector,node_tags
as jsonb). - Optionally run additional processing modules (e.g., definition extraction, reference extraction).
- Connect to your local database server and back it up.
- Disconnect from the local server and connect to Supabase (Postgres).
- Restore the database from the backup.
- Follow PEP 8 guidelines for Python code.
- Use meaningful variable names and consistent naming conventions.
- Add type annotations for function arguments and return types.
- Document all functions and classes using docstrings.
If you encounter any issues or have suggestions for improvements, please report them through the GitHub issue tracker. Provide as much detail as possible to help us understand and address the issue.
Please read and follow our Code of Conduct to create a welcoming and inclusive environment for all contributors.
Thank you for contributing to open-source-legislation! Together, we can build a comprehensive and reliable database of legislative information.