Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
31 changes: 31 additions & 0 deletions CODE_OF_CONDUCT.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
# Code of Conduct

## Our Pledge

We are committed to providing a welcoming and inclusive environment for everyone. All participants in Lanka Data Foundation projects are expected to uphold the highest standards of professional and respectful conduct.

## Our Standards

Examples of behavior that contributes to a positive environment:

- Being respectful and considerate in all interactions
- Using welcoming and inclusive language
- Accepting constructive feedback gracefully
- Focusing on what is best for the community
- Showing empathy towards other community members

Examples of unacceptable behavior:

- Discriminatory, harassing, or harmful behavior
- Trolling, insulting, or derogatory comments
- Public or private harassment
- Publishing others' private information without permission
- Other conduct that could reasonably be considered inappropriate

## Enforcement

Instances of unacceptable behavior may be reported by contacting the project team at [[email protected]](mailto:[email protected]). All complaints will be reviewed and investigated promptly and fairly.

## Attribution

This Code of Conduct is adapted from the [Apache Code of Conduct](https://www.apache.org/foundation/policies/conduct.html).
129 changes: 129 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,129 @@
# Contributing Guidelines

Thank you for your interest in contributing to this project! We welcome contributions from everyone. This document provides guidelines and best practices for contributing.


## Code of Conduct

Please read our [Code of Conduct](CODE_OF_CONDUCT.md) before contributing. By participating in this project, you agree to maintain a respectful and inclusive environment for everyone.

## How to Contribute

There are many ways to contribute to this project:

- **Report Bugs**: Submit bug reports with detailed information
- **Suggest Features**: Propose new features or improvements
- **Improve Documentation**: Fix typos, clarify explanations, add examples
- **Submit Code**: Fix bugs or implement new features
- **Review Pull Requests**: Help review and test contributions from others

## Getting Started

### Prerequisites

- Python 3.8+
- Deepseek LLM access
- Git

### Development Setup

<!-- Provide step-by-step instructions to set up the development environment -->
<!-- Example:
1. Fork the repository
2. Clone your fork: `git clone https://github.com/your-username/project.git`
3. Install dependencies: `pip install -r requirements.txt`
4. Create a branch: `git checkout -b feature/your-feature-name`
-->

## Making Changes

### Branching Strategy

- `feature/` - for new features
- `fix/` - for bug fixes
- `docs/` - for documentation changes


Create a topic branch from the main branch for your changes:

```bash
git checkout -b feature/your-feature-name
```

### Commit Messages

Write clear and meaningful commit messages. We recommend following this format:

```
[TYPE] Short description (max 50 chars)

Longer description if needed. Explain the "why" behind the change,
not just the "what". Reference any related issues.

Fixes #123
```

**Types**: `feat`, `fix`, `docs`, `style`, `refactor`, `test`, `chore`

### Coding Standards

- Follow PEP 8 for Python code
- Run `black` for formatting
- Run `flake8` for linting


### Testing

- Add unit tests for new functionality
- Ensure all tests pass: `pytest`
- Maintain or improve code coverage

All changes should include appropriate tests. Run the test suite before submitting:

```bash
python3 -m pytest
```

## Submitting Changes

### Pull Request Process

1. Ensure your code follows the project's coding standards
2. Update documentation if needed
3. Add or update tests as appropriate
4. Run the full test suite and ensure it passes
5. Push your branch and create a Pull Request

### Pull Request Guidelines

- Provide a clear title and description
- Reference any related issues (e.g., "Fixes #123")
- Keep changes focused and atomic
- Be responsive to feedback and review comments


### Review Process

- PRs require at least one approval from a maintainer
- CI checks must pass
- Changes may be requested before merging


## Communication

- GitHub Issues: For bug reports and feature requests
- GitHub Discussions: For questions and general discussion
- Mail: [[email protected]](mailto:[email protected])
- Discord: [Lanka Data Foundation Discord Channel](https://discord.com/invite/mg94NtHD9Y)

## Recognition

We value all contributions and appreciate your effort to improve this project!

## Additional Resources

- Please reffer to the [Project Documentation](README.md)

---

*These guidelines are inspired by the [Apache Way](https://www.apache.org/theapacheway/) and [Open Source Guides](https://opensource.guide/).*
96 changes: 96 additions & 0 deletions GETTING_STARTED.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,96 @@
# Getting Started

## Installation
```bash
pip install git+https://github.com/LDFLK/gztarchiver.git
```

## How It Works

**Step 1: Create & Configure YAML File**
- Download the example [``config example``](config_example.yaml) file from the repository.
- Edit the configurations to specify your download preferences, storage locations, and other settings
- This file acts as the control center for your archiving operations

**Step 2: Run the Program**
- Finally, execute the program using the command-line interface with your desired parameters. (Check the usage section)
- Sit back and watch as your documents are systematically archived and categorized!


## Usage

After installation, you can run the program using the command-line tool/terminal

**Show help:**
```bash
gztarchiver --help
```

**Extract data for specific year:**
```bash
gztarchiver --year 2023 --lang en --config path-to-the-config-file
```

**Extract data for specific month in a year:**
```bash
gztarchiver --year 2023 --month 06 --lang en --config path-to-the-config-file
```

**Extract data for specific date:**
```bash
gztarchiver --year 2023 --month 06 --day 15 --lang en --config path-to-the-config-file
```

## Options

| Option | Description | Example | Default |
|--------|-------------|---------|---------|
| `--year` | Filter by year or download all | `--year 2023` | None |
| `--month` | Filter by specific month (01-12) | `--month 06` | None |
| `--day` | Filter by specific day (01-31) | `--day 15` | None |
| `--lang` | Specify language | `--lang en` | None |

## Language Codes

| Code | Language |
|------|----------|
| `en` | English |
| `si` | Sinhala |
| `ta` | Tamil |


## Output Structure

Downloaded documents are organized as:
```
~/doc-archive/
├── 2023/
│ ├── 01/
│ │ ├── 15/
│ │ │ └── gazette_id/
│ │ │ ├── gazette_id_english.pdf
│ │ └── ...
│ |
| ├── records/
| | ├── successfully_archived.csv
| | ├── failed_to_archive.csv
| | ├── document_unavailable.csv
| | ├── document_classification.csv
| └── ...
└── ...
```

## Log Files

For each year, the following log files are created:
- `{year}/records/successfully_archived.csv` - Successfully downloaded files
- `{year}/records/failed_to_archive.csv` - Failed downloads with retry information
- `{year}/records/document_unavailable.csv` - Unavailable logs
- `{year}/records/document_classification.csv` - Document Classified metadata

## Error Messages

- **No gazettes found**: `❌ No gazettes found for year 2023 with month 06`
- **Invalid year**: `❌ Year '2025' not found in years.json`
- **Invalid month**: `❌ Invalid month '13'. Must be between 01-12`
- **Invalid day**: `❌ Invalid day '32'. Must be between 01-31`
Loading