diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md new file mode 100644 index 00000000..8f64687d --- /dev/null +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -0,0 +1,21 @@ +--- +name: Bug report +about: Create a report to help us improve +labels: bug +--- +**Describe the bug** +A clear and concise description. + +**Reproduction** +Code snippet / dataset shape / Spark & Scala versions. + +**Expected behavior** + +**Environment** +- Scala: +- Spark: +- JVM: +- Module: + +**Additional context** +Logs / stacktrace. diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md new file mode 100644 index 00000000..04a1340a --- /dev/null +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -0,0 +1,12 @@ +--- +name: Feature request +about: Suggest an idea +labels: enhancement +--- +**Problem** + +**Proposed solution** + +**Alternatives** + +**Additional context** diff --git a/.github/ISSUE_TEMPLATE/question.md b/.github/ISSUE_TEMPLATE/question.md new file mode 100644 index 00000000..357fe179 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/question.md @@ -0,0 +1,9 @@ +--- +name: Question +about: Ask a usage question +labels: question +--- +**Your question** + +**Context** +Spark/Scala versions, sample code. diff --git a/.github/PULL_REQUEST_TEMPLATE.md b/.github/PULL_REQUEST_TEMPLATE.md new file mode 100644 index 00000000..dfeaa281 --- /dev/null +++ b/.github/PULL_REQUEST_TEMPLATE.md @@ -0,0 +1,16 @@ +### Summary + +### Type +- [ ] Bug fix +- [ ] Feature +- [ ] Docs +- [ ] Build/CI + +### Compatibility +Spark: [ ] 3.4 [ ] 3.5 +Scala: [ ] 2.12 [ ] 2.13 + +### Checklist +- [ ] Added/updated tests +- [ ] Updated docs +- [ ] Ran `sbt clean test` locally diff --git a/.github/dependabot.yml b/.github/dependabot.yml new file mode 100644 index 00000000..c61c3a6d --- /dev/null +++ b/.github/dependabot.yml @@ -0,0 +1,8 @@ +version: 2 +updates: + - package-ecosystem: "github-actions" + directory: "/" + schedule: { interval: "weekly" } + - package-ecosystem: "sbt" + directory: "/" + schedule: { interval: "weekly" } diff --git a/.github/workflows/scala-ci.yml b/.github/workflows/scala-ci.yml new file mode 100644 index 00000000..72472ac7 --- /dev/null +++ b/.github/workflows/scala-ci.yml @@ -0,0 +1,33 @@ +name: Scala CI +on: + pull_request: + push: + branches: [ master, main ] +jobs: + test: + runs-on: ubuntu-latest + strategy: + matrix: + scala: [2.12.18, 2.13.13] + spark: [3.4.3, 3.5.1] + steps: + - uses: actions/checkout@v4 + - uses: actions/setup-java@v4 + with: + distribution: temurin + java-version: '11' + - name: Cache Ivy/Coursier + uses: actions/cache@v4 + with: + path: | + ~/.ivy2/cache + ~/.cache/coursier + ~/.sbt + key: ${{ runner.os }}-ivy-${{ hashFiles('**/*.sbt','project/**') }} + - name: Create local.sbt overrides + run: | + echo 'ThisBuild / scalaVersion := "${{ matrix.scala }}"' > local.sbt + echo 'ThisBuild / crossScalaVersions := Seq("2.12.18","2.13.13")' >> local.sbt + echo 'val sparkVer = "${{ matrix.spark }}"' >> local.sbt + - name: Test + run: sbt -v clean test diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 37f84478..b28dc2f5 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -1,215 +1,21 @@ -# Contributing to Generalized K-Means Clustering +# Contributing -Thank you for your interest in contributing to this project! This document provides guidelines for contributing to the generalized K-means clustering library. - -## Development Environment Setup - -### Prerequisites - -- **Java 17** or higher -- **SBT 1.x** (Scala Build Tool) -- **Scala 2.12.18** (managed by SBT) -- **Apache Spark 3.4.0** (managed by SBT) - -### Getting Started - -1. **Clone the repository:** - ```bash - git clone https://github.com/derrickburns/generalized-kmeans-clustering.git - cd generalized-kmeans-clustering - ``` - -2. **Compile the project:** - ```bash - sbt compile - ``` - -3. **Run tests:** - ```bash - sbt test - ``` - -4. **Check code style:** - ```bash - sbt scalastyle - ``` - -## Code Style Guidelines - -### Scala Style - -- Follow standard Scala naming conventions -- Use 2-space indentation -- Line length should not exceed 120 characters -- Use meaningful variable and function names -- Add scaladoc documentation for all public APIs - -### Code Quality - -- **Linting:** Run `sbt scalastyle` before submitting -- **Testing:** Ensure all tests pass with `sbt test` -- **Coverage:** Maintain or improve test coverage -- **Dependencies:** Check for dependency updates with `sbt dependencyUpdates` - -### Error Handling - -- Use `ValidationUtils` for common validation patterns -- Provide meaningful error messages with context -- Handle edge cases gracefully -- Use SLF4J logging instead of print statements - -## Project Structure - -``` -src/ -├── main/scala/com/massivedatascience/ -│ ├── clusterer/ # Core clustering algorithms -│ ├── divergence/ # Bregman divergence implementations -│ ├── linalg/ # Linear algebra utilities -│ ├── transforms/ # Data transformation utilities -│ └── util/ # Common utilities and validation -└── test/scala/com/massivedatascience/ - └── clusterer/ # Test suites -``` - -## Architecture Overview - -### Core Components - -- **BregmanDivergence**: Defines distance functions for clustering -- **BregmanPointOps**: Point operations and factory methods -- **KMeansModel**: Trained model with prediction capabilities -- **MultiKMeansClusterer**: Interface for different clustering implementations - -### Key Design Patterns - -- **Weighted Vectors**: All operations use `WeightedVector` for weighted clustering -- **Pluggable Distance Functions**: Easy addition of new Bregman divergences -- **Iterative Training**: Multi-stage training support - -## Testing - -### Test Requirements - -- All new features must include comprehensive tests -- Tests should cover edge cases and error conditions -- Use ScalaTest framework with the existing `LocalClusterSparkContext` trait -- Test files should be in `src/test/scala/com/massivedatascience/clusterer/` - -### Running Tests +## Dev setup +- Java 11 (Temurin) +- sbt 1.10.x +- Scala 2.12/2.13 (cross-build) +- Spark 3.4/3.5 (provided) +## Building ```bash -# Run all tests -sbt test - -# Run specific test suite -sbt "testOnly *KMeansSuite" - -# Run with coverage -sbt coverage test coverageReport -``` - -## Pull Request Process - -### Before Submitting - -1. **Ensure all tests pass:** - ```bash - sbt test - ``` - -2. **Check code style:** - ```bash - sbt scalastyle - ``` - -3. **Update documentation** if you've made API changes - -4. **Add tests** for new functionality - -### Pull Request Guidelines - -- **Title**: Use descriptive titles (e.g., "Add validation for negative weights in BregmanPointOps") -- **Description**: Clearly explain what changes you made and why -- **Testing**: Describe how you tested your changes -- **Breaking Changes**: Clearly mark any breaking changes - -### Commit Message Format - -Use conventional commit messages: - -``` -type(scope): brief description - -Longer description if needed - -- List specific changes -- Include reasoning for complex changes +sbt +compile +sbt +test ``` -**Types:** -- `feat`: New feature -- `fix`: Bug fix -- `docs`: Documentation changes -- `style`: Code style changes -- `refactor`: Code refactoring -- `test`: Adding or updating tests -- `perf`: Performance improvements - -## Common Development Tasks - -### Adding a New Bregman Divergence - -1. Create a new trait or object extending `BregmanDivergence` -2. Implement required methods: `convex`, `convexHomogeneous`, `gradientOfConvex`, `gradientOfConvexHomogeneous` -3. Use `ValidationUtils` for input validation -4. Add comprehensive tests in the test suite -5. Update documentation - -### Improving Performance - -1. Profile your changes using appropriate tools -2. Add benchmarks if introducing performance-critical code -3. Consider memory usage and garbage collection impact -4. Test with realistic data sizes - -### Adding Configuration Options - -1. Add new options to `KMeansConfig` if applicable -2. Ensure backward compatibility -3. Add validation for new configuration values -4. Document the new options - -## Code Review Criteria - -### Code Quality -- [ ] Code follows project style guidelines -- [ ] Error handling is appropriate and consistent -- [ ] No code duplication -- [ ] Performance considerations addressed - -### Testing -- [ ] Adequate test coverage -- [ ] Tests cover edge cases -- [ ] Tests are maintainable and readable - -### Documentation -- [ ] Public APIs are documented -- [ ] Complex algorithms are explained -- [ ] Breaking changes are clearly marked - -## Getting Help - -- **Issues**: Check existing [GitHub issues](https://github.com/derrickburns/generalized-kmeans-clustering/issues) -- **Discussions**: Start a discussion for questions about implementation -- **Code Review**: Request review from maintainers - -## License - -By contributing to this project, you agree that your contributions will be licensed under the Apache License 2.0. - -## Recognition - -Contributors will be acknowledged in release notes and the project README. +## Releasing +- Tag `vX.Y.Z` and let CI publish (if configured). +- Keep CHANGELOG up to date. -Thank you for contributing to the generalized K-means clustering library! \ No newline at end of file +## Code style +- Scalafmt recommended; enforce in CI if added. +- PRs must pass CI and tests. diff --git a/README_BADGES_SNIPPET.md b/README_BADGES_SNIPPET.md new file mode 100644 index 00000000..9ddbb861 --- /dev/null +++ b/README_BADGES_SNIPPET.md @@ -0,0 +1,3 @@ + +[![Scala CI](https://github.com/OWNER/REPO/actions/workflows/scala-ci.yml/badge.svg)](https://github.com/OWNER/REPO/actions/workflows/scala-ci.yml) +[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE) diff --git a/SECURITY.md b/SECURITY.md new file mode 100644 index 00000000..3f60cffd --- /dev/null +++ b/SECURITY.md @@ -0,0 +1,3 @@ +# Security Policy +Report vulnerabilities via GitHub Security Advisories. +We triage within 7 days and aim to patch critical issues within 30 days. diff --git a/local.sbt.example b/local.sbt.example new file mode 100644 index 00000000..6c49a18f --- /dev/null +++ b/local.sbt.example @@ -0,0 +1,4 @@ +// Copy to local.sbt to override versions locally (git-ignored) +ThisBuild / scalaVersion := "2.13.13" +ThisBuild / crossScalaVersions := Seq("2.12.18","2.13.13") +val sparkVer = "3.5.1"