Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
---
name: Bug report
about: Create a report to help us improve
labels: bug
---
**Describe the bug**
A clear and concise description.

**Reproduction**
Code snippet / dataset shape / Spark & Scala versions.

**Expected behavior**

**Environment**
- Scala:
- Spark:
- JVM:
- Module:

**Additional context**
Logs / stacktrace.
12 changes: 12 additions & 0 deletions .github/ISSUE_TEMPLATE/feature_request.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
---
name: Feature request
about: Suggest an idea
labels: enhancement
---
**Problem**

**Proposed solution**

**Alternatives**

**Additional context**
9 changes: 9 additions & 0 deletions .github/ISSUE_TEMPLATE/question.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
---
name: Question
about: Ask a usage question
labels: question
---
**Your question**

**Context**
Spark/Scala versions, sample code.
16 changes: 16 additions & 0 deletions .github/PULL_REQUEST_TEMPLATE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
### Summary

### Type
- [ ] Bug fix
- [ ] Feature
- [ ] Docs
- [ ] Build/CI

### Compatibility
Spark: [ ] 3.4 [ ] 3.5
Scala: [ ] 2.12 [ ] 2.13

### Checklist
- [ ] Added/updated tests
- [ ] Updated docs
- [ ] Ran `sbt clean test` locally
8 changes: 8 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
version: 2
updates:
- package-ecosystem: "github-actions"
directory: "/"
schedule: { interval: "weekly" }
- package-ecosystem: "sbt"
directory: "/"
schedule: { interval: "weekly" }
33 changes: 33 additions & 0 deletions .github/workflows/scala-ci.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
name: Scala CI
on:
pull_request:
push:
branches: [ master, main ]
jobs:
test:
runs-on: ubuntu-latest
strategy:
matrix:
scala: [2.12.18, 2.13.13]
spark: [3.4.3, 3.5.1]
steps:
- uses: actions/checkout@v4
- uses: actions/setup-java@v4
with:
distribution: temurin
java-version: '11'
- name: Cache Ivy/Coursier
uses: actions/cache@v4
with:
path: |
~/.ivy2/cache
~/.cache/coursier
~/.sbt
key: ${{ runner.os }}-ivy-${{ hashFiles('**/*.sbt','project/**') }}
- name: Create local.sbt overrides
run: |
echo 'ThisBuild / scalaVersion := "${{ matrix.scala }}"' > local.sbt
echo 'ThisBuild / crossScalaVersions := Seq("2.12.18","2.13.13")' >> local.sbt
echo 'val sparkVer = "${{ matrix.spark }}"' >> local.sbt
- name: Test
run: sbt -v clean test
224 changes: 15 additions & 209 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,215 +1,21 @@
# Contributing to Generalized K-Means Clustering
# Contributing

Thank you for your interest in contributing to this project! This document provides guidelines for contributing to the generalized K-means clustering library.

## Development Environment Setup

### Prerequisites

- **Java 17** or higher
- **SBT 1.x** (Scala Build Tool)
- **Scala 2.12.18** (managed by SBT)
- **Apache Spark 3.4.0** (managed by SBT)

### Getting Started

1. **Clone the repository:**
```bash
git clone https://github.com/derrickburns/generalized-kmeans-clustering.git
cd generalized-kmeans-clustering
```

2. **Compile the project:**
```bash
sbt compile
```

3. **Run tests:**
```bash
sbt test
```

4. **Check code style:**
```bash
sbt scalastyle
```

## Code Style Guidelines

### Scala Style

- Follow standard Scala naming conventions
- Use 2-space indentation
- Line length should not exceed 120 characters
- Use meaningful variable and function names
- Add scaladoc documentation for all public APIs

### Code Quality

- **Linting:** Run `sbt scalastyle` before submitting
- **Testing:** Ensure all tests pass with `sbt test`
- **Coverage:** Maintain or improve test coverage
- **Dependencies:** Check for dependency updates with `sbt dependencyUpdates`

### Error Handling

- Use `ValidationUtils` for common validation patterns
- Provide meaningful error messages with context
- Handle edge cases gracefully
- Use SLF4J logging instead of print statements

## Project Structure

```
src/
├── main/scala/com/massivedatascience/
│ ├── clusterer/ # Core clustering algorithms
│ ├── divergence/ # Bregman divergence implementations
│ ├── linalg/ # Linear algebra utilities
│ ├── transforms/ # Data transformation utilities
│ └── util/ # Common utilities and validation
└── test/scala/com/massivedatascience/
└── clusterer/ # Test suites
```

## Architecture Overview

### Core Components

- **BregmanDivergence**: Defines distance functions for clustering
- **BregmanPointOps**: Point operations and factory methods
- **KMeansModel**: Trained model with prediction capabilities
- **MultiKMeansClusterer**: Interface for different clustering implementations

### Key Design Patterns

- **Weighted Vectors**: All operations use `WeightedVector` for weighted clustering
- **Pluggable Distance Functions**: Easy addition of new Bregman divergences
- **Iterative Training**: Multi-stage training support

## Testing

### Test Requirements

- All new features must include comprehensive tests
- Tests should cover edge cases and error conditions
- Use ScalaTest framework with the existing `LocalClusterSparkContext` trait
- Test files should be in `src/test/scala/com/massivedatascience/clusterer/`

### Running Tests
## Dev setup
- Java 11 (Temurin)
- sbt 1.10.x
- Scala 2.12/2.13 (cross-build)
- Spark 3.4/3.5 (provided)

## Building
```bash
# Run all tests
sbt test

# Run specific test suite
sbt "testOnly *KMeansSuite"

# Run with coverage
sbt coverage test coverageReport
```

## Pull Request Process

### Before Submitting

1. **Ensure all tests pass:**
```bash
sbt test
```

2. **Check code style:**
```bash
sbt scalastyle
```

3. **Update documentation** if you've made API changes

4. **Add tests** for new functionality

### Pull Request Guidelines

- **Title**: Use descriptive titles (e.g., "Add validation for negative weights in BregmanPointOps")
- **Description**: Clearly explain what changes you made and why
- **Testing**: Describe how you tested your changes
- **Breaking Changes**: Clearly mark any breaking changes

### Commit Message Format

Use conventional commit messages:

```
type(scope): brief description

Longer description if needed

- List specific changes
- Include reasoning for complex changes
sbt +compile
sbt +test
```

**Types:**
- `feat`: New feature
- `fix`: Bug fix
- `docs`: Documentation changes
- `style`: Code style changes
- `refactor`: Code refactoring
- `test`: Adding or updating tests
- `perf`: Performance improvements

## Common Development Tasks

### Adding a New Bregman Divergence

1. Create a new trait or object extending `BregmanDivergence`
2. Implement required methods: `convex`, `convexHomogeneous`, `gradientOfConvex`, `gradientOfConvexHomogeneous`
3. Use `ValidationUtils` for input validation
4. Add comprehensive tests in the test suite
5. Update documentation

### Improving Performance

1. Profile your changes using appropriate tools
2. Add benchmarks if introducing performance-critical code
3. Consider memory usage and garbage collection impact
4. Test with realistic data sizes

### Adding Configuration Options

1. Add new options to `KMeansConfig` if applicable
2. Ensure backward compatibility
3. Add validation for new configuration values
4. Document the new options

## Code Review Criteria

### Code Quality
- [ ] Code follows project style guidelines
- [ ] Error handling is appropriate and consistent
- [ ] No code duplication
- [ ] Performance considerations addressed

### Testing
- [ ] Adequate test coverage
- [ ] Tests cover edge cases
- [ ] Tests are maintainable and readable

### Documentation
- [ ] Public APIs are documented
- [ ] Complex algorithms are explained
- [ ] Breaking changes are clearly marked

## Getting Help

- **Issues**: Check existing [GitHub issues](https://github.com/derrickburns/generalized-kmeans-clustering/issues)
- **Discussions**: Start a discussion for questions about implementation
- **Code Review**: Request review from maintainers

## License

By contributing to this project, you agree that your contributions will be licensed under the Apache License 2.0.

## Recognition

Contributors will be acknowledged in release notes and the project README.
## Releasing
- Tag `vX.Y.Z` and let CI publish (if configured).
- Keep CHANGELOG up to date.

Thank you for contributing to the generalized K-means clustering library!
## Code style
- Scalafmt recommended; enforce in CI if added.
- PRs must pass CI and tests.
3 changes: 3 additions & 0 deletions README_BADGES_SNIPPET.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
<!-- Badges: paste near the top of README.md -->
[![Scala CI](https://github.com/OWNER/REPO/actions/workflows/scala-ci.yml/badge.svg)](https://github.com/OWNER/REPO/actions/workflows/scala-ci.yml)
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](LICENSE)
3 changes: 3 additions & 0 deletions SECURITY.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# Security Policy
Report vulnerabilities via GitHub Security Advisories.
We triage within 7 days and aim to patch critical issues within 30 days.
4 changes: 4 additions & 0 deletions local.sbt.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
// Copy to local.sbt to override versions locally (git-ignored)
ThisBuild / scalaVersion := "2.13.13"
ThisBuild / crossScalaVersions := Seq("2.12.18","2.13.13")
val sparkVer = "3.5.1"
Loading