diff --git a/documents/analysis/connection-count-validation/ANALYSIS.md b/documents/analysis/connection-count-validation/ANALYSIS.md new file mode 100644 index 000000000..43170ebdf --- /dev/null +++ b/documents/analysis/connection-count-validation/ANALYSIS.md @@ -0,0 +1,687 @@ +# Analysis: Connection Count Validation for Pool Resizing + +## Executive Summary + +This document analyzes the requirements and design for querying the database to validate the number of user connections before resizing OJP server connection pools in a multinode deployment. The goal is to distinguish between true node failures and network partitions to avoid unnecessary pool resizing. + +**Recommendation**: Implement connection count validation as an **optional, configurable feature** with conservative defaults to minimize risk while providing value in network partition scenarios. + +## Problem Statement + +### Current Behavior + +In a multinode OJP deployment (e.g., 3 servers with 10 max connections each = 30 total): + +1. **Normal Operation**: Each server handles 10 connections +2. **Server Failure Detected**: When a client cannot reach a server, it reports the server as DOWN +3. **Pool Expansion**: Other servers increase their pools (e.g., 2 remaining servers → 15 connections each) +4. **Problem**: In a network partition, the "failed" server may still be serving other clients + +### Network Partition Scenario + +``` +Client Group A → Server1 (reachable) → Database +Client Group A → Server2 (reachable) → Database +Client Group A → Server3 (UNREACHABLE due to network partition) + +From Client Group A's perspective: +- Server3 appears DOWN +- Servers 1 & 2 expand pools to 15 connections each + +Reality: +- Server3 is UP and serving Client Group B with 10 connections +- Total database connections: 10 + 15 + 15 = 40 (exceeds original 30 limit) +``` + +### Desired Solution + +Query the database for the actual number of connections from the current database user. If the count suggests connections are still active on the "failed" server, skip the pool resize. + +## Detailed Analysis + +### 1. Feasibility Assessment + +#### Database Query Support + +**✅ Feasible** - All major databases support connection count queries: + +| Database | Query Support | Complexity | Permissions | +|----------|--------------|------------|-------------| +| PostgreSQL | ✅ Excellent | Simple | None required | +| MySQL/MariaDB | ✅ Excellent | Simple | PROCESS (optional) | +| Oracle | ✅ Good | Moderate | SELECT on v$session | +| SQL Server | ✅ Good | Moderate | VIEW SERVER STATE | +| DB2 | ✅ Good | Moderate | Monitoring privileges | +| H2 | ✅ Good | Simple | None required | +| CockroachDB | ✅ Excellent | Simple | None required | + +**Conclusion**: Technically feasible across all supported databases. + +#### Implementation Complexity + +**Moderate Complexity** - Requires: +1. Database-specific query mapping (7 databases × 2 variants = ~14 queries) +2. Connection count validation logic +3. Integration with existing health check mechanism +4. Configuration management +5. Error handling and fallback logic +6. Testing across all database types + +**Estimated Effort**: 3-5 days development + 2-3 days testing + +### 2. Questions and Concerns + +#### Critical Questions + +**Q1: What if multiple database users share the same OJP deployment?** + +**Concern**: If different applications use different database users but share the same OJP servers, the connection count query would only see connections for the current user, not the total server load. + +**Impact**: HIGH - Could lead to incorrect decisions + +**Example**: +``` +Application A (user: app_a) → OJP Servers → Database +Application B (user: app_b) → OJP Servers → Database + +If Server3 fails: +- Query shows app_a connections: 20 (suggests partition) +- But app_b also has 10 connections on Server3 +- Total: 30 connections, but query only sees 20 +``` + +**Recommendation**: +- Document this limitation clearly +- Add configuration option to specify whether to use per-user or total connection count +- Consider adding a query variant that counts ALL connections (requires higher privileges) + +--- + +**Q2: What about connection pooling at the database level?** + +**Concern**: Some databases (Oracle, DB2) may use their own connection pooling at the server level. The query might not reflect actual OJP connections. + +**Impact**: MEDIUM - Could show more connections than expected + +**Example**: +- Database maintains 50 connections in its own pool +- Only 20 are actively used by OJP +- Query shows 50, suggesting a partition when there isn't one + +**Recommendation**: +- Focus on "active" connections (with state filtering where available) +- Document this as a known limitation for databases with server-side pooling + +--- + +**Q3: How do we handle query failures?** + +**Concern**: Network issues, permission problems, or database errors could cause the validation query to fail. + +**Impact**: HIGH - Could prevent pool resizing when needed + +**Options**: +1. **Fail-open** (proceed with resize): Prioritizes availability but may create issues +2. **Fail-closed** (skip resize): Conservative but could lead to resource exhaustion +3. **Configurable**: Allow operators to choose behavior + +**Recommendation**: +- Default to **fail-open** (proceed with resize) for availability +- Make configurable via `ojp.pool.resize.validation.failureMode=PROCEED|SKIP` +- Log all failures prominently for investigation + +--- + +**Q4: What is the appropriate threshold for detection?** + +**Concern**: Setting the threshold too high or too low could lead to false positives/negatives. + +**Analysis**: +``` +Total Max Pool Size: 30 +3 servers → 10 connections each (normal) +1 server fails → 2 servers × 15 connections each = 30 total + +Thresholds: +- 90% (27 connections): Lenient - catches obvious partitions +- 85% (25 connections): Balanced +- 80% (24 connections): Strict - may miss some partitions +``` + +**Recommendation**: +- Default threshold: **85% of total max pool size** +- Configurable: `ojp.pool.resize.validation.connectionThreshold=0.85` +- Log the actual count vs threshold for tuning + +--- + +**Q5: Does this solve the problem completely?** + +**Concern**: Even with connection count validation, network partitions can create edge cases. + +**Scenario 1 - Gradual Failure**: +``` +Time T0: Server3 fails, connections still open in database (closing takes time) +Time T1: Validation runs, sees high connection count, skips resize +Time T2: Database closes stale connections +Time T3: Clients experience connection shortage +``` + +**Scenario 2 - Partial Partition**: +``` +Server3 can reach database but not other servers +Clients can reach other servers but not Server3 +Connection count stays high, but coordination is broken +``` + +**Impact**: MEDIUM - Validation helps but doesn't eliminate all issues + +**Recommendation**: +- Position this as a **heuristic**, not a guarantee +- Combine with existing health checks and client-side tracking +- Consider adding a time-based override (e.g., force resize after 5 minutes regardless of count) + +#### Operational Questions + +**Q6: How will operators troubleshoot validation issues?** + +**Recommendation**: +- Comprehensive logging at INFO and DEBUG levels +- JMX metrics for validation attempts, successes, failures, and skips +- Admin API endpoint to manually trigger validation checks +- Documentation with common failure scenarios and resolutions + +--- + +**Q7: What's the performance impact on the database?** + +**Analysis**: +- Query cost: ~10-100ms per execution +- Frequency: Only when cluster health changes (rare, ~1-10 times per hour in stable environment) +- Impact: **Negligible** for typical workloads + +**Concern**: In a large deployment with many datasources and frequent health changes: +- 10 datasources × 10 health changes/hour = 100 queries/hour = ~1.67 queries/minute + +**Recommendation**: +- Add rate limiting: Maximum 1 validation query per datasource per 5 seconds +- Cache validation results for 5-10 seconds +- Monitor query performance with metrics + +--- + +**Q8: Should this be enabled by default?** + +**Arguments FOR default enable**: +- Solves a real problem (network partitions) +- Low overhead when working correctly +- Graceful fallback on errors + +**Arguments AGAINST default enable**: +- Adds complexity and potential failure modes +- Not all deployments need it (single-region networks) +- Permission issues could surprise operators +- Introduces database-specific behavior + +**Recommendation**: +- **Disable by default** (opt-in) for initial release +- Provide clear documentation on when to enable +- Consider enabling by default in future release after field validation + +### 3. Design Opinions and Suggestions + +#### Opinion 1: Keep It Simple + +**Suggestion**: Start with a minimal implementation focused on the common case: + +1. Single database user per OJP deployment +2. Simple threshold-based decision (e.g., 85%) +3. Fail-open behavior (proceed with resize on errors) +4. Manual enable via configuration + +**Rationale**: Simpler implementation is easier to test, debug, and maintain. Add complexity only when needed. + +--- + +#### Opinion 2: Make It Observable + +**Suggestion**: Prioritize observability from day one: + +```java +// Metrics +meter("ojp.pool.resize.validation.attempts") +timer("ojp.pool.resize.validation.query.duration") +counter("ojp.pool.resize.validation.skipped", tags("reason", "network_partition")) +counter("ojp.pool.resize.validation.proceeded", tags("reason", "confirmed_failure")) +counter("ojp.pool.resize.validation.errors", tags("database", "postgres", "error", "timeout")) + +// Logs (structured) +log.info("Connection validation: connHash={}, dbConnections={}, threshold={}, decision={}", + connHash, actualCount, threshold, decision); +``` + +**Rationale**: Operators need visibility to tune thresholds, debug issues, and understand system behavior. + +--- + +#### Opinion 3: Provide Escape Hatches + +**Suggestion**: Allow operators to override or bypass validation: + +1. **Global disable**: `ojp.pool.resize.validation.enabled=false` +2. **Per-datasource disable**: `ojp.ds.mydb.pool.resize.validation.enabled=false` +3. **Manual override API**: Admin endpoint to force resize regardless of validation +4. **Timeout-based override**: After N minutes, ignore validation and resize anyway + +**Rationale**: No heuristic is perfect. Operators need ways to work around issues. + +--- + +#### Opinion 4: Consider XA vs Non-XA Differences + +**Current System**: +- XA mode: Connection tracking + automatic redistribution +- Non-XA mode: No connection tracking, pools manage distribution naturally + +**Suggestion**: Apply connection count validation differently: + +- **XA Mode**: Full validation logic (as designed) +- **Non-XA Mode**: Optional or simplified validation (less critical due to different architecture) + +**Rationale**: XA mode has more complex coordination needs and would benefit more from validation. + +### 4. Alternative Approaches + +#### Alternative 1: Improved Heartbeat Mechanism + +**Approach**: Instead of relying on connection-level errors, implement explicit server-to-server heartbeats. + +**Pros**: +- No database queries required +- More responsive (millisecond detection vs seconds) +- Clearer distinction between network issues and server failures + +**Cons**: +- Doesn't solve split-brain (both sides think they're healthy) +- Additional network overhead +- Complexity of heartbeat coordination + +**Verdict**: Could complement connection count validation but doesn't replace it. + +--- + +#### Alternative 2: Client-Side Consensus + +**Approach**: Clients coordinate to reach consensus on which servers are healthy before triggering resize. + +**Pros**: +- Distributed decision-making +- Handles partial partitions better + +**Cons**: +- Significant complexity (requires coordination protocol) +- Latency in decision-making +- Clients must communicate with each other + +**Verdict**: Too complex for the benefit. Overkill for connection pooling. + +--- + +#### Alternative 3: Database-Triggered Notifications + +**Approach**: Database sends notifications when connection counts change significantly. + +**Pros**: +- Push-based (no polling) +- Real-time awareness + +**Cons**: +- Database-specific implementation (PostgreSQL LISTEN/NOTIFY, Oracle AQ, etc.) +- Requires database-side setup +- Complexity of maintaining notification channels + +**Verdict**: Interesting for future exploration but adds too much complexity for initial implementation. + +--- + +#### Alternative 4: Time-Based Dampening Only + +**Approach**: Instead of querying the database, simply wait longer before resizing (e.g., 5 minutes instead of immediate). + +**Pros**: +- Very simple +- No database queries +- Allows time for network issues to resolve + +**Cons**: +- Doesn't actually solve the problem +- Delays legitimate failover +- Just shifts the timing issue + +**Verdict**: Not sufficient on its own, but good to combine with validation (timeout-based override). + +### 5. Implementation Recommendations + +#### Phased Rollout + +**Phase 1: Foundation (Week 1-2)** +- Implement `ConnectionCountValidator` interface +- Add database-specific query classes +- Create `PoolResizeValidator` to orchestrate validation +- Add configuration properties +- Unit tests for each database type + +**Phase 2: Integration (Week 2-3)** +- Integrate with `ProcessClusterHealthAction` +- Add decision logic and thresholds +- Implement error handling and fallback +- Integration tests with real databases + +**Phase 3: Observability (Week 3-4)** +- Add metrics and logging +- Create admin API endpoints +- Performance testing +- Documentation + +**Phase 4: Validation (Week 4-5)** +- End-to-end testing with multinode setup +- Network partition simulation tests +- Load testing to verify overhead +- Beta testing with select users + +#### Configuration Design + +```properties +# Enable/disable connection count validation before pool resize +# Default: false (opt-in for initial release) +ojp.pool.resize.validation.enabled=false + +# Connection count threshold as fraction of total max pool size +# If actual connections >= threshold * maxPoolSize, skip resize (likely partition) +# Default: 0.85 (85%) +ojp.pool.resize.validation.connectionThreshold=0.85 + +# Behavior when validation query fails +# PROCEED: Ignore validation failure, proceed with resize (availability) +# SKIP: Skip resize on validation failure (conservative) +# Default: PROCEED +ojp.pool.resize.validation.failureMode=PROCEED + +# Query timeout in milliseconds +# Default: 5000 (5 seconds) +ojp.pool.resize.validation.queryTimeout=5000 + +# Rate limit: minimum time between validation queries for the same datasource +# Prevents excessive database queries during rapid health changes +# Default: 5000 (5 seconds) +ojp.pool.resize.validation.rateLimitMs=5000 + +# Time-based override: force resize after this duration regardless of validation +# Prevents permanent pool size mismatch in edge cases +# Default: 300000 (5 minutes), set to 0 to disable +ojp.pool.resize.validation.forceResizeAfterMs=300000 +``` + +#### Code Structure + +```java +// New classes to add + +// Interface for database-specific query implementations +public interface ConnectionCountQuery { + String getQuery(); + int executeQuery(Connection conn) throws SQLException; +} + +// Factory to get appropriate query for database type +public class ConnectionCountQueryFactory { + public static ConnectionCountQuery getQuery(DbName dbName) { ... } +} + +// Main validation orchestrator +public class PoolResizeValidator { + private final ConnectionCountQueryFactory queryFactory; + private final Map lastValidationTime; // Rate limiting + private final Map lastHealthChangeTime; // Time-based override + + public ValidationResult validate(String connHash, + DataSource dataSource, + MultinodePoolCoordinator.PoolAllocation allocation, + int newHealthyServerCount) { ... } +} + +// Result of validation +public class ValidationResult { + enum Decision { PROCEED_WITH_RESIZE, SKIP_RESIZE } + + private final Decision decision; + private final int actualConnectionCount; + private final int threshold; + private final String reason; +} + +// Integration point in ProcessClusterHealthAction +public class ProcessClusterHealthAction { + private final PoolResizeValidator resizeValidator; + + public void execute(ActionContext context, SessionInfo sessionInfo) { + // ... existing health check logic ... + + if (healthChanged && validationEnabled) { + ValidationResult result = resizeValidator.validate(...); + + if (result.getDecision() == Decision.SKIP_RESIZE) { + log.info("Skipping pool resize due to validation: {}", result.getReason()); + return; // Skip resize + } + } + + // ... proceed with resize ... + } +} +``` + +### 6. Criticisms and Risks + +#### Criticism 1: Added Complexity + +**Critique**: This feature adds significant complexity for a relatively rare scenario (network partitions). + +**Response**: +- Valid concern - this is why opt-in is recommended +- Complexity is localized (new classes, limited integration points) +- Alternative of doing nothing could lead to database overload in partition scenarios + +--- + +#### Criticism 2: Incomplete Solution + +**Critique**: Connection count validation doesn't solve all partition scenarios and may give false confidence. + +**Response**: +- Accurate - this is a heuristic, not a complete solution +- Documentation must clearly state limitations +- Should be one tool among many (health checks, monitoring, alerts) +- Still provides value by catching common partition cases + +--- + +#### Criticism 3: Database Permissions Burden + +**Critique**: Requiring additional database permissions increases operational complexity and may not be acceptable in some environments. + +**Response**: +- Most databases allow users to see their own connections without extra permissions +- For databases requiring permissions (Oracle, SQL Server, DB2), document clearly +- Validation automatically disables if query fails (fail-open) +- Operators can disable if permissions are an issue + +--- + +#### Criticism 4: Performance Risk + +**Critique**: Adding database queries in the critical path of pool resizing could introduce latency or failures. + +**Response**: +- Queries are only executed when health changes (rare) +- Queries are lightweight (system view queries, ~10-100ms) +- Timeout prevents hanging (5 second default) +- Fail-open ensures availability is prioritized +- Rate limiting prevents query storms + +--- + +#### Criticism 5: False Positives + +**Critique**: Thresholds are arbitrary and could cause unnecessary pool resize skips. + +**Response**: +- True - threshold tuning will be needed per environment +- Make threshold configurable (default 85%) +- Comprehensive logging helps operators tune +- Time-based override prevents permanent issues +- Metrics enable data-driven threshold adjustment + +### 7. Testing Strategy + +#### Unit Tests + +```java +// Test each database-specific query +PostgreSQLConnectionCountQueryTest +MySQLConnectionCountQueryTest +OracleConnectionCountQueryTest +... (one per database) + +// Test validation logic +PoolResizeValidatorTest +- testValidationSkipsResizeWhenAboveThreshold() +- testValidationProceedsWhenBelowThreshold() +- testValidationProceedsOnQueryFailure() +- testRateLimitingPreventsDuplicateQueries() +- testTimeBasedOverrideForces Resize() +``` + +#### Integration Tests + +```java +// Test with real databases +ConnectionCountValidationIntegrationTest +- testPostgreSQLConnectionCount() +- testMySQLConnectionCount() +- testH2ConnectionCount() + +// Test multinode scenarios +MultinodeValidationTest +- testNetworkPartitionDetection() +- testTrueNodeFailureProceeds() +- testMixedScenarios() +``` + +#### Manual Testing + +1. **3-Server Setup**: Deploy 3 OJP servers with PostgreSQL +2. **Network Partition**: Use firewall rules to simulate partition +3. **Verify**: Connection count query detects partition +4. **Verify**: Pool resize is skipped with appropriate logging +5. **Recover**: Remove partition, verify recovery + +#### Load Testing + +1. **Baseline**: Measure pool resize performance without validation +2. **With Validation**: Measure performance with validation enabled +3. **Failure Scenarios**: Measure with slow/failing validation queries +4. **Verify**: < 100ms overhead in 99th percentile + +### 8. Documentation Requirements + +#### Operator Guide + +- **When to Enable**: Network environments prone to partitions +- **How to Configure**: Step-by-step with examples +- **Troubleshooting**: Common issues and solutions +- **Monitoring**: What metrics to watch +- **Tuning**: How to adjust thresholds + +#### Developer Guide + +- **Architecture**: How validation integrates with pool resizing +- **Adding Databases**: How to implement new database queries +- **Testing**: How to test validation locally + +#### Database-Specific Guides + +- **Permissions**: Required privileges per database +- **Query Details**: What each query does +- **Limitations**: Known issues per database + +## Conclusion + +### Summary of Recommendations + +1. **✅ Implement** connection count validation as described +2. **✅ Make it opt-in** (disabled by default) for initial release +3. **✅ Fail-open** (proceed with resize) when validation fails +4. **✅ Default threshold** of 85% of total max pool size +5. **✅ Comprehensive logging** and metrics for observability +6. **✅ Time-based override** (force resize after 5 minutes) +7. **✅ Rate limiting** (max 1 query per datasource per 5 seconds) +8. **✅ Clear documentation** of limitations and trade-offs + +### Value Proposition + +**Benefits**: +- ✅ Prevents unnecessary pool expansion in network partition scenarios +- ✅ Reduces risk of exceeding database connection limits +- ✅ Provides operators with more control over pool behavior +- ✅ Works across all major databases + +**Trade-offs**: +- ⚠️ Added complexity (~500-1000 lines of code) +- ⚠️ Requires database permissions in some cases +- ⚠️ Heuristic-based (not 100% accurate) +- ⚠️ Small performance overhead (~10-100ms per health change) + +### Go/No-Go Decision + +**Recommendation: GO** with following conditions: + +1. **Implement as opt-in feature** - minimize risk for existing users +2. **Focus on PostgreSQL and MySQL first** - cover 80% of users, add others incrementally +3. **Extensive testing** - especially network partition scenarios +4. **Beta period** - get feedback from select users before GA +5. **Clear documentation** - set expectations about limitations + +### Risk Mitigation + +| Risk | Severity | Mitigation | +|------|----------|------------| +| Query failures | HIGH | Fail-open, timeout, logging | +| Permission issues | MEDIUM | Auto-disable, clear docs | +| False positives | MEDIUM | Configurable threshold, time override | +| Performance impact | LOW | Rate limiting, caching, metrics | +| Complexity | MEDIUM | Phased rollout, comprehensive tests | + +### Success Criteria + +1. **Functional**: Correctly detects network partitions in ≥90% of test scenarios +2. **Performance**: < 100ms overhead in 99th percentile +3. **Reliability**: Zero production incidents related to validation +4. **Adoption**: ≥20% of multinode users enable validation after 6 months +5. **Feedback**: Positive feedback from early adopters + +### Next Steps + +If approved: +1. **Week 1-2**: Implement core validation logic and PostgreSQL support +2. **Week 3**: Add MySQL and Oracle support +3. **Week 4**: Integration, testing, and documentation +4. **Week 5**: Beta testing with select users +5. **Week 6**: Final QA and release preparation + +--- + +**Document Version**: 1.0 +**Date**: 2026-01-19 +**Author**: OJP Development Team +**Status**: Draft for Review diff --git a/documents/analysis/connection-count-validation/DATABASE_CONNECTION_COUNT_QUERIES.md b/documents/analysis/connection-count-validation/DATABASE_CONNECTION_COUNT_QUERIES.md new file mode 100644 index 000000000..0ae54b0d1 --- /dev/null +++ b/documents/analysis/connection-count-validation/DATABASE_CONNECTION_COUNT_QUERIES.md @@ -0,0 +1,416 @@ +# Database-Specific Connection Count Queries + +## Overview + +This document outlines database-specific SQL queries to retrieve the number of active connections for a specific database user. These queries are essential for the connection count validation mechanism in OJP's multinode pool resizing logic. + +## Purpose + +Before resizing connection pools due to perceived node failures, OJP can query the database to verify the actual number of connections from the current user. This helps distinguish between: +- **True node failure**: Node is down, connections are closed, database shows fewer connections +- **Network partition**: Node is up but unreachable to some clients, connections still active in database + +## Database-Specific Queries + +### PostgreSQL + +```sql +-- Count active connections for current user +SELECT COUNT(*) as connection_count +FROM pg_stat_activity +WHERE usename = CURRENT_USER + AND state != 'idle' + AND pid != pg_backend_pid(); -- Exclude the query connection itself +``` + +**Alternative (all connections including idle):** +```sql +SELECT COUNT(*) as connection_count +FROM pg_stat_activity +WHERE usename = CURRENT_USER + AND pid != pg_backend_pid(); +``` + +**Permissions Required**: None for current user's connections. Access to `pg_stat_activity` is granted by default. + +**Notes**: +- `state != 'idle'` filters out idle connections +- `pg_backend_pid()` excludes the current query connection +- Works with PostgreSQL 9.2+ + +### MySQL / MariaDB + +```sql +-- Count active connections for current user +SELECT COUNT(*) as connection_count +FROM information_schema.PROCESSLIST +WHERE USER = SUBSTRING_INDEX(USER(), '@', 1) + AND ID != CONNECTION_ID(); -- Exclude the query connection itself +``` + +**Alternative (with host filtering):** +```sql +SELECT COUNT(*) as connection_count +FROM information_schema.PROCESSLIST +WHERE USER = SUBSTRING_INDEX(USER(), '@', 1) + AND HOST LIKE CONCAT(SUBSTRING_INDEX(USER(), '@', -1), '%') + AND ID != CONNECTION_ID(); +``` + +**Permissions Required**: `PROCESS` privilege to see all processes, or user sees their own connections by default. + +**Notes**: +- `USER()` returns 'username@hostname' +- `SUBSTRING_INDEX(USER(), '@', 1)` extracts just the username +- `CONNECTION_ID()` excludes the current query connection +- Works with MySQL 5.1+ and MariaDB 10.0+ + +### Oracle + +```sql +-- Count active sessions for current user +SELECT COUNT(*) as connection_count +FROM v$session +WHERE username = SYS_CONTEXT('USERENV', 'SESSION_USER') + AND sid != SYS_CONTEXT('USERENV', 'SID') -- Exclude current session + AND status = 'ACTIVE'; +``` + +**Alternative (all sessions including inactive):** +```sql +SELECT COUNT(*) as connection_count +FROM v$session +WHERE username = SYS_CONTEXT('USERENV', 'SESSION_USER') + AND sid != SYS_CONTEXT('USERENV', 'SID') + AND type = 'USER'; -- Exclude background processes +``` + +**Permissions Required**: +- `SELECT` privilege on `v$session` (typically granted via `SELECT_CATALOG_ROLE` or similar) +- For non-DBA users, Oracle may require explicit grants + +**Notes**: +- `v$session` is a performance view +- `SYS_CONTEXT('USERENV', 'SESSION_USER')` gets the current user +- `SYS_CONTEXT('USERENV', 'SID')` gets the current session ID +- Works with Oracle 10g+ + +### SQL Server + +```sql +-- Count active connections for current user +SELECT COUNT(*) as connection_count +FROM sys.dm_exec_sessions +WHERE login_name = SUSER_SNAME() + AND session_id != @@SPID -- Exclude current session + AND is_user_process = 1; +``` + +**Alternative (with database context):** +```sql +SELECT COUNT(*) as connection_count +FROM sys.dm_exec_sessions s +WHERE s.login_name = SUSER_SNAME() + AND s.session_id != @@SPID + AND s.is_user_process = 1 + AND s.database_id = DB_ID(); -- Only current database +``` + +**Permissions Required**: +- `VIEW SERVER STATE` permission (server-level) +- Without this permission, users can only see their own sessions + +**Notes**: +- `sys.dm_exec_sessions` is a dynamic management view +- `SUSER_SNAME()` returns the current login name +- `@@SPID` is the current session ID +- `is_user_process = 1` excludes system processes +- Works with SQL Server 2005+ + +### DB2 + +```sql +-- Count active connections for current user (DB2 LUW) +SELECT COUNT(*) as connection_count +FROM TABLE(MON_GET_CONNECTION(NULL, -1)) +WHERE APPLICATION_HANDLE != MON_GET_APPLICATION_HANDLE() + AND SESSION_AUTH_ID = SESSION_USER; +``` + +**Alternative (application-level):** +```sql +SELECT COUNT(*) as connection_count +FROM SYSIBMADM.APPLICATIONS +WHERE AUTHID = USER + AND APPL_ID != (SELECT CURRENT SERVER FROM SYSIBM.SYSDUMMY1); +``` + +**Permissions Required**: +- `EXECUTE` privilege on monitoring functions +- `SELECT` privilege on `SYSIBMADM.APPLICATIONS` +- `DATAACCESS` or `DBADM` authority may be required + +**Notes**: +- `MON_GET_CONNECTION()` is a table function for connection monitoring +- `SESSION_USER` and `USER` return the current user +- Works with DB2 10.1+ for LUW (Linux, Unix, Windows) +- For DB2 z/OS, different system tables may be required + +### H2 + +```sql +-- Count active connections (H2 in-memory/embedded) +SELECT COUNT(*) as connection_count +FROM INFORMATION_SCHEMA.SESSIONS +WHERE USER_NAME = USER() + AND ID != SESSION_ID(); +``` + +**Permissions Required**: None (accessible by all users for their own sessions) + +**Notes**: +- H2 provides `INFORMATION_SCHEMA.SESSIONS` +- `USER()` returns the current user +- `SESSION_ID()` returns the current session ID +- Works with H2 1.4+ +- Limited utility in embedded mode (single JVM) + +### CockroachDB + +```sql +-- Count active sessions for current user +SELECT COUNT(*) as connection_count +FROM crdb_internal.cluster_sessions +WHERE user_name = current_user() + AND session_id != crdb_internal.cluster_session_id(); +``` + +**Alternative (using pg_stat_activity for PostgreSQL compatibility):** +```sql +SELECT COUNT(*) as connection_count +FROM pg_stat_activity +WHERE usename = CURRENT_USER + AND pid != pg_backend_pid(); +``` + +**Permissions Required**: None for current user's sessions + +**Notes**: +- CockroachDB supports PostgreSQL-compatible queries +- `crdb_internal.cluster_sessions` is CockroachDB-specific +- `pg_stat_activity` provides PostgreSQL compatibility +- Works with CockroachDB 19.1+ + +## Implementation Considerations + +### 1. Query Execution Context + +**When to Execute:** +- Before expanding pool (server failure detected) +- Before contracting pool (server recovery detected) +- On-demand via admin API (optional) + +**Frequency:** +- Should be rate-limited to avoid overhead +- Only execute when cluster health changes +- Cache results for a short period (e.g., 5-10 seconds) + +### 2. Connection Overhead + +**Query Cost:** +- All queries are lightweight (system views/tables) +- Execution time: typically < 100ms +- Uses one additional database connection briefly + +**Mitigation:** +- Use connection pool's existing connections +- Set query timeout (e.g., 5 seconds) +- Handle failures gracefully (proceed with resize on error) + +### 3. Expected Connection Counts + +**Baseline Calculation:** +``` +Expected Connections Per Server = Total Max Pool Size / Number of Healthy Servers +``` + +**Example (3-server cluster):** +- Total max pool size: 30 +- Normal operation: Each server has ~10 connections +- One server appears down: Expected ~15 connections on remaining servers +- If database shows ~30 connections total: **Network partition** (server still serving) +- If database shows ~20 connections total: **True failure** (server is down) + +### 4. Decision Logic + +``` +Current DB Connections = Query Result +Expected Connections After Resize = (Total Max Pool Size / New Healthy Server Count) +Connection Threshold = Total Max Pool Size * 0.9 // 90% threshold + +IF Current DB Connections >= Connection Threshold THEN + // Network partition likely - do NOT resize + Log: "Detected potential network partition - database shows {count} connections, expected resize would be premature" + Skip pool resize +ELSE + // True failure - proceed with resize + Log: "Confirmed node failure - database shows {count} connections (below threshold), proceeding with pool resize" + Resize pools +END IF +``` + +### 5. Error Handling + +**Query Failures:** +- Network timeout: Proceed with resize (conservative) +- Permission denied: Log warning, proceed with resize +- Unknown database: Skip validation, proceed with resize +- Query syntax error: Log error, proceed with resize + +**Fallback Strategy:** +If validation fails, the system should default to the current behavior (resize based on cluster health) to maintain availability. + +## Security Considerations + +### 1. Least Privilege + +- Queries only access the current user's connection information +- No need for elevated privileges in most databases +- PostgreSQL and H2: No special permissions required +- MySQL: May need `PROCESS` privilege (or sees own connections) +- Oracle, SQL Server, DB2: May require monitoring permissions + +### 2. Information Leakage + +- Queries reveal only connection counts, not sensitive data +- No access to other users' session details +- No query content or result data exposed + +### 3. SQL Injection Prevention + +- All queries use database functions (CURRENT_USER, USER(), etc.) +- No user-supplied parameters in queries +- Prepared statements can still be used for safety + +## Performance Impact + +### Resource Usage + +**Database Side:** +- Minimal CPU impact (system view queries are optimized) +- No disk I/O (data is in-memory) +- Negligible memory overhead + +**OJP Server Side:** +- One additional query per health change event +- Query executes in ~10-100ms typically +- Minimal memory for result set (single integer) + +### Scalability + +**Small Deployments (2-3 servers):** +- Health changes are rare (server failures/recoveries) +- Validation query overhead negligible +- Total overhead: ~1-2 queries per hour in stable environment + +**Large Deployments (10+ servers):** +- Health changes more frequent +- Consider rate limiting (max 1 query per 5 seconds per datasource) +- Total overhead: ~10-20 queries per hour per datasource + +## Testing Strategy + +### Unit Tests + +```java +@Test +void testPostgreSQLConnectionCountQuery() { + int count = connectionValidator.getConnectionCount( + dataSource, + DbName.POSTGRES + ); + assertTrue(count >= 0); +} +``` + +### Integration Tests + +1. **Baseline Test**: Query connection count in normal operation +2. **Server Failure Test**: Verify count decreases when server actually fails +3. **Network Partition Simulation**: Mock partition, verify count remains high +4. **Threshold Test**: Verify resize is skipped when count above threshold + +### Performance Tests + +1. **Query Latency**: Measure query execution time across databases +2. **Concurrent Queries**: Test multiple simultaneous validation queries +3. **Load Test**: Verify validation doesn't impact normal operations + +## Monitoring and Observability + +### Metrics + +- `ojp.pool.resize.validation.query.duration_ms` - Query execution time +- `ojp.pool.resize.validation.skipped` - Count of skipped resizes +- `ojp.pool.resize.validation.performed` - Count of performed resizes +- `ojp.pool.resize.validation.errors` - Count of validation errors + +### Logs + +``` +INFO: Validating connection count before pool resize: connHash={}, currentHealthy={} +DEBUG: Database connection count query result: {} connections (threshold: {}) +INFO: Skipping pool resize - network partition detected ({} connections above threshold) +INFO: Proceeding with pool resize - confirmed node failure ({} connections below threshold) +WARN: Connection count validation failed: {} - proceeding with resize (conservative) +``` + +## Alternatives Considered + +### 1. Client-Side Connection Tracking Only + +**Pros:** +- No database queries required +- Lower latency decision +- No database permissions needed + +**Cons:** +- Cannot detect network partitions +- No ground truth from database +- Client state may be stale + +### 2. Heartbeat-Based Detection + +**Pros:** +- Active monitoring of server health +- Can detect various failure modes + +**Cons:** +- Doesn't solve network partition problem +- Additional network overhead +- Complex implementation + +### 3. Distributed Consensus (e.g., Raft, Paxos) + +**Pros:** +- Definitive cluster membership decisions +- Handles split-brain scenarios + +**Cons:** +- Significant complexity +- Requires coordination infrastructure +- Overkill for connection pooling + +**Recommendation**: Database connection count validation provides a simple, effective solution with minimal overhead. + +## Conclusion + +Querying the database for connection counts before pool resizing is a pragmatic approach to distinguish between true node failures and network partitions. The solution: + +1. **Works across major databases** with standard system views +2. **Minimal overhead** (~10-100ms per validation) +3. **Simple to implement** with existing infrastructure +4. **Safe fallback** (proceeds with resize on validation failure) +5. **Effective detection** of network partition scenarios + +The main trade-off is the additional database query, but this is acceptable given the infrequency of health changes and the benefit of avoiding unnecessary pool resizing. diff --git a/documents/analysis/connection-count-validation/DECISION_SUMMARY.md b/documents/analysis/connection-count-validation/DECISION_SUMMARY.md new file mode 100644 index 000000000..8367b67e8 --- /dev/null +++ b/documents/analysis/connection-count-validation/DECISION_SUMMARY.md @@ -0,0 +1,252 @@ +# Decision Summary: Connection Count Validation for Pool Resizing + +## TL;DR + +**Should we implement this?** ✅ **YES** - as an opt-in feature + +**Core Idea**: Query the database for actual connection count before resizing OJP server pools to distinguish network partitions from true node failures. + +**Bottom Line**: +- **Benefit**: Prevents unnecessary pool expansion that could exceed database limits +- **Cost**: ~3-4 weeks development, minimal runtime overhead +- **Risk**: Low (opt-in, fail-open, comprehensive testing) + +## Quick Facts + +| Aspect | Details | +|--------|---------| +| **Feasibility** | ✅ Supported on all 7 major databases | +| **Complexity** | 🟡 Moderate (~500-1000 LOC) | +| **Performance Impact** | ✅ Negligible (~10-100ms, rare execution) | +| **Risk Level** | ✅ Low (opt-in, graceful fallback) | +| **Development Time** | 3-4 weeks (complete with tests + docs) | +| **Default State** | Disabled (opt-in) | + +## The Problem in 30 Seconds + +``` +3 OJP servers, 30 max connections total (10 each) + +Network Partition Occurs: +├─ Server3 appears DOWN to Servers 1 & 2 +├─ Servers 1 & 2 expand to 15 connections each +└─ Server3 still serving other clients with 10 connections + +Result: 10 + 15 + 15 = 40 connections (exceeds 30 limit!) 🚨 +``` + +## The Solution in 30 Seconds + +Before resizing, query database: + +```sql +SELECT COUNT(*) FROM pg_stat_activity WHERE usename = CURRENT_USER +-- If result ≈ 30: Network partition → Skip resize +-- If result ≈ 20: True failure → Proceed with resize +``` + +## Key Decision Points + +### 1. Opt-In vs Default-Enabled + +**Recommendation: Opt-In (disabled by default)** + +**Rationale**: +- ✅ Minimal risk to existing deployments +- ✅ Users can enable when needed +- ✅ Time to gather field feedback +- ❌ Requires user awareness and action + +**Alternative**: Could enable by default in future release after field validation. + +### 2. Fail-Open vs Fail-Closed + +**Recommendation: Fail-Open (proceed with resize on errors)** + +**Rationale**: +- ✅ Prioritizes availability +- ✅ Matches current behavior on failure +- ✅ Prevents validation from blocking legitimate failover +- ❌ May resize when validation intended to prevent it + +**Alternative**: Make configurable (`PROCEED` or `SKIP`), default to `PROCEED`. + +### 3. Threshold Selection + +**Recommendation: 85% of total max pool size** + +**Example**: +- Total max pool: 30 connections +- Threshold: 30 × 0.85 = 25.5 → 26 connections +- Decision: If DB shows ≥26 connections → Skip resize (partition) + +**Rationale**: +- ✅ Balances false positives vs false negatives +- ✅ Allows for some connection variance +- ✅ Configurable per environment +- ❌ May need tuning per deployment + +### 4. Database Priority + +**Recommendation: Start with PostgreSQL & MySQL** + +**Rationale**: +- ✅ Covers ~80% of OJP users +- ✅ Simpler queries (no special permissions) +- ✅ Faster initial release +- ✅ Add others incrementally + +**Full Support**: PostgreSQL, MySQL, Oracle, SQL Server, DB2, H2, CockroachDB + +## Pros & Cons + +### Pros ✅ + +| Benefit | Impact | +|---------|--------| +| Prevents unnecessary pool expansion | HIGH - Protects against connection limit violations | +| Minimal performance overhead | HIGH - Only ~10-100ms when cluster health changes | +| Works across all major databases | HIGH - Universal solution | +| Safe fallback on errors | HIGH - Doesn't break existing functionality | +| Configurable and tunable | MEDIUM - Operators can adjust to their environment | +| Comprehensive logging/metrics | MEDIUM - Good observability | + +### Cons ⚠️ + +| Concern | Impact | Mitigation | +|---------|--------|------------| +| Added complexity | MEDIUM | Localized, well-tested | +| Database permissions needed | LOW-MEDIUM | Most DBs don't require extra perms | +| Heuristic-based (not perfect) | MEDIUM | Document limitations clearly | +| Maintenance burden | LOW | Database queries rarely change | +| Threshold tuning required | LOW | Good default + configurable | + +## Key Questions Answered + +### Q: What if validation query fails? + +**A**: Fail-open (proceed with resize) to maintain availability. Comprehensive logging helps debug. + +### Q: What about multiple database users? + +**A**: Query only sees current user's connections. Document this limitation. Consider total connection query variant for advanced users. + +### Q: What's the database performance impact? + +**A**: Negligible. Query runs only when cluster health changes (rare), takes 10-100ms, uses system views (no disk I/O). + +### Q: What if threshold is wrong for my environment? + +**A**: Configurable via `ojp.pool.resize.validation.connectionThreshold=0.85` (0.0 to 1.0). Comprehensive logging helps tune. + +### Q: Does this solve network partitions completely? + +**A**: No - it's a heuristic that catches common cases. Edge cases still exist (gradual failures, partial partitions). Document as one tool among many. + +## Configuration Example + +```properties +# Enable validation (opt-in) +ojp.pool.resize.validation.enabled=true + +# 85% threshold - adjust based on your environment +ojp.pool.resize.validation.connectionThreshold=0.85 + +# Fail-open on errors (availability over correctness) +ojp.pool.resize.validation.failureMode=PROCEED + +# Query timeout (5 seconds) +ojp.pool.resize.validation.queryTimeout=5000 + +# Rate limiting (prevent query storms) +ojp.pool.resize.validation.rateLimitMs=5000 + +# Time-based override (force resize after 5 minutes) +ojp.pool.resize.validation.forceResizeAfterMs=300000 +``` + +## Alternatives Considered (and why rejected) + +| Alternative | Why Not? | +|-------------|----------| +| **Do Nothing** | Doesn't solve the problem, partitions still cause issues | +| **Heartbeat-based** | Doesn't distinguish partition from failure | +| **Distributed Consensus** | Overkill complexity for connection pooling | +| **Time-based dampening only** | Just delays the problem, doesn't solve it | + +## Risk Assessment + +| Risk | Likelihood | Impact | Mitigation | +|------|-----------|--------|------------| +| Query failures | LOW | MEDIUM | Fail-open, timeout, logging | +| Permission issues | LOW | LOW | Auto-disable, clear docs | +| False positives | MEDIUM | LOW | Configurable threshold, time override | +| Performance impact | LOW | LOW | Rate limiting, metrics | +| Deployment issues | LOW | MEDIUM | Opt-in, comprehensive testing, rollback plan | + +## Testing Strategy + +- ✅ Unit tests for all query implementations +- ✅ Integration tests with real databases +- ✅ Network partition simulation tests +- ✅ Load tests to verify overhead +- ✅ Beta testing with select users + +## Success Criteria + +1. **Functional**: Correctly detects network partitions in ≥90% of test scenarios +2. **Performance**: < 100ms overhead in 99th percentile +3. **Reliability**: Zero production incidents related to validation +4. **Adoption**: ≥20% of multinode users enable after 6 months +5. **Feedback**: Positive feedback from early adopters + +## Timeline + +``` +Week 1-2: Core implementation (PostgreSQL, MySQL) +Week 3: Remaining databases + integration +Week 4: Testing (unit, integration, performance) +Week 5: Documentation + beta testing +Week 6: Final QA + release preparation +``` + +## Go/No-Go Recommendation + +### ✅ **GO** with following conditions: + +1. ✅ Implement as **opt-in feature** (disabled by default) +2. ✅ Start with **PostgreSQL and MySQL** (add others incrementally) +3. ✅ Use **fail-open** behavior (proceed on errors) +4. ✅ Provide **comprehensive documentation** on limitations +5. ✅ Include **beta testing period** before GA +6. ✅ Add **extensive logging and metrics** for observability + +### Decision Confidence: **HIGH** 🟢 + +**Justification**: +- Problem is real and impactful (connection limit violations) +- Solution is pragmatic and works across databases +- Implementation risk is low (opt-in, fail-open, localized changes) +- Trade-offs are acceptable (small overhead, heuristic-based) +- Testing strategy is comprehensive +- Rollback plan is clear + +## For More Details + +- 📘 **[README.md](./README.md)** - Navigation and overview +- 📊 **[ANALYSIS.md](./ANALYSIS.md)** - Complete analysis (23KB, 687 lines) +- 🔍 **[DATABASE_CONNECTION_COUNT_QUERIES.md](./DATABASE_CONNECTION_COUNT_QUERIES.md)** - Query reference (13KB, 416 lines) +- 🛠️ **[IMPLEMENTATION_GUIDE.md](./IMPLEMENTATION_GUIDE.md)** - Implementation details (29KB, 794 lines) + +## Bottom Line + +This feature solves a real problem (network partition causing connection limit violations) with a pragmatic, low-risk solution that works across all major databases. The opt-in approach minimizes risk while providing value to users who need it. + +**Recommendation**: ✅ **Approve and implement** as outlined in the implementation guide. + +--- + +**Version**: 1.0 +**Date**: 2026-01-19 +**Status**: Ready for Review and Decision +**Recommendation**: GO with opt-in implementation diff --git a/documents/analysis/connection-count-validation/IMPLEMENTATION_GUIDE.md b/documents/analysis/connection-count-validation/IMPLEMENTATION_GUIDE.md new file mode 100644 index 000000000..4cd2e771f --- /dev/null +++ b/documents/analysis/connection-count-validation/IMPLEMENTATION_GUIDE.md @@ -0,0 +1,794 @@ +# Implementation Guide: Connection Count Validation for Pool Resizing + +## Overview + +This document provides a detailed implementation guide for adding connection count validation to OJP's multinode pool resizing logic. This feature queries the database before resizing connection pools to distinguish between true node failures and network partitions. + +## Architecture + +### Component Diagram + +``` +┌─────────────────────────────────────────────────────────────────┐ +│ ProcessClusterHealthAction │ +│ (Existing class - trigger point for pool resizing) │ +└───────────────────────────────┬─────────────────────────────────┘ + │ + │ calls if validation enabled + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ PoolResizeValidator │ +│ - Orchestrates validation process │ +│ - Implements rate limiting │ +│ - Handles time-based overrides │ +│ - Makes resize decision │ +└───────────────────────────────┬─────────────────────────────────┘ + │ + │ gets query for database type + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ ConnectionCountQueryFactory │ +│ - Maps DbName to appropriate query implementation │ +└───────────────────────────────┬─────────────────────────────────┘ + │ + │ returns + ▼ +┌─────────────────────────────────────────────────────────────────┐ +│ ConnectionCountQuery (Interface) │ +│ + getQuery(): String │ +│ + executeQuery(Connection): int │ +└───────────────────────────────┬─────────────────────────────────┘ + │ + ┌───────────┴───────────┬───────────────┐ + ▼ ▼ ▼ + ┌────────────────────┐ ┌──────────────────┐ ┌─────────────┐ + │ PostgreSQLQuery │ │ MySQLQuery │ │ OracleQuery │ + │ MySQLQuery │ │ SQLServerQuery │ │ ... │ + │ OracleQuery │ │ DB2Query │ │ │ + │ SQLServerQuery │ │ H2Query │ │ │ + │ DB2Query │ │ CockroachQuery │ │ │ + │ H2Query │ │ │ │ │ + │ CockroachDBQuery │ │ │ │ │ + └────────────────────┘ └──────────────────┘ └─────────────┘ +``` + +### Data Flow + +``` +1. Client reports cluster health change + ↓ +2. ProcessClusterHealthAction detects health change + ↓ +3. Check if validation is enabled (config) + ↓ +4. PoolResizeValidator.validate() called + ↓ +5. Check rate limit (last validation time) + ↓ +6. Get appropriate ConnectionCountQuery from factory + ↓ +7. Borrow connection from pool + ↓ +8. Execute query with timeout + ↓ +9. Compare result to threshold + ↓ +10. Return ValidationResult (PROCEED or SKIP) + ↓ +11. If SKIP: Log and return (no resize) + If PROCEED: Continue with pool resize +``` + +## Implementation Details + +### 1. Configuration Properties + +Add to `ServerConfiguration.java`: + +```java +public class ServerConfiguration { + // ... existing properties ... + + // Connection count validation properties + private boolean poolResizeValidationEnabled = false; // Opt-in + private double poolResizeValidationThreshold = 0.85; // 85% threshold + private String poolResizeValidationFailureMode = "PROCEED"; // PROCEED or SKIP + private int poolResizeValidationQueryTimeout = 5000; // 5 seconds + private int poolResizeValidationRateLimitMs = 5000; // 5 seconds + private long poolResizeValidationForceResizeAfterMs = 300000; // 5 minutes + + // Getters and setters + // ... +} +``` + +Load from properties file: + +```properties +# ojp-server.properties +ojp.pool.resize.validation.enabled=false +ojp.pool.resize.validation.connectionThreshold=0.85 +ojp.pool.resize.validation.failureMode=PROCEED +ojp.pool.resize.validation.queryTimeout=5000 +ojp.pool.resize.validation.rateLimitMs=5000 +ojp.pool.resize.validation.forceResizeAfterMs=300000 +``` + +### 2. ConnectionCountQuery Interface + +Create new interface: + +```java +package org.openjproxy.grpc.server.pool.validation; + +import java.sql.Connection; +import java.sql.SQLException; + +/** + * Interface for database-specific connection count queries. + * Implementations provide the SQL query and execution logic to count + * active connections for the current database user. + */ +public interface ConnectionCountQuery { + + /** + * Returns the SQL query to count connections. + * The query should: + * - Count connections for the current user only + * - Exclude the connection used to execute the query + * - Optionally filter by connection state (active vs idle) + * + * @return SQL query string + */ + String getQuery(); + + /** + * Executes the connection count query. + * + * @param conn Database connection (from pool) + * @return Number of active connections for current user + * @throws SQLException if query execution fails + */ + int executeQuery(Connection conn) throws SQLException; + + /** + * Returns a human-readable description of what the query does. + * Used for logging and documentation. + * + * @return Description string + */ + default String getDescription() { + return "Counts connections for current user"; + } +} +``` + +### 3. Database-Specific Implementations + +#### PostgreSQL + +```java +package org.openjproxy.grpc.server.pool.validation.queries; + +import org.openjproxy.grpc.server.pool.validation.ConnectionCountQuery; + +import java.sql.Connection; +import java.sql.PreparedStatement; +import java.sql.ResultSet; +import java.sql.SQLException; + +public class PostgreSQLConnectionCountQuery implements ConnectionCountQuery { + + private static final String QUERY = + "SELECT COUNT(*) as connection_count " + + "FROM pg_stat_activity " + + "WHERE usename = CURRENT_USER " + + " AND pid != pg_backend_pid()"; + + @Override + public String getQuery() { + return QUERY; + } + + @Override + public int executeQuery(Connection conn) throws SQLException { + try (PreparedStatement stmt = conn.prepareStatement(QUERY); + ResultSet rs = stmt.executeQuery()) { + if (rs.next()) { + return rs.getInt("connection_count"); + } + throw new SQLException("Query returned no results"); + } + } + + @Override + public String getDescription() { + return "PostgreSQL: Counts active connections from pg_stat_activity"; + } +} +``` + +#### MySQL/MariaDB + +```java +public class MySQLConnectionCountQuery implements ConnectionCountQuery { + + private static final String QUERY = + "SELECT COUNT(*) as connection_count " + + "FROM information_schema.PROCESSLIST " + + "WHERE USER = SUBSTRING_INDEX(USER(), '@', 1) " + + " AND ID != CONNECTION_ID()"; + + @Override + public String getQuery() { + return QUERY; + } + + @Override + public int executeQuery(Connection conn) throws SQLException { + try (PreparedStatement stmt = conn.prepareStatement(QUERY); + ResultSet rs = stmt.executeQuery()) { + if (rs.next()) { + return rs.getInt("connection_count"); + } + throw new SQLException("Query returned no results"); + } + } + + @Override + public String getDescription() { + return "MySQL/MariaDB: Counts processes from INFORMATION_SCHEMA"; + } +} +``` + +#### Oracle + +```java +public class OracleConnectionCountQuery implements ConnectionCountQuery { + + private static final String QUERY = + "SELECT COUNT(*) as connection_count " + + "FROM v$session " + + "WHERE username = SYS_CONTEXT('USERENV', 'SESSION_USER') " + + " AND sid != SYS_CONTEXT('USERENV', 'SID') " + + " AND type = 'USER'"; + + @Override + public String getQuery() { + return QUERY; + } + + @Override + public int executeQuery(Connection conn) throws SQLException { + try (PreparedStatement stmt = conn.prepareStatement(QUERY); + ResultSet rs = stmt.executeQuery()) { + if (rs.next()) { + return rs.getInt("connection_count"); + } + throw new SQLException("Query returned no results"); + } + } + + @Override + public String getDescription() { + return "Oracle: Counts sessions from v$session"; + } +} +``` + +*(Similar implementations for SQL Server, DB2, H2, CockroachDB - see DATABASE_CONNECTION_COUNT_QUERIES.md)* + +### 4. ConnectionCountQueryFactory + +```java +package org.openjproxy.grpc.server.pool.validation; + +import com.openjproxy.grpc.DbName; +import org.openjproxy.grpc.server.pool.validation.queries.*; + +/** + * Factory for creating database-specific connection count queries. + */ +public class ConnectionCountQueryFactory { + + /** + * Returns the appropriate ConnectionCountQuery for the given database type. + * + * @param dbName Database type + * @return ConnectionCountQuery implementation + * @throws IllegalArgumentException if database type is not supported + */ + public static ConnectionCountQuery getQuery(DbName dbName) { + switch (dbName) { + case POSTGRES: + return new PostgreSQLConnectionCountQuery(); + case MYSQL: + case MARIADB: + return new MySQLConnectionCountQuery(); + case ORACLE: + return new OracleConnectionCountQuery(); + case SQLSERVER: + return new SQLServerConnectionCountQuery(); + case DB2: + return new DB2ConnectionCountQuery(); + case H2: + return new H2ConnectionCountQuery(); + case COCKROACHDB: + return new CockroachDBConnectionCountQuery(); + default: + throw new IllegalArgumentException( + "Connection count query not supported for database: " + dbName); + } + } + + /** + * Checks if connection count validation is supported for the given database. + * + * @param dbName Database type + * @return true if supported, false otherwise + */ + public static boolean isSupported(DbName dbName) { + try { + getQuery(dbName); + return true; + } catch (IllegalArgumentException e) { + return false; + } + } +} +``` + +### 5. ValidationResult Class + +```java +package org.openjproxy.grpc.server.pool.validation; + +/** + * Result of connection count validation. + */ +public class ValidationResult { + + public enum Decision { + /** Proceed with pool resize (node failure confirmed or validation failed) */ + PROCEED_WITH_RESIZE, + + /** Skip pool resize (network partition detected) */ + SKIP_RESIZE + } + + private final Decision decision; + private final int actualConnectionCount; + private final int threshold; + private final String reason; + private final boolean validationPerformed; + + private ValidationResult(Decision decision, int actualConnectionCount, + int threshold, String reason, boolean validationPerformed) { + this.decision = decision; + this.actualConnectionCount = actualConnectionCount; + this.threshold = threshold; + this.reason = reason; + this.validationPerformed = validationPerformed; + } + + public static ValidationResult proceed(String reason) { + return new ValidationResult(Decision.PROCEED_WITH_RESIZE, -1, -1, reason, false); + } + + public static ValidationResult proceed(int actualCount, int threshold, String reason) { + return new ValidationResult(Decision.PROCEED_WITH_RESIZE, actualCount, threshold, reason, true); + } + + public static ValidationResult skip(int actualCount, int threshold, String reason) { + return new ValidationResult(Decision.SKIP_RESIZE, actualCount, threshold, reason, true); + } + + // Getters + public Decision getDecision() { return decision; } + public int getActualConnectionCount() { return actualConnectionCount; } + public int getThreshold() { return threshold; } + public String getReason() { return reason; } + public boolean isValidationPerformed() { return validationPerformed; } + + @Override + public String toString() { + if (!validationPerformed) { + return String.format("ValidationResult{decision=%s, reason='%s'}", + decision, reason); + } + return String.format( + "ValidationResult{decision=%s, actualCount=%d, threshold=%d, reason='%s'}", + decision, actualConnectionCount, threshold, reason); + } +} +``` + +### 6. PoolResizeValidator + +```java +package org.openjproxy.grpc.server.pool.validation; + +import com.openjproxy.grpc.DbName; +import lombok.extern.slf4j.Slf4j; +import org.openjproxy.grpc.server.MultinodePoolCoordinator; +import org.openjproxy.grpc.server.ServerConfiguration; + +import javax.sql.DataSource; +import java.sql.Connection; +import java.sql.SQLException; +import java.util.Map; +import java.util.concurrent.ConcurrentHashMap; + +/** + * Validates whether pool resizing should proceed by querying the database + * for the current connection count. + */ +@Slf4j +public class PoolResizeValidator { + + private final ServerConfiguration config; + private final Map lastValidationTime = new ConcurrentHashMap<>(); + private final Map lastHealthChangeTime = new ConcurrentHashMap<>(); + + public PoolResizeValidator(ServerConfiguration config) { + this.config = config; + } + + /** + * Validates whether pool resize should proceed based on database connection count. + * + * @param connHash Connection hash identifying the datasource + * @param dataSource DataSource to query + * @param dbName Database type + * @param allocation Current pool allocation + * @param newHealthyServerCount New number of healthy servers + * @return ValidationResult indicating whether to proceed or skip resize + */ + public ValidationResult validate(String connHash, + DataSource dataSource, + DbName dbName, + MultinodePoolCoordinator.PoolAllocation allocation, + int newHealthyServerCount) { + + log.debug("Starting pool resize validation: connHash={}, dbName={}, newHealthyServers={}", + connHash, dbName, newHealthyServerCount); + + // Check rate limiting + if (!checkRateLimit(connHash)) { + log.debug("Validation skipped due to rate limiting: connHash={}", connHash); + return ValidationResult.proceed("Rate limit not reached"); + } + + // Check time-based override + if (shouldForceResize(connHash)) { + log.info("Forcing pool resize due to time-based override: connHash={}", connHash); + return ValidationResult.proceed("Time-based override triggered"); + } + + // Check if validation is supported for this database + if (!ConnectionCountQueryFactory.isSupported(dbName)) { + log.debug("Connection count validation not supported for {}: connHash={}", + dbName, connHash); + return ValidationResult.proceed("Database type not supported: " + dbName); + } + + // Execute validation query + try { + int actualCount = queryConnectionCount(dataSource, dbName); + int threshold = calculateThreshold(allocation); + + log.info("Connection count validation: connHash={}, actualCount={}, threshold={}", + connHash, actualCount, threshold); + + if (actualCount >= threshold) { + // Likely network partition - high connection count suggests "failed" server is still serving + String reason = String.format( + "Network partition detected: %d connections >= threshold %d", + actualCount, threshold); + log.warn("Skipping pool resize for {}: {}", connHash, reason); + return ValidationResult.skip(actualCount, threshold, reason); + } else { + // True failure - connection count dropped as expected + String reason = String.format( + "Node failure confirmed: %d connections < threshold %d", + actualCount, threshold); + log.info("Proceeding with pool resize for {}: {}", connHash, reason); + return ValidationResult.proceed(actualCount, threshold, reason); + } + + } catch (Exception e) { + log.warn("Connection count validation failed for {}: {} - {}", + connHash, e.getClass().getSimpleName(), e.getMessage()); + + // Fail-open: proceed with resize on validation failure + if ("PROCEED".equalsIgnoreCase(config.getPoolResizeValidationFailureMode())) { + return ValidationResult.proceed("Validation failed: " + e.getMessage()); + } else { + return ValidationResult.skip(-1, -1, "Validation failed (fail-closed): " + e.getMessage()); + } + } + } + + private boolean checkRateLimit(String connHash) { + long now = System.currentTimeMillis(); + Long lastTime = lastValidationTime.get(connHash); + + if (lastTime == null || (now - lastTime) >= config.getPoolResizeValidationRateLimitMs()) { + lastValidationTime.put(connHash, now); + + // Track first health change time for time-based override + lastHealthChangeTime.putIfAbsent(connHash, now); + + return true; + } + + return false; + } + + private boolean shouldForceResize(String connHash) { + if (config.getPoolResizeValidationForceResizeAfterMs() <= 0) { + return false; // Feature disabled + } + + Long firstChangeTime = lastHealthChangeTime.get(connHash); + if (firstChangeTime == null) { + return false; + } + + long elapsed = System.currentTimeMillis() - firstChangeTime; + if (elapsed >= config.getPoolResizeValidationForceResizeAfterMs()) { + // Clear tracking to reset timer for next cycle + lastHealthChangeTime.remove(connHash); + return true; + } + + return false; + } + + private int queryConnectionCount(DataSource dataSource, DbName dbName) throws SQLException { + ConnectionCountQuery query = ConnectionCountQueryFactory.getQuery(dbName); + + log.debug("Executing connection count query: {}", query.getDescription()); + + try (Connection conn = dataSource.getConnection()) { + // Set query timeout + conn.setNetworkTimeout(null, config.getPoolResizeValidationQueryTimeout()); + + return query.executeQuery(conn); + } + } + + private int calculateThreshold(MultinodePoolCoordinator.PoolAllocation allocation) { + int originalMaxPoolSize = allocation.getOriginalMaxPoolSize(); + double thresholdFraction = config.getPoolResizeValidationThreshold(); + + return (int) Math.ceil(originalMaxPoolSize * thresholdFraction); + } + + /** + * Resets validation tracking for a connection hash. + * Called when connection is closed or datasource is removed. + * + * @param connHash Connection hash to reset + */ + public void reset(String connHash) { + lastValidationTime.remove(connHash); + lastHealthChangeTime.remove(connHash); + log.debug("Reset validation tracking for connHash={}", connHash); + } +} +``` + +### 7. Integration with ProcessClusterHealthAction + +Modify `ProcessClusterHealthAction.execute()`: + +```java +public class ProcessClusterHealthAction { + + private final PoolResizeValidator resizeValidator; // Add this + + public ProcessClusterHealthAction(ServerConfiguration config) { + this.resizeValidator = new PoolResizeValidator(config); + } + + public void execute(ActionContext context, SessionInfo sessionInfo) { + // ... existing health check logic ... + + if (healthChanged) { + int healthyServerCount = context.getClusterHealthTracker() + .countHealthyServers(clusterHealth); + + log.info("[POOL-RESIZE] Cluster health changed for {}, healthy servers: {}", + connHash, healthyServerCount); + + // NEW: Validate before resizing if enabled + if (context.getServerConfiguration().isPoolResizeValidationEnabled()) { + ValidationResult validation = validateResize( + context, connHash, healthyServerCount); + + if (validation.getDecision() == ValidationResult.Decision.SKIP_RESIZE) { + log.info("[POOL-RESIZE] Skipping resize for {}: {}", + connHash, validation.getReason()); + return; // Skip resize + } + + log.info("[POOL-RESIZE] Proceeding with resize for {}: {}", + connHash, validation.getReason()); + } + + // Update the pool coordinator with new healthy server count + ConnectionPoolConfigurer.getPoolCoordinator() + .updateHealthyServers(connHash, healthyServerCount); + + // Apply pool size changes + // ... existing resize logic ... + } + } + + private ValidationResult validateResize(ActionContext context, + String connHash, + int newHealthyServerCount) { + DataSource ds = context.getDatasourceMap().get(connHash); + DbName dbName = context.getDbNameMap().get(connHash); + MultinodePoolCoordinator.PoolAllocation allocation = + ConnectionPoolConfigurer.getPoolCoordinator().getPoolAllocation(connHash); + + if (ds == null || dbName == null || allocation == null) { + log.warn("Cannot validate resize: missing datasource, dbName, or allocation"); + return ValidationResult.proceed("Required components not available"); + } + + return resizeValidator.validate(connHash, ds, dbName, allocation, newHealthyServerCount); + } +} +``` + +## Testing + +### Unit Tests + +Create `PoolResizeValidatorTest.java`: + +```java +class PoolResizeValidatorTest { + + @Test + void testValidationProceedsWhenBelowThreshold() { + // Setup with 30 max pool size, 85% threshold = 26 + // Simulate 20 actual connections (below threshold) + // Expect: PROCEED + } + + @Test + void testValidationSkipsWhenAboveThreshold() { + // Setup with 30 max pool size, 85% threshold = 26 + // Simulate 28 actual connections (above threshold) + // Expect: SKIP + } + + @Test + void testRateLimitingPreventsFrequentQueries() { + // Call validate() twice within rate limit window + // Expect: Second call returns PROCEED without querying + } + + @Test + void testTimeBasedOverrideForcesResize() { + // Simulate time passing beyond force resize threshold + // Expect: PROCEED regardless of connection count + } + + @Test + void testFailOpenOnQueryError() { + // Configure fail-open mode + // Simulate query failure + // Expect: PROCEED + } + + @Test + void testFailClosedOnQueryError() { + // Configure fail-closed mode + // Simulate query failure + // Expect: SKIP + } +} +``` + +### Integration Tests + +Create `ConnectionCountValidationIntegrationTest.java`: + +```java +class ConnectionCountValidationIntegrationTest { + + @Test + void testPostgreSQLConnectionCountQuery() { + // Use testcontainers to start PostgreSQL + // Create 10 connections + // Query connection count + // Verify count is approximately 10 + } + + @Test + void testValidationWithRealDatabase() { + // 3-server setup with HikariCP + // Create connections + // Simulate server failure + // Verify validation correctly detects scenario + } +} +``` + +## Deployment + +### Phase 1: Development (Weeks 1-2) +- Implement all query classes +- Implement PoolResizeValidator +- Unit tests for all components + +### Phase 2: Integration (Week 3) +- Integrate with ProcessClusterHealthAction +- Add configuration properties +- Integration tests with real databases + +### Phase 3: Testing (Week 4) +- Load testing +- Network partition simulation +- Performance verification + +### Phase 4: Documentation (Week 5) +- Operator guide +- Configuration examples +- Troubleshooting guide + +### Phase 5: Release (Week 6) +- Beta testing with select users +- Bug fixes +- GA release + +## Rollback Plan + +If issues arise: + +1. **Immediate**: Disable via config (`ojp.pool.resize.validation.enabled=false`) +2. **Code-level**: Validation is opt-in, so disabling reverts to previous behavior +3. **Emergency**: Remove validation check from ProcessClusterHealthAction (simple if statement) + +No database schema changes, so rollback is clean. + +## Monitoring + +Add metrics: + +```java +// In PoolResizeValidator +meter("ojp.pool.resize.validation.attempts", tags("connHash", connHash)); +timer("ojp.pool.resize.validation.duration", tags("database", dbName.name())); +counter("ojp.pool.resize.validation.decision", tags("decision", decision.name(), "connHash", connHash)); +counter("ojp.pool.resize.validation.errors", tags("error", errorType, "database", dbName.name())); +``` + +Add logs: + +``` +INFO: Pool resize validation enabled for connHash={} +DEBUG: Executing connection count query: {} (timeout: {}ms) +INFO: Connection count: {}, threshold: {}, decision: {} +WARN: Connection count validation failed: {} - proceeding with resize +INFO: Skipping pool resize - network partition detected +``` + +## Conclusion + +This implementation provides: + +- ✅ Clear separation of concerns (query, validation, integration) +- ✅ Database-agnostic design with easy extensibility +- ✅ Comprehensive error handling +- ✅ Configurable behavior +- ✅ Testable architecture +- ✅ Production-ready monitoring + +Estimated implementation: **3-4 weeks** for complete feature with testing and documentation. diff --git a/documents/analysis/connection-count-validation/README.md b/documents/analysis/connection-count-validation/README.md new file mode 100644 index 000000000..58e91062d --- /dev/null +++ b/documents/analysis/connection-count-validation/README.md @@ -0,0 +1,197 @@ +# Connection Count Validation for Pool Resizing - Analysis + +## Overview + +This directory contains a comprehensive analysis of implementing database connection count validation before resizing OJP server connection pools in multinode deployments. The goal is to distinguish between true node failures and network partitions to prevent unnecessary pool resizing. + +## Problem Statement + +In a multinode OJP setup (e.g., 3 servers with 10 max connections each), when a server appears down to some clients, the remaining servers expand their pools (e.g., to 15 connections each). However, in a network partition scenario, the "failed" server may still be serving other clients, leading to: + +- Total connections exceeding the configured limit (10 + 15 + 15 = 40 instead of 30) +- Potential database overload +- Unnecessary resource consumption + +## Proposed Solution + +Before resizing connection pools, query the database to verify the actual number of connections from the current user: + +- **High connection count** (near total max): Likely network partition → **Skip resize** +- **Low connection count** (well below max): True node failure → **Proceed with resize** + +## Documents in This Analysis + +### 1. [ANALYSIS.md](./ANALYSIS.md) + +**Primary analysis document** containing: + +- ✅ Detailed feasibility assessment +- ✅ Critical questions and concerns +- ✅ Design opinions and suggestions +- ✅ Criticisms and risk analysis +- ✅ Alternative approaches considered +- ✅ Implementation recommendations +- ✅ Testing strategy +- ✅ Go/No-Go decision framework + +**Key Findings:** + +- **Feasible** across all major databases (PostgreSQL, MySQL, Oracle, SQL Server, DB2, H2, CockroachDB) +- **Moderate complexity** (~500-1000 lines of code) +- **Low overhead** (~10-100ms per validation, rare execution) +- **Recommended approach**: Opt-in feature with fail-open behavior + +**Key Concerns:** + +1. Multiple database users sharing OJP deployment +2. Database-level connection pooling interference +3. Query failure handling strategy +4. Threshold tuning requirements +5. Incomplete solution for all partition scenarios + +**Recommendation**: **GO** with opt-in, fail-open implementation + +### 2. [DATABASE_CONNECTION_COUNT_QUERIES.md](./DATABASE_CONNECTION_COUNT_QUERIES.md) + +**Database-specific query reference** containing: + +- ✅ SQL queries for each supported database +- ✅ Permission requirements per database +- ✅ Query performance characteristics +- ✅ Implementation considerations +- ✅ Testing strategies +- ✅ Monitoring recommendations + +**Highlights:** + +- All queries use system views/tables (no application schema changes) +- Most databases require no special permissions for current user queries +- Queries exclude the validation query connection itself +- Typical execution time: 10-100ms + +### 3. [IMPLEMENTATION_GUIDE.md](./IMPLEMENTATION_GUIDE.md) + +**Detailed implementation guide** containing: + +- ✅ Architecture diagrams +- ✅ Component specifications +- ✅ Code structure and interfaces +- ✅ Complete implementation examples +- ✅ Integration points +- ✅ Testing approach +- ✅ Deployment plan + +**Key Components:** + +1. `ConnectionCountQuery` interface - Database-specific query abstraction +2. `ConnectionCountQueryFactory` - Maps database types to queries +3. `PoolResizeValidator` - Orchestrates validation logic +4. `ValidationResult` - Encapsulates validation decision +5. Integration with `ProcessClusterHealthAction` + +**Estimated Effort**: 3-4 weeks (development + testing + documentation) + +## Quick Decision Summary + +### Should We Implement This? + +**YES**, with the following approach: + +| Aspect | Recommendation | Rationale | +|--------|----------------|-----------| +| **Default State** | Disabled (opt-in) | Minimize risk, let users enable when needed | +| **Failure Handling** | Fail-open (proceed with resize) | Prioritize availability over correctness | +| **Threshold** | 85% of max pool size | Balance between false positives and false negatives | +| **Rate Limiting** | 1 query per 5 seconds per datasource | Prevent excessive database load | +| **Time Override** | Force resize after 5 minutes | Prevent permanent pool size mismatch | +| **Initial Databases** | PostgreSQL, MySQL first | Cover 80% of users, add others incrementally | + +### Configuration Example + +```properties +# Enable connection count validation (opt-in) +ojp.pool.resize.validation.enabled=true + +# Connection count threshold (85% of max pool size) +ojp.pool.resize.validation.connectionThreshold=0.85 + +# Fail-open: proceed with resize on validation failure +ojp.pool.resize.validation.failureMode=PROCEED + +# Query timeout +ojp.pool.resize.validation.queryTimeout=5000 + +# Rate limiting (max 1 query per 5 seconds) +ojp.pool.resize.validation.rateLimitMs=5000 + +# Force resize after 5 minutes regardless of validation +ojp.pool.resize.validation.forceResizeAfterMs=300000 +``` + +## Benefits vs Trade-offs + +### Benefits ✅ + +1. **Prevents unnecessary pool expansion** in network partition scenarios +2. **Reduces risk** of exceeding database connection limits +3. **Provides operators control** over pool behavior +4. **Works across all major databases** with minimal overhead +5. **Safe fallback** (proceeds with resize on errors) + +### Trade-offs ⚠️ + +1. **Added complexity** (~500-1000 lines of code) +2. **Requires database permissions** in some cases (Oracle, SQL Server, DB2) +3. **Heuristic-based** (not 100% accurate) +4. **Small performance overhead** (~10-100ms per health change) +5. **Maintenance burden** (7 database-specific implementations) + +## When to Use This Feature + +### ✅ Good Use Cases + +- Multi-region deployments with potential network partitions +- Environments where exceeding connection limits is critical +- Deployments with strict database connection quotas +- Operators who can monitor and tune thresholds + +### ❌ Not Recommended For + +- Single-region, reliable network environments +- Deployments without monitoring/observability +- Environments where database permissions are restricted +- Simple single-node deployments + +## Next Steps + +If approved: + +1. **Week 1-2**: Implement core validation logic (PostgreSQL, MySQL first) +2. **Week 3**: Add remaining database support and integration +3. **Week 4**: Comprehensive testing (unit, integration, performance) +4. **Week 5**: Documentation and beta testing +5. **Week 6**: Final QA and release + +## Questions or Feedback? + +This analysis is intended to facilitate discussion and decision-making. Key topics for review: + +1. **Is the opt-in approach acceptable?** (vs default-enabled) +2. **Is the 85% threshold reasonable?** (configurable, but needs a good default) +3. **Is fail-open the right default?** (vs fail-closed) +4. **Should we support all databases initially?** (vs PostgreSQL/MySQL first) +5. **Are there use cases we haven't considered?** + +## Related Documentation + +- [Multinode Architecture](../../multinode/README.md) +- [Server Recovery and Redistribution](../../multinode/server-recovery-and-redistribution.md) +- [Connection Pool Configuration](../../connection-pool/configuration.md) +- [OJP Server Configuration](../../configuration/ojp-server-configuration.md) + +--- + +**Analysis Version**: 1.0 +**Date**: 2026-01-19 +**Status**: Complete - Ready for Review +**Recommendation**: Implement as opt-in feature with fail-open behavior