Skip to content

ANTLR filter parser Unicode handling #981

@LohithkumarAV

Description

@LohithkumarAV
# SCIM Filter Parser Fails with Accented Characters

## Summary
The Apache Directory SCIM filter parser fails to parse SCIM search requests when filter values contain accented/diacritic characters, returning `400 Bad Request` with error `"Unable to map or parse JSON to SCIM schema"`.

## Environment
- **Library**: `org.apache.directory.scim:scim-spec`
- **Parser**: ANTLR-based filter parser in `org.apache.directory.scim.spec.filter.Filter`
- **Java Version**: 17+
- **Affected Component**: `GroupService.find()` method calling `buildFilterTree(filter)`

## Impact
- **Severity**: High
- **Scope**: Blocks SCIM RFC 7644 Section 3.13 compliance for internationalized string normalization
- **Affected Operations**: All SCIM search operations with accented characters in filter values

## Steps to Reproduce

### 1. Create a group with accented characters
**Request:**
```json
POST /scim/v2/Groups
{
  "schemas": ["urn:ietf:params:scim:schemas:core:2.0:Group"],
  "displayName": "José's Team"
}

**Response:** ✅ Success (201 Created)

{
  "id": "468b6df5-80aa-4c94-ab39-75e36172d859",
  "displayName": "José's Team"
}

### 2. Search with exact accented characters
**Request:**

POST /scim/v2/Groups/.search
{
  "schemas": ["urn:ietf:params:scim:api:messages:2.0:SearchRequest"],
  "filter": "displayName eq \"José's Team\"",
  "startIndex": 1,
  "count": 10
}

 Failure (400 Bad Request)

{
  "status": 400,
  "scimType": "invalidSyntax",
  "error": "Unable to map or parse JSON to SCIM schema. Please check syntax and field types."
}


Search WITHOUT accents (workaround)
Request

POST /scim/v2/Groups/.search
{
  "schemas": ["urn:ietf:params:scim:api:messages:2.0:SearchRequest"],
  "filter": "displayName eq \"Jose's Team\"",
  "startIndex": 1,
  "count": 10
}

Response
Success (200 OK) - Returns the group with displayName: "José's Team"


 Test Results Matrix

| Filter Value | Expected | Actual | Status |
|-------------|----------|--------|--------|
| "José's Team" | 200 OK | 400 Bad Request | ❌ FAIL |
| "Jose's Team" | 200 OK | 200 OK | ✅ PASS |
| "JOSÉ'S TEAM" | 200 OK | 400 Bad Request | ❌ FAIL |
| "JOSE'S TEAM" | 200 OK | 200 OK | ✅ PASS |
| "Müller's Gruppe" | 200 OK | 400 Bad Request | ❌ FAIL |
| "Muller's Gruppe" | 200 OK | 200 OK | ✅ PASS |
| "Café Équipe" | 200 OK | 400 Bad Request | ❌ FAIL |
| "Cafe Equipe" | 200 OK | 200 OK | ✅ PASS |
| "Ñoño's Tëäm" | 200 OK | 400 Bad Request | ❌ FAIL |
| "Nono's Team" | 200 OK | 200 OK | ✅ PASS |
| "Åse's Øverhead" | 200 OK | 400 Bad Request | ❌ FAIL |
| "åse's øverhead" | 200 OK | 400 Bad Request | ❌ FAIL |


 Affected Character Sets

The parser fails with:
1. **Spanish accents**: José, JOSÉ
2. **German umlauts**: Müller, MÜLLER
3. **French accents**: Café, Équipe
4. **Multiple diacritics**: Ñoño's Tëäm
5. **Nordic characters**: Åse's Øverhead, åse's øverhead


Root Cause Analysis

The ANTLR grammar used by the SCIM filter parser appears to have issues tokenizing Unicode characters in the following contexts:

1. **Accented characters combined with apostrophes**: José's, Müller's
2. **Multiple diacritics**: Ñoño's Tëäm
3. **Nordic characters**: Åse's Øverhead
4. **French accents**: Café Équipe

The parser likely treats these as invalid token sequences rather than valid string literals.


Expected Behavior

According to **RFC 7644 Section 3.4.2.2** (Filtering):
> String attribute values are compared using case-insensitive matching and SHOULD be normalized according to Section 3.13.

The filter parser should:
1. Accept any valid Unicode characters in string literals
2. Parse filter values containing accented characters without errors
3. Allow the application layer to perform normalization for comparison


Actual Behavior

The parser rejects filter values containing accented characters with a generic syntax error, preventing any normalization logic from executing.


 Code Flow Analysis

The failure occurs **before** application code is reached:

1. ❌ Apache Directory SCIM library receives HTTP POST with filter string
2. ❌ ANTLR parser attempts to tokenize: `"displayName eq \"José's Team\""`
3. ❌ Parser fails on accented characters → throws exception
4. ❌ Returns 400 Bad Request
5. ⛔ Application's `find(Filter filter, ...)` method **never called**
6. ⛔ Custom normalization logic **never executes**

**Proof**: When filter has no accents (`"Jose's Team"`), parsing succeeds and application-level normalization correctly matches groups with accented names.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions