ES|QL: problems with TopN #126030

luigidellaquila · 2025-04-01T11:02:05Z

With the CSV dataset

from * 
| keep `city`, `city.name`, `card`, nanos, year, network.bytes_in, author.keyword, `birth_date`, `ip0`, `height.half_float`, `languages.short`, zip_code, network.cost, `status`, `ratings`, emp_no, host, `salary_change.long`, `street`, `client_cidr`, `city.country.continent.planet.name`, `languages.int`, height, `network.bytes_in`, author, `is_rehired`, contai*, *, version, height.double, `first_name`, `author`, contains, `millis`, bytes_out, network.total_cost, `country.keyword`, still_hir*, `within`, gender, `location`, fi*, sha*, `lk`, `kibana.alert.risk_score`, `last_name`, salary_change.long, `name`, ip*, `book_no`, `city.country.continent.name`, scalerank, `id`, hire_date, *set, message, `ip1`, height.scaled_float, `height`, language.name.keyword, client_cidr, `client_ip`, `decade`, languages.int, `bytes_in` 
| drop `language.name`, `city.name`, pod, `num`, birth_date, description, `host_group`, `lk`, `@timestamp`, type, `birth_date`, `height.half_float`, book_no, `smaller`, height_range, `status`, ratings, `street`, event, `city.country.continent.planet.name`, date_range, still_hired, `is_rehired`, `pod`, event_duration, env, height.scaled_float, `first_name`, `languages.long`, *uthor, `client.ip`, height.half_float, `height.double`, `author`, `height`, bytes_out, `still_hired`, `language.code`, `event`, location, abbrev, millis, region 
| enrich languages_policy on author.keyword 
| sort city.country.name NULLS LAST

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "expected exactly [16] bytes but got [4]"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "expected exactly [16] bytes but got [4]"
    },
    "status": 400
}

Same query, but without KEEP

from * 
| drop `language.name`, `city.name`, pod, `num`, birth_date, description, `host_group`, `lk`, `@timestamp`, type, `birth_date`, `height.half_float`, book_no, `smaller`, height_range, `status`, ratings, `street`, event, `city.country.continent.planet.name`, date_range, still_hired, `is_rehired`, `pod`, event_duration, env, height.scaled_float, `first_name`, `languages.long`, *uthor, `client.ip`, height.half_float, `height.double`, `author`, `height`, bytes_out, `still_hired`, `language.code`, `event`, location, abbrev, millis, region 
| enrich languages_policy on author.keyword 
| sort city.country.name NULLS LAST

{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Expected [BYTES_REF] but was [DOUBLE]"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Expected [BYTES_REF] but was [DOUBLE]"
    },
    "status": 400
}

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2025-04-01T11:02:38Z

Pinging @elastic/es-analytical-engine (Team:Analytics)

luigidellaquila · 2025-04-01T11:17:32Z

Pulling main branch I don't seem to be able to reproduce the first problem anymore. Recent fix or slightly different data set?

The second one is still there

java.lang.IllegalArgumentException: Expected [BOOLEAN] but was [BYTES_REF]
        at org.elasticsearch.compute.operator.topn.ValueExtractor.extractorFor(ValueExtractor.java:30)
        at org.elasticsearch.compute.operator.topn.TopNOperator$RowFiller.<init>(TopNOperator.java:150)
        at org.elasticsearch.compute.operator.topn.TopNOperator.addInput(TopNOperator.java:371)
        at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:276)
        at org.elasticsearch.compute.operator.Driver.run(Driver.java:184)
        at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:403)
        at [email protected]/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        at org.elasticsearch.compute.operator.DriverScheduler$1.doRun(DriverScheduler.java:57)
        at [email protected]/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        at [email protected]/org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)
        at [email protected]/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1044)
        at [email protected]/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

alex-spies · 2025-04-01T15:35:17Z

There's a chance the first problem was quick fixed by #125636.

There might be layout issues again that might be solved by #125764. I'll check.

alex-spies · 2025-04-02T12:53:44Z

Pulling main branch I don't seem to be able to reproduce the first problem anymore. Recent fix or slightly different data set?

I just pulled main and it still repro's for me.

Neither errors occur with #125764. I'm going to include these as test cases in the PR.

alex-spies · 2025-04-02T14:54:13Z

Was able to condense/change this to

from *
| keep author.keyword, `book_no`, scalerank, `bytes_in`
| enrich languages_policy on author.keyword
| sort book_no
| limit 1

While condensing, I realized that the failure is not fully deterministic. Sometimes it fails with a 500 in TopN, sometimes with a 400 due to IAE from an unexpected data type. Sometimes it even passes.

For the query from above, I've only seen 400s.

alex-spies · 2025-04-02T15:05:25Z

I just confirmed that the problem was not fixed by, and pre-existed #125636.

Before that, running the query from above resulted in an NPE.

{"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot read field \"nameIds\" because \"set\" is null"}],"type":"null_pointer_exception","reason":"Cannot read field \"nameIds\" because \"set\" is null","suppressed":[{"type":"null_pointer_exception","reason":"Cannot read field \"nameIds\" because \"set\" is null"},{"type":"null_pointer_exception","reason":"Cannot read field \"nameIds\" because \"set\" is null"},{"type":"null_pointer_exception","reason":"Cannot read field \"nameIds\" because \"set\" is null"}]},"status":500}

@costin , this means we should backport #125764 as far back as possible, given that we had NPEs in simple ENRICH queries, even.

Thanks again for generating this test case @luigidellaquila , this is gold.

alex-spies · 2025-04-02T15:11:37Z

Was able to condense/change this to
from *
| keep author.keyword, `book_no`, scalerank, `bytes_in`
| enrich languages_policy on author.keyword
| sort book_no
| limit 1
While condensing, I realized that the failure is not fully deterministic. Sometimes it fails with a 500 in TopN, sometimes with a 400 due to IAE from an unexpected data type. Sometimes it even passes.

For the query from above, I've only seen 400s.

Bleh, that doesn't reproduce deterministically after restarting the cluster.

alex-spies · 2025-04-02T15:24:47Z

This seems to repro reliably:

from *
 | enrich languages_policy on author.keyword
 | sort book_no
 | limit 1

alex-spies · 2025-04-02T16:10:47Z

Ok, got something more managable that still reproduces the problem, like, 50% of the times:

from *
  | keep author.keyword, book_no, scalerank, street, bytes_in, @timestamp, abbrev, city_location, distance, description, birth_date, language_code, intersects, client_ip, event_duration, version
 | enrich languages_policy on author.keyword
 | sort book_no
 | limit 1

The trick is to include as many indices as possible which do not have the author.keyword field, but some other field. When the local optimizer encounters such indices but its current batch doesn't include the books index (no author.keyword), the following happens:

ENRICH triggers a field extraction because it needs author.keyword
ReplaceMissingFieldWithNull inserts an EVAL some time after the ENRICH because author.keyword is missing. This reuses the name id of the previously extracted author.keyword, causing a wrong layout. More specifically, the layout has 1 more column than it should have, shifting the expected position of other columns by 1.
Because there are some fields present in the batch, like street or so, a field extraction is inserted for these fields. Due to the wrong layout, they are not in the right position and there is a type mismatch if the neighboring block has a different data type.

alex-spies · 2025-04-11T13:33:26Z

I just checked 8.16.0 and 8.16.last and the query from above also results in a 500 caused by the known NPE: Cannot read field \"nameIds\" because \"set\" is null.

luigidellaquila added >bug :Analytics/ES|QL AKA ESQL labels Apr 1, 2025

elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 1, 2025

luigidellaquila changed the title ~~ES|QL: problems with TopN encoder~~ ES|QL: problems with TopN Apr 1, 2025

alex-spies self-assigned this Apr 1, 2025

alex-spies mentioned this issue Apr 2, 2025

ESQL: Fix ReplaceMissingFieldsWithNull #125764

Merged

alex-spies closed this as completed in #125764 Apr 3, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES|QL: problems with TopN #126030

ES|QL: problems with TopN #126030

luigidellaquila commented Apr 1, 2025

elasticsearchmachine commented Apr 1, 2025

luigidellaquila commented Apr 1, 2025

alex-spies commented Apr 1, 2025

alex-spies commented Apr 2, 2025

alex-spies commented Apr 2, 2025 •

edited

Loading

alex-spies commented Apr 2, 2025

alex-spies commented Apr 2, 2025

alex-spies commented Apr 2, 2025

alex-spies commented Apr 2, 2025 •

edited

Loading

alex-spies commented Apr 11, 2025 •

edited

Loading

ES|QL: problems with TopN #126030

ES|QL: problems with TopN #126030

Comments

luigidellaquila commented Apr 1, 2025

elasticsearchmachine commented Apr 1, 2025

luigidellaquila commented Apr 1, 2025

alex-spies commented Apr 1, 2025

alex-spies commented Apr 2, 2025

alex-spies commented Apr 2, 2025 • edited Loading

alex-spies commented Apr 2, 2025

alex-spies commented Apr 2, 2025

alex-spies commented Apr 2, 2025

alex-spies commented Apr 2, 2025 • edited Loading

alex-spies commented Apr 11, 2025 • edited Loading

alex-spies commented Apr 2, 2025 •

edited

Loading

alex-spies commented Apr 2, 2025 •

edited

Loading

alex-spies commented Apr 11, 2025 •

edited

Loading