Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ES|QL: problems with TopN #126030

Closed
luigidellaquila opened this issue Apr 1, 2025 · 10 comments · Fixed by #125764
Closed

ES|QL: problems with TopN #126030

luigidellaquila opened this issue Apr 1, 2025 · 10 comments · Fixed by #125764
Assignees
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@luigidellaquila
Copy link
Contributor

With the CSV dataset

from * 
| keep `city`, `city.name`, `card`, nanos, year, network.bytes_in, author.keyword, `birth_date`, `ip0`, `height.half_float`, `languages.short`, zip_code, network.cost, `status`, `ratings`, emp_no, host, `salary_change.long`, `street`, `client_cidr`, `city.country.continent.planet.name`, `languages.int`, height, `network.bytes_in`, author, `is_rehired`, contai*, *, version, height.double, `first_name`, `author`, contains, `millis`, bytes_out, network.total_cost, `country.keyword`, still_hir*, `within`, gender, `location`, fi*, sha*, `lk`, `kibana.alert.risk_score`, `last_name`, salary_change.long, `name`, ip*, `book_no`, `city.country.continent.name`, scalerank, `id`, hire_date, *set, message, `ip1`, height.scaled_float, `height`, language.name.keyword, client_cidr, `client_ip`, `decade`, languages.int, `bytes_in` 
| drop `language.name`, `city.name`, pod, `num`, birth_date, description, `host_group`, `lk`, `@timestamp`, type, `birth_date`, `height.half_float`, book_no, `smaller`, height_range, `status`, ratings, `street`, event, `city.country.continent.planet.name`, date_range, still_hired, `is_rehired`, `pod`, event_duration, env, height.scaled_float, `first_name`, `languages.long`, *uthor, `client.ip`, height.half_float, `height.double`, `author`, `height`, bytes_out, `still_hired`, `language.code`, `event`, location, abbrev, millis, region 
| enrich languages_policy on author.keyword 
| sort city.country.name NULLS LAST
{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "expected exactly [16] bytes but got [4]"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "expected exactly [16] bytes but got [4]"
    },
    "status": 400
}

Same query, but without KEEP

from * 
| drop `language.name`, `city.name`, pod, `num`, birth_date, description, `host_group`, `lk`, `@timestamp`, type, `birth_date`, `height.half_float`, book_no, `smaller`, height_range, `status`, ratings, `street`, event, `city.country.continent.planet.name`, date_range, still_hired, `is_rehired`, `pod`, event_duration, env, height.scaled_float, `first_name`, `languages.long`, *uthor, `client.ip`, height.half_float, `height.double`, `author`, `height`, bytes_out, `still_hired`, `language.code`, `event`, location, abbrev, millis, region 
| enrich languages_policy on author.keyword 
| sort city.country.name NULLS LAST
{
    "error": {
        "root_cause": [
            {
                "type": "illegal_argument_exception",
                "reason": "Expected [BYTES_REF] but was [DOUBLE]"
            }
        ],
        "type": "illegal_argument_exception",
        "reason": "Expected [BYTES_REF] but was [DOUBLE]"
    },
    "status": 400
}
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-analytical-engine (Team:Analytics)

@elasticsearchmachine elasticsearchmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Apr 1, 2025
@luigidellaquila
Copy link
Contributor Author

Pulling main branch I don't seem to be able to reproduce the first problem anymore. Recent fix or slightly different data set?

The second one is still there

java.lang.IllegalArgumentException: Expected [BOOLEAN] but was [BYTES_REF]
        at org.elasticsearch.compute.operator.topn.ValueExtractor.extractorFor(ValueExtractor.java:30)
        at org.elasticsearch.compute.operator.topn.TopNOperator$RowFiller.<init>(TopNOperator.java:150)
        at org.elasticsearch.compute.operator.topn.TopNOperator.addInput(TopNOperator.java:371)
        at org.elasticsearch.compute.operator.Driver.runSingleLoopIteration(Driver.java:276)
        at org.elasticsearch.compute.operator.Driver.run(Driver.java:184)
        at org.elasticsearch.compute.operator.Driver$1.doRun(Driver.java:403)
        at [email protected]/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        at org.elasticsearch.compute.operator.DriverScheduler$1.doRun(DriverScheduler.java:57)
        at [email protected]/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        at [email protected]/org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:34)
        at [email protected]/org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:1044)
        at [email protected]/org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:27)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
        at java.base/java.lang.Thread.run(Thread.java:1583)

@luigidellaquila luigidellaquila changed the title ES|QL: problems with TopN encoder ES|QL: problems with TopN Apr 1, 2025
@alex-spies
Copy link
Contributor

There's a chance the first problem was quick fixed by #125636.

There might be layout issues again that might be solved by #125764. I'll check.

@alex-spies alex-spies self-assigned this Apr 1, 2025
@alex-spies
Copy link
Contributor

Pulling main branch I don't seem to be able to reproduce the first problem anymore. Recent fix or slightly different data set?

I just pulled main and it still repro's for me.

Neither errors occur with #125764. I'm going to include these as test cases in the PR.

@alex-spies
Copy link
Contributor

alex-spies commented Apr 2, 2025

Was able to condense/change this to

from *
| keep author.keyword, `book_no`, scalerank, `bytes_in`
| enrich languages_policy on author.keyword
| sort book_no
| limit 1

While condensing, I realized that the failure is not fully deterministic. Sometimes it fails with a 500 in TopN, sometimes with a 400 due to IAE from an unexpected data type. Sometimes it even passes.

For the query from above, I've only seen 400s.

@alex-spies
Copy link
Contributor

I just confirmed that the problem was not fixed by, and pre-existed #125636.

Before that, running the query from above resulted in an NPE.

{"error":{"root_cause":[{"type":"null_pointer_exception","reason":"Cannot read field \"nameIds\" because \"set\" is null"}],"type":"null_pointer_exception","reason":"Cannot read field \"nameIds\" because \"set\" is null","suppressed":[{"type":"null_pointer_exception","reason":"Cannot read field \"nameIds\" because \"set\" is null"},{"type":"null_pointer_exception","reason":"Cannot read field \"nameIds\" because \"set\" is null"},{"type":"null_pointer_exception","reason":"Cannot read field \"nameIds\" because \"set\" is null"}]},"status":500}

@costin , this means we should backport #125764 as far back as possible, given that we had NPEs in simple ENRICH queries, even.

Thanks again for generating this test case @luigidellaquila , this is gold.

@alex-spies
Copy link
Contributor

Was able to condense/change this to

from *
| keep author.keyword, `book_no`, scalerank, `bytes_in`
| enrich languages_policy on author.keyword
| sort book_no
| limit 1

While condensing, I realized that the failure is not fully deterministic. Sometimes it fails with a 500 in TopN, sometimes with a 400 due to IAE from an unexpected data type. Sometimes it even passes.

For the query from above, I've only seen 400s.

Bleh, that doesn't reproduce deterministically after restarting the cluster.

@alex-spies
Copy link
Contributor

This seems to repro reliably:

from *
 | enrich languages_policy on author.keyword
 | sort book_no
 | limit 1

@alex-spies
Copy link
Contributor

alex-spies commented Apr 2, 2025

Ok, got something more managable that still reproduces the problem, like, 50% of the times:

from *
  | keep author.keyword, book_no, scalerank, street, bytes_in, @timestamp, abbrev, city_location, distance, description, birth_date, language_code, intersects, client_ip, event_duration, version
 | enrich languages_policy on author.keyword
 | sort book_no
 | limit 1

The trick is to include as many indices as possible which do not have the author.keyword field, but some other field. When the local optimizer encounters such indices but its current batch doesn't include the books index (no author.keyword), the following happens:

  • ENRICH triggers a field extraction because it needs author.keyword
  • ReplaceMissingFieldWithNull inserts an EVAL some time after the ENRICH because author.keyword is missing. This reuses the name id of the previously extracted author.keyword, causing a wrong layout. More specifically, the layout has 1 more column than it should have, shifting the expected position of other columns by 1.
  • Because there are some fields present in the batch, like street or so, a field extraction is inserted for these fields. Due to the wrong layout, they are not in the right position and there is a type mismatch if the neighboring block has a different data type.

@alex-spies
Copy link
Contributor

alex-spies commented Apr 11, 2025

I just checked 8.16.0 and 8.16.last and the query from above also results in a 500 caused by the known NPE: Cannot read field \"nameIds\" because \"set\" is null.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/ES|QL AKA ESQL >bug Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants