-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Implementation for regex_instr #15928
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Thank you. I'm wondering what's the reference system for this function's behavior (like postgres or others) |
The reference system for this function's behaviour is postgres. |
Thank you for this PR @nirnayroy Can you please resolve the CI error: https://github.com/apache/datafusion/actions/runs/14820525339/job/41754009017?pr=15928
|
I ran the bash script, but I’m not sure if the workflow succeeded. |
fixed the cippy errors showing up in the workflow |
fixed formatting error in workflow |
@Omega359 I wonder if you might have time to review this PR? |
Of course @alamb, not sure how I missed this one. It may be a day or two though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR!
I've left some comments - a few are just nits, so feel free to dismiss them or raise separate issues on top of this PR.
My main concern right now is that I’d expect this function to be a superset of strpos
, but it currently behaves differently in some edge cases. For example:
SELECT regexp_instr('😀abcdef', 'abc');
SELECT strpos('😀abcdef', 'abc');
SELECT strpos(NULL, 'abc');
SELECT regexp_instr(NULL, 'abc');
SELECT strpos('abc', NULL);
SELECT regexp_instr('abc', NULL);
Do you think we should unify them? 🙏
Hi @blaginin, thanks for the review and regret the delay in reply. I think I have rectified a majority of the concerns raised. Please have a look again. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds the regexp_instr
SQL function, including documentation, SQL logic tests, benchmarking, and integration into the function registry.
- Introduce
regexp_instr
in docs and user guide - Add
regexp_instr
SQL tests and benches - Expose regex compilation helpers and register the new UDF
Reviewed Changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.
Show a summary per file
File | Description |
---|---|
docs/source/user-guide/sql/scalar_functions.md | Add docs for regexp_instr , including signature and examples |
datafusion/sqllogictest/test_files/regexp/regexp_instr.slt | Add SQL logic tests covering regexp_instr behavior |
datafusion/functions/src/regex/regexpcount.rs | Make compile_and_cache_regex and compile_regex public |
datafusion/functions/src/regex/mod.rs | Register regexp_instr UDF and define its expression builder |
datafusion/functions/benches/regx.rs | Add benchmark cases for regexp_instr |
Comments suppressed due to low confidence (4)
docs/source/user-guide/sql/scalar_functions.md:1844
- The
start
argument description is duplicated (- **start**: - **start**:
). Remove the extra label for clarity.
- **start**: - **start**: Optional start position (the first position is 1) to search for the regular expression. Can be a constant, column, or function. Defaults to 1
docs/source/user-guide/sql/scalar_functions.md:1837
- The function signature is missing the
subexpr
(andreturn_option
) parameters implemented in code. Update the signature to match the actual argument order.
regexp_instr(str, regexp[, start[, N[, flags]]])
datafusion/functions/src/regex/mod.rs:71
- [nitpick] The parameter name
endoption
is unclear. Consider renaming it toreturn_option
or another descriptive name consistent with SQL terminology.
endoption: Option<Expr>,
datafusion/sqllogictest/test_files/regexp/regexp_instr.slt:1
- There are no tests covering the
subexpr
argument. Add test cases that use thesubexpr
parameter to validate capture-group position behavior.
# Licensed to the Apache Software Foundation (ASF) under one
regexp_instr_func(&[ | ||
Arc::clone(&data), | ||
Arc::clone(®ex), | ||
Arc::clone(&start), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The benchmark passes flags
as the fourth argument, but according to the UDF signature it should be the fifth. Add the missing n
(occurrence) argument or adjust the argument order.
Arc::clone(&start), | |
Arc::clone(&start), | |
Arc::new(n(&mut rng)) as ArrayRef, |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work, @nirnayroy!
The PR is much better now, but I think there's still room for improvement:
- Tests are failing. If that helps, you can run the CI command locally to debug:

- The code around handling literals and arrays feels a bit complex. I think arrow/df does this in a simpler and more concise way – I’ve attached some links for inspiration :)
I haven’t reviewed the tests yet since I assume things might change as you iterate. But I can see you're covering a lot of cases, which is great ☘️
enum ScalarOrArray<T> { | ||
Scalar(T), | ||
Array(Vec<T>), | ||
} | ||
|
||
impl<T: Clone> ScalarOrArray<T> { | ||
fn iter(&self, len: usize) -> Box<dyn Iterator<Item = T> + '_> { | ||
match self { | ||
ScalarOrArray::Scalar(val) => Box::new(std::iter::repeat_n(val.clone(), len)), | ||
ScalarOrArray::Array(arr) => Box::new(arr.iter().cloned()), | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this construction is needed - it makes the code much more harder to read and no other regex function uses it. I think you may be reinventing Datum
which is exactly "A possibly Scalar"
https://docs.rs/arrow/latest/arrow/array/trait.Datum.html
Also see ColumnarValue
- this may also fit your needs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example where that's used: datafusion/functions/src/regex/regexpmatch.rs
} else if len == 1 { | ||
ScalarOrArray::Scalar(1) | ||
} else { | ||
ScalarOrArray::Array(vec![1; len]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code may be too complex (and expensive). For example, if start_input
was passed is None, you can pass none along later and handle that nullable case (same for other params). This will save you all this big processing "defaults" here
let flags_input = if let Some(ref flags) = flags_array { | ||
if is_flags_scalar || flags.len() == 1 { | ||
ScalarOrArray::Scalar(flags.value(0)) | ||
} else { | ||
let flags_vec: Vec<&str> = flags.iter().map(|v| v.unwrap_or("")).collect(); | ||
ScalarOrArray::Array(flags_vec) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above, you don't need to set flags be "" - they can be null. Can you please check existing implementations for example? e.g. https://arrow.apache.org/rust/src/arrow_string/regexp.rs.html#356
.map(|(value, regex, start, nth, flags, subexp)| match regex { | ||
None => Ok(None), | ||
Some("") => Ok(None), | ||
Some(regex) => get_index( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure that's wrong but not sure i get your idea - isn't regex a mandatory argument? how can it be none? and why is this allowed?
let match_start_byte_offset = byte_start_offset + mat.start(); | ||
let match_start_char_offset = | ||
value[..match_start_byte_offset].chars().count() as i64 + 1; | ||
Ok(Some(match_start_char_offset)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit), if you want, this can be each small own function
Which issue does this PR close?
Rationale for this change
Implements a regex SQL standard function in datafusion
What changes are included in this PR?
Implementation, tests, benches and docs for the regexp_instr function
Are these changes tested?
Yes
Are there any user-facing changes?
Yes
No