Skip to content

Implementation for regex_instr #15928

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

nirnayroy
Copy link

Which issue does this PR close?

Rationale for this change

Implements a regex SQL standard function in datafusion

What changes are included in this PR?

Implementation, tests, benches and docs for the regexp_instr function

Are these changes tested?

Yes

Are there any user-facing changes?

Yes

No

@github-actions github-actions bot added documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt) functions Changes to functions implementation labels May 2, 2025
@2010YOUY01
Copy link
Contributor

Thank you. I'm wondering what's the reference system for this function's behavior (like postgres or others)

@nirnayroy
Copy link
Author

Thank you. I'm wondering what's the reference system for this function's behavior (like postgres or others)

The reference system for this function's behaviour is postgres.

@alamb
Copy link
Contributor

alamb commented May 7, 2025

Thank you for this PR @nirnayroy

Can you please resolve the CI error: https://github.com/apache/datafusion/actions/runs/14820525339/job/41754009017?pr=15928

If you encounter an error, run './dev/update_function_docs.sh' and commit

@nirnayroy
Copy link
Author

Thank you for this PR @nirnayroy

Can you please resolve the CI error: https://github.com/apache/datafusion/actions/runs/14820525339/job/41754009017?pr=15928

If you encounter an error, run './dev/update_function_docs.sh' and commit

I ran the bash script, but I’m not sure if the workflow succeeded.

@nirnayroy
Copy link
Author

fixed the cippy errors showing up in the workflow

@nirnayroy
Copy link
Author

fixed formatting error in workflow

@blaginin blaginin self-requested a review May 21, 2025 16:54
@alamb
Copy link
Contributor

alamb commented May 21, 2025

@Omega359 I wonder if you might have time to review this PR?

@Omega359
Copy link
Contributor

Of course @alamb, not sure how I missed this one. It may be a day or two though

Copy link
Contributor

@blaginin blaginin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR!

I've left some comments - a few are just nits, so feel free to dismiss them or raise separate issues on top of this PR.

My main concern right now is that I’d expect this function to be a superset of strpos, but it currently behaves differently in some edge cases. For example:

SELECT regexp_instr('😀abcdef', 'abc');
SELECT strpos('😀abcdef', 'abc');

SELECT strpos(NULL, 'abc');
SELECT regexp_instr(NULL, 'abc');

SELECT strpos('abc', NULL);
SELECT regexp_instr('abc', NULL);

Do you think we should unify them? 🙏

@nirnayroy
Copy link
Author

Hi @blaginin, thanks for the review and regret the delay in reply. I think I have rectified a majority of the concerns raised. Please have a look again.

@blaginin blaginin self-requested a review June 17, 2025 20:25
@blaginin blaginin requested a review from Copilot June 17, 2025 20:35
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds the regexp_instr SQL function, including documentation, SQL logic tests, benchmarking, and integration into the function registry.

  • Introduce regexp_instr in docs and user guide
  • Add regexp_instr SQL tests and benches
  • Expose regex compilation helpers and register the new UDF

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
docs/source/user-guide/sql/scalar_functions.md Add docs for regexp_instr, including signature and examples
datafusion/sqllogictest/test_files/regexp/regexp_instr.slt Add SQL logic tests covering regexp_instr behavior
datafusion/functions/src/regex/regexpcount.rs Make compile_and_cache_regex and compile_regex public
datafusion/functions/src/regex/mod.rs Register regexp_instr UDF and define its expression builder
datafusion/functions/benches/regx.rs Add benchmark cases for regexp_instr
Comments suppressed due to low confidence (4)

docs/source/user-guide/sql/scalar_functions.md:1844

  • The start argument description is duplicated (- **start**: - **start**:). Remove the extra label for clarity.
- **start**: - **start**: Optional start position (the first position is 1) to search for the regular expression. Can be a constant, column, or function. Defaults to 1

docs/source/user-guide/sql/scalar_functions.md:1837

  • The function signature is missing the subexpr (and return_option) parameters implemented in code. Update the signature to match the actual argument order.
regexp_instr(str, regexp[, start[, N[, flags]]])

datafusion/functions/src/regex/mod.rs:71

  • [nitpick] The parameter name endoption is unclear. Consider renaming it to return_option or another descriptive name consistent with SQL terminology.
        endoption: Option<Expr>,

datafusion/sqllogictest/test_files/regexp/regexp_instr.slt:1

  • There are no tests covering the subexpr argument. Add test cases that use the subexpr parameter to validate capture-group position behavior.
# Licensed to the Apache Software Foundation (ASF) under one

regexp_instr_func(&[
Arc::clone(&data),
Arc::clone(&regex),
Arc::clone(&start),
Copy link
Preview

Copilot AI Jun 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The benchmark passes flags as the fourth argument, but according to the UDF signature it should be the fifth. Add the missing n (occurrence) argument or adjust the argument order.

Suggested change
Arc::clone(&start),
Arc::clone(&start),
Arc::new(n(&mut rng)) as ArrayRef,

Copilot uses AI. Check for mistakes.

Copy link
Contributor

@blaginin blaginin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work, @nirnayroy!

The PR is much better now, but I think there's still room for improvement:

  • Tests are failing. If that helps, you can run the CI command locally to debug:
image
  • The code around handling literals and arrays feels a bit complex. I think arrow/df does this in a simpler and more concise way – I’ve attached some links for inspiration :)

I haven’t reviewed the tests yet since I assume things might change as you iterate. But I can see you're covering a lot of cases, which is great ☘️

Comment on lines +315 to +326
enum ScalarOrArray<T> {
Scalar(T),
Array(Vec<T>),
}

impl<T: Clone> ScalarOrArray<T> {
fn iter(&self, len: usize) -> Box<dyn Iterator<Item = T> + '_> {
match self {
ScalarOrArray::Scalar(val) => Box::new(std::iter::repeat_n(val.clone(), len)),
ScalarOrArray::Array(arr) => Box::new(arr.iter().cloned()),
}
}
Copy link
Contributor

@blaginin blaginin Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this construction is needed - it makes the code much more harder to read and no other regex function uses it. I think you may be reinventing Datum which is exactly "A possibly Scalar"

https://docs.rs/arrow/latest/arrow/array/trait.Datum.html

Also see ColumnarValue - this may also fit your needs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example where that's used: datafusion/functions/src/regex/regexpmatch.rs

} else if len == 1 {
ScalarOrArray::Scalar(1)
} else {
ScalarOrArray::Array(vec![1; len])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this code may be too complex (and expensive). For example, if start_input was passed is None, you can pass none along later and handle that nullable case (same for other params). This will save you all this big processing "defaults" here

Comment on lines +388 to +393
let flags_input = if let Some(ref flags) = flags_array {
if is_flags_scalar || flags.len() == 1 {
ScalarOrArray::Scalar(flags.value(0))
} else {
let flags_vec: Vec<&str> = flags.iter().map(|v| v.unwrap_or("")).collect();
ScalarOrArray::Array(flags_vec)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above, you don't need to set flags be "" - they can be null. Can you please check existing implementations for example? e.g. https://arrow.apache.org/rust/src/arrow_string/regexp.rs.html#356

Comment on lines +436 to +439
.map(|(value, regex, start, nth, flags, subexp)| match regex {
None => Ok(None),
Some("") => Ok(None),
Some(regex) => get_index(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure that's wrong but not sure i get your idea - isn't regex a mandatory argument? how can it be none? and why is this allowed?

Comment on lines +522 to +525
let match_start_byte_offset = byte_start_offset + mat.start();
let match_start_char_offset =
value[..match_start_byte_offset].chars().count() as i64 + 1;
Ok(Some(match_start_char_offset))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit), if you want, this can be each small own function

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation functions Changes to functions implementation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add regexp function - regexp_instr()
5 participants