Skip to content

Conversation

TeodorDjelic
Copy link
Contributor

What changes were proposed in this pull request?

PR adds a new golden file test that contains 100 randomly generated SQL Scripts. Scripts were generated using an LLM (Perplexity AI), and the scripts were handpicked from those generated SQL Scripts. Criterium for selection was:

  • Script has to be working
  • Runtime of the script has to be lower than 30s
  • Diversity has to be big enough

Following prompt was used:

*ROLE: YOU ARE A RANDOM SCRIPT GENERATOR*

*TASK*

Hey, can you generate me a dataset of a couple of tables with many columns, and 300 SQL Scripts working on that dataset. You can learn about what the sql scripts are from this starting point, and then going through links inside that sql-ref-scripting link [https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-scripting](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-scripting). Study it thoroughly, and go through all the links in sql-ref-scripting, this step is very important! Study all these links:
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-scripting
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/case-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/compound-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/for-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/get-diagnostics-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/if-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/iterate-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/leave-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/loop-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/repeat-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/resignal-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/signal-stmt
https://docs.databricks.com/aws/en/sql/language-manual/control-flow/while-stmt


*HERE ARE DIRECTIVES FOR DATA GENERATION:*

-Table cannot reference itself
-Dont use COMMENT
-Don’t use UNIQUE
-Insert some data into tables after creating them


*HERE ARE DIRECTIVES FOR SCRIPTS GENERATION:*

-Make sql scripts as random and as exhaustive as you can
-Make scripts bigger and queries inside them bigger, even if they don’t make sense.
-Scripts must follow the guidelines and rules from sql-ref-scripting.
-Make each compound statement have at least 3 and at most 10 structures/statements
-Make scripts have nested compound statements up to depth of 6, and at the very least the depth of 2.
-Try to make many scopes, don’t just go into depth, and have narrow but deep structures, but try to also make them wide. Don’t just go into depth of 2 and stay there going down the depth and -coming back to depth 2, but rather have it go through all the depths (for example dont just enter a while inside an if, and just stay inside that scope for the entire script, but rather escape it, and then make another while or for or case or something)
-Add one or sometimes more exception handlers to the Script
-You dont even have to think hard about it, just make a huge randomly generated sql script based on the dataset.
-Scripts do not have to make sense, just make them as big, random and as extensive as possible.
-Print all the SQL Scripts in a file
-Do not repeat structural compositions (avoid repeating the same control-flow structures), each script should be unique in the sense of how the structures (for, while, if, etc) were combined
-Each script has to be wrapped in a BEGIN END block, so its a valid script
-Dont make infinite loops!
-Do not make a script that runs slower than 30 seconds



*WATCH OUT FOR THE FOLLOWING PITFALLS:*

-SELECT … INTO table is not valid sql, dont use INTO, SELECT INTO is not supported !!!!!!!
[UNSUPPORTED_FEATURE.CONTINUE_EXCEPTION_HANDLER] The feature is not supported: CONTINUE exception handler is not supported. Use EXIT handler. SQLSTATE: 0A000
[INVALID_VARIABLE_DECLARATION.ONLY_AT_BEGINNING] Invalid variable declaration. Variable `region_name` can only be declared at the beginning of the compound. SQLSTATE: 42K0M - Variables have to be declared before handler declarations of the same scope - No, the reason this doesnt work is because DECLARE ... CURSOR FOR is not valid in sql scripting
-Don’t create or replace a procedure, just make it a standalone script BEGIN … END
-RESIGNAL IN DECLARE EXIT HANDLER FOR SQLEXCEPTION RESIGNAL; is not valid
-CURSOR FOR … is not valid syntax in sql scripting!
-Avoid this error, dont make this mistake GET DIAGNOSTICS CONDITION 1 v_sqlstate = RETURNED_SQLSTATE, v_message_text = MESSAGE_TEXT, v_line_number = DB2RETURNED_SQLCODE;, correct is GET DIAGNOSTICS CONDITION 1 v_sqlstate = RETURNED_SQLSTATE, v_message_text = MESSAGE_TEXT, v_line_number = LINE_NUMBER;!
-LEAVE has to have a label after it referencing the scope it is leaving, and LEAVE has to in that scope, instead [[INVALID_LABEL_USAGE.DOES_NOT_EXIST](https://docs.databricks.com/error-messages/invalid-label-usage-error-class.html#does_not_exist)] The usage of the label SCRIPT_BLOCK is invalid. Label was used in the LEAVE statement, but the label does not belong to any surrounding block. SQLSTATE: 42K0L
-Label name stands in front of BEGIN separated with a : . Good: L1: BEGIN, BAD: BEGIN L1
-Inside FOR loops, its not IN but AS
-Top level scope (first BEGIN), must not have a label
-FOR i AS INT IN (SELECT product_id FROM products LIMIT 10) is not valid because INT IN … is not valid ([[PARSE_SYNTAX_ERROR](https://docs.databricks.com/error-messages/error-classes.html#parse_syntax_error)] Syntax error at or near 'INT'. SQLSTATE: 42601)
-You cannot put LIMIT to an UPDATE statement
-FOR v_id AS SELECT product_id, price FROM products WHERE price < 200 LOOP BEGIN is invalid, it should be DO instead of LOOP
-[[UNSUPPORTED_SUBQUERY_EXPRESSION_CATEGORY.SCALAR_SUBQUERY_IN_VALUES](https://docs.databricks.com/error-messages/unsupported-subquery-expression-category-error-class.html#scalar_subquery_in_values)] Unsupported subquery expression: Scalar subqueries in the VALUES clause. SQLSTATE: 0A000 - not supported
-INSERT INTO customer (customer_id, total_spent) VALUES (99999, v_total) ON CONFLICT (customer_id) DO UPDATE SET total_spent = v_total; - breaks with [[PARSE_SYNTAX_ERROR](https://docs.databricks.com/error-messages/error-classes.html#parse_syntax_error)] Syntax error at or near 'ON': missing ';'. SQLSTATE: 42601, dont do it
-Dont use RETURN; - its not supported
-DECLARE CONTINUE CONDITION FOR '02000'; is invalid, you have to do DECLARE CONTINUE CONDITION FOR SQLSTATE '02000'; because it expects a SQLSTATE code, not string literals (nothing to do with continue handlers, they aren’t supported)
-ALTER TABLE product ADD COLUMN IF NOT EXISTS is not valid syntax, IF NOT EXISTS cannot be there
-[[INVALID_HANDLER_DECLARATION.WRONG_PLACE_OF_DECLARATION](https://docs.databricks.com/error-messages/invalid-handler-declaration-error-class.html#wrong_place_of_declaration)] Invalid handler declaration. Handlers must be declared after variable/condition declaration, and before other statements. SQLSTATE: 42K0Q
-[[TEMP_TABLE_CREATION_LEGACY_WITH_QUERY](https://docs.databricks.com/error-messages/error-classes.html#temp_table_creation_legacy_with_query)] CREATE TEMPORARY TABLE ... AS ... is not supported here, please use CREATE TEMPORARY VIEW instead SQLSTATE: 0A000
-DO NOT USE “SELECT [something] INTO …” SYNTAX, IT IS PROHIBITED BY LAW AND MANDATE!!

*GENERATE AND IMMEDIATELY OUTPUT:*
-One file containing SQL CREATE TABLE statements for the 
-One file containing 1000 different, big, randomly-structured standalone Databricks SQL scripts, making extensive use of procedural scripting, control flow, DML, DDL, error handlers, and deep nesting. Scripts do not have to make sense, just be large and syntactically valid, and strictly follow all guidelines from [Databricks SQL scripting docs](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-scripting).
*OUTPUT:*
-Because you cannot write out all 300 files in one response, every time I send a request with prompt “NEXT”, give me next queries, and if there are no more queries to be sent, return “DONE”
-After every SQL Script, write — TAGS: and then insert some tags that classify the type of script or what its testing etc (so it that you could group scripts by some parameter or smth)
-After every SQL Script, write what’s the expected output of the script, and what should be executed
-Send only the SQL code contents of both files, wrapped in code blocks—no extra text or explanation.

Why are the changes needed?

These tests will be used to catch regression errors inside scripting.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

It was manually tested using Databrick's notebooks, and via inspection of the golden file generation output files.

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Perplexity AI

@github-actions github-actions bot added the SQL label Sep 9, 2025
@cloud-fan
Copy link
Contributor

how did we verify the result? is there a reference system?

@TeodorDjelic TeodorDjelic changed the title [SPARK-48338][Core] Adding a Golden File Test With Randomly Generated SQL Scripts [SPARK-53536][Core] Adding a Golden File Test With Randomly Generated SQL Scripts Sep 10, 2025
@TeodorDjelic
Copy link
Contributor Author

how did we verify the result? is there a reference system?

Every script was hand picked and manually run using local spark. Behavior and script code coverage was not tested, but every script does have its logical flow described under the script itself.

@cloud-fan
Copy link
Contributor

So you manually verified the results?

@TeodorDjelic
Copy link
Contributor Author

TeodorDjelic commented Sep 11, 2025

So you manually verified the results?

Yes, by manually running the scripts locally.

@cloud-fan
Copy link
Contributor

I'm a bit worried about the golden answer. How did you know if the result is correct or not when you verify? By analyzing the script manually and come up with the result? I will be more reliefed if it's verified by llm...

@TeodorDjelic
Copy link
Contributor Author

I'm a bit worried about the golden answer. How did you know if the result is correct or not when you verify? By analyzing the script manually and come up with the result? I will be more reliefed if it's verified by llm...

Scripts outputted by the LLM were not edited manually at all, they were only run for checking parsing errors. Semantics and logic flow were not thoroughly analyzed, and were analyzed for script diversity (only the combinations of flow control blocks like ifs, for loops, while loops etc). Final results of scripts were not analyzed/verified thoroughly, and all the tags, expected and executes comments were generated by the LLM.

@cloud-fan
Copy link
Contributor

Can we go a bit futher to verify the results with a reference system like pgsql? It's good to have more tests, but without verifying the test result, the tests do not prove anything.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants