Skip to content

Conversation

@rgraber
Copy link
Contributor

@rgraber rgraber commented Oct 28, 2025

🗒️ Checklist

  1. run linter locally
  2. update developer docs (API, README, inline, etc.), if any
  3. for user-facing doc changes create a Zulip thread at #Support Docs Updates, if any
  4. draft PR with a title <type>(<scope>)<!>: <title> DEV-1234
  5. assign yourself, tag PR: at least Front end and/or Back end or workflow
  6. fill in the template below and delete template comments
  7. review thyself: read the diff and repro the preview as written
  8. open PR & confirm that CI passes & request reviewers, if needed
  9. delete this section before merging

💭 Notes

Fill out the method for converting old SubmissionExtra content dicts to the new format expected by SubmissionSupplemental for translations and transcripts. Qualitative analysis answers will be handled separately.
This code makes numerous assumptions to fill in information that is not present in the old structure but required in the new:

  1. If old[xpath]['transcript']['value'] == old[xpath]['googlets']['value']and the language codes are the same, we assume the most recent transcript was automatically generated
  2. If, for any revision in old[xpath]['transcript']['revisions'], revision['value'] is the same as old[xpath]['googlets']['value'] and the language codes match, we assume that revision was automatically generated. If multiple match, we assume they were all automatically generated. This should be pretty rare but it's possible
  3. 1-2 also apply to transcriptions
  4. old[xpath]['transcript']['dateModified'] will be assumed to be the creation date of the most recent revision (ie whatever is in old[xpath]['transcript']['value']). The same goes for translations
  5. All uuids are newly generated
  6. All old transcriptions/translations have status=complete with a _dateAccepted of now() (whenever the code is running)
  7. To determine the dependency of any old translation, whether automated or manual:
  • If we know the source language, look for the most recent transcript in that language that was created before the translation.
  • If there is none, take the most recent transcription in that language
  • If there are no transcriptions in the source language, take the most recent transcript
  • If we don't know the source language, take the most recent transcript that was created before the translation
  • If there is none, take the most recent transcript
  1. We can ignore any badly formatted revisions/transcripts/translations
  2. Most recent revisions will be first in the version array

👀 Preview steps

On main:

  1. ℹ️ have an account and a project with an audio question
  2. Enable NLP
  3. Add a submission to the project
  4. Generate an automatic transcript in English
  5. Generate an automatic translation in Spanish
  6. Manually edit the English transcript and save
  7. Manually edit the Spanish translation and save

On PR branch:
8. Clone the asset
9. PATCH the asset with

{
  "_version": "20250820",
  "_actionConfigs": {
    "<question xpath>": {
      "manual_transcription": [{"language": "en"}],
      "automated_google_transcription": [{"language": "en"}],
      "manual_translation": [{"language": "es"}],
      "automated_google_translation": [{"language": "es"}],
    }
  }
}
  1. In a python shell, run
from kobo.apps.subsequences.schemas import validate_submission_supplement
from kobo.apps.subsequences.utils.versioning import migrate_submission_supplementals
se = SubmissionExtras.objects.get(submission_uuid=<submission uuid>)
validate_submission_supplement(<cloned_asset>, migrate_submission_supplementals(se.content))
  1. 🟢 The validation should pass

@rgraber rgraber self-assigned this Oct 29, 2025
@rgraber rgraber marked this pull request as ready for review October 29, 2025 13:50
@rgraber rgraber requested a review from Guitlle as a code owner October 29, 2025 13:50
@rgraber rgraber requested review from jnm and noliveleger and removed request for Guitlle October 29, 2025 16:47
@noliveleger noliveleger removed the request for review from jnm October 30, 2025 17:06


def migrate_submission_supplementals(supplemental_data: dict) -> dict | None:
if supplemental_data.get('_version', None) == SCHEMA_VERSIONS[0]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: .get() already defaults to None, so you can drop the second argument.

new_version = {
'_version': '20250820',
'Audio_question': {
'automatic_transcription': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the name of the action. It should be automatic_google_transcription. We are only using Google for NLP at the moment. The logic works great, but it should match the current action IDs.

Moreover, we need to rename every automated_* and Automated* to their (A|a)utomatic counterparts.

}
]
},
'automatic_translation': {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as above.

@noliveleger
Copy link
Contributor

Nit: 9. PATCH the asset with is bad formatted, trailing comas are missing inside the "<question xpath>" dictionary.

@noliveleger
Copy link
Contributor

noliveleger commented Oct 30, 2025

validate_submission_supplement(<cloned_asset>, migrate_submission_supplementals(se.content)) exposes a bug, since with the current code the validation should not pass, automatic_transcription and automatic_translation should be rejected.

I think there’s an issue in get_submission_supplement_schema(asset: 'kpi.models.Asset') (unrelated to this PR though).
We don't set any validation at the question level (see jnm's comment below for current schema)

We should probably have schema which would look like something like that (my question is audio).

"type": "object",
  "additionalProperties": false,
  "required": ["_version", "audio"],
  "properties": {
    "_version": { "const": "20250820" },
    "audio": {
      "type": "object",
      "additionalProperties": false,
      "properties": {
        "manual_transcription": { "$ref": "#/$defs/manualTranscription" },
        "automatic_google_transcription": { "$ref": "#/$defs/autoGoogleTranscription" },
        "manual_translation": { "$ref": "#/$defs/manualTranslation" },
        "automatic_google_translation": { "$ref": "#/$defs/autoGoogleTranslation" }
      },
      "required": [
        "manual_transcription",
        "automatic_google_transcription",
        "manual_translation",
        "automatic_google_translation"
      ]
    }
  },

So everything nested inside the question_name dictionary is ignored/by_passed.

@jnm
Copy link
Member

jnm commented Oct 30, 2025

So everything nested inside the question_name dictionary is ignored/by_passed.

Right 🤦 what we are doing now is NOT putting the action IDs inside properties. They're just hanging out at the top level of the object for each question, not doing anything:

{
  "additionalProperties": false,
  "properties": {
    "_version": {
      "const": "20250820"
    },
    "audio": {
      "additionalProperties": false,
      "properties": {},
      "type": "object",
      "automated_google_transcription": {
       

Unfortunately, correcting this causes other problems because the $defs paths are wrong.

@rgraber rgraber requested a review from jnm as a code owner November 4, 2025 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants