Skip to content

Conversation

@DoffeBupt
Copy link

Dear authors of Starcoder, while using the framework, I noticed that mask_user_labels sometimes does not function properly. Upon investigation, I found that there might be an issue with the function invocation. Here are my modifications:

  1. In the group_texts() function in train.py, the result["input_ids"] has a structure of lists nested within lists. However, mask_user_labels deals with a single-layered list structure. Therefore, mask_user_labels did not function as expected. I used a loop to process each label separately, allowing the function to be called normally.
  2. In the mask_user_labels function, I added masking for system-related labels. Moreover, the condition current_idx < len(labels) in the while loop should be placed at the beginning; otherwise, this condition becomes meaningless and can lead to out-of-bounds access with labels[current_idx].

@wanglongxingtianxia
Copy link

def group_texts(examples):
        # Concatenate all texts.
        # print(type(examples))
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        labels = concatenated_examples["input_ids"].copy()
        mask_user_labels(tokenizer, dialogue_template, labels)
        concatenated_examples['labels'] = labels
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        return result

Can we change this?

@DoffeBupt
Copy link
Author

def group_texts(examples):
        # Concatenate all texts.
        # print(type(examples))
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        labels = concatenated_examples["input_ids"].copy()
        mask_user_labels(tokenizer, dialogue_template, labels)
        concatenated_examples['labels'] = labels
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        return result

Can we change this?

change
mask_user_labels(tokenizer, dialogue_template, labels)
to
for label in labels:
mask_user_labels(tokenizer, dialogue_template, label)

I think this should make sense

@wanglongxingtianxia
Copy link

concatenated_examples['input_ids'] is one dimensional list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants