-
Notifications
You must be signed in to change notification settings - Fork 499
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Phi4 #2197
base: main
Are you sure you want to change the base?
Add Phi4 #2197
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2197
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit bdf478f with merge base aa8f365 (): This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Should we wait until Phi-4 will be on HF? |
We require on changes in only tokenizer actually, as we already used full Attention in Phi3 without sliding one. |
I assume that I need to do run with Phi4 |
@joecummings It seems to me that you haven't gone to holidays) Maybe you can give me some comments about this PR? |
Haha yes I'm still (somewhat) here. I asked and it looks like the Phi4 team is planning on fixing some license issues with Hugging Face and should have the model on the Hub soon. So eventually the true test will be to grab the official model from Hugging Face and do a forward pass; however, if you want to iron out any potential discrepancies right away, I'd just grab one of the unofficial uploads like this for your testing. Happy holidays to you @krammnic - been a pleasure working with you on torchtune this year! |
@joecummings Thanks for the comments!) Will do some runs with this then |
Hi @krammnic just checking in on this PR. I saw the model is on Hugging Face (as of yesterday I believe). Have you done a parity check with their model? And is this ready for review? If so let me know and we can take a look |
Hi, will run tests today and will ping you when it will be ready for review! |
# Config for EleutherEvalRecipe in eleuther_eval.py | ||
# | ||
# To launch, run the following command: | ||
# tune run eleuther_eval --config phi3/evaluation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/phi3/phi4
# Checkpointer | ||
checkpointer: | ||
_component_: torchtune.training.FullModelHFCheckpointer | ||
checkpoint_dir: /tmp/Phi-3-mini-4k-instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/Phi-3/Phi-4
] | ||
recipe_checkpoint: null | ||
output_dir: ${output_dir} | ||
model_type: PHI3_MINI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/PHI3_MINI/PHI4_MINI
# Tokenizer | ||
tokenizer: | ||
_component_: torchtune.models.phi3.phi3_mini_tokenizer | ||
path: /tmp/Phi-3-mini-4k-instruct/tokenizer.model |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/torchtune.models.phi3.phi3_mini_tokenizer/torchtune.models.phi4.phi4_mini_tokenizer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/Phi-3/Phi-4
recipes/configs/phi4/mini_full.yaml
Outdated
@@ -0,0 +1,105 @@ | |||
# Config for multi-device full finetuning in full_finetune_distributed.py | |||
# using a Phi3 Mini 4K Instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/Phi3/Phi4
@@ -0,0 +1,106 @@ | |||
# Config for single device full finetuning in full_finetune_single_device.py | |||
# using a Phi3 Mini 4K Instruct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
search all Phi3 and replace with Phi4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the comments! I'm still fixing naming actually. Will ping you when it will be ready!
Probably forward is now working (I get OOM cause my cards a busy with some experiments). There is pretty weird point that I had to set |
No, I can't do forward both for For 20 I get:
|
Hardcoding like this fixes the issue:
Probably we should revise formulas especially for phi4. |
Nit: For all configs should change tokenizer field |
Getting |
For
Already here:
Already have a problem here as it will be not 2560 in all cases, but 5120, 1280, 1280. Assume that we "hardcoded" in a way that I have shown earlier. But then we get same problem here:
Error: Part of config.json for reference:
So, the product should be twice less. Am I missing something? (I hope I have not miscalculated). Something weird is behind this problem. Will try to work out it asap. @ebsmothers I'm not really sure if it fixable without touching phi3 model or creating separate model for phi4. |
Oh, and also I assume that we first of all need to speak about this... #2212 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just left some comments, not a review yet. Will come back to it.
recipes/configs/phi3/evaluation.yaml
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
folder is phi3, but args are phi4
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, good point
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that you made a copy from phi3, but made the changes in phi3/evaluation, instead of here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think these two eval files need to be swapped
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b question: Is "mini" the right nomenclature? Or do they have a family of model sizes like phi4_7b, phi4_13B, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty arguing moment, in the description of Phi4 it is "mini model" in real life it is not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should drop the mini and just stick to model sizes, since its more informative. @ebsmothers , any thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seems like they are mostly using model sizes instead of "mini" in public docs, so maybe let's go with 14B instead of mini?
] | ||
recipe_checkpoint: null | ||
output_dir: ${output_dir} | ||
model_type: PHI3_MINI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b question: Are there are differences between PHI3 and PHI4? Even if there arent, should we update the model_type for clarity? I believe that this is used in the checkpointer to map the HF format to torchtune format.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to tech report there is difference in tokenizer and in attention in such way that it is not touching us. But some observations that I made upper might get us to different conclusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am tempted to say that even if PHI3_MINI == PHI4_MINI, every model should have its own nomenclature, so there is less cognitive load for the user. @ebsmothers , what do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I would stick with the precedent we've set, which is to only use a new model type when the arch changes. This is what we do for the Llama family, where we have LLAMA3
, LLAMA3_2
, but not LLAMA3_1
or LLAMA3_3
. I do agree with your point though @felipemello1 -- we can consider the renaming in a follow-up (at that time I would also probably drop the MINI
from Phi model names too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that this was already the naming convention for Phi3, but we should probably add "single_device" to the config name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phi3 uses low_memory
. Personally I would like to change full_low_memory
-> full_single_device
across the board, but again would prioritize consistency with Phi3 in this PR.
# You can add specific overrides through the command line. For example | ||
# to override the checkpointer directory while launching training | ||
# you can run: | ||
# tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config phi4/mini_lora checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can probably remove --nnodes 1
. We dont usually add it to distributed configs. Same goes to the other dist configs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are working on providing better support for multi node. @joecummings , I am thinking that if you add some documentation about how to configure it (+ good error logging if we dont have have), i would be happy to bulk change every config to add --nnodes 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this PR @krammnic. Left some questions and minor bug fixes, e.g. phi3 -> phi4.
I personally don't feel comfortable approving this PR without a minimal forward pass comparison vs HF.
Ideally, we should be running evals to see if it matches HF: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=phi-4
Here is the script I used when i was checking llama 3.2. I don't think its ideal either because it doesn't test all of the tokenizer special tokens: https://gist.github.com/felipemello1/55ec8cdcf625b42c1542813c3f2ebf65
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should drop the mini and just stick to model sizes, since its more informative. @ebsmothers , any thoughts?
] | ||
recipe_checkpoint: null | ||
output_dir: ${output_dir} | ||
model_type: PHI3_MINI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am tempted to say that even if PHI3_MINI == PHI4_MINI, every model should have its own nomenclature, so there is less cognitive load for the user. @ebsmothers , what do you think?
# You can add specific overrides through the command line. For example | ||
# to override the checkpointer directory while launching training | ||
# you can run: | ||
# tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config phi4/mini_lora checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are working on providing better support for multi node. @joecummings , I am thinking that if you add some documentation about how to configure it (+ good error logging if we dont have have), i would be happy to bulk change every config to add --nnodes 1
self.eos_id = self.special_tokens["<|endoftext|>"] | ||
self.bos_id = self.special_tokens["<|endoftext|>"] | ||
self.pad_id = self.special_tokens["<|endoftext|>"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to Daniel Hans, this is not correct: https://x.com/danielhanchen/status/1877781452818968615. Do you mind taking a look?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also seems weird that begin of sentence (bos) would be "endoftext"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is straight from the HF repo, and is also what the underlying tokenizer class (GPT2Tokenizer) uses as defaults. We would need some confirmation from the Phi team that these are incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So maybe we apply all Unsloth fixes at this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is pretty comprehensive report about all this "incorrect" things with fixes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the changes will be approved and merged. Daniel and the Microsoft team are discussing the fixes here: https://huggingface.co/microsoft/phi-4/discussions/21.
Go ahead and do the fixes
# m.model is a pretrained Sentencepiece model using the following command: | ||
# spm.SentencePieceTrainer.train('--input=<TRAIN_FILE> --model_prefix=m --vocab_size=2000') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that this comment is wrong, since it uses tiktoken
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, definitely will change all examples like this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you can copy this comment pointing to the script on how the toy tiktoken tokenizer was trained
|
||
>>> # tokenize_messages encodes messages separately and concats | ||
>>> tokenizer.tokenize_messages(messages)[0] | ||
[1, 1788, 2643, 13, 1792, 9508, 13, 465, 22137, 2933, 2] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this example still accurate for phi4?
if message.role == "user": | ||
tokenized_messages.append(self.special_tokens["<|im_start|>"]) | ||
encoded = self.encode( | ||
"system", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
n00b question: is it supposed to be "system" for all of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see anywhere where phi4 is referred as phi4-mini. On HF it seems to be only called Phi4. My preference would be to match the commonly accepted name and drop the mini, so we don't confuse folks with Phi4 vs Phi4 Mini.
Would also like to see detailed testing in the PR summary. Specifically, on:
- Comparing tokenizer and model forward against HF implementation
- Loss curves
- Running generation and eval on the model to ensure it gets reasonable outputs
|
||
self.prompt_template = prompt_template | ||
|
||
self.tt_model = TikTokenBaseTokenizer( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GPT2Tokenizer is probably closer to TikToken than SentencePiece (someone who's more knowledgeable can correct me), but I'm not sure if this will create the correct token map. You would need to test against the HF version of the tokenizer on the same text.
This is probably highlighting our issue with converting from HF tokenizers as mentioned in #2212. We'll need to think of a better long-term solution here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sure!
mask.append(message.masked) | ||
|
||
# Add special tokens | ||
if message.role == "user": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this a separate method _tokenize_header
and then pass in the role into self.encode
to avoid all the if else statements
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, sure!
Returns: | ||
TransformerDecoder: Instantiation of Phi4 Mini 16K Instruct Model | ||
""" | ||
return phi3( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so there's no architectural difference between phi4 and phi3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No real difference! See technical report. (We do not support sliding attention)
Only point about architecture is this. And this is not align with tech report actually |
Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: Felipe Mello <[email protected]>
Co-authored-by: Felipe Mello <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I'm excited to get this landed into the library. A couple comments:
-
The configs don't currently run. IIUC Phi4 is using grouped query attention, which Phi3 does not. So it's possible that the weight mapping function needs to be updated.
-
We should run tests to confirm forward parity with a known implementation (probably the one on HF). E.g. you can check out this file comparing our Llama2 implementation to the one from the original meta-llama repo. @joecummings may already have some scripts from his parity checks for Phi3 that you would be able to reuse here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think these two eval files need to be swapped
|
||
# Checkpointer | ||
checkpointer: | ||
_component_: torchtune.training.FullModelHFCheckpointer | ||
checkpoint_dir: /tmp/Phi-3-mini-4k-instruct | ||
checkpoint_dir: /tmp/phi-4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sure this matches the format of other directories
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah seems like they are mostly using model sizes instead of "mini" in public docs, so maybe let's go with 14B instead of mini?
] | ||
recipe_checkpoint: null | ||
output_dir: ${output_dir} | ||
model_type: PHI3_MINI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now I would stick with the precedent we've set, which is to only use a new model type when the arch changes. This is what we do for the Llama family, where we have LLAMA3
, LLAMA3_2
, but not LLAMA3_1
or LLAMA3_3
. I do agree with your point though @felipemello1 -- we can consider the renaming in a follow-up (at that time I would also probably drop the MINI
from Phi model names too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phi3 uses low_memory
. Personally I would like to change full_low_memory
-> full_single_device
across the board, but again would prioritize consistency with Phi3 in this PR.
# m.model is a pretrained Sentencepiece model using the following command: | ||
# spm.SentencePieceTrainer.train('--input=<TRAIN_FILE> --model_prefix=m --vocab_size=2000') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah you can copy this comment pointing to the script on how the toy tiktoken tokenizer was trained
return phi3( | ||
vocab_size=100_352, | ||
num_layers=40, | ||
num_heads=20, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think num_heads should be 40 based on the HF config? (And same comment for the LoRA builder)
prepend/append tags. | ||
|
||
Returns: | ||
Phi4MiniTikTokenTokenizer: Instantiation of the SPM tokenizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Phi4MiniTikTokenTokenizer: Instantiation of the SPM tokenizer. | |
Phi4MiniTikTokenTokenizer: Instantiation of the tiktoken tokenizer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ebsmothers I will fix this points ASAP(actually I have already fixed them locally) But, unfortunately our bottleneck will be related to tokenizer.
Hey @krammnic, sorry I completely missed this comment of yours previously. Did you get this figured out? If not I think we need to make some updates like this to the Phi3 convert weights function to support loading of models with GQA. I think you will also need to update num_heads from 20 to 40 like I suggested previously, but at least after these changes you are able to load the checkpoint. We should also make sure that the forward matches what's on HF -- it's possible that we may need to permute the fused QKV projection instead of naively splitting it as done in my changes. I've put together a minimal script showing roughly how you can do this here. The numbers do not line up, so it needs further investigation whether my torch.split is incorrect or whether I am just passing incorrect arguments to the HF version in that gist. Separately, you mentioned in another comment there are some tokenizer issues blocking you. Can you elaborate on that? I'm happy to take a look here as well. |
Context
What is the purpose of this PR? Is it to
Please link to any issues this PR addresses.
Changelog
What are the changes made in this PR?
Test plan
Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.
pre-commit install
)pytest tests
pytest tests -m integration_test
UX
If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example