Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Phi4 #2197

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open

Add Phi4 #2197

wants to merge 7 commits into from

Conversation

krammnic
Copy link
Contributor

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Please link to any issues this PR addresses.

Changelog

What are the changes made in this PR?

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

  • I did not change any public API
  • I have added an example to docs or docstrings

Copy link

pytorch-bot bot commented Dec 21, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2197

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit bdf478f with merge base aa8f365 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 21, 2024
@krammnic
Copy link
Contributor Author

Should we wait until Phi-4 will be on HF?

@krammnic
Copy link
Contributor Author

We require on changes in only tokenizer actually, as we already used full Attention in Phi3 without sliding one.

@krammnic krammnic changed the title [WIP] Add Phi4 Add Phi4 Dec 24, 2024
@krammnic
Copy link
Contributor Author

I assume that I need to do run with Phi4

@krammnic
Copy link
Contributor Author

@joecummings It seems to me that you haven't gone to holidays) Maybe you can give me some comments about this PR?

@joecummings
Copy link
Contributor

@joecummings It seems to me that you haven't gone to holidays) Maybe you can give me some comments about this PR?

Haha yes I'm still (somewhat) here. I asked and it looks like the Phi4 team is planning on fixing some license issues with Hugging Face and should have the model on the Hub soon. So eventually the true test will be to grab the official model from Hugging Face and do a forward pass; however, if you want to iron out any potential discrepancies right away, I'd just grab one of the unofficial uploads like this for your testing.

Happy holidays to you @krammnic - been a pleasure working with you on torchtune this year!

@krammnic
Copy link
Contributor Author

@joecummings Thanks for the comments!) Will do some runs with this then

@ebsmothers
Copy link
Contributor

Hi @krammnic just checking in on this PR. I saw the model is on Hugging Face (as of yesterday I believe). Have you done a parity check with their model? And is this ready for review? If so let me know and we can take a look

@krammnic
Copy link
Contributor Author

Hi @krammnic just checking in on this PR. I saw the model is on Hugging Face (as of yesterday I believe). Have you done a parity check with their model? And is this ready for review? If so let me know and we can take a look

Hi, will run tests today and will ping you when it will be ready for review!

# Config for EleutherEvalRecipe in eleuther_eval.py
#
# To launch, run the following command:
# tune run eleuther_eval --config phi3/evaluation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/phi3/phi4

# Checkpointer
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Phi-3-mini-4k-instruct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/Phi-3/Phi-4

]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: PHI3_MINI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/PHI3_MINI/PHI4_MINI

# Tokenizer
tokenizer:
_component_: torchtune.models.phi3.phi3_mini_tokenizer
path: /tmp/Phi-3-mini-4k-instruct/tokenizer.model
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/torchtune.models.phi3.phi3_mini_tokenizer/torchtune.models.phi4.phi4_mini_tokenizer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/Phi-3/Phi-4

@@ -0,0 +1,105 @@
# Config for multi-device full finetuning in full_finetune_distributed.py
# using a Phi3 Mini 4K Instruct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/Phi3/Phi4

@@ -0,0 +1,106 @@
# Config for single device full finetuning in full_finetune_single_device.py
# using a Phi3 Mini 4K Instruct
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

search all Phi3 and replace with Phi4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comments! I'm still fixing naming actually. Will ping you when it will be ready!

@krammnic
Copy link
Contributor Author

Probably forward is now working (I get OOM cause my cards a busy with some experiments). There is pretty weird point that I had to set num_heads = 20 which is twice less then real num_heads (I assume that it is feature of torchtune?). Also, there is some inconsistency with naming. Official description is: Phi-4 small language model but probably we can't name it "small".

@krammnic
Copy link
Contributor Author

No, I can't do forward both for num_heads = 20 and num_heads=40:

For 20 I get:
size mismatch for layers.39._checkpoint_wrapped_module.attn.q_proj.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([5120, 5120]).
For 40 (original value) I get:

        size mismatch for layers.39._checkpoint_wrapped_module.attn.k_proj.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([1280, 5120]).
        size mismatch for layers.39._checkpoint_wrapped_module.attn.v_proj.weight: copying a param with shape torch.Size([2560, 5120]) from checkpoint, the shape in current model is torch.Size([1280, 5120]).```

Similar issue for each layer. Took params directly from config.json. Am I missing something?

@krammnic
Copy link
Contributor Author

Hardcoding like this fixes the issue:

 q_proj=nn.Linear(embed_dim, 2560, bias=False),
 k_proj=nn.Linear(embed_dim, 2560, bias=False),
 v_proj=nn.Linear(embed_dim, 2560, bias=False),

Probably we should revise formulas especially for phi4.

@krammnic
Copy link
Contributor Author

Nit: For all configs should change tokenizer field

@krammnic
Copy link
Contributor Author

Getting RuntimeError: shape '[2, 308, 40, 128]' is invalid for input of size 1576960 0%| Probably from same reason

@krammnic
Copy link
Contributor Author

krammnic commented Jan 11, 2025

For num_heads=40, num_kv_heads=10, embed_dim=5120. Let's calculate:

head_dim = 5120 / 40 = 128

Already here:

 q_proj=nn.Linear(embed_dim, num_heads * head_dim, bias=False),
 k_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),
 v_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),

Already have a problem here as it will be not 2560 in all cases, but 5120, 1280, 1280. Assume that we "hardcoded" in a way that I have shown earlier. But then we get same problem here:

 q_per_kv = self.num_heads // self.num_kv_heads
 q = q.view(b, s_x, self.num_kv_heads * q_per_kv, self.head_dim)

Error:
RuntimeError: shape '[2, 308, 40, 128]' is invalid for input of size 1576960

Part of config.json for reference:

  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 17920,
  "max_position_embeddings": 16384,
  "model_type": "phi3",
  "num_attention_heads": 40,
  "num_hidden_layers": 40,
  "num_key_value_heads": 10,
  "original_max_position_embeddings": 16384,

So, the product should be twice less. Am I missing something? (I hope I have not miscalculated). Something weird is behind this problem. Will try to work out it asap. @ebsmothers I'm not really sure if it fixable without touching phi3 model or creating separate model for phi4.

@krammnic
Copy link
Contributor Author

Oh, and also I assume that we first of all need to speak about this... #2212

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just left some comments, not a review yet. Will come back to it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

folder is phi3, but args are phi4

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, good point

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that you made a copy from phi3, but made the changes in phi3/evaluation, instead of here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think these two eval files need to be swapped

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: Is "mini" the right nomenclature? Or do they have a family of model sizes like phi4_7b, phi4_13B, etc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty arguing moment, in the description of Phi4 it is "mini model" in real life it is not

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should drop the mini and just stick to model sizes, since its more informative. @ebsmothers , any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like they are mostly using model sizes instead of "mini" in public docs, so maybe let's go with 14B instead of mini?

]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: PHI3_MINI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: Are there are differences between PHI3 and PHI4? Even if there arent, should we update the model_type for clarity? I believe that this is used in the checkpointer to map the HF format to torchtune format.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to tech report there is difference in tokenizer and in attention in such way that it is not touching us. But some observations that I made upper might get us to different conclusion

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am tempted to say that even if PHI3_MINI == PHI4_MINI, every model should have its own nomenclature, so there is less cognitive load for the user. @ebsmothers , what do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I would stick with the precedent we've set, which is to only use a new model type when the arch changes. This is what we do for the Llama family, where we have LLAMA3, LLAMA3_2, but not LLAMA3_1 or LLAMA3_3. I do agree with your point though @felipemello1 -- we can consider the renaming in a follow-up (at that time I would also probably drop the MINI from Phi model names too)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that this was already the naming convention for Phi3, but we should probably add "single_device" to the config name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phi3 uses low_memory. Personally I would like to change full_low_memory -> full_single_device across the board, but again would prioritize consistency with Phi3 in this PR.

# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config phi4/mini_lora checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can probably remove --nnodes 1. We dont usually add it to distributed configs. Same goes to the other dist configs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are working on providing better support for multi node. @joecummings , I am thinking that if you add some documentation about how to configure it (+ good error logging if we dont have have), i would be happy to bulk change every config to add --nnodes 1

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @krammnic. Left some questions and minor bug fixes, e.g. phi3 -> phi4.

I personally don't feel comfortable approving this PR without a minimal forward pass comparison vs HF.

Ideally, we should be running evals to see if it matches HF: https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard#/?search=phi-4

Here is the script I used when i was checking llama 3.2. I don't think its ideal either because it doesn't test all of the tokenizer special tokens: https://gist.github.com/felipemello1/55ec8cdcf625b42c1542813c3f2ebf65

torchtune/_recipe_registry.py Outdated Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should drop the mini and just stick to model sizes, since its more informative. @ebsmothers , any thoughts?

]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: PHI3_MINI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am tempted to say that even if PHI3_MINI == PHI4_MINI, every model should have its own nomenclature, so there is less cognitive load for the user. @ebsmothers , what do you think?

# You can add specific overrides through the command line. For example
# to override the checkpointer directory while launching training
# you can run:
# tune run --nnodes 1 --nproc_per_node 2 lora_finetune_distributed --config phi4/mini_lora checkpointer.checkpoint_dir=<YOUR_CHECKPOINT_DIR>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are working on providing better support for multi node. @joecummings , I am thinking that if you add some documentation about how to configure it (+ good error logging if we dont have have), i would be happy to bulk change every config to add --nnodes 1

torchtune/models/phi4/_model_builders.py Outdated Show resolved Hide resolved
Comment on lines +80 to +82
self.eos_id = self.special_tokens["<|endoftext|>"]
self.bos_id = self.special_tokens["<|endoftext|>"]
self.pad_id = self.special_tokens["<|endoftext|>"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to Daniel Hans, this is not correct: https://x.com/danielhanchen/status/1877781452818968615. Do you mind taking a look?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also seems weird that begin of sentence (bos) would be "endoftext"

Copy link
Contributor

@RdoubleA RdoubleA Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is straight from the HF repo, and is also what the underlying tokenizer class (GPT2Tokenizer) uses as defaults. We would need some confirmation from the Phi team that these are incorrect

Copy link
Contributor Author

@krammnic krammnic Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So maybe we apply all Unsloth fixes at this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is pretty comprehensive report about all this "incorrect" things with fixes

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the changes will be approved and merged. Daniel and the Microsoft team are discussing the fixes here: https://huggingface.co/microsoft/phi-4/discussions/21.

Go ahead and do the fixes

Comment on lines +17 to +18
# m.model is a pretrained Sentencepiece model using the following command:
# spm.SentencePieceTrainer.train('--input=<TRAIN_FILE> --model_prefix=m --vocab_size=2000')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that this comment is wrong, since it uses tiktoken

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, definitely will change all examples like this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you can copy this comment pointing to the script on how the toy tiktoken tokenizer was trained

torchtune/models/phi4/_tokenizer.py Outdated Show resolved Hide resolved

>>> # tokenize_messages encodes messages separately and concats
>>> tokenizer.tokenize_messages(messages)[0]
[1, 1788, 2643, 13, 1792, 9508, 13, 465, 22137, 2933, 2]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this example still accurate for phi4?

if message.role == "user":
tokenized_messages.append(self.special_tokens["<|im_start|>"])
encoded = self.encode(
"system",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

n00b question: is it supposed to be "system" for all of them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo :)

Copy link
Contributor

@RdoubleA RdoubleA left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see anywhere where phi4 is referred as phi4-mini. On HF it seems to be only called Phi4. My preference would be to match the commonly accepted name and drop the mini, so we don't confuse folks with Phi4 vs Phi4 Mini.

Would also like to see detailed testing in the PR summary. Specifically, on:

  • Comparing tokenizer and model forward against HF implementation
  • Loss curves
  • Running generation and eval on the model to ensure it gets reasonable outputs


self.prompt_template = prompt_template

self.tt_model = TikTokenBaseTokenizer(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GPT2Tokenizer is probably closer to TikToken than SentencePiece (someone who's more knowledgeable can correct me), but I'm not sure if this will create the correct token map. You would need to test against the HF version of the tokenizer on the same text.

This is probably highlighting our issue with converting from HF tokenizers as mentioned in #2212. We'll need to think of a better long-term solution here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sure!

mask.append(message.masked)

# Add special tokens
if message.role == "user":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a separate method _tokenize_header and then pass in the role into self.encode to avoid all the if else statements

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sure!

Returns:
TransformerDecoder: Instantiation of Phi4 Mini 16K Instruct Model
"""
return phi3(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so there's no architectural difference between phi4 and phi3?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real difference! See technical report. (We do not support sliding attention)

@krammnic
Copy link
Contributor Author

For num_heads=40, num_kv_heads=10, embed_dim=5120. Let's calculate:

head_dim = 5120 / 40 = 128

Already here:

 q_proj=nn.Linear(embed_dim, num_heads * head_dim, bias=False),
 k_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),
 v_proj=nn.Linear(embed_dim, num_kv_heads * head_dim, bias=False),

Already have a problem here as it will be not 2560 in all cases, but 5120, 1280, 1280. Assume that we "hardcoded" in a way that I have shown earlier. But then we get same problem here:

 q_per_kv = self.num_heads // self.num_kv_heads
 q = q.view(b, s_x, self.num_kv_heads * q_per_kv, self.head_dim)

Error: RuntimeError: shape '[2, 308, 40, 128]' is invalid for input of size 1576960

Part of config.json for reference:

  "hidden_size": 5120,
  "initializer_range": 0.02,
  "intermediate_size": 17920,
  "max_position_embeddings": 16384,
  "model_type": "phi3",
  "num_attention_heads": 40,
  "num_hidden_layers": 40,
  "num_key_value_heads": 10,
  "original_max_position_embeddings": 16384,

So, the product should be twice less. Am I missing something? (I hope I have not miscalculated). Something weird is behind this problem. Will try to work out it asap. @ebsmothers I'm not really sure if it fixable without touching phi3 model or creating separate model for phi4.

Only point about architecture is this. And this is not align with tech report actually

Copy link
Contributor

@ebsmothers ebsmothers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'm excited to get this landed into the library. A couple comments:

  1. The configs don't currently run. IIUC Phi4 is using grouped query attention, which Phi3 does not. So it's possible that the weight mapping function needs to be updated.

  2. We should run tests to confirm forward parity with a known implementation (probably the one on HF). E.g. you can check out this file comparing our Llama2 implementation to the one from the original meta-llama repo. @joecummings may already have some scripts from his parity checks for Phi3 that you would be able to reuse here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I think these two eval files need to be swapped


# Checkpointer
checkpointer:
_component_: torchtune.training.FullModelHFCheckpointer
checkpoint_dir: /tmp/Phi-3-mini-4k-instruct
checkpoint_dir: /tmp/phi-4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure this matches the format of other directories

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah seems like they are mostly using model sizes instead of "mini" in public docs, so maybe let's go with 14B instead of mini?

]
recipe_checkpoint: null
output_dir: ${output_dir}
model_type: PHI3_MINI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I would stick with the precedent we've set, which is to only use a new model type when the arch changes. This is what we do for the Llama family, where we have LLAMA3, LLAMA3_2, but not LLAMA3_1 or LLAMA3_3. I do agree with your point though @felipemello1 -- we can consider the renaming in a follow-up (at that time I would also probably drop the MINI from Phi model names too)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Phi3 uses low_memory. Personally I would like to change full_low_memory -> full_single_device across the board, but again would prioritize consistency with Phi3 in this PR.

Comment on lines +17 to +18
# m.model is a pretrained Sentencepiece model using the following command:
# spm.SentencePieceTrainer.train('--input=<TRAIN_FILE> --model_prefix=m --vocab_size=2000')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah you can copy this comment pointing to the script on how the toy tiktoken tokenizer was trained

return phi3(
vocab_size=100_352,
num_layers=40,
num_heads=20,
Copy link
Contributor

@ebsmothers ebsmothers Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think num_heads should be 40 based on the HF config? (And same comment for the LoRA builder)

prepend/append tags.

Returns:
Phi4MiniTikTokenTokenizer: Instantiation of the SPM tokenizer.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Phi4MiniTikTokenTokenizer: Instantiation of the SPM tokenizer.
Phi4MiniTikTokenTokenizer: Instantiation of the tiktoken tokenizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ebsmothers I will fix this points ASAP(actually I have already fixed them locally) But, unfortunately our bottleneck will be related to tokenizer.

@RdoubleA RdoubleA mentioned this pull request Jan 21, 2025
@ebsmothers
Copy link
Contributor

Hey @krammnic, sorry I completely missed this comment of yours previously. Did you get this figured out? If not I think we need to make some updates like this to the Phi3 convert weights function to support loading of models with GQA. I think you will also need to update num_heads from 20 to 40 like I suggested previously, but at least after these changes you are able to load the checkpoint. We should also make sure that the forward matches what's on HF -- it's possible that we may need to permute the fused QKV projection instead of naively splitting it as done in my changes. I've put together a minimal script showing roughly how you can do this here. The numbers do not line up, so it needs further investigation whether my torch.split is incorrect or whether I am just passing incorrect arguments to the HF version in that gist.

Separately, you mentioned in another comment there are some tokenizer issues blocking you. Can you elaborate on that? I'm happy to take a look here as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants