def forward(self, idx, targets=None):
# idx is of shape (B, T)
B, T = idx.size()
assert T <= self.config.block_size, f"Cannot forward sequence of length {T}, block size is only {self.config.block_size}"
# forward the token and posisition embeddings
pos = torch.arange(0, T, dtype=torch.long, device=idx.device) # shape (T)
pos_emb = self.transformer.wpe(pos) # position embeddings of shape (T, n_embd)
tok_emb = self.transformer.wte(idx) # token embeddings of shape (B, T, n_embd)
x = tok_emb + pos_emb
# forward the blocks of the transformer
for block in self.transformer.h:
x = block(x)
# forward the final layernorm and the classifier
x = self.transformer.ln_f(x)
logits = self.lm_head(x) # (B, T, vocab_size)
loss = None
if targets is not None:
# logits.view(-1, logits.size(-1)):将所有批次的logits展平,然后和targets展平
loss = F.cross_entropy(logits.view(-1, logits.size(-1)), targets.view(-1))
return logits, loss
这是一个GPT-2模型的forward函数,它接收一个批次的索引(input)和目标(targets)。input和target都是一个形状为(B, T)的矩阵,其中B是批次大小,T是序列长度。但是input在经过计算之后形状变为了(B, T, n_embd),其中n_embd为vocab_size,就是词表的长度,而targets的形状没有变化。
输入数据其实是一个很长的text文档,我们为什么要将其转变为一个矩阵呢? 因为我们希望从文本序列中学习到规律,所以我们需要将原本的一个很长的文本序列,切分成一些小块,每一小块都是堆叠的形状,每一行就是一个序列,行数就是batchsize的大小。
为什么在最后计算loss的时候要将logits.view(-1, logits.size(-1)) ,targets.view(-1)? 将两者展平之后,计算交叉熵损失时,计算的是每一个单词预测下一个词的准确性。也就是说,模型从序列中学习到规律,而最终计算损失时,是看每一个词预测下一个词的准确性。
GPT2原文: We scale the weights of residual layers at initialization by a factor of 1/√N where N is the number of residual layers.
import torch
x = torch.zeros(768)
n = 100 # 假设有100层
for i in range(n):
x += torch.randn(768)
print(x.std()) # tensor(9.5679)
x = torch.zeros(768)
n = 100
for i in range(n):
x += n**-0.5 * torch.randn(768)
print(x.std()) # tensor(0.9772)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
std = 0.02
if hasattr(module, 'NANOGPT_SCALE_INIT'):
std *= (2 * self.config.layer)**-0.5
torch.nn.init.normal_(module.weight, mean=0.0, std=std)
if module.bias is not None:
elif isinstance(module, nn.Enbedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
- 符号位决定了数字的正负,一位数字,0代表正数,1代表负数。
- 指数部分决定了数值的范围或者说数值的量级。
- 在二进制浮点数表示中,指数通常是以偏移的形式存储的。这意味着实际的指数值是存储值减去一个偏移量(例如,对于32位浮点数,偏移量通常是127)。
- 指数的宽度决定了数值可以表示的范围,也即是可以表示多大或多小的数。较宽的指数范围允许表示更大或更小的数值,但以牺牲尾数部分的宽度为代价,可能会影响到精度。
- 尾数部分存储实际的数值信息,决定了数字的精度。
- 尾数是浮点数的基数部分,在二进制中通常以1.xxxx的形式隐式表示,其中“1”是隐式的,后面的“xxxx”是显式存储的。这种表示方式称为规格化形式。
- 尾数的宽度决定了数值的精确度,即能够区分多接近的两个数值。尾数越宽,能表示的数值就越精确。
- FP32是最常用的精度类型,提供很高的数值精度和稳定性。
- 适用于需要高精度计算的应用,比如科学计算和精确的数值模拟。
- 在没有专门硬件加速的情况下,大多数设备上的默认计算精度。
TF32(Tensor Float 32):
- TF32是由NVIDIA为其Ampere架构GPU设计的一种新的数值格式。
- 它在内部使用与FP32相同的尾数宽度,但指数范围更短。
- TF32旨在提供与FP32相似的精度,同时在深度学习训练中提供更高的性能。
- FP16是一种较低的精度格式,减少了内存的使用和数据传输的需求。
- 它在一些深度学习应用中足够用来训练网络,尤其是在有专门硬件支持的情况下,如NVIDIA的Tensor Cores。
- FP16可以加快训练速度并减少功耗,但可能需要特别的数值稳定技术,如梯度缩放。
- 使用较多的精度格式,与FP16类似,但精度更高。
- BF16是一种16位格式,与FP16相比,它保持了与FP32相同宽度的指数部分,但减少了尾数的精度。
- 这种格式特别适用于深度学习,因为它提供足够的动态范围来处理深度学习应用中的数值,同时可以减少模型的内存占用和提高处理速度。
- BF16在谷歌的TPU和最新的Intel和AMD处理器中得到支持。
单精度(Single Precision)和双精度(Double Precision)是计算机科学中用于表示数字的二进制格式。它们都是用于存储数字的二进制格式,但两者之间的区别在于位数。
单精度(Single Precision) | 双精度双精度(Double Precision) |
使用32位来表示浮点数,其中1位用于符号,8位用于指数,23位用于尾数 | 使用64位来表示浮点数,其中1位用于符号,11位用于指数,52位用于尾数 |
32位(4字节) | 64位(8字节) |
内存使用量和计算资源消耗更小,训练速度较快 | 内存使用量和计算资源消耗更大,导致训练速度较慢 |
大多数现代GPU对单精度浮点数的处理具有优化,特别是NVIDIA的CUDA核心,可以提供专门的单精度计算单元,使得单精度运算的速度远远超过双精度 | 双精度运算在某些GPU上也得到支持,但往往速度较慢,且不是所有GPU都具有高效的双精度计算能力 |
神经网络中使用 | 天文学计算、气候模型中使用 |
torch.cuda.amp provides convenience methods for mixed precision, where some operations use the torch.float32 (float) datatype and other operations use torch.float16 (half). Some ops, like linear layers and convolutions, are much faster in float16 or bfloat16. Other ops, like reductions, often require the dynamic range of float32. Mixed precision tries to match each op to its appropriate datatype, which can reduce your network’s runtime and memory footprint.
In these regions, CUDA ops run in a dtype chosen by autocast to improve performance while maintaining accuracy. See the Autocast Op Reference for details on what precision autocast chooses for each op, and under what circumstances.
for epoch in range(0): # 0 epochs, this section is for illustration only
for input, target in zip(data, targets):
# Runs the forward pass under ``autocast``.
with torch.autocast(device_type=device_type, dtype=torch.float16):
output = net(input)
# output is float16 because linear layers ``autocast`` to float16.
assert output.dtype is torch.float16
loss = loss_fn(output, target)
# loss is float32 because ``mse_loss`` layers ``autocast`` to float32.
assert loss.dtype is torch.float32
# Exits ``autocast`` before backward().
# Backward passes under ``autocast`` are not recommended.
# Backward ops run in the same ``dtype`` ``autocast`` chose for corresponding forward ops.
opt.zero_grad() # set_to_none=True here can modestly improve performance
- CUDA Ops that can autocast to float16:
matmul, addbmm, addmm, addmv, addr, baddbmm, bmm, chain_matmul, multi_dot, conv1d, conv2d, conv3d, conv_transpose1d, conv_transpose2d, conv_transpose3d, GRUCell, linear, LSTMCell, matmul, mm, mv, prelu, RNNCell
- CUDA Ops that can autocast to float32:
pow, rdiv, rpow, rtruediv, acos, asin, binary_cross_entropy_with_logits, cosh, cosine_embedding_loss, cdist, cosine_similarity, cross_entropy, cumprod, cumsum, dist, erfinv, exp, expm1, group_norm, hinge_embedding_loss, kl_div, l1_loss, layer_norm, log, log_softmax, log10, log1p, log2, margin_ranking_loss, mse_loss, multilabel_margin_loss, multi_margin_loss, nll_loss, norm, normalize, pdist, poisson_nll_loss, pow, prod, reciprocal, rsqrt, sinh, smooth_l1_loss, soft_margin_loss, softmax, softmin, softplus, sum, renorm, tan, triplet_margin_loss
# Constructs a ``scaler`` once, at the beginning of the convergence run, using default arguments.
# If your network fails to converge with default ``GradScaler`` arguments, please file an issue.
# The same ``GradScaler`` instance should be used for the entire convergence run.
# If you perform multiple convergence runs in the same script, each run should use
# a dedicated fresh ``GradScaler`` instance. ``GradScaler`` instances are lightweight.
scaler = torch.cuda.amp.GradScaler()
for epoch in range(0): # 0 epochs, this section is for illustration only
for input, target in zip(data, targets):
with torch.autocast(device_type=device, dtype=torch.float16):
output = net(input)
loss = loss_fn(output, target)
# Scales loss. Calls ``backward()`` on scaled loss to create scaled gradients.
# ``scaler.step()`` first unscales the gradients of the optimizer's assigned parameters.
# If these gradients do not contain ``inf``s or ``NaN``s, optimizer.step() is then called,
# otherwise, optimizer.step() is skipped.
# Updates the scale for next iteration.
opt.zero_grad() # set_to_none=True here can modestly improve performance
use_amp = True # 若为False,则不使用torch.autocast和GradScaler
net = make_model(in_size, out_size, num_layers)
opt = torch.optim.SGD(net.parameters(), lr=0.001)
scaler = torch.cuda.amp.GradScaler(enabled=use_amp)
for epoch in range(epochs):
for input, target in zip(data, targets):
with torch.autocast(device_type=device, dtype=torch.float16, enabled=use_amp):
output = net(input)
loss = loss_fn(output, target)
opt.zero_grad() # set_to_none=True here can modestly improve performance
end_timer_and_print("Mixed precision:")
class MyModule(torch.nn.Module):
def __init__(self):
self.lin = torch.nn.Linear(100, 10)
def forward(self, x):
return torch.nn.functional.relu(self.lin(x))
mod = MyModule()
opt_mod = torch.compile(mod)
print(opt_mod(torch.randn(10, 100)))
Speedup mainly comes from reducing Python overhead and GPU read/writes, and so the observed speedup may vary on factors such as model architecture and batch size.
- run
ldconfig -p|grep libcuda
- create a soft link
ln -s /usr/lib/x86_64-linux-gnu/libcuda.so.1 /usr/bin/libcuda.so
import torch
class MyModule(torch.nn.Module):
def __init__(self):
self.lin = torch.nn.Linear(100, 10)
def forward(self, x):
return torch.nn.functional.relu(self.lin(x))
mod = MyModule()
opt_mod = torch.compile(mod)
print(opt_mod(torch.randn(10, 100)))
python -c 'import torch; import torchvision; x=torch.ones(1,3,224,224); model=torchvision.models.resnet50(); compiled=torch.compile(model); compiled(x)'
FlashAttention,这是一种具有 IO 感知能力的精确注意力算法,它使用分块技术来减少 GPU 高带宽内存(HBM)和 GPU 芯片上 SRAM 之间的内存读写次数,在 GPT-2(序列长度 1K)上实现了 3 倍加速。
att = (q @ k.transpose(-2, -1)) * (1.0/math.sqrt(k.size(-1)))
att = att.masked_fill(self.bias[:,:,:T,:T] == 0, float('-inf'))
att = F.softmax(att, dim=-1)
y = att @ v # (B, nh, T, T) x (B, nh, T, hs) -> (B, nh, T, hs)
y = F.scaled_dot_product_attention(q, k, v, is_causal=True) # flash attention
To train all versions of GPT-3, we use Adam with β1 = 0.9, β2 = 0.95, and eps = 10−8, we clip the global norm of the gradient at 1.0, and we use cosine decay for learning rate down to 10% of its value, over 260 billion tokens (after 260 billion tokens, training continues at 10% of the original learning rate). There is a linear LR warmup over the first 375 million tokens. We also gradually increase the batch size linearly from a small value (32k tokens) to the full value over the first 4-12 billion tokens of training, depending on the model size. Data are sampled without replacement during training (until an epoch boundary is reached) to minimize overfitting. All models use weight decay of 0.1 to provide a small amount of regularization [LH17].
During training we always train on sequences of the full nctx = 2048 token context window, packing multiple documents into a single sequence when documents are shorter than 2048, in order to increase computational efficiency. Sequences with multiple documents are not masked in any special way but instead documents within a sequence are delimited with a special end of text token, giving the language model the information necessary to infer that context separated by the end of text token is unrelated. This allows for efficient training without need for any special sequence-specific masking.
torch.optim.AdamW(model.parameters(), lr=3e-4, betas=(0.9, 0.95),eps=1e-8)
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=0.1)
通过限制(裁剪)梯度的范数来避免梯度过大,从而防止梯度爆炸。当梯度的范数超过一个给定的阈值 max_norm 时,这个函数会按比例缩小梯度,使得其范数不超过 max_norm。
原文:we use cosine decay for learning rate down to 10% of its value, over 260 billion tokens (after 260 billion tokens, training continues at 10% of the original learning rate)
max_lr = 6e-4
min_lr = max_lr * 0.1
warmup_steps = 715
max_steps = 19073 # 19,073 steps is ~1 epoch, if data is 10B tokens and batch size 0.5M tokens
def get_lr(it):
# 1) 预热期,当iter数小于warmup_steps时,lr线性增加
if it < warmup_steps:
return max_lr * (it+1) / warmup_steps
# 2) 稳定期,当iter数大于max_steps时,返回最小lr
if it > max_steps:
return min_lr
# 3) 中间态,使用余弦退火算法
decay_ratio = (it - warmup_steps) / (max_steps - warmup_steps)
assert 0 <= decay_ratio <= 1
coeff = 0.5 * (1.0 + math.cos(math.pi * decay_ratio)) # coeff starts at 1 and goes to 0
return min_lr + coeff * (max_lr - min_lr)
原文中学习率的预热期为375M tokens(There is a linear LR warmup over the first 375 million tokens),每一个step=2**19 tokens
所有数据一共是 10B tokens,每一个 step 是 2**19 tokens,19073个 step 刚好可以遍历完 10B tokens,所以最大step=19073。
fused AdamW:PyTorch中对优化算法提供3种实现方式来更新模型参数:forloop, foreach和fused,性能排序为
。Forloop方式遍历每个param来对其进行更新;Foreach方式将同一param group的所有param组合成一个multi-tensor来一次性更新所有参数,内存使用多但减少kernel calls;Fused方式在一个kernel中执行所有计算。Foreach和Fused要求所有的模型参数在CUDA上,而Fused进一步要求所有模型的参数类型为float。 -
def configure_optimizers(self, weight_decay, learning_rate, device_type):
# start with all of the candidate parameters (that require grad)
param_dict = {pn: p for pn, p in self.named_parameters()}
param_dict = {pn: p for pn, p in param_dict.items() if p.requires_grad}
# create optim groups. Any parameters that is 2D will be weight decayed, otherwise no.
# i.e. all weight tensors in matmuls + embeddings decay, all biases and layernorms don't.
decay_params = [p for n, p in param_dict.items() if p.dim() >= 2]
nodecay_params = [p for n, p in param_dict.items() if p.dim() < 2]
optim_groups = [
{'params': decay_params, 'weight_decay': weight_decay},
{'params': nodecay_params, 'weight_decay': 0.0}
num_decay_params = sum(p.numel() for p in decay_params)
num_nodecay_params = sum(p.numel() for p in nodecay_params)
print(f"num decayed parameter tensors: {len(decay_params)}, with {num_decay_params:,} parameters")
print(f"num non-decayed parameter tensors: {len(nodecay_params)}, with {num_nodecay_params:,} parameters")
# Create AdamW optimizer and use the fused version if it is available
fused_available = 'fused' in inspect.signature(torch.optim.AdamW).parameters
use_fused = fused_available and device_type == "cuda"
print(f"using fused AdamW: {use_fused}")
optimizer = torch.optim.AdamW(optim_groups, lr=learning_rate, betas=(0.9, 0.95), eps=1e-8, fused=use_fused)
return optimizer
GPT3中的 batch size 为0.5M=500,000
,同时也想到达这种规模的 batch size ,那么需要的 batch 数为B=500,000/1024=488
,如此大的batchsize无法加载到24GB显存的GPU上。因此我们可以将 batch size 拆分成多个小 batch,然后通过梯度累积的方式,将多个小 batch 的梯度求和,再更新一次模型参数。
total_batch_size = 524288 # 2**19, ~0.5M, in number of tokens
B = 8 # micro batch size
T = 1024 # sequence length
assert total_batch_size % (B * T ) == 0, "make sure total_batch_size is divisible by B * T"
grad_accum_steps = total_batch_size // (B * T)
train_loader = DataLoaderLite(B=B, T=T)
for epoch in range(epochs):
for micro_step in range(grad_accum_steps):
x, y = train_loader.next_batch()
x, y = x.to(device), y.to(device)
# 使用bfloat16进行训练
with torch.autocast(device_type=device,dtype=torch.bfloat16):
logits, loss = model(x, y)
loss = loss/grad_accum_steps
loss.backward() # 计算梯度
下列代码使用批量大小为4的数据进行一次梯度计算,得到如下梯度示例:tensor([ 0.0331, 0.0307, 0.0061, -0.0363, 0.0187, -0.0415, 0.0159, 0.0077, 0.0364, -0.0054])
import torch
# super simple little MLP
net = torch.nn.Sequential(
torch.nn.Linear(16, 32),
torch.nn.Linear(32, 1)
x = torch.randn(4, 16)
y = torch.randn(4, 1)
yhat = net(x)
loss = torch.nn.functional.mse_loss(yhat, y)
# the loss objective here is (due to readuction='mean')
# L = 1/4 * [
# (y[0] - yhat[0])**2 +
# (y[1] - yhat[1])**2 +
# (y[2] - yhat[2])**2 +
# (y[3] - yhat[3])**2
# ]
# NOTE: 1/4!
下列代码将批量大小为4的数据拆分成4等分,依次进行梯度计算,每次计算前都将损失值除以4,得到如下梯度示例:tensor([ 0.0331, 0.0307, 0.0061, -0.0363, 0.0187, -0.0415, 0.0159, 0.0077, 0.0364, -0.0054])
,与之前的完全相同。这是因为在计算损失时,默认的策略为均值模式。如果仅仅是将梯度累积而不除以累积步数,所得到的梯度大小将是原来的k倍(k为梯度累积步数)。因此,当batch size 太大而无法一次性计算时,可以使用梯度累积,但需要除以累积步数。
# now let's do it with grad_accum_steps of 4, and B=1
# the loss objective here is different because
# accumulation in gradient <---> SUM in loss
# i.e. we instead get:
# L0 = 1/4(y[0] - yhat[0])**2
# L1 = 1/4(y[1] - yhat[1])**2
# L2 = 1/4(y[2] - yhat[2])**2
# L3 = 1/4(y[3] - yhat[3])**2
# L = L0 + L1 + L2 + L3
# NOTE: the "normalizer" of 1/4 is lost
for i in range(4):
yhat = net(x[i])
loss = torch.nn.functional.mse_loss(yhat, y[i])
loss = loss / 4 # <-- have to add back the "normalizer"!
DDP (Distributed Data Parallel) 是 PyTorch 中的分布式训练工具,它可以创建多个进程,从而将模型并行化到多个 GPU 上,以加速训练过程。第一个进程为主进程,会进行一些日志记录。每一个进程有一个进程编号,除此之外,代码完全相同。
- 初始化 ddp
# use of DDP atm demands CUDA, we set the device appropriately according to rank
assert torch.cuda.is_available(), "for now i think we need CUDA for DDP"
init_process_group(backend='nccl') # 初始化进程组,使用nccl作为后端。
ddp_rank = int(os.environ['RANK']) # 进程号
ddp_local_rank = int(os.environ['LOCAL_RANK']) # 本地进程号
ddp_world_size = int(os.environ['WORLD_SIZE']) # 总进程数
device = f'cuda:{ddp_local_rank}' # 使进程编号与cuda设备编号对应
master_process = ddp_rank == 0 # this process will do logging, checkpointing etc.
- 梯度累积中的 grad_accum_steps
- dataloader
- 添加ddp容器
- ❓梯度累积:
model.require_backward_grad_sync = (micro_step == grad_accum_steps-1)
- 梯度平均:
dist.all_reduce(loss_accum, op=dist.ReduceOp.AVG)
,在分布式计算环境中将所有进程的loss_accum求平均值,并将结果放回每个进程。 - 替换model为model.module
还可以选择fineweb,由于我们训练的模型比较小,因此可以下载fineweb-edu的 sample-10BT 版本。如果huggingface访问不稳定,可以使用huggingface国内镜像,替换下载方式可以参考blog,使用镜像之前执行export HF_ENDPOINT=https://hf-mirror.com
最终数据被切分成100份,每份占100M tokens(实际会有所不同,最后一份数据可能会小于100M tokens)。
- 下载数据
- 预处理数据为合适的格式
- 编写数据迭代器
- 评估模型
下面是评估模型的核心代码。得到处理好的数据之后,将数据输入模型中,得到 logits 输出,计算 logits 输出与标签(对应位置的 token )的交叉熵损失。计算交叉熵损失实际就是计算每个位置的预测准确率,通过公式可以计算每个位置的损失值,如果传入一个minibatch的数据,那么计算出来的就是一系列的损失值。此处的minibatch就是指一个句子的 tokens 长度。由于我们不需要将所有损失汇总(求和或求均值),所以设置参数 reduction='none'
logits = model(tokens).logits
# contiguous 方法在 PyTorch 中用于确保张量在内存中的存储是连续的
shift_logits = (logits[..., :-1, :]).contiguous() # 去掉最后一个时间步的 logits
shift_tokens = (tokens[..., 1:]).contiguous() # 去掉第一个时间步的 tokens
# 将 shift_logits 和 shift_tokens 展平,使得每个位置的 logits 和 tokens 对应
flat_shift_logits = shift_logits.view(-1, shift_logits.size(-1))
flat_shift_tokens = shift_tokens.view(-1)
shift_losses = F.cross_entropy(flat_shift_logits, flat_shift_tokens, reduction='none') # 计算所有位置的交叉熵损失,不进行任何归约(reduction='none'),返回每个位置的损失值
shift_losses = shift_losses.view(tokens.size(0), -1) # 损失值恢复为原来的形状
一般情况下,自回归语言模型每一步的预测 logits 都是对于下一个时间步的预测,所以 logits[i, t, :] 表示在时间步 t 模型对第 i 个样本的所有可能下一个 token 的预测得分。因此需要对logits和tokens进行偏移,使得logits[i, t, :] 表示在时间步 t 模型对第 i 个样本的第 t 个 token 的预测得分,方便后续计算。
GPT3 Zero-shot --> 33.7
Datasets Description:
"activity_label": "Removing ice from car",
"ctx": "Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles. then",
"ctx_a": "Then, the man writes over the snow covering the window of a car, and a woman wearing winter clothes smiles.",
"ctx_b": "then",
"endings": "[\", the man adds wax to the windshield and cuts it.\", \", a person board a ski lift, while two men supporting the head of the per...",
"ind": 4,
"label": "3",
"source_id": "activitynet~v_-1IBHYS3L-Y",
"split": "train",
"split_type": "indomain"
GPT3 Zero-shot --> 42.7
LAMBADA is used to evaluate the capabilities of computational models for text understanding by means of a word prediction task. LAMBADA is a collection of narrative texts sharing the characteristic that human subjects are able to guess their last word if they are exposed to the whole text, but not if they only see the last sentence preceding the target word. To succeed on LAMBADA, computational models cannot simply rely on local context, but must be able to keep track of information in the broader discourse.
GPT3 Zero-shot --> 63.3
This test requires a system to choose the correct ending to a four-sentence story.
Datasets Description
"answer_right_ending": 1,
"input_sentence_1": "Rick grew up in a troubled household.",
"input_sentence_2": "He never found good support in family, and turned to gangs.",
"input_sentence_3": "It wasn't long before Rick got shot in a robbery.",
"input_sentence_4": "The incident caused him to turn a new leaf.",
"sentence_quiz1": "He is happy now.",
"sentence_quiz2": "He joined a gang.",
"story_id": "138d5bfb-05cc-41e3-bf2c-fa85ebad14e2"
V100(16GB) 8卡并行,训练时长约为48小时。
本地系统为 ubuntu ,下面的测试使用 git
- 在github创建一个空的项目,获得项目提交地址
- 检查本地现有的密钥,主要是查看现有密钥的名称/是否已经有连接 github 的密钥了。
- 生成新的 SSH 密钥并将其添加到 ssh-agent
添加密钥到 ssh-agent 的时候出错了Permissions 0664 for '/home/zhangyuanwang/.ssh/id_ed25519' are too open.
,这个错误提示是因为私钥文件权限设置得太宽松了。SSH 要求私钥文件的权限不能被其他人访问。
可以通过下面的命令来修改私钥文件的权限:chmod 600 ~/.ssh/id_ed25519
git init
git add . # 暂存文件用于提交
git commit -m "first commit" # 提交暂存在本地仓库中的文件
git branch -M main
git remote add origin [email protected]:zhangyuanwang777/build_nanoGPT.git
git push -u origin main
由于我是克隆 karpathy 的项目,在此基础上进行的编辑,所以这个项目本身就有一个远程仓库,在进行git remote add origin
时会报错,可以使用git remote -v
查看远程 URL 设置是否正确。(其实也可以添加一个新的远程仓库,先前的origin仓库不会对其造成影响)
我需要使用git remote set-url origin [email protected]:zhangyuanwang777/build_nanoGPT.git
来更换远程仓库的地址,从而提交到我的 github 。
git config user.name "Your Name"
git config user.email "[email protected]"
如果本地存储库有更改,可以使用如下命令将更改的内容推送到 github :
git add .
git commit -m "Your commit message"
git push origin main
- 训练一个分词器
- 微调
- 将大模型全流程走一遍