Hell Fast
1. discrete token 2. vocabulary size 3. embedding dimension 4. “meaning” model has learned for one token
Hell Fast
If you have a 48‐dimensional embedding, you have 48 “features” describing each token’s characteristics. 这里可以举例说明,什么是 feature GPT-3 (175 billion-parameter): C=12,288 Embedding 是把 token 的 位置信息 以及 某个语义特征结合起来
Hell Fast
1. mean (μ) and variance (Var) 2. The epsilon term (ε = 1×10-5) is there to prevent division by zero. 3. 特征修正
Hell Fast
1. A is the size of the head (what is head) 2. dot product meaning 3. what is Q K V
Hell Fast
1. Q means "I need an adjective (type 34 query)" 2. K means "I am a adjective (type 34 query)" 3. KQ means how much relavancy they are in this query 4. softmax(KQ) V is what you need to change the meaning of token, you are an adjective, and now you are red or tall
Hell Fast
it is still another feature-related thing (ask abstract class) such feature is important or not and add the feature to itself