Epstein Files Full PDF

CLICK HERE
Technopedia Center
PMB University Brochure
Faculty of Engineering and Computer Science
S1 Informatics S1 Information Systems S1 Information Technology S1 Computer Engineering S1 Electrical Engineering S1 Civil Engineering

faculty of Economics and Business
S1 Management S1 Accountancy

Faculty of Letters and Educational Sciences
S1 English literature S1 English language education S1 Mathematics education S1 Sports Education
teknopedia

  • Registerasi
  • Brosur UTI
  • Kip Scholarship Information
  • Performance
Flag Counter
  1. World Encyclopedia
  2. Weight initialization - Wikipedia
Weight initialization - Wikipedia
From Wikipedia, the free encyclopedia
Technique for setting initial values of trainable parameters in a neural network
Part of a series on
Machine learning
and data mining
Paradigms
  • Supervised learning
  • Unsupervised learning
  • Semi-supervised learning
  • Self-supervised learning
  • Reinforcement learning
  • Meta-learning
  • Online learning
  • Batch learning
  • Curriculum learning
  • Rule-based learning
  • Neuro-symbolic AI
  • Neuromorphic engineering
  • Quantum machine learning
Problems
  • Classification
  • Generative modeling
  • Regression
  • Clustering
  • Dimensionality reduction
  • Density estimation
  • Anomaly detection
  • Data cleaning
  • AutoML
  • Association rules
  • Semantic analysis
  • Structured prediction
  • Feature engineering
  • Feature learning
  • Learning to rank
  • Grammar induction
  • Ontology learning
  • Multimodal learning
Supervised learning
(classification • regression)
  • Apprenticeship learning
  • Decision trees
  • Ensembles
    • Bagging
    • Boosting
    • Random forest
  • k-NN
  • Linear regression
  • Naive Bayes
  • Artificial neural networks
  • Logistic regression
  • Perceptron
  • Relevance vector machine (RVM)
  • Support vector machine (SVM)
Clustering
  • BIRCH
  • CURE
  • Hierarchical
  • k-means
  • Fuzzy
  • Expectation–maximization (EM)

  • DBSCAN
  • OPTICS
  • Mean shift
Dimensionality reduction
  • Factor analysis
  • CCA
  • ICA
  • LDA
  • NMF
  • PCA
  • PGD
  • t-SNE
  • SDL
Structured prediction
  • Graphical models
    • Bayes net
    • Conditional random field
    • Hidden Markov
Anomaly detection
  • RANSAC
  • k-NN
  • Local outlier factor
  • Isolation forest
Neural networks
  • Autoencoder
  • Deep learning
  • Feedforward neural network
  • Recurrent neural network
    • LSTM
    • GRU
    • ESN
    • reservoir computing
  • Boltzmann machine
    • Restricted
  • GAN
  • Diffusion model
  • SOM
  • Convolutional neural network
    • U-Net
    • LeNet
    • AlexNet
    • DeepDream
  • Neural field
    • Neural radiance field
    • Physics-informed neural networks
  • Transformer
    • Vision
  • Mamba
  • Spiking neural network
  • Memtransistor
  • Electrochemical RAM (ECRAM)
Reinforcement learning
  • Q-learning
  • Policy gradient
  • SARSA
  • Temporal difference (TD)
  • Multi-agent
    • Self-play
Learning with humans
  • Active learning
  • Crowdsourcing
  • Human-in-the-loop
  • Mechanistic interpretability
  • RLHF
Model diagnostics
  • Coefficient of determination
  • Confusion matrix
  • Learning curve
  • ROC curve
Mathematical foundations
  • Kernel machines
  • Bias–variance tradeoff
  • Computational learning theory
  • Empirical risk minimization
  • Occam learning
  • PAC learning
  • Statistical learning
  • VC theory
  • Topological deep learning
Journals and conferences
  • AAAI
  • ECML PKDD
  • NeurIPS
  • ICML
  • ICLR
  • IJCAI
  • ML
  • JMLR
Related articles
  • Glossary of artificial intelligence
  • List of datasets for machine-learning research
    • List of datasets in computer vision and image processing
  • Outline of machine learning
  • v
  • t
  • e

In deep learning, weight initialization or parameter initialization describes the initial step in creating a neural network. A neural network contains trainable parameters that are modified during training: weight initialization is the pre-training step of assigning initial values to these parameters.

The choice of weight initialization method affects the speed of convergence, the scale of neural activation within the network, the scale of gradient signals during backpropagation, and the quality of the final model. Proper initialization is necessary for avoiding issues such as vanishing and exploding gradients and activation function saturation.

Note that even though this article is titled "weight initialization", both weights and biases are used in a neural network as trainable parameters, so this article describes how both of these are initialized. Similarly, trainable parameters in convolutional neural networks (CNNs) are called kernels and biases, and this article also describes these.

Constant initialization

[edit]

We discuss the main methods of initialization in the context of a multilayer perceptron (MLP). Specific strategies for initializing other network architectures are discussed in later sections.

For an MLP, there are only two kinds of trainable parameters, called weights and biases. Each layer l {\displaystyle l} {\displaystyle l} contains a weight matrix W ( l ) ∈ R n l − 1 × n l {\displaystyle W^{(l)}\in \mathbb {R} ^{n_{l-1}\times n_{l}}} {\displaystyle W^{(l)}\in \mathbb {R} ^{n_{l-1}\times n_{l}}}and a bias vector b ( l ) ∈ R n l {\displaystyle b^{(l)}\in \mathbb {R} ^{n_{l}}} {\displaystyle b^{(l)}\in \mathbb {R} ^{n_{l}}}, where n l {\displaystyle n_{l}} {\displaystyle n_{l}} is the number of neurons in that layer. A weight initialization method is an algorithm for setting the initial values for W ( l ) , b ( l ) {\displaystyle W^{(l)},b^{(l)}} {\displaystyle W^{(l)},b^{(l)}} for each layer l {\displaystyle l} {\displaystyle l}.

The simplest form is zero initialization: W ( l ) = 0 , b ( l ) = 0 {\displaystyle W^{(l)}=0,b^{(l)}=0} {\displaystyle W^{(l)}=0,b^{(l)}=0}Zero initialization is usually used for initializing biases, but it is not used for initializing weights, as it leads to symmetry in the network, causing all neurons to learn the same features.

In this page, we assume b = 0 {\displaystyle b=0} {\displaystyle b=0} unless otherwise stated.

Recurrent neural networks typically use activation functions with bounded range, such as sigmoid and tanh, since unbounded activation may cause exploding values. (Le, Jaitly, Hinton, 2015)[1] suggested initializing weights in the recurrent parts of the network to identity and zero bias, similar to the idea of residual connections and LSTM with no forget gate.

In most cases, the biases are initialized to zero, though some situations can use a nonzero initialization. For example, in multiplicative units, such as the forget gate of LSTM, the bias can be initialized to 1 to allow good gradient signal through the gate.[2] For neurons with ReLU activation, one can initialize the bias to a small positive value like 0.1, so that the gradient is likely nonzero at initialization, avoiding the dying ReLU problem.[3]: 305 [4]

Random initialization

[edit]

Random initialization means sampling the weights from a normal distribution or a uniform distribution, usually independently.

LeCun initialization

[edit]

LeCun initialization, popularized in (LeCun et al., 1998),[5] is designed to preserve the variance of neural activations during the forward pass.

It samples each entry in W ( l ) {\displaystyle W^{(l)}} {\displaystyle W^{(l)}} independently from a distribution with mean 0 and variance 1 / n l − 1 {\displaystyle 1/n_{l-1}} {\displaystyle 1/n_{l-1}}. For example, if the distribution is a continuous uniform distribution, then the distribution is U ( ± 3 / n l − 1 ) {\displaystyle {\mathcal {U}}(\pm {\sqrt {3/n_{l-1}}})} {\displaystyle {\mathcal {U}}(\pm {\sqrt {3/n_{l-1}}})}.

Glorot initialization

[edit]

Glorot initialization (or Xavier initialization) was proposed by Xavier Glorot and Yoshua Bengio.[6] It was designed as a compromise between two goals: to preserve activation variance during the forward pass and to preserve gradient variance during the backward pass.

For uniform initialization, it samples each entry in W ( l ) {\displaystyle W^{(l)}} {\displaystyle W^{(l)}} independently and identically from U ( ± 6 / ( n l + 1 + n l − 1 ) ) {\displaystyle {\mathcal {U}}(\pm {\sqrt {6/(n_{l+1}+n_{l-1})}})} {\displaystyle {\mathcal {U}}(\pm {\sqrt {6/(n_{l+1}+n_{l-1})}})}. In the context, n l − 1 {\displaystyle n_{l-1}} {\displaystyle n_{l-1}} is also called the "fan-in", and n l + 1 {\displaystyle n_{l+1}} {\displaystyle n_{l+1}} the "fan-out". When the fan-in and fan-out are equal, then Glorot initialization is the same as LeCun initialization.

He initialization

[edit]

As Glorot initialization performs poorly for ReLU activation,[7] He initialization (or Kaiming initialization) was proposed by Kaiming He et al.[8] for networks with ReLU activation. It samples each entry in W ( l ) {\displaystyle W^{(l)}} {\displaystyle W^{(l)}} from N ( 0 , 2 / n l − 1 ) {\displaystyle {\mathcal {N}}(0,2/n_{l-1})} {\displaystyle {\mathcal {N}}(0,2/n_{l-1})}.

Orthogonal initialization

[edit]

(Saxe et al. 2013)[9] proposed orthogonal initialization: initializing weight matrices as uniformly random (according to the Haar measure) semi-orthogonal matrices, multiplied by a factor that depends on the activation function of the layer. It was designed so that if one initializes a deep linear network this way, then its training time until convergence is independent of depth.[10]

Sampling a uniformly random semi-orthogonal matrix can be done by initializing X {\displaystyle X} {\displaystyle X} by IID sampling its entries from a standard normal distribution, then calculate ( X X ⊤ ) − 1 / 2 X {\displaystyle \left(XX^{\top }\right)^{-1/2}X} {\displaystyle \left(XX^{\top }\right)^{-1/2}X} or its transpose, depending on whether X {\displaystyle X} {\displaystyle X} is tall or wide.[11]

For CNN kernels with odd widths and heights, orthogonal initialization is done this way: initialize the central point by a semi-orthogonal matrix, and fill the other entries with zero. As an illustration, a kernel K {\displaystyle K} {\displaystyle K} of shape 3 × 3 × c × c ′ {\displaystyle 3\times 3\times c\times c'} {\displaystyle 3\times 3\times c\times c'} is initialized by filling K [ 2 , 2 , : , : ] {\displaystyle K[2,2,:,:]} {\displaystyle K[2,2,:,:]} with the entries of a random semi-orthogonal matrix of shape c × c ′ {\displaystyle c\times c'} {\displaystyle c\times c'}, and the other entries with zero. (Balduzzi et al., 2017)[12] used it with stride 1 and zero-padding. This is sometimes called the Orthogonal Delta initialization.[11][13]

Related to this approach, unitary initialization proposes to parameterize the weight matrices to be unitary matrices, with the result that at initialization they are random unitary matrices (and throughout training, they remain unitary). This is found to improve long-sequence modelling in LSTM.[14][15]

Orthogonal initialization has been generalized to layer-sequential unit-variance (LSUV) initialization. It is a data-dependent initialization method, and can be used in convolutional neural networks. It first initializes weights of each convolution or fully connected layer with orthonormal matrices. Then, proceeding from the first to the last layer, it runs a forward pass on a random minibatch, and divides the layer's weights by the standard deviation of its output, so that its output has variance approximately 1.[16][17]

Fixup initialization

[edit]

In 2015, the introduction of residual connections allowed very deep neural networks to be trained, much deeper than the ~20 layers of the previous state of the art (such as the VGG-19). Residual connections gave rise to their own weight initialization problems and strategies. These are sometimes called "normalization-free" methods, since using residual connection could stabilize the training of a deep neural network so much that normalizations become unnecessary.

Fixup initialization is designed specifically for networks with residual connections and without batch normalization, as follows:[18]

  1. Initialize the classification layer and the last layer of each residual branch to 0.
  2. Initialize every other layer using a standard method (such as He initialization), and scale only the weight layers inside residual branches by L − 1 2 m − 2 {\displaystyle L^{-{\frac {1}{2m-2}}}} {\displaystyle L^{-{\frac {1}{2m-2}}}}.
  3. Add a scalar multiplier (initialized at 1) in every branch and a scalar bias (initialized at 0) before each convolution, linear, and element-wise activation layer.

Similarly, T-Fixup initialization is designed for Transformers without layer normalization.[19]: 9 

Others

[edit]

Instead of initializing all weights with random values on the order of O ( 1 / n ) {\displaystyle O(1/{\sqrt {n}})} {\displaystyle O(1/{\sqrt {n}})}, sparse initialization initialized only a small subset of the weights with larger random values, and the other weights zero, so that the total variance is still on the order of O ( 1 ) {\displaystyle O(1)} {\displaystyle O(1)}.[20]

Random walk initialization was designed for MLP so that during backpropagation, the L2 norm of gradient at each layer performs an unbiased random walk as one moves from the last layer to the first.[21]

Looks linear initialization was designed to allow the neural network to behave like a deep linear network at initialization, since W R e L U ( x ) − W R e L U ( − x ) = W x {\displaystyle W\;\mathrm {ReLU} (x)-W\;\mathrm {ReLU} (-x)=Wx} {\displaystyle W\;\mathrm {ReLU} (x)-W\;\mathrm {ReLU} (-x)=Wx}. It initializes a matrix W {\displaystyle W} {\displaystyle W} of shape R n 2 × m {\displaystyle \mathbb {R} ^{{\frac {n}{2}}\times m}} {\displaystyle \mathbb {R} ^{{\frac {n}{2}}\times m}} by any method, such as orthogonal initialization, then let the R n × m {\displaystyle \mathbb {R} ^{n\times m}} {\displaystyle \mathbb {R} ^{n\times m}} weight matrix to be the concatenation of W , − W {\displaystyle W,-W} {\displaystyle W,-W}.[22]

Miscellaneous

[edit]

For hyperbolic tangent activation function, a particular scaling is sometimes used: 1.7159 tanh ⁡ ( 2 x / 3 ) {\displaystyle 1.7159\tanh(2x/3)} {\displaystyle 1.7159\tanh(2x/3)}. This was sometimes called "LeCun's tanh". It was designed so that it maps the interval [ − 1 , + 1 ] {\displaystyle [-1,+1]} {\displaystyle [-1,+1]} to itself, thus ensuring that the overall gain is around 1 in "normal operating conditions", and that | f ″ ( x ) | {\displaystyle |f''(x)|} {\displaystyle |f''(x)|} is at maximum when x = − 1 , + 1 {\displaystyle x=-1,+1} {\displaystyle x=-1,+1}, which improves convergence at the end of training.[23][5]

In self-normalizing neural networks, the SELU activation function S E L U ( x ) = λ { x if  x > 0 α e x − α if  x ≤ 0 {\displaystyle \mathrm {SELU} (x)=\lambda {\begin{cases}x&{\text{if }}x>0\\\alpha e^{x}-\alpha &{\text{if }}x\leq 0\end{cases}}} {\displaystyle \mathrm {SELU} (x)=\lambda {\begin{cases}x&{\text{if }}x>0\\\alpha e^{x}-\alpha &{\text{if }}x\leq 0\end{cases}}} with parameters λ ≈ 1.0507 , α ≈ 1.6733 {\displaystyle \lambda \approx 1.0507,\alpha \approx 1.6733} {\displaystyle \lambda \approx 1.0507,\alpha \approx 1.6733} makes it such that the mean and variance of the output of each layer has ( 0 , 1 ) {\displaystyle (0,1)} {\displaystyle (0,1)} as an attracting fixed-point. This makes initialization less important, though they recommend initializing weights randomly with variance 1 / n l − 1 {\displaystyle 1/n_{l-1}} {\displaystyle 1/n_{l-1}}.[24]

History

[edit]

Random weight initialization was used since Frank Rosenblatt's perceptrons. An early work that described weight initialization specifically was (LeCun et al., 1998).[5]

Before the 2010s era of deep learning, it was common to initialize models by "generative pre-training" using an unsupervised learning algorithm that is not backpropagation, as it was difficult to directly train deep neural networks by backpropagation.[25][26] For example, a deep belief network was trained by using contrastive divergence layer by layer, starting from the bottom.[27]

(Martens, 2010)[20] proposed Hessian-free Optimization, a quasi-Newton method to directly train deep networks. The work generated considerable excitement that initializing networks without pre-training phase was possible.[28] However, a 2013 paper demonstrated that with well-chosen hyperparameters, momentum gradient descent with weight initialization was sufficient for training neural networks, without needing either quasi-Newton method or generative pre-training, a combination that is still in use as of 2024.[29]

Since then, the impact of initialization on tuning the variance has become less important, with methods developed to automatically tune variance, like batch normalization tuning the variance of the forward pass,[30] and momentum-based optimizers tuning the variance of the backward pass.[31]

There is a tension between using careful weight initialization to decrease the need for normalization, and using normalization to decrease the need for careful weight initialization, with each approach having its tradeoffs. For example, batch normalization causes training examples in the minibatch to become dependent, an undesirable trait, while weight initialization is architecture-dependent.[32]

See also

[edit]
  • Backpropagation
  • Normalization (machine learning)
  • Gradient descent
  • Vanishing gradient problem

References

[edit]
  1. ^ Le, Quoc V.; Jaitly, Navdeep; Hinton, Geoffrey E. (2015). "A Simple Way to Initialize Recurrent Networks of Rectified Linear Units". arXiv:1504.00941 [cs.NE].
  2. ^ Jozefowicz, Rafal; Zaremba, Wojciech; Sutskever, Ilya (2015-06-01). "An Empirical Exploration of Recurrent Network Architectures". Proceedings of the 32nd International Conference on Machine Learning. PMLR: 2342–2350.
  3. ^ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). Deep learning. Adaptive computation and machine learning. Cambridge, Massachusetts: The MIT Press. ISBN 978-0-262-03561-3.
  4. ^ Lu, Lu; Shin, Yeonjong; Su, Yanhui; Karniadakis, George Em (2019). "Dying ReLU and Initialization: Theory and Numerical Examples". Communications in Computational Physics. 28 (5): 1671–1706. arXiv:1903.06733. doi:10.4208/cicp.OA-2020-0165.
  5. ^ a b c LeCun, Yann; Bottou, Leon; Orr, Genevieve B.; Müller, Klaus -Robert (1998), Orr, Genevieve B.; Müller, Klaus-Robert (eds.), "Efficient BackProp", Neural Networks: Tricks of the Trade, Berlin, Heidelberg: Springer, pp. 9–50, doi:10.1007/3-540-49430-8_2, ISBN 978-3-540-49430-0, retrieved 2024-10-05{{citation}}: CS1 maint: work parameter with ISBN (link)
  6. ^ Glorot, Xavier; Bengio, Yoshua (2010-03-31). "Understanding the difficulty of training deep feedforward neural networks". Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 249–256.
  7. ^ Kumar, Siddharth Krishna (2017). "On weight initialization in deep neural networks". arXiv:1704.08863 [cs.LG].
  8. ^ He, Kaiming; Zhang, Xiangyu; Ren, Shaoqing; Sun, Jian (2015). "Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification". arXiv:1502.01852 [cs.CV].
  9. ^ Saxe, Andrew M.; McClelland, James L.; Ganguli, Surya (2013). "Exact solutions to the nonlinear dynamics of learning in deep linear neural networks". arXiv:1312.6120 [cs.NE].
  10. ^ Hu, Wei; Xiao, Lechao; Pennington, Jeffrey (2020). "Provable Benefit of Orthogonal Initialization in Optimizing Deep Linear Networks". arXiv:2001.05992 [cs.LG].
  11. ^ a b Martens, James; Ballard, Andy; Desjardins, Guillaume; Swirszcz, Grzegorz; Dalibard, Valentin; Sohl-Dickstein, Jascha; Schoenholz, Samuel S. (2021). "Rapid training of deep neural networks without skip connections or normalization layers using Deep Kernel Shaping". arXiv:2110.01765 [cs.LG].
  12. ^ Balduzzi, David; Frean, Marcus; Leary, Lennox; Lewis, J. P.; Ma, Kurt Wan-Duo; McWilliams, Brian (2017-07-17). "The Shattered Gradients Problem: If resnets are the answer, then what is the question?". Proceedings of the 34th International Conference on Machine Learning. PMLR: 342–350.
  13. ^ Xiao, Lechao; Bahri, Yasaman; Sohl-Dickstein, Jascha; Schoenholz, Samuel; Pennington, Jeffrey (2018-07-03). "Dynamical Isometry and a Mean Field Theory of CNNs: How to Train 10,000-Layer Vanilla Convolutional Neural Networks". Proceedings of the 35th International Conference on Machine Learning. PMLR: 5393–5402. arXiv:1806.05393.
  14. ^ Arjovsky, Martin; Shah, Amar; Bengio, Yoshua (2016-06-11). "Unitary Evolution Recurrent Neural Networks". Proceedings of the 33rd International Conference on Machine Learning. PMLR: 1120–1128. arXiv:1511.06464.
  15. ^ Henaff, Mikael; Szlam, Arthur; LeCun, Yann (2017-03-15). "Recurrent Orthogonal Networks and Long-Memory Tasks". arXiv:1602.06662 [cs.NE].
  16. ^ Mishkin, Dmytro; Matas, Jiri (2016-02-19), All you need is a good init, arXiv:1511.06422
  17. ^ Xie, Di; Xiong, Jiang; Pu, Shiliang (2017). All You Need Is Beyond a Good Init: Exploring Better Solution for Training Extremely Deep Convolutional Neural Networks With Orthonormality and Modulation. IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6176–6185.
  18. ^ Zhang, Hongyi; Dauphin, Yann N.; Ma, Tengyu (2019). "Fixup Initialization: Residual Learning Without Normalization". arXiv:1901.09321 [cs.LG].
  19. ^ Huang, Xiao Shi; Perez, Felipe; Ba, Jimmy; Volkovs, Maksims (2020-11-21). "Improving Transformer Optimization Through Better Initialization". Proceedings of the 37th International Conference on Machine Learning. PMLR: 4475–4483.
  20. ^ a b Martens, James (2010-06-21). "Deep learning via Hessian-free optimization". Proceedings of the 27th International Conference on International Conference on Machine Learning. ICML'10. Madison, WI, USA: Omnipress: 735–742. ISBN 978-1-60558-907-7.
  21. ^ Sussillo, David; Abbott, L. F. (2014). "Random Walk Initialization for Training Very Deep Feedforward Networks". arXiv:1412.6558 [cs.NE].
  22. ^ Balduzzi, David; Frean, Marcus; Leary, Lennox; Lewis, JP; Kurt Wan-Duo Ma; McWilliams, Brian (2017). "The Shattered Gradients Problem: If resnets are the answer, then what is the question?". arXiv:1702.08591 [cs.NE].
  23. ^ LeCun, Y. (1989). "Generalization and network design strategies" (PDF). In Pfeifer, R.; Schreter, Z.; Fogelman, F.; Steels, L. (eds.). Connectionism in Perspective: Proceedings of the International Conference Connectionism in Perspective, University of Zurich, 10–13 October 1988. Amsterdam: Elsevier.
  24. ^ Klambauer, Günter; Unterthiner, Thomas; Mayr, Andreas; Hochreiter, Sepp (2017). "Self-Normalizing Neural Networks". Advances in Neural Information Processing Systems. 30. Curran Associates, Inc.
  25. ^ Bengio, Y. (2009). "Learning Deep Architectures for AI" (PDF). Foundations and Trends in Machine Learning. 2: 1–127. CiteSeerX 10.1.1.701.9550. doi:10.1561/2200000006.
  26. ^ Erhan, Dumitru; Courville, Aaron; Bengio, Yoshua; Vincent, Pascal (2010-03-31). "Why Does Unsupervised Pre-training Help Deep Learning?". Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 201–208.
  27. ^ Bengio, Yoshua; Lamblin, Pascal; Popovici, Dan; Larochelle, Hugo (2006). "Greedy Layer-Wise Training of Deep Networks". Advances in Neural Information Processing Systems. 19. MIT Press.
  28. ^ Glorot, Xavier; Bordes, Antoine; Bengio, Yoshua (2011-06-14). "Deep Sparse Rectifier Neural Networks". Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings: 315–323.
  29. ^ Sutskever, Ilya; Martens, James; Dahl, George; Hinton, Geoffrey (2013-05-26). "On the importance of initialization and momentum in deep learning" (PDF). Proceedings of the 30th International Conference on Machine Learning. PMLR: 1139–1147.
  30. ^ Bjorck, Nils; Gomes, Carla P; Selman, Bart; Weinberger, Kilian Q (2018). "Understanding Batch Normalization". Advances in Neural Information Processing Systems. 31. Curran Associates, Inc. arXiv:1806.02375.
  31. ^ Balles, Lukas; Hennig, Philipp (2018-07-03). "Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients". Proceedings of the 35th International Conference on Machine Learning. PMLR: 404–413. arXiv:1705.07774.
  32. ^ Brock, Andrew; De, Soham; Smith, Samuel L.; Simonyan, Karen (2021). "High-Performance Large-Scale Image Recognition Without Normalization". arXiv:2102.06171 [cs.CV].

Further reading

[edit]
  • Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "8.4 Parameter Initialization Strategies". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN 978-0-262-03561-3.
  • Narkhede, Meenal V.; Bartakke, Prashant P.; Sutaone, Mukul S. (June 28, 2021). "A review on weight initialization strategies for neural networks". Artificial Intelligence Review. 55 (1). Springer Science and Business Media LLC: 291–322. doi:10.1007/s10462-021-10033-z. ISSN 0269-2821.
  • v
  • t
  • e
Artificial intelligence (AI)
  • History
    • timeline
  • Glossary
  • Companies
  • Projects
Concepts
  • Parameter
    • Hyperparameter
  • Loss functions
  • Regression
    • Bias–variance tradeoff
    • Double descent
    • Overfitting
  • Clustering
  • Gradient descent
    • SGD
    • Quasi-Newton method
    • Conjugate gradient method
  • Backpropagation
  • Attention
  • Convolution
  • Normalization
    • Batchnorm
  • Activation
    • Softmax
    • Sigmoid
    • Rectifier
  • Gating
  • Weight initialization
  • Regularization
  • Datasets
    • Augmentation
  • Prompt engineering
  • Reinforcement learning
    • Q-learning
    • SARSA
    • Imitation
    • Policy gradient
  • Diffusion
  • Latent diffusion model
  • Autoregression
  • Adversary
  • RAG
  • Uncanny valley
  • RLHF
  • Self-supervised learning
  • Reflection
  • Recursive self-improvement
  • Hallucination
  • Word embedding
  • Vibe coding
Applications
  • Machine learning
    • In-context learning
  • Artificial neural network
    • Deep learning
  • Language model
    • Large
    • NMT
    • Reasoning
  • Model Context Protocol
  • Intelligent agent
  • Artificial human companion
  • Humanity's Last Exam
  • Lethal autonomous weapons (LAWs)
  • Generative artificial intelligence (GenAI)
  • (Hypothetical: Artificial general intelligence (AGI))
  • (Hypothetical: Artificial superintelligence (ASI))
  • Agent2Agent protocol
Implementations
Audio–visual
  • AlexNet
  • WaveNet
  • Human image synthesis
  • HWR
  • OCR
  • Computer vision
  • Speech synthesis
    • 15.ai
    • ElevenLabs
  • Speech recognition
    • Whisper
  • Facial recognition
  • AlphaFold
  • Text-to-image models
    • Aurora
    • DALL-E
    • Firefly
    • Flux
    • GPT Image
    • Ideogram
    • Imagen
    • Midjourney
    • Recraft
    • Stable Diffusion
  • Text-to-video models
    • Dream Machine
    • Runway Gen
    • Hailuo AI
    • Kling
    • Sora
    • Seedance
    • Veo
  • Music generation
    • Riffusion
    • Suno AI
    • Udio
Text
  • Word2vec
  • Seq2seq
  • GloVe
  • BERT
  • T5
  • Llama
  • Chinchilla AI
  • PaLM
  • GPT
    • 1
    • 2
    • 3
    • J
    • ChatGPT
    • 4
    • 4o
    • o1
    • o3
    • 4.5
    • 4.1
    • o4-mini
    • 5
    • 5.1
    • 5.2
  • Claude
  • Gemini
    • Gemini (language model)
    • Gemma
  • Grok
  • LaMDA
  • BLOOM
  • DBRX
  • Project Debater
  • IBM Watson
  • IBM Watsonx
  • Granite
  • PanGu-Σ
  • DeepSeek
  • Qwen
Decisional
  • AlphaGo
  • AlphaZero
  • OpenAI Five
  • Self-driving car
  • MuZero
  • Action selection
    • AutoGPT
  • Robot control
People
  • Alan Turing
  • Warren Sturgis McCulloch
  • Walter Pitts
  • John von Neumann
  • Christopher D. Manning
  • Claude Shannon
  • Shun'ichi Amari
  • Kunihiko Fukushima
  • Takeo Kanade
  • Marvin Minsky
  • John McCarthy
  • Nathaniel Rochester
  • Allen Newell
  • Cliff Shaw
  • Herbert A. Simon
  • Oliver Selfridge
  • Frank Rosenblatt
  • Bernard Widrow
  • Joseph Weizenbaum
  • Seymour Papert
  • Seppo Linnainmaa
  • Paul Werbos
  • Geoffrey Hinton
  • John Hopfield
  • Jürgen Schmidhuber
  • Yann LeCun
  • Yoshua Bengio
  • Lotfi A. Zadeh
  • Stephen Grossberg
  • Alex Graves
  • James Goodnight
  • Andrew Ng
  • Fei-Fei Li
  • Alex Krizhevsky
  • Ilya Sutskever
  • Oriol Vinyals
  • Quoc V. Le
  • Ian Goodfellow
  • Demis Hassabis
  • David Silver
  • Andrej Karpathy
  • Ashish Vaswani
  • Noam Shazeer
  • Aidan Gomez
  • John Schulman
  • Mustafa Suleyman
  • Jan Leike
  • Daniel Kokotajlo
  • François Chollet
Architectures
  • Neural Turing machine
  • Differentiable neural computer
  • Transformer
    • Vision transformer (ViT)
  • Recurrent neural network (RNN)
  • Long short-term memory (LSTM)
  • Gated recurrent unit (GRU)
  • Echo state network
  • Multilayer perceptron (MLP)
  • Convolutional neural network (CNN)
  • Residual neural network (RNN)
  • Highway network
  • Mamba
  • Autoencoder
  • Variational autoencoder (VAE)
  • Generative adversarial network (GAN)
  • Graph neural network (GNN)
Political
  • AI safety (Alignment)
  • Ethics of AI
  • EU AI Act
  • Precautionary principle
  • Regulation of AI
  • Virtual politician
Social and economic
  • AI boom
  • AI bubble
  • AI literacy
  • AI slop
  • AI veganism
  • AI winter
  • Anthropomorphism
  • In architecture
  • In education
  • In healthcare
    • Chatbot psychosis
    • Mental health
  • In visual art
  • Category
Retrieved from "https://teknopedia.ac.id/w/index.php?title=Weight_initialization&oldid=1313148588"
Categories:
  • Machine learning
  • Artificial neural networks
  • Deep learning
Hidden categories:
  • CS1 maint: work parameter with ISBN
  • Articles with short description
  • Short description with empty Wikidata description

  • indonesia
  • Polski
  • العربية
  • Deutsch
  • English
  • Español
  • Français
  • Italiano
  • مصرى
  • Nederlands
  • 日本語
  • Português
  • Sinugboanong Binisaya
  • Svenska
  • Українська
  • Tiếng Việt
  • Winaray
  • 中文
  • Русский
Sunting pranala
url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url url
Pusat Layanan

UNIVERSITAS TEKNOKRAT INDONESIA | ASEAN's Best Private University
Jl. ZA. Pagar Alam No.9 -11, Labuhan Ratu, Kec. Kedaton, Kota Bandar Lampung, Lampung 35132
Phone: (0721) 702022
Email: pmb@teknokrat.ac.id