Representation Learning: A Review and New Perspectives 09/04/2024 Representation Learning: A Review and New Perspectives Abstract The success of machine learning algorithms generally depends on data representation, and we hypothesize that this is because different representations can entangle and hide more or less the different explanatory factors of variation behind the data. (좋은 데이터 표현은 얽히고 숨어있는 요인들이 잘 학습되도록 돕는다) Index TermsㅡDeep learning, representation learning, feature learning, unsupervised learning, likelihood Boltzmann Machine, autoencoder, neural nets, destributed representation, Introduction The performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied. An AI must fundamentally understand the world around us, and we argue that this can only be achieved if it can learn to identify and disentangle the underlying explanatory factors hidden in the observed milieu of low-level sensory data. We consider some of the fundamental questions that have been driving research in this area. Specifically, what makes one representation better than another? Why should we care about learning representation? Neural net language models are all based on learning a distributed representation for each word, called a word embedding. Learning word embeddings can be combined with learning image representations in a way that allow to associate text and images. Image Captioning Searching image based on text In transfer learning, we hypothesize that representation learning algorithms have an advantage for such tasks because they learn representations that capture underlying factors. Illustration of representation-learning discovering explanatory factors Most impressive are the two transfer learning challenges held in 2011 and won by representation learning algorithms. What makes a representation good ? Priors for Representation Learning in AI One reason why explicitly dealing with representations is interesting is because they can be convenient to express many general priors about the world around us consecutive or spatially nearby observations tend to be associated with the same value of relevant categorical concepts. (Embedding?) 연속적인 데이터에서 작은 변화를 허용하는 것은 더 자연스럽고, 더 안정적인 표현을 학습하도록 한다. in good high-level representations, the factors are related to each other through simple, typically linear dependencies. (high-level representation을 학습할 때 이 표현들의 관계가 선형적이라는 가정을 하고 모델링?) Depth and abstraction Deep architectures can lead to abstract representations because more abstract concepts can often be constructed in terms of less abstract ones 저수준 추상화들을 기반으로 고수준 추상화 학습. More abstract concepts are generally invariant to most local changes of the input. 입력 데이터의 local changes 가 있더라도 일관되게 추상화하여 일반화 성능 강화 (고양이 이미지에서 고양이의 위치, 크기, 색 등이 다르더라도 고양이 인 것은 변함 없음) Disentangling Factors of Variation How can we cope with these complex interactions? How can we disentangle the objects and their shadows? Ultimately, we believe the approach we adopt for overcoming these challenges must leverage the data itself , using vast quantities of unlabeled examples, to learn representations that separate the various explanatory sources (unsupervised leraning) 정보의 보존 및 분리 It is often difficult to determine a priori which set of features and variations will ultimately be relevant to the task at hand. (feature engineering) Building Deep Representations it was empirically observed that layerwise stacking of feature extraction often yielded better representations Alternatively, the outputs of the previous layer can be fed as extra inputs for the next layer (in addition to the raw input), as successfully done in Yu et al. (2010). (Skip Connection, Residual Connection) This joint training has brought substantial improvements, both in terms of likelihood and in terms of classification performance of the resulting deep feature learner (Combining multi-losses, e.g. summation of a loss for the feature learning and a loss for the classification) The deep auto-encoder can then be jointly trained, with all the parameters optimized with respect to a global reconstruction error criterion (Decreasing reconstruction error = Increasing likelihood) Single-Layer Learning Modules 같은 구조의 모델이더라도 hidden units을 어떻게 해석하느냐에 따라 그 모델의 의미가 달라질 수 있음 (Probabilistic perspective (VAE) or Computational perspective (AE)) The expressive power of linear features is very limited: they cannot be stacked to form deeper, more abstract representations since the composition of linear operations yields another linear operation (선형변환을 여러 층 쌓더라도 그대로 선형변환) \[ y = W_{3} \cdot W_{2} \cdot W_{1}x \\ \] To extract non-linear features, simply insert a non-linearity between learned single-layer linear projections. Probabilistic models From the probabilistic modeling perspective, the question of feature learning can be interpreted as an attempt to recover a parsimonious set of latent variables (잠재 변수) that describe a distribution over the observed data. 정규분포를 따르는 잠재변수는 원본 데이터를 잘 설명하는 벡터, 원본데이터를 잘 설명하도록 맵핑된다는 직관 Explaining away (설명 소거) a priori independent causes of an event can become non-independent given the observation of the event. alarm activation. burglarized or earthquake. 알람이 울렸을 때 처음에는 도둑이 들었거나 지진이 발생했을 가능성이 모두 있음. 하지만 도둑이 들었다는 사실이 확인되면, 지진이 발생했을 가능성은 크게 낮아짐. 반대로, 지진이 발생했다는 사실을 알면 도둑이 든 가능성 역시 크게 낮아짐. Sparse Coding Directly learning a parametric map from input to representation Auto-Encoders Non-probabilistic method to map features of the input 입력 데이터를 저차원 표현 (Latent Space)로 변환하고 다시 복원함으로써, 입력 데이터를 특징으로 변환하는 직접적인 함수를 학습 \[ h = f_{\theta}(x), \,\, r = g_{\phi}(h) \\ f_{\theta}(x) = \sigma(b + Wx), \,\, g_{\phi}(h) = \sigma(b' + W'h) \\ \] Good generaliztion means low reconstruction error \( L(x, r) \) at test examples. To capture the structure of the data-generating distribution, it is therefore important that the parametrization prevents the auto-encoder from learning the identity function \(I\) , which has zero reconstruction error everywhere. \[ I = f(x) = x \\ \] To prevent the above issue, it is possible to achieve this by setting the dimensions of latent space lower than the input data. Bottleneck structure \( d_{h} < d_{x} \) Basic auto-encoder structure to prevent learning the identity function In summary, basic auto-encoder training consists in ifnding a value of parameter vector \(\theta\) minimizing reconstruction error. \[ \mathcal{Loss}_{AE}(\theta) = \sum_{t} L\left(x^{(t)}, \,\, g_{\phi}(\,f_{\theta}(\,x^{(t)}\,)\,)\right) \] Sparse Auto-Encoders It tends to favour overcomplete representations, i.e. \( d_{h} > d_{x} \). 인코더의 가중치와 디코더의 가중치를 공유함으로서 모델의 복잡성 제한 Sparsity in the representation can be achieved by penalizing the hidden unit biases or by directly penalizing the output of the hidden unit activations. \[ \log(1 + h^{(j)\,2}) \] Denoising Auto-Encoders Learning to reconstruct the clean input from a corrupted version Learning the identity is no longer enough: the learner must capture the structure of the input distribution in order to optimally undo the effect of the corruption process, i.e. DAE learns a reconstruction function Illustration of DAE Formally, the objective optimized by a DAE is: \[ \mathcal{J}_{DAE} = \sum_{t} \mathbb{E}_{q(\tilde{x}^{(t)} \,|\, x^{(t)})} \left[ L\left( x^{(t)}, \,\, g_\theta \,(f_\theta \,(\tilde{x}^{(t)})) \right) \right] \\ \] Where \( \mathbb{E}_{q(\tilde{x}|x^{(t)})} [\cdot] \) averages over corrupted examples \( \tilde{x} \) drawn from the corruption process \( q(\tilde{x}|x^{(t)}) \). \(\tilde{x}^{(t)}\) 와 복원된 데이터 \(g_\theta(f_\theta(\tilde{x}^{(t)})) \) 사이의 손실을 계산하고 배치 내 모든 샘플에 대해 평균을 구함으로써 기댓값 근사 Gaussian noise method adds the noise that follows gaussian distribution Salt and pepper noise method randomly changes some pixels to 0 or 1 for the gray-scale images/ Masking noise method randomly sets chosen inputs to 0 Contractive Auto-Encoders By using contractive penalty (Frobenius norm), CAE learns the representations that are insensitive to small changes in the input. Frobenius norm은 인코더의 Jacobian matrix의 크기를 측정. Jacobian matrix는 인코더 함수가 입력 데이터 \(x\)의 작은 변화에 얼마나 영향을 미치는지 계산 the Jacobian matrix \(J(f)\) is a matrix that consists of the partial derivatives and shows how each output variable changes for the input variable. \[ J(f) = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} & \cdots & \frac{\partial y_2}{\partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \frac{\partial y_m}{\partial x_2} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \] The CAE's training objective is \[ \mathcal{J}_{CAE} = \sum_t L(x^{(t)}, \,\, g_\theta(f_\theta(x^{(t)}))) + \lambda \, \left\| J(f_\theta(x^{(t)})) \right\|_F^2 \\ \|A\|_F = \sqrt{\sum_{i,j} |a_{ij}|^2} \] CAE에서 Jacobian matrix는 입력 \(x\)에 대한 인코더의 출력 \(f_\theta(x)\)의 변화가 얼마나 큰지를 나타내므로, \(J(f_\theta(x^{(t)}))\)의 Frobenius norm을 최소화 하는 term은 입력의 작은 변화에 대해 인코더의 출력 변화가 최소가 되도록 만드는 것 A potential disadvantage of the CAE it only responds to small input changes and may not consider for larger or diverse input changes. (It is remedied in CAE+H) The representation learned by the CAE tends to be saturated rather than sparse, i.e., most hidden units are near the extremes of their range (e.g. 0 or 1 after passing the sigmoid) 포화된 유닛은 활성화 함수의 출력이 0 또는 1에 매우 가까운 상태(시그모이드 기준)를 의미하며, 이런 상황에서는 입력의 작은 변화에 대해 출력 변화가 거의 일어나지 않게 됨. (기울기가 거의 0) 따라서 saturated unit이 입력 데이터의 미세한 변화를 잘 반영하지 못하고, 민감하게 반응하지 않음을 의미. (고양이 이미지에서 고양이의 위치, 크기, 색 등이 다르더라도 고양이 인 것은 변함 없음) Representation learning as manifold learning Another important perspective on representation learning is based on the geometric notion of manifold. Its premise is the manifold hypothesis, according to which real-world data presented in high dimensional spaces are expected to concentrate in the vicinity of a manifold (subspace) \(\mathcal{M} \) of much lower dimensionality \(d_{\mathcal{M}}\) embedded in high dimensional input space \(\mathbb{R}^{d_x}\). Meaningful representation learned by the encoder of auto-encoders With this perspective, the primary unsupervised learning task is then seen as modeling the structure of the data-supporting manifold The associated representation being learned can be associated with an intrinsic coordinate system on the embedded manifold. Principal Component Analysis (PCA), which models a linear manifold. It was initially devised with the objective of finding the closest linear manifold to a cloud of data points . The principal components, i.e. the representation \(f_{\theta}(x)\) that PCA yields for an input point \(x\), uniquely locates its projection on that manifold: it corresponds to intrinsic coordinates on the manifold. PCA는 linear manifold이므로 고정된 방식으로 데이터를 변형하므로 비선형 구조를 가진 데이터에 대해서는 적절한 표현흘 못하는 경우가 많음 A parametrized neural network architecture simultaneously learns a manifold embedding and a classifier. The training criterion encourages training set neigbhors to have similar representations. Connections between probabilistic and direct encoding models Direct encoding models focus on the encoding process, which transforms input data into latent representations, whereas probabilistic models focus on the decoding process \(P(x \,| \,h, \,\theta)\), which reconstructs data from latent representations. 테스트 셋에 대한 reconstruction error를 통한 feature learning 성능 평가의 지표가 될 수 있으나 모델 옹량이 크면 지표의 신뢰도가 낮아질 수 있음 (DAE, CAE는 노이즈와 함께 학습되므로 더 큰 모델과 더 많은 학습시간이 항상 더 나은 결과를 의미하지 않음) Global Training of deep models Higher-level abstraction means more non-linearity The hard part is learning a good representation that does this unfolding. (For the auto-encoders, the encoder corresponds to the unfolding process and the decoder corresponds to the folding process) It is interesting to ask why does the layerwise unsupervised pre-training procedure sometimes help a supervised learner Intermediate representations 학습. Single-Layer learning이 아닌 hidden layer를 추가함으로써 각 레이어가 추상적이고 고차원적인 패턴 학습. Building-in invariance 도메인 지식과 ML 통합 Generalization performance is usually improved by providing a larger quantity of representative data. This can be achieved by generating new examples by applying small random deformations to the original training examples, using deformations that are known not to change the target variables of interest. Convolution and pooling play an important role in training invariance. 불변성 학습의 목표는 데이터의 중요한 요소들 간의 분리(disentangle)를 도와주는 것 Conclusion Practical Concerns and Guidelines One of the criticisms addressed to artificial neural networks and deep learning algorithms is that they have many hyper-parameters and variants and that exploring their configurations and architectures is an art (노하우, 경험에 의존). Recent work on automating hyper-parameter search is also making it more convenient, efficient and reproducible. Incorporating Generic AI-level Priors 일반화할 수 있는 선행지식(priors) 또는 도메인 지식을 ML과 통합하여 데이터의 중요한 요소를 분리하고 설명 가능성을 높이는 것이 중요하다. 이미지로 예를 들면, 이미지에서 물체의 모양이 조금 회전해도 그 물체가 동일하다는 가정은 이미지를 처리하는 모델이 학습해야 할 priors가 될 수 있음. (일반화 성능 개선)