Jekyll2018-08-19T21:31:12+00:00https://zjiayao.github.io/Jiayao’s BlogB.Eng. in Computer Science (Class of 2019) The University of Hong Kong Jiayao J. ZhangTransfer Learning via Geodesic Sampling2017-11-24T00:00:00+00:002017-11-24T00:00:00+00:00https://zjiayao.github.io/blog/2017/geodesic-sampling<h3 id="introduction"><strong>Introduction</strong></h3> <p>In our recent magazine paper (under review), one topic of interests is subspace-aided transfer learning. In this blog, we give an overview of the method we studied with illustrations that are not included in the paper. This blog will nonethless not go into mathematics details, rather, we ask motivated readers to bear with the partial presentation and postpone the enjoyment of full derivation when our paper is ready. We hope our vivid illustrations and animations may partially explain the reasons why this seemingly ad-hoc method may work <em>a posteriori</em>.</p> <p>More introductory texts on transfer learning can be found in <a href="https://zjiayao.github.io/blog/2017/cvpr17-domain-adaptation/">one of my earlier blog</a>. We should remark that, the jargon “domain adaptation” and “transfer learning” are not exactly equivalent in a mathematical sense. Subtle differences exist though, which are essentially different assumptions on the underlying latent distribution. We do not, nonetheless, differentiate them in this rather casual writing.</p> <h3 id="dataset"><strong>Dataset</strong></h3> <p>As a general setup, there are several popular dataset tailored for benchmarking transfer learning. We select the celebrated Office data set <a href="#Saenko:2010:Office">[Saenko et al. 2010]</a> augmented by Caltech-256 <a href="#Griffin:2007:Caltech">[Griffin et al. 2007]</a>. These datasets have been collected for many years and are among the very popular choices.</p> <p>Office dataset contains a handful of categories of objects whose images are collected by either from online merchant (‘amazon’); by a DSLR camera (‘dslr’); or by a webcam (‘webcam’). The Caltech-256 dataset, as the name suggested, contains 256 categories of objects. Usually, we would select a small subset of categories in common, and train a supervised model using one domain out of the four and test in another domain. This procedure is known as unsupervised domain adaptation since no label in the target domain is known; Alternatively, if we train the model using very few but not none data from the target domain in addition to the source domain, we are indeed doing the semi-supervised (or semi-unsupervised) domain adaptation.</p> <p>Firstly, let us have a brief overview of several categories among four domains:</p> <p><strong>Headphone</strong></p> <p><img src="https://zjiayao.github.io/assets/img/tl/headphone.png" alt="headphone" /></p> <p><strong>Projector</strong></p> <p><img src="https://zjiayao.github.io/assets/img/tl/projector.png" alt="projector" /></p> <p>As suggested by name, we see the core disparities between domains: “amazon” has generally better-aligned objects; images in “dslr” are of higher resolution; “webcam” on the contrary is much more casual and usually under unbalanced illumination; “caltech” is more in the wild. The discrepancies here can be considered as a new scenario that we wish our model to generalized well from previously learnt domains.</p> <h2 id="geodesic-sampling"><strong>Geodesic Sampling</strong></h2> <p>The details of the methods we studied and implemented can be found in <a href="#Gopalan:2011:GGF">[Gopalan et al. 2011; Gong et al. 2012]</a>. Very roughly speaking, we model the source and target domains of images of the same category as two points on some highly-nonlinear manifold (i.e., Grassmann manifold), which, by some heurstic argument, may capture the geometric and statistical properties of the set of the image by rendering a unified latent representation. Though the manifold is highly non-linear <em>per se</em>, its distinct element infact constitutes of vector spaces (i.e., linear subspaces); we may as well construct its tangent space as vector spaces. Furthermore, to measure the discrepancies between points (or rather, define the metric over the manifold), we progress to the notion of principle angles between subspaces. For example, in $\mathcal{R}^2$, each vector is by itself (the basis of) a 1 dimensional subspace and thus the discrepancy between two such spaces can be measured by the cosine of the angle they make. Principle angle is a natural extension of such construction in general dimension subspaces. Above is some very coarse-grained discussion, we refer interested readers for classical texts on the subjects such as Boothby, Edelman or Absil. Readers are also encouraged to check out our paper for abundant details omitted here.</p> <p>The geodesic is the shortest curve under constant velocity joining two points (the velocity is indeed the element of the tangent space). By sampling on the geodesic between the source and target domains, it has been heuristically argued and empirically tested that some intermediate points exhibited more consistency in terms of statistical and geometrical properties between source and target domains. Such samples can thus be viewed as latent representations and consequently a discriminatively-trained classifier should be able to generalize well on such latent spaces.</p> <p>In fact, suppose we parametric the geodesic (recall it is simply a curve joining two points) in terms of $t \in [0, 1]$, we can infact visualize the change of a <em>single</em> image in the source domain as we traverse the geodesic to another domain. For such maneuver, we apply a similar procedure used in “Eigenfaces” by way of principle component analysis.</p> <p>Concretely, assume we wish to generalize bikes from “caltech” to “amazon”, we first plot their mean images to have a better feeling of the two domains:</p> <p><img src="https://zjiayao.github.io/assets/img/tl/mean_cal_amazon.png" alt="mean" /></p> <p>The mean image is simply an average over all images in the domain, we noted the main differences may be the size of the wheels and orientation of the bikes; and the characteristics shared are the structures of the bike.</p> <p><img src="https://zjiayao.github.io/assets/img/tl/caltech_bike.png" alt="sample_bike" /></p> <p>We choose an image from the source domain and visualize its latent-representation on five samples along the geodesic:</p> <p><img src="https://zjiayao.github.io/assets/img/tl/caltech_bike_samples.png" alt="samples_bike" /></p> <p>The result shows some interesting features besides the overreacted pixels: along the geodesic, there is a noticeable deformation of the front wheel shrinking to the size that mostly appears in the target domain.</p> <p>As another example, consider the computer mouse from “dslr” to “webcam”:</p> <p><img src="https://zjiayao.github.io/assets/img/tl/dslr_mouse.png" alt="sample_mouse" /></p> <p><img src="https://zjiayao.github.io/assets/img/tl/dslr_mouse_samples.png" alt="samples_mouse" /></p> <p>and keyboards of the same source-target pair:</p> <p><img src="https://zjiayao.github.io/assets/img/tl/dslr_keyboard.png" alt="sample_keyboard" /></p> <p><img src="https://zjiayao.github.io/assets/img/tl/dslr_keyboard_samples.png" alt="samples_keyboard" /></p> <p>This example illustrates that one of the defining factors of the visual consistency of such manuvour is the quality of the mean image.</p> <p>In fact, we can animate the whole process, take our now familiar caltech bike, for example:</p> <p><img src="https://zjiayao.github.io/assets/img/tl/animated_cal_bike_sample.png" alt="animated_cal_bike_sample" /></p> <p><img src="https://zjiayao.github.io/assets/img/tl/animated_cal_bike.gif" alt="animated_cal_bike" /></p> <p>How about a backpack from “webcam” to “amazon”?</p> <p><img src="https://zjiayao.github.io/assets/img/tl/animated_webcam_backpack_sample.png" alt="animated_webcam_backpack_sample" /></p> <p><img src="https://zjiayao.github.io/assets/img/tl/animated_webcam_backpack.gif" alt="animated_webcam_backpack" /></p> <h2 id="feature-visualization"><strong>Feature Visualization</strong></h2> <p>Going along this direction a little bit further, we are curious what are the latent features and how they change along the geodesic. To do this, we follow the same procedure of “Eigenface” again. Concretely, we animate top eigenvector of the latent samples along the geodesic of calculators from “dslr” to “amazon”.</p> <p><img src="https://zjiayao.github.io/assets/img/tl/latent_calculator/1.gif" alt="latent_calculator_1" /></p> <h2 id="remark"><strong>Remark.</strong></h2> <p>During a recent talk by D. A. Forsyth I attended, he commented on a general perception in CV community which I consider to be very true:</p> <blockquote> <p>In the past, we understood the theories very well, but we cannot do well; now the models perform very well but we do not know why.</p> </blockquote> <p>I believe the incorporation of deep models and mechanism inspired from geometrically/physcially-based perspective may lead to our better understanding and exploring the capacity of deep models; and better comprehend how to obtain models that really <em>understand</em> the image, not at least the classifications or regressions.</p> <p>The experiments in this blog are purely illustrative: we did not even bother to feed HOG or SURF features, rather, we directly used normalized raw images; we did neither augment the dataset by removing lousy images or localizing objects.</p> <h3 id="references"><strong>References</strong></h3> <ol class="bibliography"><li><span id="Gong:2012:GFK"><span style="font-variant: small-caps">Gong, B., Shi, Y., Sha, F., and Grauman, K.</span> 2012. Geodesic flow kernel for unsupervised domain adaptation. <i>Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on</i>, IEEE, 2066–2073.</span></li> <li><span id="Gopalan:2011:GGF"><span style="font-variant: small-caps">Gopalan, R., Li, R., and Chellappa, R.</span> 2011. Domain adaptation for object recognition: An unsupervised approach. <i>ICCV ’11</i>, IEEE, 999–1006.</span></li> <li><span id="Griffin:2007:Caltech"><span style="font-variant: small-caps">Griffin, G., Holub, A., and Perona, P.</span> 2007. Caltech-256 object category dataset. .</span></li> <li><span id="Saenko:2010:Office"><span style="font-variant: small-caps">Saenko, K., Kulis, B., Fritz, M., and Darrell, T.</span> 2010. Adapting visual category models to new domains. <i>Computer Vision–ECCV 2010</i>, 213–226.</span></li></ol>Jiayao J. ZhangIntroductionClustering with Mean Shift2017-11-08T00:00:00+00:002017-11-08T00:00:00+00:00https://zjiayao.github.io/blog/2017/mean-shift-2d<h3 id="introduction"><strong>Introduction</strong></h3> <p>I came across this handy and intricate unsupervised learning algorithm few months ago on a work (to appear) where we want to do something similar on some non-linear manifolds. Contrary to the ubiquitous Lloyd (a.k.a. $k$-means) algorithm, mean shift does not require any prior knowledge or tuning of cluster numbers. I later wrote a simple few liner in <code class="highlighter-rouge">C++</code> to experiment it with planar toy data. Code for this blog is available at <a href="https://github.com/zjiayao/ms-2dpnts">GitHub</a>.</p> <p><img src="https://zjiayao.github.io/assets/img/ms/demo.gif" alt="demo" /></p> <p>Mean shift is categories as one of the unsupervised kernel density estimation methods for clustering. Speaking of density estimation, mainly we consider two general categories, parametric and non-parametric. Notable examples of the former include the familiar MLE, MAP, or BMA, where models are parametrized explicitly. The latter encompasses $k$-NN, kernel density estimation and so forth. As an interesting side-note, often we consider neural networks of a fixed architecture as parametric models. Some works have been done to reduce the hyperparameters and make it more non-parametric, for examples, in a recent ICLR <a href="#Philipp:2017:NNN">[Philipp and Carbonell 2017]</a>.</p> <h3 id="the-algorithm"><strong>The Algorithm</strong></h3> <h4 id="density-estimation"><strong>Density Estimation</strong></h4> <p>Given a set of datapoints ${\boldsymbol{x}}_{i=1}^N$, where $\boldsymbol{x} \in \mathcal{R}^k$, we are interested in investigating the underlying data generating distribution $f(\boldsymbol{x})$, which is, unfortunately, unknown. If we may obtain some (good estimation) of the density, we can then find its modes and cluster the observations in accordance with the proximity to the modes.</p> <p>This intuition suggested a two-step iterative algorithm for such maneuver. Concretely, assume the underlying data lying on some metric space with metric $d$ (hence the notion of proximity is well defined via, e.g., $\mathcal{l}_2$ norm in $\mathcal{R}^k$), which is differentiable; With proper kernel function $K(\cdot)$ chosen as a smoother, the kernel density estimator is given as <a href="#Fukunaga:1975:MS">[Fukunaga and Hostetler 1975]</a>:</p> <script type="math/tex; mode=display">\hat{f}(\boldsymbol{x}) = \frac{c}{Nh^k} \sum_{i=1}^N K \left( \frac{ d(\boldsymbol{x}, \boldsymbol{x}_i)}{h}\right ),</script> <p>where $h$ the bandwidth (since it determines the window of the kernel smoother, as we shall see later), $c$ is the normalizing constant.</p> <h4 id="choices-of-kernel-function"><strong>Choices of Kernel Function</strong></h4> <p>In the original paper of Fukunaga and Hostetler, several regularity conditions must apply for some function to be a valid kernel function including integrated to unity ($\int K(\boldsymbol{x})) \mathrm{d} \boldsymbol{x} = 1$), no singularities ($\sup\lvert K(\cdot) \rvert \lt \infty$), $\int \lvert K(\boldsymbol{x}) \rvert \lt \infty$ and $\lim_{\lVert \boldsymbol{x} \rVert \to \infty} \lVert\boldsymbol{x}\rVert^k K(\boldsymbol{x}) = 0$.</p> <p>Note we define the kernels as a function from $\mathcal{R}$ to $\mathcal{R}$. That is, we take norm before feeding data to the kernel. In practice, we generally use three kernels, namely:</p> <ul> <li>Gaussian Kernel</li> </ul> <p>Most often, we require the covariance matrix being identity, hence:</p> <script type="math/tex; mode=display">K(x) = \frac{1}{\sqrt{2\pi}} \exp\{-\frac{1}{2}\lVert x\rVert ^ 2\}.</script> <ul> <li>Uniform Kernel</li> </ul> <script type="math/tex; mode=display">K(x) = 1\{\lVert x\rVert \lt 1 \}.</script> <ul> <li>Epanechnikov Kernel</li> </ul> <script type="math/tex; mode=display">K(\boldsymbol{x})= \frac{3}{4}(1-\lvert x \rvert ^2) 1\{\lvert x \rvert \lt 1 \}.</script> <p><img src="https://zjiayao.github.io/assets/img/ms/kernels.svg" alt="kernels" /></p> <p>Noted in the case of uniform and Epanechnikov kernels, we limit their supports to be $[0,1]$, hence the hyperparameter $h$ hinted earlier is indeed controlling the window of the smoother.</p> <h4 id="mode-seeking-via-gradient"><strong>Mode Seeking via Gradient</strong></h4> <p>Recall to find the mode, we equate the gradient of the density to zero. Define $g(x) = -\frac{\partial}{\partial \boldsymbol{x}} K(x)$, we have that:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \hat{\nabla f(\boldsymbol{x})} &= - \frac{1}{Nh^k} \sum_{i=1}^N \nabla d^2(\boldsymbol{x}, \boldsymbol{x_i}) g \left( \frac{d^2( \boldsymbol{x}, \boldsymbol{x_i})}{h^2} \right) \\ &= -\frac{1}{Nh^k} \sum_{i=1}^N \left\{ \frac{\sum_{i=1}^N g \left( \frac{d^2(\boldsymbol{x}, \boldsymbol{x_i})}{h^2}\right) \nabla d^2(\boldsymbol{x}, \boldsymbol{x_i})} {\sum_{i=1}^N g \left( \frac{d^2(\boldsymbol{x}, \boldsymbol{x_i})}{h^2}\right)} \right\}. \end{aligned} %]]></script> <p>This is indeed a very general result, and has some subtle implications: to perform mean shift, it suffices to have metrics that is square-differentiable. This generalizes to piecewise differentiable metrics.</p> <p>For example, using canonical Euclidean norm, we have that:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{aligned} \hat{\nabla f(\boldsymbol{x})} &= \frac{2c}{Nh^k} \sum_{i=1}^N g \left( \frac{\lVert \boldsymbol{x} - \boldsymbol{x}_i \rVert ^ 2}{h^2} \right) \left\{ \frac{\sum_{i=1}^N \boldsymbol{x}_i g \left( \frac{\lVert \boldsymbol{x} - \boldsymbol{x}_i \rVert ^2}{h^2}\right) } {\sum_{i=1}^N g \left( \frac{\lVert \boldsymbol{x} - \boldsymbol{x}_i \rVert ^ 2}{h^2}\right)} - \boldsymbol{x} \right\}. \end{aligned} %]]></script> <p>What does this suggest? Given a point, the first term is constant, hence only the second term matters in our iterative procedure. Define</p> <script type="math/tex; mode=display">\boldsymbol{m} = \left\{ \frac{\sum_{i=1}^N \boldsymbol{x}_i g \left( \frac{\lVert \boldsymbol{x} - \boldsymbol{x}_i \rVert ^ 2}{h^2}\right) } {\sum_{i=1}^N g \left( \frac{\lVert \boldsymbol{x} - \boldsymbol{x}_i \rVert ^ 2}{h^2}\right)} - \boldsymbol{x} \right\},</script> <p>the mean shift vector, which can be perceived as a smoothed centroid of ${\boldsymbol{x}}$; setting $\hat{\nabla f} (\boldsymbol{x}) = 0$ thus amounts to seek the centroids under the kernel smoother $g(\cdot)$. Since $g(\cdot)$ is the actual kernel we use to weigh the points, it is commonly called the “shadow” of the original kernel $K(\cdot)$.</p> <h4 id="relationship-between-em"><strong>Relationship between EM</strong></h4> <p>If we further consider the “real” centroids as missing data, the above optimization problem can be cast as a variant of EM algorithm. Concretely, in E-step, we take expectations conditioning on the previously sought centroids of the smoother. That is, in fact:</p> <script type="math/tex; mode=display">\sum_{i=1}^N f (\boldsymbol{x}_i) = \boldsymbol{E} \left[ F \vert \left\{\boldsymbol{x}_c\right\}\right],</script> <p>where ${ \boldsymbol{x}_c }$ are observed centroids (from last iterations), and $F$ is the actual objective in terms of the real unknown centroids expressed as a KDE. Hence in the subsequent M-step, the only work to be done is to equal the gradient to zero to complete one EM iteration.</p> <p>The major difference is whether the estimated results are fed to the next iteration. To be precise, there are typically two types of mean shifts: one in which the modes are not to replace the data points, which is commonly used for clustering; the other one being substituting the data points for the learnt modes, which can be used for image segmentation.</p> <p>In the latter case, the mean shift can be readily regarded as within the EM paradigm.</p> <h3 id="concrete-examples"><strong>Concrete Examples</strong></h3> <p>We consider a simple examples: four mixed Gaussian that is visually well separable:</p> <p><img src="https://zjiayao.github.io/assets/img/ms/data.svg" alt="data" /></p> <p>We first run one iteration using Gaussian kernel with bandwidth 18 and pruning criterion 18. These hyperparameters are highly problem dependent.</p> <p><img src="https://zjiayao.github.io/assets/img/ms/gauss_modes.svg" alt="gauss" /></p> <p>We can see the modes of four clusters shrink largely together. Hence applying a pruning algorithm such as DFS would do the trick:</p> <p><img src="https://zjiayao.github.io/assets/img/ms/gauss_cls.svg" alt="cluster" /></p> <p>In practice, one may alter the learning parameters and perform several iterations of the whole algorithm. Nonetheless, since the problem is ill-posed <em>per se</em>, it would be better to visualize the results to determine which result is most desirable.</p> <p>For example, consider the following ring data (<a href="https://github.com/deric/clustering-benchmark/blob/master/src/main/resources/datasets/artificial/rings.arff">data source</a>):</p> <p><img src="https://zjiayao.github.io/assets/img/ms/ring.svg" alt="ring_data" /></p> <p>The following results are generated from different sets of configurations:</p> <p><img src="https://zjiayao.github.io/assets/img/ms/ring_1.svg" alt="ring_1" /></p> <p><img src="https://zjiayao.github.io/assets/img/ms/ring_2.svg" alt="ring_2" /></p> <h3 id="references"><strong>References</strong></h3> <ol class="bibliography"><li><span id="Fukunaga:1975:MS"><span style="font-variant: small-caps">Fukunaga, K. and Hostetler, L.</span> 1975. The estimation of the gradient of a density function, with applications in pattern recognition. <i>IEEE Transactions on information theory</i> <i>21</i>, 1, 32–40.</span></li> <li><span id="Philipp:2017:NNN"><span style="font-variant: small-caps">Philipp, G. and Carbonell, J.G.</span> 2017. Nonparametric Neural Networks. <i>ICLR ’17</i>.</span></li></ol>Jiayao J. ZhangIntroductionBayesian Hierarchical Model: Gamma and Inverse Gamma Prior2017-09-26T00:00:00+00:002017-09-26T00:00:00+00:00https://zjiayao.github.io/blog/2017/bhm-gamma-invgamma<p>I have been taking a graduate course on computational statistics this semester. Albeit to the fact the choices for priors in Bayesian model is subjective and prone to human error, the paradigm of hierarchy comes pretty handy when we deal with sequential dependent models such as some RNNs.</p> <p>It is more and more becoming why Bayesian statistics is computational intensive: <em>mind you, what else you can resort to when you wish to integrate out a parameter over products of mixtures of exponential family densities?</em></p> <p>Indeed, consider the following simple hierarchical model:</p> <script type="math/tex; mode=display">z \vert s \mathbin{\sim} N(0, s),\quad s \mathbin{\sim} ?.</script> <p>We consider two cases: either $s \mathbin{\sim} \operatorname{\Gamma}(a, b)$ or $s \mathbin{\sim} \operatorname{Inv\operatorname{\Gamma}}(a, b)$. The integration for the first case is not very straightforward; nevertheless, it is given as a one-liner fact in the original paper <a href="#West:1989:MIX">[West 1987]</a>. We proceed using <a href="https://stats.stackexchange.com/questions/175458/show-that-a-scale-mixtures-of-normals-is-a-power-exponential/175850#175850">this technique introduced on Cross Validated</a>.</p> <h3 id="gamma-prior"><strong>Gamma Prior</strong></h3> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} z &= \int_{\mathbb{R}} \operatorname{\mathcal{N}}(0, s)\operatorname{\Gamma}(a, b) \mathop{}\!\operatorname{d} s\\ &= \frac{b^a}{\sqrt{2\pi}\operatorname{\Gamma}(a)} \int_{\mathbb{R}} s^ {-\frac{1}{2}} s ^ {a-1} \exp\left\{-\frac{1}{2}\left( \frac{\lvert z \rvert ^2}{s} + 2bs \right)\right\} \mathop{}\!\operatorname{d} s \\ &= \frac{b^a \mathop{}\!\mathrm{e}^{-\sqrt{2b}\lvert z \rvert}}{\sqrt{2\pi}\operatorname{\Gamma}(a)} \int_{\mathbb{R}} s^ {-\frac{1}{2}} s ^ {a-1} \exp\left\{-\frac{1}{2}\left( \frac{\lvert z \rvert}{\sqrt{s}} + \sqrt{2b}\sqrt{s}\right)^2\right\} \mathop{}\!\operatorname{d} s. \end{align*} %]]></script> <p>This suggests a change of variable as $v = \sqrt{s} &gt; 0$. It follows that:</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} z &= \frac{2 b^a \mathop{}\!\mathrm{e}^{-\sqrt{2b}\lvert z \rvert}}{\sqrt{2\pi}\operatorname{\Gamma}(a)} \int_{\mathbb{R}^+} v^{2(a-1)} \exp\left\{ -\frac{1}{2} \left( \frac{\lvert z \rvert}{v} - \sqrt{2b} v \right)^2 \right\} \mathop{}\!\operatorname{d} v. \end{align*} %]]></script> <p>Write $t = \sqrt{2b} v - \frac{\lvert z \rvert}{v} \in \mathbb{R}$. Noted $t(v)$ is monotone and $t \rightarrow -\infty$ as $v \rightarrow 0$; $t \rightarrow \infty$ as $v \rightarrow \infty$.</p> <p>Moreover, noted</p> <script type="math/tex; mode=display">v = \frac{1}{2\sqrt{2b}} \left(t - \sqrt{t^2 + 4\sqrt{2b}\lvert z \rvert}\right),</script> <p>where the other root is rejected since $v\ge 0$. Hence</p> <script type="math/tex; mode=display">\mathop{}\!\operatorname{d} v =\frac{1}{2\sqrt{2b}}\left(1 - \frac{t}{\sqrt{t^2 + 4\sqrt{2b} \lvert z \rvert}}\right) \mathop{}\!\operatorname{d} t.</script> <p>It follows that</p> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} z &= \frac{2 b^a \mathop{}\!\mathrm{e}^{-\sqrt{2b}\lvert z \rvert}}{\sqrt{2\pi}\operatorname{\Gamma}(a)} \int_{\mathbb{R}^+} v^{2(a-1)} \exp\left\{ -\frac{1}{2} t ^2 \right\} \mathop{}\!\operatorname{d} v \\ &\overset{a=1}{=} \frac{2 b \mathop{}\!\mathrm{e}^{-\sqrt{2b}\lvert z \rvert}}{\sqrt{2\pi}} \int_{\mathbb{R}} \exp\left\{ -\frac{1}{2} t ^ 2 \right\} \frac{1}{2\sqrt{2b}} \left( 1 - \frac{t}{ \sqrt{t^2+ 4\sqrt{2b}\lvert z \rvert} } \right) \mathop{}\!\operatorname{d} v \\ &= \frac{\sqrt{2b}}{2} \mathop{}\!\mathrm{e}^{-\sqrt{2b} \lvert z \rvert}, \end{align*} %]]></script> <p>where the last equality follows since the second term in the Jacobian is odd.</p> <p>We therefore conclude that when $a=1$, $z\mathbin{\sim} \mathrm{DE}(\sqrt{2b})$.</p> <h3 id="inverse-gamma-prior"><strong>Inverse Gamma Prior</strong></h3> <script type="math/tex; mode=display">% <![CDATA[ \begin{align*} z &= \int_{\mathbb{R}} \operatorname{Inv\operatorname{\Gamma}}(a,b) \operatorname{\mathcal{N}}(0, s) \mathop{}\!\operatorname{d} s \\ &= \int_{\mathbb{R}} \frac{1}{\sqrt{2 \pi s}} \mathop{}\!\mathrm{e}^{-\frac{1}{2s} z^2} \frac{b^a}{\operatorname{\Gamma}(a)} s^{-a-1} \mathop{}\!\mathrm{e}^{-\frac{b}{s}} \mathop{}\!\operatorname{d} s \\ &= \frac{b^a}{\sqrt{2\pi}\operatorname{\Gamma}(a)} \int_{\mathbb{R}} s^{-(a+\frac{1}{2})-1} \mathop{}\!\mathrm{e}^{-\left(\frac{z^2}{2} + b\right) \frac{1}{s}} \mathop{}\!\operatorname{d} s \\ &= \frac{\operatorname{\Gamma}\left(a + \frac{1}{2}\right)} {\sqrt{2b\pi}\operatorname{\Gamma}(a)} \left(1 + \frac{z^2}{2b} \right) ^ {-\left(a + \frac{1}{2}\right)}. \end{align*} %]]></script> <p>I.e., when $a = b = v/2$, $z\mathbin{\sim} t_{\mathrm{df}=v}$.</p> <h3 id="remark"><strong>Remark.</strong></h3> <p>There may exist other simpler and more elegant method of tacking the first integral. For example, the use of moment generating functions or convert it into a contour integral. I may make a further update after I grasp them.</p> <p>Moreover, the setting of $a=1$ in the first case is solely for the sake of integration. I am not sure whether the general integral is analytic otherwise.</p> <h3 id="references"><strong>References</strong></h3> <ol class="bibliography"><li><span id="West:1989:MIX"><span style="font-variant: small-caps">West, M.</span> 1987. On scale mixtures of normal distributions. <i>Biometrika</i> <i>74</i>, 3, 646–648.</span></li></ol>Jiayao J. ZhangI have been taking a graduate course on computational statistics this semester. Albeit to the fact the choices for priors in Bayesian model is subjective and prone to human error, the paradigm of hierarchy comes pretty handy when we deal with sequential dependent models such as some RNNs.Review: Advances in Style Transfer2017-07-25T00:00:00+00:002017-07-25T00:00:00+00:00https://zjiayao.github.io/blog/2017/cvpr17-style-transfer<p><strong>Abstract</strong></p> <p><em>We discuss recent advances in style transfer in CVPR ‘17 and SIGGRAPH ‘17, and provide pointers to relevant papers.</em></p> <h3 id="introduction"><strong>Introduction</strong></h3> <p>Since the introduction of neural style transfer by <a href="#Gatys:2016:NST">[Gatys et al. 2016]</a>, several commercial software harnessing the deep model underlying have appeared and gained arguable large popularaity.</p> <p>The idea behind the scene is quite simple: extract style and content features and generate a fused image base on those features. How? Deep neural networks come to rescue. A pre-trained convnet is used to extract features from the style and the content image. Then starting with a randomly initialized image, one may iteratively “refine” it by minimizing the a notion of loss between the generated image and extracted feature at <em>each</em> layer. The mixed content loss, style loss and perceptual loss proposed by <a href="#Johnson:2016:PLOSS">[Johnson et al. 2016]</a> is then back propograted to the <em>output image</em> itself.</p> <p>After several iterations, we may then expect a stylized image that assumes structural contents from the source and high-level style traits (e.g., texture patches, color histograms, etc.) from the reference image.</p> <p>Unlike traditional approaches that mostly operate on lower level features such as texture patches, neural style transfer has been proven to be more powerful to transfer higher level features. And indeed, is more interesting.</p> <h3 id="recent-progress"><strong>Recent Progress</strong></h3> <p>Nevertheless, the model introduced by <a href="#Gatys:2016:NST">[Gatys et al. 2016]</a> is not flawless. Indeed, in this year’s CVPR and SIGGRAPH, we see several papers attempting to address various problems and achieving intriguing results.</p> <h4 id="multiple-style-transfer"><strong>Multiple Style Transfer</strong></h4> <p>The original model learns style separately. New styles were not possible unless having the network being retrained on the new references. <a href="#Chen:2017:NIST">[Chen et al. 2017]</a> attacked this by introducing a bank of convolutional filters that enable the decouple between contents and styles.</p> <p>Hence new style can be directly trained on the existed <em>StyleBank</em> incrementally. It is also more convenient for style fusion, as can be done by weighing the filter bank.</p> <h4 id="controls-and-constraints"><strong>Controls and Constraints</strong></h4> <p>The goal of style transfer, by the end of the day, is to achieve perceptually consistent, aesthetically pleasant (well, at least in my opinion) results. The original model may go wild under certain circumstances, for example, due to the lack of context awareness. As being widely noticed, styles from some context (say, sky) may be applied to some irrelevant context (say, a house), such discrepancy is one direction people strive to overcome.</p> <p>We see <a href="#Gatys:2017:CTRL">[Gatys et al. 2017; Luan et al. 2017; Liao et al. 2017]</a> all more or less respond to this issue. <a href="#Gatys:2017:CTRL">[Gatys et al. 2017]</a> introduces several controlling factors that may act on spatial context (to determine the corresponding regions for transferring, thus transferring only occurs between similar contexts); colors (thus patterns may be transferred without alternating the color); and scale (applying separate styles for different scales).</p> <p>Unlike most existing models that apply some “painting-like” style to real-world images, deep photo style transfer proposed by <a href="#Luan:2017:DPST">[Luan et al. 2017]</a>, on the other hand, learns styles from real photographs. One advantage of this insight is the prevention of structural distortion. As common in many other algorithms, lines or areas are almost always distorted. This is done by constraining the transfer operation to be occur only in the color space.</p> <p>A direct observation from this constraint is that it limits the model’s capability of being able only to applying style related to <em>color</em>, e.g., alternation between day and night, weather variation, etc. Nevertheless, their results are mostly photorealistic compared to others.</p> <p><a href="#Liao:2017:VAT">[Liao et al. 2017]</a> tackles this problem from a pair-wise supervised fashion, as what is known as transfer between <em>visual attributes</em> (that is, higher-level traits). Such attributes are learnt by image analogy (that is, the establishment of structural correspondences between input images).</p> <p>The utilization of image analogy offers this model a unique merit: style transfer is acted on <em>pairs</em>. The model simultaneously generates two outputs given two inputs with the style interchanged. Their results are most impressive so far as I can tell (interesting enough, a comparison between <a href="#Luan:2017:DPST">[Luan et al. 2017]</a> is also given).</p> <p>Another work might be inline would be <a href="#Shu:2017:PLT">[Shu et al. 2017]</a>, where the transfer occurs exclusively in lightning. In particular, their algorithm, without exploiting deep neural networks, is capable of apply the desired illumination from one portrait to another without sabotage facial features.</p> <h4 id="faster-transfer"><strong>Faster Transfer</strong></h4> <ul> <li>Video Style Transfer</li> </ul> <p>Applying neural style transfer on video on a frame-by-frame basis may cause artifacts. Indeed, the temporal relationship between frames are not exploited. <a href="#Huang:2017:VNST">[Huang et al. 2017]</a> noted the “flickers” caused by inconsistency of style patches applied on the same region, and propose to alleviate this problem by feeding two consecutive frames at one time to minimize temporal inconsistency. Their model achieves real-time consistent performance, also of interests, by feeding frames that may be “far away” in time, long-term consistency can also be preserved.</p> <ul> <li>High Resolution</li> </ul> <p>As far as high resolution images are concerned, the transfer process often fails to obtain the <em>intricate</em> fine details of the texture. In <a href="#Wang:2017:HIRES">[Wang et al. 2017]</a>, we see a coarse-to-fine approach applied in a hierarchical manner that is able to generate high resolution stylized images with correct textures in scale (surprisingly though, I did not perceive much difference compared to the baseline model).</p> <h3 id="remark"><strong>Remark</strong></h3> <p>Clearly, neural style transfer seems to dominate the arena of style transfer. We see many problems are being addressed, and the resulting stylized image is more and more realistic (or unrealistic, depends on the application), less flawed. We expect more discrepancies between fully stylized results may be settled in the future.</p> <p>Moreover, it is quite inspiring to note work such as <a href="#Shu:2017:PLT">[Shu et al. 2017]</a> that achieves impressing results without the help of neural networks.</p> <h3 id="references"><strong>References</strong></h3> <ol class="bibliography"><li><span id="Chen:2017:NIST"><span style="font-variant: small-caps">Chen, D., Yuan, L., Liao, J., Yu, N., and Hua, G.</span> 2017. StyleBank: An Explicit Representation for Neural Image Style Transfer. <i>CVPR ’17</i>.</span></li> <li><span id="Gatys:2016:NST"><span style="font-variant: small-caps">Gatys, L.A., Ecker, A.S., and Bethge, M.</span> 2016. Image style transfer using convolutional neural networks. <i>CVPR ’16</i>, 2414–2423.</span></li> <li><span id="Gatys:2017:CTRL"><span style="font-variant: small-caps">Gatys, L.A., Ecker, A.S., Bethge, M., Hertzmann, A., and Shechtman, E.</span> 2017. Controlling Perceptual Factors in Neural Style Transfer. <i>CVPR ’17</i>.</span></li> <li><span id="Huang:2017:VNST"><span style="font-variant: small-caps">Huang, H., Wang, H., Luo, W., et al.</span> 2017. Real-Time Neural Style Transfer for Videos. <i>CVPR ’17</i>.</span></li> <li><span id="Johnson:2016:PLOSS"><span style="font-variant: small-caps">Johnson, J., Alahi, A., and Fei-Fei, L.</span> 2016. Perceptual losses for real-time style transfer and super-resolution. <i>ECCV ’16</i>, Springer, 694–711.</span></li> <li><span id="Liao:2017:VAT"><span style="font-variant: small-caps">Liao, J., Yao, Y., Yuan, L., Hua, G., and Kang, S.B.</span> 2017. Visual Attribute Transfer through Deep Image Analogy. <i>ACM Transactions on Graphics</i> <i>36</i>, 4.</span></li> <li><span id="Luan:2017:DPST"><span style="font-variant: small-caps">Luan, F., Paris, S., Shechtman, E., and Bala, K.</span> 2017. Deep Photo Style Transfer. <i>CVPR ’17</i>.</span></li> <li><span id="Shu:2017:PLT"><span style="font-variant: small-caps">Shu, Z., Hadap, S., Shechtman, E., Sunkavalli, K., Paris, S., and Samaras, D.</span> 2017. Portrait lighting transfer using a mass transport approach. <i>ACM Transactions on Graphics</i> <i>36</i>, 4, 145a.</span></li> <li><span id="Wang:2017:HIRES"><span style="font-variant: small-caps">Wang, X., Oxholm, G., Zhang, D., and Wang, Y.-F.</span> 2017. Multimodal Transfer: A Hierarchical Deep Convolutional Neural Network for Fast Artistic Style Transfer. <i>CVPR ’17</i>.</span></li></ol>Jiayao J. ZhangAbstractReview: Domain Adaptation in the Age of Deep Models2017-07-20T00:00:00+00:002017-07-20T00:00:00+00:00https://zjiayao.github.io/blog/2017/cvpr17-domain-adaptation<p><strong>Abstract</strong></p> <p><em>We briefly introduce domain adaptation and discuss some highlights from relevant CVPR ‘17 papers.</em></p> <h3 id="what-is-domain-adaptation"><strong>What is domain adaptation?</strong></h3> <p>Modern supervised machine learning models usually require to be trained on extraordinarily large labeled data set in order to achieve state-of-the-art accuracy. For example, the benchmark in object detection and localization, the <a href="http://www.image-net.org/index">ImageNet</a> consists of more than fourteen million images from twenty-seven high-level categories (e.g., fish, flowers, &amp;c) - with more than twenty thousand subcategories, or synsets (e.g., begonia, orchid, &amp;c). Collecting datasets at this magnitude is usually not a problem, but annotating them surely is, and is more difficult than we may expect.</p> <p>Besides inaccessibility of large labeled data set, some models are hard to train, for example, a deep neural network usually cost hours on an average machine (that is to say, with no GPU support), if not days to converge. Naturally, people wish their models to be <em>robust</em> in the sense that they can be generalized well on novel unseen scenarios with little or even no access to labeled data to further retrain or fine-tune.</p> <p>Hence, we may ask ourselves: how can we have the trained models <em>transfer</em> their knowledge from the domain they have been trained to a novel-but-somewhat-related domain well? The related domain, or the <em>target</em> domain, may have some labeled data (semi-supervised) or have no labeled data at all (unsupervised) - this is the core question <strong>Domain Adaptation</strong> intended to seek out a decent answer.</p> <h3 id="what-are-common-applications"><strong>What are common applications?</strong></h3> <p>Domain adaptation addresses problems more than that. Indeed, transferring knowledge between related domains comes with different flavors, listed below are a few common use cases:</p> <h4 id="face-recognition"><strong>Face Recognition</strong></h4> <p>Faces in the wild differ from the those taken under controlled scenes in terms of <em>pose</em>, <em>illumination</em>, <em>variations in the backgrounds</em>, etc. Traditional methods often learn a projection or transformation to augment data. It is widely researched that projecting a set of faces under different conditions (provided to be the faces of the same subject, of course) to a lower dimensional subspace, and discriminate them on such spaces is workable. Why this might work? Well, at the end of the last century, people showed that same faces under the aforementioned condition variations tend to be lying on the same lower dimensional subspace (or <em>manifolds</em>. Indeed, this is usually referred to as “the manifold hypothesis”) and can be separated from the manifolds formed by other faces using an off-the-shelf classifier such as support vector machines readily.</p> <p>Recent work, in addition to continue exploring in the traditional land, as we shall see later, tends to leverage the capability of deep neural networks as well.</p> <h4 id="object-detection"><strong>Object Detection</strong></h4> <p>As previously given as an example, having a well-trained model perform well in reality is very important. We have seen models such as <a href="https://github.com/ShaoqingRen/faster_rcnn">Faster-RCNN</a> doing this job pretty well. Nonetheless, it is almost surely more to achieve, especially for benchmarking, e.g., on the <a href="https://cs.stanford.edu/~jhoffman/domainadapt/#datasets_code">Office+Caltech</a> dataset.</p> <h3 id="whats-the-progress"><strong>What’s the progress?</strong></h3> <p>In this year’s CVPR, we see at least nine papers diving into the area of domain adaptation. Surely they addresses different issues at different levels, and we noticed a few interesting traits:</p> <h4 id="leveraging-deep-models"><strong>Leveraging Deep Models</strong></h4> <p>Even flashbacking only few years to 2014, when DNN’s been a hot topic quite for a while, a well cited thorough survey <a href="#Patel:2015:DASURVEY">[Patel et al. 2015]</a> failed to mention too much about DA using DNN’s at all. In this year, on the contrary, we see several deep models, including the one:</p> <ul> <li>achieves compactness: using as few as 59% of parameters compared to GoogLeNet yet achieving similar DA task accuracy <a href="#Wu:2017:CONVM">[Wu et al. 2017]</a>;</li> <li>introduces a novel hashing layer and hash loss <a href="#Venkateswara:2017:DAH">[Venkateswara et al. 2017]</a>;</li> <li>trains on the target domain by jointly fine-tuning low-level features from the source domain <a href="#Ge:2017:SJFT">[Ge and Yu 2017]</a>; and most excitingly:</li> <li>borrows ideas from generative models <a href="#Bousmalis:2017:PIXELDA">[Bousmalis et al. 2017; Tzeng et al. 2017]</a></li> </ul> <p>We highlight some of the key features of those generative-based models:</p> <table> <thead> <tr> <th>Model</th> <th>Highlight</th> <th>Pointer</th> </tr> </thead> <tbody> <tr> <td>PixelDA</td> <td>Learns transformation of <em>pixel</em> space between domains with results look as if drawn from the target domain</td> <td><a href="#Bousmalis:2017:PIXELDA">[Bousmalis et al. 2017]</a></td> </tr> <tr> <td>ADDA</td> <td>Exploits adversarial loss; extends well to cross-modality tasks</td> <td><a href="#Tzeng:2017:ADDA">[Tzeng et al. 2017]</a></td> </tr> </tbody> </table> <p>We see they were inspired from GAN’s differently: PixelDA attempts to map <em>both</em> the source data and noise through GAN such that the generated data seems to be sampled from the target domain as far as the classifier is concerned. Their model is also not unified as the classifier may be changed according to specific tasks; ADDA mainly incorporated an adversarial loss (which they referred to as <em>the GAN loss</em>).</p> <h4 id="exploring-shallow-methods"><strong>Exploring Shallow Methods</strong></h4> <p>Besides deep models, we see several non-DNN based models (hence referred to as “Shallow Models”), that:</p> <ul> <li>enhances the classic Maximum Mean Discrepancy paradigm <a href="#Yan:2017:WMMD">[Yan et al. 2017]</a>;</li> <li>explore further in the traditional subspace learning methods <a href="#Herath:2017:ILS">[Herath et al. 2017; Zhang et al. 2017; Koniusz et al. 2017]</a>;</li> </ul> <p>In particular, <a href="#Yan:2017:WMMD">[Yan et al. 2017]</a> proposed to weight source classes differently, in the hope to impose class priors in case the cross-domain data are not very balanced (i.e., some classes from the source domain may be missing in the target domain). A weighted domain adaptation network based on Weighted MMD and CNN has been tested. Here we see again a deep model, nevertheless, we consider the key feature, i.e., the notion of weighted MMD to be more related to the canonical DA approaches.</p> <p><a href="#Herath:2017:ILS">[Herath et al. 2017; Zhang et al. 2017; Koniusz et al. 2017]</a> all dug further on data augmented-related approaches. <a href="#Herath:2017:ILS">[Herath et al. 2017]</a> is motivated from several state-of-the-art geodesic flow kernel methods and directly learns to construct a latent Hilbert space (that is, a vector space plus the notion of inner products) so as to project <em>both</em> the source and target data onto this space in hope to reduce the discrepancy between domains. Also worthy of highlighting, a notion of discriminatory power is proposed. This notion, as far as we are concerned, is analogous to the classical discriminant analysis - considering both between-class dissimilarity and within-class similarity.</p> <p>In line of the idea of projection, <a href="#Zhang:2017:JGSA">[Zhang et al. 2017]</a> learns two coupled projections to reduce geometrical and distribution shift. <a href="#Koniusz:2017:SOHOT">[Koniusz et al. 2017]</a> deals with second order or higher order scatter statistics.</p> <p>In a nutshell, these approaches exploit the notion of subspace learning (that is, projection followed by optimization on certain discrepancy measures in order to align similar data from both domains and misalign distinct data), the difference is how they achieve this.</p> <h3 id="remark"><strong>Remark</strong></h3> <p>As we have seen, incorporating deep models into DA related tasks seems to be pretty trendy. This is not surprising though, as generally deep models achieve better performances than classical data-augmentation approaches in <em>supervised</em> tasks. As for unsupervised DA tasks, both can be improved further.</p> <p>We have also noted the introducing of generative networks. The ability of GAN’s to facilitate domain adaptation is started to be harnessed.</p> <p>With the leading roles being played by deep models, traditional methods (that is, non DNN-based methods) are also attractive: a handulf of papers addressed subspace learning related topics. Interestingly though, most of them focused on subspace clustering.</p> <p>In the future, we expect to witness more work leveraging deep models and integrating them with traditional canons as well.</p> <h3 id="references"><strong>References</strong></h3> <ol class="bibliography"><li><span id="Bousmalis:2017:PIXELDA"><span style="font-variant: small-caps">Bousmalis, K., Silberman, N., Dohan, D., Erhan, D., and Krishnan, D.</span> 2017. Unsupervised Pixel-Level Domain Adaptation With Generative Adversarial Networks. <i>CVPR ’17</i>.</span></li> <li><span id="Ge:2017:SJFT"><span style="font-variant: small-caps">Ge, W. and Yu, Y.</span> 2017. Borrowing Treasures From the Wealthy: Deep Transfer Learning Through Selective Joint Fine-Tuning. <i>CVPR ’17</i>.</span></li> <li><span id="Herath:2017:ILS"><span style="font-variant: small-caps">Herath, S., Harandi, M., and Porikli, F.</span> 2017. Learning an Invariant Hilbert Space for Domain Adaptation. <i>CVPR ’17</i>.</span></li> <li><span id="Koniusz:2017:SOHOT"><span style="font-variant: small-caps">Koniusz, P., Tas, Y., and Porikli, F.</span> 2017. Domain Adaptation by Mixture of Alignments of Second- or Higher-Order Scatter Tensors. <i>CVPR ’17</i>.</span></li> <li><span id="Patel:2015:DASURVEY"><span style="font-variant: small-caps">Patel, V.M., Gopalan, R., Li, R., and Chellappa, R.</span> 2015. Visual domain adaptation: A survey of recent advances. <i>IEEE signal processing magazine</i> <i>32</i>, 3, 53–69.</span></li> <li><span id="Tzeng:2017:ADDA"><span style="font-variant: small-caps">Tzeng, E., Hoffman, J., Saenko, K., and Darrell, T.</span> 2017. Adversarial Discriminative Domain Adaptation. <i>CVPR ’17</i>.</span></li> <li><span id="Venkateswara:2017:DAH"><span style="font-variant: small-caps">Venkateswara, H., Eusebio, J., Chakraborty, S., and Panchanathan, S.</span> 2017. Deep Hashing Network for Unsupervised Domain Adaptation. <i>CVPR ’17</i>.</span></li> <li><span id="Wu:2017:CONVM"><span style="font-variant: small-caps">Wu, C., Wen, W., Afzal, T., Zhang, Y., Chen, Y., and (Helen) Li, H.</span> 2017. A Compact DNN: Approaching GoogLeNet-Level Accuracy of Classification and Domain Adaptation. <i>CVPR ’17</i>.</span></li> <li><span id="Yan:2017:WMMD"><span style="font-variant: small-caps">Yan, H., Ding, Y., Li, P., Wang, Q., Xu, Y., and Zuo, W.</span> 2017. Mind the Class Weight Bias: Weighted Maximum Mean Discrepancy for Unsupervised Domain Adaptation. <i>CVPR ’17</i>.</span></li> <li><span id="Zhang:2017:JGSA"><span style="font-variant: small-caps">Zhang, J., Li, W., and Ogunbona, P.</span> 2017. Joint Geometrical and Statistical Alignment for Visual Domain Adaptation. <i>CVPR ’17</i>.</span></li></ol>Jiayao J. ZhangAbstractLearn LaTeX2017-03-17T00:00:00+00:002017-03-17T00:00:00+00:00https://zjiayao.github.io/blog/2017/learn-latex<p>I used to typeset assignments and notes using Microsoft formula editor. Only by the end of the last year was I motivated to start using $\LaTeX$ by virtue of its simplicity, elegance and professionality, especially the way it handles citations and references – easily and efficiently.</p> <p>I found it natural to transit from formula editor to $\LaTeX$, as there is a consistency along syntax (more complicated, of course). Within the transition, I have been using LyX as an intermediate typesetting tool. So if directly start working with $\LaTeX$ seems not intuitive, maybe try LyX first. Tutorials are almost surely helpful.</p> <p>After all, the best way to learn $\LaTeX$, it seems, is to follow a template and write one’s own, say, paper.</p>Jiayao J. ZhangI used to typeset assignments and notes using Microsoft formula editor. Only by the end of the last year was I motivated to start using $\LaTeX$ by virtue of its simplicity, elegance and professionality, especially the way it handles citations and references – easily and efficiently.