SlideShare a Scribd company logo
Deep Kernel Learning
¤ Wilson et al. (arXiv 2015/11/6)
¤ Carnegie Mellon University
¤ Salakhutdinov
¤ Deep learning + Gaussian process
¤
¤ &
¤
¤
¤ D
¤
an Processes
ew the predictive equations and marginal likelihood for Gaussian processes
e associated computational requirements, following the notational conven-
n et al. (2015). See, for example, Rasmussen and Williams (2006) for a
discussion of GPs.
ataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
x an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
tion of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
ector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
function and covariance kernel of the Gaussian process. The kernel, k , is
y . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
ibution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
mple, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
We briefly review the predictive equations and marginal likelihood for Gaussian proce
(GPs), and the associated computational requirements, following the notational con
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) f
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimen
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ,
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) ,
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determ
from the mean function and covariance kernel of the Gaussian process. The kernel, k
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2),
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is give
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) ,
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel
functions, to produce scalable deep kernels.
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
and X. µX⇤
is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated
show that the proposed model outperforms state of the art stand-alone deep learning archi-
tectures and Gaussian processes with advanced kernel learning procedures on a wide range
of datasets, demonstrating its practical significance. We achieve scalability while retaining
non-parametric model structure by leveraging the very recent KISS-GP approach (Wilson
and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel
functions, to produce scalable deep kernels.
3 Gaussian Processes
We briefly review the predictive equations and marginal likelihood for Gaussian processes
(GPs), and the associated computational requirements, following the notational conven-
tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a
comprehensive discussion of GPs.
We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension
D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
then any collection of function values f has a joint Gaussian distribution,
f = f(X) = [f(x1), . . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined
from the mean function and covariance kernel of the Gaussian process. The kernel, k , is
parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by
f⇤|X⇤,X, y, , 2
⇠ N(E[f⇤], cov(f⇤)) , (2)
E[f⇤] = µX⇤
+ KX⇤,X[KX,X + 2
I] 1
y ,
cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤
and X. µX⇤
is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated
at training inputs X. All covariance (kernel) matrices implicitly depend on the kernel
hyperparameters .
15.2. GPs for regression 517
−5 0 5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
(a)
−5 0 5
−2
−1.5
−1
−0.5
0
0.5
1
1.5
2
2.5
(b)
Figure 15.2 Left: some functions sampled from a GP prior with SE kernel. Right: some samples from a GP
posterior, after conditioning on 5 noise-free observations. The shaded area represents E [f(x)]±2std(f(x).
ns and marginal likelihood for Gaussian processes
al requirements, following the notational conven-
example, Rasmussen and Williams (2006) for a
ctor) vectors X = {x1, . . . , xn}, each of dimension
ets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ),
has a joint Gaussian distribution,
. . . , f(xn)]>
⇠ N(µ, KX,X) , (1)
ariance matrix, (KX,X)ij = k (xi, xj), determined
kernel of the Gaussian process. The kernel, k , is
Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the
ed at the n⇤ test points indexed by X⇤, is given by
N(E[f⇤], cov(f⇤)) , (2)
X⇤,X[KX,X + 2
I] 1
y ,
KX⇤,X[KX,X + 2
I] 1
KX,X⇤ .
x of covariances between the GP evaluated at X⇤
nd KX,X is the n ⇥ n covariance matrix evaluated
¤
¤ Φ
¤ w !
0
¤
¤ "($)
¤
¤ &
. . . . . .
について同時に書くと, 下のように y = Φw と
書ける (Φ : 計画行列)
1)
2)
N)
⎞
⎟
⎟
⎟
⎟
⎠
=
⎛
⎜
⎜
⎜
⎜
⎝
φ1(x(1)) · · · φH(x(1))
φ1(x(2)) · · · φH(x(2))
...
...
φ1(x(N)) · · · φH(x(N))
⎞
⎟
⎟
⎟
⎟
⎠
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
w1
w2
...
...
wH
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
(6)
y Φ w
ウス分布 p(w) = N(0, α−1I) に従っているとすると,
ガウス分布に従い,
⟨ yyT
⟩ = (Φw) (Φw)T
= Φ⟨wwT
⟩ΦT
(7)
= α−1
ΦΦT
の正規分布となる
9 / 59
. . . . . .
GP の導入 (1)
• y(1) · · · y(N) について同時に書くと, 下のように y = Φw と
行列形式で書ける (Φ : 計画行列)
⎛
⎜
⎜
⎜
⎜
⎝
y(1)
y(2)
...
y(N)
⎞
⎟
⎟
⎟
⎟
⎠
=
⎛
⎜
⎜
⎜
⎜
⎝
φ1(x(1)) · · · φH(x(1))
φ1(x(2)) · · · φH(x(2))
...
...
φ1(x(N)) · · · φH(x(N))
⎞
⎟
⎟
⎟
⎟
⎠
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
w1
w2
...
...
wH
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
(6)
y Φ w
• 重み w がガウス分布 p(w) = N(0, α−1I) に従っているとすると,
y = Φw もガウス分布に従い,
• 平均 0, 分散
⟨ yyT
⟩ = (Φw) (Φw)T
= Φ⟨wwT
⟩ΦT
(7)
= α−1
ΦΦT
の正規分布となる
9 / 59
. . . . . .
GP の導入 (1)
• y(1) · · · y(N) について同時に書くと, 下のように y = Φw と
行列形式で書ける (Φ : 計画行列)
⎛
⎜
⎜
⎜
⎜
⎝
y(1)
y(2)
...
y(N)
⎞
⎟
⎟
⎟
⎟
⎠
=
⎛
⎜
⎜
⎜
⎜
⎝
φ1(x(1)) · · · φH(x(1))
φ1(x(2)) · · · φH(x(2))
...
...
φ1(x(N)) · · · φH(x(N))
⎞
⎟
⎟
⎟
⎟
⎠
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
w1
w2
...
...
wH
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
(6)
y Φ w
• 重み w がガウス分布 p(w) = N(0, α−1I) に従っているとすると,
y = Φw もガウス分布に従い,
• 平均 0, 分散
⟨ yyT
⟩ = (Φw) (Φw)T
= Φ⟨wwT
⟩ΦT
(7)
= α−1
ΦΦT
の正規分布となる
9 / 59
. . . . . .
• y(1) · · · y(N) について同時に書くと, 下のように y = Φw と
行列形式で書ける (Φ : 計画行列)
⎛
⎜
⎜
⎜
⎜
⎝
y(1)
y(2)
...
y(N)
⎞
⎟
⎟
⎟
⎟
⎠
=
⎛
⎜
⎜
⎜
⎜
⎝
φ1(x(1)) · · · φH(x(1))
φ1(x(2)) · · · φH(x(2))
...
...
φ1(x(N)) · · · φH(x(N))
⎞
⎟
⎟
⎟
⎟
⎠
⎛
⎜
⎜
⎜
⎜
⎜
⎜
⎜
⎝
w1
w2
...
...
wH
⎞
⎟
⎟
⎟
⎟
⎟
⎟
⎟
⎠
(6)
y Φ w
• 重み w がガウス分布 p(w) = N(0, α−1I) に従っているとすると,
y = Φw もガウス分布に従い,
• 平均 0, 分散
⟨ yyT
⟩ = (Φw) (Φw)T
= Φ⟨wwT
⟩ΦT
(7)
= α−1
ΦΦT
の正規分布となる
9 / 59
GP の導入 (2)
p(y) = N(y|0, α−1
ΦΦT
) (8)
は, どんな入力 {xn}N
n=1 についても成り立つ → ガウス過程の定義
• どんな入力 (x1, x2, · · · , xN ) についても, 対応する出力
y = (y1, y2, · · · , yN ) がガウス分布に従うとき, p(y) はガウス過
程に従う という.
− ガウス過程 = 無限次元のガウス分布
− ガウス分布の周辺化はまたガウス分布なので, 実際にはデー
タのある所だけの有限次元
• K = α−1ΦΦT の要素であるカーネル関数
. . . . .
GP の導入 (2)
p(y) = N(y|0, α−1
ΦΦT
) (8)
は, どんな入力 {xn}N
n=1 についても成り立つ → ガウス過程の定義
• どんな入力 (x1, x2, · · · , xN ) についても, 対応する出力
y = (y1, y2, · · · , yN ) がガウス分布に従うとき, p(y) はガウス過
程に従う という.
− ガウス過程 = 無限次元のガウス分布
− ガウス分布の周辺化はまたガウス分布なので, 実際にはデー
タのある所だけの有限次元
• K = α−1ΦΦT の要素であるカーネル関数
k(x, x′
) = α−1
φ(x)T
φ(x′
) (9)
だけでガウス分布が定まる
− k(x, x′) は x と x′ の距離 ; 入力 x が近い→出力 y が近い
¤
¤ "
¤
¤ C
¤ . . . . . .
GP の導入 (3)
• 実際には, 観測値にはノイズ ϵ が乗っている
y = wT φ(x) + ϵ
ϵ ∼ N(0, β−1I)
=⇒ p(y|f) = N(wT
φ(x), β−1
I) (10)
• 途中の f = wT φ(x) を積分消去
p(y|x) = p(y|f)p(f|x)df (11)
= N(0, C) (12)
− 二つの独立な Gaussian の畳み込みなので, C の要素は共分散
の和:
C(xi, xj) = k(xi, xj) + β−1
δ(i, j). (13)
− GP は, カーネル関数 k(x, x′) とハイパーパラメータ α, β
だけで表すことができる.
11 / 59
. . . . . .
GP の導入 (3)
• 実際には, 観測値にはノイズ ϵ が乗っている
y = wT φ(x) + ϵ
ϵ ∼ N(0, β−1I)
=⇒ p(y|f) = N(wT
φ(x), β−1
I) (10)
• 途中の f = wT φ(x) を積分消去
p(y|x) = p(y|f)p(f|x)df (11)
= N(0, C) (12)
− 二つの独立な Gaussian の畳み込みなので, C の要素は共分散
の和:
C(xi, xj) = k(xi, xj) + β−1
δ(i, j). (13)
− GP は, カーネル関数 k(x, x′) とハイパーパラメータ α, β
だけで表すことができる.
11 / 59
. . . . . .
GP の導入 (3)
• 実際には, 観測値にはノイズ ϵ が乗っている
y = wT φ(x) + ϵ
ϵ ∼ N(0, β−1I)
=⇒ p(y|f) = N(wT
φ(x), β−1
I) (10)
• 途中の f = wT φ(x) を積分消去
p(y|x) = p(y|f)p(f|x)df (11)
= N(0, C) (12)
− 二つの独立な Gaussian の畳み込みなので, C の要素は共分散
の和:
C(xi, xj) = k(xi, xj) + β−1
δ(i, j). (13)
− GP は, カーネル関数 k(x, x′) とハイパーパラメータ α, β
だけで表すことができる.
11 / 59
¤ !'()
!
¤
¤
¤
. . . . . .
GP の予測
• 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に
なるので,
p(ynew
|xnew
, X, y, θ) (17)
=
p((y, ynew)|(X, xnew), θ)
p(y|X, θ)
(18)
∝ exp
⎛
⎝−
1
2
([y, ynew
]
K k
kT k
−1
y
ynew
− yT
K−1
y)
⎞
⎠
(19)
∼ N(kT
K−1
y, k − kT
K−1
k). (20)
ここで
− K = [k(x, x′)].
− k = (k(xnew, x1), · · · , k(xnew, xN )).
18 / 59
. . . . . .
GP の予測
• 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に
なるので,
p(ynew
|xnew
, X, y, θ) (17)
=
p((y, ynew)|(X, xnew), θ)
p(y|X, θ)
(18)
∝ exp
⎛
⎝−
1
2
([y, ynew
]
K k
kT k
−1
y
ynew
− yT
K−1
y)
⎞
⎠
(19)
∼ N(kT
K−1
y, k − kT
K−1
k). (20)
ここで
− K = [k(x, x′)].
− k = (k(xnew, x1), · · · , k(xnew, xN )).
18 / 59
. . . . . .
GP の予測
• 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に
なるので,
p(ynew
|xnew
, X, y, θ) (17)
=
p((y, ynew)|(X, xnew), θ)
p(y|X, θ)
(18)
∝ exp
⎛
⎝−
1
2
([y, ynew
]
K k
kT k
−1
y
ynew
− yT
K−1
y)
⎞
⎠
(19)
∼ N(kT
K−1
y, k − kT
K−1
k). (20)
ここで
− K = [k(x, x′)].
− k = (k(xnew, x1), · · · , k(xnew, xN )).
18 / 59
. . . . . .
GP の予測
• 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に
なるので,
p(ynew
|xnew
, X, y, θ) (17)
=
p((y, ynew)|(X, xnew), θ)
p(y|X, θ)
(18)
∝ exp
⎛
⎝−
1
2
([y, ynew
]
K k
kT k
−1
y
ynew
− yT
K−1
y)
⎞
⎠
(19)
∼ N(kT
K−1
y, k − kT
K−1
k). (20)
ここで
− K = [k(x, x′)].
− k = (k(xnew, x1), · · · , k(xnew, xN )).
18 / 59
¤
¤ =
developed that have better scaling with training set size than the exact approach
(Gibbs, 1997; Tresp, 2001; Smola and Bartlett, 2001; Williams and Seeger, 2001;
Csat´o and Opper, 2002; Seeger et al., 2003). Practical issues in the application of
Gaussian processes are discussed in Bishop and Nabney (2008).
We have introduced Gaussian process regression for the case of a single tar-
get variable. The extension of this formalism to multiple target variables, known
as co-kriging (Cressie, 1993), is straightforward. Various other extensions of Gaus-
Illustration of Gaussian process re-
gression applied to the sinusoidal
data set in Figure A.6 in which the
three right-most data points have
been omitted. The green curve
shows the sinusoidal function from
which the data points, shown in
blue, are obtained by sampling and
addition of Gaussian noise. The
red line shows the mean of the
Gaussian process predictive distri-
bution, and the shaded region cor-
responds to plus and minus two
standard deviations. Notice how
the uncertainty increases in the re-
gion to the right of the data points.
0 0.2 0.4 0.6 0.8 1
−1
−0.5
0
0.5
1
&
¤
¤
¤
¤
¤
¤ *(+,)
¤
¤ *(+-
)
¤ *(+) *(+,
)
¤
¤ “How can Gaussian processes possibly replace neural networks? Have
we thrown the baby out with the bathwater?” [MacKay1998]
¤
¤ Bayesian Learning for Neural Networks [Neal 1996]
¤ Neal D Hinton MacKey
¤
¤
¤
¤ [Wilson, 2014][Wilson and Adams, 2013][Lloyd et al., 2014][Yang et al.,
2015]
¤
¤
¤
+
¤ Gaussian process regression network [Wilson et al., 2012]
¤
¤ Damianou and Lawrence (2013)
¤
¤ Salakhutdinov and Hinton (2008)
¤ DBN
¤ Calandra et al. (2014)
¤ NN sharp discontinuities
¤ k x
¤ g DNN &
¤ RBF spectral mixture (SM)
[Wilson and Adams, 2013]
¤
¤ DNN .
In this section we show how we can contruct kernels which encapsulate the expressive pow
of deep architectures, and how to learn the properties of these kernels as part of a scalab
probabilistic Gaussian process framework.
Specifically, starting from a base kernel k(xi, xj|✓) with hyperparameters ✓, we transfor
the inputs (predictors) x as
k(xi, xj|✓) ! k(g(xi, w), g(xj, w)|✓, w) , (5
where g(x, w) is a non-linear mapping given by a deep architecture, such as a deep con
volutional network, parametrized by weights w. The popular RBF kernel (Eq. (3)) is
sensible choice of base kernel k(xi, xj|✓). For added flexibility, we also propose to u
spectral mixture base kernels (Wilson and Adams, 2013):
kSM(x, x0
|✓) =
QX
q=1
aq
|⌃q|
1
2
(2⇡)
D
2
exp
✓
1
2
||⌃
1
2
q (x x0
)||2
◆
coshx x0
, 2⇡µqi . (6
The parameters of the spectral mixture kernel ✓ = {aq, ⌃q, µq} are mixture weights, band
widths (inverse length-scales), and frequencies. The spectral mixture (SM) kernel, whic
forms an expressive basis for all stationary covariance functions, can discover quasi-period
stationary structure with an interpretable and succinct representation, while the deep learn
4 Deep Kernel Learning
In this section we show how we can contruct kernels which encapsulate the expressive power
of deep architectures, and how to learn the properties of these kernels as part of a scalable
probabilistic Gaussian process framework.
Specifically, starting from a base kernel k(xi, xj|✓) with hyperparameters ✓, we transform
the inputs (predictors) x as
k(xi, xj|✓) ! k(g(xi, w), g(xj, w)|✓, w) , (5)
where g(x, w) is a non-linear mapping given by a deep architecture, such as a deep con-
volutional network, parametrized by weights w. The popular RBF kernel (Eq. (3)) is a
sensible choice of base kernel k(xi, xj|✓). For added flexibility, we also propose to use
spectral mixture base kernels (Wilson and Adams, 2013):
kSM(x, x0
|✓) =
QX
q=1
aq
|⌃q|
1
2
(2⇡)
D
2
exp
✓
1
2
||⌃
1
2
q (x x0
)||2
◆
coshx x0
, 2⇡µqi . (6)
The parameters of the spectral mixture kernel ✓ = {aq, ⌃q, µq} are mixture weights, band-
widths (inverse length-scales), and frequencies. The spectral mixture (SM) kernel, which
forms an expressive basis for all stationary covariance functions, can discover quasi-periodic
stationary structure with an interpretable and succinct representation, while the deep learn-
ing transformation g(x, w) captures non-stationary and hierarchical structure.
5
¤
¤ DNN
¤ RBF SM
x1
xD
Input layer h
(1)
1
h
(1)
A
...
. . .
h
(2)
1
h
(2)
B
h
(L)
1
h
(L)
C
W(1)
W(2)
W(L)
h1(✓)
h1(✓)
Hidden layers
1 layer
y1
yP
Output layer
. . .
...
...
...
......
...
...
Figure 1: Deep Kernel Learning: A Gaussian process with a deep kernel maps D dimensional
inputs x through L parametric hidden layers followed by a hidden layer with an infinite number of
basis functions, with base kernel hyperparameters ✓. Overall, a Gaussian process with a deep kernel
¤ L
¤ / DNN &
¤
is determined by the interpretable length-scale hyperparameter `. Shorter length-scales
correspond to functions which vary more rapidly with the inputs x.
The structure of our data is discovered through learning interpretable kernel hyperparam-
eters. The marginal likelihood of the targets y, the probability of the data conditioned
only on kernel hyperparameters , provides a principled probabilistic framework for kernel
learning:
log p(y| , X) / [y>
(K + 2
I) 1
y + log |K + 2
I|] , (4)
where we have used K as shorthand for KX,X given . Note that the expression for the log
marginal likelihood in Eq. (4) pleasingly separates into automatically calibrated model fit
and complexity terms (Rasmussen and Ghahramani, 2001). Kernel learning can be achieved
by optimizing Eq. (4) with respect to .
The computational bottleneck for inference is solving the linear system (KX,X + 2I) 1y,
and for kernel learning is computing the log determinant log |KX,X + 2I| in the marginal
likelihood. The standard approach is to compute the Cholesky decomposition of the n ⇥
n matrix KX,X, which requires O(n3) operations and O(n2) storage. After inference is
complete, the predictive mean costs O(n), and the predictive variance costs O(n2), per test
point x⇤.
4 Deep Kernel Learning
In this section we show how we can contruct kernels which encapsulate the expressive power
function representation, our network e↵ectively has a hidden layer with an infinite number
of hidden units. The overall model is shown in Figure 1.
We emphasize, however, that we jointly learn all deep kernel hyperparameters, = {w, ✓},
which include w, the weights of the network, and ✓ the parameters of the base kernel, by
maximizing the log marginal likelihood L of the Gaussian process (see Eq. (4)). Indeed
compartmentalizing our model into a base kernel and deep architecture is for pedagogical
clarity. When applying a Gaussian process one can use our deep kernel, which operates as a
single unit, as a drop-in replacement for e.g., standard ARD or Mat´ern kernels (Rasmussen
and Williams, 2006), since learning and inference follow the same procedures.
For kernel learning, we use the chain rule to compute derivatives of the log marginal likeli-
hood with respect to the deep kernel hyperparameters:
@L
@✓
=
@L
@K
@K
@✓
,
@L
@w
=
@L
@K
@K
@g(x, w)
@g(x, w)
@w
.
The implicit derivative of the log marginal likelihood with respect to our n ⇥ n data covari-
6
ance matrix K is given by
@L
@K
=
1
2
(K 1
yy>
K 1
K 1
) , (7)
where we have absorbed the noise covariance 2I into our covariance matrix, and treat it as
part of the base kernel hyperparameters ✓.
@K
@✓ are the derivatives of the deep kernel with
respect to the base kernel hyperparameters (such as length-scale), conditioned on the fixed
transformation of the inputs g(x, w). Similarly,
@K
@g(x,w) are the implicit derivatives of the
deep kernel with respect to g, holding ✓ fixed. The derivatives with respect to the weight
KISS-GP
¤ K KISS-GP [Wilson and Nickisch, 2015]
[Wilson et al., 2015]
¤ M:
¤ K:
¤ linear conjugate gradients (LCG)
¤ ×
¤ &
¤
¤ [Quin ̃onero-Candela and Rasmussen, 2005]
part of the base kernel hyperparameters ✓. @✓ are the derivatives of the deep kernel wit
respect to the base kernel hyperparameters (such as length-scale), conditioned on the fixe
transformation of the inputs g(x, w). Similarly,
@K
@g(x,w) are the implicit derivatives of th
deep kernel with respect to g, holding ✓ fixed. The derivatives with respect to the weigh
variables @g(x,w)
@w are computed using standard backpropagation.
For scalability, we replace all instances of K with the KISS-GP covariance matrix (Wilso
and Nickisch, 2015; Wilson et al., 2015)
K ⇡ MKdeep
U,U M>
:= KKISS , (
where M is a sparse matrix of interpolation weights, containing only 4 non-zero entries p
row for local cubic interpolation, and KU,U is a covariance matrix created from our dee
kernel, evaluated over m latent inducing points U = [ui]i=1...m. We place the inducing poin
over a regular multidimensional lattice, and exploit the resulting decomposition of KU,U in
a Kronecker product of Toeplitz matrices for extremely fast matrix vector multiplication
(MVMs), without requiring any grid structure in the data inputs X or the transforme
inputs g(x, w). Because KISS-GP operates by creating an approximate kernel which admi
fast computations, and is independent from a specific inference and learning procedure, w
can view the KISS approximation applied to our deep kernels as a stand-alone kerne
k(x, z) = m>
x Kdeep
U,U mz, which can be combined with Gaussian processes or other kern
machines for scalable learning.
For inference we solve K 1
KISSy using linear conjugate gradients (LCG), an iterative procedure
for solving linear systems which only involves matrix vector multiplications (MVMs). The
number of iterations required for convergence to within machine precision is j ⌧ n, and in
practice j depends on the conditioning of the KISS-GP covariance matrix rather than the
number of training points n. For estimating the log determinant in the marginal likelihood
we follow the approach described in Wilson and Nickisch (2015) with extensions in Wilson
et al. (2015).
KISS-GP training scales as O(n+h(m)) (where h(m) is typically close to linear in m), versus
conventional scalable GP approaches which require O(m2n + m3) (Qui˜nonero-Candela and
Rasmussen, 2005) computations and need m ⌧ n for tractability, which results in severe
deteriorations in predictive performance. The ability to have large m ⇡ n allows KISS-GP
ce we solve K 1
KISSy using linear conjugate gradients (LCG), an iterative procedure
linear systems which only involves matrix vector multiplications (MVMs). The
iterations required for convergence to within machine precision is j ⌧ n, and in
depends on the conditioning of the KISS-GP covariance matrix rather than the
training points n. For estimating the log determinant in the marginal likelihood
the approach described in Wilson and Nickisch (2015) with extensions in Wilson
5).
raining scales as O(n+h(m)) (where h(m) is typically close to linear in m), versus
al scalable GP approaches which require O(m2n + m3) (Qui˜nonero-Candela and
n, 2005) computations and need m ⌧ n for tractability, which results in severe
ons in predictive performance. The ability to have large m ⇡ n allows KISS-GP
ar-exact accuracy in its approximation (Wilson and Nickisch, 2015), retaining a
metric representation, while providing linear scaling in n and O(1) time per test
iction (Wilson et al., 2015). We empirically demonstrate this scalability and
ng linear conjugate gradients (LCG), an iterative procedure
only involves matrix vector multiplications (MVMs). The
convergence to within machine precision is j ⌧ n, and in
ioning of the KISS-GP covariance matrix rather than the
estimating the log determinant in the marginal likelihood
in Wilson and Nickisch (2015) with extensions in Wilson
h(m)) (where h(m) is typically close to linear in m), versus
ches which require O(m2n + m3) (Qui˜nonero-Candela and
and need m ⌧ n for tractability, which results in severe
rmance. The ability to have large m ⇡ n allows KISS-GP
s approximation (Wilson and Nickisch, 2015), retaining a
¤ 3
¤ UCI repository
¤
¤ MNIST
¤
¤ DNN Caffe KISS-GP GPML
¤ DNN SGD ReLU
¤ DNN KISS-GP
¤
UCI repository
¤
¤ n<6000 [d-1000-500-50-2]
¤ n>6000 [d-1000-1000-500-50-2]
¤ GP Fastfood
finite basis function expansions
Table 1: Comparative RMSE performance and runtime on UCI regression datasets, with n training points and d the input
dimensions. The results are averaged over 5 equal partitions (90% train, 10% test) of the data ± one standard deviation. The
best denotes the best-performing kernel according to Yang et al. (2015) (note that often the best performing kernel is GP-SM).
Following Yang et al. (2015), as exact Gaussian processes are intractable on the large data used here, the Fastfood finite basis
function expansions are used for approximation in GP (RBF, SM, Best). We verified on datasets with n < 10, 000 that exact
GPs with RBF kernels provide comparable performance to the Fastfood expansions. For datasets with n < 6, 000 we used a
fully-connected DNN with a [d-1000-500-50-2] architecture, and for n > 6000 we used a [d-1000-1000-500-50-2] architecture. We
consider scalable deep kernel learning (DKL) with RBF and SM base kernels. For the SM base kernel, we set Q = 4 for datasets
with n < 10, 000 training instances, and use Q = 6 for larger datasets.
Datasets n d
RMSE Runtime(s)
GP
DNN
DKL
DNN
DKL
RBF SM best RBF SM RBF SM
Gas 2,565 128 0.21±0.07 0.14±0.08 0.12±0.07 0.11±0.05 0.11±0.05 0.09±0.06 7.43 7.80 10.52
Skillcraft 3,338 19 1.26±3.14 0.25±0.02 0.25±0.02 0.25±0.00 0.25±0.00 0.25±0.00 15.79 15.91 17.08
SML 4,137 26 6.94±0.51 0.27±0.03 0.26±0.04 0.25±0.02 0.24±0.01 0.23±0.01 1.09 1.48 1.92
Parkinsons 5,875 20 3.94±1.31 0.00±0.00 0.00±0.00 0.31±0.04 0.29±0.04 0.29±0.04 3.21 3.44 6.49
Pumadyn 8,192 32 1.00±0.00 0.21±0.00 0.20±0.00 0.25±0.02 0.24±0.02 0.23±0.02 7.50 7.88 9.77
PoleTele 15,000 26 12.6±0.3 5.40±0.3 4.30±0.2 3.42±0.05 3.28±0.04 3.11±0.07 8.02 8.27 26.95
Elevators 16,599 18 0.12±0.00 0.090±0.001 0.089±0.002 0.099±0.001 0.084±0.002 0.084±0.002 8.91 9.16 11.77
Kin40k 40,000 8 0.34±0.01 0.19±0.02 0.06±0.00 0.11±0.01 0.05±0.00 0.03±0.01 19.82 20.73 24.99
Protein 45,730 9 1.64±1.66 0.50±0.02 0.47±0.01 0.49±0.01 0.46±0.01 0.43±0.01 142.8 154.8 144.2
KEGG 48,827 22 0.33±0.17 0.12±0.01 0.12±0.01 0.12±0.01 0.11±0.00 0.10±0.01 31.31 34.23 61.01
CTslice 53,500 385 7.13±0.11 2.21±0.06 0.59±0.07 0.41±0.06 0.36±0.01 0.34±0.02 36.38 44.28 80.44
KEGGU 63,608 27 0.29±0.12 0.12±0.00 0.12±0.00 0.12±0.00 0.11±0.00 0.11±0.00 39.54 42.97 41.05
3Droad 434,874 3 12.86±0.09 10.34±0.19 9.90±0.10 7.36±0.07 6.91±0.04 6.91±0.04 238.7 256.1 292.2
Song 515,345 90 0.55±0.00 0.46±0.00 0.45±0.00 0.45±0.02 0.44±0.00 0.43±0.01 517.7 538.5 589.8
Buzz 583,250 77 0.88±0.01 0.51±0.01 0.51±0.01 0.49±0.00 0.48±0.00 0.46±0.01 486.4 523.3 769.7
Electric 2,049,280 11 0.230±0.000 0.053±0.000 0.053±0.000 0.058±0.002 0.050±0.002 0.048±0.002 3458 3542 4881
¤
¤ The Olivetti face data
28 28 [Salakhutdinov and Hinton (2008)]
¤ 30 10
¤
¤ 2 4
¤ 2
36.15-43.10 -3.4917.35 -19.81
Training data
Test data
Label
Figure 2: Left: Randomly sampled examples of the training and test data. Right: Th
dimensional outputs of the convolutional network on a set of test cases. Each point is shown
a line segment that has the same orientation as the input face.
5.1 UCI regression tasks
We consider a large set of UCI regression problems of varying sizes and properties. Ta
reports test root mean squared error (RMSE) for 1) many scalable Gaussian process k
learning methods based on Fastfood (Yang et al., 2015); 2) stand-alone deep neural netw
(DNNs); and 3) our proposed combined deep kernel learning (DKL) model using both
and SM base kernels.
For smaller datasets, where the number of training examples n < 6, 000, we used a f
connected neural network with a d-1000-500-50-2 architecture; for larger datasets we
2
A Appendix
A.1 Convolutional network architecture
Table 3 lists the architecture of the convolutional networks used in the tasks of face ori-
entation extraction (section 5.2) and digit magnitude extraction (section 5.3). The CNN
architecture is original from the LeNet LeCun et al. (1998) (for digit classification) and
adapted to the above tasks with one or two more fully-connected layers for feature trans-
formation.
Layer conv1 pool1 conv2 pool2 full3 full4 full5 full6
kernel size 5⇥5 2⇥2 5⇥5 2⇥2 - - - -
stride 1 2 1 2 - - - -
channel 20 20 50 50 1000 500 50 2
Table 3: The architecture of the convolutional network used in face orientation extraction.
The CNN used in the MNIST digit magnitude regression has a similar architecture except
that the full3 layer is omitted. Both pool1 and pool2 are max pooling layers. ReLU layer is
placed after full3 and full4.
¤
¤ DBN-GP 12000 1000
DKL
¤ (RMSE)
Table 2: RMSE performance on the Olivetti and MNIST. For comparison, in the face orientation
extraction, we trained DKL on the same amount (12,000) of training instances as with DBN-GP, but
used all labels; whereas DBN-GP (as with GP) scaled to only 1,000 labeled images and modeled the
remaining data through unsupervised pretraining of DBN. We used RBF base kernel within GPs.
Datasets GP DBN-GP CNN DKL
Olivetti 16.33 6.42 6.34 6.07
MNIST 1.25 1.03 0.59 0.53
combining KISS-GP with DNNs for deep kernels introduces only negligible runtime costs:
KISS-GP imposes an additional runtime of about 10% (one order of magnitude less than)
the runtime a DNN typically requires. Overall, these results show the general applicability
and practical significance of our scalable DKL approach.
5.2 Face orientation extraction
We now consider the task of predicting the orientation of a face extracted from a gray-
1 2 3 4 5 6
x 10
4
5.2
5.4
5.6
5.8
6
6.2
6.4
#Training Instances
RMSE
CNN
DKL−RBF
DKL−SM
1 2 3 4 5 6
x 10
4
50
100
150
200
250
300
350
#Training Instances
Runtime(s)
CNN
DKL−RBF
DKL−SM
8
10
x 10
4
e(s)
CNN
DKL−RBF
DKL−SM
1 2 3 4 5 6
x 10
4
5.2
5.4
5.6
5.8
6
6.2
6.4
#Training Instances
RMSE
CNN
DKL−RBF
DKL−SM
1 2 3 4 5 6
x 1
50
100
150
200
250
300
350
#Training Instances
Runtime(s)
CNN
DKL−RBF
DKL−SM
1.2 2 3 4 5 6
x 10
4
0
2
4
6
8
10
x 10
4
#Training Instances
TotalTrainingTime(s)
CNN
DKL−RBF
DKL−SM
Figure 3: Left: RMSE vs. n, the number of training examples. Middle: Runtime vs n. Ri
Total training time vs n. The dashed line in black indicates a slope of 1. Convolutional netw
are used within DKL. We set Q = 4 for the SM kernel.
¤ spectral
¤ spectral mixture RBF
¤ SM 2
¤ RBF
¤
0 10 20 30
−800
−600
−400
−200
0
Frequency
LogSpectralDensity
Figure 4: The log spectral densities of the DKL-SM and DKL-SE base kernels are in black and red,
respectively.
We further see the benefit of an SM base kernel in Figure 5, where we show the learned
covariance matrices constructed from the whole deep kernels (composition of base kernel
and deep architecture) for RBF and SM base kernels. The covariance matrix is evaluated
on a set of test inputs, where we randomly sample 400 instances from the test set and sort
them in terms of the orientation angles of the input faces. We see that the deep kernels with
both RBF and SM base kernels discover that faces with similar rotation angles are highly
¤
¤ DKL-SM
¤ DKL-RBF
¤ RBF
¤ DKL
¤ DKL-RBF
¤ RBF
→DKL
100 200 300 400
100
200
300
400 −0.1
0
0.1
0.2
100 200 300 400
100
200
300
400 0
1
2
100 200 300 400
100
200
300
400 0
100
200
300
Figure 5: Left: The induced covariance matrix using DKL-SM kernel on a set of test cases, where
the test samples are ordered according to the orientations of the input faces. Middle: The respective
covariance matrix using DKL-RBF kernel. Right: The respective covariance matrix using regular
RBF kernel. The models are trained with n = 12, 000. We set Q = 4 for the SM base kernel.
5.3 Digit magnitude extraction
MNIST
¤ MNIST
¤
¤
¤ (RMSE)
¤
Table 2: RMSE performance on the Olivetti and MNIST. For comparison, in the face orientation
extraction, we trained DKL on the same amount (12,000) of training instances as with DBN-GP, but
used all labels; whereas DBN-GP (as with GP) scaled to only 1,000 labeled images and modeled the
remaining data through unsupervised pretraining of DBN. We used RBF base kernel within GPs.
Datasets GP DBN-GP CNN DKL
Olivetti 16.33 6.42 6.34 6.07
MNIST 1.25 1.03 0.59 0.53
combining KISS-GP with DNNs for deep kernels introduces only negligible runtime costs:
KISS-GP imposes an additional runtime of about 10% (one order of magnitude less than)
the runtime a DNN typically requires. Overall, these results show the general applicability
and practical significance of our scalable DKL approach.
5.2 Face orientation extraction
Table 2: RMSE performance on the Olivetti and MNIST. For comparison, in the face orientation
extraction, we trained DKL on the same amount (12,000) of training instances as with DBN-GP, but
used all labels; whereas DBN-GP (as with GP) scaled to only 1,000 labeled images and modeled the
remaining data through unsupervised pretraining of DBN. We used RBF base kernel within GPs.
Datasets GP DBN-GP CNN DKL
Olivetti 16.33 6.42 6.34 6.07
MNIST 1.25 1.03 0.59 0.53
combining KISS-GP with DNNs for deep kernels introduces only negligible runtime costs:
KISS-GP imposes an additional runtime of about 10% (one order of magnitude less than)
the runtime a DNN typically requires. Overall, these results show the general applicability
and practical significance of our scalable DKL approach.
5.2 Face orientation extraction
We now consider the task of predicting the orientation of a face extracted from a gray-
A Appendix
A.1 Convolutional network architecture
Table 3 lists the architecture of the convolutional networks used in the tasks of face ori-
entation extraction (section 5.2) and digit magnitude extraction (section 5.3). The CNN
architecture is original from the LeNet LeCun et al. (1998) (for digit classification) and
adapted to the above tasks with one or two more fully-connected layers for feature trans-
formation.
Layer conv1 pool1 conv2 pool2 full3 full4 full5 full6
kernel size 5⇥5 2⇥2 5⇥5 2⇥2 - - - -
stride 1 2 1 2 - - - -
channel 20 20 50 50 1000 500 50 2
Table 3: The architecture of the convolutional network used in face orientation extraction.
The CNN used in the MNIST digit magnitude regression has a similar architecture except
that the full3 layer is omitted. Both pool1 and pool2 are max pooling layers. ReLU layer is
placed after full3 and full4.
¤ DKL Deep Learning
¤
¤
¤
¤
¤
−1 −0.5 0 0.5 1
4
6
8
10
12
14
16
18
Input X
OutputY
GP(RBF)
GP(SM)
DKL−SM
Training data
Figure 6: Recovering a step function. We show the predictive mean and 95% of the predictive
probability mass for regular GPs with RBF and SM kernels, and DKL with SM base kernel. We set
¤
¤ DL
¤ DL KILL-GP
¤
¤ Spectral mixture
¤
¤ DKL
¤ GP DNN

More Related Content

What's hot (20)

PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門
tmtm otm
 
強化学習その3
強化学習その3強化学習その3
強化学習その3
nishio
 
Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)
yukihiro domae
 
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
Deep Learning JP
 
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
Deep Learning JP
 
POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用
Yasunori Ozaki
 
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Hideki Tsunashima
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Yusuke Uchida
 
A3C解説
A3C解説A3C解説
A3C解説
harmonylab
 
グラフニューラルネットワーク入門
グラフニューラルネットワーク入門グラフニューラルネットワーク入門
グラフニューラルネットワーク入門
ryosuke-kojima
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習
Deep Learning JP
 
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
Shunichi Sekiguchi
 
coordinate descent 法について
coordinate descent 法についてcoordinate descent 法について
coordinate descent 法について
京都大学大学院情報学研究科数理工学専攻
 
計算論的学習理論入門 -PAC学習とかVC次元とか-
計算論的学習理論入門 -PAC学習とかVC次元とか-計算論的学習理論入門 -PAC学習とかVC次元とか-
計算論的学習理論入門 -PAC学習とかVC次元とか-
sleepy_yoshi
 
ELBO型VAEのダメなところ
ELBO型VAEのダメなところELBO型VAEのダメなところ
ELBO型VAEのダメなところ
KCS Keio Computer Society
 
最適輸送の解き方
最適輸送の解き方最適輸送の解き方
最適輸送の解き方
joisino
 
[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展
Deep Learning JP
 
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
Deep Learning JP
 
[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論
Deep Learning JP
 
[DL輪読会]Deep Reinforcement Learning that Matters
[DL輪読会]Deep Reinforcement Learning that Matters[DL輪読会]Deep Reinforcement Learning that Matters
[DL輪読会]Deep Reinforcement Learning that Matters
Deep Learning JP
 
PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門PRML学習者から入る深層生成モデル入門
PRML学習者から入る深層生成モデル入門
tmtm otm
 
強化学習その3
強化学習その3強化学習その3
強化学習その3
nishio
 
Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)Graph convolution (スペクトルアプローチ)
Graph convolution (スペクトルアプローチ)
yukihiro domae
 
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels[DL輪読会]Learning Latent Dynamics for Planning from Pixels
[DL輪読会]Learning Latent Dynamics for Planning from Pixels
Deep Learning JP
 
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
[DL輪読会]NVAE: A Deep Hierarchical Variational Autoencoder
Deep Learning JP
 
POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用POMDP下での強化学習の基礎と応用
POMDP下での強化学習の基礎と応用
Yasunori Ozaki
 
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Disentanglement Survey:Can You Explain How Much Are Generative models Disenta...
Hideki Tsunashima
 
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Swin Transformer (ICCV'21 Best Paper) を完璧に理解する資料
Yusuke Uchida
 
グラフニューラルネットワーク入門
グラフニューラルネットワーク入門グラフニューラルネットワーク入門
グラフニューラルネットワーク入門
ryosuke-kojima
 
[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習[DL輪読会]相互情報量最大化による表現学習
[DL輪読会]相互情報量最大化による表現学習
Deep Learning JP
 
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会PILCO - 第一回高橋研究室モデルベース強化学習勉強会
PILCO - 第一回高橋研究室モデルベース強化学習勉強会
Shunichi Sekiguchi
 
計算論的学習理論入門 -PAC学習とかVC次元とか-
計算論的学習理論入門 -PAC学習とかVC次元とか-計算論的学習理論入門 -PAC学習とかVC次元とか-
計算論的学習理論入門 -PAC学習とかVC次元とか-
sleepy_yoshi
 
最適輸送の解き方
最適輸送の解き方最適輸送の解き方
最適輸送の解き方
joisino
 
[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展[DL輪読会]近年のエネルギーベースモデルの進展
[DL輪読会]近年のエネルギーベースモデルの進展
Deep Learning JP
 
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
【DL輪読会】マルチエージェント強化学習における近年の 協調的方策学習アルゴリズムの発展
Deep Learning JP
 
[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論[DL輪読会] マルチエージェント強化学習と心の理論
[DL輪読会] マルチエージェント強化学習と心の理論
Deep Learning JP
 
[DL輪読会]Deep Reinforcement Learning that Matters
[DL輪読会]Deep Reinforcement Learning that Matters[DL輪読会]Deep Reinforcement Learning that Matters
[DL輪読会]Deep Reinforcement Learning that Matters
Deep Learning JP
 

Viewers also liked (20)

(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
Masahiro Suzuki
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
Masahiro Suzuki
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
Masahiro Suzuki
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
Masahiro Suzuki
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
Masahiro Suzuki
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
Masahiro Suzuki
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
Masahiro Suzuki
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
Masahiro Suzuki
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
Masahiro Suzuki
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
Masahiro Suzuki
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
Masahiro Suzuki
 
[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読
Deep Learning JP
 
Iclr2016 vaeまとめ
Iclr2016 vaeまとめIclr2016 vaeまとめ
Iclr2016 vaeまとめ
Deep Learning JP
 
Kalman filter
Kalman filterKalman filter
Kalman filter
Raghava Raghu
 
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
 Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De... Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Ohsawa Goodfellow
 
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
Kaoru Nasuno
 
Deep learning勉強会20121214ochi
Deep learning勉強会20121214ochiDeep learning勉強会20121214ochi
Deep learning勉強会20121214ochi
Ohsawa Goodfellow
 
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
kurotaki_weblab
 
(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters(DL hacks輪読) Deep Kalman Filters
(DL hacks輪読) Deep Kalman Filters
Masahiro Suzuki
 
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
(DL hacks輪読) Seven neurons memorizing sequences of alphabetical images via sp...
Masahiro Suzuki
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
Masahiro Suzuki
 
(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning(DL輪読)Matching Networks for One Shot Learning
(DL輪読)Matching Networks for One Shot Learning
Masahiro Suzuki
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
Masahiro Suzuki
 
(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network(DL hacks輪読)Bayesian Neural Network
(DL hacks輪読)Bayesian Neural Network
Masahiro Suzuki
 
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
(DL hacks輪読) How to Train Deep Variational Autoencoders and Probabilistic Lad...
Masahiro Suzuki
 
深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習深層生成モデルを用いたマルチモーダル学習
深層生成モデルを用いたマルチモーダル学習
Masahiro Suzuki
 
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
(DL輪読)Variational Dropout Sparsifies Deep Neural Networks
Masahiro Suzuki
 
(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation(DL hacks輪読) Difference Target Propagation
(DL hacks輪読) Difference Target Propagation
Masahiro Suzuki
 
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
(研究会輪読) Facial Landmark Detection by Deep Multi-task Learning
Masahiro Suzuki
 
(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?(DL Hacks輪読) How transferable are features in deep neural networks?
(DL Hacks輪読) How transferable are features in deep neural networks?
Masahiro Suzuki
 
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
(DL hacks輪読) Variational Dropout and the Local Reparameterization Trick
Masahiro Suzuki
 
[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読[Dl輪読会]dl hacks輪読
[Dl輪読会]dl hacks輪読
Deep Learning JP
 
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
 Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De... Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Semi-Supervised Autoencoders for Predicting Sentiment Distributions(第 5 回 De...
Ohsawa Goodfellow
 
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
論文輪読資料「A review of unsupervised feature learning and deep learning for time-s...
Kaoru Nasuno
 
Deep learning勉強会20121214ochi
Deep learning勉強会20121214ochiDeep learning勉強会20121214ochi
Deep learning勉強会20121214ochi
Ohsawa Goodfellow
 
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
論文輪読資料「Why regularized Auto-Encoders learn Sparse Representation?」DL Hacks
kurotaki_weblab
 

Similar to (DL hacks輪読) Deep Kernel Learning (20)

Finance Enginering from Columbia.pdf
Finance Enginering from Columbia.pdfFinance Enginering from Columbia.pdf
Finance Enginering from Columbia.pdf
CarlosLazo45
 
Fixed Point Results In Fuzzy Menger Space With Common Property (E.A.)
Fixed Point Results In Fuzzy Menger Space With Common Property (E.A.)Fixed Point Results In Fuzzy Menger Space With Common Property (E.A.)
Fixed Point Results In Fuzzy Menger Space With Common Property (E.A.)
IJERA Editor
 
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
The Statistical and Applied Mathematical Sciences Institute
 
Numarical values
Numarical valuesNumarical values
Numarical values
AmanSaeed11
 
Numarical values highlighted
Numarical values highlightedNumarical values highlighted
Numarical values highlighted
AmanSaeed11
 
Fixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractionsFixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractions
Alexander Decker
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
researchinventy
 
Nested sampling
Nested samplingNested sampling
Nested sampling
Christian Robert
 
An introduction to quantum stochastic calculus
An introduction to quantum stochastic calculusAn introduction to quantum stochastic calculus
An introduction to quantum stochastic calculus
Springer
 
Common Fixed Point Theorems For Occasionally Weakely Compatible Mappings
Common Fixed Point Theorems For Occasionally Weakely Compatible MappingsCommon Fixed Point Theorems For Occasionally Weakely Compatible Mappings
Common Fixed Point Theorems For Occasionally Weakely Compatible Mappings
iosrjce
 
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Dahua Lin
 
A Note on “   Geraghty contraction type mappings”
A Note on “   Geraghty contraction type mappings”A Note on “   Geraghty contraction type mappings”
A Note on “   Geraghty contraction type mappings”
IOSRJM
 
Partial derivative1
Partial derivative1Partial derivative1
Partial derivative1
Nidhu Sharma
 
this materials is useful for the students who studying masters level in elect...
this materials is useful for the students who studying masters level in elect...this materials is useful for the students who studying masters level in elect...
this materials is useful for the students who studying masters level in elect...
BhojRajAdhikari5
 
Clustering Random Walk Time Series
Clustering Random Walk Time SeriesClustering Random Walk Time Series
Clustering Random Walk Time Series
Gautier Marti
 
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
mathsjournal
 
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
mathsjournal
 
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
mathsjournal
 
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
mathsjournal
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 
Finance Enginering from Columbia.pdf
Finance Enginering from Columbia.pdfFinance Enginering from Columbia.pdf
Finance Enginering from Columbia.pdf
CarlosLazo45
 
Fixed Point Results In Fuzzy Menger Space With Common Property (E.A.)
Fixed Point Results In Fuzzy Menger Space With Common Property (E.A.)Fixed Point Results In Fuzzy Menger Space With Common Property (E.A.)
Fixed Point Results In Fuzzy Menger Space With Common Property (E.A.)
IJERA Editor
 
Numarical values
Numarical valuesNumarical values
Numarical values
AmanSaeed11
 
Numarical values highlighted
Numarical values highlightedNumarical values highlighted
Numarical values highlighted
AmanSaeed11
 
Fixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractionsFixed points theorem on a pair of random generalized non linear contractions
Fixed points theorem on a pair of random generalized non linear contractions
Alexander Decker
 
Research Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and ScienceResearch Inventy : International Journal of Engineering and Science
Research Inventy : International Journal of Engineering and Science
researchinventy
 
An introduction to quantum stochastic calculus
An introduction to quantum stochastic calculusAn introduction to quantum stochastic calculus
An introduction to quantum stochastic calculus
Springer
 
Common Fixed Point Theorems For Occasionally Weakely Compatible Mappings
Common Fixed Point Theorems For Occasionally Weakely Compatible MappingsCommon Fixed Point Theorems For Occasionally Weakely Compatible Mappings
Common Fixed Point Theorems For Occasionally Weakely Compatible Mappings
iosrjce
 
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Appendix to MLPI Lecture 2 - Monte Carlo Methods (Basics)
Dahua Lin
 
A Note on “   Geraghty contraction type mappings”
A Note on “   Geraghty contraction type mappings”A Note on “   Geraghty contraction type mappings”
A Note on “   Geraghty contraction type mappings”
IOSRJM
 
Partial derivative1
Partial derivative1Partial derivative1
Partial derivative1
Nidhu Sharma
 
this materials is useful for the students who studying masters level in elect...
this materials is useful for the students who studying masters level in elect...this materials is useful for the students who studying masters level in elect...
this materials is useful for the students who studying masters level in elect...
BhojRajAdhikari5
 
Clustering Random Walk Time Series
Clustering Random Walk Time SeriesClustering Random Walk Time Series
Clustering Random Walk Time Series
Gautier Marti
 
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
mathsjournal
 
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
mathsjournal
 
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
mathsjournal
 
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
COMMON FIXED POINT THEOREMS IN COMPATIBLE MAPPINGS OF TYPE (P*) OF GENERALIZE...
mathsjournal
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
IJERD Editor
 

More from Masahiro Suzuki (7)

深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
Masahiro Suzuki
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
Masahiro Suzuki
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
Masahiro Suzuki
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
Masahiro Suzuki
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
Masahiro Suzuki
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
Masahiro Suzuki
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
Masahiro Suzuki
 
深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)深層生成モデルと世界モデル(2020/11/20版)
深層生成モデルと世界モデル(2020/11/20版)
Masahiro Suzuki
 
確率的推論と行動選択
確率的推論と行動選択確率的推論と行動選択
確率的推論と行動選択
Masahiro Suzuki
 
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて深層生成モデルと世界モデル,深層生成モデルライブラリPixyzについて
深層生成モデルと世界モデル, 深層生成モデルライブラリPixyzについて
Masahiro Suzuki
 
深層生成モデルと世界モデル
深層生成モデルと世界モデル深層生成モデルと世界モデル
深層生成モデルと世界モデル
Masahiro Suzuki
 
「世界モデル」と関連研究について
「世界モデル」と関連研究について「世界モデル」と関連研究について
「世界モデル」と関連研究について
Masahiro Suzuki
 
GAN(と強化学習との関係)
GAN(と強化学習との関係)GAN(と強化学習との関係)
GAN(と強化学習との関係)
Masahiro Suzuki
 
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習深層生成モデルを用いたマルチモーダルデータの半教師あり学習
深層生成モデルを用いたマルチモーダルデータの半教師あり学習
Masahiro Suzuki
 

Recently uploaded (20)

Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Developing System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptxDeveloping System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptx
wondimagegndesta
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
Build With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdfBuild With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdf
Google Developer Group - Harare
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Artificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptxArtificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptx
03ANMOLCHAURASIYA
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptxTop 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
mkubeusa
 
Agentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community MeetupAgentic Automation - Delhi UiPath Community Meetup
Agentic Automation - Delhi UiPath Community Meetup
Manoj Batra (1600 + Connections)
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Developing System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptxDeveloping System Infrastructure Design Plan.pptx
Developing System Infrastructure Design Plan.pptx
wondimagegndesta
 
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025Zilliz Cloud Monthly Technical Review: May 2025
Zilliz Cloud Monthly Technical Review: May 2025
Zilliz
 
How to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabberHow to Install & Activate ListGrabber - eGrabber
How to Install & Activate ListGrabber - eGrabber
eGrabber
 
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
RTP Over QUIC: An Interesting Opportunity Or Wasted Time?
Lorenzo Miniero
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Artificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptxArtificial_Intelligence_in_Everyday_Life.pptx
Artificial_Intelligence_in_Everyday_Life.pptx
03ANMOLCHAURASIYA
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Building the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdfBuilding the Customer Identity Community, Together.pdf
Building the Customer Identity Community, Together.pdf
Cheryl Hung
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptxTop 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
Top 5 Benefits of Using Molybdenum Rods in Industrial Applications.pptx
mkubeusa
 
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Crazy Incentives and How They Kill Security. How Do You Turn the Wheel?
Christian Folini
 

(DL hacks輪読) Deep Kernel Learning

  • 2. ¤ Wilson et al. (arXiv 2015/11/6) ¤ Carnegie Mellon University ¤ Salakhutdinov ¤ Deep learning + Gaussian process ¤ ¤ & ¤
  • 3. ¤ ¤ D ¤ an Processes ew the predictive equations and marginal likelihood for Gaussian processes e associated computational requirements, following the notational conven- n et al. (2015). See, for example, Rasmussen and Williams (2006) for a discussion of GPs. ataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension x an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ), tion of function values f has a joint Gaussian distribution, f = f(X) = [f(x1), . . . , f(xn)]> ⇠ N(µ, KX,X) , (1) ector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined function and covariance kernel of the Gaussian process. The kernel, k , is y . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the ibution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by f⇤|X⇤,X, y, , 2 ⇠ N(E[f⇤], cov(f⇤)) , (2) E[f⇤] = µX⇤ + KX⇤,X[KX,X + 2 I] 1 y , cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . mple, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤ 3 Gaussian Processes We briefly review the predictive equations and marginal likelihood for Gaussian processes (GPs), and the associated computational requirements, following the notational conven- tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a comprehensive discussion of GPs. We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ), then any collection of function values f has a joint Gaussian distribution, f = f(X) = [f(x1), . . . , f(xn)]> ⇠ N(µ, KX,X) , (1) with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined from the mean function and covariance kernel of the Gaussian process. The kernel, k , is parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by f⇤|X⇤,X, y, , 2 ⇠ N(E[f⇤], cov(f⇤)) , (2) E[f⇤] = µX⇤ + KX⇤,X[KX,X + 2 I] 1 y , cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤ We briefly review the predictive equations and marginal likelihood for Gaussian proce (GPs), and the associated computational requirements, following the notational con tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) f comprehensive discussion of GPs. We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimen D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, then any collection of function values f has a joint Gaussian distribution, f = f(X) = [f(x1), . . . , f(xn)]> ⇠ N(µ, KX,X) , with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determ from the mean function and covariance kernel of the Gaussian process. The kernel, k parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is give f⇤|X⇤,X, y, , 2 ⇠ N(E[f⇤], cov(f⇤)) , E[f⇤] = µX⇤ + KX⇤,X[KX,X + 2 I] 1 y , cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel functions, to produce scalable deep kernels. 3 Gaussian Processes We briefly review the predictive equations and marginal likelihood for Gaussian processes (GPs), and the associated computational requirements, following the notational conven- tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a comprehensive discussion of GPs. We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ), then any collection of function values f has a joint Gaussian distribution, f = f(X) = [f(x1), . . . , f(xn)]> ⇠ N(µ, KX,X) , (1) with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined from the mean function and covariance kernel of the Gaussian process. The kernel, k , is parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by f⇤|X⇤,X, y, , 2 ⇠ N(E[f⇤], cov(f⇤)) , (2) E[f⇤] = µX⇤ + KX⇤,X[KX,X + 2 I] 1 y , cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤ and X. µX⇤ is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated show that the proposed model outperforms state of the art stand-alone deep learning archi- tectures and Gaussian processes with advanced kernel learning procedures on a wide range of datasets, demonstrating its practical significance. We achieve scalability while retaining non-parametric model structure by leveraging the very recent KISS-GP approach (Wilson and Nickisch, 2015) and extensions in Wilson et al. (2015) for e ciently representing kernel functions, to produce scalable deep kernels. 3 Gaussian Processes We briefly review the predictive equations and marginal likelihood for Gaussian processes (GPs), and the associated computational requirements, following the notational conven- tions in Wilson et al. (2015). See, for example, Rasmussen and Williams (2006) for a comprehensive discussion of GPs. We assume a dataset D of n input (predictor) vectors X = {x1, . . . , xn}, each of dimension D, which index an n ⇥ 1 vector of targets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ), then any collection of function values f has a joint Gaussian distribution, f = f(X) = [f(x1), . . . , f(xn)]> ⇠ N(µ, KX,X) , (1) with a mean vector, µi = µ(xi), and covariance matrix, (KX,X)ij = k (xi, xj), determined from the mean function and covariance kernel of the Gaussian process. The kernel, k , is parametrized by . Assuming additive Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the predictive distribution of the GP evaluated at the n⇤ test points indexed by X⇤, is given by f⇤|X⇤,X, y, , 2 ⇠ N(E[f⇤], cov(f⇤)) , (2) E[f⇤] = µX⇤ + KX⇤,X[KX,X + 2 I] 1 y , cov(f⇤) = KX⇤,X⇤ KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . KX⇤,X, for example, is an n⇤ ⇥ n matrix of covariances between the GP evaluated at X⇤ and X. µX⇤ is the n⇤ ⇥ 1 mean vector, and KX,X is the n ⇥ n covariance matrix evaluated at training inputs X. All covariance (kernel) matrices implicitly depend on the kernel hyperparameters . 15.2. GPs for regression 517 −5 0 5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 (a) −5 0 5 −2 −1.5 −1 −0.5 0 0.5 1 1.5 2 2.5 (b) Figure 15.2 Left: some functions sampled from a GP prior with SE kernel. Right: some samples from a GP posterior, after conditioning on 5 noise-free observations. The shaded area represents E [f(x)]±2std(f(x). ns and marginal likelihood for Gaussian processes al requirements, following the notational conven- example, Rasmussen and Williams (2006) for a ctor) vectors X = {x1, . . . , xn}, each of dimension ets y = (y(x1), . . . , y(xn))>. If f(x) ⇠ GP(µ, k ), has a joint Gaussian distribution, . . . , f(xn)]> ⇠ N(µ, KX,X) , (1) ariance matrix, (KX,X)ij = k (xi, xj), determined kernel of the Gaussian process. The kernel, k , is Gaussian noise, y(x)|f(x) ⇠ N(y(x); f(x), 2), the ed at the n⇤ test points indexed by X⇤, is given by N(E[f⇤], cov(f⇤)) , (2) X⇤,X[KX,X + 2 I] 1 y , KX⇤,X[KX,X + 2 I] 1 KX,X⇤ . x of covariances between the GP evaluated at X⇤ nd KX,X is the n ⇥ n covariance matrix evaluated
  • 4. ¤ ¤ Φ ¤ w ! 0 ¤ ¤ "($) ¤ ¤ & . . . . . . について同時に書くと, 下のように y = Φw と 書ける (Φ : 計画行列) 1) 2) N) ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ φ1(x(1)) · · · φH(x(1)) φ1(x(2)) · · · φH(x(2)) ... ... φ1(x(N)) · · · φH(x(N)) ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ w1 w2 ... ... wH ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ (6) y Φ w ウス分布 p(w) = N(0, α−1I) に従っているとすると, ガウス分布に従い, ⟨ yyT ⟩ = (Φw) (Φw)T = Φ⟨wwT ⟩ΦT (7) = α−1 ΦΦT の正規分布となる 9 / 59 . . . . . . GP の導入 (1) • y(1) · · · y(N) について同時に書くと, 下のように y = Φw と 行列形式で書ける (Φ : 計画行列) ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ y(1) y(2) ... y(N) ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ φ1(x(1)) · · · φH(x(1)) φ1(x(2)) · · · φH(x(2)) ... ... φ1(x(N)) · · · φH(x(N)) ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ w1 w2 ... ... wH ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ (6) y Φ w • 重み w がガウス分布 p(w) = N(0, α−1I) に従っているとすると, y = Φw もガウス分布に従い, • 平均 0, 分散 ⟨ yyT ⟩ = (Φw) (Φw)T = Φ⟨wwT ⟩ΦT (7) = α−1 ΦΦT の正規分布となる 9 / 59 . . . . . . GP の導入 (1) • y(1) · · · y(N) について同時に書くと, 下のように y = Φw と 行列形式で書ける (Φ : 計画行列) ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ y(1) y(2) ... y(N) ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ φ1(x(1)) · · · φH(x(1)) φ1(x(2)) · · · φH(x(2)) ... ... φ1(x(N)) · · · φH(x(N)) ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ w1 w2 ... ... wH ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ (6) y Φ w • 重み w がガウス分布 p(w) = N(0, α−1I) に従っているとすると, y = Φw もガウス分布に従い, • 平均 0, 分散 ⟨ yyT ⟩ = (Φw) (Φw)T = Φ⟨wwT ⟩ΦT (7) = α−1 ΦΦT の正規分布となる 9 / 59 . . . . . . • y(1) · · · y(N) について同時に書くと, 下のように y = Φw と 行列形式で書ける (Φ : 計画行列) ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ y(1) y(2) ... y(N) ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ = ⎛ ⎜ ⎜ ⎜ ⎜ ⎝ φ1(x(1)) · · · φH(x(1)) φ1(x(2)) · · · φH(x(2)) ... ... φ1(x(N)) · · · φH(x(N)) ⎞ ⎟ ⎟ ⎟ ⎟ ⎠ ⎛ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎝ w1 w2 ... ... wH ⎞ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎠ (6) y Φ w • 重み w がガウス分布 p(w) = N(0, α−1I) に従っているとすると, y = Φw もガウス分布に従い, • 平均 0, 分散 ⟨ yyT ⟩ = (Φw) (Φw)T = Φ⟨wwT ⟩ΦT (7) = α−1 ΦΦT の正規分布となる 9 / 59 GP の導入 (2) p(y) = N(y|0, α−1 ΦΦT ) (8) は, どんな入力 {xn}N n=1 についても成り立つ → ガウス過程の定義 • どんな入力 (x1, x2, · · · , xN ) についても, 対応する出力 y = (y1, y2, · · · , yN ) がガウス分布に従うとき, p(y) はガウス過 程に従う という. − ガウス過程 = 無限次元のガウス分布 − ガウス分布の周辺化はまたガウス分布なので, 実際にはデー タのある所だけの有限次元 • K = α−1ΦΦT の要素であるカーネル関数 . . . . . GP の導入 (2) p(y) = N(y|0, α−1 ΦΦT ) (8) は, どんな入力 {xn}N n=1 についても成り立つ → ガウス過程の定義 • どんな入力 (x1, x2, · · · , xN ) についても, 対応する出力 y = (y1, y2, · · · , yN ) がガウス分布に従うとき, p(y) はガウス過 程に従う という. − ガウス過程 = 無限次元のガウス分布 − ガウス分布の周辺化はまたガウス分布なので, 実際にはデー タのある所だけの有限次元 • K = α−1ΦΦT の要素であるカーネル関数 k(x, x′ ) = α−1 φ(x)T φ(x′ ) (9) だけでガウス分布が定まる − k(x, x′) は x と x′ の距離 ; 入力 x が近い→出力 y が近い
  • 5. ¤ ¤ " ¤ ¤ C ¤ . . . . . . GP の導入 (3) • 実際には, 観測値にはノイズ ϵ が乗っている y = wT φ(x) + ϵ ϵ ∼ N(0, β−1I) =⇒ p(y|f) = N(wT φ(x), β−1 I) (10) • 途中の f = wT φ(x) を積分消去 p(y|x) = p(y|f)p(f|x)df (11) = N(0, C) (12) − 二つの独立な Gaussian の畳み込みなので, C の要素は共分散 の和: C(xi, xj) = k(xi, xj) + β−1 δ(i, j). (13) − GP は, カーネル関数 k(x, x′) とハイパーパラメータ α, β だけで表すことができる. 11 / 59 . . . . . . GP の導入 (3) • 実際には, 観測値にはノイズ ϵ が乗っている y = wT φ(x) + ϵ ϵ ∼ N(0, β−1I) =⇒ p(y|f) = N(wT φ(x), β−1 I) (10) • 途中の f = wT φ(x) を積分消去 p(y|x) = p(y|f)p(f|x)df (11) = N(0, C) (12) − 二つの独立な Gaussian の畳み込みなので, C の要素は共分散 の和: C(xi, xj) = k(xi, xj) + β−1 δ(i, j). (13) − GP は, カーネル関数 k(x, x′) とハイパーパラメータ α, β だけで表すことができる. 11 / 59 . . . . . . GP の導入 (3) • 実際には, 観測値にはノイズ ϵ が乗っている y = wT φ(x) + ϵ ϵ ∼ N(0, β−1I) =⇒ p(y|f) = N(wT φ(x), β−1 I) (10) • 途中の f = wT φ(x) を積分消去 p(y|x) = p(y|f)p(f|x)df (11) = N(0, C) (12) − 二つの独立な Gaussian の畳み込みなので, C の要素は共分散 の和: C(xi, xj) = k(xi, xj) + β−1 δ(i, j). (13) − GP は, カーネル関数 k(x, x′) とハイパーパラメータ α, β だけで表すことができる. 11 / 59
  • 6. ¤ !'() ! ¤ ¤ ¤ . . . . . . GP の予測 • 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に なるので, p(ynew |xnew , X, y, θ) (17) = p((y, ynew)|(X, xnew), θ) p(y|X, θ) (18) ∝ exp ⎛ ⎝− 1 2 ([y, ynew ] K k kT k −1 y ynew − yT K−1 y) ⎞ ⎠ (19) ∼ N(kT K−1 y, k − kT K−1 k). (20) ここで − K = [k(x, x′)]. − k = (k(xnew, x1), · · · , k(xnew, xN )). 18 / 59 . . . . . . GP の予測 • 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に なるので, p(ynew |xnew , X, y, θ) (17) = p((y, ynew)|(X, xnew), θ) p(y|X, θ) (18) ∝ exp ⎛ ⎝− 1 2 ([y, ynew ] K k kT k −1 y ynew − yT K−1 y) ⎞ ⎠ (19) ∼ N(kT K−1 y, k − kT K−1 k). (20) ここで − K = [k(x, x′)]. − k = (k(xnew, x1), · · · , k(xnew, xN )). 18 / 59 . . . . . . GP の予測 • 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に なるので, p(ynew |xnew , X, y, θ) (17) = p((y, ynew)|(X, xnew), θ) p(y|X, θ) (18) ∝ exp ⎛ ⎝− 1 2 ([y, ynew ] K k kT k −1 y ynew − yT K−1 y) ⎞ ⎠ (19) ∼ N(kT K−1 y, k − kT K−1 k). (20) ここで − K = [k(x, x′)]. − k = (k(xnew, x1), · · · , k(xnew, xN )). 18 / 59 . . . . . . GP の予測 • 新しい入力 ynew とこれまでの y の結合分布がまた Gaussian に なるので, p(ynew |xnew , X, y, θ) (17) = p((y, ynew)|(X, xnew), θ) p(y|X, θ) (18) ∝ exp ⎛ ⎝− 1 2 ([y, ynew ] K k kT k −1 y ynew − yT K−1 y) ⎞ ⎠ (19) ∼ N(kT K−1 y, k − kT K−1 k). (20) ここで − K = [k(x, x′)]. − k = (k(xnew, x1), · · · , k(xnew, xN )). 18 / 59
  • 7. ¤ ¤ = developed that have better scaling with training set size than the exact approach (Gibbs, 1997; Tresp, 2001; Smola and Bartlett, 2001; Williams and Seeger, 2001; Csat´o and Opper, 2002; Seeger et al., 2003). Practical issues in the application of Gaussian processes are discussed in Bishop and Nabney (2008). We have introduced Gaussian process regression for the case of a single tar- get variable. The extension of this formalism to multiple target variables, known as co-kriging (Cressie, 1993), is straightforward. Various other extensions of Gaus- Illustration of Gaussian process re- gression applied to the sinusoidal data set in Figure A.6 in which the three right-most data points have been omitted. The green curve shows the sinusoidal function from which the data points, shown in blue, are obtained by sampling and addition of Gaussian noise. The red line shows the mean of the Gaussian process predictive distri- bution, and the shaded region cor- responds to plus and minus two standard deviations. Notice how the uncertainty increases in the re- gion to the right of the data points. 0 0.2 0.4 0.6 0.8 1 −1 −0.5 0 0.5 1
  • 9. ¤ ¤ “How can Gaussian processes possibly replace neural networks? Have we thrown the baby out with the bathwater?” [MacKay1998] ¤ ¤ Bayesian Learning for Neural Networks [Neal 1996] ¤ Neal D Hinton MacKey ¤ ¤ ¤ ¤ [Wilson, 2014][Wilson and Adams, 2013][Lloyd et al., 2014][Yang et al., 2015] ¤ ¤ ¤
  • 10. + ¤ Gaussian process regression network [Wilson et al., 2012] ¤ ¤ Damianou and Lawrence (2013) ¤ ¤ Salakhutdinov and Hinton (2008) ¤ DBN ¤ Calandra et al. (2014) ¤ NN sharp discontinuities
  • 11. ¤ k x ¤ g DNN & ¤ RBF spectral mixture (SM) [Wilson and Adams, 2013] ¤ ¤ DNN . In this section we show how we can contruct kernels which encapsulate the expressive pow of deep architectures, and how to learn the properties of these kernels as part of a scalab probabilistic Gaussian process framework. Specifically, starting from a base kernel k(xi, xj|✓) with hyperparameters ✓, we transfor the inputs (predictors) x as k(xi, xj|✓) ! k(g(xi, w), g(xj, w)|✓, w) , (5 where g(x, w) is a non-linear mapping given by a deep architecture, such as a deep con volutional network, parametrized by weights w. The popular RBF kernel (Eq. (3)) is sensible choice of base kernel k(xi, xj|✓). For added flexibility, we also propose to u spectral mixture base kernels (Wilson and Adams, 2013): kSM(x, x0 |✓) = QX q=1 aq |⌃q| 1 2 (2⇡) D 2 exp ✓ 1 2 ||⌃ 1 2 q (x x0 )||2 ◆ coshx x0 , 2⇡µqi . (6 The parameters of the spectral mixture kernel ✓ = {aq, ⌃q, µq} are mixture weights, band widths (inverse length-scales), and frequencies. The spectral mixture (SM) kernel, whic forms an expressive basis for all stationary covariance functions, can discover quasi-period stationary structure with an interpretable and succinct representation, while the deep learn 4 Deep Kernel Learning In this section we show how we can contruct kernels which encapsulate the expressive power of deep architectures, and how to learn the properties of these kernels as part of a scalable probabilistic Gaussian process framework. Specifically, starting from a base kernel k(xi, xj|✓) with hyperparameters ✓, we transform the inputs (predictors) x as k(xi, xj|✓) ! k(g(xi, w), g(xj, w)|✓, w) , (5) where g(x, w) is a non-linear mapping given by a deep architecture, such as a deep con- volutional network, parametrized by weights w. The popular RBF kernel (Eq. (3)) is a sensible choice of base kernel k(xi, xj|✓). For added flexibility, we also propose to use spectral mixture base kernels (Wilson and Adams, 2013): kSM(x, x0 |✓) = QX q=1 aq |⌃q| 1 2 (2⇡) D 2 exp ✓ 1 2 ||⌃ 1 2 q (x x0 )||2 ◆ coshx x0 , 2⇡µqi . (6) The parameters of the spectral mixture kernel ✓ = {aq, ⌃q, µq} are mixture weights, band- widths (inverse length-scales), and frequencies. The spectral mixture (SM) kernel, which forms an expressive basis for all stationary covariance functions, can discover quasi-periodic stationary structure with an interpretable and succinct representation, while the deep learn- ing transformation g(x, w) captures non-stationary and hierarchical structure. 5
  • 12. ¤ ¤ DNN ¤ RBF SM x1 xD Input layer h (1) 1 h (1) A ... . . . h (2) 1 h (2) B h (L) 1 h (L) C W(1) W(2) W(L) h1(✓) h1(✓) Hidden layers 1 layer y1 yP Output layer . . . ... ... ... ...... ... ... Figure 1: Deep Kernel Learning: A Gaussian process with a deep kernel maps D dimensional inputs x through L parametric hidden layers followed by a hidden layer with an infinite number of basis functions, with base kernel hyperparameters ✓. Overall, a Gaussian process with a deep kernel
  • 13. ¤ L ¤ / DNN & ¤ is determined by the interpretable length-scale hyperparameter `. Shorter length-scales correspond to functions which vary more rapidly with the inputs x. The structure of our data is discovered through learning interpretable kernel hyperparam- eters. The marginal likelihood of the targets y, the probability of the data conditioned only on kernel hyperparameters , provides a principled probabilistic framework for kernel learning: log p(y| , X) / [y> (K + 2 I) 1 y + log |K + 2 I|] , (4) where we have used K as shorthand for KX,X given . Note that the expression for the log marginal likelihood in Eq. (4) pleasingly separates into automatically calibrated model fit and complexity terms (Rasmussen and Ghahramani, 2001). Kernel learning can be achieved by optimizing Eq. (4) with respect to . The computational bottleneck for inference is solving the linear system (KX,X + 2I) 1y, and for kernel learning is computing the log determinant log |KX,X + 2I| in the marginal likelihood. The standard approach is to compute the Cholesky decomposition of the n ⇥ n matrix KX,X, which requires O(n3) operations and O(n2) storage. After inference is complete, the predictive mean costs O(n), and the predictive variance costs O(n2), per test point x⇤. 4 Deep Kernel Learning In this section we show how we can contruct kernels which encapsulate the expressive power function representation, our network e↵ectively has a hidden layer with an infinite number of hidden units. The overall model is shown in Figure 1. We emphasize, however, that we jointly learn all deep kernel hyperparameters, = {w, ✓}, which include w, the weights of the network, and ✓ the parameters of the base kernel, by maximizing the log marginal likelihood L of the Gaussian process (see Eq. (4)). Indeed compartmentalizing our model into a base kernel and deep architecture is for pedagogical clarity. When applying a Gaussian process one can use our deep kernel, which operates as a single unit, as a drop-in replacement for e.g., standard ARD or Mat´ern kernels (Rasmussen and Williams, 2006), since learning and inference follow the same procedures. For kernel learning, we use the chain rule to compute derivatives of the log marginal likeli- hood with respect to the deep kernel hyperparameters: @L @✓ = @L @K @K @✓ , @L @w = @L @K @K @g(x, w) @g(x, w) @w . The implicit derivative of the log marginal likelihood with respect to our n ⇥ n data covari- 6 ance matrix K is given by @L @K = 1 2 (K 1 yy> K 1 K 1 ) , (7) where we have absorbed the noise covariance 2I into our covariance matrix, and treat it as part of the base kernel hyperparameters ✓. @K @✓ are the derivatives of the deep kernel with respect to the base kernel hyperparameters (such as length-scale), conditioned on the fixed transformation of the inputs g(x, w). Similarly, @K @g(x,w) are the implicit derivatives of the deep kernel with respect to g, holding ✓ fixed. The derivatives with respect to the weight
  • 14. KISS-GP ¤ K KISS-GP [Wilson and Nickisch, 2015] [Wilson et al., 2015] ¤ M: ¤ K: ¤ linear conjugate gradients (LCG) ¤ × ¤ & ¤ ¤ [Quin ̃onero-Candela and Rasmussen, 2005] part of the base kernel hyperparameters ✓. @✓ are the derivatives of the deep kernel wit respect to the base kernel hyperparameters (such as length-scale), conditioned on the fixe transformation of the inputs g(x, w). Similarly, @K @g(x,w) are the implicit derivatives of th deep kernel with respect to g, holding ✓ fixed. The derivatives with respect to the weigh variables @g(x,w) @w are computed using standard backpropagation. For scalability, we replace all instances of K with the KISS-GP covariance matrix (Wilso and Nickisch, 2015; Wilson et al., 2015) K ⇡ MKdeep U,U M> := KKISS , ( where M is a sparse matrix of interpolation weights, containing only 4 non-zero entries p row for local cubic interpolation, and KU,U is a covariance matrix created from our dee kernel, evaluated over m latent inducing points U = [ui]i=1...m. We place the inducing poin over a regular multidimensional lattice, and exploit the resulting decomposition of KU,U in a Kronecker product of Toeplitz matrices for extremely fast matrix vector multiplication (MVMs), without requiring any grid structure in the data inputs X or the transforme inputs g(x, w). Because KISS-GP operates by creating an approximate kernel which admi fast computations, and is independent from a specific inference and learning procedure, w can view the KISS approximation applied to our deep kernels as a stand-alone kerne k(x, z) = m> x Kdeep U,U mz, which can be combined with Gaussian processes or other kern machines for scalable learning. For inference we solve K 1 KISSy using linear conjugate gradients (LCG), an iterative procedure for solving linear systems which only involves matrix vector multiplications (MVMs). The number of iterations required for convergence to within machine precision is j ⌧ n, and in practice j depends on the conditioning of the KISS-GP covariance matrix rather than the number of training points n. For estimating the log determinant in the marginal likelihood we follow the approach described in Wilson and Nickisch (2015) with extensions in Wilson et al. (2015). KISS-GP training scales as O(n+h(m)) (where h(m) is typically close to linear in m), versus conventional scalable GP approaches which require O(m2n + m3) (Qui˜nonero-Candela and Rasmussen, 2005) computations and need m ⌧ n for tractability, which results in severe deteriorations in predictive performance. The ability to have large m ⇡ n allows KISS-GP ce we solve K 1 KISSy using linear conjugate gradients (LCG), an iterative procedure linear systems which only involves matrix vector multiplications (MVMs). The iterations required for convergence to within machine precision is j ⌧ n, and in depends on the conditioning of the KISS-GP covariance matrix rather than the training points n. For estimating the log determinant in the marginal likelihood the approach described in Wilson and Nickisch (2015) with extensions in Wilson 5). raining scales as O(n+h(m)) (where h(m) is typically close to linear in m), versus al scalable GP approaches which require O(m2n + m3) (Qui˜nonero-Candela and n, 2005) computations and need m ⌧ n for tractability, which results in severe ons in predictive performance. The ability to have large m ⇡ n allows KISS-GP ar-exact accuracy in its approximation (Wilson and Nickisch, 2015), retaining a metric representation, while providing linear scaling in n and O(1) time per test iction (Wilson et al., 2015). We empirically demonstrate this scalability and ng linear conjugate gradients (LCG), an iterative procedure only involves matrix vector multiplications (MVMs). The convergence to within machine precision is j ⌧ n, and in ioning of the KISS-GP covariance matrix rather than the estimating the log determinant in the marginal likelihood in Wilson and Nickisch (2015) with extensions in Wilson h(m)) (where h(m) is typically close to linear in m), versus ches which require O(m2n + m3) (Qui˜nonero-Candela and and need m ⌧ n for tractability, which results in severe rmance. The ability to have large m ⇡ n allows KISS-GP s approximation (Wilson and Nickisch, 2015), retaining a
  • 15. ¤ 3 ¤ UCI repository ¤ ¤ MNIST ¤ ¤ DNN Caffe KISS-GP GPML ¤ DNN SGD ReLU ¤ DNN KISS-GP ¤
  • 16. UCI repository ¤ ¤ n<6000 [d-1000-500-50-2] ¤ n>6000 [d-1000-1000-500-50-2] ¤ GP Fastfood finite basis function expansions Table 1: Comparative RMSE performance and runtime on UCI regression datasets, with n training points and d the input dimensions. The results are averaged over 5 equal partitions (90% train, 10% test) of the data ± one standard deviation. The best denotes the best-performing kernel according to Yang et al. (2015) (note that often the best performing kernel is GP-SM). Following Yang et al. (2015), as exact Gaussian processes are intractable on the large data used here, the Fastfood finite basis function expansions are used for approximation in GP (RBF, SM, Best). We verified on datasets with n < 10, 000 that exact GPs with RBF kernels provide comparable performance to the Fastfood expansions. For datasets with n < 6, 000 we used a fully-connected DNN with a [d-1000-500-50-2] architecture, and for n > 6000 we used a [d-1000-1000-500-50-2] architecture. We consider scalable deep kernel learning (DKL) with RBF and SM base kernels. For the SM base kernel, we set Q = 4 for datasets with n < 10, 000 training instances, and use Q = 6 for larger datasets. Datasets n d RMSE Runtime(s) GP DNN DKL DNN DKL RBF SM best RBF SM RBF SM Gas 2,565 128 0.21±0.07 0.14±0.08 0.12±0.07 0.11±0.05 0.11±0.05 0.09±0.06 7.43 7.80 10.52 Skillcraft 3,338 19 1.26±3.14 0.25±0.02 0.25±0.02 0.25±0.00 0.25±0.00 0.25±0.00 15.79 15.91 17.08 SML 4,137 26 6.94±0.51 0.27±0.03 0.26±0.04 0.25±0.02 0.24±0.01 0.23±0.01 1.09 1.48 1.92 Parkinsons 5,875 20 3.94±1.31 0.00±0.00 0.00±0.00 0.31±0.04 0.29±0.04 0.29±0.04 3.21 3.44 6.49 Pumadyn 8,192 32 1.00±0.00 0.21±0.00 0.20±0.00 0.25±0.02 0.24±0.02 0.23±0.02 7.50 7.88 9.77 PoleTele 15,000 26 12.6±0.3 5.40±0.3 4.30±0.2 3.42±0.05 3.28±0.04 3.11±0.07 8.02 8.27 26.95 Elevators 16,599 18 0.12±0.00 0.090±0.001 0.089±0.002 0.099±0.001 0.084±0.002 0.084±0.002 8.91 9.16 11.77 Kin40k 40,000 8 0.34±0.01 0.19±0.02 0.06±0.00 0.11±0.01 0.05±0.00 0.03±0.01 19.82 20.73 24.99 Protein 45,730 9 1.64±1.66 0.50±0.02 0.47±0.01 0.49±0.01 0.46±0.01 0.43±0.01 142.8 154.8 144.2 KEGG 48,827 22 0.33±0.17 0.12±0.01 0.12±0.01 0.12±0.01 0.11±0.00 0.10±0.01 31.31 34.23 61.01 CTslice 53,500 385 7.13±0.11 2.21±0.06 0.59±0.07 0.41±0.06 0.36±0.01 0.34±0.02 36.38 44.28 80.44 KEGGU 63,608 27 0.29±0.12 0.12±0.00 0.12±0.00 0.12±0.00 0.11±0.00 0.11±0.00 39.54 42.97 41.05 3Droad 434,874 3 12.86±0.09 10.34±0.19 9.90±0.10 7.36±0.07 6.91±0.04 6.91±0.04 238.7 256.1 292.2 Song 515,345 90 0.55±0.00 0.46±0.00 0.45±0.00 0.45±0.02 0.44±0.00 0.43±0.01 517.7 538.5 589.8 Buzz 583,250 77 0.88±0.01 0.51±0.01 0.51±0.01 0.49±0.00 0.48±0.00 0.46±0.01 486.4 523.3 769.7 Electric 2,049,280 11 0.230±0.000 0.053±0.000 0.053±0.000 0.058±0.002 0.050±0.002 0.048±0.002 3458 3542 4881
  • 17. ¤ ¤ The Olivetti face data 28 28 [Salakhutdinov and Hinton (2008)] ¤ 30 10 ¤ ¤ 2 4 ¤ 2 36.15-43.10 -3.4917.35 -19.81 Training data Test data Label Figure 2: Left: Randomly sampled examples of the training and test data. Right: Th dimensional outputs of the convolutional network on a set of test cases. Each point is shown a line segment that has the same orientation as the input face. 5.1 UCI regression tasks We consider a large set of UCI regression problems of varying sizes and properties. Ta reports test root mean squared error (RMSE) for 1) many scalable Gaussian process k learning methods based on Fastfood (Yang et al., 2015); 2) stand-alone deep neural netw (DNNs); and 3) our proposed combined deep kernel learning (DKL) model using both and SM base kernels. For smaller datasets, where the number of training examples n < 6, 000, we used a f connected neural network with a d-1000-500-50-2 architecture; for larger datasets we 2 A Appendix A.1 Convolutional network architecture Table 3 lists the architecture of the convolutional networks used in the tasks of face ori- entation extraction (section 5.2) and digit magnitude extraction (section 5.3). The CNN architecture is original from the LeNet LeCun et al. (1998) (for digit classification) and adapted to the above tasks with one or two more fully-connected layers for feature trans- formation. Layer conv1 pool1 conv2 pool2 full3 full4 full5 full6 kernel size 5⇥5 2⇥2 5⇥5 2⇥2 - - - - stride 1 2 1 2 - - - - channel 20 20 50 50 1000 500 50 2 Table 3: The architecture of the convolutional network used in face orientation extraction. The CNN used in the MNIST digit magnitude regression has a similar architecture except that the full3 layer is omitted. Both pool1 and pool2 are max pooling layers. ReLU layer is placed after full3 and full4.
  • 18. ¤ ¤ DBN-GP 12000 1000 DKL ¤ (RMSE) Table 2: RMSE performance on the Olivetti and MNIST. For comparison, in the face orientation extraction, we trained DKL on the same amount (12,000) of training instances as with DBN-GP, but used all labels; whereas DBN-GP (as with GP) scaled to only 1,000 labeled images and modeled the remaining data through unsupervised pretraining of DBN. We used RBF base kernel within GPs. Datasets GP DBN-GP CNN DKL Olivetti 16.33 6.42 6.34 6.07 MNIST 1.25 1.03 0.59 0.53 combining KISS-GP with DNNs for deep kernels introduces only negligible runtime costs: KISS-GP imposes an additional runtime of about 10% (one order of magnitude less than) the runtime a DNN typically requires. Overall, these results show the general applicability and practical significance of our scalable DKL approach. 5.2 Face orientation extraction We now consider the task of predicting the orientation of a face extracted from a gray- 1 2 3 4 5 6 x 10 4 5.2 5.4 5.6 5.8 6 6.2 6.4 #Training Instances RMSE CNN DKL−RBF DKL−SM 1 2 3 4 5 6 x 10 4 50 100 150 200 250 300 350 #Training Instances Runtime(s) CNN DKL−RBF DKL−SM 8 10 x 10 4 e(s) CNN DKL−RBF DKL−SM 1 2 3 4 5 6 x 10 4 5.2 5.4 5.6 5.8 6 6.2 6.4 #Training Instances RMSE CNN DKL−RBF DKL−SM 1 2 3 4 5 6 x 1 50 100 150 200 250 300 350 #Training Instances Runtime(s) CNN DKL−RBF DKL−SM 1.2 2 3 4 5 6 x 10 4 0 2 4 6 8 10 x 10 4 #Training Instances TotalTrainingTime(s) CNN DKL−RBF DKL−SM Figure 3: Left: RMSE vs. n, the number of training examples. Middle: Runtime vs n. Ri Total training time vs n. The dashed line in black indicates a slope of 1. Convolutional netw are used within DKL. We set Q = 4 for the SM kernel.
  • 19. ¤ spectral ¤ spectral mixture RBF ¤ SM 2 ¤ RBF ¤ 0 10 20 30 −800 −600 −400 −200 0 Frequency LogSpectralDensity Figure 4: The log spectral densities of the DKL-SM and DKL-SE base kernels are in black and red, respectively. We further see the benefit of an SM base kernel in Figure 5, where we show the learned covariance matrices constructed from the whole deep kernels (composition of base kernel and deep architecture) for RBF and SM base kernels. The covariance matrix is evaluated on a set of test inputs, where we randomly sample 400 instances from the test set and sort them in terms of the orientation angles of the input faces. We see that the deep kernels with both RBF and SM base kernels discover that faces with similar rotation angles are highly
  • 20. ¤ ¤ DKL-SM ¤ DKL-RBF ¤ RBF ¤ DKL ¤ DKL-RBF ¤ RBF →DKL 100 200 300 400 100 200 300 400 −0.1 0 0.1 0.2 100 200 300 400 100 200 300 400 0 1 2 100 200 300 400 100 200 300 400 0 100 200 300 Figure 5: Left: The induced covariance matrix using DKL-SM kernel on a set of test cases, where the test samples are ordered according to the orientations of the input faces. Middle: The respective covariance matrix using DKL-RBF kernel. Right: The respective covariance matrix using regular RBF kernel. The models are trained with n = 12, 000. We set Q = 4 for the SM base kernel. 5.3 Digit magnitude extraction
  • 21. MNIST ¤ MNIST ¤ ¤ ¤ (RMSE) ¤ Table 2: RMSE performance on the Olivetti and MNIST. For comparison, in the face orientation extraction, we trained DKL on the same amount (12,000) of training instances as with DBN-GP, but used all labels; whereas DBN-GP (as with GP) scaled to only 1,000 labeled images and modeled the remaining data through unsupervised pretraining of DBN. We used RBF base kernel within GPs. Datasets GP DBN-GP CNN DKL Olivetti 16.33 6.42 6.34 6.07 MNIST 1.25 1.03 0.59 0.53 combining KISS-GP with DNNs for deep kernels introduces only negligible runtime costs: KISS-GP imposes an additional runtime of about 10% (one order of magnitude less than) the runtime a DNN typically requires. Overall, these results show the general applicability and practical significance of our scalable DKL approach. 5.2 Face orientation extraction Table 2: RMSE performance on the Olivetti and MNIST. For comparison, in the face orientation extraction, we trained DKL on the same amount (12,000) of training instances as with DBN-GP, but used all labels; whereas DBN-GP (as with GP) scaled to only 1,000 labeled images and modeled the remaining data through unsupervised pretraining of DBN. We used RBF base kernel within GPs. Datasets GP DBN-GP CNN DKL Olivetti 16.33 6.42 6.34 6.07 MNIST 1.25 1.03 0.59 0.53 combining KISS-GP with DNNs for deep kernels introduces only negligible runtime costs: KISS-GP imposes an additional runtime of about 10% (one order of magnitude less than) the runtime a DNN typically requires. Overall, these results show the general applicability and practical significance of our scalable DKL approach. 5.2 Face orientation extraction We now consider the task of predicting the orientation of a face extracted from a gray- A Appendix A.1 Convolutional network architecture Table 3 lists the architecture of the convolutional networks used in the tasks of face ori- entation extraction (section 5.2) and digit magnitude extraction (section 5.3). The CNN architecture is original from the LeNet LeCun et al. (1998) (for digit classification) and adapted to the above tasks with one or two more fully-connected layers for feature trans- formation. Layer conv1 pool1 conv2 pool2 full3 full4 full5 full6 kernel size 5⇥5 2⇥2 5⇥5 2⇥2 - - - - stride 1 2 1 2 - - - - channel 20 20 50 50 1000 500 50 2 Table 3: The architecture of the convolutional network used in face orientation extraction. The CNN used in the MNIST digit magnitude regression has a similar architecture except that the full3 layer is omitted. Both pool1 and pool2 are max pooling layers. ReLU layer is placed after full3 and full4.
  • 22. ¤ DKL Deep Learning ¤ ¤ ¤ ¤ ¤ −1 −0.5 0 0.5 1 4 6 8 10 12 14 16 18 Input X OutputY GP(RBF) GP(SM) DKL−SM Training data Figure 6: Recovering a step function. We show the predictive mean and 95% of the predictive probability mass for regular GPs with RBF and SM kernels, and DKL with SM base kernel. We set
  • 23. ¤ ¤ DL ¤ DL KILL-GP ¤ ¤ Spectral mixture ¤ ¤ DKL ¤ GP DNN
  翻译: