Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Signum optimizer #9220
Conversation
eric-haibin-lin
reviewed
Jan 2, 2018
•
Thanks for the contribution!! Please see detailed comments in code.
@@ -57,6 +58,10 @@ class Optimizer(object): | ||
The weight decay (or L2 regularization) coefficient. Modifies objective | ||
by adding a penalty for having large weights. | ||
+ wd_lh: float, optional |
eric-haibin-lin
Jan 2, 2018
Contributor
I don't see a change in the Optimizer
class constructor. Why is this changed?
yuxiangw
Jan 4, 2018
Contributor
I added that to the constructor at some point, cuz wd_lh is something more generally applicable to other algorithms too (in particular, Adam).
+ | ||
+@register | ||
+class Signum(Optimizer): | ||
+ """The SGD optimizer with momentum and weight decay. |
eric-haibin-lin
Jan 2, 2018
Contributor
The one line summary should also mention it only takes the sign. Otherwise the readers don't know it until they see line 547
+ momentum : float, optional | ||
+ The momentum value. | ||
+ wd_lh : float, optitional | ||
+ The amount of decoupled weight decay regularization. |
yuxiangw
Jan 4, 2018
Contributor
added the temp link to pdf hosted on jeremy's site. will update to arxiv or a published version when they are ready.
+ float lr; | ||
+ float wd; | ||
+ float rescale_grad; | ||
+ float clip_gradient; |
eric-haibin-lin
Jan 2, 2018
Contributor
If the clip_gradient
param has no effect on both SignSGD and Signum, can we just remove this param from signsgd_update
and signum_update
? That would also simply the c++ kernels
yuxiangw
Jan 4, 2018
Contributor
It has an effect on Signum. Because it will lead to different result whether we use gradient or clipped gradient for calculating momentum.
+ | ||
+NNVM_REGISTER_OP(signsgd_update) | ||
+// MXNET_ADD_SPARSE_OP_ALIAS(signsgd_update) | ||
+.describe(R"code(Update function for SignSGDoptimizer. |
yuxiangw
Jan 4, 2018
Contributor
done. and added the math description block similar to other optimizers.
+ weight = weight - learning_rate * sign(gradient) | ||
+ | ||
+ | ||
+** Sparse matrix not supported for this optimizer yet. |
eric-haibin-lin
Jan 2, 2018
Contributor
Same comment for documentation rendering and FInferStorageType
in signum_update
+ | ||
+Where the parameter ``momentum`` is the decay rate of momentum estimates at each epoch. | ||
+ | ||
+** Sparse matrix not supported for this optimizer yet. |
eric-haibin-lin
Jan 2, 2018
Contributor
Not sure if sentence starting with **
renders well in API doc. What about adding a "note" section like rint?
incubator-mxnet/src/operator/tensor/elemwise_unary_op_basic.cc
Lines 432 to 434 in ae70769
Also, term "sparse ndarray" instead of "sparse matrix" is preferred :)
+ | ||
+** Sparse matrix not supported for this optimizer yet. | ||
+ | ||
+If weight and momentum are both of ``row_sparse`` storage type, |
eric-haibin-lin
Jan 2, 2018
Contributor
I'd rather remove the line 81-87 since sparse update is not supported anyway.
+.set_attr_parser(ParamParser<SignumParam>) | ||
+.set_attr<nnvm::FInferShape>("FInferShape", ElemwiseShape<3, 1>) | ||
+.set_attr<nnvm::FInferType>("FInferType", ElemwiseType<3, 1>) | ||
+.set_attr<FInferStorageType>("FInferStorageType", ElemwiseStorageType<3, 1, false, true, false>) |
eric-haibin-lin
Jan 2, 2018
Contributor
Since sparse implementation is missing for this operator, removing FInferStorageType
registration is fine, it will by default infer all storage types as "default" (dense).
+ return std::vector<uint32_t>{2}; | ||
+ }) | ||
+.set_attr<FCompute>("FCompute<cpu>", SignumUpdate<cpu>) | ||
+// .set_attr<FComputeEx>("FComputeEx<cpu>", SGDMomUpdateEx<cpu>) |
@@ -28,6 +28,14 @@ | ||
namespace mxnet { | ||
namespace op { | ||
+NNVM_REGISTER_OP(signsgd_update) | ||
+.set_attr<FCompute>("FCompute<gpu>", SignSGDUpdate<gpu>); | ||
+// .set_attr<FComputeEx>("FComputeEx<gpu>", SignSGDUpdateEx<gpu>); |
There are new conflicts now. Do you mind resolving them again? |
yuxiangw
added some commits
Dec 22, 2017
Done fixing the conflicts. |
@lx75249 could you help review the code for cpp-package? |
@eric-haibin-lin LGTM |
+ signum_update(weight, grad, state, out=weight, | ||
+ lr=lr, wd=wd, **kwargs) | ||
+ else: | ||
+ signsgd_update(weight, grad, out=weight, |
+ | ||
+ rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight | ||
+ state = momentum * state + (1-momentum)*rescaled_grad | ||
+ weight = (1 - lr * wd_lh) * weight - lr * sign(state) |
eric-haibin-lin
Jan 11, 2018
Contributor
Since wd_lh
is new, I suggest put a reference link to the original paper by Loshchilov and Frank Hutter in the documentation
+ kwargs['wd_lh'] = self.wd_lh | ||
+ | ||
+ if state is not None: | ||
+ signum_update(weight, grad, state, out=weight, |
piiswrong
Jan 9, 2018
Contributor
call these signum_momentum_update and signum_update to be consistent with others
yuxiangw
Jan 9, 2018
Contributor
RE: naming.
- signum means SIGN momentUM. So the semantics of the momentum is already in there. -
- SignSGD is the special case of Signum that goes without momentum. And it has been used before.
Unless we change the names in our paper, let's keep them the way they are.
eric-haibin-lin
reviewed
Jan 11, 2018
One final comment. Otherwise LGTM. Thanks for the contribution!
+ | ||
+ rescaled_grad = rescale_grad * clip(grad, clip_gradient) + wd * weight | ||
+ state = momentum * state + (1-momentum)*rescaled_grad | ||
+ weight = (1 - lr * wd_lh) * weight - lr * sign(state) |
eric-haibin-lin
Jan 11, 2018
Contributor
Since wd_lh
is new, I suggest put a reference link to the original paper by Loshchilov and Frank Hutter in the documentation
Added the reference the documentation as suggested. Thanks guys for reviewing the PR! |
piiswrong
merged commit 5251b86
into
apache:master
Jan 12, 2018
Thanks |
yuxiangw commentedDec 28, 2017
•
Edited 2 times
-
yuxiangw
Jan 2, 2018
-
yuxiangw
Dec 28, 2017
Description
Added the C++ implementation of the Signum optimizer.
Bernstein, Wang, Azizzadenesheli and Anandkumar (2017) "The Signum optimiser: a theory of momentum in quantised stochastic optimisation"
Link to pdf
What's also included is the implementation of an option to do the alternative version of weight decay regularization due to Loshchilov and Hutter via option 'wd_lh'.
"Fixing Weight Decay Regularization in Adam"
Link to arxiv
Checklist
Essentials
make lint
)Changes
Comments