Join GitHub today
GitHub is home to over 20 million developers working together to host and review code, manage projects, and build software together.
Add BLAS3 and LAPACK routines #6538
Conversation
+.. autosummary:: | ||
+ :nosignatures: | ||
+ | ||
+ la_gemm |
piiswrong
Jun 2, 2017
•
Contributor
I'm not so sure about exposing them as operators...
would linalg_* be a better namespace?
asmushetzel
Jun 2, 2017
Contributor
no preference on my side concerning namespace. we can choose what we like.
What is your concern about exposing as operators. That's exactly what people need in order to build loss functions based on advanced linear algebra. Or are you referring just to gemm?
piiswrong
Jun 2, 2017
Contributor
I think linalg_* looks better, more consistent with numpy.
I guess this is fine then.
BTW are you planning to add the GPU version?
asmushetzel
Jun 2, 2017
Contributor
Yes, Zhenwen is already working on this. We want to get it done soon as we want to provide a truly full and scalable solution. Though I would prefer to get the CPU-versions in first and the GPU ones with a separate PR. Even if this means that we have to do a bit of restructuring of code afterwards again (in order to unify the cpu/gpu abstraction).
I will rename the stuff to linalg.
+.set_attr<nnvm::TIsBackward>("TIsBackward", true) | ||
+.set_attr<FCompute>("FCompute<cpu>", LaOpBackward<cpu, 2, 2, 4, 3, gemm_backward>); | ||
+ | ||
+NNVM_REGISTER_OP(la_gemm2) |
asmushetzel
Jun 2, 2017
Contributor
gemm: alphaAB + betaC
gemm2: alphaA*B
gemm2 is equivalent to gemm with beta=0 but it is more convenient as you don't have to supply a third input tensor for gemm2
I may have eventually pass a flag from config.mk down to the compile-step that enables me to detect whether compilation happens osx (but not sure yet). In case that is needed, do I have to pass such things downto mshadow (it to me as if its make finally sets all relevant "-D" flags in the compile)? |
+ | ||
+// D = gemm(A,B,C) | ||
+struct gemm { | ||
+ template<typename xpu, typename DType> |
piiswrong
Jun 2, 2017
Contributor
This is basically what I was talking about. I think we should support GPU/CPU/DType on this layer.
asmushetzel
Jun 2, 2017
Contributor
Agree that this makes sense. Ok if we do this higher level abstraction that involves also gpu when we add the gpu versions?
piiswrong
Jun 3, 2017
Contributor
Sure. Any reason this is using structure and partial specialization instead of function overloading?
asmushetzel
Jun 4, 2017
Contributor
Due to the nature of the usual operators structure, at the end there must be core operator code that can be compiled with any of the supported types (though it may not be able to run with all of them). I.e. it must compile even when DType is 16-bit, int, etc. But the lapack/blas3 code only supports 32/64-bit and will likely never support anything else. Template specialization with structs allow me to write a default-implementation (that would just error out if it would ever be called) and specialization for 32/64-bits.
Function overloading would require to write overloads for every potential DType (as you can not have a default and specialized function templates).
asmushetzel
Jun 5, 2017
Contributor
Above is not fully true. Just that function templates allow no partial specialization (only full one). Anyway: I wanted to have a generic function that covers all the processing that is common for all these operators (LaOpForward/LaOpBackward) that then gets parametrized with a template for the special functionality (the laop-parameter to LaOpForward/LaOPBackward). This needed this struct-approach. I can't pass a partially specialized function template like this (at least I have found no way)
+.set_attr<nnvm::FInferType>("FInferType", ElemwiseType<2, 1>) | ||
+.set_attr<FCompute>("FCompute<cpu>", LaOpForward<cpu, 2, 2, 2, 1, gemm2>) | ||
+.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{"_backward_la_gemm2"}) | ||
+.add_arguments(LaMatrixMultParam::__FIELDS__()) |
piiswrong
Jun 2, 2017
Contributor
add_arguments(LaMatrixMultParam::FIELDS()) this should go after ndarray-or-symbol arguments
piiswrong
Jun 2, 2017
Contributor
The order will affect function signature in frontend.
Like this needs to be fixed: http://mxnet.io/api/python/ndarray.html#mxnet.ndarray.sample_gamma
I'll submit a pr soon
+ } | ||
+ } | ||
+ for ( int i = 0; i < N; ++i ) { | ||
+ dA.dptr_[i*(N+1)] += dB / A.dptr_[i*N+i]; |
asmushetzel
Jun 4, 2017
Contributor
Certainly. Though these linear algebra operators comprise a package for advanced numerical applications. For such applications, numerical stability is a big issue in general, and usually is the callers responsibility. At the lower level, there is no way to ensure numerical stability anymore. For example the matrices that are valid input to potrf must be symmetric and positive semidefinite and this may not be the case because of some rounding problems in prior steps of the algorithm. It is the algorithm designers responsibility to ensure that proper mechanisms are added such that everything is numerically stable. Dividing by zero is only one special case here, so no special mechanisms are added for this case.
@@ -106,6 +106,11 @@ else | ||
endif | ||
endif | ||
+# lapack may be a separate library | ||
+ifeq ($(USE_BLAS), openblas) | ||
+ LDFLAGS += -llapack |
asmushetzel
Jun 6, 2017
Contributor
I'm currently tracing the details. It is available, but at least on ubuntu I have to wget get it as a separate package. Guess I will have it worked out by tomorrow.
So far:
- The only consistent way to use lapack in MXNet by includes is to interface with the Fortran code. openblas has lapacke.h, but ATLAS and "apple" (accelarator framework) use clapack.h. The latter have completely different signatures, ATLAS is a proper C-wrapper while "apple" are just C-declarations matching the Fortran signatures. So it is easier to interface with the Fortran code in all cases.
- We have some builds that just use "blas" or even "NONE". These are builds for docs and also the mxnet-p-r (see failure below). My plan is to support such builds (which are not intended to generate an executable that actually runs) by passing a -DMXNET_USE_LAPACK (in all other builds). If this is not present, then our lapack-wrappers will be compiled into a stub that will create an error at runtime. I don't want to completely skip compiling operators that use lapack as we may get a lot more lapack-based functionalities in the future and it would be cumbersome to handle them all specially in such builds.
What do you think?
does all the blas libraries include lapack by default? Does this PR support all lapack variants? Is lapack support turned on by default? Is there an option to turn it off? |
mli
changed the title from
La op
to
Add BLAS3 and LAPACK routines
Jun 7, 2017
IMO this should be all done now. I have squashed all the commits into a single one and pushed again. Unfortunately there are some build failures but they seem to be all unrelated (scala, some github issue with cub etc). They seem to pop up by today when I incorporated the latest MXNet code. Eric, can you take a look? May be you have an idea. Has something pushed recently that explains that? Summary of the changes:
Once again, I would be happy if someone else can look into the mysterious build errors. I got all compiled clean before but now they pop up. |
Sorry what build errors exactly? |
In |
Jenkins failed on some nodes immediately with this: [build-cpu] Running batch script Any ideas? |
+// - It is desired to add some basic checking in the C++-wrappers in order | ||
+// to catch simple mistakes when calling these wrappers. | ||
+// - Must support compilation without lapack-package but issue runtime error in this case. | ||
+ |
jli05
Jun 8, 2017
Contributor
May we add one line of comment stipulating the expected layout of input/output matrix (always row-major vs column-major)?
So Also maybe we should turn off lapack for amalgamation. How big is lapack lib? |
@@ -5,7 +5,7 @@ MAINTAINER Mu Li <muli@cs.cmu.edu> | ||
# First, build MXNet binaries (ref mxnet/docker/cpu/Dockerfile) | ||
# | ||
-RUN apt-get update && apt-get install -y build-essential git libopenblas-dev libopencv-dev | ||
+RUN apt-get update && apt-get install -y build-essential git libopenblas-dev liblapack-dev libopencv-dev |
jli05
Jun 9, 2017
Contributor
Please note that for Ubuntu 14.04 LTS, the libopenblas.so
within libopenblas-base
or libopenblas-dev
only packed in the BLAS routines but not LAPACK. Only from Ubuntu 16.04 LTS, the libopenblas.so
in the official packages contain all the BLAS and LAPACK routines.
If we install liblapack-dev
that's the unoptimised Netlib version which bear no relation with OpenBLAS.
To rectify this, we have to base the test on the latest Ubuntu LTS (aka 16.04) Image, remove liblapack-dev
-- using just libopenblas-base
or libopenblas-dev
would suffice to provide all the BLAS/LAPACK routines.
(If we install libblas-dev
or liblapack-dev
that's the unoptimised Netlib version. We could do sudo update-alternatives --display liblapack.so
to check.)
jli05
Jun 9, 2017
•
Contributor
Another thing is that we'd need to patch another one or two small things in the Docker test run file as the Ubuntu 16.04 LTS Docker Image is much more compact than 14.04 and has removed packages as sudo
-- but there should barely be need for sudo
when run inside the container.
piiswrong
Jun 9, 2017
Contributor
We can't break the build for ubuntu 14.04 users.
I see three possible options:
- add USE_LAPACK in config.mk and turn it off by default
- automatically detect if a lapack lib is available and turn on/off lapack accordingly.
- Build our own lapack when make
jli05
Jun 9, 2017
Contributor
Did you mean Ubuntu 14.04 as the Host OS? Can we use Ubuntu 16.04 LTS as base Image for building the Docker and still run the Docker on Ubuntu 14.04?
If we have to stick with Ubuntu 14.04 LTS as base Image, effectively no official package provides OpenBLAS's LAPACK routines.
asmushetzel
Jun 12, 2017
Contributor
Not sure whether I fully understand all of this. Let me try to summarize.
Amalgamation build: This build is not-affected with the current set of changes. It has it's own (unchanged) Makefile and that doesn't set the flag to include LAPACK. There is no method currently to change this behavior by a config, which is IMO the right thing to do as Amalgamation by today only supports building one specific inference model and not all types of potential applications.
Ubuntu 14.04 does not have lapack mangled into openblas. It's separate. But it is part of the official Ubuntu packages for 14.04: (https://packages.ubuntu.com/search?keywords=liblapack-dev). And whether this package is the original netlib version or something mangled into openBlas-library doesn't really matter for us as our internal wrapper code can work with all of them. So it's not clear to me why we shouldn't use it for 14.04 as well
It may be necessary to put in a switch that for Ubuntu 14.04 we add a "-llapack" and for 16.04 we don't. But that should be straightforward to achieve.
Generally, we can have a global "USE_LAPACK" in the config file. I think it makes a lot of sense. But IMO, this should be on by default whenever we also use blas/atlas/apple/openblas. I expect that we will have an increasing number of functionalities based on lapack, once we add the basic support. And then it will be hard to track and document what functionality needs lapack and what not. In doubt this will be visible only after a user tries something and then gets a runtime error which is annoying.
I thought that Dockerfile is only for CI test? Just wished to point out 'liblapack-dev' may not be the LAPACK binaries of OpenBLAS. If we consider it OK things are fine with me. |
Jun 7, 2017
added some commits
I think this is done now. In the default mode (USE_LAPACK=1), it will add lapack support whenever we use one of the blas distributions that contain lapack (apple/atlas/openblas/mkl). Concerning Ubuntu, the state is that 14.04 and 16.04 are almost equivalent: In both cases, lapack is part of the official packages. The difference is that on 14.04, installing openblas will not automatically install the lapack library while in 16.04 it does. In both cases, In my most recent run, one of the travis-ci builds times out when doing some tests on neural networks (not related to the new functionalities herein). Seen this happen in the past from time to time, may be we want to give this tests a bit more time by default. So all is clean. I will work on integrating cuda support for these operators as soon as this PR gets integrated. Don't want to pile up more stuff on top of this. This will result in a small bit of internal restructuring. The goal would be to provide a common internal abstraction layer that handles the different DTypes and CPU vs. GPU automatically for the blas3/lapack functions (as suggested by the reviews). |
|
This was what I wanted to say as well: First merge this PR with CPU code before adding anything else. |
Let's either make USE_LAPACK=0 by default or automatically turn it of when we cannot detect lapack. In short, we shouldn't break build for people who don't have lapack installed. |
+ la_potri | ||
+ la_trmm | ||
+ la_trsm | ||
+ la_sumlogdiag |
I think we have then to do the "detection" flow eventually in which we dynamically turn it off and issue a warning message. Though I rather would submit this flow with a separate PR soon to keep us going here and not complicate things further. |
Cool |
piiswrong
merged commit e852036
into
apache:master
Jun 13, 2017
Can I ask what did you mean "We cannot break user's build" exactly, say, for this PR? For my sense, the repo evolves naturally as long as we update the source files, Makefiles in a way that the build and test succeed? Could you confirm the the storage space for a Tensor is contiguous or not? For CPU code, I often saw the stride different from the number of columns; whilst it seems much of the GPU code supposed the storage space is contiguous? |
Concerning stride: I'm looking into how we can support this in general. Should be easily possible using lda/ldb arguments. If this is the case, there may be a small change to the signatures of the abstractions. This should all be rolled in when adding the CUDA-support. |
asmushetzel commentedJun 2, 2017
•
Edited 1 time
-
asmushetzel
Jun 2, 2017
This brings in operators that wrap certain BLAS3 and LAPACK routines in particular
trsm
trmm
potrf
potri
gemm
(in addition a 2-operator version of gemm and an operator for getting the sum of logs of a matrix diagonal).
The goal is to bring in enough operators in order to support specific commonly important applications, for example Gaussian Processes. With this set of operators, it is possible to
An example was given at AMLC. There is a lot of active work ongoing in this area. In addition, there are other groups that are interested specifically in the Cholesky factorization (potrf).
All operators provide proper backward gradients such that they can be integrated easily into custom loss-layers of neural networks. Getting gradients is nontrivial specifically for matrix factorization, so this PR is more than a simple wrapper around linear algebra routines.
This PR defines a framework into which we can add other LAPACK/BLAS-related operators in the future.
The plan is to first release these CPU versions and afterwards bring in the GPU ones (we are already working on that). Based on the outcome of the GPU work
The documentation for the operators is in line with the most recent general doc-updates (verified the layout on a local webserver).