Add BLAS3 and LAPACK routines #6538

Merged
merged 6 commits into from Jun 13, 2017

Conversation

Projects
None yet
3 participants
Contributor

asmushetzel commented Jun 2, 2017

This brings in operators that wrap certain BLAS3 and LAPACK routines in particular
trsm
trmm
potrf
potri
gemm

(in addition a 2-operator version of gemm and an operator for getting the sum of logs of a matrix diagonal).
The goal is to bring in enough operators in order to support specific commonly important applications, for example Gaussian Processes. With this set of operators, it is possible to

  • Use MxNet as a computing framework for lot of Gaussian Process work
  • Develop new applications that combine neural networks and Gaussian processes

An example was given at AMLC. There is a lot of active work ongoing in this area. In addition, there are other groups that are interested specifically in the Cholesky factorization (potrf).

All operators provide proper backward gradients such that they can be integrated easily into custom loss-layers of neural networks. Getting gradients is nontrivial specifically for matrix factorization, so this PR is more than a simple wrapper around linear algebra routines.

This PR defines a framework into which we can add other LAPACK/BLAS-related operators in the future.

The plan is to first release these CPU versions and afterwards bring in the GPU ones (we are already working on that). Based on the outcome of the GPU work

The documentation for the operators is in line with the most recent general doc-updates (verified the layout on a local webserver).

Contributor

asmushetzel commented Jun 2, 2017

docs/api/python/symbol.md
+.. autosummary::
+ :nosignatures:
+
+ la_gemm
@piiswrong

piiswrong Jun 2, 2017

Contributor

I'm not so sure about exposing them as operators...

would linalg_* be a better namespace?

@asmushetzel

asmushetzel Jun 2, 2017

Contributor

no preference on my side concerning namespace. we can choose what we like.
What is your concern about exposing as operators. That's exactly what people need in order to build loss functions based on advanced linear algebra. Or are you referring just to gemm?

@piiswrong

piiswrong Jun 2, 2017

Contributor

I think linalg_* looks better, more consistent with numpy.

I guess this is fine then.

BTW are you planning to add the GPU version?

@asmushetzel

asmushetzel Jun 2, 2017

Contributor

Yes, Zhenwen is already working on this. We want to get it done soon as we want to provide a truly full and scalable solution. Though I would prefer to get the CPU-versions in first and the GPU ones with a separate PR. Even if this means that we have to do a bit of restructuring of code afterwards again (in order to unify the cpu/gpu abstraction).
I will rename the stuff to linalg.

src/operator/tensor/la_op.cc
+.set_attr<nnvm::TIsBackward>("TIsBackward", true)
+.set_attr<FCompute>("FCompute<cpu>", LaOpBackward<cpu, 2, 2, 4, 3, gemm_backward>);
+
+NNVM_REGISTER_OP(la_gemm2)
@piiswrong

piiswrong Jun 2, 2017

Contributor

whats the difference between gemm2 and gemm?

@asmushetzel

asmushetzel Jun 2, 2017

Contributor

gemm: alphaAB + betaC
gemm2: alpha
A*B
gemm2 is equivalent to gemm with beta=0 but it is more convenient as you don't have to supply a third input tensor for gemm2

Contributor

asmushetzel commented Jun 2, 2017

I may have eventually pass a flag from config.mk down to the compile-step that enables me to detect whether compilation happens osx (but not sure yet). In case that is needed, do I have to pass such things downto mshadow (it to me as if its make finally sets all relevant "-D" flags in the compile)?

+
+// D = gemm(A,B,C)
+struct gemm {
+ template<typename xpu, typename DType>
@piiswrong

piiswrong Jun 2, 2017

Contributor

This is basically what I was talking about. I think we should support GPU/CPU/DType on this layer.

@asmushetzel

asmushetzel Jun 2, 2017

Contributor

Agree that this makes sense. Ok if we do this higher level abstraction that involves also gpu when we add the gpu versions?

@piiswrong

piiswrong Jun 3, 2017

Contributor

Sure. Any reason this is using structure and partial specialization instead of function overloading?

@asmushetzel

asmushetzel Jun 4, 2017

Contributor

Due to the nature of the usual operators structure, at the end there must be core operator code that can be compiled with any of the supported types (though it may not be able to run with all of them). I.e. it must compile even when DType is 16-bit, int, etc. But the lapack/blas3 code only supports 32/64-bit and will likely never support anything else. Template specialization with structs allow me to write a default-implementation (that would just error out if it would ever be called) and specialization for 32/64-bits.
Function overloading would require to write overloads for every potential DType (as you can not have a default and specialized function templates).

@asmushetzel

asmushetzel Jun 5, 2017

Contributor

Above is not fully true. Just that function templates allow no partial specialization (only full one). Anyway: I wanted to have a generic function that covers all the processing that is common for all these operators (LaOpForward/LaOpBackward) that then gets parametrized with a template for the special functionality (the laop-parameter to LaOpForward/LaOPBackward). This needed this struct-approach. I can't pass a partially specialized function template like this (at least I have found no way)

src/operator/tensor/la_op.cc
+.set_attr<nnvm::FInferType>("FInferType", ElemwiseType<2, 1>)
+.set_attr<FCompute>("FCompute<cpu>", LaOpForward<cpu, 2, 2, 2, 1, gemm2>)
+.set_attr<nnvm::FGradient>("FGradient", ElemwiseGradUseIn{"_backward_la_gemm2"})
+.add_arguments(LaMatrixMultParam::__FIELDS__())
@piiswrong

piiswrong Jun 2, 2017

Contributor

add_arguments(LaMatrixMultParam::FIELDS()) this should go after ndarray-or-symbol arguments

@asmushetzel

asmushetzel Jun 2, 2017

Contributor

Will do. Just for curiosity: What is the reason?

@piiswrong

piiswrong Jun 2, 2017

Contributor

The order will affect function signature in frontend.
Like this needs to be fixed: http://mxnet.io/api/python/ndarray.html#mxnet.ndarray.sample_gamma

I'll submit a pr soon

+ }
+ }
+ for ( int i = 0; i < N; ++i ) {
+ dA.dptr_[i*(N+1)] += dB / A.dptr_[i*N+i];
@jli05

jli05 Jun 3, 2017

Contributor

Would it be a problem if the denominator is zero?

@asmushetzel

asmushetzel Jun 4, 2017

Contributor

Certainly. Though these linear algebra operators comprise a package for advanced numerical applications. For such applications, numerical stability is a big issue in general, and usually is the callers responsibility. At the lower level, there is no way to ensure numerical stability anymore. For example the matrices that are valid input to potrf must be symmetric and positive semidefinite and this may not be the case because of some rounding problems in prior steps of the algorithm. It is the algorithm designers responsibility to ensure that proper mechanisms are added such that everything is numerically stable. Dividing by zero is only one special case here, so no special mechanisms are added for this case.

Makefile
@@ -106,6 +106,11 @@ else
endif
endif
+# lapack may be a separate library
+ifeq ($(USE_BLAS), openblas)
+ LDFLAGS += -llapack
@piiswrong

piiswrong Jun 6, 2017

Contributor

does openblas always build lapack by default?

@asmushetzel

asmushetzel Jun 6, 2017

Contributor

I'm currently tracing the details. It is available, but at least on ubuntu I have to wget get it as a separate package. Guess I will have it worked out by tomorrow.
So far:

  • The only consistent way to use lapack in MXNet by includes is to interface with the Fortran code. openblas has lapacke.h, but ATLAS and "apple" (accelarator framework) use clapack.h. The latter have completely different signatures, ATLAS is a proper C-wrapper while "apple" are just C-declarations matching the Fortran signatures. So it is easier to interface with the Fortran code in all cases.
  • We have some builds that just use "blas" or even "NONE". These are builds for docs and also the mxnet-p-r (see failure below). My plan is to support such builds (which are not intended to generate an executable that actually runs) by passing a -DMXNET_USE_LAPACK (in all other builds). If this is not present, then our lapack-wrappers will be compiled into a stub that will create an error at runtime. I don't want to completely skip compiling operators that use lapack as we may get a lot more lapack-based functionalities in the future and it would be cumbersome to handle them all specially in such builds.

What do you think?

Contributor

piiswrong commented Jun 6, 2017

does all the blas libraries include lapack by default? Does this PR support all lapack variants?

Is lapack support turned on by default? Is there an option to turn it off?

@mli mli changed the title from La op to Add BLAS3 and LAPACK routines Jun 7, 2017

Contributor

asmushetzel commented Jun 7, 2017

IMO this should be all done now. I have squashed all the commits into a single one and pushed again. Unfortunately there are some build failures but they seem to be all unrelated (scala, some github issue with cub etc). They seem to pop up by today when I incorporated the latest MXNet code. Eric, can you take a look? May be you have an idea. Has something pushed recently that explains that?

Summary of the changes:

  • Added a set of BLAS3/LAPACK based operators. This set is rich enough to support a set of interesting linear algebra applications including Gaussian processes (standalone or integrated with deep learning strategies for parameter estimation). There is already interest from several other sides specifically on the Cholesky factorization.

  • The operators are currently avail only for CPU but we are working on GPU support already and are committed to supply it based on cuSolver.

  • For the LAPACK operators (potrf,potri), a low level MXNet-specific abstraction has been implemented that enable us to switch the underlying LAPACK integration easily without going through the entire MXNet code. This abstraction currently supports only potrf/potri but can be extended easily whenever there is the need to make more lapack-functions available.

  • This low level abstraction is currently just for CPU. We may combine GPU and CPU together into a higher level abstraction when bringing in GPU-support.

  • The LAPACK-abstraction interfaces directly with the core Fortran-libraries on all platforms. This is the only consistent way as C-wrappers for lapack differ heavily between platforms and all lapack-packages that are relevant for us (and likely all other ones as well) use anyway the underlying Fortran library. This is also the way numpy integrates lapack (just that numpy uses an extra utility to generate the C-declarations from the fortran library automatically).

  • Above method should be stable as the LAPACK-fortran interfaces are unlikely to change ever.

  • All of the usual BLAS-packages that we integrate with MXNet (openblas,apple,atlas,mkl) include lapack already by default. Smaller differences: On Ubuntu, the lapack library is not installed automatically with openblas, it must be installed explicitly. On Windows, blas and lapack are mangled into a single library. apple/atlas/mkl contain lapack automatically. When using openblas, we have to add a linker flag to include liblapack. Makefiles, configuration files etc have been updated accordingly, also the installation documentation.

  • Whenever MXNet is compiled with one of the BLAS libraries openblas/atlas/mkl/apple as well as for all compiles on windows, a compiler flag -DMXNET_USE_LAPACK is set that will activate the LAPACK-abstraction in the C++ code and add the lapack library to the linker step (if necessary for the platform). If the compiler flag is not present, then dummy stubs will be generated in the C++ abstraction that will cause a runtime error with appropriate error message whenever a lapack-function is called. This ensures that MXNET can always be compiled even when no lapack is available. Though as lapack will likely become soon a vital part of MXNet and is also available on all platforms, I think we will only use that for some potential special "lightweight" builds.

Once again, I would be happy if someone else can look into the mysterious build errors. I got all compiled clean before but now they pop up.

Contributor

jli05 commented Jun 7, 2017

Sorry what build errors exactly?

Contributor

jli05 commented Jun 7, 2017

In c_lapack_api.h, all functions follow the same signature. It'd be great if we have a couple of more lines of code to illustrate how the MXNet abstraction interface will be like for LAPACK functions with different signature?

Contributor

asmushetzel commented Jun 8, 2017

Jenkins failed on some nodes immediately with this:

[build-cpu] Running batch script
c:\jenkins_slave\workspace\build-cpu>git submodule update --init
error: no such remote ref 01347a797c620618d09e7d2d90bce4be4c42513e
Fetched in submodule path 'cub', but it did not contain 01347a797c620618d09e7d2d90bce4be4c42513e. Direct fetching of that commit failed.
script returned exit code 1

Any ideas?

+// - It is desired to add some basic checking in the C++-wrappers in order
+// to catch simple mistakes when calling these wrappers.
+// - Must support compilation without lapack-package but issue runtime error in this case.
+
@jli05

jli05 Jun 8, 2017

Contributor

May we add one line of comment stipulating the expected layout of input/output matrix (always row-major vs column-major)?

Contributor

piiswrong commented Jun 9, 2017

So apt-get install libopenblas-dev won't install lapack? In that case we probably should add an USE_LAPACK option in config.mk and turn it off by default for openblas.

Also maybe we should turn off lapack for amalgamation. How big is lapack lib?

@@ -5,7 +5,7 @@ MAINTAINER Mu Li <muli@cs.cmu.edu>
# First, build MXNet binaries (ref mxnet/docker/cpu/Dockerfile)
#
-RUN apt-get update && apt-get install -y build-essential git libopenblas-dev libopencv-dev
+RUN apt-get update && apt-get install -y build-essential git libopenblas-dev liblapack-dev libopencv-dev
@jli05

jli05 Jun 9, 2017

Contributor

Please note that for Ubuntu 14.04 LTS, the libopenblas.so within libopenblas-base or libopenblas-dev only packed in the BLAS routines but not LAPACK. Only from Ubuntu 16.04 LTS, the libopenblas.so in the official packages contain all the BLAS and LAPACK routines.

If we install liblapack-dev that's the unoptimised Netlib version which bear no relation with OpenBLAS.

To rectify this, we have to base the test on the latest Ubuntu LTS (aka 16.04) Image, remove liblapack-dev -- using just libopenblas-base or libopenblas-dev would suffice to provide all the BLAS/LAPACK routines.

(If we install libblas-dev or liblapack-dev that's the unoptimised Netlib version. We could do sudo update-alternatives --display liblapack.so to check.)

@jli05

jli05 Jun 9, 2017

Contributor

Another thing is that we'd need to patch another one or two small things in the Docker test run file as the Ubuntu 16.04 LTS Docker Image is much more compact than 14.04 and has removed packages as sudo -- but there should barely be need for sudo when run inside the container.

@piiswrong

piiswrong Jun 9, 2017

Contributor

We can't break the build for ubuntu 14.04 users.

I see three possible options:

  1. add USE_LAPACK in config.mk and turn it off by default
  2. automatically detect if a lapack lib is available and turn on/off lapack accordingly.
  3. Build our own lapack when make
@jli05

jli05 Jun 9, 2017

Contributor

Did you mean Ubuntu 14.04 as the Host OS? Can we use Ubuntu 16.04 LTS as base Image for building the Docker and still run the Docker on Ubuntu 14.04?

If we have to stick with Ubuntu 14.04 LTS as base Image, effectively no official package provides OpenBLAS's LAPACK routines.

@asmushetzel

asmushetzel Jun 12, 2017

Contributor

Not sure whether I fully understand all of this. Let me try to summarize.

Amalgamation build: This build is not-affected with the current set of changes. It has it's own (unchanged) Makefile and that doesn't set the flag to include LAPACK. There is no method currently to change this behavior by a config, which is IMO the right thing to do as Amalgamation by today only supports building one specific inference model and not all types of potential applications.

Ubuntu 14.04 does not have lapack mangled into openblas. It's separate. But it is part of the official Ubuntu packages for 14.04: (https://packages.ubuntu.com/search?keywords=liblapack-dev). And whether this package is the original netlib version or something mangled into openBlas-library doesn't really matter for us as our internal wrapper code can work with all of them. So it's not clear to me why we shouldn't use it for 14.04 as well

It may be necessary to put in a switch that for Ubuntu 14.04 we add a "-llapack" and for 16.04 we don't. But that should be straightforward to achieve.

Generally, we can have a global "USE_LAPACK" in the config file. I think it makes a lot of sense. But IMO, this should be on by default whenever we also use blas/atlas/apple/openblas. I expect that we will have an increasing number of functionalities based on lapack, once we add the basic support. And then it will be hard to track and document what functionality needs lapack and what not. In doubt this will be visible only after a user tries something and then gets a runtime error which is annoying.

Contributor

jli05 commented Jun 12, 2017

I thought that Dockerfile is only for CI test? Just wished to point out 'liblapack-dev' may not be the LAPACK binaries of OpenBLAS. If we consider it OK things are fine with me.

Hetzel added some commits Jun 7, 2017

Contributor

asmushetzel commented Jun 13, 2017

I think this is done now.
I added an additional flag to the configs USE_LAPACK which is 1 by default. So there is always an easy option to turn it off.

In the default mode (USE_LAPACK=1), it will add lapack support whenever we use one of the blas distributions that contain lapack (apple/atlas/openblas/mkl). Concerning Ubuntu, the state is that 14.04 and 16.04 are almost equivalent: In both cases, lapack is part of the official packages. The difference is that on 14.04, installing openblas will not automatically install the lapack library while in 16.04 it does. In both cases,
sudo apt-get install -y libopenblas-dev liblapack-dev
will get all necessary pieces (just that in 16.04 there would be no need for referencing liblapack explicitly, but it will work either way). Tested that explicitly on the different images.
So our docker container can happily stay at 14.04 with the small change that it also installs liblapack-dev now. No need to migrate to 16.04.

In my most recent run, one of the travis-ci builds times out when doing some tests on neural networks (not related to the new functionalities herein). Seen this happen in the past from time to time, may be we want to give this tests a bit more time by default. So all is clean.

I will work on integrating cuda support for these operators as soon as this PR gets integrated. Don't want to pile up more stuff on top of this. This will result in a small bit of internal restructuring. The goal would be to provide a common internal abstraction layer that handles the different DTypes and CPU vs. GPU automatically for the blas3/lapack functions (as suggested by the reviews).
Let me know what you think.

Contributor

jli05 commented Jun 13, 2017

  • Alternatively, could we merge this PR with CPU code first so that I could submit my PRs on tensors? I'll need to put more LAPACK interfaces in 'c_lapack_api.h'.

  • A tiny clarification: on Ubuntu 16.04 the official build 'libopenblas.so' mangled in both BLAS/LAPACK symbols but the one on 14.04 only BLAS symbols. Always linking against the Netlib's version is a viable way to get it work. It really just depends on what we intend to do.

Contributor

asmushetzel commented Jun 13, 2017

This was what I wanted to say as well: First merge this PR with CPU code before adding anything else.

Contributor

piiswrong commented Jun 13, 2017

Let's either make USE_LAPACK=0 by default or automatically turn it of when we cannot detect lapack.

In short, we shouldn't break build for people who don't have lapack installed.

docs/api/python/symbol.md
+ la_potri
+ la_trmm
+ la_trsm
+ la_sumlogdiag
@piiswrong

piiswrong Jun 13, 2017

Contributor

Change names to match function definition

Contributor

asmushetzel commented Jun 13, 2017

I think we have then to do the "detection" flow eventually in which we dynamically turn it off and issue a warning message. Though I rather would submit this flow with a separate PR soon to keep us going here and not complicate things further.
So I will make the default not to use lapack (but for a short period). The downside of this is that I then have to disable also all the tests for these operators in test_operator.py, so we won't run automatic regressions (unless we would modify Jenkins setup). I think this is bad, but tolerable for a short time.
Ok with that approach Eric?

Contributor

piiswrong commented Jun 13, 2017

Cool

@piiswrong piiswrong merged commit e852036 into apache:master Jun 13, 2017

3 of 4 checks passed

continuous-integration/travis-ci/pr The Travis CI build is in progress
Details
continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/jenkins/pr-head This commit looks good
Details
mxnet-pr-r
Details
Contributor

jli05 commented Jun 14, 2017

Can I ask what did you mean "We cannot break user's build" exactly, say, for this PR? For my sense, the repo evolves naturally as long as we update the source files, Makefiles in a way that the build and test succeed?

Could you confirm the the storage space for a Tensor is contiguous or not? For CPU code, I often saw the stride different from the number of columns; whilst it seems much of the GPU code supposed the storage space is contiguous?

Contributor

asmushetzel commented Jun 15, 2017

Concerning stride: I'm looking into how we can support this in general. Should be easily possible using lda/ldb arguments. If this is the case, there may be a small change to the signatures of the abstractions. This should all be rolled in when adding the CUDA-support.
Concerning build stability, leave this to Eric to answer.

@asmushetzel asmushetzel deleted the asmushetzel:la_op branch Jun 15, 2017

EvanzzzZ added a commit to EvanzzzZ/mxnet that referenced this pull request Jun 15, 2017

Add BLAS3 and LAPACK routines (#6538)
* Added linear algebra operators

* more comments about style of wrapper interface

* more appropriate fatal exit when lapack does not exist

* more comments on row/col-major ordering

* added config switch for lapack usage

* switched lapack usage off by default

Guneet-Dhillon pushed a commit to Guneet-Dhillon/mxnet that referenced this pull request Sep 13, 2017

Add BLAS3 and LAPACK routines (#6538)
* Added linear algebra operators

* more comments about style of wrapper interface

* more appropriate fatal exit when lapack does not exist

* more comments on row/col-major ordering

* added config switch for lapack usage

* switched lapack usage off by default

kice added a commit to kice/mxnet that referenced this pull request Dec 20, 2017

Add BLAS3 and LAPACK routines (#6538)
* Added linear algebra operators

* more comments about style of wrapper interface

* more appropriate fatal exit when lapack does not exist

* more comments on row/col-major ordering

* added config switch for lapack usage

* switched lapack usage off by default
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment