This page was generated from unit-5.5-cuda/poisson_cuda.ipynb.

5.5.1 Solving the Poisson Equation on devices

  • After finite element discretization we obtain a (linear) system of equations.

  • The new ngscuda module moves the linear operators to the Cuda - decice.

  • The host is stearing, data stays on the device

  • The module is now included in NGSolve Linux - distributions, and can be used whenever an accelerator card by NVIDIA is available, and the cuda-runtime is installed.

[1]:
from ngsolve import *
from ngsolve.webgui import Draw
from time import time
[2]:
mesh = Mesh(unit_square.GenerateMesh(maxh=0.1))
for l in range(5): mesh.Refine()
fes = H1(mesh, order=2, dirichlet=".*")
print ("ndof =", fes.ndof)

u, v = fes.TnT()
with TaskManager():
    a = BilinearForm(grad(u)*grad(v)*dx+u*v*dx).Assemble()
    f = LinearForm(x*v*dx).Assemble()

gfu = GridFunction(fes)

jac = a.mat.CreateSmoother(fes.FreeDofs())

with TaskManager():
    inv_host = CGSolver(a.mat, jac, maxiter=2000)
    ts = time()
    gfu.vec.data = inv_host * f.vec
    te = time()
    print ("steps =", inv_host.GetSteps(), ", time =", te-ts)

# Draw (gfu);
ndof = 476417
steps = 1751 , time = 9.42453932762146

Now we import the NGSolve - cuda library.

It provides

  • an UnifiedVector, which allocates memory on both, host and device. The data is updated on demand either on host, or on device.

  • NGSolve - matrices can create their counterparts on the device. In the following, the conjugate gradients iteration runs on the host, but all operations involving big data are performed on the accelerator.

[3]:
try:
    from ngsolve.ngscuda import *
except:
    print ("no CUDA library or device available, using replacement types on host")

ngsglobals.msg_level=1
fdev = f.vec.CreateDeviceVector(copy=True)
no CUDA library or device available, using replacement types on host
No device creator function, creating host vector
[4]:
adev = a.mat.CreateDeviceMatrix()
jacdev = jac.CreateDeviceMatrix()

inv = CGSolver(adev, jacdev, maxsteps=2000, printrates=False)

ts = time()
res = (inv * fdev).Evaluate()
te = time()

print ("Time on device:", te-ts)
diff = Norm(gfu.vec - res)
print ("diff = ", diff)
Time on device: 30.29498028755188
diff =  0.0

On an A-100 device I got (for 5 levels of refinement, ndof=476417):

Time on device: 0.4084775447845459 diff = 3.406979028373306e-12

CG Solver with Block-Jacobi and exact low-order solver:

\[\begin{split}A = \left( \begin{array}{cc} A_{cc} & A_{cf} \\ A_{fc} & A_{ff} \end{array} \right)\end{split}\]

Additive Schwarz preconditioner:

\[C^{-1} = P A_{cc}^{-1} P^T + \sum_i E_i A_i^{-1} E_i^T\]

with * \(P\) .. embedding of low-order space * \(A_i\) .. blocks on edges/faces/cells * \(E_i\) .. embedding matrices

[5]:
fes = H1(mesh, order=5, dirichlet=".*")
print ("ndof =", fes.ndof)

u, v = fes.TnT()
with TaskManager():
    a = BilinearForm(grad(u)*grad(v)*dx+u*v*dx).Assemble()
    f = LinearForm(x*v*dx).Assemble()

gfu = GridFunction(fes)

jac = a.mat.CreateBlockSmoother(fes.CreateSmoothingBlocks())
lospace = fes.lospace
loinv = a.loform.mat.Inverse(inverse="sparsecholesky", freedofs=lospace.FreeDofs())
loemb = fes.loembedding

pre = jac + loemb@loinv@loemb.T
print ("mat", a.mat.GetOperatorInfo())
print ("preconditioner:")
print(pre.GetOperatorInfo())

with TaskManager():
    inv = CGSolver(a.mat, pre, maxsteps=2000, printrates=False)
    ts = time()
    gfu.vec.data = inv * f.vec
    te = time()
    print ("iterations =", inv.GetSteps(), "time =", te-ts)
ndof = 2972801
mat SparseMatrixd, h = 2972801, w = 2972801

preconditioner:
SumMatrix, h = 2972801, w = 2972801
  BlockJacobi-d, h = 2972801, w = 2972801
  EmbeddedTransposeMatrix, h = 2972801, w = 2972801
    EmbeddedMatrix, h = 2972801, w = 119425
      SparseCholesky-d, h = 119425, w = 119425

iterations = 37 time = 5.332625865936279
[6]:
adev = a.mat.CreateDeviceMatrix()
predev = pre.CreateDeviceMatrix()
fdev = f.vec.CreateDeviceVector()

with TaskManager():
    inv = CGSolver(adev, predev, maxsteps=2000, printrates=False)
    ts = time()
    gfu.vec.data = (inv * fdev).Evaluate()
    te = time()
    print ("iterations =", inv.GetSteps(), "time =", te-ts)
No device creator function, creating host vector
iterations = 37 time = 5.3534324169158936

on the A-100:

SumMatrix, h = 2896001, w = 2896001 N4ngla20DevBlockJacobiMatrixE, h = 2896001, w = 2896001 EmbeddedTransposeMatrix, h = 2896001, w = 2896001 EmbeddedMatrix, h = 2896001, w = 116353 N4ngla17DevSparseCholeskyE, h = 116353, w = 116353 iterations = 37 time= 0.6766986846923828

Using the BDDC preconditioner:

For the BDDC (balancing domain decomposition with constraints) preconditioning, we build a FEM system with relaxed connectivity:

38af1cf229284999ba9c6f44368f42d2

This allows for static condensation of all local and interface degrees of freedom, only the wirebasket dofs enter the global solver. The resulting matrix \(\tilde A\) is much cheaper to invert.

The preconditioner is

\[P = R {\tilde A}^{-1} R^T\]

with an averagingn operator \(R\).

[7]:
mesh = Mesh(unit_square.GenerateMesh(maxh=0.1))
for l in range(3): mesh.Refine()
fes = H1(mesh, order=10, dirichlet=".*")
print ("ndof =", fes.ndof)

u, v = fes.TnT()
with TaskManager():
    a = BilinearForm(grad(u)*grad(v)*dx+u*v*dx)
    pre = Preconditioner(a, "bddc")
    a.Assemble()
    f = LinearForm(x*v*dx).Assemble()

gfu = GridFunction(fes)
with TaskManager():
    inv = CGSolver(a.mat, pre, maxsteps=2000, printrates=False)
    ts = time()
    gfu.vec.data = (inv * f.vec).Evaluate()
    te = time()
    print ("iterations =", inv.GetSteps(), "time =", te-ts)
 Mesh bisection
 Bisection done
 Mesh bisection
 Bisection done
 Mesh bisection
 Bisection done
ndof = 744001
iterations = 53 time = 3.5901877880096436
[8]:
predev = pre.mat.CreateDeviceMatrix()
print (predev.GetOperatorInfo())
SumMatrix, h = 744001, w = 744001
  ProductMatrix, h = 744001, w = 744001
    ProductMatrix, h = 744001, w = 744001
      SparseMatrixd, h = 744001, w = 744001
      SparseCholesky-d, h = 744001, w = 744001
    SparseMatrixd, h = 744001, w = 744001
  SparseMatrixd, h = 744001, w = 744001

[9]:
adev = a.mat.CreateDeviceMatrix()
predev = pre.mat.CreateDeviceMatrix()
fdev = f.vec.CreateDeviceVector()

with TaskManager():
    inv = CGSolver(adev, predev, maxsteps=2000, printrates=False)
    gfu.vec.data = (inv * fdev).Evaluate()
    print ("iterations =", inv.GetSteps())
No device creator function, creating host vector
iterations = 52
A100: iterations = 53 time= 0.16417622566223145

Vite - traces:

1ed72949faee4c62ac9e4b10e75b9293

[ ]: