This page was generated from unit-5.5-cuda/poisson_cuda.ipynb.

5.5.1 Solving the Poisson Equation on devices¶

After finite element discretization we obtain a (linear) system of equations.
The new ngscuda module moves the linear operators to the Cuda - decice.
The host is stearing, data stays on the device
The module is now included in NGSolve Linux - distributions, and can be used whenever an accelerator card by NVIDIA is available, and the cuda-runtime is installed.

[1]:

from ngsolve import *
from ngsolve.webgui import Draw
from time import time

[2]:

mesh = Mesh(unit_square.GenerateMesh(maxh=0.1))
for l in range(5): mesh.Refine()
fes = H1(mesh, order=2, dirichlet=".*")
print ("ndof =", fes.ndof)

u, v = fes.TnT()
with TaskManager():
    a = BilinearForm(grad(u)*grad(v)*dx+u*v*dx).Assemble()
    f = LinearForm(x*v*dx).Assemble()

gfu = GridFunction(fes)

jac = a.mat.CreateSmoother(fes.FreeDofs())

with TaskManager():
    inv_host = CGSolver(a.mat, jac, maxiter=2000)
    ts = time()
    gfu.vec.data = inv_host * f.vec
    te = time()
    print ("steps =", inv_host.GetSteps(), ", time =", te-ts)

# Draw (gfu);

ndof = 476417
steps = 1751 , time = 9.334568738937378

Now we import the NGSolve - cuda library.

It provides

an UnifiedVector, which allocates memory on both, host and device. The data is updated on demand either on host, or on device.
NGSolve - matrices can create their counterparts on the device. In the following, the conjugate gradients iteration runs on the host, but all operations involving big data are performed on the accelerator.

[3]:

try:
    from ngsolve.ngscuda import *
except:
    print ("no CUDA library or device available, using replacement types on host")

ngsglobals.msg_level=1
fdev = f.vec.CreateDeviceVector(copy=True)

no CUDA library or device available, using replacement types on host
No device creator function, creating host vector

[4]:

adev = a.mat.CreateDeviceMatrix()
jacdev = jac.CreateDeviceMatrix()

inv = CGSolver(adev, jacdev, maxsteps=2000, printrates=False)

ts = time()
res = (inv * fdev).Evaluate()
te = time()

print ("Time on device:", te-ts)
diff = Norm(gfu.vec - res)
print ("diff = ", diff)

Time on device: 30.225558042526245
diff =  0.0

On an A-100 device I got (for 5 levels of refinement, ndof=476417):

Time on device: 0.4084775447845459 diff = 3.406979028373306e-12

CG Solver with Block-Jacobi and exact low-order solver:¶

\begin{array}{r} A = (\begin{array}{cc} A_{c c} & A_{c f} \\ A_{f c} & A_{f f} \end{array}) \end{array}

Additive Schwarz preconditioner:

C^{- 1} = P A_{c c}^{- 1} P^{T} + \sum_{i} E_{i} A_{i}^{- 1} E_{i}^{T}

with * $P$ .. embedding of low-order space * $A_{i}$ .. blocks on edges/faces/cells * $E_{i}$ .. embedding matrices

[5]:

fes = H1(mesh, order=5, dirichlet=".*")
print ("ndof =", fes.ndof)

u, v = fes.TnT()
with TaskManager():
    a = BilinearForm(grad(u)*grad(v)*dx+u*v*dx).Assemble()
    f = LinearForm(x*v*dx).Assemble()

gfu = GridFunction(fes)

jac = a.mat.CreateBlockSmoother(fes.CreateSmoothingBlocks())
lospace = fes.lospace
loinv = a.loform.mat.Inverse(inverse="sparsecholesky", freedofs=lospace.FreeDofs())
loemb = fes.loembedding

pre = jac + loemb@loinv@loemb.T
print ("mat", a.mat.GetOperatorInfo())
print ("preconditioner:")
print(pre.GetOperatorInfo())

with TaskManager():
    inv = CGSolver(a.mat, pre, maxsteps=2000, printrates=False)
    ts = time()
    gfu.vec.data = inv * f.vec
    te = time()
    print ("iterations =", inv.GetSteps(), "time =", te-ts)

ndof = 2972801
mat SparseMatrixd, h = 2972801, w = 2972801

preconditioner:
SumMatrix, h = 2972801, w = 2972801
  BlockJacobi-d, h = 2972801, w = 2972801
  EmbeddedTransposeMatrix, h = 2972801, w = 2972801
    EmbeddedMatrix, h = 2972801, w = 119425
      SparseCholesky-d, h = 119425, w = 119425

iterations = 37 time = 5.302971124649048

[6]:

adev = a.mat.CreateDeviceMatrix()
predev = pre.CreateDeviceMatrix()
fdev = f.vec.CreateDeviceVector()

with TaskManager():
    inv = CGSolver(adev, predev, maxsteps=2000, printrates=False)
    ts = time()
    gfu.vec.data = (inv * fdev).Evaluate()
    te = time()
    print ("iterations =", inv.GetSteps(), "time =", te-ts)

No device creator function, creating host vector
iterations = 37 time = 5.364409685134888

on the A-100:

SumMatrix, h = 2896001, w = 2896001 N4ngla20DevBlockJacobiMatrixE, h = 2896001, w = 2896001 EmbeddedTransposeMatrix, h = 2896001, w = 2896001 EmbeddedMatrix, h = 2896001, w = 116353 N4ngla17DevSparseCholeskyE, h = 116353, w = 116353 iterations = 37 time= 0.6766986846923828

Using the BDDC preconditioner:¶

For the BDDC (balancing domain decomposition with constraints) preconditioning, we build a FEM system with relaxed connectivity:

06394193b1664905a2af3efb6fbd83d6

This allows for static condensation of all local and interface degrees of freedom, only the wirebasket dofs enter the global solver. The resulting matrix $\tilde{A}$ is much cheaper to invert.

The preconditioner is

P = R {\tilde{A}}^{- 1} R^{T}

with an averagingn operator $R$ .

[7]:

mesh = Mesh(unit_square.GenerateMesh(maxh=0.1))
for l in range(3): mesh.Refine()
fes = H1(mesh, order=10, dirichlet=".*")
print ("ndof =", fes.ndof)

u, v = fes.TnT()
with TaskManager():
    a = BilinearForm(grad(u)*grad(v)*dx+u*v*dx)
    pre = Preconditioner(a, "bddc")
    a.Assemble()
    f = LinearForm(x*v*dx).Assemble()

gfu = GridFunction(fes)
with TaskManager():
    inv = CGSolver(a.mat, pre, maxsteps=2000, printrates=False)
    ts = time()
    gfu.vec.data = (inv * f.vec).Evaluate()
    te = time()
    print ("iterations =", inv.GetSteps(), "time =", te-ts)

 Mesh bisection
 Bisection done
 Mesh bisection
 Bisection done
 Mesh bisection
 Bisection done
ndof = 744001
iterations = 53 time = 3.435251474380493

[8]:

predev = pre.mat.CreateDeviceMatrix()
print (predev.GetOperatorInfo())

SumMatrix, h = 744001, w = 744001
  ProductMatrix, h = 744001, w = 744001
    ProductMatrix, h = 744001, w = 744001
      SparseMatrixd, h = 744001, w = 744001
      SparseCholesky-d, h = 744001, w = 744001
    SparseMatrixd, h = 744001, w = 744001
  SparseMatrixd, h = 744001, w = 744001

[9]:

adev = a.mat.CreateDeviceMatrix()
predev = pre.mat.CreateDeviceMatrix()
fdev = f.vec.CreateDeviceVector()

with TaskManager():
    inv = CGSolver(adev, predev, maxsteps=2000, printrates=False)
    gfu.vec.data = (inv * fdev).Evaluate()
    print ("iterations =", inv.GetSteps())

No device creator function, creating host vector
iterations = 52

A100: iterations = 53 time= 0.16417622566223145

Vite - traces:

9d0853db529d46f0bfdade0a79b7d855

[ ]: