5.5 FETI-DP in NGSolve IV: Inexact FETI-DP

This time, we will, instead of going to the schur-complement for \(\lambda\), directly iterate on the equation

\[\begin{split}\left( \begin{matrix} A_{\scriptscriptstyle DP} & B^T \\ B & 0 \end{matrix} \right) \left( \begin{matrix} u\\ \lambda \end{matrix} \right) = \left( \begin{matrix} f \\ 0 \end{matrix} \right)\end{split}\]

The preconditioner will be

\[\begin{split}\widehat{M}^{-1} = \left( \begin{matrix} A_{\scriptscriptstyle DP}^{-1} & 0 \\ -M_s^{-1} B A_{\scriptscriptstyle DP}^{-1} & M_s^{-1} \\ \end{matrix} \right)\end{split}\]

As \(\widehat{M}^{-1}\) is not symmetric, we will use GMRES.

For setting up the preconditioner, we only need pieces we already have.

In [1]:
num_procs = '100'
In [2]:
from usrmeeting_jupyterstuff import *
In [3]:
stop_cluster()
In [4]:
start_cluster(num_procs)
connect_cluster()
Waiting for connection file: ~/.ipython/profile_ngsolve/security/ipcontroller-kogler-client.json
connecting ... try:6 succeeded!
In [5]:
%%px
from ngsolve import *
import netgen.meshing as ngmeshing
from ngsolve.la import ParallelMatrix, FETI_Jump, SparseMatrixd, ParallelDofs, BlockMatrix
from dd_toolbox import FindFEV, DPSpace_Inverse, ScaledMat, LinMat

def load_mesh(nref=0):
    ngmesh = ngmeshing.Mesh(dim=3)
    ngmesh.Load('cube.vol')
    for l in range(nref):
        ngmesh.Refine()
    return Mesh(ngmesh)

def setup_space(mesh, order=1):
    comm = MPI_Init()
    fes = H1(mesh, order=order, dirichlet='right|top')
    a = BilinearForm(fes)
    u,v = fes.TnT()
    a += SymbolicBFI(grad(u)*grad(v))
    a.Assemble()
    f = LinearForm(fes)
    f += SymbolicLFI(x*y*v)
    f.Assemble()
    avg_dof = comm.Sum(fes.ndof) / comm.size
    if comm.rank==0:
        print('global,  ndof =', fes.ndofglobal, ', lodofs =', fes.lospace.ndofglobal)
        print('avg DOFs per core: ', avg_dof)
    return [fes, a, f]

def setup_FETIDP(fes, a):
    faces, edges, vertices = FindFEV(mesh.dim, mesh.nv, \
                                     fes.ParallelDofs(), fes.FreeDofs())
    primal_dofs = BitArray([ v in set(vertices) for v in range(fes.ndof) ]) & fes.FreeDofs()
    dp_pardofs = fes.ParallelDofs().SubSet(primal_dofs)
    ar = [(num_e[0],d,1.0) for num_e in enumerate(edges) for d in num_e[1] ]
    rows, cols, vals = [list(x) for x in zip(*ar)] if len(ar) else [[],[],[]]
    B_p = SparseMatrixd.CreateFromCOO(rows, cols, vals, len(edges), fes.ndof)
    edist_procs = [sorted(set.intersection(*[set(fes.ParallelDofs().Dof2Proc(v)) for v in edge])) for edge in edges]
    eavg_pardofs = ParallelDofs(edist_procs, comm)
    nprim = comm.Sum(sum([1 for k in range(fes.ndof) if primal_dofs[k] and comm.rank<fes.ParallelDofs().Dof2Proc(k)[0] ]))
    if comm.rank==0:
        print('# of global primal dofs: ', nprim)
    A_dp = ParallelMatrix(a.mat.local_mat, dp_pardofs)
    dual_pardofs = fes.ParallelDofs().SubSet(BitArray(~primal_dofs & fes.FreeDofs()))
    B = FETI_Jump(dual_pardofs, u_pardofs=dp_pardofs)
    if comm.rank==0:
        print('# of global multipliers = :', B.col_pardofs.ndofglobal)
    A_dp_inv = DPSpace_Inverse(mat=a.mat, freedofs=fes.FreeDofs(), \
                               c_points=primal_dofs, \
                               c_mat=B_p, c_pardofs=eavg_pardofs, \
                               invtype_loc='sparsecholesky', \
                               invtype_glob='masterinverse')
    F = B @ A_dp_inv @ B.T
    innerdofs = BitArray([len(fes.ParallelDofs().Dof2Proc(k))==0 for k in range(fes.ndof)]) & fes.FreeDofs()
    A = a.mat.local_mat
    Aiinv = A.Inverse(innerdofs, inverse='sparsecholesky')
    scaledA = ScaledMat(A, [0 if primal_dofs[k] else 1.0/(1+len(fes.ParallelDofs().Dof2Proc(k))) for k in range(fes.ndof)])
    scaledBT = ScaledMat(B.T, [0 if primal_dofs[k] else 1.0/(1+len(fes.ParallelDofs().Dof2Proc(k))) for k in range(fes.ndof)])
    Fhat = B @ scaledA @ (IdentityMatrix() - Aiinv @ A) @ scaledBT
    nFhat = B @ scaledA @ (Aiinv @ A - IdentityMatrix()) @ scaledBT
    return [A_dp, A_dp_inv, F, Fhat, nFhat, B, scaledA, scaledBT]
In [6]:
%%px
comm = MPI_Init()
mesh = load_mesh(nref=1)
fes, a, f = setup_space(mesh, order=2)
A_dp, A_dp_inv, F, Fhat, nFhat, B, scaledA, scaledBT = setup_FETIDP(fes, a)
[stdout:1]
global,  ndof = 576877 , lodofs = 75403
avg DOFs per core:  6607.56
# of global primal dofs:  399
# of global multipliers = : 87867

Setting up the block-matrices

For setting up the saddle point system and the preconditioner, we can use BlockMatrix.

We could implement \(\widehat{M}^{-1}\) more efficiently as

\[\begin{split}\widehat{M}^{-1} = \left( \begin{matrix} I & 0 \\ 0 & M_s^{-1} \end{matrix} \right) \cdot \left( \begin{matrix} I & 0 \\ B & I \end{matrix} \right) \cdot \left( \begin{matrix} A_{\scriptscriptstyle DP}^{-1} & 0 \\ 0 & I \end{matrix} \right)\end{split}\]

This would mean that we only have to apply \(A_{\scriptscriptstyle DP}^{-1}\) and \(M_s^{-1}\) once instead of twice.

However, this way we would still need multiple unnecessary vector copies.

We can avoid both double applications and annecessary vector copies if we really quickly implement this ourselfs.

In [7]:
%%px
M = BlockMatrix([[A_dp, B.T], \
                 [B, None]])
Mhat = BlockMatrix([[A_dp_inv, None], \
                    [Fhat @ B @ A_dp_inv, nFhat]])
class Mhat_v2(BaseMatrix):
    def __init__(self, Ahat, B, Fhat):
        super(Mhat_v2, self).__init__()
        self.Ahat = Ahat
        self.B = B
        self.Fhat = Fhat
        self.hv = Fhat.CreateColVector()
    def Mult(self, x, y):
        y[0].data = self.Ahat * x[0]
        self.hv.data = x[1] - self.B * y[0]
        y[1].data = self.Fhat * self.hv
    def CreateRowVector(self):
        return BlockVector([self.Ahat.CreateRowVector(), self.Fhat.CreateRowVector()])
    def CreateColVector(self):
        return BlockVector([self.Ahat.CreateRowVector(), self.Fhat.CreateRowVector()])
Mhat2 = Mhat_v2(A_dp_inv, B, Fhat)

There is one small problem. We have to use the C++-side GMRES, simply because that is currently the only one available to us form NGSolve.

BlockMatrix does not play nice with GMRES, because BlockMatrix works with BlockVectors, and GMRES expects standard NGSolve (Parallel-)Vectors.

“LinMat” is a temporary simple workaround, that just copies between BlockVectors and normal “linear” vectors.

In [8]:
%%px
M_lin = LinMat(M, [A_dp.col_pardofs, B.col_pardofs])
Mhat_lin = LinMat(Mhat, [A_dp.col_pardofs, B.col_pardofs])
Mhat2_lin = LinMat(Mhat2, [A_dp.col_pardofs, B.col_pardofs])

One last annoyance is that jupyter-notebooks do not capture C++ stdout, so we lose the output during the iteration.

this only affects us inside the notebooks

In [9]:
%%px
sol = M_lin.CreateRowVector()
rhs = sol.CreateVector()
rhs[:] = 0.0
rhs[0:fes.ndof] = f.vec.local_vec
rhs.SetParallelStatus(f.vec.GetParallelStatus())
ngsglobals.msg_level = 3
t1 = -comm.WTime()
gmr = GMRESSolver(mat=M_lin, pre=Mhat_lin, maxsteps=100,\
                  precision=1e-6, printrates=True)
sol.data = gmr * rhs
t1 += comm.WTime()
nsteps1 = gmr.GetSteps()
t2 = -comm.WTime()
gmr = GMRESSolver(mat=M_lin, pre=Mhat2_lin, maxsteps=100,\
                  precision=1e-6, printrates=True)
sol.data = gmr * rhs
t2 += comm.WTime()
nsteps2 = gmr.GetSteps()
if comm.rank==0:
    print('\ntook', nsteps1, 'steps for v1')
    print('time solve v1: ', t1)
    print('dofs per proc and second v1: ', fes.ndofglobal / ( t1 * comm.size))
    print('\ntook', nsteps2, 'steps for v2')
    print('time solve v2: ', t2)
    print('dofs per proc and second v2: ', fes.ndofglobal / ( t2 * comm.size))
[stdout:1]
took 13 steps for v1
time solve v1:  1.1863238600781187
dofs per proc and second v1:  4862.727788025885

took 13 steps for v2
time solve v2:  0.5696066429372877
dofs per proc and second v2:  10127.63820704796
In [10]:
%%px --target 1
for t in sorted(filter(lambda t:t['time']>0.5, Timers()), key=lambda t:t['time'], reverse=True):
    print(t['name'], ':  ', t['time'])
dummy - AllReduce :   1.266655683517456
dummy - AllReduce :   0.8690826892852783
In [11]:
%%px
t_chol = filter(lambda t: t['name'] == 'SparseCholesky<d,d,d>::MultAdd', Timers()).__next__()
maxt = comm.Max(t_chol['time'])
if t_chol['time'] == maxt:
    print('timers from rank ', comm.rank, ':')
    for t in sorted(filter(lambda t:t['time']>min(0.3*maxt, 0.5), Timers()), key=lambda t:t['time'], reverse=True):
        print(t['name'], ':  ', t['time'])
[stdout:12]
timers from rank  13 :
SparseCholesky<d,d,d>::MultAdd :   1.9462761878967285
SparseCholesky<d,d,d>::MultAdd fac1 :   1.1591949462890625
SparseCholesky<d,d,d>::MultAdd fac2 :   0.7608258724212646
In [12]:
%%px --targets 0:20
maxt = max(Timers(), key=lambda t:t['time'])
if maxt['time']>min(0.3*maxt['time'], 0.5):
    print('max timer rank', comm.rank, ': ', maxt['name'], ' ', maxt['time'])
[stdout:0] max timer rank 14 :  SparseCholesky<d,d,d>::MultAdd   1.7368290424346924
[stdout:1] max timer rank 0 :  dummy - AllReduce   1.266655683517456
[stdout:2] max timer rank 18 :  SparseCholesky<d,d,d>::MultAdd   1.413224220275879
[stdout:3] max timer rank 8 :  SparseCholesky<d,d,d>::MultAdd   1.675055980682373
[stdout:4] max timer rank 17 :  SparseCholesky<d,d,d>::MultAdd   1.814605951309204
[stdout:5] max timer rank 16 :  SparseCholesky<d,d,d>::MultAdd   1.579732894897461
[stdout:6] max timer rank 2 :  SparseCholesky<d,d,d>::MultAdd   0.8424544334411621
[stdout:7] max timer rank 12 :  SparseCholesky<d,d,d>::MultAdd   1.098663091659546
[stdout:8] max timer rank 10 :  dummy - AllReduce   0.6785235404968262
[stdout:9] max timer rank 6 :  SparseCholesky<d,d,d>::MultAdd   1.760845422744751
[stdout:10] max timer rank 1 :  SparseCholesky<d,d,d>::MultAdd   1.5053730010986328
[stdout:11] max timer rank 19 :  SparseCholesky<d,d,d>::MultAdd   1.6160402297973633
[stdout:12] max timer rank 13 :  SparseCholesky<d,d,d>::MultAdd   1.9462761878967285
[stdout:13] max timer rank 15 :  SparseCholesky<d,d,d>::MultAdd   1.866727352142334
[stdout:14] max timer rank 4 :  SparseCholesky<d,d,d>::MultAdd   1.491440773010254
[stdout:15] max timer rank 5 :  SparseCholesky<d,d,d>::MultAdd   1.719233751296997
[stdout:16] max timer rank 3 :  SparseCholesky<d,d,d>::MultAdd   1.3392870426177979
[stdout:17] max timer rank 11 :  SparseCholesky<d,d,d>::MultAdd   0.9613127708435059
[stdout:18] max timer rank 7 :  SparseCholesky<d,d,d>::MultAdd   1.774916648864746
[stdout:19] max timer rank 9 :  SparseCholesky<d,d,d>::MultAdd   1.0627970695495605
In [13]:
stop_cluster()