5.5 FETI-DP in NGSolve IV: Inexact FETI-DP¶
This time, we will, instead of going to the schur-complement for \(\lambda\), directly iterate on the equation
The preconditioner will be
As \(\widehat{M}^{-1}\) is not symmetric, we will use GMRES.
For setting up the preconditioner, we only need pieces we already have.
In [1]:
num_procs = '100'
In [2]:
from usrmeeting_jupyterstuff import *
In [3]:
stop_cluster()
In [4]:
start_cluster(num_procs)
connect_cluster()
Waiting for connection file: ~/.ipython/profile_ngsolve/security/ipcontroller-kogler-client.json
connecting ... try:6 succeeded!
In [5]:
%%px
from ngsolve import *
import netgen.meshing as ngmeshing
from ngsolve.la import ParallelMatrix, FETI_Jump, SparseMatrixd, ParallelDofs, BlockMatrix
from dd_toolbox import FindFEV, DPSpace_Inverse, ScaledMat, LinMat
def load_mesh(nref=0):
ngmesh = ngmeshing.Mesh(dim=3)
ngmesh.Load('cube.vol')
for l in range(nref):
ngmesh.Refine()
return Mesh(ngmesh)
def setup_space(mesh, order=1):
comm = MPI_Init()
fes = H1(mesh, order=order, dirichlet='right|top')
a = BilinearForm(fes)
u,v = fes.TnT()
a += SymbolicBFI(grad(u)*grad(v))
a.Assemble()
f = LinearForm(fes)
f += SymbolicLFI(x*y*v)
f.Assemble()
avg_dof = comm.Sum(fes.ndof) / comm.size
if comm.rank==0:
print('global, ndof =', fes.ndofglobal, ', lodofs =', fes.lospace.ndofglobal)
print('avg DOFs per core: ', avg_dof)
return [fes, a, f]
def setup_FETIDP(fes, a):
faces, edges, vertices = FindFEV(mesh.dim, mesh.nv, \
fes.ParallelDofs(), fes.FreeDofs())
primal_dofs = BitArray([ v in set(vertices) for v in range(fes.ndof) ]) & fes.FreeDofs()
dp_pardofs = fes.ParallelDofs().SubSet(primal_dofs)
ar = [(num_e[0],d,1.0) for num_e in enumerate(edges) for d in num_e[1] ]
rows, cols, vals = [list(x) for x in zip(*ar)] if len(ar) else [[],[],[]]
B_p = SparseMatrixd.CreateFromCOO(rows, cols, vals, len(edges), fes.ndof)
edist_procs = [sorted(set.intersection(*[set(fes.ParallelDofs().Dof2Proc(v)) for v in edge])) for edge in edges]
eavg_pardofs = ParallelDofs(edist_procs, comm)
nprim = comm.Sum(sum([1 for k in range(fes.ndof) if primal_dofs[k] and comm.rank<fes.ParallelDofs().Dof2Proc(k)[0] ]))
if comm.rank==0:
print('# of global primal dofs: ', nprim)
A_dp = ParallelMatrix(a.mat.local_mat, dp_pardofs)
dual_pardofs = fes.ParallelDofs().SubSet(BitArray(~primal_dofs & fes.FreeDofs()))
B = FETI_Jump(dual_pardofs, u_pardofs=dp_pardofs)
if comm.rank==0:
print('# of global multipliers = :', B.col_pardofs.ndofglobal)
A_dp_inv = DPSpace_Inverse(mat=a.mat, freedofs=fes.FreeDofs(), \
c_points=primal_dofs, \
c_mat=B_p, c_pardofs=eavg_pardofs, \
invtype_loc='sparsecholesky', \
invtype_glob='masterinverse')
F = B @ A_dp_inv @ B.T
innerdofs = BitArray([len(fes.ParallelDofs().Dof2Proc(k))==0 for k in range(fes.ndof)]) & fes.FreeDofs()
A = a.mat.local_mat
Aiinv = A.Inverse(innerdofs, inverse='sparsecholesky')
scaledA = ScaledMat(A, [0 if primal_dofs[k] else 1.0/(1+len(fes.ParallelDofs().Dof2Proc(k))) for k in range(fes.ndof)])
scaledBT = ScaledMat(B.T, [0 if primal_dofs[k] else 1.0/(1+len(fes.ParallelDofs().Dof2Proc(k))) for k in range(fes.ndof)])
Fhat = B @ scaledA @ (IdentityMatrix() - Aiinv @ A) @ scaledBT
nFhat = B @ scaledA @ (Aiinv @ A - IdentityMatrix()) @ scaledBT
return [A_dp, A_dp_inv, F, Fhat, nFhat, B, scaledA, scaledBT]
In [6]:
%%px
comm = MPI_Init()
mesh = load_mesh(nref=1)
fes, a, f = setup_space(mesh, order=2)
A_dp, A_dp_inv, F, Fhat, nFhat, B, scaledA, scaledBT = setup_FETIDP(fes, a)
[stdout:1]
global, ndof = 576877 , lodofs = 75403
avg DOFs per core: 6607.56
# of global primal dofs: 399
# of global multipliers = : 87867
Setting up the block-matrices¶
For setting up the saddle point system and the preconditioner, we can use BlockMatrix.
We could implement \(\widehat{M}^{-1}\) more efficiently as
This would mean that we only have to apply \(A_{\scriptscriptstyle DP}^{-1}\) and \(M_s^{-1}\) once instead of twice.
However, this way we would still need multiple unnecessary vector copies.
We can avoid both double applications and annecessary vector copies if we really quickly implement this ourselfs.
In [7]:
%%px
M = BlockMatrix([[A_dp, B.T], \
[B, None]])
Mhat = BlockMatrix([[A_dp_inv, None], \
[Fhat @ B @ A_dp_inv, nFhat]])
class Mhat_v2(BaseMatrix):
def __init__(self, Ahat, B, Fhat):
super(Mhat_v2, self).__init__()
self.Ahat = Ahat
self.B = B
self.Fhat = Fhat
self.hv = Fhat.CreateColVector()
def Mult(self, x, y):
y[0].data = self.Ahat * x[0]
self.hv.data = x[1] - self.B * y[0]
y[1].data = self.Fhat * self.hv
def CreateRowVector(self):
return BlockVector([self.Ahat.CreateRowVector(), self.Fhat.CreateRowVector()])
def CreateColVector(self):
return BlockVector([self.Ahat.CreateRowVector(), self.Fhat.CreateRowVector()])
Mhat2 = Mhat_v2(A_dp_inv, B, Fhat)
There is one small problem. We have to use the C++-side GMRES, simply because that is currently the only one available to us form NGSolve.
BlockMatrix does not play nice with GMRES, because BlockMatrix works with BlockVectors, and GMRES expects standard NGSolve (Parallel-)Vectors.
“LinMat” is a temporary simple workaround, that just copies between BlockVectors and normal “linear” vectors.
In [8]:
%%px
M_lin = LinMat(M, [A_dp.col_pardofs, B.col_pardofs])
Mhat_lin = LinMat(Mhat, [A_dp.col_pardofs, B.col_pardofs])
Mhat2_lin = LinMat(Mhat2, [A_dp.col_pardofs, B.col_pardofs])
One last annoyance is that jupyter-notebooks do not capture C++ stdout, so we lose the output during the iteration.
this only affects us inside the notebooks
In [9]:
%%px
sol = M_lin.CreateRowVector()
rhs = sol.CreateVector()
rhs[:] = 0.0
rhs[0:fes.ndof] = f.vec.local_vec
rhs.SetParallelStatus(f.vec.GetParallelStatus())
ngsglobals.msg_level = 3
t1 = -comm.WTime()
gmr = GMRESSolver(mat=M_lin, pre=Mhat_lin, maxsteps=100,\
precision=1e-6, printrates=True)
sol.data = gmr * rhs
t1 += comm.WTime()
nsteps1 = gmr.GetSteps()
t2 = -comm.WTime()
gmr = GMRESSolver(mat=M_lin, pre=Mhat2_lin, maxsteps=100,\
precision=1e-6, printrates=True)
sol.data = gmr * rhs
t2 += comm.WTime()
nsteps2 = gmr.GetSteps()
if comm.rank==0:
print('\ntook', nsteps1, 'steps for v1')
print('time solve v1: ', t1)
print('dofs per proc and second v1: ', fes.ndofglobal / ( t1 * comm.size))
print('\ntook', nsteps2, 'steps for v2')
print('time solve v2: ', t2)
print('dofs per proc and second v2: ', fes.ndofglobal / ( t2 * comm.size))
[stdout:1]
took 13 steps for v1
time solve v1: 1.1863238600781187
dofs per proc and second v1: 4862.727788025885
took 13 steps for v2
time solve v2: 0.5696066429372877
dofs per proc and second v2: 10127.63820704796
In [10]:
%%px --target 1
for t in sorted(filter(lambda t:t['time']>0.5, Timers()), key=lambda t:t['time'], reverse=True):
print(t['name'], ': ', t['time'])
dummy - AllReduce : 1.266655683517456
dummy - AllReduce : 0.8690826892852783
In [11]:
%%px
t_chol = filter(lambda t: t['name'] == 'SparseCholesky<d,d,d>::MultAdd', Timers()).__next__()
maxt = comm.Max(t_chol['time'])
if t_chol['time'] == maxt:
print('timers from rank ', comm.rank, ':')
for t in sorted(filter(lambda t:t['time']>min(0.3*maxt, 0.5), Timers()), key=lambda t:t['time'], reverse=True):
print(t['name'], ': ', t['time'])
[stdout:12]
timers from rank 13 :
SparseCholesky<d,d,d>::MultAdd : 1.9462761878967285
SparseCholesky<d,d,d>::MultAdd fac1 : 1.1591949462890625
SparseCholesky<d,d,d>::MultAdd fac2 : 0.7608258724212646
In [12]:
%%px --targets 0:20
maxt = max(Timers(), key=lambda t:t['time'])
if maxt['time']>min(0.3*maxt['time'], 0.5):
print('max timer rank', comm.rank, ': ', maxt['name'], ' ', maxt['time'])
[stdout:0] max timer rank 14 : SparseCholesky<d,d,d>::MultAdd 1.7368290424346924
[stdout:1] max timer rank 0 : dummy - AllReduce 1.266655683517456
[stdout:2] max timer rank 18 : SparseCholesky<d,d,d>::MultAdd 1.413224220275879
[stdout:3] max timer rank 8 : SparseCholesky<d,d,d>::MultAdd 1.675055980682373
[stdout:4] max timer rank 17 : SparseCholesky<d,d,d>::MultAdd 1.814605951309204
[stdout:5] max timer rank 16 : SparseCholesky<d,d,d>::MultAdd 1.579732894897461
[stdout:6] max timer rank 2 : SparseCholesky<d,d,d>::MultAdd 0.8424544334411621
[stdout:7] max timer rank 12 : SparseCholesky<d,d,d>::MultAdd 1.098663091659546
[stdout:8] max timer rank 10 : dummy - AllReduce 0.6785235404968262
[stdout:9] max timer rank 6 : SparseCholesky<d,d,d>::MultAdd 1.760845422744751
[stdout:10] max timer rank 1 : SparseCholesky<d,d,d>::MultAdd 1.5053730010986328
[stdout:11] max timer rank 19 : SparseCholesky<d,d,d>::MultAdd 1.6160402297973633
[stdout:12] max timer rank 13 : SparseCholesky<d,d,d>::MultAdd 1.9462761878967285
[stdout:13] max timer rank 15 : SparseCholesky<d,d,d>::MultAdd 1.866727352142334
[stdout:14] max timer rank 4 : SparseCholesky<d,d,d>::MultAdd 1.491440773010254
[stdout:15] max timer rank 5 : SparseCholesky<d,d,d>::MultAdd 1.719233751296997
[stdout:16] max timer rank 3 : SparseCholesky<d,d,d>::MultAdd 1.3392870426177979
[stdout:17] max timer rank 11 : SparseCholesky<d,d,d>::MultAdd 0.9613127708435059
[stdout:18] max timer rank 7 : SparseCholesky<d,d,d>::MultAdd 1.774916648864746
[stdout:19] max timer rank 9 : SparseCholesky<d,d,d>::MultAdd 1.0627970695495605
In [13]:
stop_cluster()