Fix AniaBug#2: Lifecycle of SelfSufficient slices was wrong (comments)

This commit shows how the lifecycle of as slice goes.
At some point, a rank gets a list of slices that needs
in the next iteration, at classifies them according
to the characteristics of every situation.

If for instance we are given a slice with
an abc tuple such that we find that this tuple
was given to our rank, then we know that
we have to create a SelfSufficient tuple.

What we do is that we find a blank slice in our
SliceUnion slices bucket. This buffer is blank
and safe to do everything we want with it.
Without cuda, we just need to point this
blank slice to the correct memory address
of the data, that we (the SliceUnion) own.
This is therefore the line

  blank.data = sources[from.source].data();

Of course, doing this in CUDA will mess everything,
as it was until now, since we are pointing to a Host
address. Sadly the way the casting fu is now implemented,
the typechecker did not get that one and I foolishly
forgot about this important bit.

After the creation of the slice comes at some point
in the life cycle the destruction, which we also
have to handle separately.
This is done every iteration in the

    void clearUnusedSlicesForNext(ABCTuple const& abc);

function. There, normally the SelfSufficient slice
would just forget about the pointer it points, slice.data,
since this point is part of the original data of the tensor
distributed in the SliceUnion. In the CUDA case however,
we gave the SelfSufficient slice a freePointer from our
SliceUnion's bucket of allocated freePointers in the GPU
(of which we have around 6 per SliceUnion type).
This pointer needs to be marked again free to use
by a slice in the future, so it has to go back to the bucket,
we can't afford to lose it.
This commit is contained in:
Gallo Alejandro 2022-09-12 19:29:35 +02:00
parent 5483325626
commit 5678ac0d9c

View File

@ -195,7 +195,28 @@ template <typename F=double>
; ;
if (blank.info.state == Slice<F>::SelfSufficient) { if (blank.info.state == Slice<F>::SelfSufficient) {
#if defined(HAVE_CUDA) #if defined(HAVE_CUDA)
blank.mpi_data = sources[from.source].data(); const size_t _size = sizeof(F) * sources[from.source].size();
// TODO: this is code duplication with downstairs
if (freePointers.size() == 0) {
std::stringstream stream;
stream << "No more free pointers "
<< "for type " << type
<< " and name " << name
;
throw std::domain_error(stream.str());
}
auto dataPointer = freePointers.begin();
freePointers.erase(dataPointer);
blank.data = *dataPointer;
//
//
// TODO [#A]: do cuMemcpy of
// sources[from.source].data() ⇒ blank.data
// Do this when everything else is working.
// This will probably be a bottleneck of the H-to-D communication,
// as most slices are SelfSufficient.
//
//
#else #else
blank.data = sources[from.source].data(); blank.data = sources[from.source].data();
#endif #endif
@ -316,6 +337,16 @@ template <typename F=double>
} }
} }
#if defined(HAVE_CUDA)
// In cuda, SelfSufficient slices have an ad-hoc pointer
// since it is a pointer on the device and has to be
// brought back to the free pointer bucket of the SliceUnion.
// Therefore, only in the Recycled case it should not be
// put back the pointer.
if (slice.info.state == Slice<F>::Recycled) {
freeSlicePointer = false;
}
#else
// if the slice is self sufficient, do not dare touching the // if the slice is self sufficient, do not dare touching the
// pointer, since it is a pointer to our sources in our rank. // pointer, since it is a pointer to our sources in our rank.
if ( slice.info.state == Slice<F>::SelfSufficient if ( slice.info.state == Slice<F>::SelfSufficient
@ -323,6 +354,7 @@ template <typename F=double>
) { ) {
freeSlicePointer = false; freeSlicePointer = false;
} }
#endif
// make sure we get its data pointer to be used later // make sure we get its data pointer to be used later
// only for non-recycled, since it can be that we need // only for non-recycled, since it can be that we need
@ -346,7 +378,7 @@ template <typename F=double>
// << " info " << slice.info // << " info " << slice.info
<< "\n"; << "\n";
slice.free(); slice.free();
} } // we did not find the slice
} }
} }