#+title: ATRIP: An MPI-asynchronous implementation of CCSD(T) #+PROPERTY: header-args+ :noweb yes :comments noweb :mkdirp t * Introduction The algorithm uses two main data types, the =Slice= and the =SliceUnion= as a container and resource manager of the =Slice=. * Tests :noexport:test: We will try to test some programatic functionality in atrip too, and mostly we will append all tests in a single file, however this might change for special purpose test functionality. #+begin_src c++ :tangle (atrip-test-main) :noweb yes #include #include <> <> #define TESTCASE(_name, ...) { \ std::cout << "\x1b[35m-> \x1b[0m" \ << _name \ << std::endl; \ __VA_ARGS__ \ } int main() { <> return 0; } #+end_src Therefore, small tests can be written like #+begin_src example ,#+begin_src c++ :noweb-ref testcase // your c++ snippet of code ,#+end_src #+end_src * The slice The following section introduces the idea of a slice. ** Prolog :noexport: #+begin_src c++ :tangle (atrip-slice-h) #pragma once #include #include #include #include #include #include namespace atrip { template struct Slice { #+end_src ** Introduction A slice is the concept of a subset of values of a given tensor. As an example, for the doubles amplitudes $ T^{ab}_{ij} $, one need two kinds of objects: - the object $ \mathsf{T}(a)^b_{ij} $ which for every $ a $ gets assigned the tensor $ T^{ab}_{ij} $ of size $ N_\mathrm{o}^2 \times N_\mathrm{v} $ - the object $ \mathsf{T}(a,b)_{ij} $ which for every pair of $ a, b $ corresponds the $ N_\mathrm{o}^2 $-sized tensor $ T^{ab}_{ij} $. ** Location Every slice set, for instance, $$ S_k = \left\{ a \mapsto \mathsf{T}(a)^{b}_{ij} \mid a \in A_k \right\} $$ where $ A_k $ is some subset of $ \mathsf{N}_\mathrm{v} $, gets stored in some rank $ k $. In general however, the number of elements in $ A_k $ can be bigger than the number of processes $ n_p $. Therefore in order to uniquely indentify a given slice in $ S_k $ we need two identifiers, the rank $ k $, which tells us in which core's memory the slice is allocated, and an additional tag which we will call =source=. The datatype that simply models this state of affairs is therefore a simple structure: #+begin_src c++ :tangle (atrip-slice-h) struct Location { size_t rank; size_t source; }; #+end_src ** Type Due to the permutation operators in the equations it is noticeable that for every one dimensional slice and triple $ (a,b,c) $ $$ a \mapsto \mathsf{t}(a) $$ one needs at the same time $ \mathsf{t}(a) $, $ \mathsf{t}(b) $ and $ \mathsf{t}(c) $. For two dimensional slices, i.e., slices of the form $$ (a,b) \mapsto \mathsf{t}(a,b) $$ one needs in the equations the slices $ \mathsf{t}(a,b) $, $ \mathsf{t}(b,c) $ and $ \mathsf{t}(a,c) $. In addition, in the case of diagrams where the integral $ V^{ab}_{ci} $ appears, we additionaly need the permuted slices from before, i.e. $ \mathsf{t}(b,a) $, $ \mathsf{t}(c,b) $ and $ \mathsf{t}(c,a) $. This means, every slice has associated with it a type which denotes which permutation it is. #+begin_src c++ :tangle (atrip-slice-h) enum Type { A = 10 , B , C // Two-parameter slices , AB = 20 , BC , AC // for abci and the doubles , CB , BA , CA // The non-typed slice , Blank = 404 }; #+end_src ** State Every slice can be in different states and every state denotes which function the slice is going to provide and which relations they have between themselves. - Fetch :: A slice is in state =Fetch= when it has a valid data pointer that **must** be written to. A =Fetch= slice should not live very long, this means that after the database send and receive phase, =Fetch= slices should be changed into =Dispatched= in order to start the process of writing to the data pointer from some other rank. - Dispatched :: A =Dispatched= slice indicates that at some point send and receive MPI calls have been dispatched in order to get the data. However, the calls have just been dispatched and there is no warranty for the data to be there, for that, the slice must be unwrapped. - Ready :: =Ready= means that the data pointer can be read from directly. - SelfSufficient :: A slice is =SelfSufficient= when its contents are located in the same rank that it lives, so that it does not have to fetch from no other rank. This is important in order to handle the data pointers correctly and in order to save calls to MPI receive and send functions. - Recycled :: =Recycled= means that this slice gets its data pointer from another slice, so it should not be written to - Acceptor :: =Acceptor= means that the slice can accept a new slice, it is the counterpart of the =Blank= type, but for states Again the implementation is a simple enum type. #+begin_src c++ :tangle (atrip-slice-h) enum State { Fetch = 0, Dispatched = 2, Ready = 1, SelfSufficient = 911, Recycled = 123, Acceptor = 405 }; #+end_src ** Data pointer The data pointer type is an abstraction of where the data of the slice are. In a CPU architecture it will simply be =F*=, otherwise it will be a pointer to a GPU address in the case of a GPU architecture. #+begin_src c++ :tangle (atrip-types-h) #pragma once #include #include namespace atrip { template struct DataField; template <> struct DataField { using type = double; }; #if defined(HAVE_CUDA) template using DataPtr = CUdeviceptr; #define DataNullPtr 0x00 template <> struct DataField { using type = cuDoubleComplex; }; #else template using DataPtr = F*; #define DataNullPtr nullptr template <> struct DataField { using type = Complex; }; #endif template using DataFieldType = typename DataField::type; } #+end_src ** The Info structure Every slice has an information structure associated with it that keeps track of the **variable** type, state and so on. #+begin_src c++ :tangle (atrip-slice-h) struct Info { // which part of a,b,c the slice holds PartialTuple tuple; // The type of slice for the user to retrieve the correct one Type type; // What is the state of the slice State state; // Where the slice is to be retrieved Location from; // If the data are actually to be found in this other slice Type recycling; Info() : tuple{0,0} , type{Blank} , state{Acceptor} , from{0,0} , recycling{Blank} {} }; using Ty_x_Tu = std::pair< Type, PartialTuple >; #+end_src ** Name CCSD(T) needs in this algorithm 5 types of tensor slices, namely $ V^{ij}_{ka} $, $ V^{ab}_{ci} $, $ V^{ab}_{ij} $ and two times $ T^{ab}_{ij} $. The reason why we need two times the doubles amplitudes is because in the doubles contribution to the energy, the $ T $ amplidutes will be sliced through one parameter for the particle contribution and through two parameters for the hole contribution. #+begin_src c++ :tangle (atrip-slice-h) enum Name { TA = 100 , VIJKA = 101 , VABCI = 200 , TABIJ = 201 , VABIJ = 202 }; #+end_src ** Database The database is a simple representation of the slices of a slice union. Every element of the database is given by the name of the tensor it represents and the internal information structure. #+begin_src c++ :tangle (atrip-slice-h) struct LocalDatabaseElement { Slice::Name name; Slice::Info info; }; #+end_src A local database (of a given rank) and the global database is thus simply a vector of these elements. #+begin_src c++ :tangle (atrip-slice-h) using LocalDatabase = std::vector; using Database = LocalDatabase; #+end_src ** MPI Types #+begin_src c++ :tangle (atrip-slice-h) struct mpi { static MPI_Datatype vector(size_t n, MPI_Datatype const& DT) { MPI_Datatype dt; MPI_Type_vector(n, 1, 1, DT, &dt); MPI_Type_commit(&dt); return dt; } static MPI_Datatype sliceLocation () { constexpr int n = 2; // create a sliceLocation to measure in the current architecture // the packing of the struct Slice::Location measure; MPI_Datatype dt; const std::vector lengths(n, 1); const MPI_Datatype types[n] = {usizeDt(), usizeDt()}; static_assert(sizeof(Slice::Location) == 2 * sizeof(size_t), "The Location packing is wrong in your compiler"); // measure the displacements in the struct size_t j = 0; MPI_Aint base_address, displacements[n]; MPI_Get_address(&measure, &base_address); MPI_Get_address(&measure.rank, &displacements[j++]); MPI_Get_address(&measure.source, &displacements[j++]); for (size_t i = 0; i < n; i++) displacements[i] = MPI_Aint_diff(displacements[i], base_address); MPI_Type_create_struct(n, lengths.data(), displacements, types, &dt); MPI_Type_commit(&dt); return dt; } static MPI_Datatype usizeDt() { return MPI_UINT64_T; } static MPI_Datatype sliceInfo () { constexpr int n = 5; MPI_Datatype dt; Slice::Info measure; const std::vector lengths(n, 1); const MPI_Datatype types[n] = { vector(2, usizeDt()) , vector(sizeof(enum Type), MPI_CHAR) , vector(sizeof(enum State), MPI_CHAR) , sliceLocation() , vector(sizeof(enum Type), MPI_CHAR) // TODO: Why this does not work on intel mpi? /*, MPI_UINT64_T*/ }; static_assert(sizeof(enum Type) == 4, "Enum type not 4 bytes long"); static_assert(sizeof(enum State) == 4, "Enum State not 4 bytes long"); static_assert(sizeof(enum Name) == 4, "Enum Name not 4 bytes long"); // create the displacements from the info measurement struct size_t j = 0; MPI_Aint base_address, displacements[n]; MPI_Get_address(&measure, &base_address); MPI_Get_address(&measure.tuple[0], &displacements[j++]); MPI_Get_address(&measure.type, &displacements[j++]); MPI_Get_address(&measure.state, &displacements[j++]); MPI_Get_address(&measure.from, &displacements[j++]); MPI_Get_address(&measure.recycling, &displacements[j++]); for (size_t i = 0; i < n; i++) displacements[i] = MPI_Aint_diff(displacements[i], base_address); MPI_Type_create_struct(n, lengths.data(), displacements, types, &dt); MPI_Type_commit(&dt); return dt; } static MPI_Datatype localDatabaseElement () { constexpr int n = 2; MPI_Datatype dt; LocalDatabaseElement measure; const std::vector lengths(n, 1); const MPI_Datatype types[n] = { vector(sizeof(enum Name), MPI_CHAR) , sliceInfo() }; // measure the displacements in the struct size_t j = 0; MPI_Aint base_address, displacements[n]; MPI_Get_address(&measure, &base_address); MPI_Get_address(&measure.name, &displacements[j++]); MPI_Get_address(&measure.info, &displacements[j++]); for (size_t i = 0; i < n; i++) displacements[i] = MPI_Aint_diff(displacements[i], base_address); static_assert( sizeof(LocalDatabaseElement) == sizeof(measure) , "Measure has bad size"); MPI_Type_create_struct(n, lengths.data(), displacements, types, &dt); MPI_Type_commit(&dt); return vector(sizeof(LocalDatabaseElement), MPI_CHAR); // TODO: write tests in order to know if this works return dt; } }; #+end_src ** Static utilities This section presents some functions which are useful to work with slices and are inside the namespace created by the slice struct. The function =subtupleBySlice= gives to every =Slice::Type= its meaning in terms of the triples $ (a,b,c) $. Notice that since in general the relation $ a < b < c $ holds (in our implementation), the case of one-dimensional parametrizations =A=, =B= and =C= is well defined. The function should only throw if there is an implementation error where the =Slice::Type= enum has been expanded and this function has not been updated accordingly. #+begin_src c++ :tangle (atrip-slice-h) static PartialTuple subtupleBySlice(ABCTuple abc, Type sliceType) { switch (sliceType) { case AB: return {abc[0], abc[1]}; case BC: return {abc[1], abc[2]}; case AC: return {abc[0], abc[2]}; case CB: return {abc[2], abc[1]}; case BA: return {abc[1], abc[0]}; case CA: return {abc[2], abc[0]}; case A: return {abc[0], 0}; case B: return {abc[1], 0}; case C: return {abc[2], 0}; default: throw "Switch statement not exhaustive!"; } } #+end_src In the context of cleaning up slices during the main loop, it is important to check if a given slice has some slices referencing to it in quality of recycled slices. This function should therefore return a vector of pointers of slices referencing to the given slice's info, when the length of the vector is zero, then there are no dangling links. #+begin_src c++ :tangle (atrip-slice-h) static std::vector*> hasRecycledReferencingToIt ( std::vector> &slices , Info const& info ) { std::vector*> result; for (auto& s: slices) if ( s.info.recycling == info.type && s.info.tuple == info.tuple && s.info.state == Recycled ) result.push_back(&s); return result; } #+end_src The rest of the coming functions are utilities in order to find in a vector of slices a given slice by reference. Mostly they are merely convenience wrappers to the standard library function =std::find_if=. They are named as =find<...>=, where =<...>= represents some condition and must always return a reference to the found slice, i.e., =Slice&=. =Atrip= relies on these functions to find the sought for slices, therefore these functions will throw a =std::domain_error= if the given slice could not be found. #+begin_src c++ :tangle (atrip-slice-h) static Slice& findOneByType(std::vector> &slices, Slice::Type type) { const auto sliceIt = std::find_if(slices.begin(), slices.end(), [&type](Slice const& s) { return type == s.info.type; }); WITH_CRAZY_DEBUG WITH_RANK << "\t__ looking for " << type << "\n"; if (sliceIt == slices.end()) throw std::domain_error("Slice by type not found!"); return *sliceIt; } #+end_src #+begin_src c++ :tangle (atrip-slice-h) static Slice& findRecycledSource (std::vector> &slices, Slice::Info info) { const auto sliceIt = std::find_if(slices.begin(), slices.end(), [&info](Slice const& s) { return info.recycling == s.info.type && info.tuple == s.info.tuple && State::Recycled != s.info.state ; }); WITH_CRAZY_DEBUG WITH_RANK << "__slice__:find: recycling source of " << pretty_print(info) << "\n"; if (sliceIt == slices.end()) throw std::domain_error( "Slice not found: " + pretty_print(info) + " rank: " + pretty_print(Atrip::rank) ); WITH_RANK << "__slice__:find: " << pretty_print(sliceIt->info) << "\n"; return *sliceIt; } #+end_src #+begin_src c++ :tangle (atrip-slice-h) static Slice& findByTypeAbc ( std::vector> &slices , Slice::Type type , ABCTuple const& abc ) { const auto tuple = Slice::subtupleBySlice(abc, type); const auto sliceIt = std::find_if(slices.begin(), slices.end(), [&type, &tuple](Slice const& s) { return type == s.info.type && tuple == s.info.tuple ; }); WITH_CRAZY_DEBUG WITH_RANK << "__slice__:find:" << type << " and tuple " << pretty_print(tuple) << "\n"; if (sliceIt == slices.end()) throw std::domain_error( "Slice not found: " + pretty_print(tuple) + ", " + pretty_print(type) + " rank: " + pretty_print(Atrip::rank) ); return *sliceIt; } #+end_src #+begin_src c++ :tangle (atrip-slice-h) static Slice& findByInfo(std::vector> &slices, Slice::Info const& info) { const auto sliceIt = std::find_if(slices.begin(), slices.end(), [&info](Slice const& s) { // TODO: maybe implement comparison in Info struct return info.type == s.info.type && info.state == s.info.state && info.tuple == s.info.tuple && info.from.rank == s.info.from.rank && info.from.source == s.info.from.source ; }); WITH_CRAZY_DEBUG WITH_RANK << "__slice__:find:looking for " << pretty_print(info) << "\n"; if (sliceIt == slices.end()) throw std::domain_error( "Slice by info not found: " + pretty_print(info)); return *sliceIt; } #+end_src ** Attributes A slice object does not own data, it is just a container or a pointer to data together with additional bookkeeping facilities. It includes an info structure with the information about the slice, =Type=, =State= etc, which will be later communicated to other ranks. #+begin_src c++ :tangle (atrip-slice-h) Info info; #+end_src A pointer to data is also necessary for the =Slice= but not necessary to be communicated to other ranks. The =Slice= should never allocate or deallocate itself the pointer. #+begin_src c++ :tangle (atrip-slice-h) DataPtr data; #if defined(HAVE_CUDA) F* mpi_data; #endif #+end_src An =MPI_Request= handle is also included so that the slices that are to receive data through MPI can know which request they belong to. #+begin_src c++ :tangle (atrip-slice-h) MPI_Request request; #+end_src For practical purposes in MPI calls, the number of elements in =data= is also included. #+begin_src c++ :tangle (atrip-slice-h) const size_t size; #+end_src ** Member functions It is important to note that a ready slice should not be recycled from any other slice, so that it can have access by itself to the data. #+begin_src c++ :tangle (atrip-slice-h) void markReady() noexcept { info.state = Ready; info.recycling = Blank; } #+end_src The following function asks wether or not the slice has effectively been unwrapped or not, i.e., wether or not the data are accessible and already there. This can only happen in two ways, either is the slice =Ready= or it is =SelfSufficient=, i.e., the data pointed to was pre-distributed to the current node. #+begin_src c++ :tangle (atrip-slice-h) bool isUnwrapped() const noexcept { return info.state == Ready || info.state == SelfSufficient ; } #+end_src The function =isUnwrappable= answers which slices can be unwrapped potentially. Unwrapped slices can be unwrapped again idempotentially. Also =Recycled= slices can be unwrapped, i.e. the slices pointed to by them will be unwrapped. The only other possibility is that the slice has been dispatched in the past and can be unwrapped. The case where the state is =Dispatched= is the canonical intuitive case where a real process of unwrapping, i.e. waiting for the data to get through the network, is done. #+begin_src c++ :tangle (atrip-slice-h) bool isUnwrappable() const noexcept { return isUnwrapped() || info.state == Recycled || info.state == Dispatched ; } inline bool isDirectlyFetchable() const noexcept { return info.state == Ready || info.state == Dispatched; } void free() noexcept { info.tuple = {0, 0}; info.type = Blank; info.state = Acceptor; info.from = {0, 0}; info.recycling = Blank; data = DataNullPtr; } inline bool isFree() const noexcept { return info.tuple == PartialTuple{0, 0} && info.type == Blank && info.state == Acceptor && info.from.rank == 0 && info.from.source == 0 && info.recycling == Blank && data == DataNullPtr ; } #+end_src The function =isRecylable= answers the question, which slices can be recycled. A slice can only be recycled if it is Fetch or Ready and has a valid datapointer. In particular, SelfSufficient are not recyclable, since it is easier just to create a SelfSufficient slice than deal with data dependencies. Furthermore, a recycled slice is not recyclable, if this is the case then it is either bad design or a bug. #+begin_src c++ :tangle (atrip-slice-h) inline bool isRecyclable() const noexcept { return ( info.state == Dispatched || info.state == Ready || info.state == Fetch ) && hasValidDataPointer() ; } #+end_src The function =hasValidDataPointer= describes if a slice has a valid data pointer. This is important to know if the slice has some data to it, also some structural checks are done, so that it should not be =Acceptor= or =Blank=, if this is the case then it is a bug. #+begin_src c++ :tangle (atrip-slice-h) inline bool hasValidDataPointer() const noexcept { return data != DataNullPtr && info.state != Acceptor && info.type != Blank ; } #+end_src The function =unwrapAndMarkReady= calls the low-level MPI functions in order to wait whenever the state of the slice is correct. The main behaviour of the function should - return if state is =Ready=, since then there is nothing to be done. - throw if the state is not =Dispatched=, only a dispatched slice can be unwrapped through MPI. - throw if an MPI error happens. #+begin_src c++ :tangle (atrip-slice-h) void unwrapAndMarkReady() { if (info.state == Ready) return; if (info.state != Dispatched) throw std::domain_error("Can't unwrap a non-ready, non-dispatched slice!"); markReady(); MPI_Status status; #ifdef HAVE_OCD WITH_RANK << "__slice__:mpi: waiting " << "\n"; #endif const int errorCode = MPI_Wait(&request, &status); if (MPI_SUCCESS != MPI_Request_free(&request)) throw "Error freeing MPI request"; if (errorCode != MPI_SUCCESS) throw "MPI ERROR HAPPENED...."; #if defined(HAVE_CUDA) // copy the retrieved mpi data to the device cuMemcpyHtoD(data, (void*)mpi_data, sizeof(F) * size); std::free(mpi_data); #endif #ifdef HAVE_OCD char errorString[MPI_MAX_ERROR_STRING]; int errorSize; MPI_Error_string(errorCode, errorString, &errorSize); WITH_RANK << "__slice__:mpi: status " << "{ .source=" << status.MPI_SOURCE << ", .tag=" << status.MPI_TAG << ", .error=" << status.MPI_ERROR << ", .errCode=" << errorCode << ", .err=" << errorString << " }" << "\n"; #endif } #+end_src ** Epilog :noexport: #+begin_src c++ :tangle (atrip-slice-h) Slice(size_t size_) : info({}) , data(DataNullPtr) #if defined(HAVE_CUDA) , mpi_data(nullptr) #endif , size(size_) {} }; // struct Slice #+end_src ** Debug :noexport: #+begin_src c++ :tangle (atrip-slice-h) template std::ostream& operator<<(std::ostream& out, typename Slice::Location const& v) { // TODO: remove me out << "{.r(" << v.rank << "), .s(" << v.source << ")};"; return out; } template std::ostream& operator<<(std::ostream& out, typename Slice::Info const& i) { out << "«t" << i.type << ", s" << i.state << "»" << " ⊙ {" << i.from.rank << ", " << i.from.source << "}" << " ∴ {" << i.tuple[0] << ", " << i.tuple[1] << "}" << " ♲t" << i.recycling ; return out; } } // namespace atrip #+end_src * Utils This section presents some utilities ** Prolog :noexport: #+begin_src c++ :tangle (atrip-utils-h) #pragma once #include #include #include #include #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wvla" #pragma GCC diagnostic ignored "-Wint-in-bool-context" #pragma GCC diagnostic ignored "-Wunused-parameter" #pragma GCC diagnostic ignored "-Wdeprecated-copy" #include #pragma GCC diagnostic pop #include namespace atrip { #+end_src ** Pretty printing The pretty printing uses the [[https://github.com/sharkdp/dbg-macro][dbg-macro]] package. #+begin_src c++ :tangle (atrip-utils-h) #pragma GCC diagnostic push #pragma GCC diagnostic ignored "-Wunused-parameter" template std::string pretty_print(T&& value) { std::stringstream stream; #if ATRIP_DEBUG > 2 dbg::pretty_print(stream, std::forward(value)); #endif return stream.str(); } #pragma GCC diagnostic pop #+end_src ** Chrono The chrono is just a simple wrapper for a high resolution clock that can be found in the =std::chrono= namespace of the standard library. #+begin_src c++ :tangle (atrip-utils-h) #define WITH_CHRONO(__chrono_name, ...) \ Atrip::chrono[__chrono_name].start(); \ __VA_ARGS__ \ Atrip::chrono[__chrono_name].stop(); struct Timer { using Clock = std::chrono::high_resolution_clock; using Event = std::chrono::time_point; std::chrono::duration duration; Event _start; inline void start() noexcept { _start = Clock::now(); } inline void stop() noexcept { duration += Clock::now() - _start; } inline void clear() noexcept { duration *= 0; } inline double count() const noexcept { return duration.count(); } }; using Timings = std::map; #+end_src ** Epilog :noexport: #+begin_src c++ :tangle (atrip-utils-h) } #+end_src * The rank mapping This section introduces the concept of rank mapping, which defines how slices will be allocated to every rank. #+begin_src c++ :tangle (atrip-rankmap-h) #pragma once #include #include #include #include namespace atrip { template struct RankMap { static bool RANK_ROUND_ROBIN; std::vector const lengths; size_t const np, size; ClusterInfo const clusterInfo; RankMap(std::vector lens, size_t np_, MPI_Comm comm) : lengths(lens) , np(np_) , size(std::accumulate(lengths.begin(), lengths.end(), 1UL, std::multiplies())) , clusterInfo(getClusterInfo(comm)) { assert(lengths.size() <= 2); } size_t find(typename Slice::Location const& p) const noexcept { if (RANK_ROUND_ROBIN) { return p.source * np + p.rank; } else { const size_t rankPosition = p.source * clusterInfo.ranksPerNode + clusterInfo.rankInfos[p.rank].localRank ; return rankPosition * clusterInfo.nNodes + clusterInfo.rankInfos[p.rank].nodeId ; } } size_t nSources() const noexcept { return size / np + size_t(size % np != 0); } bool isPaddingRank(size_t rank) const noexcept { return size % np == 0 ? false : rank > (size % np - 1) ; } bool isSourcePadding(const size_t rank, const size_t source) const noexcept { return source == nSources() && isPaddingRank(rank); } typename Slice::Location find(ABCTuple const& abc, typename Slice::Type sliceType) const { // tuple = {11, 8} when abc = {11, 8, 9} and sliceType = AB // tuple = {11, 0} when abc = {11, 8, 9} and sliceType = A const auto tuple = Slice::subtupleBySlice(abc, sliceType); const size_t index = tuple[0] + tuple[1] * (lengths.size() > 1 ? lengths[0] : 0) ; size_t rank, source; if (RANK_ROUND_ROBIN) { rank = index % np; source = index / np; } else { size_t const // the node that will be assigned to nodeId = index % clusterInfo.nNodes // how many times it has been assigned to the node , s_n = index / clusterInfo.nNodes // which local rank in the node should be , localRank = s_n % clusterInfo.ranksPerNode // and the local source (how many times we chose this local rank) , localSource = s_n / clusterInfo.ranksPerNode ; // find the localRank-th entry in clusterInfo auto const& it = std::find_if(clusterInfo.rankInfos.begin(), clusterInfo.rankInfos.end(), [nodeId, localRank](RankInfo const& ri) { return ri.nodeId == nodeId && ri.localRank == localRank ; }); if (it == clusterInfo.rankInfos.end()) { throw "FATAL! Error in node distribution of the slices"; } rank = (*it).globalRank; source = localSource; } return { rank , source }; } }; } #+end_src * The slice union ** Prolog :noexport: #+begin_src c++ :tangle (atrip-slice-union-h) #pragma once #include #include #include namespace atrip { #+end_src #+begin_src c++ :tangle (atrip-slice-union-h) template class SliceUnion { public: using Tensor = CTF::Tensor; virtual void sliceIntoBuffer(size_t iteration, Tensor &to, Tensor const& from) = 0; /* ,* This function should enforce an important property of a SliceUnion. ,* Namely, there can be no two Slices of the same nature. ,* ,* This means that there can be at most one slice with a given Ty_x_Tu. ,*/ void checkForDuplicates() const { std::vector::Ty_x_Tu> tytus; for (auto const& s: slices) { if (s.isFree()) continue; tytus.push_back({s.info.type, s.info.tuple}); } for (auto const& tytu: tytus) { if (std::count(tytus.begin(), tytus.end(), tytu) > 1) throw "Invariance violated, more than one slice with same Ty_x_Tu"; } } std::vector::Ty_x_Tu> neededSlices(ABCTuple const& abc) { std::vector::Ty_x_Tu> needed(sliceTypes.size()); // build the needed vector std::transform(sliceTypes.begin(), sliceTypes.end(), needed.begin(), [&abc](typename Slice::Type const type) { auto tuple = Slice::subtupleBySlice(abc, type); return std::make_pair(type, tuple); }); return needed; } /* buildLocalDatabase * * It should build a database of slices so that we know what is needed * to fetch in the next iteration represented by the tuple 'abc'. * * 1. The algorithm works as follows, we build a database of the all * the slice types that we need together with their tuple. * * 2. Look in the SliceUnion if we already have this tuple, * if we already have it mark it (TODO) * * 3. If we don't have the tuple, look for a (state=acceptor, type=blank) * slice and mark this slice as type=Fetch with the corresponding type * and tuple. * * NOTE: The algorithm should certify that we always have enough blank * slices. * */ typename Slice::LocalDatabase buildLocalDatabase(ABCTuple const& abc) { typename Slice::LocalDatabase result; auto const needed = neededSlices(abc); WITH_RANK << "__db__:needed:" << pretty_print(needed) << "\n"; // BUILD THE DATABASE // we need to loop over all sliceTypes that this TensorUnion // is representing and find out how we will get the corresponding // slice for the abc we are considering right now. for (auto const& pair: needed) { auto const type = pair.first; auto const tuple = pair.second; auto const from = rankMap.find(abc, type); #ifdef HAVE_OCD WITH_RANK << "__db__:want:" << pretty_print(pair) << "\n"; for (auto const& s: slices) WITH_RANK << "__db__:guts:ocd " << s.info << " pt " << s.data << "\n"; #endif #ifdef HAVE_OCD WITH_RANK << "__db__: checking if exact match" << "\n"; #endif { // FIRST: look up if there is already a *Ready* slice matching what we // need auto const& it = std::find_if(slices.begin(), slices.end(), [&tuple, &type](Slice const& other) { return other.info.tuple == tuple && other.info.type == type // we only want another slice when it // has already ready-to-use data && other.isUnwrappable() ; }); if (it != slices.end()) { // if we find this slice, it means that we don't have to do anything WITH_RANK << "__db__: EXACT: found EXACT in name=" << name << " for tuple " << tuple[0] << ", " << tuple[1] << " ptr " << it->data << "\n"; result.push_back({name, it->info}); continue; } } #ifdef HAVE_OCD WITH_RANK << "__db__: checking if recycle" << "\n"; #endif // Try to find a recyling possibility ie. find a slice with the same // tuple and that has a valid data pointer. auto const& recycleIt = std::find_if(slices.begin(), slices.end(), [&tuple, &type](Slice const& other) { return other.info.tuple == tuple && other.info.type != type && other.isRecyclable() ; }); // if we find this recylce, then we find a Blank slice // (which should exist by construction :THINK) // if (recycleIt != slices.end()) { auto& blank = Slice::findOneByType(slices, Slice::Blank); // TODO: formalize this through a method to copy information // from another slice blank.data = recycleIt->data; blank.info.type = type; blank.info.tuple = tuple; blank.info.state = Slice::Recycled; blank.info.from = from; blank.info.recycling = recycleIt->info.type; result.push_back({name, blank.info}); WITH_RANK << "__db__: RECYCLING: n" << name << " " << pretty_print(abc) << " get " << pretty_print(blank.info) << " from " << pretty_print(recycleIt->info) << " ptr " << recycleIt->data << "\n" ; continue; } // in this case we have to create a new slice // this means that we should have a blank slice at our disposal // and also the freePointers should have some elements inside, // so we pop a data pointer from the freePointers container #ifdef HAVE_OCD WITH_RANK << "__db__: none work, doing new" << "\n"; #endif { WITH_RANK << "__db__: NEW: finding blank in " << name << " for type " << type << " for tuple " << tuple[0] << ", " << tuple[1] << "\n" ; auto& blank = Slice::findOneByType(slices, Slice::Blank); blank.info.type = type; blank.info.tuple = tuple; blank.info.from = from; // Handle self sufficiency blank.info.state = Atrip::rank == from.rank ? Slice::SelfSufficient : Slice::Fetch ; if (blank.info.state == Slice::SelfSufficient) { #if defined(HAVE_CUDA) blank.mpi_data = sources[from.source].data(); #else blank.data = sources[from.source].data(); #endif } else { if (freePointers.size() == 0) { std::stringstream stream; stream << "No more free pointers " << "for type " << type << " and name " << name ; throw std::domain_error(stream.str()); } auto dataPointer = freePointers.begin(); freePointers.erase(dataPointer); blank.data = *dataPointer; } result.push_back({name, blank.info}); continue; } } #ifdef HAVE_OCD for (auto const& s: slices) WITH_RANK << "__db__:guts:ocd:__end__ " << s.info << "\n"; #endif return result; } /* * Garbage collect slices not needed for the next iteration. * * It will throw if it tries to gc a slice that has not been * previously unwrapped, as a safety mechanism. */ void clearUnusedSlicesForNext(ABCTuple const& abc) { auto const needed = neededSlices(abc); // CLEAN UP SLICES, FREE THE ONES THAT ARE NOT NEEDED ANYMORE for (auto& slice: slices) { // if the slice is free, then it was not used anyways if (slice.isFree()) continue; // try to find the slice in the needed slices list auto const found = std::find_if(needed.begin(), needed.end(), [&slice] (typename Slice::Ty_x_Tu const& tytu) { return slice.info.tuple == tytu.second && slice.info.type == tytu.first ; }); // if we did not find slice in needed, then erase it if (found == needed.end()) { // We have to be careful about the data pointer, // for SelfSufficient, the data pointer is a source pointer // of the slice, so we should just wipe it. // // For Ready slices, we have to be careful if there are some // recycled slices depending on it. bool freeSlicePointer = true; // allow to gc unwrapped and recycled, never Fetch, // if we have a Fetch slice then something has gone very wrong. if (!slice.isUnwrapped() && slice.info.state != Slice::Recycled) throw std::domain_error("Trying to garbage collect " " a non-unwrapped slice! " + pretty_print(&slice) + pretty_print(slice.info)); // it can be that our slice is ready, but it has some hanging // references lying around in the form of a recycled slice. // Of course if we need the recycled slice the next iteration // this would be fatal, because we would then free the pointer // of the slice and at some point in the future we would // overwrite it. Therefore, we must check if slice has some // references in slices and if so then // // - we should mark those references as the original (since the data // pointer should be the same) // // - we should make sure that the data pointer of slice // does not get freed. // if (slice.info.state == Slice::Ready) { WITH_OCD WITH_RANK << "__gc__:" << "checking for data recycled dependencies\n"; auto recycled = Slice::hasRecycledReferencingToIt(slices, slice.info); if (recycled.size()) { Slice* newReady = recycled[0]; WITH_OCD WITH_RANK << "__gc__:" << "swaping recycled " << pretty_print(newReady->info) << " and " << pretty_print(slice.info) << "\n"; newReady->markReady(); assert(newReady->data == slice.data); freeSlicePointer = false; for (size_t i = 1; i < recycled.size(); i++) { auto newRecyled = recycled[i]; newRecyled->info.recycling = newReady->info.type; WITH_OCD WITH_RANK << "__gc__:" << "updating recycled " << pretty_print(newRecyled->info) << "\n"; } } } // if the slice is self sufficient, do not dare touching the // pointer, since it is a pointer to our sources in our rank. if ( slice.info.state == Slice::SelfSufficient || slice.info.state == Slice::Recycled ) { freeSlicePointer = false; } // make sure we get its data pointer to be used later // only for non-recycled, since it can be that we need // for next iteration the data of the slice that the recycled points // to if (freeSlicePointer) { freePointers.insert(slice.data); WITH_RANK << "~~~:cl(" << name << ")" << " added to freePointer " << pretty_print(freePointers) << "\n"; } else { WITH_OCD WITH_RANK << "__gc__:not touching the free Pointer\n"; } // at this point, let us blank the slice WITH_RANK << "~~~:cl(" << name << ")" << " freeing up slice " // TODO: make this possible because of Templates // TODO: there is a deduction error here // << " info " << slice.info << "\n"; slice.free(); } } } // CONSTRUCTOR SliceUnion( std::vector::Type> sliceTypes_ , std::vector sliceLength_ , std::vector paramLength , size_t np , MPI_Comm child_world , MPI_Comm global_world , typename Slice::Name name_ , size_t nSliceBuffers = 4 ) : rankMap(paramLength, np, global_world) , world(child_world) , universe(global_world) , sliceLength(sliceLength_) , sources(rankMap.nSources(), std::vector (std::accumulate(sliceLength.begin(), sliceLength.end(), 1UL, std::multiplies()))) , name(name_) , sliceTypes(sliceTypes_) , sliceBuffers(nSliceBuffers) //, slices(2 * sliceTypes.size(), Slice{ sources[0].size() }) { // constructor begin LOG(0,"Atrip") << "INIT SliceUnion: " << name << "\n"; for (auto& ptr: sliceBuffers) #if defined(HAVE_CUDA) cuMemAlloc(&ptr, sizeof(F) * sources[0].size()); #else ptr = (DataPtr)malloc(sizeof(F) * sources[0].size()); #endif slices = std::vector>(2 * sliceTypes.size(), { sources[0].size() }); // TODO: think exactly ^------------------- about this number // initialize the freePointers with the pointers to the buffers std::transform(sliceBuffers.begin(), sliceBuffers.end(), std::inserter(freePointers, freePointers.begin()), [](DataPtr ptr) { return ptr; }); LOG(1,"Atrip") << "rankMap.nSources " << rankMap.nSources() << "\n"; LOG(1,"Atrip") << "#slices " << slices.size() << "\n"; LOG(1,"Atrip") << "#slices[0] " << slices[0].size << "\n"; LOG(1,"Atrip") << "#sources " << sources.size() << "\n"; LOG(1,"Atrip") << "#sources[0] " << sources[0].size() << "\n"; LOG(1,"Atrip") << "#freePointers " << freePointers.size() << "\n"; LOG(1,"Atrip") << "#sliceBuffers " << sliceBuffers.size() << "\n"; LOG(1,"Atrip") << "#sliceLength " << sliceLength.size() << "\n"; LOG(1,"Atrip") << "#paramLength " << paramLength.size() << "\n"; LOG(1,"Atrip") << "GB*" << np << " " << double(sources.size() + sliceBuffers.size()) ,* sources[0].size() ,* 8 * np / 1073741824.0 << "\n"; } // constructor ends void init(Tensor const& sourceTensor) { CTF::World w(world); const int rank = Atrip::rank , order = sliceLength.size() ; std::vector const syms(order, NS); std::vector __sliceLength(sliceLength.begin(), sliceLength.end()); Tensor toSliceInto(order, __sliceLength.data(), syms.data(), w); LOG(1,"Atrip") << "slicing... \n"; // setUp sources for (size_t it(0); it < rankMap.nSources(); ++it) { const size_t source = rankMap.isSourcePadding(rank, source) ? 0 : it; WITH_OCD WITH_RANK << "Init:toSliceInto it-" << it << " :: source " << source << "\n"; sliceIntoBuffer(source, toSliceInto, sourceTensor); } } /** ,* \brief Send asynchronously only if the state is Fetch ,*/ void send( size_t otherRank , typename Slice::LocalDatabaseElement const& el , size_t tag) const noexcept { MPI_Request request; bool sendData_p = false; auto const& info = el.info; if (info.state == Slice::Fetch) sendData_p = true; // TODO: remove this because I have SelfSufficient if (otherRank == info.from.rank) sendData_p = false; if (!sendData_p) return; MPI_Isend( sources[info.from.source].data() , sources[info.from.source].size() , traits::mpi::datatypeOf() , otherRank , tag , universe , &request ); WITH_CRAZY_DEBUG WITH_RANK << "sent to " << otherRank << "\n"; } /** * \brief Receive asynchronously only if the state is Fetch */ void receive(typename Slice::Info const& info, size_t tag) noexcept { auto& slice = Slice::findByInfo(slices, info); if (Atrip::rank == info.from.rank) return; if (slice.info.state == Slice::Fetch) { // TODO: do it through the slice class slice.info.state = Slice::Dispatched; #if defined(HAVE_CUDA) slice.mpi_data = (F*)malloc(sizeof(F) * slice.size); MPI_Irecv( slice.mpi_data #else MPI_Irecv( slice.data #endif , slice.size , traits::mpi::datatypeOf() , info.from.rank , tag , universe , &slice.request //, MPI_STATUS_IGNORE ); } } void unwrapAll(ABCTuple const& abc) { for (auto type: sliceTypes) unwrapSlice(type, abc); } DataPtr unwrapSlice(typename Slice::Type type, ABCTuple const& abc) { WITH_CRAZY_DEBUG WITH_RANK << "__unwrap__:slice " << type << " w n " << name << " abc" << pretty_print(abc) << "\n"; auto& slice = Slice::findByTypeAbc(slices, type, abc); //WITH_RANK << "__unwrap__:info " << slice.info << "\n"; switch (slice.info.state) { case Slice::Dispatched: WITH_RANK << "__unwrap__:Fetch: " << &slice << " info " << pretty_print(slice.info) << "\n"; slice.unwrapAndMarkReady(); return slice.data; break; case Slice::SelfSufficient: WITH_RANK << "__unwrap__:SelfSufficient: " << &slice << " info " << pretty_print(slice.info) << "\n"; return slice.data; break; case Slice::Ready: WITH_RANK << "__unwrap__:READY: UNWRAPPED ALREADY" << &slice << " info " << pretty_print(slice.info) << "\n"; return slice.data; break; case Slice::Recycled: WITH_RANK << "__unwrap__:RECYCLED " << &slice << " info " << pretty_print(slice.info) << "\n"; return unwrapSlice(slice.info.recycling, abc); break; case Slice::Fetch: case Slice::Acceptor: throw std::domain_error("Can't unwrap an acceptor or fetch slice!"); break; default: throw std::domain_error("Unknown error unwrapping slice!"); } return slice.data; } ~SliceUnion() { for (auto& ptr: sliceBuffers) #if defined(HAVE_CUDA) cuMemFree(ptr); #else std::free(ptr); #endif } const RankMap rankMap; const MPI_Comm world; const MPI_Comm universe; const std::vector sliceLength; std::vector< std::vector > sources; std::vector< Slice > slices; typename Slice::Name name; const std::vector::Type> sliceTypes; std::vector< DataPtr > sliceBuffers; std::set< DataPtr > freePointers; }; template SliceUnion& unionByName(std::vector*> const& unions, typename Slice::Name name) { const auto sliceUnionIt = std::find_if(unions.begin(), unions.end(), [&name](SliceUnion const* s) { return name == s->name; }); if (sliceUnionIt == unions.end()) { std::stringstream stream; stream << "SliceUnion(" << name << ") not found!"; throw std::domain_error(stream.str()); } return **sliceUnionIt; } #+end_src ** Epilog :noexport: #+begin_src c++ :tangle (atrip-slice-union-h) } #+end_src * Tuples This section introduces the types for tuples $ (a,b,c) $ as well as their distribution to nodes and cores. ** Prolog :noexport: #+begin_src c++ :tangle (atrip-tuples-h) #pragma once #include #include #include // TODO: remove some #include #include #include #include #include #include #include #include #include #include namespace atrip { #+end_src ** Tuples types The main tuple types are simple type aliases for finite-size arrays. A tuple is thus simply 3 natural numbers $ (a,b,c) $ whereas a partial tuple is a two dimensional subset of these three. #+begin_src c++ :tangle (atrip-tuples-h) using ABCTuple = std::array; using PartialTuple = std::array; using ABCTuples = std::vector; constexpr ABCTuple FAKE_TUPLE = {0, 0, 0}; constexpr ABCTuple INVALID_TUPLE = {1, 1, 1}; #+end_src ** Distributing the tuples In general it is our task to distribute all the tuples $ (a,b,c) $ among the ranks. Every distribution should make sure to allocate the same amount of tuples to every rank, padding the list with =FAKE_TUPLE= elements as necessary. The interface that we propose for this is simplye #+begin_src c++ :tangle (atrip-tuples-h) struct TuplesDistribution { virtual ABCTuples getTuples(size_t Nv, MPI_Comm universe) = 0; virtual bool tupleIsFake(ABCTuple const& t) { return t == FAKE_TUPLE; } }; #+end_src ** Node information - nodeList :: List of hostnames of size $ N_n $ - nodeInfos :: List of (hostname, local rank Id) of size $ N_p $, i.e., size of ranks where local rank id goes from 0 to 48. =getNodeNames= gets the names of the nodes used, i.e., the size of the resulting vector gives the number of nodes. #+begin_src c++ :tangle (atrip-tuples-h) std::vector getNodeNames(MPI_Comm comm){ int rank, np; MPI_Comm_rank(comm, &rank); MPI_Comm_size(comm, &np); std::vector nodeList(np); char nodeName[MPI_MAX_PROCESSOR_NAME]; char *nodeNames = (char*)malloc(np * MPI_MAX_PROCESSOR_NAME); std::vector nameLengths(np) , off(np) ; int nameLength; MPI_Get_processor_name(nodeName, &nameLength); MPI_Allgather(&nameLength, 1, MPI_INT, nameLengths.data(), 1, MPI_INT, comm); for (int i(1); i < np; i++) off[i] = off[i-1] + nameLengths[i-1]; MPI_Allgatherv(nodeName, nameLengths[rank], MPI_BYTE, nodeNames, nameLengths.data(), off.data(), MPI_BYTE, comm); for (int i(0); i < np; i++) { std::string const s(&nodeNames[off[i]], nameLengths[i]); nodeList[i] = s; } std::free(nodeNames); return nodeList; } #+end_src =getNodeInfos= #+begin_src c++ :tangle (atrip-tuples-h) struct RankInfo { const std::string name; const size_t nodeId; const size_t globalRank; const size_t localRank; const size_t ranksPerNode; }; template A unique(A const &xs) { auto result = xs; std::sort(std::begin(result), std::end(result)); auto const& last = std::unique(std::begin(result), std::end(result)); result.erase(last, std::end(result)); return result; } std::vector getNodeInfos(std::vector const& nodeNames) { std::vector result; auto const uniqueNames = unique(nodeNames); auto const index = [&uniqueNames](std::string const& s) { auto const& it = std::find(uniqueNames.begin(), uniqueNames.end(), s); return std::distance(uniqueNames.begin(), it); }; std::vector localRanks(uniqueNames.size(), 0); size_t globalRank = 0; for (auto const& name: nodeNames) { const size_t nodeId = index(name); result.push_back({name, nodeId, globalRank++, localRanks[nodeId]++, (size_t) std::count(nodeNames.begin(), nodeNames.end(), name) }); } return result; } struct ClusterInfo { const size_t nNodes, np, ranksPerNode; const std::vector rankInfos; }; ClusterInfo getClusterInfo(MPI_Comm comm) { auto const names = getNodeNames(comm); auto const rankInfos = getNodeInfos(names); return ClusterInfo { unique(names).size(), names.size(), rankInfos[0].ranksPerNode, rankInfos }; } #+end_src ** Naive list The naive implementation of the global tuples list is simple three for loops creating tuples of the sort $ (a,b,c) $ where the following conditions are met at the same time: - $ a \leq b \leq c $ - $ a \neq b \land b \neq c $ This means, $ (1, 2, 3) , (1, 1, 3) , (1, 2, 2) $ are acceptable tuples wherease $ (2, 1, 1) $ and $ (1, 1, 1) $ are not. #+begin_src c++ :tangle (atrip-tuples-h) ABCTuples getTuplesList(size_t Nv, size_t rank, size_t np) { const size_t // total number of tuples for the problem n = Nv * (Nv + 1) * (Nv + 2) / 6 - Nv // all ranks should have the same number of tuples_per_rank , tuples_per_rank = n / np + size_t(n % np != 0) // start index for the global tuples list , start = tuples_per_rank * rank // end index for the global tuples list , end = tuples_per_rank * (rank + 1) ; LOG(1,"Atrip") << "tuples_per_rank = " << tuples_per_rank << "\n"; WITH_RANK << "start, end = " << start << ", " << end << "\n"; ABCTuples result(tuples_per_rank, FAKE_TUPLE); for (size_t a(0), r(0), g(0); a < Nv; a++) for (size_t b(a); b < Nv; b++) for (size_t c(b); c < Nv; c++){ if ( a == b && b == c ) continue; if ( start <= g && g < end) result[r++] = {a, b, c}; g++; } return result; } #+end_src and all tuples would simply be #+begin_src c++ :tangle (atrip-tuples-h) ABCTuples getAllTuplesList(const size_t Nv) { const size_t n = Nv * (Nv + 1) * (Nv + 2) / 6 - Nv; ABCTuples result(n); for (size_t a(0), u(0); a < Nv; a++) for (size_t b(a); b < Nv; b++) for (size_t c(b); c < Nv; c++){ if ( a == b && b == c ) continue; result[u++] = {a, b, c}; } return result; } #+end_src With =getTupleList= we can easily define a tuple distribution like #+begin_src c++ :tangle (atrip-tuples-h) struct NaiveDistribution : public TuplesDistribution { ABCTuples getTuples(size_t Nv, MPI_Comm universe) override { int rank, np; MPI_Comm_rank(universe, &rank); MPI_Comm_size(universe, &np); return getTuplesList(Nv, (size_t)rank, (size_t)np); } }; #+end_src ** Group and sort list *** Prolog :noexport: #+begin_src c++ :tangle (atrip-tuples-h) namespace group_and_sort { #+end_src *** Utils #+begin_src c++ :tangle (atrip-tuples-h) // Provides the node on which the slice-element is found // Right now we distribute the slices in a round robin fashion // over the different nodes (NOTE: not mpi ranks but nodes) inline size_t isOnNode(size_t tuple, size_t nNodes) { return tuple % nNodes; } // return the node (or all nodes) where the elements of this // tuple are located std::vector getTupleNodes(ABCTuple const& t, size_t nNodes) { std::vector nTuple = { isOnNode(t[0], nNodes) , isOnNode(t[1], nNodes) , isOnNode(t[2], nNodes) }; return unique(nTuple); } struct Info { size_t nNodes; size_t nodeId; }; #+end_src *** Distribution wording: home element = element which is located on the given node 1. we distribute the tuples such that each tuple has at least one 'home element' 2. we sort each tuple in a way that the 'home element' are the fastest indices 3. we sort the list of tuples on every node 4. we resort the tuples that for every tuple abc the following holds: a container1d(nNodes) , container2d(nNodes * nNodes) , container3d(nNodes * nNodes * nNodes) ; WITH_DBG if (info.nodeId == 0) std::cout << "\tGoing through all " << allTuples.size() << " tuples in " << nNodes << " nodes\n"; // build container-n-d's for (auto const& t: allTuples) { // one which node(s) are the tuple elements located... // put them into the right container auto const _nodes = getTupleNodes(t, nNodes); switch (_nodes.size()) { case 1: container1d[_nodes[0]].push_back(t); break; case 2: container2d[ _nodes[0] + _nodes[1] * nNodes ].push_back(t); break; case 3: container3d[ _nodes[0] + _nodes[1] * nNodes + _nodes[2] * nNodes * nNodes ].push_back(t); break; } } WITH_DBG if (info.nodeId == 0) std::cout << "\tBuilding 1-d containers\n"; // DISTRIBUTE 1-d containers // every tuple which is only located at one node belongs to this node { auto const& _tuples = container1d[info.nodeId]; nodeTuples.resize(_tuples.size(), INVALID_TUPLE); std::copy(_tuples.begin(), _tuples.end(), nodeTuples.begin()); } WITH_DBG if (info.nodeId == 0) std::cout << "\tBuilding 2-d containers\n"; // DISTRIBUTE 2-d containers //the tuples which are located at two nodes are half/half given to these nodes for (size_t yx = 0; yx < container2d.size(); yx++) { auto const& _tuples = container2d[yx]; const size_t idx = yx % nNodes // remeber: yx = idy * nNodes + idx , idy = yx / nNodes , n_half = _tuples.size() / 2 , size = nodeTuples.size() ; size_t nbeg, nend; if (info.nodeId == idx) { nbeg = 0 * n_half; nend = n_half; } else if (info.nodeId == idy) { nbeg = 1 * n_half; nend = _tuples.size(); } else { // either idx or idy is my node continue; } size_t const nextra = nend - nbeg; nodeTuples.resize(size + nextra, INVALID_TUPLE); std::copy(_tuples.begin() + nbeg, _tuples.begin() + nend, nodeTuples.begin() + size); } WITH_DBG if (info.nodeId == 0) std::cout << "\tBuilding 3-d containers\n"; // DISTRIBUTE 3-d containers for (size_t zyx = 0; zyx < container3d.size(); zyx++) { auto const& _tuples = container3d[zyx]; const size_t idx = zyx % nNodes , idy = (zyx / nNodes) % nNodes // remember: zyx = idx + idy * nNodes + idz * nNodes^2 , idz = zyx / nNodes / nNodes , n_third = _tuples.size() / 3 , size = nodeTuples.size() ; size_t nbeg, nend; if (info.nodeId == idx) { nbeg = 0 * n_third; nend = 1 * n_third; } else if (info.nodeId == idy) { nbeg = 1 * n_third; nend = 2 * n_third; } else if (info.nodeId == idz) { nbeg = 2 * n_third; nend = _tuples.size(); } else { // either idx or idy or idz is my node continue; } size_t const nextra = nend - nbeg; nodeTuples.resize(size + nextra, INVALID_TUPLE); std::copy(_tuples.begin() + nbeg, _tuples.begin() + nend, nodeTuples.begin() + size); } WITH_DBG if (info.nodeId == 0) std::cout << "\tswapping tuples...\n"; /* * sort part of group-and-sort algorithm * every tuple on a given node is sorted in a way that * the 'home elements' are the fastest index. * 1:yyy 2:yyn(x) 3:yny(x) 4:ynn(x) 5:nyy 6:nyn(x) 7:nny 8:nnn */ for (auto &nt: nodeTuples){ if ( isOnNode(nt[0], nNodes) == info.nodeId ){ // 1234 if ( isOnNode(nt[2], nNodes) != info.nodeId ){ // 24 size_t const x(nt[0]); nt[0] = nt[2]; // switch first and last nt[2] = x; } else if ( isOnNode(nt[1], nNodes) != info.nodeId){ // 3 size_t const x(nt[0]); nt[0] = nt[1]; // switch first two nt[1] = x; } } else { if ( isOnNode(nt[1], nNodes) == info.nodeId // 56 && isOnNode(nt[2], nNodes) != info.nodeId ) { // 6 size_t const x(nt[1]); nt[1] = nt[2]; // switch last two nt[2] = x; } } } WITH_DBG if (info.nodeId == 0) std::cout << "\tsorting list of tuples...\n"; //now we sort the list of tuples std::sort(nodeTuples.begin(), nodeTuples.end()); WITH_DBG if (info.nodeId == 0) std::cout << "\trestoring tuples...\n"; // we bring the tuples abc back in the order a 1 WITH_DBG if (info.nodeId == 0) std::cout << "checking for validity of " << nodeTuples.size() << std::endl; const bool anyInvalid = std::any_of(nodeTuples.begin(), nodeTuples.end(), [](ABCTuple const& t) { return t == INVALID_TUPLE; }); if (anyInvalid) throw "Some tuple is invalid in group-and-sort algorithm"; #endif WITH_DBG if (info.nodeId == 0) std::cout << "\treturning tuples...\n"; return nodeTuples; } #+end_src *** Main The main routine should return the list of tuples to be handled by the current rank. Let $ N_p $ be the number of ranks or processes. Let $ N_n $ be the number of nodes or sockets. Then we have the following #+begin_example Global rank | 0 1 2 3 4 5 6 7 8 key | global rank nodeId | 0 1 0 1 1 0 2 2 2 Local rank | 0 0 1 1 2 2 0 1 2 intra color | 0 1 0 1 1 0 2 2 2 #+end_example #+begin_src c++ :tangle (atrip-tuples-h) std::vector main(MPI_Comm universe, size_t Nv) { int rank, np; MPI_Comm_rank(universe, &rank); MPI_Comm_size(universe, &np); std::vector result; auto const nodeNames(getNodeNames(universe)); size_t const nNodes = unique(nodeNames).size(); auto const nodeInfos = getNodeInfos(nodeNames); // We want to construct a communicator which only contains of one // element per node bool const computeDistribution = nodeInfos[rank].localRank == 0; std::vector nodeTuples = computeDistribution ? specialDistribution(Info{nNodes, nodeInfos[rank].nodeId}, getAllTuplesList(Nv)) : std::vector() ; LOG(1,"Atrip") << "got nodeTuples\n"; // now we have to send the data from **one** rank on each node // to all others ranks of this node const int color = nodeInfos[rank].nodeId , key = nodeInfos[rank].localRank ; MPI_Comm INTRA_COMM; MPI_Comm_split(universe, color, key, &INTRA_COMM); #+end_src Every node has to distribute **at least** =nodeTuples.size() / nodeInfos[rank].ranksPerNode= tuples among the ranks. We have to communicate this quantity among all nodes. #+begin_src c++ :tangle (atrip-tuples-h) size_t const tuplesPerRankLocal = nodeTuples.size() / nodeInfos[rank].ranksPerNode + size_t(nodeTuples.size() % nodeInfos[rank].ranksPerNode != 0) ; size_t tuplesPerRankGlobal; MPI_Reduce(&tuplesPerRankLocal, &tuplesPerRankGlobal, 1, MPI_UINT64_T, MPI_MAX, 0, universe); MPI_Bcast(&tuplesPerRankGlobal, 1, MPI_UINT64_T, 0, universe); LOG(1,"Atrip") << "Tuples per rank: " << tuplesPerRankGlobal << "\n"; LOG(1,"Atrip") << "ranks per node " << nodeInfos[rank].ranksPerNode << "\n"; LOG(1,"Atrip") << "#nodes " << nNodes << "\n"; #+end_src Now we have the tuples that every rank has to have, i.e., =tuplesPerRankGlobal=. However before this, the tuples in =nodeTuples= now have to be sent from the local rank in every node to all the ranks in the given node, and we have to make sure that every rank inside a given node gets the same amount of tuples, in this case it should be =tuplesPerRankLocal=, and in our node the total number of tuples should be =tuplesPerRankLocal * nodeInfos[rank].ranksPerNode=, however this might not be the case up to now due to divisibility issues. Up to now we have exactly =nodeTuples.size()= tuples, we have to make sure by resizing that the condition above is met, i.e., so we can resize and add some fake tuples at the end as padding. #+begin_src c++ :tangle (atrip-tuples-h) size_t const totalTuples = tuplesPerRankGlobal * nodeInfos[rank].ranksPerNode; if (computeDistribution) { // pad with FAKE_TUPLEs nodeTuples.insert(nodeTuples.end(), totalTuples - nodeTuples.size(), FAKE_TUPLE); } #+end_src And now we can simply scatter the tuples in nodeTuples and send =tuplesPerRankGlobal= to the different ranks in the node, #+begin_src c++ :tangle (atrip-tuples-h) { // construct mpi type for abctuple MPI_Datatype MPI_ABCTUPLE; MPI_Type_vector(nodeTuples[0].size(), 1, 1, MPI_UINT64_T, &MPI_ABCTUPLE); MPI_Type_commit(&MPI_ABCTUPLE); LOG(1,"Atrip") << "scattering tuples \n"; result.resize(tuplesPerRankGlobal); MPI_Scatter(nodeTuples.data(), tuplesPerRankGlobal, MPI_ABCTUPLE, result.data(), tuplesPerRankGlobal, MPI_ABCTUPLE, 0, INTRA_COMM); MPI_Type_free(&MPI_ABCTUPLE); } #+end_src The next step is sending the tuples in the local root rank to the other ranks in the node, this we do with the MPI function =MPI_Scatterv=. Every rank gets =tuplesPerRankLocal= tuples and the =nodeTuples= vector is now homogeneous and divisible by the number of ranks per node in our node. Therefore, the =displacements= are simply the vector $$ \left\{ k * \mathrm{tuplesPerNodeLocal} \mid k \in \left\{ 0 , \ldots , \#\text{ranks in node} - 1 \right\} \right\} $$ and the =sendCounts= vector is simply the constant vector =tuplesPerRankLocal= of size =ranksPerNode=. #+begin_src c++ :tangle (atrip-tuples-h) return result; } #+end_src *** Interface The distribution interface will then simply be #+begin_src c++ :tangle (atrip-tuples-h) struct Distribution : public TuplesDistribution { ABCTuples getTuples(size_t Nv, MPI_Comm universe) override { return main(universe, Nv); } }; #+end_src *** Epilog :noexport: #+begin_src c++ :tangle (atrip-tuples-h) } // namespace group_and_sort #+end_src ** Epilog :noexport: #+begin_src c++ :tangle (atrip-tuples-h) } #+end_src * Unions Every slice pertaining to every different tensor is sliced differently. #+begin_src c++ :tangle (atrip-unions-h) #pragma once #include namespace atrip { template void sliceIntoVector ( std::vector &v , CTF::Tensor &toSlice , std::vector const low , std::vector const up , CTF::Tensor const& origin , std::vector const originLow , std::vector const originUp ) { // Thank you CTF for forcing me to do this struct { std::vector up, low; } toSlice_ = { {up.begin(), up.end()} , {low.begin(), low.end()} } , origin_ = { {originUp.begin(), originUp.end()} , {originLow.begin(), originLow.end()} } ; WITH_OCD WITH_RANK << "slicing into " << pretty_print(toSlice_.up) << "," << pretty_print(toSlice_.low) << " from " << pretty_print(origin_.up) << "," << pretty_print(origin_.low) << "\n"; #ifndef ATRIP_DONT_SLICE toSlice.slice( toSlice_.low.data() , toSlice_.up.data() , 0.0 , origin , origin_.low.data() , origin_.up.data() , 1.0); memcpy(v.data(), toSlice.data, sizeof(F) * v.size()); #endif } template struct TAPHH : public SliceUnion { TAPHH( CTF::Tensor const& sourceTensor , size_t No , size_t Nv , size_t np , MPI_Comm child_world , MPI_Comm global_world ) : SliceUnion( {Slice::A, Slice::B, Slice::C} , {Nv, No, No} // size of the slices , {Nv} , np , child_world , global_world , Slice::TA , 6) { this->init(sourceTensor); } void sliceIntoBuffer(size_t it, CTF::Tensor &to, CTF::Tensor const& from) override { const int Nv = this->sliceLength[0] , No = this->sliceLength[1] , a = this->rankMap.find({static_cast(Atrip::rank), it}); ; sliceIntoVector( this->sources[it] , to, {0, 0, 0}, {Nv, No, No} , from, {a, 0, 0, 0}, {a+1, Nv, No, No} ); } }; template struct HHHA : public SliceUnion { HHHA( CTF::Tensor const& sourceTensor , size_t No , size_t Nv , size_t np , MPI_Comm child_world , MPI_Comm global_world ) : SliceUnion( {Slice::A, Slice::B, Slice::C} , {No, No, No} // size of the slices , {Nv} // size of the parametrization , np , child_world , global_world , Slice::VIJKA , 6) { this->init(sourceTensor); } void sliceIntoBuffer(size_t it, CTF::Tensor &to, CTF::Tensor const& from) override { const int No = this->sliceLength[0] , a = this->rankMap.find({static_cast(Atrip::rank), it}) ; sliceIntoVector( this->sources[it] , to, {0, 0, 0}, {No, No, No} , from, {0, 0, 0, a}, {No, No, No, a+1} ); } }; template struct ABPH : public SliceUnion { ABPH( CTF::Tensor const& sourceTensor , size_t No , size_t Nv , size_t np , MPI_Comm child_world , MPI_Comm global_world ) : SliceUnion( { Slice::AB, Slice::BC, Slice::AC , Slice::BA, Slice::CB, Slice::CA } , {Nv, No} // size of the slices , {Nv, Nv} // size of the parametrization , np , child_world , global_world , Slice::VABCI , 2*6) { this->init(sourceTensor); } void sliceIntoBuffer(size_t it, CTF::Tensor &to, CTF::Tensor const& from) override { const int Nv = this->sliceLength[0] , No = this->sliceLength[1] , el = this->rankMap.find({static_cast(Atrip::rank), it}) , a = el % Nv , b = el / Nv ; sliceIntoVector( this->sources[it] , to, {0, 0}, {Nv, No} , from, {a, b, 0, 0}, {a+1, b+1, Nv, No} ); } }; template struct ABHH : public SliceUnion { ABHH( CTF::Tensor const& sourceTensor , size_t No , size_t Nv , size_t np , MPI_Comm child_world , MPI_Comm global_world ) : SliceUnion( {Slice::AB, Slice::BC, Slice::AC} , {No, No} // size of the slices , {Nv, Nv} // size of the parametrization , np , child_world , global_world , Slice::VABIJ , 6) { this->init(sourceTensor); } void sliceIntoBuffer(size_t it, CTF::Tensor &to, CTF::Tensor const& from) override { const int Nv = from.lens[0] , No = this->sliceLength[1] , el = this->rankMap.find({static_cast(Atrip::rank), it}) , a = el % Nv , b = el / Nv ; sliceIntoVector( this->sources[it] , to, {0, 0}, {No, No} , from, {a, b, 0, 0}, {a+1, b+1, No, No} ); } }; template struct TABHH : public SliceUnion { TABHH( CTF::Tensor const& sourceTensor , size_t No , size_t Nv , size_t np , MPI_Comm child_world , MPI_Comm global_world ) : SliceUnion( {Slice::AB, Slice::BC, Slice::AC} , {No, No} // size of the slices , {Nv, Nv} // size of the parametrization , np , child_world , global_world , Slice::TABIJ , 6) { this->init(sourceTensor); } void sliceIntoBuffer(size_t it, CTF::Tensor &to, CTF::Tensor const& from) override { // TODO: maybe generalize this with ABHH const int Nv = from.lens[0] , No = this->sliceLength[1] , el = this->rankMap.find({static_cast(Atrip::rank), it}) , a = el % Nv , b = el / Nv ; sliceIntoVector( this->sources[it] , to, {0, 0}, {No, No} , from, {a, b, 0, 0}, {a+1, b+1, No, No} ); } }; } #+end_src * Equations This section presents on of the main physical contents of the program, the equations used. The equations are of two types, energy equations and intermediate tensor contractions. ** Prolog :noexport: #+begin_src c++ :tangle (atrip-equations-h) #pragma once #include #include #include #if defined(HAVE_CUDA) #include #endif namespace atrip { using ABCTuple = std::array; using PartialTuple = std::array; using ABCTuples = std::vector; #+end_src #+begin_src c++ :tangle (atrip-equations-cxx) #include #if defined(HAVE_CUDA) #include #endif namespace atrip { #+end_src ** Energy For the energy we have two types of functions, a function for when the tuples $ (a,b,c) $ are distinct and where they are the same, they are accordingly called: #+begin_src c++ :tangle (atrip-equations-h) template double getEnergyDistinct ( F const epsabc , size_t const No , F* const epsi , F* const Tijk , F* const Zijk ); template double getEnergySame ( F const epsabc , size_t const No , F* const epsi , F* const Tijk , F* const Zijk ); #+end_src Their implementations follow, and they are written in a cache-friendly style for CPU architectures through the =blockSize= variable. #+begin_src c++ :tangle (atrip-equations-cxx) template double getEnergyDistinct ( F const epsabc , size_t const No , F* const epsi , F* const Tijk , F* const Zijk ) { constexpr size_t blockSize=16; F energy(0.); for (size_t kk=0; kk k ? jj : k; for (size_t j(jstart); j < jend; j++){ F const ej(epsi[j]); F const facjk = j == k ? F(0.5) : F(1.0); size_t istart = ii > j ? ii : j; for (size_t i(istart); i < iend; i++){ const F ei(epsi[i]) , facij = i == j ? F(0.5) : F(1.0) , denominator(epsabc - ei - ej - ek) , U(Zijk[i + No*j + No*No*k]) , V(Zijk[i + No*k + No*No*j]) , W(Zijk[j + No*i + No*No*k]) , X(Zijk[j + No*k + No*No*i]) , Y(Zijk[k + No*i + No*No*j]) , Z(Zijk[k + No*j + No*No*i]) , A(maybeConjugate(Tijk[i + No*j + No*No*k])) , B(maybeConjugate(Tijk[i + No*k + No*No*j])) , C(maybeConjugate(Tijk[j + No*i + No*No*k])) , D(maybeConjugate(Tijk[j + No*k + No*No*i])) , E(maybeConjugate(Tijk[k + No*i + No*No*j])) , _F(maybeConjugate(Tijk[k + No*j + No*No*i])) , value = 3.0 * ( A * U + B * V + C * W + D * X + E * Y + _F * Z ) + ( ( U + X + Y ) - 2.0 * ( V + W + Z ) ) * ( A + D + E ) + ( ( V + W + Z ) - 2.0 * ( U + X + Y ) ) * ( B + C + _F ) ; energy += 2.0 * value / denominator * facjk * facij; } // i } // j } // k } // ii } // jj } // kk return std::real(energy); } template double getEnergySame ( F const epsabc , size_t const No , F* const epsi , F* const Tijk , F* const Zijk ) { constexpr size_t blockSize = 16; F energy = F(0.); for (size_t kk=0; kk k ? jj : k; for(size_t j(jstart); j < jend; j++){ const F facjk( j == k ? F(0.5) : F(1.0)); const F ej(epsi[j]); const size_t istart = ii > j ? ii : j; for(size_t i(istart); i < iend; i++){ const F ei(epsi[i]) , facij ( i==j ? F(0.5) : F(1.0)) , denominator(epsabc - ei - ej - ek) , U(Zijk[i + No*j + No*No*k]) , V(Zijk[j + No*k + No*No*i]) , W(Zijk[k + No*i + No*No*j]) , A(maybeConjugate(Tijk[i + No*j + No*No*k])) , B(maybeConjugate(Tijk[j + No*k + No*No*i])) , C(maybeConjugate(Tijk[k + No*i + No*No*j])) , value = F(3.0) * ( A * U + B * V + C * W ) - ( A + B + C ) * ( U + V + W ) ; energy += F(2.0) * value / denominator * facjk * facij; } // i } // j } // k } // ii } // jj } // kk return std::real(energy); } #+end_src And we explicitly instantiate these templated functions: #+begin_src c++ :tangle (atrip-equations-cxx) // instantiate double template double getEnergyDistinct ( double const epsabc , size_t const No , double* const epsi , double* const Tijk , double* const Zijk ); template double getEnergySame ( double const epsabc , size_t const No , double* const epsi , double* const Tijk , double* const Zijk ); // instantiate Complex template double getEnergyDistinct ( Complex const epsabc , size_t const No , Complex* const epsi , Complex* const Tijk , Complex* const Zijk ); template double getEnergySame ( Complex const epsabc , size_t const No , Complex* const epsi , Complex* const Tijk , Complex* const Zijk ); #+end_src ** Contractions *** Singles contribution The first contraction we discuss is the one involving $ t_i^a $ singles amplitudes. For every $ (a,b,c) $ we construct the object \begin{align*} Z_{ijk} &= t^a_i V^{bc}_{jk} \\ & + t^b_j V^{ac}_{ik} \\ & + t^c_k V^{ab}_{ij} \end{align*} and for it we will need the corresponding slices of the $ V^{pp}_{ph} $ integrals, i.e., =AB=, =AC= and =BC=. #+begin_src c++ :tangle (atrip-equations-h) :exports none template void singlesContribution ( size_t No , size_t Nv , const ABCTuple &abc , F* const Tph , F* const VABij , F* const VACij , F* const VBCij , F* Zijk ); #+end_src #+begin_src c++ :tangle (atrip-equations-cxx) template void singlesContribution ( size_t No , size_t Nv , const ABCTuple &abc , F* const Tph , F* const VABij , F* const VACij , F* const VBCij , F* Zijk ) { const size_t a(abc[0]), b(abc[1]), c(abc[2]); const size_t NoNo = No*No; // TODO: change order of for loops for (size_t k = 0; k < No; k++) for (size_t i = 0; i < No; i++) for (size_t j = 0; j < No; j++) { const size_t ijk = i + j*No + k*NoNo; Zijk[ijk] += Tph[ a + i * Nv ] * VBCij[ j + k * No ]; Zijk[ijk] += Tph[ b + j * Nv ] * VACij[ i + k * No ]; Zijk[ijk] += Tph[ c + k * Nv ] * VABij[ i + j * No ]; } } // instantiate template void singlesContribution( size_t No , size_t Nv , const ABCTuple &abc , double* const Tph , double* const VABij , double* const VACij , double* const VBCij , double* Zijk ); template void singlesContribution( size_t No , size_t Nv , const ABCTuple &abc , Complex* const Tph , Complex* const VABij , Complex* const VACij , Complex* const VBCij , Complex* Zijk ); #+end_src *** Doubles contribution Next we build the tensor $ T_{ijk} $ from the contraction of the coulomb integrals with the doubles amplitudes $ t^{ab}_{ij} $. There are two kinds of contractions, which are differently challenging in terms of computation. \begin{align*} T_{ijk} &= \\ \sum_{{\color{red}L} = 0}^{N_\mathrm{o}} &- T^{{\color{blue} ab}}_{{\color{red}L}j} V_{ik}^{{\color{red}L}{\color{blue}c}} \\ &- T^{{\color{blue} ab}}_{i{\color{red}L}} V_{jk}^{{\color{red}L}{\color{blue}c}} \\ &- T^{{\color{blue} ac}}_{{\color{red}L}k} V_{ij}^{{\color{red}L}{\color{blue}b}} \\ &- T^{{\color{blue} ac}}_{i{\color{red}L}} V_{kj}^{{\color{red}L}{\color{blue}b}} \\ &- T^{{\color{blue} bc}}_{{\color{red}L}k} V_{ji}^{{\color{red}L}{\color{blue}a}} \\ &- T^{{\color{blue} bc}}_{j{\color{red}L}} V_{ki}^{{\color{red}L}{\color{blue}a}} \end{align*} \begin{align*} T_{ijk} &= \\ \sum_{{\color{red}e} = 0}^{N_\mathrm{v}} &\phantom{+} V^{{\color{blue}ab}}_{{\color{red}e}i} T^{{\color{blue}c}{\color{red}e}}_{ij} \\ &+ V^{{\color{blue}ba}}_{{\color{red}e}j} T^{{\color{blue}c}{\color{red}e}}_{ji} \\ &+ V^{{\color{blue}cb}}_{{\color{red}e}k} T^{{\color{blue}a}{\color{red}e}}_{kj} \\ &+ V^{{\color{blue}ac}}_{{\color{red}e}i} T^{{\color{blue}b}{\color{red}e}}_{ik} \\ &+ V^{{\color{blue}bc}}_{{\color{red}e}j} T^{{\color{blue}a}{\color{red}e}}_{jk} \\ &+ V^{{\color{blue}ca}}_{{\color{red}e}k} T^{{\color{blue}b}{\color{red}e}}_{ki} \\ \end{align*} #+begin_src c++ :tangle (atrip-equations-h) template void doublesContribution ( const ABCTuple &abc , size_t const No , size_t const Nv // -- VABCI , DataPtr const VABph , DataPtr const VACph , DataPtr const VBCph , DataPtr const VBAph , DataPtr const VCAph , DataPtr const VCBph // -- VHHHA , DataPtr const VhhhA , DataPtr const VhhhB , DataPtr const VhhhC // -- TA , DataPtr const TAphh , DataPtr const TBphh , DataPtr const TCphh // -- TABIJ , DataPtr const TABhh , DataPtr const TAChh , DataPtr const TBChh // -- TIJK // , DataPtr Tijk , DataFieldType* Tijk_ ); #+end_src #+begin_src c++ :tangle (atrip-equations-cxx) template void doublesContribution ( const ABCTuple &abc , size_t const No , size_t const Nv // -- VABCI , DataPtr const VABph , DataPtr const VACph , DataPtr const VBCph , DataPtr const VBAph , DataPtr const VCAph , DataPtr const VCBph // -- VHHHA , DataPtr const VhhhA , DataPtr const VhhhB , DataPtr const VhhhC // -- TA , DataPtr const TAphh , DataPtr const TBphh , DataPtr const TCphh // -- TABIJ , DataPtr const TABhh , DataPtr const TAChh , DataPtr const TBChh // -- TIJK // , DataPtr Tijk_ , DataFieldType* Tijk_ ) { const size_t a = abc[0], b = abc[1], c = abc[2] , NoNo = No*No, NoNv = No*Nv ; typename DataField::type* Tijk = (typename DataField::type*) Tijk_; #if defined(ATRIP_USE_DGEMM) #define _IJK_(i, j, k) i + j*No + k*NoNo #define REORDER(__II, __JJ, __KK) \ WITH_CHRONO("doubles:reorder", \ for (size_t k = 0; k < No; k++) \ for (size_t j = 0; j < No; j++) \ for (size_t i = 0; i < No; i++) { \ Tijk[_IJK_(i, j, k)] += _t_buffer_p[_IJK_(__II, __JJ, __KK)]; \ } \ ) #if defined(HAVE_CUDA) #define __TO_DEVICEPTR(_v) (DataFieldType*)(CUdeviceptr)thrust::raw_pointer_cast((_v)) #define DGEMM_PARTICLES(__A, __B) \ atrip::xgemm("T", \ "N", \ (int const*)&NoNo, \ (int const*)&No, \ (int const*)&Nv, \ &one, \ (DataFieldType*)__A, \ (int const*)&Nv, \ (DataFieldType*)__B, \ (int const*)&Nv, \ &zero, \ _t_buffer_p, \ (int const*)&NoNo); #define DGEMM_HOLES(__A, __B, __TRANSB) \ atrip::xgemm("N", \ __TRANSB, \ (int const*)&NoNo, \ (int const*)&No, \ (int const*)&No, \ &m_one, \ __TO_DEVICEPTR(__A), \ (int const*)&NoNo, \ (DataFieldType*)__B, \ (int const*)&No, \ &zero, \ _t_buffer_p, \ (int const*)&NoNo \ ); #define MAYBE_CONJ(_conj, _buffer) \ for (size_t __i = 0; __i < NoNoNo; ++__i) \ _conj[__i] = \ maybeConjugate>(((DataFieldType*)_buffer)[__i]); #else #define __TO_DEVICEPTR(_v) (_v) #define DGEMM_PARTICLES(__A, __B) \ atrip::xgemm("T", \ "N", \ (int const*)&NoNo, \ (int const*)&No, \ (int const*)&Nv, \ &one, \ __A, \ (int const*)&Nv, \ __B, \ (int const*)&Nv, \ &zero, \ _t_buffer_p, \ (int const*)&NoNo \ ); #define DGEMM_HOLES(__A, __B, __TRANSB) \ atrip::xgemm("N", \ __TRANSB, \ (int const*)&NoNo, \ (int const*)&No, \ (int const*)&No, \ &m_one, \ __A, \ (int const*)&NoNo, \ __B, \ (int const*)&No, \ &zero, \ _t_buffer_p, \ (int const*)&NoNo \ ); #define MAYBE_CONJ(_conj, _buffer) \ for (size_t __i = 0; __i < NoNoNo; ++__i) \ _conj[__i] = maybeConjugate(_buffer[__i]); #endif const size_t NoNoNo = No*NoNo; #ifdef HAVE_CUDA thrust::device_vector< DataFieldType > _t_buffer; #else std::vector _t_buffer; #endif _t_buffer.reserve(NoNoNo); DataFieldType* _t_buffer_p = __TO_DEVICEPTR(_t_buffer.data()); F one{1.0}, m_one{-1.0}, zero{0.0}; WITH_CHRONO("double:reorder", for (size_t k = 0; k < NoNoNo; k++) { Tijk[k] = DataFieldType{0.0}; }) // TOMERGE: replace chronos WITH_CHRONO("doubles:holes", { // Holes part %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% #ifdef HAVE_CUDA thrust::device_vector< DataFieldType > _vhhh(NoNoNo); #else std::vector _vhhh(NoNoNo); #endif // VhhhC[i + k*No + L*NoNo] * TABhh[L + j*No]; H1 MAYBE_CONJ(_vhhh, VhhhC) WITH_CHRONO("doubles:holes:1", DGEMM_HOLES(_vhhh.data(), TABhh, "N") REORDER(i, k, j) ) // VhhhC[j + k*No + L*NoNo] * TABhh[i + L*No]; H0 WITH_CHRONO("doubles:holes:2", DGEMM_HOLES(_vhhh.data(), TABhh, "T") REORDER(j, k, i) ) // VhhhB[i + j*No + L*NoNo] * TAChh[L + k*No]; H5 MAYBE_CONJ(_vhhh, VhhhB) WITH_CHRONO("doubles:holes:3", DGEMM_HOLES(_vhhh.data(), TAChh, "N") REORDER(i, j, k) ) // VhhhB[k + j*No + L*NoNo] * TAChh[i + L*No]; H3 WITH_CHRONO("doubles:holes:4", DGEMM_HOLES(_vhhh.data(), TAChh, "T") REORDER(k, j, i) ) // VhhhA[j + i*No + L*NoNo] * TBChh[L + k*No]; H1 MAYBE_CONJ(_vhhh, VhhhA) WITH_CHRONO("doubles:holes:5", DGEMM_HOLES(_vhhh.data(), TBChh, "N") REORDER(j, i, k) ) // VhhhA[k + i*No + L*NoNo] * TBChh[j + L*No]; H4 WITH_CHRONO("doubles:holes:6", DGEMM_HOLES(_vhhh.data(), TBChh, "T") REORDER(k, i, j) ) } ) #undef MAYBE_CONJ WITH_CHRONO("doubles:particles", { // Particle part %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% // TAphh[E + i*Nv + j*NoNv] * VBCph[E + k*Nv]; P0 WITH_CHRONO("doubles:particles:1", DGEMM_PARTICLES(TAphh, VBCph) REORDER(i, j, k) ) // TAphh[E + i*Nv + k*NoNv] * VCBph[E + j*Nv]; P3 WITH_CHRONO("doubles:particles:2", DGEMM_PARTICLES(TAphh, VCBph) REORDER(i, k, j) ) // TCphh[E + k*Nv + i*NoNv] * VABph[E + j*Nv]; P5 WITH_CHRONO("doubles:particles:3", DGEMM_PARTICLES(TCphh, VABph) REORDER(k, i, j) ) // TCphh[E + k*Nv + j*NoNv] * VBAph[E + i*Nv]; P2 WITH_CHRONO("doubles:particles:4", DGEMM_PARTICLES(TCphh, VBAph) REORDER(k, j, i) ) // TBphh[E + j*Nv + i*NoNv] * VACph[E + k*Nv]; P1 WITH_CHRONO("doubles:particles:5", DGEMM_PARTICLES(TBphh, VACph) REORDER(j, i, k) ) // TBphh[E + j*Nv + k*NoNv] * VCAph[E + i*Nv]; P4 WITH_CHRONO("doubles:particles:6", DGEMM_PARTICLES(TBphh, VCAph) REORDER(j, k, i) ) } ) #undef REORDER #undef DGEMM_HOLES #undef DGEMM_PARTICLES #undef _IJK_ #else for (size_t k = 0; k < No; k++) for (size_t j = 0; j < No; j++) for (size_t i = 0; i < No; i++){ const size_t ijk = i + j*No + k*NoNo , jk = j + k*No ; Tijk[ijk] = 0.0; // :important // HOLE DIAGRAMS: TABHH and VHHHA for (size_t L = 0; L < No; L++){ // t[abLj] * V[Lcik] H1 // t[baLi] * V[Lcjk] H0 TODO: conjugate T for complex Tijk[ijk] -= TABhh[L + j*No] * VhhhC[i + k*No + L*NoNo]; Tijk[ijk] -= TABhh[i + L*No] * VhhhC[j + k*No + L*NoNo]; // t[acLk] * V[Lbij] H5 // t[caLi] * V[Lbkj] H3 Tijk[ijk] -= TAChh[L + k*No] * VhhhB[i + j*No + L*NoNo]; Tijk[ijk] -= TAChh[i + L*No] * VhhhB[k + j*No + L*NoNo]; // t[bcLk] * V[Laji] H2 // t[cbLj] * V[Laki] H4 Tijk[ijk] -= TBChh[L + k*No] * VhhhA[j + i*No + L*NoNo]; Tijk[ijk] -= TBChh[j + L*No] * VhhhA[k + i*No + L*NoNo]; } // PARTILCE DIAGRAMS: TAPHH and VABPH for (size_t E = 0; E < Nv; E++) { // t[aEij] * V[bcEk] P0 // t[aEik] * V[cbEj] P3 // TODO: CHECK THIS ONE, I DONT KNOW Tijk[ijk] += TAphh[E + i*Nv + j*NoNv] * VBCph[E + k*Nv]; Tijk[ijk] += TAphh[E + i*Nv + k*NoNv] * VCBph[E + j*Nv]; // t[cEki] * V[abEj] P5 // t[cEkj] * V[baEi] P2 Tijk[ijk] += TCphh[E + k*Nv + i*NoNv] * VABph[E + j*Nv]; Tijk[ijk] += TCphh[E + k*Nv + j*NoNv] * VBAph[E + i*Nv]; // t[bEji] * V[acEk] P1 // t[bEjk] * V[caEi] P4 Tijk[ijk] += TBphh[E + j*Nv + i*NoNv] * VACph[E + k*Nv]; Tijk[ijk] += TBphh[E + j*Nv + k*NoNv] * VCAph[E + i*Nv]; } } #endif } // instantiate templates template void doublesContribution ( const ABCTuple &abc , size_t const No , size_t const Nv // -- VABCI , DataPtr const VABph , DataPtr const VACph , DataPtr const VBCph , DataPtr const VBAph , DataPtr const VCAph , DataPtr const VCBph // -- VHHHA , DataPtr const VhhhA , DataPtr const VhhhB , DataPtr const VhhhC // -- TA , DataPtr const TAphh , DataPtr const TBphh , DataPtr const TCphh // -- TABIJ , DataPtr const TABhh , DataPtr const TAChh , DataPtr const TBChh // -- TIJK , DataFieldType* Tijk ); template void doublesContribution ( const ABCTuple &abc , size_t const No , size_t const Nv // -- VABCI , DataPtr const VABph , DataPtr const VACph , DataPtr const VBCph , DataPtr const VBAph , DataPtr const VCAph , DataPtr const VCBph // -- VHHHA , DataPtr const VhhhA , DataPtr const VhhhB , DataPtr const VhhhC // -- TA , DataPtr const TAphh , DataPtr const TBphh , DataPtr const TCphh // -- TABIJ , DataPtr const TABhh , DataPtr const TAChh , DataPtr const TBChh // -- TIJK , DataFieldType* Tijk ); #+end_src ** Epilog :noexport: #+begin_src c++ :tangle (atrip-equations-h) } #+end_src #+begin_src c++ :tangle (atrip-equations-cxx) } #+end_src * Blas The main matrix-matrix multiplication method used in this algorithm is mainly using the =DGEMM= function, which we declare as =extern= since it should be known only at link-time. #+begin_src c++ :tangle (atrip-blas-h) :exports none #pragma once #include #include #include "config.h" namespace atrip { #if !defined(HAVE_CUDA) extern "C" { void dgemm_( const char *transa, const char *transb, const int *m, const int *n, const int *k, double *alpha, const double *a, const int *lda, const double *b, const int *ldb, double *beta, double *c, const int *ldc ); void zgemm_( const char *transa, const char *transb, const int *m, const int *n, const int *k, Complex *alpha, const Complex *A, const int *lda, const Complex *B, const int *ldb, Complex *beta, Complex *C, const int *ldc ); } #endif template void xgemm(const char *transa, const char *transb, const int *m, const int *n, const int *k, F *alpha, const DataFieldType *A, const int *lda, const DataFieldType *B, const int *ldb, F *beta, DataFieldType *C, const int *ldc); } #+end_src #+begin_src c++ :tangle (atrip-blas-cxx) #include #include #if defined(HAVE_CUDA) # include static inline cublasOperation_t char_to_cublasOperation(const char* trans) { if (strncmp("N", trans, 1) == 0) return CUBLAS_OP_N; else if (strncmp("T", trans, 1) == 0) return CUBLAS_OP_T; else return CUBLAS_OP_C; } #endif namespace atrip { template <> void xgemm(const char *transa, const char *transb, const int *m, const int *n, const int *k, double *alpha, const typename DataField::type *A, const int *lda, const typename DataField::type *B, const int *ldb, double *beta, typename DataField::type *C, const int *ldc) { #if defined(HAVE_CUDA) cublasDgemm(Atrip::cuda.handle, char_to_cublasOperation(transa), char_to_cublasOperation(transb), ,*m, *n, *k, alpha, A, *lda, B, *ldb, beta, C, *ldc); #else dgemm_(transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc); #endif } template <> void xgemm(const char *transa, const char *transb, const int *m, const int *n, const int *k, Complex *alpha, const typename DataField::type *A, const int *lda, const typename DataField::type *B, const int *ldb, Complex *beta, typename DataField::type *C, const int *ldc) { #if defined(HAVE_CUDA) #pragma warning HAVE_CUDA cuDoubleComplex cu_alpha = {std::real(*alpha), std::imag(*alpha)}, cu_beta = {std::real(*beta), std::imag(*beta)}; cublasZgemm(Atrip::cuda.handle, char_to_cublasOperation(transa), char_to_cublasOperation(transb), ,*m, *n, *k, &cu_alpha, A, *lda, B, *ldb, &cu_beta, C, *ldc); #else zgemm_(transa, transb, m, n, k, alpha, A, lda, B, ldb, beta, C, ldc); #endif } } #+end_src * Atrip ** Header #+begin_src c++ :tangle (atrip-atrip-h) #pragma once #include #include #include #include #include "config.h" #if defined(HAVE_CUDA) #include #define CUBLASAPI #include #include #endif #include #include #define ADD_ATTRIBUTE(_type, _name, _default) \ _type _name = _default; \ Input& with_ ## _name(_type i) { \ _name = i; \ return *this; \ } namespace atrip { struct Atrip { static size_t rank; static size_t np; static MPI_Comm communicator; static Timings chrono; #if defined(HAVE_CUDA) struct CudaContext { cublasStatus_t status; cublasHandle_t handle; }; static CudaContext cuda; #endif static void init(MPI_Comm); template struct Input { CTF::Tensor *ei = nullptr , *ea = nullptr , *Tph = nullptr , *Tpphh = nullptr , *Vpphh = nullptr , *Vhhhp = nullptr , *Vppph = nullptr ; Input& with_epsilon_i(CTF::Tensor * t) { ei = t; return *this; } Input& with_epsilon_a(CTF::Tensor