# How do I use this?

The library is broken up into several components. From import, you have
the Attention, the Local, the Glimpses, and the Loss library.

## Attention

The attention library contains the majority of the interesting components.
The items here are layers which impliment a variety of attention and feedforward
mechanisms suitable for some kickass transformers. Notably, every layer possesses
three useful properties.

First, they all possess an 'ensembles' option on initialization, which 
sets the number of parallel layers to setup. This is optional, but quite useful. Ensemble layers are processed completely
in parallel, and must be used by having a tensor of the shape (..., ensemble, words, embeddings), 
vs the standard shape of (..., words, embeddings). It should be noted that if ensembles is not defined,
the system expects the latter shape

Second, each and every layer is torchscript compatible. This is 
required for serious work such as saving to ONNX format, or even 
compiling CUDA with custom kernels. It means you can use torch.jit.script
with little worry.

Third, with the exception of Multiheaded Attention,
every layer listed below impliments a variation of 
transformer existing in an O space of less than O(N^2) with
respect to words provided. It should be noted that parameter
usage is frequently moderately higher, and signicantly higher
if a naive ensemble is used.

As of 6/14/2022, the layers available are:

* FeedForward: 
* MultiHeadedAttention (MHA)
* Parameter Injection Memory Unit (PIMU)
* Parameter Injection Summary Unit (PISU)
* Local Context Self Attention (LCSA) (banded self attention)
* Ensemble Exchange Self Attention (EESA)
* Global-Local Self Attention (GLSA)
* Global Strategic Processing Unit (GSPU)

Their proper utilization is:

* Feedforward: Use this to make decisions.
* PIMU: Use this when dealing with tasks
which you suspect would best be approached by
subcatagorization.
* PISU: Use this if you need to create a fixed shape
tensor output summarizing global trends in an order
independent manner.
* LCSA: Use this to capture order based contextual information
among nearby tensors. This is a banded attention.
* EESA: Use this only when processing an ensemble tensor. It allows
the exchange of data from lower level ensembles to higher level ones, but not vice versa.
* GSPU: Use this, with an internal transformer stack, when the model really
needs to be able to reason about the big picture.

## Linear

Linear is a core layer, and a rebuild of torch's
linear layer in an ensemble capable format. See the 
class for details

## Glimpses

The Glimpses package contains a few functions which
are useful for dealing with local operations and reshaping.

In particular, Glimpses contains an operation called "local" 
which is capable of returning a view of a tensor which is 
exactly the same as would be seen from a convolution kernel.

## Loss

Loss contains some experimental loss functions which may
be useful on an ensemble of outputs.