Attention

Standard

The functions below are simply a reimplimentation of the standard attention mechanisms in torch. The reason for this reimplimentation is so that they support native ensembles.

Feedforward

class supertransformerlib.Attention.FeedForward(d_model: int, d_internal: int = 2048, ensembles: Optional[int] = None)

A feedforward layer for attention purposes. Permits ensembling, but nothing clever beyond that.

Expects inputs to be tensors in (…, (ensemble), item, embedding) where ensemble is optional, and only used if the ensemble channel was defined on initialization. Returns something of the same shape

See ‘Attention is all you need’ for details.

forward(tensor: Tensor)
Parameters:

tensor – A tensor to perform feedforward on

Return tensor:

The result of feedforward processing.

MultiheadedAttention

class supertransformerlib.Attention.MultiHeadedAttention(d_query: int, d_content: int, d_output: int, heads: int, ensembles: Optional[int] = None)

An ensemble-enabled implimentation of multiheaded attention.

This is implimented as seen in “attention is all you need”, with an additional option. The incoming content is projected to build heads, dot product attention is performed, multiheaded combine occurs, and the result is returned.

A novel feature is the ensembles option. This may be left blank, but if defined will revise the expected input shape to be (…, ensemble, items, embedding). Each entry in ensembles will be processed in parallel, and with completely unique parameters.

forward(query: Tensor, key: Tensor, value: Tensor, mask: Optional[Tensor] = None) Tensor
Parameters:
  • query – The query. Of shape (…, (ensemble), items, embedding)

  • key – The key, Of shape (…, (ensemble), content_items, embedding)

  • value – The value. Of shape, (…, (ensemble), content_items, embedding)

  • mask – A bool mask. True masks. Optional. Of shape (…, (ensemble), items, content_items)

Returns:

tensor. Attention result

Parameter Injection

Parameter Injected Memory Unit

class supertransformerlib.Attention.PIMU(d_model: int, mem_width: int, heads: int, ensembles: Optional[int] = None)

Parameter Injection Memory Unit. (PIMU)

Parameter Memory are large blocks of parameters which are compatible with an embedded stream as though they are embeddings themselves.

The process of Parameter Injection is a process of conditionally injecting whole blocks of parameters, into a running embedded stream as though it were an embedding itself. Two tasks exist. First, the module must figure out what parameter block to inject, and when. Second, the module must train the parameter blocks to provide useful context.

The location of best effect for parameter injection is within a model of some sort that is mapping many inputs onto only a few results, at some point in the logic of the process. This may exist in a transformer unit, a imagenet flow, or even just a standard dense network.

For high granulaty, a high head count and softmax mode are desirable. In this case many options are considered avaibable. For a case in which only a few options should be allowed at each step, a low head count is recommmended. Generally, it is recommended to start with a high head count where possible; more heads does NOT slow the model down.

forward(query: Tensor) Tensor
Parameters:

query – A tensor to gain insight on

Returns:

The calibrated result of the query.

Parameter Injected Summary Unit

class supertransformerlib.Attention.PISU(d_model: int, d_output: int, output_items: int, heads: int, ensembles: Optional[int] = None)

Parameter Injected Summary Unit (PISU)

An attention layer designed to enable the collapse of a large number of items into something of fixed width. The sibling of PISU

A fixed width, parameter based query is presented as attention to heads generate from the incoming content. The result, a embedding of the same width as indicated, is then returned.

Note that, as with PISU, an aggressive number of heads will allow more degrees of freedom, while fewer will allow less.

forward(content: Tensor) Tensor
Parameters:

content – (…, (ensembles), items, embeddings)

Returns:

Context Splitting

Local Context Self Attention

class supertransformerlib.Attention.LCSA(d_model: int, kernel_width: int, dilations: List[int], mode: str = 'center', ensemble: Optional[int] = None)

Local Context Self Attention (LCSA)

A banded self attention class with positional intelligence. Once it constructs a convolutional kernel, each dimension is projected using an independent linear action. The net effect is the layer can learn to consider words at different positions in different manners.

Multiple padding options exist allowing conditioning to be done based on only words that came before, only words that come after, or a centered view with both.

One thing of note: The number of words passed into the layer MUST be equal to or greater than the kernel width. Remember to pad to above this length.

Combined with add+norm, this nicely handles local context.

forward(tensor: Tensor) Tensor
Parameters:

tensor – The tensor to perform self attention with. Shape (…, (ensemble), item, embedding).

Returns:

A tensor. The result of self attention. Shape (…, (ensemble), item, embedding)

Raises:

RuntimeError – If the number of items is too small for the kernel.

Global Strategic Processing Unit

class supertransformerlib.Attention.GSPU(d_model: int, d_summary: int, summary_width: int, pisu_heads: int, mha_heads: int, layers: Optional[List[Module]] = None, dropout: Optional[float] = 0.2, ensembles: Optional[int] = None)

Global Strategic Processing Unit.

This essentially performs a PISU summary processses the summary by whatever arbritary logic desired, then uses this to generate a conditioning tensor for each individual word. Combined with add+norm, it takes care of global context.

forward(tensor: Tensor) Tensor
Parameters:

tensor – A tensor, of shape (…, (ensemble), items, embedding)

Returns:

A tensor, of shape (…, (ensemble), items, embedding)

Ensembles and Specialization

Ensemble Exchange Self Attention

class supertransformerlib.Attention.EESA(d_model: int, heads: int, ensembles: int)

Ensemble Exchange Self Attention (EESA)

Allows different ensembles to exchange information, while constraining available parameters among lower level units. Ensembles are only allowed to perform attention with units whose index are equal to or lower then themselves. This is performed by the attention mechanism.

This, it is hoped, will help provide good test behavior by ensuring that even if one section is overfit, others are not. It should have the effect of increasing fine tuning speed as well.

forward(tensor: Tensor) Tensor
Parameters:

tensor – A tensor. Of shape (…, ensemble, items, embedding)

Returns:

Another tensor. Shape (…, ensemble, items, embedding)