anhinga_anhinga | Dataflow matrix machines as generalized recurrent neural networks

Entry tags:

Dataflow matrix machines as generalized recurrent neural networks

A year ago I posted about dataflow programming and linear models of computation:

http://anhinga-anhinga.livejournal.com/82757.html

It turns out that those dataflow matrix machines are a fairly powerful generalization of recurrent neural networks.

The main feature of dataflow matrix machines (DMMs) are vector neurons. While recurrent neural networks process streams of numbers, dataflow matrix machines process streams of representations of arbitrary vectors (linear streams).

Another important feature of DMMs is that neurons of arbitrary input and output arity are allowed, and a rich set of built-in transformations of linear streams is provided.

Recurrent neural networks are Turing-complete, but they are an esoteric programming language, and not a convenient general-purpose programming platform. DMMs provide a formalism friendly to handling sparse vectors, conditionals, and more, and there are indications that DMMs will grow to become a powerful general-purpose programming platform, in addition to being a convenient machine learning platform.

In this context, it is possible to represent large classes of programs by matrices of real numbers, which allows us to modify programs in continuous fashion and to synthesize programs by synthesizing matrices of real numbers.

Further details and preprints

Self-referential mechanism: Consider a linear stream of matrices describing the connectivity pattern and weights of a DMM. Select a dedicated neuron Self emitting such a stream on its output, and use the latest value of that stream as the current network matrix (matrix describing the connectivity pattern and weights of our DMM). A typical Self neuron would work as an accumulator taking additive updates from other neurons in the network. This mechanism enables reflection facilities and powerful dynamic self-modification facilities. In particular, the networks in question have facilities for dynamic expansion.

The recent DMM-related preprints by our group:

https://arxiv.org/abs/1603.09002

https://arxiv.org/abs/1605.05296

https://arxiv.org/abs/1606.09470

https://arxiv.org/abs/1610.00831

Modern recurrent neural networks with good machine learning properties such as LSTM and Gated Recurrent Unit networks are naturally understood in the DMM framework as networks having linear and bilinear neurons in addition to neurons with more traditional sigmoid activation functions.

Our new open source effort

The new open-source implementation of core DMM primitives in Clojure:

https://github.com/jsa-aerial/DMM

This open-source implementation features a new vector space of recurrent maps (space of "mixed rank tensors"), which allows us to represent a large variety of linear streams as streams of recurrent maps. The vector space of recurrent maps also makes it possible to express variadic neurons as neurons having just one argument.

Therefore a type of neuron is simply a function transforming recurrent maps, which is a great simplification compared to the formalism presented in the preprints above. See the design notes within this open-source implementation for further details.

Flat | Top-Level Comments Only

The following post "The Unreasonable Effectiveness of Recurrent Neural Networks" by Andrej Karpathy illustrates the remarkable power of LSTM as a machine learning platform:

http://karpathy.github.io/2015/05/21/rnn-effectiveness/

I was particularly impressed by his examples of the nets learning to reproduce patterns observed in real-life C code and patterns observed in mathematical texts written in LaTeX. This is the post I recommend to people as a starting point to familiarize oneself with the impressive capabilities of the RNNs.

His examples are character-based nets, with characters being represented as vectors, with the dimension of vectors being the alphabet in question. A particular character, say 'X', would be represented on the input by a vector with value 1 at the coordinate corresponding to 'X' and zeros at other coordinates. So, a conventional scalar-based RNN he uses needs as many neurons as the size of the alphabet just to input the characters into the network. This might be problematic, if one wants to program the character-based algorithms in the RNNs manually, and this would be problematic for large alphabets such as Unicode, which has more than 100,000 characters.

This is the first illustration of why it might be good to have vector neurons. If one has neurons which process linear streams, one can treat characters as sparse vectors, and then one can have character-processing nets with the number of neurons independent of the size of the alphabet.

For example, our preprint https://arxiv.org/abs/1606.09470 has a simple network with 9 neurons and 10 non-zero weights solving a standard coding interview problem with character strings (Section 1 explains the problem and Section 3, pages 8 and 9, solves it). It seems that this problem cannot be solved in a traditional RNN in such a way that the size of the RNN would be independent of the size of the alphabet (at least, there is no simple solution of this kind). This is one of the first indications that DMMs are indeed a more powerful programming platform than RNNs.

Misha, thank you, looks very interesting but I'm rather a newbie here. Could you answer a first bunch of my silly questions here or give me a link to some introduction? I started learning ml more or less seriously only recently and some looking into the papers you gave makes it clear for me: it's yet another interesting story I started to listen from the middle of the book.

1. why fixed number of inputs in a single neuron can not be approximated by a cascade of layers of ordinary neurons? or what are advantages of a single multiple input neuron compared to a subnetwork of ordinary neurons?

2. are the multiple inputs of a DMM neuron supposed to have different roles/interpretations (as in your example of "main signal" and "modulation signal" or as 'recurrent' connections in RNN) or they are thought to be fed from homogeneous streams like inputs for linear combination before activation function in regular neurons? Or it's up to the designer of the network?

3. Do I understand it right that generally in DMMs the number of a neuron inputs is not necessarily fixed?

4. and even new neurons may be created dynamically? In training time only or even in a test stream processing?

5. "friendly to handling sparse vectors" is friendly only to "one-of-all" representation as for letters in an alphabet or there is some more generic friendship with sparsity?

I'm sorry, I can only guess how silly can be my understanding.

Привет! Let me start with answering your questions, then pointing to some links.

1. Indeed there is a theorem that any reasonable function can be approximated by a neural cascade. But pragmatically speaking, it's a nightmare, not something easy to use in practice. (E.g. try to craft a network performing a multiplication of two reals. Also, in general, the more precise the approximation, the larger must the cascade/subnetwork.) So it is convenient to allow to add built-in activation functions as needed.

2. It's up to the designer. E.g. even for multiplication, sometimes an asymmetric interpretation is convenient, and sometimes it might be convenient to think in a symmetric fashion that both inputs modulate each other.

3. Typically, we think about a particular neuron as having a type. And this type says among other things how many inputs a neuron has, and what vectors are associated with them. For example, one input might take a vector representing a letter (so it would have a dimension of the alphabet in question), and another input might take a number (a one-dimensional vector), and so on. And the network can have neurons of different types.

So in that approach, the number of inputs for a single given neuron is fixed, but there can be neurons of varying type. This is our approach in the preprints.

Our latest version is more flexible (the one in our new open-source project in Clojure). There we use the vector space of recurrent maps to implement variadic neurons. Basically, recurrent maps allow us to assemble any number of arguments into one single argument, and so we can code neurons with arbitrary (and not necessarily fixed) number of inputs via functions of one argument.

4. Yes, we think about countable-sized networks, but with only finite number of non-zero elements in the network matrix at any given time. Only the neurons with a non-zero weight associated with them are considered active. So at any given moment, only a finite subnetwork is active and the rest is silent. The new neurons can be recruited from the silent part by making some of their associated weights non-zero. So as long as the network weights are allowed to change, the new neurons can be created dynamically (by recruiting them from the countable silent part).

5. There is actually friendship with any reasonable representations of vectors in computers. Even with very approximate representations (e.g. probability distributions, which are often infinitely dimensional as vectors, can be approximately represented by samples, and then variants of a stochastic sum can be used to implement linear combinations).

I think this is a somewhat difficult material. (I am telling various parts of this story to people, and it's not always easy for them to understand.)

It's not too difficult compared to many math texts, but as a computer science it's not very simple.

I am happy to answer more questions. I can also put together more references to RNNs (let me know, if I should).

But speaking about DMMs, the links above are a complete set of literature available at the moment. We discovered them last summer, and this year we understood that what we have is a generalized version of RNNs, and these preprints and notes in the open-source projects (this one, and the previous one mentioned in the previous livejournal post) is all that exists at the moment on DMMs.

RNN-related literature is quite extensive. The link to Karpathy's post in my comments above is very useful. I found various Wikipedia pages helpful. On LSTMs see Appendix C of the last of our preprints and this awesome paper it talks about: https://www.arxiv.org/abs/1603.09420 (this is the best way to understand LSTMs and similar architectures).

Edited 2016-12-30 04:27 (UTC)

Привет!
Wow! So generic and still workable!
I'll try to read through and return with new questions. How nice to be able to get answers from first hands.
May be you could also write some post resolving some typical noob questions and difficulties in understanding? Say, assuming not bad undergraduate background in CS/ML/FP of the reader? I believe many people would greatly appreciate it.
Thank you, Misha!

Привет!

Yes, I think it is a very interesting class of models and a very interesting computational platform. All standard methods of machine learning should be applicable to it.

My real hope is, of course, that it would allow to create better methods of machine learning in the coming year. So I invite people to try to make progress in that direction, both on their own and in collaboration with me :-)

***

Speaking of questions and difficulties, this conversation is a good start. Other people might read it and find it useful. So it would be great, if you have more questions!

Миша, спасибо, я с удовольствием постараюсь вчитаться в работы вашей группы, может попробую что-то перенести на tensorflow в следующем месяце. Но я - только любитель, и мои вопросы могут наоборот уводить от важной сути, а не помогать хорошим людям ее разглядеть. Вот если бы ты (если не ошибаюсь, когда я был школьником и приходил заниматься какими-то фигнюшками в ЦЭМИ, мы были на ты, да?) сам придумал и отобрал важные вопросы и на них ответил в каком-нибудь тексте на гитхабе, где его найдут рядом с работающим кодом, пользы людям было бы точно больше. А ненужных вопросов я еще назадаю :-)
Ханука самеах и удачного 2017!

я тоже стал слегка смотреть на tensorflow на предмет того, насколько легко/трудно перенести туда что-то из DMMs.

насколько много ты уже возился с tensorflow? (один из моих вопросов, какая там ситуация со sparse tensors; я знаю, что они там есть, но насколько легко их использовать, например в существующих имплементациях шагов по градиенту?)

да, мы, конечно, были на ты :-)

***

ну, вот, я пока что добавил сюда комментарий со списком Clojure links, которые мне пригодились в этом году, для тех, кому может захотеться освоить Clojure, или читать код, на нём написанный... Но иметь версии также на более общедоступных языках, конечно, было бы правильно, как и иметь tutorials/FAQs получше...

Ханука самеах и с Новым Годом!

Миша, я, увы, еще полный чайник и в tensorflow только немножко поиграть успел (а во все остальное кафе/торч/теано даже и не играл). Я даже не ожидал, что в tensorflow уже есть sparse tensors, я думал как бы без них обойтись, а ты мне открыл глаза. Ну, теперь надо тоько постепенно сцедить в одну воронку все что я выучил за этот месяц - раньше руки не доходили.

За ссылки на кложур тоже спасибо! (краснея) я его тоже не знаю, но собирался узнать, когда свалится на голову. Похоже, что уже.

- и иметь tutorials/FAQs
- Да! я постараюсь вчитаться и придумать что спросить и как поиграть, чтобы стало понятнее.
Буду держать в курсе :-)

я пытаюсь понять азы.
1. что такое "generalized animation"? это что-то из более ранних работ? это знать?

2. "Recurrent maps are maps which map arbitrary legal hash-map keys into numbers or other recurrent maps." - ок. Как вы из него делаете векторное пространство? Можешь в двух словах пересказать что написано в src/dmm/core.clj или куда посмотреть?

3. в http://www.cs.brandeis.edu/~bukatin/DataFlowGraphsAsMatrices.pdf не упоминается нейрон Селф. Это потому что вы его потом ввели, вместе с двухтактным двигателем, или двухтактный двигатель - это частная implementation detail?

Это я не читая пытаюсь понять как это все читать.

ой про 3 - я понял, что двухтактный двигатель - это про входы и выходы, а не то что я извлек из каши в голове, что это на ап - процессим данные, а на даун - апдейтим веса нейроном Селф.

Привет!

1. Generalized animation - это довольно красивая штука. Вот, полстранички - секция 1.1, page 2 of

http://easychair.org/publications/download/Linear_Models_of_Computation_and_Program_Learning

(Если захочешь её читать в какой-то момент, то это статья-сэндвич, две статьи в одной; математическую начинку, которая Секция 4, следует игнорировать (или рассматривать, как отдельную статью).)

2. Это правильный вопрос, он побудил меня, наконец, это записать и добавить в github:

https://github.com/jsa-aerial/DMM/blob/master/design-notes/Early-2017/vector-space-of-mixed-tensors.md

3. Про двухтактный двигатель мы уже тогда понимали (мы тогда говорили "двудольный граф", но это было для того же - чтобы чередовать линейные преобразования с нелинейными, то есть как раз двухтактный цикл), а про Self - не понимали, пытались, чтобы матрица меняла сама себя на уровне отдельных элементов, и это как-то получалось ужасно утомительно и со скрипом...

Потом уже сообразили, что можно ввести Self, и делать на уровне матриц и больших фрагментов матриц, и стало несколько легче жить...

1. thank you, it's indeed interesting.
" This architecture provides a direct way to incorporate aesthetic criteria into software systems."
" One should be able to formulate mechanisms of higher-order animation programming via variable illumination of elements of such structures."
- Wow, cool! I'm not sick anymore of so generic view in your papers. I'd rather regret you did not make a series of masterpieces making the concept to generate explanations of itself on its own with a formality/poetry ratio as a parameter :-)

2. Now I see, thank you.
"we are talking about a co-product (direct sum) of one-dimenstional spaces." - hmm. I'd say without this note it reads better. Simply, a linear space generated by finite linear combinations of strings of finite length over some alphabet ("a countable set L which is a subset of legal keys"), right? Or, if we merge common prefixes in the strings in these linear combinations, it's a space of prefix trees with coefficients in leaves (the strings of keys assumed to have an implicit sentinel end-of-string character at their ends which represents the leaf nodes in the trees), right?

1. It would certainly be nice to create intelligent masterpieces :-) Not a bad subject for meditation and, with some luck, experiments :-) :-) :-)

2. You are right, thanks! I've commented this note out.

***

Yes, these are prefix trees with coefficients (which are real numbers at the moment, although we might make them something more interesting).

***

We do use a very rich alphabet, where everything which Clojure can use as a hash key is a legal letter. So even the alphabet is countable. Yes, mathematically speaking, it would be enough to use just strings as letters (the alphabet is still countable), and then if we use an extra sentinel end-of-string character, then one could concatenate the strings in the path, and use the resulting finite alphabet, and get prefix trees over finite alphabet, but with more depth.

This would indeed work.

***

Implementation-wise, we are actually in love with intermediate layers, e.g. one can see how we use them in our network matrix here (https://github.com/jsa-aerial/DMM/blob/master/design-notes/recurrent-maps.md and the code) with columns and rows of the network matrix being indexed not by "flat indices", but by a three-level hierarchy of indices. So we might want to not actually concatenate in our implementation, but to keep the alphabet countable and to keep "letters" expressive.

But in general, this "concatenation idea" might be quite fruitful in a number of contexts. (The actual implementation would probably be replacing a sequence of keys, e.g. key1, key2, key3, with a single key which would be a vector [key1 key2 key 3]). Basically, instead of concatenating to increase the tree depth, we would be concatenating to decrease the tree depth (to use the whole paths as single dictionary keys). So we should keep this possibility in mind, it might come handy in a number of situations.

I keep reading in small portions time to time.

1.
"
let's agree that ihe first level of hash-map keys would name the respective input and output streams and map to their latest values.
...
Therefore the root keys should be types of neurons.
"

~~Neuron types or instances of neurons?~~ - ok, i see, instances are the next level.

2.
"
A map from names of the built-in transformers to (the map from names of the individual neurons to (the map from names of their output streams to the respective non-zero matrix coefficients)).
"

In the first quote you said about naming both input and output streams and here only about output?

3. Am I right that you don't keep in mind the computational graph "as a picture" and you rather think of it as something implicitly given by the "computational matrix"?

May be the pictures of usual neural nets confuse me here. It seems I understood the idea. Could you please check if I'm wrong in the following:

The "computation matrix" *is* that prefix tree of chains of keys that we were talking about (aka recurrent map), and a rule for computing the next step is to traverse it up to each individual neuron and apply the neuron's function to the subtree of arguments, producing a new subtree of output values, which is then split&replicated&wrapped with appropriate neurons keys to make up a computation matrix for next step.

I have a feeling that in papers this (or some more generic operation?) is meant as a part of some simple transformation of a giant (or even of countable dimensions) but sparsed matrix or tensor, is it right? Most likely it's explained in the papers you gave the links to but I did not read or understood it yet. Do you have a simple hint of what is that matrix transformation which would apply the neuron functions to appropriate data and produce a new computational matrix. Or, may be an example? Tomorrow I'll try to grasp the example of the interview task solution in https://arxiv.org/abs/1606.09470 so, don;t hurry with answers if you think it's better to wait me there.

<s>
at any moment of time t (and on the up movement of the two-stroke engine) there is a directed bipartite computational graph g_t with neurons as edges (or terminal source or sink nodes) transforming an input ~~recurrent map~~ vector of values (generated for the moment t) to an output ~~recurrent map~~ vector of values. Nodes of g_t are neurons' inputs and outputs (corresp in input and output partitions of the graph), where several recurrent map streams are merged (concatenated) and split (or replicated).
</s> (I wanted to erase this statement as I thought I came to a better idea above after thinking about recurrent map vs vector of values for input/output streams. left it here just in case if it worth any comments).

4. What are requirements for neuron functions? Generically, at least they are assumed to always produce exactly zero on zero input, right? Are there other requirements? Stateless? Deterministic?

Привет!

Буду отвечать кусочками...

2. Columns correspond to outputs, and they are indexed by this three-level structure: {name of the built-in transformer {name of the individual neuron {name of the output stream ...}}}.

When one considers a particular matrix row, this correspond to "the second index, which changes when one moves along the matrix row".

Rows correspond to inputs, and they are indexed by this three-level structure: {name of the built-in transformer {name of the individual neuron {name of the input stream ...}}}.

So the matrix elements are indexed by the 6-level structure: {name of the built-in transformer of neuron-i {name of the individual neuron-i {name of the input stream {name of the built-in transformer of neuron-j {name of the individual neuron-j {name of the output stream ...}}}}}}

3 (first paragraph). I love visualizing computational graphs. It is always good to have alternative angles of view (to view the network both as a matrix and as a graph).

In some of out earlier prototypes, we did visualize how the dataflow graph of a changing network
changes with time. E.g. I have this auxiliary livejournal, and its top post has the video (made in June 2015) which shows changing dataflow graph in the lower left corner: http://anhinga-drafts.livejournal.com/29929.html

This was made before we understood the matrix formalism. And I always wanted to have options to use input interfaces to edit dataflow graph directly on the graphical level (being inspired in this by some visual programming languages like Pure Data). But it's always a lot of software engineering work, and we don't have anything like this yet in our current Clojure system.

> The "computation matrix" *is* that prefix tree of chains of keys that we were talking about (aka recurrent map), and a rule for computing the next step is to traverse it up to each individual neuron and apply the neuron's function to the subtree of arguments, producing a new subtree of output values

Yes, this is an "up movement" of the two-stroke engine. In the current version of

https://github.com/jsa-aerial/DMM/blob/master/src/dmm/core.clj

see the comments starting from line 303 at the end of this file, and the two functions at the bottom implement the traversal (when you decide you want to read the code, the important function to understand is reduce, this whole DMM codebase is constructed around it: https://clojuredocs.org/clojure.core/reduce )

So this up-movement takes a three-level structure of all inputs, and produces a three-level structure of all outputs: from a dictionary of these things {name-of-the-built-in transformer {name-of-the-individual-neuron {name-of-the-input-stream some-input}}} it builds a new dictionary of these things {name-of-the-built-in transformer {name-of-the-individual-neuron {name-of-the-output-stream some-output}}}

> Do you have a simple hint of what is that matrix transformation which would apply the neuron functions to appropriate data and produce a new computational matrix. Or, may be an example?

Currently, we use a very simple form of Self. It has two arguments, and on the up movement it simply
adds them together. Its output is connected to one of its own inputs with weight 1, so this allows it to act as an accumulator (to keep the computed value through time).

The other input of Self accepts additive contributions from other neurons. So on the down movement, those additive contributions are added together, and on the up movement they are added to the previously existing matrix.

***

A couple of simple examples are in the Appendix D.2 on the page 6 of https://arxiv.org/abs/1610.00831

The second of those examples is also implemented in here: https://github.com/jsa-aerial/DMM/tree/master/examples/dmm

4 in an interesting question.

On one-hand, the requirements depend on what training methods are to be applicable. For example, people occasionally consider the step-function (y = 1 if x > 0 else 0), but its derivative is 0 everywhere where it is defined, so if one wants to do gradient descent, this is not a good neuron function to have.

At the same time, one can consider an intergal of the step-function, (ReLU: y = max(0,x) ). People were ignoring it for a long time, because of the discontinuity in the derivative, but then it turns out that it works fine in gradient methods, and that it is actually one of the best neuron functions available.

***

A priori, we don't impose many requirements. When a new neuron is recruited to an active part of the network, it happens because Self just changed the network matrix on the up movement. So the next thing that happens is that the inputs of that newly recruited neuron are computed on the down movement. Hence we don't even need to impose any particular requirements for zero.

In fact, we typically think about input neurons as neurons which ignore there own input if any and just inject their output into the network. So this would be an example of a neuron which is not trying to map zeros to zeros.

They are usually stateless, although when we described the general formalism in the second of those preprints, we specifically made sure to be general enough to allow state, if desired.

We found stochastic neurons quite interesting. E.g. all these funky self-organizing patterns in the videos linked from the first comment in http://anhinga-anhinga.livejournal.com/82757.html are obtained using neurons, which are "randomized propagators" (they usually pass through their input unchanged, but with a small probability they output zero instead; this turns out to be enough for all that complexity in those videos).

Misha, thank you a lot for so detailed answers! I'll keep making small steps and asking new questions.

Privet!

This is one of the most interesting talks I've ever heard in my life:

http://anhinga-drafts.livejournal.com/29954.html

Misha, they say that "There's no sparse variable in TensorFlow":
https://groups.google.com/a/tensorflow.org/forum/?utm_medium=email&utm_source=footer#!msg/discuss/LXJIWkil_mc/IEWI1398EQAJ

period. currently they can't do gradient steps on sparse weights.
I yet have to learn what people do for ML on large graphs for example. I believe there must be a workaround.

Before you pointed me to sparse tensors in TF I was thinking of making a kd-tree out of a sparse tensors and apply to the tree an RNN adopted for recurrent processing trees (instead of usual recurrent processing sequences). This way all weights are dense and, arguably, the recurrent tree RNN may mimic the plain convolution-like updates or even have some advantage over it. On the other hand, in kd-tree spatially neighboring cells (leaves) can be quite far from their nearest common ancestor. That's that tree RNN should be able to learn spatial neighboring relations for some nodes separated by quite long traverses through the tree and it makes me skeptic about this approach.

Currently I have no idea when I will be able to play with such kd-tree RNNs but if I suddenly learn something inspiring for this technique I may want to give the idea a try as seriously as I can. Basically I want to make it possible to apply DL to point clouds (which are generically sparsed samples of does not matter what embedded into does not matter which space).

I'll also try to think about this...

(more on 1. Of course, if one has unary neurons with "log" and "exp" activation functions, then multiplication is doable at a relatively moderate cost. Multiplication only becomes difficult, if one has the usual (rather impoverished) set of activation functions.

But even with "log" and "exp", there is some difficulty, because the most important case of multiplication is multiplication by zero. So, a bit more thought would be needed on handling this near zero. And, of course, sign reversal would be a separate unary operation. Still the idea of presenting multiplication in this fashion is neat, might be useful for some things, so let it be recorded here.)

Edited 2016-12-30 06:57 (UTC)

Flat | Top-Level Comments Only

Dataflow matrix machines as generalized recurrent neural networks

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject

no subject